Lightweight Conversion from Arithmetic to Boolean Masking for Embedded IoT Processor

: A masking method is a widely known countermeasure against side-channel attacks. To apply a masking method to cryptosystems consisting of Boolean and arithmetic operations, such as ARX (Addition, Rotation, XOR) block ciphers, a masking conversion algorithm should be used. Masking conversion algorithms can be classiﬁed into two categories: “Boolean to Arithmetic (B2A)” and “Arithmetic to Boolean (A2B)”. The A2B algorithm generally requires more execution time than the B2A algorithm. Using pre-computation tables, the A2B algorithm substantially reduces its execution time, although it requires additional space in RAM. In CHES2012, B. Debraize proposed a conversion algorithm that somewhat reduced the memory cost of using pre-computation tables. However, they still require ( 2 ( k + 1 ) ) entries of length ( k + 1 ) -bit where k denotes the size of the processed data. In this paper, we propose a low-memory algorithm to convert A2B masking that requires only ( 2 k )( k ) -bit. Our contributions are three-fold. First, we speciﬁcally show how to reduce the pre-computation table from ( k + 1 ) -bit to ( k ) -bit, as a result, the memory use for the pre-computation table is reduced from ( 2 ( k + 1 ) )( k + 1 ) -bit to ( 2 k )( k ) -bit. Second, we optimize the execution times of the pre-computation phase and the conversion phase, and determine that our pre-computation algorithm requires approximately half of the operations than Debraize’s algorithm. The results of the 8/16/32-bit simulation show improved speed in the pre-computation phase and the conversion phase as compared to Debraize’s results. Finally, we verify the security of the algorithm against side-channel attacks as well as the soundness of the proposed algorithm.


Introduction
Side-channel attacks exploit various types of physical leakage-including power consumption, electromagnetic radiation, running time, etc.-during the execution of a cryptographic algorithm on a real device [1][2][3].Differential Power Analysis (DPA), which was introduced in 1999 by P. Kocher, is a statistical method that retrieves secret keys using information leakage from the power consumption of the device [4].Since the introduction of DPA, the importance of implementing countermeasures against side-channel attacks has only grown.
The masking method was suggested by T. Messerges to provide theoretical security against the DPA attack [5][6][7].The masking method breaks the relation between the algorithmic value (the value specified by the algorithm) and the processed value (the value that is actually processed by the device) by using random numbers.The typical types of masking methods include Boolean masking and arithmetic masking [8].Boolean masking uses an XOR (exclusive or) to blind values such as x = x ⊕ r, and arithmetic masking uses an algebraic operation such as A = (x − r) mod 2 k .
These two types of masking should be selectively used for cryptographic algorithms that consist of Boolean and arithmetic operations such as ARX (Addition, Rotation, XOR) block ciphers [9][10][11], cryptographic hash functions [12,13], and stream ciphers [14].In general, Boolean operations (AND, XOR, SHIFT, etc.) and arithmetic operations (Addition, Subtraction, Multiplication, etc.) can be efficiently computed using Boolean masking and arithmetic masking, respectively; however, it is very difficult to execute arithmetic operations in Boolean masking and to execute Boolean operations in arithmetic masking.This problem can be easily solved using a masking conversion algorithm between Boolean and arithmetic masking.

Related Work
The first masking conversion algorithm to counteract first-order DPA was proposed by L. Goubin in 2001 [15].This conversion algorithm from Boolean to arithmetic masking (B2A) has been elaborately implemented without any improvements made upon it.In contrast, the conversion algorithm from arithmetic to Boolean masking (A2B) is quite sluggish and could be improved in terms of the execution time.More specifically, the B2A algorithm has a constant execution time within the arbitrary bit size of the processed data, while the A2B algorithm requires substantially more execution time as the bit size increases.
Coron et al. tried to use memory to address this limitation [16].They described the use of pre-computation tables to obtain large benefits in terms of execution time, although it required additional space in RAM.There have also been numerous attempts to optimize the memory use [17][18][19][20][21].
Recently B. Debraize proposed a conversion algorithm that offered a substantial reduction in memory for the use of pre-computation tables [22].Briefly, this proposal iterates the conversion phase with a k-bit value using a (k + 1)-bit table.

Our Contribution
Masking conversion algorithms are mainly used to mask ARX ciphers.Most of the ARX ciphers, which is used in the IoT environments, are consisted of 32-bit unit operations.These ciphers aim to achieve fast encryption and a small code size.However, even if the ARX cipher is masked by Debraize's method, memory usage is still of concern.To overcome this problem, we propose an extremely low-memory algorithm that converts from arithmetic masking to Boolean masking.
The contributions of this paper are as follows: • Reducing the memory usage: We further reduce the memory usage of the pre-computation table from the (2 (k+1) )(k + 1)-bit achieved by Debraize's method [22] to (2 k )(k)-bit.The main idea is that the pre-computation table can be reduced by one bit based on the fact that the XOR operation is the same as the subtraction on F 2 ; this is the so-called LSB trick.The LSB trick has been mentioned in previous papers, but we apply this trick specifically to the A2B algorithms.Furthermore, we validate this intuitive fact using mathematical induction.As a result, our proposal can save one bit for each table.Our proposal also allows for compact, optimized memory usage in the real world.For example, if k is 8-bit, our algorithm can be constructed using a char data type.

•
Reducing the execution time: We reduce the execution time of the pre-computation phase to approximately half of that achieved by Debraize's method.When measuring the execution time of the algorithms using a pre-computation table, some researchers have focused only on the encryption parts.
However, in the real world, the execution time of the pre-computation phase cannot be disregarded.We design a new pre-computation algorithm to minimize the number of for loops and the number of operators in each for loop.Our evaluation shows that our proposal requires approximately half of the operations of Debraize's pre-computation algorithm.In additional, we simulate 8-bit, 16-bit, and 32-bit environments.On average, the results of the simulation show improvements in speed of 87% in the pre-computation phase and of 23% in the conversion, compared to that in [22].

•
Verifying the security and the soundness: We verify the security against side-channel attacks and the soundness of the proposed algorithm using mathematical induction.That is, we show that all intermediate values of the proposed algorithm are random, and the soundness is reflected by the fact that the proposed algorithm always returns the correct output for an arbitrary input.

Outline of the Paper
The rest of this paper is organized as follows.In Section 2, we introduce the conversion algorithms currently used to convert between Boolean masking and arithmetic masking.Section 3 is the core of the paper, in which we present a new conversion algorithm to convert from arithmetic masking to Boolean masking.In Section 4, we analyze the security against side-channel attacks as well as the soundness of the proposed algorithm.In Section 5, we present performance metrics for our method and Debraize's method.Finally, we conclude in Section 6.

Masking Method
Masking methods use random numbers to break the relationship between the power consumption of a crypto-device and sensitive values in a crypto-algorithm.Numerous articles have been published on a variety of masking types, such as Boolean masking [23], arithmetic masking [8], polynomial masking [24], and inner product masking [25].Among these, Boolean masking and arithmetic masking are the most widely known masking types.
Boolean masking and arithmetic masking use the following respective formulae: Given k-bit values; x is a sensitive value depending on the key; and x , A, and r x are Boolean masked values, arithmetic masked values, and masking values, respectively.
Boolean and arithmetic masking are used for algorithms that consist of Boolean and arithmetic operations, such as ARX block ciphers [9][10][11], cryptographic hash functions [13], and stream ciphers [14].More specifically, additions are computed on the arithmetic masking, and SHIFT, XOR, and AND are computed on Boolean masking.In this case, a conversion algorithm between arithmetic and Boolean masking is necessary, and this algorithm must be secure against DPA attacks.
The first masking conversion algorithm was introduced by T. Messerges [26], but vulnerabilities were discovered in it against DPA attacks.In CHES 2001, L. Goubin proposed an improved masking conversion algorithm that guaranteed security against DPA attacks [15].That algorithm can convert from Boolean to arithmetic masking with only 5 XORs and 2 subtractions.This algorithm has recently been extended to a higher-order masking scheme [20,27].According to [20,27], these countermeasures have time complexity O(n 2 • k) and O(2 n ) for n shares on the security against t-th order DPA attacks, respectively.
In contrast, conversion algorithms from arithmetic to Boolean masking have been improved through various efforts.

Arithmetic to Boolean (A2B) Masking
Conversion algorithms from arithmetic to Boolean masking can be broadly classified into two types based on the concept of hardware adders: carry look-ahead-based algorithms and ripple-carry-based algorithms.The first category operates with two values, i.e., Generation and Propagation, whether carry values are either propagated (at least one input is 1) or generated (both inputs are 1).Carry look-ahead-based algorithms can be implemented with Boolean operators such as SHIFT, AND, and XOR.A conversion algorithm of the first category, presented in FSE 2015 by Coron et al., reduces the complexity to a logarithmic scale [21].However, it is hard to speed up this algorithm using memory storage, such as with pre-computation tables.On the other hand, algorithms in the second category iterate the task of adding the carry value generated in the lower stages to the upper stages.Ripple-carry-based algorithms are easy to optimize in terms of execution time by using memory.There have been various comparisons between the two categories.However, the performance of each algorithm and type of algorithm depends on the implementation environment, such as memory size, CPU architecture, and cryptographic algorithms.
In this paper, we propose an A2B algorithm to optimize memory storage based on the second type of algorithm.

The A2B Algorithm Based on the Ripple-Carry Adder
Ripple-carry-based A2B algorithms were first proposed by Goubin then improved by Coron, Neiße, and Debraize by using a pre-computation table.This paper briefly summarizes the point of each algorithm and explains their working principles in the following sections.

Goubin's A2B Algorithm
The Algorithm 1 was proposed by L. Goubin in CHES2001.We classify it as a ripple-carry-based algorithm because it computes the carry bit generated from the lower bits to the upper bits in Lines 11-16, such as the concept of the ripple-carry-based algorithms.Goubin's A2B algorithm iterates K times with the K-bit input, meaning that this algorithm is inefficient for large inputs.The algorithm is based on the following recursion formula: Theorem 1 (Goubin's recursion formula [15]).If we denote x = (A + r) ⊕ r, we also have x = A ⊕ u K−1 , where u K−1 is obtained from the following recursion formula: Algorithm 1 Goubin's A2B Algorithm [15] Require:

Coron's A2B Algorithm
As Goubin's A2B algorithm requires several operations that are linear in terms of the sizes of bits, it can serve as a bottleneck when implemented.Coron et al. improved the A2B algorithm by using pre-computation look-up tables.They used the pre-computation table T[.] of (k • n)-bit size to reduce the number of iterations from K to K/k.The Algorithms 2 and 3 are improved versions from [22].Although Coron's algorithm takes time in the pre-computation phase, the execution time of its conversion phase can be reduced in comparison to that of Goubin's algorithm.In summary, Coron's algorithm has been improved in terms of execution time.However, this algorithm still has a disadvantage in terms of its memory usage (i.e., the table size is too large).Table 1 shows the intermediate value A of Algorithm 3 when loop i = 0. 0 k indicates 0 filled with k-bits.c denotes the carry bit at the pre-computation table T[A] in Algorithm 2. As can be verified in Table 1, the initial and final forms of the for loop are the same, i.e., it is intuitively confirmed that the algorithm operates recursively and correctly.The main idea for handling the carry bit is to blind the carry bit with a large random number γ.To omit the step of removing the masking value γ in the conversion phase, these algorithms calculate Γ in the pre-computation phase and subtract it at the beginning of the conversion phase.

Algorithm 2 Improved Coron's Table T Generation [22]
Require: None Ensure: T[.], r, Γ 1: Generate a random k-bit r and a random ((n (such that A l and R l have size k)

Line Intermediate Value of
Neiße et al. suggested a new method to handle carry bits using complemented values in the conversion phase.This method was able to reduce the size of the pre-computation table.An adapted version of the Neiße-Pulkus method was proposed in [22]; however, the author claimed that this A2B algorithm is vulnerable to combination attacks that distinguish 00...00 and 11...11(−1) using the Simple Power Analysis (SPA) attack and recovers the secret values using the DPA attack.For example, let us suppose that the value z at the conversion phase in [22] is extracted as 0 by the SPA attack.In this case, the most significant bit (i.e., the carry bit) is biased, and this biased bit allows for side-channel attacks.However, this vulnerability can easily be made secure.In addition, Neiße's A2B algorithm in [22] does not work.Algorithms 4 and 5 are versions of this algorithm that are correct and secure against combination attacks due to their use of the LSB (Least Significant Bit) trick.The LSB trick is a technique that reduces pre-computation tables by one bit based on the equivalence that the Boolean masked value B i is the same as the arithmetic masked value A i on F 2 .The LSB trick was mentioned in Neiße's paper, and we applied this trick specifically to the A2B algorithms.However, it is not perfectly safe to change (0, −1) to (0, 1) against SPA.It has been published that binary classification of "0" and "1" is possible using side-channel information [28].Neiße's A2B algorithm could be potentially exploited via side-channel vulnerabilities.
The working principle of Neiße's algorithm is that the converted data from arithmetic to Boolean masking are either (x, R) or ( x, R).Their complementary values are determined by z.Although this masking scheme does not mask the carry bit with a probability of 1/2, the carry bit is 0 or 1 with probabilities of (2 k +1) 2 k+1 respectively for arbitrary z and r.In other words, the probability that the carry bit is generated for any sensitive value x is always equal to the same distribution, and so it is safe against DPA attacks.Table 2 shows the intermediate value A of Algorithm 5 when loop i = 0.For a value w, let w denote w if z = 0, and w if z = 1.These computations are based on following equations.
As we can verify in Table 2, the initial and final forms of the for loop are the same.The algorithm operates recursively.

Line Intermediate Value of
Debraize proposed an A2B algorithm which was quite optimized in terms of the memory usage of the pre-computation table as well as the execution time with security against DPA attacks.Debraize's algorithms reduce the complexity of the conversion phase by using the masked carry bit as the input to the table.This means that the algorithm does not require for extra costs to handle the carry bit.However, the LSB trick briefly mentioned previously in this paper is not easy to directly apply to Debraize's algorithms.Algorithms 6 and 7 are versions of Debraize's algorithms adjusted to include the LSB trick.As shown at Line 6 in Algorithm 7, it is required for the process to update the carry bit of the previous for loop on the LSB.Table 3 provides an understanding of Debraize's conversion process.Although the original algorithms are well designed in terms of execution time and memory usage, applying the LSB trick incurs additional costs.

Algorithm 6 Debraize's Table T Generation [22]
Require: None Ensure: T[.], r, ρ 1: Generate a random k-bit r and a random bit ρ 2: for A = 0 to 2 k − 1 do 3: (such that A l and R l have size k) x i ← x i ⊕ R l

Line
Intermediate Value of A and β is not yet updated in the register A)

Our Proposal
In this section, we propose an A2B algorithm that is designed for extremely low memory usage while preserving execution time.Our A2B algorithm consists of the use of one table only (like [17]), combined with the stronger secure management of the carry bit such that the carry bit is masked with the same probability (like [22]).The key idea is to construct a pre-computation table of the same size as the input bits.To handle the carry bit with stronger security, we designed a memory of size two which was constructed to minimize the additional costs in the A2B conversion phase.

Pre-Computation Phase
Algorithm 8 is a new algorithm that generates the pre-computation tables of our A2B algorithm.Table G is stored, except for the LSB of {(A + r) ⊕ (γ||r)} ∈ F 2 k+1 at Line 3. The A2B algorithm works correctly even if the pre-computation table does not store the LSB; this is the so-called LSB trick.The reason for this is that the subtraction in the arithmetic masking and the XOR in the Boolean masking are equal on F 2 .This means that the information regarding the LSB can be handled by a trick in the A2B algorithm without needing to be stored in the pre-computation table.The conversion process is described in further detail in Section 4. In summary, our pre-computation table G is constructed with k-bit memories, and it can be hugely advantageous in real devices that use the char type.In addition, table C is used to handle carry bits and Γ is used to guarantee security against DPA attacks.For a carry bit c, table C is based on the following equation, with α and k denoting a random value and the size of converted bits, respectively.
The bit sizes of table C and Γ should make up the total bit size of (k • n)-bit, and the reason for this is discussed in further detail in Section 4. Algorithm 8 is also remarkable not only in terms of memory usage but also in terms of execution time.Debraize's Algorithm 6 has two steps (Lines 3, 4) inside the for loops.This means that the number of generated tables is 2 k+1 .Quantitatively, the number of for loops of Algorithm 6 can be considered to be 2 k+1 , whereas ours has only 2 k loops, like [17].Therefore, our algorithm will take approximately half of the execution time required by Debraize's algorithm.In terms of the security against side-channel attacks of a carry bit, our A2B masking scheme masks the carry bit with a probability of 1/2, like [22], and the outputs of table C are uniformly distributed in [0, 2 k•(n−1) )||0 k .We designed Γ to eliminate the extra step of removing α in the conversion phase.

Conversion Phase
Algorithm 9 is the proposed A2B algorithm using the pre-computation table.This algorithm divides the arithmetic masked value A into a k-bit, converts it to a Boolean masked value B i , and handles carry bit using the table C. At Line 4, the masked LSB can be computed correctly because the Boolean masked value B i is the same as the arithmetic masked value A i on F 2 .At Line 6, the carry bit t, which is masked by γ in Algorithm 8 (Line 3), is handled by table C[t] by adding α or α + 1 when the carry bit is 0 or 1, respectively.As we designed the algorithm to subtract Γ at Line 1, α of table C is removed without the need for any extra steps.It is also worth noting that our algorithm does not require modulus operations.In real devices, data that exceeds the memory size of the register is automatically deleted.Based on this, our algorithm proceeds with the A2B conversion from the least significant word to the most significant word without any modulus operations.

Security Analysis and Soundness of Algorithm
When proposing a new countermeasure, two crucial points are the security analysis and soundness.First, to achieve security against first-order DPA attacks, we prove that all intermediate values to process our A2B algorithm are masked by random numbers.Namely, if the intermediate values of the algorithm are uniformly distributed random numbers (i.e., masking values), this algorithm can be considered as having achieved security against first-order DPA attacks.Second, the soundness is that an algorithm achieves its goal with arbitrary inputs in any case.We then use mathematical induction to prove the soundness of our countermeasures.Now, we analyze the security of our A2B algorithm.To achieve this goal, we enumerate all intermediate values of our algorithms (Algorithms 8 and 9), then verify whether these values have any random numbers.
Algorithm 8 is the algorithm that computes pre-computation tables.The mainly handled data is the masking value.This phase is a good target for horizontal correlation attacks [29,30].However, this attack can effectively cope with shuffling and dummy operations.In terms of sensitive values such as the key, this algorithm is computed without any sensitive values.The only information attackers can gain is the masking value.To recover the key, the pre-computation and conversion phase should be measured twice.
These two probings deviate from the assumption of the first-order DPA.That is, the higher-order DPA should be required to restore the secret key in the pre-computation and conversion algorithms.
Algorithm 9 is the A2B algorithm.We prove the randomness of each intermediate value as shown in Table 4. Table 4 shows that the sensitive value x; the random values r, R, γ, and α are combined in V i .0 l indicates 0 filled with l-bits.V i denotes the intermediate values at A in Algorithm 9, where i is the number of the Line, i.e., V 1 , V 3 , V 6 ,and V 7 are Lines 1, 3, 6, and 7, respectively.

•
V 1 is constructed with the masking value R, the masking value r of table G, and the masking value α of table C. α does not mask the lower k-bit.V 1 is uniformly distributed by R, so it guarantees the side-channel security.• V 3 is concatenated with the upper part like V 1 ; the middle part, which is (x − r), by adding R i • 2 i•k ; and the lower part, which is always filled with 0s, by Line 7. Since masking values R and r are independent, this distribution is uniformly random.• V 4 , V 5 are values that have been converted from arithmetic to Boolean masking by table G.The carry bit is masked with γ, and the rest are masked with r.These Lines are secure because γ and r are independent and random numbers.
V 6 is constructed with the upper part like V 1 ; the upper-middle part, which is (x + c − R − r); the lower-middle part like V 3 ; and the lower part as 0s.The masking value α of the upper-middle part is removed by adding table C. The security against side-channel attacks is explained by the same principle as that of V 3 .• V 7 is the clearing process to 0s in the lower part, this step is important to achieve the side-channel safety.If the algorithm omits this clearing process, the pattern of the masking values of the lower part is represented as a concatenated form like r||r.In this case, the distribution of the lower part is not uniform.As a result, the distribution of the side-channel signals is determined by sensitive data.
In other words, the algorithm cannot provide security against side-channel attacks, and therefore, a clearing process must be performed.
We now use mathematical induction to prove the soundness of the proposed algorithm.In detail, we observe changes in the internal state at i = 0 (the first loop) through the Base case.At this time, we prove the operating principles of the pre-computation tables by Lemmas 1 and 2. In the Inductive hypothesis, we define the state when loop i = a.This can be easily inferred in the form of a Base case.Finally, in the Inductive step, we claim that the proposed A2B algorithm for arbitrary k, n works correctly.X j i denotes the partial bits of X from the i-th to the j-th bits.For example, (R n−1 k ||0 k−1 0 ) means the concatenated bits of the R from (n − 1)-th to k-th bits, and other bits which are zeros.
Theorem 2. The proposed A2B algorithm for arbitrary k, n works correctly.

Proof.
Base case: The algorithm works correctly in the initial state (i = 0), as shown in the following steps.
We must verify that the A2B operation at Lines 4 and 5 is correct.Refer to the following Lemma 1: Lemma 1.At Lines 4 and 5, the arithmetic masked value A i is correctly converted from arithmetic masking to Boolean masking with pre-computation table G.
Proof.The value of table G is as follows.
In the LSB, the Boolean masked value (x = x ⊕ r) is the same as the arithmetic masked value on F 2 (A = x − r mod 2), based on the following definition.
Remark 1. Exclusive or (XOR) is defined in the arithmetic as modulo-2 addition/subtraction.

In the above definition, LSB of
The important implication of this equation is that we can calculate without storing the LSB in the masking conversion process.This is the so-called LSB trick.Line 4 is summarized as follows.
Therefore, Lines 4 and 5 are valid.

Performance Analysis
We summarize the performance of the proposed algorithm, Debraize's algorithm ( [22]), and Neiße's algorithm ( [17]) in Table 6.The proposed algorithms have advantages in terms of the memory usage and the execution time of the pre-computation phase like [17].The execution time of the A2B conversion phase with the LSB trick also shows good performance in various environments.
To summarize, one of the advantages in terms of memory usage is that it uses approximatively half the memory in the pre-computation phase compared to Debraize's algorithm.Another advantage in terms of memory usage is that the data type of our pre-computation table is compact and reasonable in real devices, because typical devices only support char, short, and int.The advantage in the execution time of the pre-computation phase is that it requires approximately half the time that Debraize's does, as the table only has half as many entries.
In the conversion phase, the operations of the ADD/SUB, XOR/AND and SHIFT are very similar in the three algorithms; however, the proposed algorithm has slightly fewer operators.To count the operations, we counted the number of operators with the following rules.

- In Algorithm 5 ,
the concatenating process at Line 6 (front) as one AND.-In Algorithm 5, the concatenating process at Line 6 (rear) as one SHIFT, one AND, and one XOR.-In Algorithm 6, the concatenating process at Line 3 and 4 (front) as one SHIFT and one XOR.-In Algorithm 6, the concatenating process at Line 3 and 4 (rear) as one SHIFT and one XOR.(Computes only once) -In Algorithm 7, the splitting process at Line 4 as two SHIFTs and two ANDs.-In Algorithm 7, the modulus at the Line 5 as one AND.-In Algorithm 7, the concatenating at the Line 6 (front) as one SHIFT and one AND.-In Algorithm 7, the concatenating at the Line 6 (T[β||A l ]) as one SHIFT, one AND, and one XOR.-In Algorithm 7, the concatenating at the Line 6 (rear) as one SHIFT, one AND, and one XOR.-In Algorithm 8, the concatenating process at Line 3 as one SHIFT and one XOR.(Computes only once) -In Algorithm 9, the concatenating process at Line 4 (front) as one SHIFT and one AND.-In Algorithm 9, the concatenating process at Line 4 (rear) as one SHIFT, one AND, and one XOR.

Table 5 .
The Inductive Step.