Highly Efﬁcient SCA-Resistant Binary Field Multiplication on 8-Bit AVR Microcontrollers

: Binary ﬁeld ( BF ) multiplication is a basic and important operation for widely used crypto algorithms such as the GHASH function of GCM (Galois/Counter Mode) mode and NIST-compliant binary Elliptic Curve Cryptosystems (ECCs). Recently, Seo et al. proposed a novel SCA-resistant binary ﬁeld multiplication method in the context of GHASH optimization in AES GCM mode on 8-bit AVR microcontrollers (MCUs). They proposed a concept of Dummy XOR operation with a kind of garbage registers and a concept of instruction level atomicity ( ILA ) for resistance against Timing Analysis (TA) and Simple Power Analysis (SPA) and used a Karatsuba Block-Comb multiplication approach for efﬁciency. Even though their method achieved a large performance improvement compared with previous works, it still has room for improvement on the 8-bit AVR platform. In this paper, we propose a more improved binary ﬁeld multiplication method on 8-bit AVR MCUs. Our method basically adopts a Dummy XOR technique using a set of garbage registers for TA and SPA security; however, we save the number of used garbage registers from eight to one by using the fact that the number of used garbage registers does not affect TA and SPA security. In addition, we apply a multiplier encoding approach so as to decrease the number of required registers when accessing the multiplier, which enables the use of extended block size in the Karatsuba Block-Comb multiplication technique. Actually, the proposed technique extends the block size from four to eight and the proposed binary ﬁeld multiplication method can compute a 128-bit BF multiplication with only 3816 clock cycles ( cc ) (resp. 3490 cc ) with (resp. without) the multiplier encoding process, which is almost a 32.8% (resp. 38.5%) improvement compared with 5675 cc of the best previous work. We apply the proposed technique to the GHASH function of the GCM mode with several additional optimization techniques. The proposed GHASH implementation provides improved performance by over 42% compared with the previous best result. The concept of the proposed BF method can be extended to other MCUs, including 16-bit MSP430 MCUs and 32-bit ARM MCUs.


Introduction
Binary field (BF) multiplication is an important and the most time-consuming arithmetic operation in several widely used cryptographic algorithms, including the Galois/Counter mode (GCM) operation and NIST-compliant binary ECC (Elliptic Curve Cryptosystems). For example, the central computation part of the GHASH function in GCM is consecutive 128-bit BF multiplications, and BF multiplications also occupy almost 80% of the running time of binary Elliptic Curve Cryptosystems, such as ECDH (Elliptic Curve Diffie-Hellman), ECDSA (Elliptic Curve Digital Signature Algorithm), ECIES (Elliptic the available registers on 8-bit AVR MCUs, we apply a multiplier encoding approach that can further curtail the number of registers to access the multiplier during multiplication from s to just one (where s is the block size of Block-Comb multiplication method). Thus, we can expand the block size of the secure Block-Comb multiplication method from four to eight the same as the known maximum block size on 8-bit AVR MCUs. As a result, we can decrease the number of partial multiplications from nine to three when calculating a field multiplication over GF (2 128 ), which results in a large performance improvement.

Research Contributions
The following are our summarized contributions.

1.
Presenting an enhanced secure Block-Comb multiplication method on 8-bit AVR MCUs We present an enhanced secure Block-Comb multiplication method on 8-bit AVR MCUs. Through experiments with SPA traces analysis and security analysis using clustering algorithms, we show that the number of used garbage registers does not affect SPA security. With this fact, we configure that our method makes use of a single garbage register, which saves seven registers. In order to further extend the block size of our Block-Comb method, we apply a multiplier encoding technique for optimizing the usage of the registers on 8-bit AVR MCUs. As a result, we extend the block size of the secure Block-Comb multiplication method from four (32-bit wise) to eight (64-bit wise), identical to the maximum block size on AVR MCUs, which significantly decreases the number of partial multiplications from nine to three when calculating a BF multiplication over GF (2 128 ). The proposed method can be a building block for binary field multiplication in the GHASH function of GCM and NIST-compliant binary ECC.

2.
Implementing the proposed secure Block-Comb multiplication method on an ATmega128 MCU By implementing on 8-bit ATmega128, we show that our proposed multiplication technique consumes much less running time. In our method, the basic multiplication unit is 64-bit wise and we apply an enhanced Karatsuba technique when calculating a 128-bit wise BF multiplication. The proposed method takes 3816 cc, including a multiplier encoding process, when computing a 128-bit BF multiplication, almost 32.8% faster than that of Seo et al.'s (Consumes 5675 cc), while providing TA/SPA security. Without the multiplier encoding process, the proposed takes 3490 cc, which is almost a 38.5% improvement of Seo et al.'s method.

3.
Application to GHASH function of GCM mode We show how to apply the proposed BF method to the GHASH function of GCM. Even though our method requires a multiplier encoding process before executing BF multiplication, this process can be omitted in the context of the GHASH function. In other words, in the 128-bit binary field multiplication of the GHASH function, one of the two inputs is fixed as a Hash key. Thus, we can apply multiplier encoding to the hash key and store the encoded hash key in memory before starting the GHASH function, and reuse it during the GHASH function process.
In the GHASH function, the inputs and output of BF multiplication need to be bit-reflected, which requires three bit-reflection operations per BF multiplication. We propose a technique that can decrease the count of bit-reflection from three to one. We present an enhanced GHASH function implementation on ATmega128, and it provides an improved performance by over 42% compared to the previous best result.

Comparison to the Previous Work
Even though the previous work first introduced the concept of a dummy XOR operation with garbage registers and instruction level atomicity (ILA) [10], its performance was still low. This is because the work in [10] utilized the block size of four and the naive Karatsuba technique. Furthermore, the process of the GHASH function was not optimized in the work of [10]. Whereas the basic concept of our current work is similar to previous work, our current proposed method applies several optimization techniques crucial for improving the performance of a binary field multiplication and GHASH function in the GCM of operation. Firstly, we have shown that the number of garbage registers does not affect SPA security by conducting in-depth experiments. By reducing the number of garbage registers from eight to one and applying a multiplier encoding technique, the proposed method could achieve block size eight, which is the well-known maximum block size in the Block-Comb method on 8-bit AVR MCUs. Furthermore, by further optimizing the execution of the Karatsuba technique and the process of the GHASH function, our method could achieve significant performance improvement. As a result, the currently proposed method has achieved performance improvement over 42% compared with the previous best result while providing the same level of security.
The remainder of our paper is composed as follows. Section 2 introduces characteristics of 8-bit AVR microprocessors, and a multiplication over a binary field. Section 3 describes existing BF multiplication algorithms on 8-bit AVR MCUs. Section 4 presents the proposed secure and efficient multiplication method over a binary field on 8-bit AVR MCUs. Both the proposed methods are analyzed with respect to performance and security in Section 4. Section 5 describes the proposed GHASH function implementation with the proposed multiplication method on 8-bit AVR MCUs. Section 6 describes a concluding remark with future works.

Related Works
This section introduces the characteristics of 8-bit AVR MCUs regarding the number of registers, memory size, and AVR instruction set. Then, we describe the basics of the BF multiplication methods. There are two main categories of BF multiplication approaches: LookUp Table-based (LUT-based) approaches and Block-Comb-based (BC-based) approaches. A detailed description of the existing multiplication methods over a binary field on 8-bit AVR MCUs will be given in Section 3.

Eight-Bit AVR Microcontrollers and Notations
Currently, devices using 8-bit AVR are broadly used for diverse applications, like RFIDs, smartcards, embedded controllers, wireless sensor nodes, and so on. Typically, 8-bit AVR MCUs, including our target platform ATmega128, contain 32 general-purpose registers (R 31 , . . . , R 1 , R 0 ). Among 32 registers, six registers are utilized as memory address pointers. Each set of (R 26 ,R 27 ), (R 28 ,R 29 ), and (R 30 ,R 31 ) are aliased as X, Y, and Z pointer registers, respectively [13]. Typically, AVR MCUs have not only individual memory spaces but also buses for data and program instructions in a simple single-issued pipeline manner, since their architecture is based on the Harvard architecture. There are 133 instructions in total, and typically, each instruction executes in constant latency. For instance, logical/arithmetic instructions (e.g., ROR (rotate right through carry), LSL (logical shift left), EOR (bit-wise XOR), ADD (arithmetic add), and so forth) are executed within a single clock cycle, while instructions related to memory accesses (e.g., ST (store from register to memory), LD (load from memory to register), and so on) consume two clock cycles [13]. In the case of conditional branch instructions, their clock cycles depend on whether the tested condition is true or not. For instance, in the case of SBRS (skip next instruction if bit in register is set), if the condition is true, it takes up two or three cycles depending on the skipped instruction's word size. Otherwise, it consumes one cycle. The memory and computation capabilities of 8-bit AVR MCUs are limited. For example, an 8-bit ATmega128 MCU has 4 Kbytes of RAM and 128 Kbytes of ROM memory, and its running clock speed is 7.3728 MHz. Contrary to the state-the-of-art ARM MCUs and Intel CPUs providing a carryless multiplier and generic binary field hardware multiplier, AVR MCUs still do not embed the dedicated hardware multiplier.
All through our paper, the following notations are used. The general purpose registers are represented as R. R i means to the i-th general-purpose register in which 0 ≤ i ≤ 31. The

Multiplication over Binary Field
Binary Field (BF) multiplication is a core operation of several cryptographic algorithms, such as the GHASH function of GCM and NIST-compliant binary elliptic curve operations. For example, in the GHASH function of GCM, BF multiplications are executed with input operands as associated data blocks or ciphertext blocks, and a secret constant hash key H. In the case of binary ECC, scalar multiplication is the most performance-critical part of the entire ECC-based protocols and its almost 80% running time comes from BF multiplications. Thus, the performance of BF multiplication needs to be optimized as much as possible.
BF multiplication computes where m is the degree of the underlying binary extension field. In the above notation, each multiplicand and multiplier are represented as A an B, respectively. The result of the multiplication can be represented as The most basic algorithm for a multiplication over a binary field is the Shift-and-Add method. It scans the multiplier from LSB (Least Significant Bit, the 0-th bit) to MSB (Most Significant Bit, the (m − 1)-th bit). At every bit, multiplicand A is shifted in the left direction like A · z, and if the bit of multiplier B is 1, the accumulator is XORed with A · z (Namely, if b i , the multiplier B' i-th bit, is 1, the accumulator is XORed with A · z i ). The Comb multiplication algorithm, the basic algorithm for both LUT-based multiplication algorithms, and Block-Comb-based multiplication algorithms enhance the performance of binary field multiplication. Actually, Comb multiplication algorithms make use of the fact that A · z W j+k can be easily attained by adding j zero words to the right side of the vector representation of A · z k , once A · z k has been computed for some k ∈ [0, W − 1] (W is eight in case of 8-bit AVR). Therefore, it can decrease the count of shift operations as compared to the Shift-and-Add method. Two categories of Comb methods exist: the RtL version and the LtR version. While the RtL version of the Comb method scans a multiplier from LSB to MSB, the LtR version of the Comb method operates the other way round [2,14].

Multiplication Methods over GF(2 m ) on 8-bit AVR MCUs
Until now, many studies have been conducted for optimizing BF multiplication's performance on 8-bit AVR platforms [2][3][4][5][6][7][8]. They can be categorized into two main approaches: LookUp Table-based (LUT-based) approaches [2][3][4][5] and Block-Comb-based (BC-based) approaches [6][7][8][9]. Table 1 summarizes the existing result results, and the details will be explained in the following Sections 3.1 and 3.2. Because the count of accessible registers is constrained on 8-bit AVR, many memory accesses take place. Namely, among 32 general-purpose registers, only 26 are accessible for calculating a BF multiplication, without six registers for a memory address pointer. For instance, at least a set of 64 registers are necessary for maintaining the total part of a multiplier, a multiplicand, and a result of multiplication. However, due to the limited number of available registers, only certain parts of the operands can be kept in the registers. Thus, this limitation generates a huge number of redundant memory accesses. Therefore, on 8-bit AVR, the major goal of existing researches on binary field multiplication methods is minimizing redundant memory accesses by optimizing the use of the available registers.  [9] Enhanced Karatsuba Block-Comb ECC GF(2 233 ) 6896 none Ziu et al. [11,15] Masked Block-Comb GCM GF(2 128 ) 14,445 TA Seo et al. [10] Block-Comb with Dummy XOR and ILA GCM GF(2 128 ) 5675 SPA, TA

Look-Up Table-Based Methods
So, as to enhance the performance of field multiplications of the GHASH function in GCM, firstly McGrew et al. presented a table-based method using different sizes considering the trade-off between computational speed and memory consumption [16,17] in their GCM implementation. They used different sized tables: a version of 256 bytes, a version of 4 Kbytes, a version of 8 Kbytes, and a version of 64 Kbytes and measured the performance on a 32-bit Motorola G4 device. Although their table-based methods are efficient regarding computational speed, memory consumption is too huge to be utilized on 8-bit AVR MCUs. Therefore, researchers usually have taken advantage of López et al.'s Look-Up Table multiplication method [3][4][5], originally aimed for field multiplication of binary elliptic curves operation [2,14] when implementing the GCM algorithm on resource-limited embedded devices, including AVR and MSP430.
López et al.'s LUT-based technique is an extended version of LtR Comb technique (it is called the wLtR Comb technique) [2,14]. The wLtR Comb technique calculates a multiplication by w-bit wise rather than single-bit wise at the cost of building a precomputation table. Thus, it can reduce the count of bit operations, like bit XOR and shift operations [2,4,14,18]. At the beginning of the multiplication, it computes all possible results of A · u(z) about all polynomials u(z) of degree at most w − 1 and stores them in a kind of precomputation table. Then, in the actual multiplication process, multiplier B is scanned by w-bit at a time from the left (MSB) to right (LSB) direction, and the corresponding value from the precomputation table is chosen. Namely, the corresponding value from the table is XORed with the intermediate value in the accumulator without actual computation. On 8-bit AVR MCUs, it is widely believed that 4-bit is the most favorable width w for this wLtR Comb technique. Therefore, it makes use of 16 × m-bit of RAM memory for maintaining the precomputation table consisting of sixteen multiplication results from 0 · A to (z 3 + z 2 + z + 1) · A. This LUT-based method and its variants have been broadly implemented on 8-bit AVR devices [3][4][5]. For instance, 163-bit binary field multiplication was implemented by NesC language on an ATmega128 MCU in Seo et al.'s work [3], and it was upgraded by integrating two iterations of the main loop into one, decreasing the count of redundant memory accesses. Seo et al. got 19,670 cc as the timing result for a field multiplication over GF(2 163 ), which was a 21.1% improvement. In [4,5], Aranha et al. proposed a concept of a rotating register technique in the wLtR Comb method, and it could greatly decrease the count of redundant memory accesses necessary for executing a multiplication method. Their method was implemented in AVR Assembly language and they reported 4508 cc, 8314 cc, and 11,727 cc for calculating a multiplication over GF(2 163 ), GF(2 233 ), and GF(2 271 ), respectively. LUT-based multiplication methods give not only good performance but also are resistant against both TA and SPA. However, they are vulnerable to side channel attacks, which uses information about the memory address [11,12,19] owing to the large number of resulted memory accesses. In [11,15], Liu et al. successfully analyzed the wLtR Comb multiplication technique with a sort of horizontal correlation analysis [12]. Namely, they could get the indices used for accessing LUT by using the correlation between power consumption traces from building up the Lookup table and referencing the LUT element during the process of a multiplication.

Block-Comb Based Multiplication Methods
As an alternative to binary field multiplication using LUT, the Block-Comb (BC) multiplication method was originally proposed for efficient field multiplication of η T pairing over GF(2 239 ) on an ATmega128 MCU [6]. In the BC multiplication method, both the multiplier and the multiplicand of multiplication are partitioned into s-byte blocks. Then, partial products of generated blocks are calculated by a column-wise fashion. Each of the partial products is calculated with the LtR Comb method for performance efficiency. Namely, in the BC multiplication method, the set of accessible registers are partitioned into three parts; s registers, s registers, and 2s + 1. These partitioned three parts of registers are used for a multiplicand, a multiplier, and the result of the partial multiplication, respectively. Because the intermediate results are kept in the set of working registers of 2s + 1, the results of partial multiplications positioned in the identical column can be updated directly into the registers without accessing memory, which decreases the count of redundant memory accesses. In [6], Shirase et al. drew a conclusion that six is the optimum block size s based on the fact that (4s + 1) < 26 on 8-bit AVR MCUs. Shirase et al.'s BC multiplication method calculates a multiplication over GF (2 239 ) in 9511 clock cycles (cc).
Seo et al. [7] proposed the Unbalanced Block-Comb multiplication method (UBC), which can expand the block size from 6 to 7 for a multiplication over GF (2 163 ). They exploited the fact that the tested bits of a multiplier become unnecessary during a partial product computation process. In other words, they recycled the register used to keep the multiplier in order to hold the multiplicand's the most significant byte. Consequently, the expanded block size decreases the count of partial multiplication from sixteen to nine when calculating a field multiplication over GF (2 163 ). Note that block size 7 (resp. block size 6) partitions a 163-bit field element into 3 blocks (resp. 4 blocks). Seo et al.'s UBC could calculate a 163-bit binary multiplication within 4546 cc. Then, Seo et al. [8] presented the so-called Karatsuba Block-Comb multiplication method (KBC), an integration of the Karatsuba technique and the Block-Comb multiplication approach, which decreases the count of partial multiplications from nine to six at the cost of several low cost field additions when calculating a multiplication over GF (2 163 ). KBC could accomplish 3274 cc for a multiplication over GF (2 163 ). They also proposed a constant version of their Karatsuba Block Comb multiplication technique. Although it accomplished resistance against a timing attack, it can still be attacked by a simple power analysis. In 2018, Seo et al. [9] proposed an enhanced version of the Karatsuba Block-Comb (EKBC) multiplication technique by applying a new multiplier encoding method, which can greatly decrease the count of registers necessary for keeping the multiplier. In addition, they showed that with their proposed technique, the maximum block size of the Block-Comb multiplication method could be 8 on 8-bit AVR MCUs. As a result, they accomplished a new timing record for NIST-compliant K-233 elliptic curve scalar multiplication. Until now, EKBC is regarded as the fastest multiplication method over a binary field on 8-bit AVR MCUs.

Secure Block-Comb Multiplication Methods
Since Block-Comb multiplication methods do not utilize any Lookup Table, they have resistance against a sort of horizontal correlation analysis [11,15], which was used for analyzing LUT-based methods. However, they are vulnerable to TA and SPA because they contain a conditional branch. Algorithm A1 in Appendix A shows a simple 56-bit wise Block-Comb multiplication method. Steps 11-15 are executed only when the l-th bit of the multiplier is 1, which is the source of TA and SPA vulnerability.
Most recently, Seo et al. proposed a novel secure Block-Comb method, which has resistance against TA and SPA (in addition, their method is resistant against a sort of horizontal CPA because it does not use any Lookup Table) for a secure GCM implementation on 8-bit AVR MCUs [10]. Similar to MBC, seo et al.'s method has made use of a 32-bit wise Block-Comb multiplication method. For making a 32-bit wise Block-Comb multiplication method secure against SPA, they introduced the concept of Dummy XOR operations with a set of garbage registers. In other words, the count of registers for the accumulator C is doubled from 8 registers to 16 registers (R 15 , . . . , R 0 ). Thus, their method makes use of 25 registers in total ((R 15 , . . . , R 0 ) for accumulator C, (R 20 , . . . , R 16 ) for multiplicand A, and (R 24 , . . . , R 21 ) for multiplier B), which is acceptable in 8-bit AVR MCUs having 32 registers. Among (R 15 , . . . , R 0 ) registers, the set of (R 7 , . . . , R 0 ) plays the role of the garbage registers, and the set of (R 15 , . . . , R 8 ) maintains the real intermediate result of the multiplication. With the Dummy XOR operations with garbage registers, the multiplicand is XORed at a different position relying on the value of the tested bit. For instance, if the tested bit is 0, the registers (R 20 , . . . , R 16 ) containing the multiplicand are XORed with the garbage registers (R 7 , . . . , R 0 ). Otherwise, the same registers are XORed with the part of real accumulator (R 15 , . . . , R 8 ). Since the registers keeping real multiplicand values are XORed with the accumulator registers in both cases, the power consumption patterns for both cases are not distinguishable each other with respect to SPA. In order to implement this concept of Dummy XOR as being secure against TA, they introduced the concept of instruction level atomicity (ILA). On 8-bit AVR MCUs, typically the branch instructions consume different clock cycles relying on whether the tested condition is true or not. For instance, if the condition is false (resp. true), it usually takes 1 clock cycle (resp. 2 clock cycles). They identified that the main role of the branch instruction is to increment the program counter (PC) depending on whether the condition is true or not, and used a dummy ADD instruction to fill the timing difference. Even though their method uses SBRS branch instruction, the timing difference is hidden by the dummy ADD instruction. Seo et al. use their 32-bit wise Block-Comb multiplication method for calculating 128-bit BF multiplication. For efficiency, they applied a two level Karatsuba technique, which consists of nine 32-bit partial products and each partial product is computed by their proposed Block-Comb multiplication method. They reported the timing cost of 128-bit BF multiplication as 5675 cc.
Require: 32-bit multiplier A and 32-bit multiplicand B Ensure: 64-bit result C=A · B 1: for k = 7 to 0 do C ← C 1 10: end for 11: (Return C) Table 1 shows the existing implementation of BF multiplication on 8-bit AVR MCUs.

Proposed Binary Field Multiplication
In this section, we describe the proposed BF multiplication method, which is not only efficient but also secure against TA and SPA. With several optimization techniques, we present a secure Block-Comb method using block size 8 known as the maximum block size on 8-bit AVR MCUs.

Enhanced Secure Block-Comb Method
Seo et al.'s utilized n garbage registers equal to the number of real accumulator registers (namely, eight registers were used as the garbage register set). Our method makes use of a single garbage register rather than using n garbage register. The security analysis of using a single garbage register is described in Section 4.4. The saved registers can be used for extending the block size of the Block-Comb method. If the Block-Comb method with a garbage register uses block size s, the number of total registers is (4s + 2): 1, s, s + 1, 2s are for the garbage register, the multiplier, the multiplicand, and the accumulator. Since on 8-bit AVR MCUs 26 registers are available except for address registers, the block size s can be 6 (48-bit). Since Seo et al. utilized the 32-bit secure Block-Comb technique for calculating 128-bit BF multiplication, 9 partial products were required (actually, they integrate their secure Block-Comb method into the Karatsuba technique. Thus, 16 partial products are reduced into 9 partial products). With the 48-bit wise Block-Comb method, 128-bit operands are divided into three terms. Thus, the 128-bit BF multiplication can be computed with 6 partial products by integration with the Karatsuba technique. Even though by using a garbage register the block size of the secure Block-Comb method has been extended from 4 to 6, it still does not reach the maximum block size 8, which was presented from the work of the non-constant Enhanced Karatsuba Block-Comb method [9].
In order to expand the block size of the Block-Comb multiplication method from 6 to 8, it is required to reserve more registers. However, since on 8-bit AVR MCUs the available registers are only 26 except for address registers, one of register sets for maintaining each of multiplier, multiplicand, or accumulator needs to be reduced. Common Block-Comb methods load s bytes of the multiplier into s registers (at Step 4∼7 of Algorithm A1 in Appendix A) and sequentially access l-th bit of s registers where l is from 0 to 7 (at Step 11 of Algorithm A1 in Appendix A). Figure 1 shows the multiplier accessing pattern of Algorithm A1 in Appendix A. Since the bits being accessed are distributed in s registers, they require s registers for accessing the multiplier. Thus, we encode the multiplier so that the bits being accessed are in one register [9]. In other words, by rearranging l-th bit of s registers into one register, steps 10∼16 of Algorithm A1, the main inner loop of the Block-Comb method, requires only one register at a time. Algorithm 2 depicts a 128-bit wise multiplier encoding process. The algorithm makes use of AVR bit handling instructions as LSR (logical shift right), ROR (rotate right through carry). Figure 2 shows the encoded multiplier from bit reordering-based multiplier encoding process when the multiplier is 128-bit, which can be used for BF multiplication in the GHASH function of GCM. The encoding process operates on 64-bit wise so that i-th encoded byte EB[i] contains 0-th bit of original multiplier B[0] ∼ B [7]. With the application of multiplier encoding, our method loads l-th bit of original multiplier's s bytes into one register. Thus, our method requires (3S+3) registers for computing a Block-Comb method (namely, 1, 1, s + 1, and 2s registers are for the multiplier, the garbage register, the multiplicand, and real accumulator, respectively). Even though s value is 7, satisfying 3s + 3 ≤ 26, we can extend the block size from 7 to 8 by using exploiting the address registers (r 31 , . . . , r 26 ). Actually, address registers can be used as arithmetic registers. In other words, the memory address for the final result is needed only when storing the final multiplication result in the accumulators into the memory at the end of the BF multiplication. Thus, at the beginning of the multiplication, the address for the final result can be stored at the stack memory with PUSH instruction and then be restored with POP instruction when storing the final result into the memory. This technique requires only 8 cc because 2 PUSH and 2 POP instructions are used for storing and restoring the 16-bit address value. Therefore, we make the secure Block-Comb method use block size 8, the same as the maximum block size reported from the work of the enhanced Karatsuba Block-Comb method [9].      Algorithm 3 depicts the proposed secure Block-Comb method using the block size 8 (64-bit wise). We assume that the multiplier B is converted into EB with Algorithm 2 and then EB is used as the input of multiplier in Algorithm 3. In the algorithm, (R 24 , . . . , R 16 ), R 25 , R 26 , and (R 15 , . . . , R 0 ) hold the multiplicand A, the encoded multiplier EB, the garbage register, and the intermediate result, respectively. Note that Algorithm 3 makes use of a single register for keeping target bits of the multiplier because each byte of EB has each bit column of the original multiplier's consecutive eight bytes. for n = 0 to 7 do 13: if the n-th bit of R 25 ==1 then 14: R 26 ← R 26 + R 27 // Dummy ADD operation for ILA 15: for k = 0 to 8 do 16:

Proposed Karatsuba Technique
With the proposed Block-Comb multiplication method using a block size of 8, 128-bit BF multiplication can be computed with 4 partial products, where each partial product is computed with Algorithm 3. To decrease the count of partial products, we combine our Block-Comb method with the enhanced Karatsuba technique [20]. Actually, even though the enhanced Karatsuba technique was originally proposed for prime field multiplication on 8-bit AVR MCUs, we modify it for our proposed Block-Comb method. By applying the enhanced Karatsuba technique rather than classic Karatsuba technique, s XOR instructions can be saved where s is the number of words for operands and 8 in Algorithm 4. Algorithm 4 depicts the proposed 1-level Karatsuba secure Block-Comb method, which computes 3 partial products with Algorithm 3. In the algorithm, H H , H L , M H , M L , L H , L L , and T are all s bytes. Therefore, Algorithm 4 saves one partial product at the expense of additional 56 XOR instructions compared with a classical 128-bit BF multiplication using Algorithm 3. In Algorithm 4, L, H, and M mean low term, high term, and middle term in the Karatsuba multiplication, respectively (thus, each of them is 128-bit). Lower case L and H mean the lower part and higher part of each term (thus, each of them is 64-bit).

Implementation Results on an 8-Bit ATmega128 MCU and Comparison to Previous Work
We have implemented our methods on a target board containing ATmega128 MCU. Table 2 shows the timing results of the proposed methods and compare them with those of Seo et al.'s method [10]. For efficiency, the proposed methods are developed in AVR assembly language. A 64-bit wise proposed secure Block-Comb method requires 1213 cc (resp. 1050 cc) with (resp. without) multiplier encoding process, and these timing results are an 8.8% (resp. 21.1%) improvement compared with Seo et al.'s 64-bit wise multiplication method. In the case of 128-bit multiplication, the proposed method with (resp. without) multiplier encoding requires 3816 cc (resp. 3490 cc), which are improvements of 32.8% and 38.5% compared with Seo et al.'s 128-bit wise multiplication method. The improvement ratio in 128-bit multiplication has been increased compared with that in 64-bit multiplication. This is because we implement 128-bit multiplication by the enhanced Karatsuba technique with assembly language while Seo et al.'s 128-bit multiplication method is implemented with C and assembly language. Furthermore, the partial product at step 8 in Algorithm 4 does not require a multiplier encoding process because the encoded multiplier can be obtained by XORing the precomputed multipliers as (EB [15 . . . 8] ⊕ EB[7 . . . 0]).

Security Impact Analysis According to the Number of Registers Used
We find out that the number of garbage registers does not affect SPA security on the target 8-bit AVR MCU, and we reduce the number of garbage registers from eight to just one. We have conducted SPA security analysis for proving the security impact of reducing the number of garbage registers. We have utilized the KLA-SCARF evaluation board having an 8-bit ATmega128 MCU and gather power traces with a LeCroy HDO06104A oscilloscope with a sampling rate 500 MS/s. Figure 4 compares two power consumption traces between the False and the True cases when using a single garbage register rather than n garbage registers for SPA and TA security. Both cases shown in the figure take the same number of clock cycles, and the two power consumption patterns are not distinguishable from each other. We have additionally investigated that the proposed BF multiplication method cannot be analyzed with popular clustering algorithms like K-Means and Spectral clustering algorithms. In other words, firstly we have classified the traces for True case (XOR operations with the real accumulator) and False case (XOR operations with a garbage register), assuming that the condition is already known. By using 5000 traces for True case and 5000 traces for False case, we have conducted the most popular clustering algorithms: K-Means and Spectral clustering. In our experiments, each of the K-Means and Spectral clustering algorithms has success rates of 0.6385 and 0.6023, respectively. With a success rate of 0.6385, the entropy is 82.85 (log 2 0.6385 −128 ), which is still a high security level with respect to SPA security.

Application to GCM Mode's GHASH Function Implementation
We have applied the proposed BF multiplication method to the GHASH function of GCM as a case study. Since the GHASH function of GCM requires several 128-bit BF multiplications, the BF needs to be implemented in a secure and efficient way. Even though Seo et al. proposed an efficient and secure BF multiplication method for secure GHASH function of GCM of operation [10], the performance of their method needs to be improved for efficient GHASH function. Our method is more efficient than Seo et at.'s method by 38.5% while providing TA and SPA security. We also apply our method to the GHASH function and suggest some optimization methods specific to GCM.
Since the hash key H is fixed in the GHASH function, it can be encoded at the beginning of the GHASH function, and the encoded H can be used for BF multiplications in the GHASH function without encoding H every time. Thus, we can save the overhead for encoding the multiplier of the BF multiplications in the GHASH function. In GCM standard, the bits in the state are reflected. In other words, the leftmost bit is considered as the 0-th bit and the rightmost bit is considered as the 127-th bit, while general crypto algorithms typically utilize the opposite notation. Therefore, two inputs of BF multiplication need to be bit-reflected and the output is also required to be bit-reflected. We apply a table-based bit-reflection method (it requires 256-byte table). However, compared with Seo et al.'s method using three bit-reflections for two inputs and an output, our implementation requires only one bit-reflection. Figure 5 describes the process of GHASH function of GCM. The inputs of BF multiplication (each of (A 1 , . . . , A m , C 1 , . . . , C n ) and H) and the output (V 1 , . . . , V m+n ) need to be bit-reflected. Our implementation encodes the hash key H at the beginning of the GHASH function and stores it in bit-reflected form, which removes the need for bit-reflection of H at each BF multiplication. Our implementation combines the bit-reflection of BF multiplication output and the bit-reflection of one input (A 1 , . . . , A m , C 1 , . . . , C n ). Since the output of previous BF multiplication is XORed with the one input of the next BF multiplication, we can combine two bit-reflection into one. In other words, BitRe f lect(O) XOR BitRe f lect(I) = BitRe f lect(I XOR O) where BitRe f lect is a bit-reflection function and, O is the output of the previous BF multiplication, and I is the input of the next BF multiplication. Thus, in our implementation, only one bit-reflection is required at each BF multiplication, which can improve GHASH function's performance further. We have measured the performance of the proposed GHASH implementation on an 8-bit ATmega128 MCU for each of 16 byte, 64 byte, and 256 byte messages and compare the results with those of Seo et al.'s work. Even though Seo et al. used the C version of the 128-bit reduction method in their BF multiplication method [10], we have implemented a fast reduction method similar to the fast reduction algorithm in [14] in assembly language, and the running time of the reduction is 350 cc. Note that including the final one additional BF multiplication for computing V m+n+1 in Figure 5, 2, 5, and 17 128-bit BF multiplications are required for 16 byte, 64 byte, and 256 byte messages, respectively. Table 3 compares the performance of our GHASH implementation and the previous work. Actually, since Seo et al. did not provide timing results for 64 byte and 256 byte messages in their paper [10], and we have implemented their method and measured the timings for 64 byte and 256 byte messages. Our proposed implementation provides improved performance by over 42% compared to the previous best result from [9].

Conclusions
In this paper, we have proposed a highly efficient SCA-resistant binary field multiplication method and applied it to the GHASH function of GCM on 8-bit AVR MCUs. The proposed BF multiplication method is efficient and secure against TA and SPA. Our method has adopted a concept of the Dummy XOR technique using a set of garbage registers and reduced the number of garbage registers from n to one by investigating the security impact of using only one garbage register. Furthermore, with a novel multiplier encoding, our method has achieved block size 8, which is known as the largest block size in the Block-Comb multiplication method on 8-bit AVR MCUs. As a result, our BF method presents an improved performance by 32.8% (resp. 38.5%) with the multiplier encoding (resp. without multiplier encoding) compared with the previous best work. The proposed BF multiplication method can be used as underlying BF multiplication in the GHASH function of GCM and NIST-compliant binary ECC arithmetic. As a case study, we have also presented enhanced GHASH function implementation with the proposed BF multiplication method and additional optimization techniques, which can improve the performance by over 42% compared with the previous best GHASH function implementation.
As future works, we will apply the concept of the proposed method on 16-bit and 32-bit embedded MCUs, such as MSP430 and ARM processors. Furthermore, we will apply our proposed BF multiplication method to NIST-compliant Binary Elliptic Curve Cryptosystems (ECCs).