Fast Implementation of NIST P-256 Elliptic Curve Cryptography on 8-Bit AVR Processor

: In this paper, we present a highly optimized implementation of elliptic curve cryptography (ECC) over NIST P-256 curve for an 8-bit AVR microcontroller. For improving the performance of ECC implementation, we focus on optimizing ﬁeld arithmetics. In particular, we optimize the modular multiplication and squaring method exploiting the state-of-the-art optimization technique, namely range shifted representation (RSR). With optimized ﬁeld arithmetics, we signiﬁcantly improve the performance of scalar multiplication and set the speed record for execution time of variable base scalar multiplication over NIST P-256 curve. When compared with previous works, we achieve a performance gain of 17.3% over the best previous result on the same platform. Moreover, the execution time of our result is even faster than that over the NIST P-192 curve of the well-known TinyECC library. Our result shows that RSR can be applied to all ﬁeld arithmetics and evaluate the impact of the adoption of RSR over the performance of scalar multiplication. Additionally, our implementation provides a high degree of regularity to withstand side-channel attacks.


Introduction
Wireless sensor networks (WSNs) that consist of a numerous number of resource-constrained sensor devices have attracted substantial attraction due to the rapid advancement of Internet of Things (IoT). In various IoT applications, such as monitoring physical and environmental conditions (temperature, sound, and pollution levels), battlefield reconnaissance, home automation, etc., many constrained devices are employed as wireless sensor nodes for their low cost and energy efficiency. When compared with traditional cable networks, it is difficult to ensure secure and reliable communications in WSNs, since those wireless sensor nodes are often deployed in unattended environments. Hence, it can be easily accessed and manipulated by malicious adversaries. Thus, the cryptographic mechanism is required in order to provide sufficient security in WSNs. However, it is hard to deploy cryptographic schemes (especially of public key cryptography) on wireless sensor nodes due to their construction in resources, such as computation power, memory, energy, and even storage space, since they usually assumed to be operated with battery-power. For example, MICAz mote, which is recognized as one of most widely used constrained 8-bit devices, is equipped with an AVR ATmega128 processor that has 128 KB of programmable flash memory and 4 KB of RAM with a clock frequency of 7.3728 MHz.
In early days, it is considered that Public-Key Cryptosystems (PKCs) are infeasible in resource-constrained devices due to its significant amount of computation. In order to overcome

Related Work
In [1], Gura et al. presented the first ECC implementation on an 8-bit AVR processor. Before this result, it is believed that ECC is infeasible to be implemented for constrained devices, since it requires a significant amount of computation. This work focuses on numerous data transport to and from memory during multi-precision multiplication of large integers in limited register space. For minimizing the number of load instruction on a processor with a large register file, they proposed the new hybrid method, which combines the advantage of conventional byte-wise multiplication techniques, such as the row-wise and the column-wise methods. They could compute a scalar multiplication in 2.19 s for NIST P-224 curve at a frequency of 8 MHz, i.e., 17.52 × 10 6 clock cycles on AVR processor. In 2008, Liu et al. presented TinyECC [2], which is a first well-known and widely used cryptographic library containing various platforms, such as 8-bit AVR-based, 16-bit MSP-based, and 32-bit ARM-based devices. TinyECC includes a set of optimization switches for flexible configuration by developer's needs, which gives different execution time and memory consumption. TinyECC supports all 128-bit, 160-bit, and 192-bit NIST-recommended elliptic curves. TinyECC is evaluated on MICAz, TelosB, Tmote Sky, and Imote2, which provide the measurement of performance and resource consumption from the low-end devices to high-end devices. They also used the hybrid method for optimizations for the multiplication and squaring operation.
Recently, researches on ECC implementations over NIST prime curves on an 8-bit AVR processor were conducted in [3,6]. Liu et al. implemented ECC over NIST P-192 curve on 8-bit AVR processor [3] and achieved an execution time of 8.62 × 10 6 clock cycles for scalar multiplication, which is the best result. With respect to NIST P-256 curve, Zhou et al. [6] achieved the fastest performance, with an execution time of 25.38 × 10 6 clock cycles. Both of the works adopted the Karatsuba method that was proposed by Hutter and Schwabe [7] for the optimization of multiplication and squaring operations, which leads to the best performance with respect to scalar multiplication. Hence, this Karatsuba method is considered to be the best way for the implementation of multiplication.
In summary, many researches showed that ECC implementations can be feasible for resource-constrained devices and their main purpose is to optimize the performance of a scalar multiplication by using fast multiplication techniques. However, they did not consider modular reduction operation as a crucial part for improving performance, despite it always following every multiplication and squaring. The reduction operation introduces huge memory access by recalling the previous results of multiplication and squaring. In the context of modular multiplication, Park et al. [8] introduced new Karatsuba technique while using new integer representation, namely, Range Shifted Representation (RSR). Their work achieved the best performance for 192-bit modular multiplication over NIST P-192 prime. Because their optimization technique is highly prime-dependent, different optimization approaches are required to adopt their technique to modular multiplication over other primes. Moreover, in order to apply RSR to other field operations i.e., squaring, addition, and inversion, new attempts need to be considered for efficient implementation.
In addition, resistance against side channel attacks is required to ensure secure ECC implementation. Traditional ECC implementations did not consider any countermeasures for preventing side-channel attacks, such as Simple Power Analysis (SPA), which might cause security threats on embedded devices [9][10][11]. SPA attacks usually exploit the conditional statements in ECC implementation, which leak secret information. In practice, SPA attacks use conditional subtraction in field operation and irregular execution pattern of scalar multiplication, which reveal key related information [12][13][14][15][16]. Hence, constant field operation and regular scalar multiplication algorithms are required for developing resistance to SPA attack.

Our Contributions
The contributions of this paper are summarized, as follows: • We present an efficient algorithm for 256-bit modular multiplication over NIST P-256 prime for 8-bit AVR processor taking advantage of the optimization technique of modular multiplication on RSR in [8]. Because the technique presented in [8] is highly prime-dependent, it cannot be directly adopted to other prime field multiplications and it requires different approaches to other bigger operand lengths considering the limited number of registers. Because the 256-bit intermediate result of multiplication occupies 32 registers, it cannot be held in working registers at a time during the modular multiplication. Hence, careful scheduling of accumulation for processing the intermediate result is required in order to avoid unnecessary memory access, i.e., load/store instructions, which is known to be the most time-consuming instructions for modular multiplication on constrained devices. In order to extend the optimization technique on RSR in [8] for the 256-bit modular multiplication, we propose a new algorithm, which divides the reduction process into two 128-bit parts and carefully schedules the accumulation of the intermediate results of multiplication. Consequently, we significantly reduce the number of load/store instructions during modular multiplication. • In [8], they only introduce modular multiplication on RSR not including other field operations, such as squaring, addition, and inversion. We focus on applying all field operations on RSR in order to improve the performance of scalar multiplication. Firstly, we propose a new modular squaring over NIST P-256 prime, which is based on the optimization of our 256-bit modular multiplication resulting in a dramatic optimization. This modular squaring set the speed record for 256-bit modular squaring over NIST P-256 prime for 8-bit AVR processor, providing about 24% of improved performance as compared with the state-of-the-art method. Moreover, we propose a modular addition on RSR providing a constant execution time in order to protect from SPA attacks. Finally, we present conversion algorithms from the original integer representation to RSR, and vice versa, which are required for applying RSR to field operations. • We present a highly optimized ECC implementation over NIST P-256 curve providing 128-bit security on an 8-bit AVR ATmega128 processor. The result of our work only requires 20.98 × 10 6 cycles, which is faster than previous work in [6]. Our proposed implementation achieves the best performance for P-256 curve on an AVR processor.
The rest of this paper is organized, as follows. In Section 2, we briefly review the basic ECC, including NIST curve P-256, and also describe the main features of the 8-bit AVR ATmega128 microcontroller. In Section 3, we overview RSR for modular multiplication applied to the NIST P-256 prime. In Section 4, we describe the optimization technique for field operations on RSR. In particular, we propose modular multiplication and squaring method that is optimized for 8-bit AVR processor. Section 5 compares our implementation of scalar multiplication with other previous works. Finally, we conclude this work in Section 6.

Elliptic Curve Cryptography
ECC has been the most popular public key cryptosystem since Neal Koblitz [17] and S. Miller [18] first introduced EEC in 1985, due to the short key sizes and low computational cost as compared to RSA, DSA, or Diffie-Hellman. Its security is based on Elliptic Curve Discrete Logarithm Problem (ECDLP), which is hard to be solved by general purpose subexponential algorithms. An elliptic curve E over finite field F P can be defined by a short Weierstraß equation, as follows: where a, b ∈ F P and 4a 3 + 27b 2 = 0. It is common practice to choose the curve parameter a as −3 in order to improve the performance of point arithmetic in scalar multiplication.
In 1999, NIST proposed five prime-field curves for standardization [19], which adopts this approach. Hence, the NIST curves E can be defined by a short Weierstraß equation as The NIST P-256 curve use prime field F P 256 , defined by prime P 256 = 2 256 − 2 224 + 2 192 + 2 96 − 1. This prime has a specific form, such that the prime is represented with addition or subtraction of several number of power of 2 and all of the exponents are multiples of 8, which are the advantages of manipulating on 8-bit devices. With this special form, modular reduction operation can be easily conducted by some additions while using the congruence 2 256 ≡ 2 224 − 2 192 − 2 96 + 1 (mod P 256 ).

8-Bit AVR ATmega Microcontroller
We used the ATmega128 as our target platform, which is a representative of the AVR family based on a modern highly structured RISC design. The ATmega128 features 128 KB of programmable flash memory and 4 KB SRAM, a 4 KB EEPROM, a 9-channel 10-bit A/D converter, and a JTAG interface for on-chip debugging. This AVR microcontroller provides 133 various instructions; most of them take a single clock cycle, but multiplication and memory access (Load, Save) instructions require two clock cycles. There is a set of 32 × 8-bit working registers, where six registers (R26-R31) are for addressing pointers X, Y, and Z, and the remaining 26 registers (R0-R25) are for general purposes. Especially, the least significant registers (R0-R1) holds the result of 8 × 8-bit multiplication.
There are many development and compile tools that are available for the ATmel family. The ATmel corporation offers a free AVR Studio development environment, which includes a compiler, an assembler, and a convenient graphical simulator for Visual Studio environments.

Range Shifted Representation for Modular Multiplication
In this section, we introduce the RSR proposed in [8] and apply it to 256-bit integer representation. We assume that input operand size is 256-bit for the simple explanation of RSR.

Range Shifted Representation
We can apply RSR to represent 256-bit integers X, Y range from 2 −128 to 2 128−1 , and their multiplication Z = X · Y with 8-bit word size (W = 2 8 ), as the following: . The result Z of multiplication X · Y is expanded to both sides, where the shape of it is symmetric with respect to W 0 . In order to reduce the result Z into the range from 2 −128 to 2 128 − 1, we use shifted prime P 256 · 2 −128 = 2 128 − 2 96 + 2 64 + 2 −32 − 2 −128 . Subsequently, we can reduce it at both sides, where z 0 W −32 + z 1 W −31 + · · · + z 15 W −15 and z 48 W 12 + z 37 W 13 + · · · + z 47 W 23 in Equation (3) is reduced while using the congruence relations For example, let . Subsequently, we can reduce Z, as follows: Note that this is not a complete reduction. The part of result in (5) is not included in the range from 2 −128 to 2 128−1 . Here, we omit the complete step for simplicity.

Modular Multiplication on RSR
We perform modular multiplication on RSR using the subtractive Karatsuba method, as in [8]. Let X, Y ∈ F P 256 be represented with RSR, and Z = X · Y. Let Subsequently, X, Y, Z can be represented as Afterwards, Z can be represented by L, H, M as In Equation (16), only two parts, L A · W −32 and H B · W 16 , are overflow on both sides of our range. By reducing them, we can compute Z (mod P 256 · W −16 ), as follows: In Equation (17) Hence, only one computation is required for these duplicated accumulations and the result can be used twice.

Modular Multiplication
We propose a 256-bit modular multiplication over F P 256 ·W −16 , as shown in Algorithm 1, which is composed of a product part, an accumulation part, and two reduction parts. At first, three 128-bit multiplications are computed for L, H, and M, as represented in Equations (13)- (15), which basically follows the same scheduling of 128-bit Karatsuba multiplication, as in [7]. Thanks to the proposed technique presented in [8], in accumulation part, (L A + L B + H A + H B ) is only computed once for the duplicated accumulations in Equation (17) and saved in 17 registers, which are represented by (T, carry 2 ) = (t 0 , . . . , t 15 , carry 2 ). The result can be used twice in each reduction part. We cannot hold the intermediate result of 256-bit multiplication for reduction operation due to the limited number of registers. Thus, the reduction process is divided into two 128-bit size of reduction parts. In reduction part 1, the left half side of intermediate result is computed for a reduction of L A and H B , whereas reduction part 2 computes the right half side, as shown in Figure 1b.
We rearrange the order of computation from L → H → M in Algorithm 2 to M → L → H as in [8]. Subsequently, we can directly use H B without any memory access for the computation of H A + H B at Step 7, because it is kept in registers after the previous computation of H at Step 5.
We can find the duplicated computations of L A + L B + H A + H B in Figure 1b, which can be computed in the accumulation part in Algorithm 1, as mentioned in Section 3. We can avoid unnecessary memory access that is related with these computations by reusing the result of first computation in reduction part 2. After the accumulation part, the result (t 0 , . . . , t 15 , carry 2 ) of duplicated computation is stored at Step 9 and loaded in reduction part 2. This accumulation part corresponds to the Step 4, 9, 10, 11, and 12 in Algorithm 2, except the subtraction with M A and M B in Step 9 and 11, respectively. When comparing the number of load and store instructions for manipulating L A , L B , H A , H B , and T during each accumulation part in Algorithms 1 and 2 the former requires 80 load and 48 store instructions, whereas the latter only requires 48 load and 16 store instructions even when considering additional computation for a reduction of L A and H B .
Algorithm 1 256-bit × 256-bit Karatsuba multiplication with reduction over F P 256 ·W −16 Require: X = (x 0 , . . . , x 31 ), Y = (y 0 , . . . , y 31 )   Generally, for a reduction operation, the result of multiplication should be first computed. Subsequently, reduction is proceeded with handling of double-sized result, such as 512-bit result (z 0 , . . . , z 63 ) in Algorithm 2, which causes huge memory access, since the number of registers is limited. However, for the case of Algorithm 1, we only need to reduce L A and H B at both sides, as shown in Figure 1a. In other words, we can easily merge reduction operation into multiplication without requiring the complete result of multiplication, as shown in Figure 1b.
In reduction part 1 and 2, we cannot hold L A and H B altogether in registers, because 16 working registers are always occupied by the intermediate results T L or T R . Hence, repeated load instructions for them are required during reduction part 1 and 2. In order to minimize the number of load instructions, we carefully schedule the order of load instructions for L A and H B , as shown in Figure 1b

Modular Squaring
We propose an optimized modular squaring on RSR. Modular squaring consists of a product part, an accumulation part, and two reduction parts, as shown in Algorithm 3. We adopt 2-level subtractive Karatsuba method for implementation of modular squaring. For three 128-bit squaring operations L, H, and M in product part in Algorithm 3, the 128-bit Karatsuba technique can be applied, which make use of left shift instruction for the processing of equal cross-product terms. Note that the absolute difference of M in the subtractive Karatsuba multiplication is always positive. Thus, no calculation for the sign of M is required in product part. Through combining squaring operation with reduction, we can get the duplicated accumulations as same as (L A + L B + H A + H B ) in modular multiplication (17). This can be used in order to avoid redundant memory access during modular squaring by reusing it in reduction part 2 in Algorithm 3. Hence, an accumulation part and two reduction parts in Algorithm 3 follow the same process of Algorithm 1.

Modular Addition
Because we use operands X, Y ∈ [0, 2 128 − 1] for modular addition, X + Y = Z ∈ [0, 2 129 − 2] needs to be subtracted by P 256 · W −16 , at most, twice in order to ensure Z becoming less than 2 128 . If Z ∈ [0, 2 128 − 1], there is nothing to do. If Z ∈ [2 128 , 2 128 + P 256 · W −16 − 1], only one subtraction of P 256 · W −16 is needed in order to ensure that the final result is smaller than 2 128 . If Z ∈ [2 128 + P 256 · W −16 , 2 129 − 2], two subtractions of P 256 · W −16 are necessary. These conditional subtraction statements, depending on the range of Z, cause observable differences in the execution time that can be used to find a secret value by a side-channel attacker [20]. For the constant execution time implementation, we always execute two final subtractions after the addition X + Y. The two subtractions may be −P 256 · W −16 or −0.
Because the 256-bit addition result cannot be held in 32 working registers, some parts of it have to be stored and reloaded repeatedly through the following two subtractions. We can reduce memory access by minimizing unnecessary register uses. For example, the addition only consists of 32-byte additions without an additional register for the carry-byte that is produced from the last byte addition. The carry that is held in carry flag decides whether the next subtraction is −P 256 · W −16 or −0. We only use three registers, R0, R1, and R2, in order to hold P 256 · W −16 , because it can only be expressed with three bytes 0x00, 0x01, and 0xFF in hexadecimal.
Let R0, R1, and R2 be zeros. Through the following instructions after the addition, we can hold P 256 · W −16 or 0 in R0, R1, and R2, depending on the carry flag.

ADC R2 R0
After the 32-byte additions for Z = X + Y, if there is no carry, then all three registers are set to zero. Subsequently, the following two subtractions have no effect on the result. If carry from the addition exists, the carry flag is set. Subsequently, the subtract-with-borrow instruction (SBC) sets R1 as 0xFF and the carry flag is set again. Afterwards, R2 is set to 0x01 by the following the add-with-carry (ADC) instruction. Through these instructions, we can hold P 256 · W −16 in R0, R1, and R2.
After the first 32-byte subtractions for Z = Z − P 256 · W −16 , the carry flag may be set by the borrow originating from the last byte subtraction. Note that this new borrow means that Z is smaller than 2 128 . In other words, no borrow means that Z is bigger than or equal to 2 128 . Through the following instructions after the first subtraction, we can hold P 256 · W −16 or 0 in R0, R1, and R2.

SBC R2 R0
If borrow from the first subtraction exists, then ADC instruction sets R1 as 0x00 and carry flag is set again, because R1 was set to 0xFF from the previous step. Subsequently, R2 is set to 0x00 through SBC instruction and we can hold zeros in R0, R1, and R2. If there was no borrow, P 256 · W −16 remains on the three registers.
Through these simple carry processes, we can always execute one addition and two final subtraction for a constant time of modular addition. Table 1 represents all of the cases of the carry and the borrow, which decide whether the next subtraction is −P 256 · W −16 or −0.

Modular Inversion
Extended Euclidean Algorithm (EEA) is widely known as an efficient method for modular inversion. However, EEA has a non-constant execution time, depending on input, which can be exploited by side channel attack, such as timing attack. In order to prevent side channel leakage, we implement Fermat's Little Theorem-based inversion, which has a constant execution time. For a ∈ F P 256 ·W −16 , the inversion of a can be computed via a −1 ≡ a P 256 −2 (mod P 256 · W −16 ), which is significantly slower than EEA.

Conversion of Representation
In order to apply RSR to field operations, conversions from the original representation to RSR and vice versa are required to shift the range of integer representation. Algorithm 4 describes the conversion from the original representation to RSR, where the range is changed from [0, P 256 ] to [2 −128 , P 256 · W −16 ]. This conversion consists of one 256-bit addition, two 256-bit subtractions, and final subtraction of P 256 · W −16 . This conversion has a negligible amount of clock cycles when compared to a modular multiplication or squaring and is only required for coordinates x and y for the input point of scalar multiplication. Reversely, the conversion from RSR to the original representation is required for the output point of scalar multiplication, where the range is restored from [2 −128 , P 256 · W −16 ] to [0, P 256 ]. Three 256-bit addition, two 256-bit subtractions, and final subtraction of P 256 are required, as shown in Algorithm 5. This conversion also costs negligible clock cycles, since it is only required for coordinates X and Y of the output point.

Coordinate System for Point Arithmetic
A scalar multiplication is composed of a two point arithmetic operation, i.e., point addition and doubling. There are several ways to represent the points. Affine coordinate system uses two coordinates x, y and it requires the computation of field inversion in point arithmetic. Because field inversion is a relatively expensive operation, most previous works represent the points in a projective coordinate system. There are plenty of researches that proposed an efficient projective coordinate system, such as Jacobian, Chudnovsky, and mixed coordinate system [21][22][23][24]. When comparing these projective coordinates, it is preferable to choose Jacobian coordinates for the fastest pointer arithmetic operations. While using Jacobian coordinates, the point addition requires 4M + 4S + 9A (M = modular multiplication, S = modular squaring, A = modular addition) and point doubling requires 8M + 3S + 7A.

Co-Z Arithmetic
We choose the co-Z arithmetic, which was proposed by Meloni in [25], for point arithmetic. It provides a very efficient point addition in Jacobian coordinates, where the two involved points share the same Z-coordinate. The co-Z point addition is faster than a point addition in Jacobian coordinates and even faster than a point doubling. Let P = (X 1 , Y 1 , Z) and Q = (X 1 , Y 1 , Z), the point addition of P and Q is defined by P + Q = (X 3 , Y 3 , Z 3 ), which is computed, as follows: The co-Z point addition has a computational cost of 5M + 2S + 7A. Moreover, the Z-coordinate of P is updated to be equal with the Z-coordinate of P + Q for free. Through the expression of B and E, we can find out that B = X 1 (X 2 − X 1 ) 2 = x 1 Z 2 3 and E = Y 1 (X 2 − X 1 ) 3 = y 1 Z 3 3 , where (x 1 , y 1 ) = (X 1 /Z 2 , Y 1 /Z 3 ) represents the affine coordinates of P. Thus, we have P ≡ (B : E : Z 3 ). Such a free update allows for the subsequent use of co-Z point addition between P + Q and P during scalar multiplication. In [25], it was shown that the conjugate P − Q can be obtained with a small additional cost, sharing the same Z coordinate as P + Q. The conjugate P − Q = (X 3 , Y 3 , Z 3 ) can be computed, as follows: With computation of two co-Z points P + Q and P − Q, we can compute the co-Z point addition with the cost of 6M + 3S + 16A.

Montgomery Ladder for Regular Scalar Multiplication
For resistance against SPA attack, we consider a regular scalar multiplication algorithm, which always performs the same operations in each iteration, regardless of input scalar. For regular scalar multiplication, we choose the Montgomery ladder proposed by Rivain in [26] for the short Weierstraß-form elliptic curves over large prime field. This Montgomery algorithm is based on the co-Z arithmetic, which only uses (X, Y) coordinates without the computation of Z coordinate in a Jacobian coordinate system. The co-Z point addition takes the (X, Y) coordinates of two co-Z points P and Q and computes the (X, Y) coordinates of P + Q and P at the cost of 4M + 2S + 7A. On the other hand, the co-Z conjugate point addition computes the (X, Y) coordinates of P + Q and of P − Q at the cost of 5M + 3S + 11A. The Montgomery ladder algorithm computes scalar multiplication with a regular pattern, which always performs the co-Z point addition and the co-Z conjugate point addition in each iteration. Hence, our scalar multiplication provides resistance against SPA attack.

Performance Analysis and Comparison
We implement scalar multiplication over NIST P-256 curve on 8-bit AVR ATmega128 processor. More precisely, we optimize field level operations on RSR, including modular multiplication, modular squaring, and modular addition by assembly language. We implement the elliptic curve group operations and scalar multiplication by C language. We use the computer algebra system Magma in order to validate the results of scalar multiplication. We simulate all test programs with AVR studio 7.0 in order to obtain execution time in clock cycles. Table 2 compares the execution time of the proposed field arithmetic operations with the best result previously published. A modular addition and subtraction require 390 clock cycles, while a modular multiplication and squaring need 5355 and 3691 clock cycles, respectively. When compared with the work conducted by Zhou et al. [6], our implementation of modular addition and subtraction is slightly faster, while modular multiplication and squaring provide about 20% and 24% of improved performance, respectively.  Table 3 compares our work with other prime field ECC implementations over different NIST curves with respect to execution time, SPA resistance, and code size. Those implementations use NIST primes that range from 192-bit to 256-bit, which provide more than 80-bit security. Our ECC implementation over NIST P-256 curve only requires 20.98 × 10 6 cycles which is faster than previous work in [6]. Our proposed implementation achieves the best performance for the P-256 curve. The work of Zhou et al. [6] on NIST P-256 curve adopted Montgomery ladder with (x, y)-only co-Z addition for implementation of scalar multiplication as our work. Hence, the gap of performance between the work in [6] and our work comes from our optimization of field level operations, especially modular multiplication and modular squaring.

Conclusions
In this paper, we introduce the fast ECC implementation over NIST P-256 curve on an 8-bit AVR ATmega128 processor. We focus on applying RSR to all of the field operations in order to improve the performance of ECC. Firstly, we propose 256-bit modular multiplication over F P 256 ·W −16 , while taking advantage of the optimization technique in [8]. Secondly, we propose a new modular squaring over NIST P-256 prime. This modular squaring achieves the speed record providing about 24% of improved performance when compared with the state-of-the-art. Besides, we also propose an efficient masking method for modular addition, which ensures constant execution time. Finally, we evaluate the impact of RSR-optimized field operations on the performance of scalar multiplication. When compared to the state-of-the-art, our ECC implementation achieves the best performance of scalar multiplication over NIST P-256 curve with 17.3% improvement.
We focus on eliminating the conditional statements that SPA attack usually exploits in order to resist against SPA attack. Our implementation does not use any conditional statements throughout all field and group operations. Especially, we implement constant modular addition and regular scalar multiplication to prevent SPA attacks.
Our optimized scheduling can be similarly applied to modular multiplication and squaring over other NIST primes. In the future, we will implement ECC over other NIST curves, such as P-384 and P-512.