Abstract
In this paper, we present the first constant-time implementations of four-dimensional Gallant–Lambert–Vanstone and Galbraith–Lin–Scott (GLV-GLS) scalar multiplication using curve - on 8-bit AVR, 16-bit MSP430, and 32-bit ARM processors. In Asiacrypt 2012, Longa and Sica introduced the four-dimensional GLV-GLS scalar multiplication, and they reported the implementation results on Intel processors. However, they did not consider efficient implementations on resource-constrained embedded devices. We have optimized the performance of scalar multiplication using curve - on 8-bit AVR, 16-bit MSP430, and 32-bit ARM processors. Our implementations compute a variable-base scalar multiplication in 6,856,026, 4,158,453, and 447,836 cycles on AVR, MSP430, and ARM Cortex-M4 processors, respectively. Recently, Four-based scalar multiplication has provided the fastest implementation results on AVR, MSP430, and ARM Cortex-M4 processors to date. Compared to Four-based scalar multiplication, the proposed implementations require 4.49% more computational cost on AVR, but save 2.85% and 4.61% cycles on MSP430 and ARM, respectively. Our 16-bit and 32-bit implementation results set new speed records for variable-base scalar multiplication.
1. Introduction
Wireless sensor networks (WSNs) are wireless networks consisting of a large number of resource-constrained sensor nodes, where each node is equipped with a sensor to monitor physical phenomena, such as temperature, light, and pressure. The main features of WSNs are resource constraints, such as storage, computing power, and sensing distance. Recently, the energy consumption of data centers has attracted attention because of the fast growth of data throughput. WSNs can provide a solution for data collection and data processing in various applications including data center monitoring. That is, WSNs can be utilized for data center monitoring to improve the efficiency of energy consumption. Several solutions were proposed to solve this problem [1,2].
Since sensor nodes are usually deployed in remote areas and left unattended, they can be led to network security issues, such as node capture, eavesdropping, and message tampering during data communication. Additionally, many application areas of WSNs require data confidentiality, integrity, authentication, and non-repudiation, meaning there is a need for an efficient cryptographic mechanism to satisfy current security requirements. However, due to the constraint of WSNs, it is difficult to utilize the conventional cryptographic algorithms. Therefore, efficient cryptographic algorithms considering code size, computation time, and power consumption are required for the security of WSNs.
In 1985, elliptic curve cryptography (ECC) was proposed independently of public key cryptography (PKC) by Miller and Koblitz [3,4]. ECC is mainly used for digital signature and key exchange based on the elliptic curve discrete logarithm problem (ECDLP), which is defined by elliptic curve point operations in a finite field. ECC provides the same security level with a smaller key size compared to existing PKC algorithms such as Rivest-Shamir-Adleman (RSA) cryptosystem [5]. For example, ECC over with a 256-bit prime p provides an equivalent security level as RSA using 3072-bit key. Because RSA uses a small integer as the public key, RSA public key operations can be efficiently computed. However, RSA private key operations are extremely slower than ECC, therefore they have limited use in the applications of WSNs. Therefore, ECC can be efficiently utilized than RSA for resource-constrained WSNs devices, such as smart cards and sensor nodes.
However, recently proposed manipulation and backdoors have raised the suspicion of weakness in previous ECC standards. In particular, the National Institute of Standards and Technology (NIST) P-224 curve is not secure against twist attacks, which are the combined attacks that use the small-subgroup attacks and the invalid-curve attacks using the twist of curve [6]. The dual elliptic curve deterministic random bit generator (Dual_EC_DRBG) is a pseudo-random number generator (PRNG) standardized in NIST SP 800-90A. However, the revised version of NIST SP 800-90A standard removes Dual_EC_DRBG because this algorithm contains a backdoor for the national security agency (NSA) [7].
Therefore, the demand for next generation elliptic curves has increased. Specific examples of such curves are Curve25519, Ed448-Goldilocks, and twisted Edwards curves [8,9,10]. The main features of these curves are the selection of efficient parameters. The Curve25519 utilizes a prime of the form and a fast Montgomery elliptic curve. The Ed448-Goldilocks curve utilizes a Solinas trinomial prime of the form , which provides fast field arithmetic on both 32-bit and 64-bit machines because . These parameters can accelerate the performance of ECC-based protocols. The details of the twisted Edwards curves can be found in Section 2.3.
Scalar multiplicationor point multiplication computes an operation using an elliptic curve point P and a scalar k. This operation determines the performance of ECC. Therefore, many researchers have proposed various methods to improve the efficiency of scalar multiplication. The speed-up methods for scalar multiplication can be classified into three types: methods based on speeding up the finite field exponentiation, such as comb techniques and windowing methods, scalar recoding methods, and methods that are particular to elliptic curve scalar multiplication [11].
Speed-up methods using efficiently computable endomorphisms are one type of method that are particular to elliptic curve scalar multiplication. The Gallant–Lambert–Vanstone (GLV) method proposed by Gallant et al. is a method for accelerating scalar multiplication by using efficiently computable endomorphisms [11]. If the cost of computing endomorphism is less than (bit-length of curve order)/3 elliptic curve point doubling (ECDBL) operations, then this method has a computational advantage. Their method reduces about half of the ECDBL operations and saves the costs of scalar multiplication by roughly 33%. Additionally, recent studies have reported that scalar multiplication methods using efficiently computable endomorphisms are significantly faster than generalized methods. The Galbraith–Lin–Scott (GLS) curves proposed by Galbraith et al. constructed an efficiently computable endomorphism for elliptic curves defined over , where p is a prime number [12]. They demonstrated that the GLV method can efficiently compute scalar multiplication on such curves. Longa and Gebotys [13] presented an efficient implementation of two-dimensional GLS curves over .
In 2012, Longa and Sica [14] proposed four-dimensional GLV-GLS curves over , which generalized the GLV method and GLS curves. Hu et al. [15] proposed a GLV-GLS curve over , which supports the four-dimensional scalar decomposition. They reported the implementation results indicating that the four-dimensional GLV-GLS scalar multiplication reduces at most 22% of computational cost than the two-dimensional GLV method. Bos et al. [16] proposed two- and four-dimensional scalar decompositions over genus 2 curves defined over . Bos et al. [17] introduced an eight-dimensional GLV-GLS method over genus 2 curves defined over . Oliveira et al. [18] presented the implementation results of a two-dimensional GLV method over binary GLS elliptic curves defined over . Guillevic and Ionica [19] utilized the four-dimensional GLV method on genus 1 curves defined over and genus 2 curves defined over . Smith [20] proposed a new family of elliptic curves over , called “-curves”. Costello and Longa [21] introduced a four-dimensional curve defined over , called “Four”. They reported the implementation results of Four on various Intel and AMD processors.
After a Four-based approach has been proposed, many implementation results were reported considering various environments, such as AVR, MSP430, ARM, and field-programmable gate array (FPGA) devices [22,23,24]. An efficient Four-based implementation on 32-bit ARM processor with the NEON single instruction multiple data (SIMD) instruction set was proposed by Longa [22]. Järvinen et al. [23] proposed a fast and compact Four-based implementation on FPGA device. In CHES 2017, Liu et al. [24] presented highly optimized implementations using curve Four on 8-bit AVR, 16-bit MSP430, and 32-bit ARM Cortex-M4 processors, respectively.
In the case of curve -, Longa and Sica and Faz-Hernández et al. [14,25] reported the implementation results on high-end processors, such as Intel Sandy Bridge, Intel Ivy Bridge, and ARM Cortex-A processors. However, efficient implementations on resource-constrained embedded devices have not been considered to date. Therefore, we focused on optimized implementations of scalar multiplication using curve - on 8-bit ATxmega256A3, 16-bit MSP430FR5969, and 32-bit ARM Cortex-M4 processors, respectively.
Our main contributions can be summarized as follows:
- We present efficient implementations at each level of the implementation hierarchy of four-dimensional GLV-GLS scalar multiplication considering the features of 8-bit AVR, 16-bit MSP430, and 32-bit ARM Cortex-M4 processors. To improve the performance of scalar multiplication, we carefully selected the internal algorithms at each level of the implementation hierarchy. These implementations also run in constant time to resist timing and cache-timing attacks [26,27].
- We demonstrate that the efficiently computable endomorphisms can accelerate the performance of four-dimensional GLV-GLS scalar multiplication. For this purpose, we analyze the operation counts of two elliptic curves “Ted127-glv4” and “Four”, which support the four-dimensional GLV-GLS scalar multiplication. The GLV-GLS curve - requires fewer number of field arithmetic operations than Four-based implementation to compute a single variable-base scalar multiplication. However, because Four uses a Mersenne prime and the curve - uses a Mersenne-like prime , Four has a computational advantage of faster field arithmetic operations. By using the computational advantage of endomorphisms, we overcome the computational disadvantage of curve - at field arithmetic level.
- We present the first constant-time implementations of four-dimensional GLV-GLS scalar multiplication using curve - on three target platforms, which have not been considered in previous works. The proposed implementations on AVR, MSP430, and ARM processors require 6,856,026, 4,158,453, and 447,836 cycles to compute a single variable-base scalar multiplication, respectively. Compared to Four-based implementations [24], which have provided the fastest results to date, our results are 4.49% slower on AVR, but 2.85% and 4.61% faster on MSP430 and ARM, respectively. Our MSP430 and ARM implementations set new speed records for variable-base scalar multiplication.
The remainder of this paper is organized as follows. Section 2 describes preliminaries regarding ECC and its speed-up techniques, including the GLV and GLS methods. Section 3 presents a review of four-dimensional GLV-GLS scalar multiplication and its implementation hierarchy. Section 4 describes the implementation details of field arithmetic and optimization methods for the target platforms. Section 5 describes optimization methods for ECC in terms of point arithmetic and scalar multiplication. Experimental results and a comparison of our work to previous ECC implementations on AVR, MSP430, and ARM processors are presented in Section 6. Finally, we conclude this paper in Section 7.
2. Preliminaries
In Section 2.1, we describe the field representation and notations used for the remainder of this paper. We briefly describe ECC using a short Weierstrass curve and its group law in Section 2.2. We also describe twisted Edwards curves, which are the target of our implementation, in Section 2.3. In Section 2.4, we describe the GLV-GLS method including the GLV method and GLS curves.
2.1. Field Representation and Notations
We assume that the target platform has a w-bit architecture. Let be the bit-length of a Mersenne-like prime , where c is small. Let be its word-length. Then, an arbitrary element is represented by an array of mw-bit words. The notations , and represent multiplication, squaring, inversion, and addition (subtraction) over , respectively. Similarly, the notations , and represent multiplication, squaring, inversion, and addition (subtraction) over , respectively. The notation represents multi-precision addition without modular reduction and the notation represents multiplication with a curve parameter.
2.2. Elliptic Curve Cryptography
Let be a finite field with odd characteristic. An elliptic curve E over is defined by a short Weierstrass equation of the following form:
where and .
Because the most important operation in ECC is scalar multiplication , it must be implemented efficiently. The basic method for computing is comprised of two elliptic curve operations: elliptic curve point addition (ECADD) and the ECDBL operations. Let and be two points on an elliptic curve E. The ECADD and ECDBL operations can be computed in affine coordinates as follows:
The ECADD and ECDBL operations are composed of finite field arithmetic operations, such as field addition, subtraction, multiplication, squaring, and inversion. Therefore, to improve the performance of scalar multiplication, the internal algorithms such as field and curve arithmetic operations should be efficiently implemented.
2.3. Twisted Edwards Curves
The Edwards curves are a normal form of elliptic curves introduced by Edwards [28]. Bernstein and Lange [29] introduced Edwards curves defined by , where with . In 2007, Bernstein et al. [10] introduced twisted Edwards curves, which are a generalization of Edwards curves defined by
where with . The Edwards curves are a special case of twisted Edwards curves with . The point is the identity element and the point has order two. The point and have order four. The negative of a point is . The ECADD operation of two points and on a twisted Edwards curve E is defined as follows:
Because the addition law is unified, it can be used for computing the ECDBL operation. Suppose that two points P and Q have an odd order. Then, the denominators of the addition formula and are nonzero. Therefore, the doubling formula can be obtained as follows:
Two relationships can be obtained by considering the curve equation: and . After straightforward elimination, the curve parameters a and d can be represented by , and . Substitutions in the unified addition formula yield the addition formula as follows:
These addition and doubling formulas are used in the dedicated addition and doubling formulas described in Section 5. The features of these formulas are independent of the curve parameter d [30].
2.4. The GLV-GLS Method
We will now describe the GLV method to explain the GLV-GLS method. Let E be an elliptic curve defined over a finite field . An endomorphism of E over is a rational map such that and for all points , where g and h are rational functions and is a point at infinity. An endomorphism is a group homomorphism, defined as
Suppose that contains a subgroup of order r and let be an efficiently computable endomorphism on E such that for some . The GLV method computes the integers and such that mod r for scalar multiplication . Because
scalar multiplication can be computed by computing and then using multiple scalar multiplications [31]. This is because the multi-scalars and have approximately half the bit-length of the scalar k. The efficiency of the GLV method depends on scalar decomposition and the efficiency of computing endomorphism .
The main concept of the GLS curves is described as follows: Let be the quadratic twists of [12]. Let be the quadratic twist map and be the q-th Frobenius endomorphism. Then, we can obtain the efficiently computable endomorphism , which satisfies the equation if . However, GLS curves only work for elliptic curves over with .
As mentioned in the introduction, the GLV-GLS method is the generalized method of the GLV method and GLS curves. Let and be two efficiently computable endomorphisms over and P be a point of prime order r. Then, the four-dimensional scalar multiplication for any scalar can be computed as follows:
where for and C is some explicit constant. The details of internal algorithms of the four-dimensional scalar multiplication can be found in Section 4 and Section 5.
3. Review of Four-Dimensional GLV-GLS Scalar Multiplication
The curve - was introduced by Longa and Sica [14]. It is based on twisted Edwards curves and has efficiently computable endomorphisms, which facilitates the four-dimensional GLV-GLS scalar multiplication. The parameters of curve - are as follows:
where , and , where r is a 251-bit prime. Let and be a quadratic non-residue in . E is isomorphic to the Weierstrass curve . The curve - contains two efficiently computable endomorphisms and defined over as follows:
where is a primitive eighth root of unity. It can be verified that and .
Let P be a point in and k be a random scalar in the range . Algorithm 1 outlines variable-base scalar multiplication using curve - and four-dimensional decompositions. Steps 1 and 2 in Algorithm 1 compute three endomorphisms , and , and then compute the eight points , where in . Step 3 decomposes the input scalar k into multi-scalars such that , where . For constant-time implementation, the multi-scalars must guarantee the same number of iterations of the main computation. Because all coordinates of scalar decomposition are less than , we apply the scalar recoding algorithm to guarantee a fixed loop length for the main computation at step 4 [25]. The result of the scalar recoding is represented by 66 lookup table indices and 66 masks , where . Steps 5 to 9 represent the main computation stage, including point loading, the ECADD operation, and the ECDBL operation. The result of the main computation is converted from an extensible coordinates to the affine coordinates in step 10. Therefore, a variable-base scalar multiplication using curve - requires one endomorphism, two endomorphisms, and seven ECADD operations in the precomputation; 65 table lookups, 65 ECADD operations, and 65 ECDBL operations in the main computation; and one inversion and two field multiplications over for point normalization.
Figure 1 describes the implementation hierarchy of four-dimensional GLV-GLS scalar multiplication and its internal algorithms. Because the implementation algorithms at each level affect the performance of scalar multiplication, we carefully choose proper algorithms considering the features of AVR, MSP430, and ARM processors. Additionally, field arithmetic over and curve arithmetic are comprised of field arithmetic over , which is the computationally primary operations. Therefore, field arithmetic over is written at the assembly level.
Figure 1.
The implementation hierarchy of four-dimensional Gallant-Lambert-Vanstone and Galbraith-Lin-Scott (GLV-GLS) scalar multiplication.
| Algorithm 1: Scalar multiplication using curve - [21]. |
|
4. Implementation Details of Field Arithmetic
In this section, we describe the implementation details of field arithmetic on AVR, MSP430X, and ARM Cortex-M4 processors using a Mersenne-like prime of the form . We describe the field arithmetic algorithms that are commonly used in three target platforms in Section 4.1, Section 4.2, Section 4.3 and Section 4.4. In Section 4.5, Section 4.6 and Section 4.7, we describe our optimization strategy for field arithmetic on AVR, MSP430, and ARM processors, respectively.
4.1. Field Addition and Subtraction over
The curve - uses a Mersenne-like prime of the form . An efficient field addition/subtraction method for this scenario was proposed by Bos et al. [16]. Let . Field addition over can be computed by , where if . Otherwise, . The result is bounded by p because, if , then , whereas if , then . Because , addition does not require carry propagation. Note that subtraction with can be efficiently implemented by clearing the 128-th bit of .
Similar to field addition, field subtraction over can be computed by , where if , otherwise, . Addition with can be implemented by clearing the 128-th bit of .
4.2. Modular Reduction
To use primes of a special form may result in a faster reduction method [31]. The NIST recommends five primes for the elliptic curve digital signature algorithm (ECDSA). These primes can be represented as the sums or differences of powers of two and facilitate the fast reduction method. The curve - uses a Mersenne-like prime of the form . Therefore, modular reduction can be efficiently computed by using a NIST-like reduction method [16]. Let . We compute , where . The first reduction step can be computed by . Then, the second reduction step can be computed by , where .
4.3. Inversion over
For the field inversion , we use the fact that in Fermat’s little theorem (in our case, ). This method can be implemented by modular exponentiation using fixed addition chains and guarantees constant-time execution requiring operations.
4.4. Field Arithmetic over
The incomplete reduction method proposed by Yanık et al. [32] is one of the optimization methods in field arithmetic over . Given two elements , the result of operations stays in the range , where and m is a fixed integer (in our case, ). Because the modulus of curve - is a Mersenne-like prime of the form , the incomplete reduction method can be applied more advantageously.
Let and be two arbitrary elements in a finite field . Field addition and subtraction over can be computed by and , respectively. Field inversion over can be computed by .
We utilize Karatsuba multiplication to compute field multiplication over . The Karatsuba multiplication uses the fact that , which can be computed by operations. It requires more operations but saves 1 operations compared to general multiplication methods, which require operations. Because field multiplication requires more computational cost than the multi-precision addition and field addition, the Karatsuba multiplication has a computational advantage. Algorithm 2 describes field multiplication over using the Karatsuba multiplication and the incomplete reduction method.
Algorithm 3 describes field squaring over using the incomplete reduction method. Note that . The first representation can be computed by operations, and the remaining representation can be computed by operations. Because operation can be implemented faster than operations, we use operations to compute field squaring over . The results of steps 3 and 4 in Algorithm 2 and steps 1 and 3 in Algorithm 3 were represented by the incompletely reduced form.
| Algorithm 2: Field multiplication over [25]. |
|
| Algorithm 3: Field squaring over [25]. |
|
4.5. Optimization Strategy on 8-Bit AVR
The AVR processor is a family of 8-bit microcontrollers that is widely used in MICA2/MICAz sensor motes. The AVR processors are equipped with an 8-bit integer multiplier and register file with 32×8-bit general registers that are numbered from to . Registers , , and pairs are used as 16-bit indirect address registers called , , and . The automatic increment and decrement addressing modes are supported on all , , and registers, and and support fixed positive displacement. and registers store the 16-bit results of -bit multiplication. The AVR processors provide a typical 8-bit reduced instruction set computer (RISC) instruction set. The most important instructions for ECC are -bit multiplication () and memory access () instructions, which require two cycles. Instructions between two registers, such as addition () or subtraction (), require only one cycle. Therefore, the basic optimization strategy on 8-bit AVR is reducing the number of memory access instructions.
To simulate our implementations, we targeted the ATxmega256A3 processor [33]. This processor can be clocked up to 32 MHz and provides 256 KB of programmable flash memory, 16 KB of SRAM, and 4 KB of EEPROM.
Recently, Hutter and Schwabe [34] proposed a highly optimized Karatsuba multiplication for the 8-bit AVR processor. There are two variants of the Karatsuba multiplication method: the additive Karatsuba and subtractive Karatsuba methods. Algorithm 4 outlines the subtractive Karatsuba multiplication. We consider -bit multiplication, where n is even and (in our case, and ). The additive Karatsuba method can be computed similarly to Algorithm 4. However, the additive Karatsuba method may produce the carry bits in the addition of two numbers and . The additional multiplication using the carry bits incurs a significant overhead for integer multiplication. The subtractive Karatsuba method does not produce carry bits in the computation of M, but computes two absolute values and . This overhead is not only smaller than the overhead required for the additive Karatsuba method, but can also be executed in constant-time. Therefore, we chose and implemented the subtractive Karatsuba multiplication for the 8-bit AVR implementation.
| Algorithm 4: Subtractive Karatsuba multiplication [34]. |
|
For integer squaring, we chose the sliding block doubling (SBD) method [35], which is more efficient than the subtractive Karatsuba method in the case of 128-bit operands on 8-bit AVR. To improve the performance of field arithmetic, we combined integer multiplication and squaring with modular reduction.
4.6. Optimization Strategy on 16-Bit MSP430X
The MSP430X processor was designed as an ultra-low power microcontroller based on the 16-bit RISC CPU. The MSP430X CPU has 16 20-bit registers that are numbered from to . Registers to are special-purpose registers that are used as the program counter, stack pointer, status register, and constant generator, respectively. Registers to are general-purpose registers that are used to store data values, address pointers, and index values.
The MSP430X instruction set does not include multiply and multiply-and-accumulate (MAC) instructions. Instead, the MSP430 family is equipped with a memory-mapped hardware multiplier. The hardware multiplier provides four different multiply operations (unsigned multiplication, signed multiplication, unsigned multiplication and accumulation, and signed multiplication and accumulation) for the first operand, called , and . The second operand register is common to all multiplier modes, called . Namely, the first operand determines the operation type of the multiplier, but does not start the operation. Writing the second operand to the register starts the selected multiplication with two values. The multiplication result is written in three result registers , and . stores the lower 16-bit of the result, stores the upper 16-bit of the result, and stores the carry bit or sign of the result.
The MSP430X processor provides seven addressing modes for the source operand and four addressing modes for the destination operand. The total computation time depends on the instruction format and the addressing modes for the operand. Instructions between two CPU registers only require one cycle. However, memory access instruction () requires two to six cycles depending on addressing modes of operands. To improve the performance of field arithmetic, reducing the number of memory access instructions and efficiently utilizing MAC operations are the basic optimization strategies.
In our implementations, we targeted the MSP430FR5969 processor [36]. This processor is equipped with 64 KB of program flash memory and 2 KB of RAM and can be clocked up to 16 MHz.
For integer multiplication on 16-bit MSP430X processor, we chose and implemented the product scanning multiplication. Algorithm 5 outlines the product scanning method for multi-precision multiplication. The first loop in Algorithm 5 computes the lower half of the multiplication result c, and the second loop computes the upper half of the result c. It accumulates partial multiplications of the inner loop and these operations can be efficiently computed using the MAC operations of the hardware multiplier. Specifically, two 16-bit operands are multiplied and the results are added to the intermediate value s, which is held in , and .
In Four [24], integer squaring was implemented using the SBD method [35]. We utilize the product scanning method for 128-bit integer squaring on 16-bit MSP430X. It can be easily implemented by modifying the product scanning multiplication. Additionally, this method results in better performance than the SBD method in Four. The implementation results can be found in Section 6.2.
| Algorithm 5: Product scanning multiplication. |
|
4.7. Optimization Strategy on 32-Bit ARM
The ARM Cortex-M is a family of 32-bit RISC ARM processors for microcontrollers. The Cortex-M4 processor is a high-performance Cortex-M processor with digital signal processing (DSP), SIMD, and MAC instructions. It based on the ARMv7-M architecture and equipped with 16 32-bit general registers that are numbered from to . Registers to are special-purpose registers that are used for the stack pointer (SP), link register (LR), and program counter (PC), respectively. The Cortex-M4 instruction set provides multiply and MAC instructions, such as , , and . The instruction multiplies two unsigned 32-bit operands to obtain a 64-bit result. The and instructions multiply two unsigned 32-bit operands and accumulate a single 64-bit value and two 32-bit values.
In our implementations, we used the STM32F407-DISC1 board, which contains a 32-bit ARM Cortex-M4 STM32F407VGT6 microcontroller [37]. This microcontroller is equipped with 1 MB of flash memory, 192 KB of SRAM, and 64 KB of core-coupled memory (CCM) data RAM and can be clocked up to 168 MHz.
For integer multiplication and squaring, we implemented the operand scanning method by using efficient MAC operations. Additionally, these MAC operations facilitate an efficient implementation of modular reduction. The first reduction computes , where . For example, the intermediate values are loaded in to and are loaded in to . The constant is loaded in and 0 is loaded in . The computation is performed as follows:
The results of the first reduction are held in . The second reduction can be computed using simple multiplication () and addition () instructions.
For the further improvement of field arithmetic, we implemented field arithmetic over at the assembly level [24,38]. In the case of field multiplication over , we utilized the operand scanning multiplication with a lazy reduction method. This operation computes , where . The operand scanning method results in better performance than the Karatsuba multiplication in our case. The field squaring over is implemented using at the assembly level.
5. Implementation Details of Curve Arithmetic
In this section, we describe the scalar decomposition and curve arithmetic that are commonly used on three target platforms. Section 5.1 describes the scalar decomposition and recoding methods for multi-scalars. The details of point arithmetic, coordinate system, and endomorphisms are described in Section 5.2 and Section 5.3.
5.1. Scalar Decomposition
In this subsection, we describe the scalar decomposition method for a random integer and corresponding multi-scalars such that as for and some explicit constant . We assume that and . Let F be a four-dimensional GLV-GLS reduction map defined by
Let be a matrix consisting of four linearly independent vectors with . Then, for any , the decomposition method computes and computes the multi-scalars
where represents a rounding operation. There are two typical methods for decomposing a scalar: the Babai rounding method [39] and division in a ring method, where is an efficiently computable endomorphism [40]. In [14], lattice reduction algorithms based on Cornacchia’s algorithms were proposed for finding a uniform basis. The first step is finding Cornacchia’s GCD in and the second step is using the Cornacchia’s algorithm in . We utilize these two algorithms to find four linearly independent vectors kerF, where the rectangle norms . The coordinates of these vectors utilize the scalar decomposition. Additionally, the relationships of four vectors can reduce the number of fixed constants. Two vectors and can represent the remaining vectors and .
Let be the matrix formed by replacing in B with the vector . We then define four precomputed constants , where . The four-dimensional decomposition computes using four integer multiplication, four integer divisions, and four rounding operations. Bos et al. [17] introduced an efficient rounding method for eliminating integer divisions. This method chooses an integer m such that , and precomputes the fixed constants . Then, can be computed by , where the division by can be computed by a shift operation. The four-dimensional decomposition of a random scalar k using curve - can be computed as follows:
However, Ref. [21] reported that this method yields the correct answer and . They also reported that a large size of m decreases the probability of a round-off error.
Because the multi-scalars lie between and , all coordinates are both positive and negative. Signed multi-scalars require additional cost to compute scalar multiplication. Costello and Longa [21] demonstrated the offset vectors such that all coordinates of the multi-scalars were always positive to simplify scalar recoding. However, this odd-only scalar recoding method requires that the first element of the muli-scalars is always odd. For constant-time execution and odd-only recoding, they found two offset vectors and such that and are valid decompositions of the scalar k and one of the two multi-scalars had a first element that was odd. To utilize these methods for curve -, we carefully chose two offset vectors and . The multi-scalars and are valid decompositions of the scalar k. Finally, all four coordinates of the two decompositions are positive and less than , and in one of them is odd.
Because all coordinates of multi-scalars are less than , scalar decomposition and recoding require more computational cost compared to Four-based implementation, which has coordinates of multi-scalars less than . However, this additional cost is an extremely small portion of the scalar multiplication.
5.2. Point Arithmetic
To enhance the performance of scalar multiplication, the selections of efficient point arithmetic and coordinate system are one of the most crucial subjects. The extended Edwards coordinates of the form were proposed by Hisil et al., where [30]. The extended Edwards coordinates are an extended version of the homogeneous coordinates of the form . The identity element is represented by and the negative element of is represented by .
Hisil et al. [30] proposed dedicated addition and doubling formulas that are independent of the curve parameter d. Given and of distinct points with and , the ECADD operation can be computed as follows:
Similarly, given with , the ECDBL operation can be computed as follows:
Hamburg [41] proposed extensible coordinates of the form , where . The final step of the ECADD and ECDBL operations using extended Edwards coordinates computes . However, the extensible coordinates store the coordinates T as and , and compute T when required for point arithmetic. For the further improvement of the ECADD operation, the precomputed point Q is represented in the form [25]. This method eliminates two multiplication by 2 operations and two field additions over compared to the extended Edwards coordinates. In the case of the ECDBL operation, we utilize the transformation to reduce the number of multiplications. It can be computed by converting one field multiplication and one field addition over to one field squaring, two field subtractions over . Algorithms 6 and 7 describe the extensible coordinates of the ECADD and ECDBL operations over with a curve parameter , which require and operations, respectively.
| Algorithm 6: Twisted Edwards point addition over . |
|
To demonstrate the efficiency of the twisted Edwards curves, we compare it to the cost of a short Weierstrass elliptic curve. The ECADD and ECDBL operations of a short Weierstrass curve of the form over using Jacobian coordinates require and operations. The ECADD operation of the twisted Edwards curve using extensible coordinates saves operations. The ECDBL operation requires additional operations but saves operations. Therefore, the twisted Edwards curves using extensible coordinates have a computational advantage compared to short Weierstrass curves using Jacobian coordinates.
| Algorithm 7: Twisted Edwards point doubling over . |
|
5.3. Endomorphisms
In [25], the formulas for the endomorphisms and are described. To reduce the number of representation conversions, we represent the results of endomorphism operations using extensible coordinates. Let be a point in curve - represented by homogeneous projective coordinates. Then, , where can be computed as follows:
where and . We also utilize the fixed values for curve - as follows:
where A = 143485135153817520976780139629062568752 and B = 1701411834604692317316873037158840
99729. The endomorphism can be computed by using or operations in the case .
Similarly, , where can be computed as follows:
The endomorphism can be computed using or operations in the case . Because the endomorphism requires fewer operations than the endomorphism , can be computed on the order of with and .
6. Performance Analysis and Implementation Results
In this section, we analyze the operation counts and implementation results of variable-base scalar multiplication using curve - on AVR (Microchip Technology Inc., Chandler, AZ, USA), MSP430 (Texas Instruments, Dallas, TX, USA), and ARM (ARM holdings plc, Cambridge, UK) processors. We performed simulations and evaluations using the IAR Embedded Workbench for AVR 6.80.7 (IAR systems, Uppsala, Sweden), IAR Embedded Workbench for MSP430 7.10.2 (IAR systems, Uppsala, Sweden), and STM32F4-DISC1 board (STMicroelectronics, Geneva, Switzerland) with the IAR Embedded Workbench for ARM 8.11.1 (IAR systems, Uppsala, Sweden). All implementations were set to the medium optimization level.
6.1. Operation Counts
Table 1 and Table 2 describe the operation counts of field arithmetic over and their conversion into field arithmetic over for curve - and Four using Algorithm 1. Because both curves support the four-dimensional decomposition, the operation counts for Algorithm 1 can be compared step by step.
Table 1.
The operation counts of curve - using field arithmetic over and operation counts for conversion into field arithmetic over .
Table 2.
The operation counts of curve Four using field arithmetic over and operation counts for conversion into field arithmetic over .
Step 1 of Algorithm 1 computes three endomorphisms , and , and requires operations for Four and operations for curve -. Step 2 requires seven ECADD operations, which require operations for Four and operations for curve -. However, these outputs are all converted for faster ECADD computations, which require operations for Four and operations for curve -. Steps 3 and 4 require only bit and integer operations for all positive scalar decomposition and fixed-length recoding operations. Step 5 requires operations for one point negation and one table lookup, and a conversion to extensible coordinates for the initial point Q, which require operations. Steps 6 to 9 require 64 ECDBL operations, 64 ECADD operations, 64 point negations, and 64 table lookups for Four, and 65 ECDBL operations, 65 ECADD operations, 65 point negations, and 65 table lookups for curve -. The operation counts of these steps are for Four and for curve -. Step 10 requires operations for the normalization of the result point Q.
Variable-base scalar multiplication using the four-dimensional decomposition requires operations for Four and operations for curve -. The curve - requires fewer operations than Four because the endomorphisms in curve - are efficiently computable. However, the operation counts of field inversion over for Four and curve - are and , respectively. Therefore, we convert the operation counts of the field arithmetic over to the field arithmetic over . Field arithmetic over can be represented by field arithmetic over as follows:
The operation counts can be represented by for Four and can be represented by for curve -. The scalar multiplication using curve - saves operations compared to Four-based scalar multiplication. Therefore, we can deduce that the four-dimensional scalar multiplication using curve - can be faster than Four-based implementation when field arithmetic is efficiently implemented.
6.2. Implementation Results of Field Arithmetic
Table 3 lists how many cycles are used for field arithmetic over and on AVR, MSP430, and ARM processors, including function call overhead. The field inversions and are the average cycles performed times and remaining the field arithmetic is the average cycles performed times. To evaluate the implementation of field arithmetic for curve -, we compare the number of cycles for its implementation with Four, which provides the fastest implementation results to date [24].
Table 3.
Cycle counts for field arithmetic on 8-bit AVR, 16-bit MSP430, and 32-bit ARM processors, including function call overhead.
We will now compare the number of cycles for field arithmetic on 8-bit AVR processor. The field arithmetic over for curve - on 8-bit AVR requires 198, 196, 1221, 1796, and 176,901 cycles to compute addition, subtraction, squaring, multiplication, and inversion over , respectively. Similarly, the field arithmetic for Four on AVR requires 155, 159, 1026, 1598, and 150,535 cycles to compute field addition, subtraction, squaring, multiplication, and inversion over , respectively. The curve - requires 43, 37, 195, 198, and 26,366 more cycles than Four for these operations, respectively. The field arithmetic over for curve - requires 452, 448, 4093, 6277, and 183,345 cycles to compute field addition, subtraction, squaring, multiplication, and inversion over , respectively. These same operations for Four require 384, 385, 3622, 5758, and 156,171 cycles, respectively. The curve - requires 68, 63, 471, 519, and 27,174 more cycles than Four for these operations, respectively.
In the case of the 16-bit MSP430X processor, field arithmetic over for curve - requires 120, 126, 837, 1087, and 119,629 cycles to compute field addition, subtraction, squaring, multiplication, and inversion over . The same operations for Four requires 102, 101, 927, 1027, and 131,819 cycles, respectively. The curve - requires 18, 25, and 60 more cycles than Four to compute addition, subtraction, and multiplication, respectively. However, it saves 90 and 12,190 cycles than Four to compute squaring and inversion over , respectively. The field arithmetic over for curve - requires 266, 278, 2476, 3806, and 123,740 cycles to compute field addition, subtraction, squaring, multiplication, and inversion over , respectively. These operations for Four require 233, 231, 2391, 3624, and 135,315 cycles, respectively. The curve - requires 33, 47, 85, and 182 more cycles than Four to compute addition, subtraction, squaring, and multiplication, respectively. It saves 11,575 cycles than Four to compute inversion over .
In the 32-bit ARM Cortex-M4 processor, field arithmetic for curve - requires 55, 55, 88, 99, and 12,135 cycles to compute field addition, subtraction, squaring, multiplication, and inversion over , respectively. However, Ref. [24] does not report the implementation results of field arithmetic over . The field arithmetic over for curve - requires 82, 82, 196, 341, and 12,612 cycles to compute field addition, subtraction, squaring, multiplication, and inversion over , respectively. These operations for Four require 84, 86, 215, 358, and 21,056 cycles, respectively. The curve - saves 2, 4, 20, 17, and 8444 cycles than Four to compute addition, subtraction, squaring, multiplication, and inversion over , respectively.
One can see that the field arithmetic over in Four on AVR and MSP430 is typically faster than curve -. This difference occurs because the primes of both curves are different, with a Mersenne prime of the form in Four and a Mersenne-like prime of the form in curve -. Let , where is small. The modular reduction step can be computed by . In this process, Four can be efficiently computed using simple shift operations because , but the curve - requires more instructions because it uses multiplication by . In the 8-bit AVR implementation, can be represented by two 8-bit words as and . Therefore, the operation requires more 8 × 8-bit multiplications and accumulations. Unlike the AVR implementation, can be represented by one word in the MSP430 and ARM CPUs. Additionally, these CPUs provide efficient MAC instructions. Therefore, the modular reduction on MSP430 and ARM implementations require fewer additional instructions than the AVR implementation.
In the case of the MSP430, field squaring over in curve - is faster than in Four. The field squaring over in curve - requires 837 cycles, whereas Four requires 927 cycles. Our implementation saves 9.71% of the cycles for field squaring over compared to the SBD method, despite the modular reduction overhead. Additionally, the principal operation of inversion is field squaring over , our implementation saves 9.32% and 8.55% of the cycles for inversion over and . For field squaring over , field squaring over is not required because it can be computed by operations. Therefore, field squaring over for - requires more cycles than Four.
6.3. Implementation Results of Scalar Multiplication
Table 4 summarizes the implementation results of variable-base scalar multiplication compared to the previous implementations on the 8-bit AVR, 16-bit MSP430, and 32-bit ARM processors. We measured the average cycles for our variable-base scalar multiplication by running it times with random scalars k. For comparison, Table 4 includes the previous implementations that guarantee constant-time execution. These were implemented using various elliptic curves, such as NIST P-256 [42,43], Curve25519 [44,45,46], Kummer [47], and Four [24]. These curves are designed such that the bit-length of the curve order is slightly smaller than 256-bit for efficient implementation. NIST P-256 has a 256-bit curve order, but Curve25519, Kummer, Four, and curve - have 252-bit, 250-bit, 246-bit, and 251-bit curve orders, respectively. Therefore, these curves provide approximately 128-bit security levels.
Table 4.
Cycle counts and memory usage of variable-base scalar multiplication on 8-bit AVR, 16-bit MSP430, 32-bit ARM processors.
We will now summarize the implementation results of previous works on embedded devices that provide approximately 128-bit security levels. Wenger and Werner [42] and Wenger et al. [43] implemented the scalar multiplication using the NIST P-256 curve on various 16-bit microcontrollers and 8-bit, 16-bit, and 32-bit microcontrollers. Hutter and Schwabe [48] implemented the NaCl library on 8-bit AVR processor, which provides a Curve25519 scalar multiplication. Hinterwälder et al. [44] implemented a Diffie–Hellman key exchange on MSP430X processor using 16-bit and 32-bit hardware multipliers. In 2015, Düll et al. [45] implemented a Curve25519 scalar multiplication of on 8-bit, 16-bit, and 32-bit microcontrollers. Renes et al. [47] implemented a Montgomery ladder scalar multiplication on the Kummer surface of a genus 2 hyperelliptic curve on 8-bit AVR and 32-bit ARM Cortex-M0 processors. Faz-Hernández et al. [25] proposed an efficient implementation of the four-dimensional GLV-GLS scalar multiplication using curve - on Intel and ARM processors.
The implementation results of variable-base scalar multiplication set new speed records on the 16-bit MSP430 and 32-bit ARM Cortex-M4 processors. Scalar multiplication using curve - on AVR, MSP430, and ARM requires 6,856,026, 4,158,453, and 447,836 cycles, respectively. Compared to the previous fastest implementation, namely Four [24], which require 6,561,500, 4,280,400, and 469,500 cycles on AVR, MSP430, and ARM, respectively, our implementation requires 4.49% more cycles on AVR, but saves 2.85% and cycles on MSP430X and ARM processors, respectively. Compared to Kummer [47], which requires 9,513,536 cycles on AVR, our implementation saves 27.93% cycles. It also saves 50.68% and 47.58% cycles than Düll et al.’s Curve25519 implementation [45], which requires 13,900,397 and 7,933,296 cycles on AVR and MSP430, respectively. It saves 69.92% cycles compared to the NaCl library [48], which requires 22,791,579 cycles on AVR. It saves 54.50% cycles than Hinterwälder et al.’s Curve25519 implementation [44], which requires 9,139,739 cycles on MSP430. Additionally, it saves 68.54% cycles compared to the method in [46], which requires 1,423,667 cycles on the ARM Cortex-M4 processor.
The memory of embedded processors is very constrained, meaning the memory usage of various implementations is important. In the case of the 8-bit AVR, Kummer [47] requires the lowest memory usage in the recently proposed results, which requires 9490 bytes of code size and 99 bytes of stack memories. Wenger et al.’s and Düll et al.’s implementations [43,45] require the lowest code size and stack memories on MSP430, which require 8378 bytes of code size and 384 bytes of stack memories. In the 32-bit ARM, Ref. [46] require 3750 bytes of code size and 740 bytes of stack memories. Four [24] reported the memory usage of ECDH and signature operations, but did not report the memory usage of single scalar multiplication. Our implementations for curve - requires 13,891, 9098, and 7532 bytes of code size and 2539, 2568, and 2792 bytes of stack memories on AVR, MSP430, and ARM Cortex-M4, respectively. Four and curve -, which utilize the four-dimensional decompositions, precompute eight points, meaning they require more stack memory than other implementations. However, the performance of four-dimensional scalar multiplication is significantly faster than other implementations.
7. Conclusions
In this paper, we presented the first constant-time implementations of four-dimensional GLV-GLS scalar multiplication using curve - on 8-bit ATxmega256A3, 16-bit MSP430FR5969, and 32-bit ARM Cortex-M4 processors. We also optimized the performance of internal algorithms in scalar multiplication on three target processors. Our implementations for single scalar multiplication on AVR require 4.49% more cycles than Four-based implementation, but save 2.85% and 4.61% cycles on MSP430 and ARM Cortex-M4, respectively. Our analysis and implementation results demonstrate that efficiently computable endomorphisms can accelerate scalar multiplication, even when using prime numbers that provide inefficient field arithmetic. Our implementations highlight that the four-dimensional GLV-GLS scalar multiplication using curve - is one of the suitable elliptic curves for constructing ECC-based applications for resource-constrained embedded devices.
Author Contributions
J.K. designed and implemented the presented software. S.C.S. and S.H. analyzed the experimental results and improved the choice of internal algorithms of scalar multiplication.
Acknowledgments
This work was supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No. 2014-6-00910, Study on Security of Cryptographic Software).
Conflicts of Interest
The authors declare no conflict of interest.
References
- Shojafar, M.; Canali, C.; Lancellotti, R.; Baccarelli, E. Minimizing computing-plus-communication energy consumptions in virtualized networked data centers. In Proceedings of the 2016 IEEE Symposium on Computers and Communication (ISCC), Messina, Italy, 27–30 June 2016; pp. 1137–1144. [Google Scholar]
- Baccarelli, E.; Naranjo, P.G.V.; Shojafar, M.; Scarpiniti, M. Q*: Energy and delay-efficient dynamic queue management in TCP/IP virtualized data centers. Comput. Commun. 2017, 102, 89–106. [Google Scholar] [CrossRef]
- Miller, V.S. Use of Elliptic Curves in Cryptography. In Proceedings of Conference on the Theory and Application of Cryptographic Techniques, Santa Barbara, CA, USA, 18–22 August 1985; Springer: Heidelberg/Berlin, Germany, 1985; pp. 417–426. [Google Scholar]
- Koblitz, N. Elliptic curve cryptosystems. Math. Comput. 1987, 48, 203–209. [Google Scholar] [CrossRef]
- Rivest, R.L.; Shamir, A.; Adleman, L. A method for obtaining digital signatures and public-key cryptosystems. Commun. ACM 1978, 21, 120–126. [Google Scholar] [CrossRef]
- SafeCurves: Choosing Safe Curves for Elliptic-Curve Cryptography. Available online: http://safecurves.cr.yp.to (accessed on 10 March 2018).
- Barker, E.; Kelsey, J. NIST Special Publication 800-90A Revision 1: Recommendation for Random Number Generation Using Deterministic Random Bit Generators; Technical Report; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2015. [Google Scholar]
- Bernstein, D.J. Curve25519: New Diffie–Hellman Speed Records. In Proceedings of 9th International Workshop on Public Key Cryptography, New York, NY, USA, 24–26 April 2006; Springer: Heidelberg/Berlin, Germany, 2006; pp. 207–228. [Google Scholar]
- Hamburg, M. Ed448-Goldilocks, a new elliptic curve. IACR Cryptol. ePrint Arch. 2015, 2015, 625. [Google Scholar]
- Bernstein, D.J.; Birkner, P.; Joye, M.; Lange, T.; Peters, C. Twisted Edwards Curves. In Proceedings of 1st International Conference on Cryptology in Africa, Casablanca, Morocco, 11–14 June 2008; Springer: Heidelberg/Berlin, Germany, 2008; pp. 389–405. [Google Scholar]
- Gallant, R.P.; Lambert, R.J.; Vanstone, S.A. Faster Point Multiplication on Elliptic Curves with Efficient Endomorphisms. In Proceedings of 21st Annual International Cryptology Conference, Santa Barbara, CA, USA, 19–23 August 2001; Springer: Heidelberg/Berlin, Germany, 2001; pp. 190–200. [Google Scholar]
- Galbraith, S.D.; Lin, X.; Scott, M. Endomorphisms for Faster Elliptic Curve Cryptography on a Large Class of Curves. In Proceedings of 28th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Cologne, Germany, 26–30 April 2009; Springer: Heidelberg/Berlin, Germany, 2009; pp. 518–535. [Google Scholar]
- Longa, P.; Gebotys, C. Efficient Techniques for High-Speed Elliptic Curve Cryptography. In Proceedings of 12th International Workshop on Cryptographic Hardware and Embedded Systems, Santa Barbara, CA, USA, 17–20 August 2010; Springer: Heidelberg/Berlin, Germany, 2010; pp. 80–94. [Google Scholar]
- Longa, P.; Sica, F. Four-Dimensional Gallant–Lambert–Vanstone Scalar Multiplication. In Proceedings of 18th International Conference on the Theory and Application of Cryptology and Information Security, Beijing, China, 2–6 December 2012; Springer: Heidelberg/Berlin, Germany, 2012; pp. 718–739. [Google Scholar]
- Hu, Z.; Longa, P.; Xu, M. Implementing the 4-dimensional GLV method on GLS elliptic curves with j-invariant 0. Des. Codes Cryptogr. 2012, 63, 331–343. [Google Scholar] [CrossRef]
- Bos, J.W.; Costello, C.; Hisil, H.; Lauter, K. Fast cryptography in genus 2. In Proceedings of 32nd Annual International Conference on the Theory and Applications of Cryptographic Techniques, Athens, Greece, 26–30 May 2013; Springer: Heidelberg/Berlin, Germany; pp. 194–210.
- Bos, J.W.; Costello, C.; Hisil, H.; Lauter, K. High-Performance Scalar Multiplication Using 8-Dimensional GLV/GLS Decomposition. In Proceedings of 15th International Workshop on Cryptographic Hardware and Embedded Systems, Santa Barbara, CA, USA, 20–23 August 2013; Springer: Heidelberg/Berlin, Germany, 2013; pp. 331–348. [Google Scholar]
- Oliveira, T.; López, J.; Aranha, D.F.; Rodríguez-Henríquez, F. Lambda Coordinates for Binary Elliptic Curves. In Proceedings of 15th International Workshop on Cryptographic Hardware and Embedded Systems, Santa Barbara, CA, USA, 20–23 August 2013; Springer: Heidelberg/Berlin, Germany, 2013; pp. 311–330. [Google Scholar]
- Guillevic, A.; Ionica, S. Four-Dimensional GLV via the Weil Restriction. In Proceedings of 19th International Conference on the Theory and Application of Cryptology and Information Security, Bengaluru, India, 1–5 December 2013; Springer: Heidelberg/Berlin, Germany, 2013; pp. 79–96. [Google Scholar]
- Smith, B. Families of Fast Elliptic Curves from -Curves. In Proceedings of 19th International Conference on the Theory and Application of Cryptology and Information Security, Bengaluru, India, 1–5 December 2013; Springer: Heidelberg/Berlin, Germany, 2013; pp. 61–78. [Google Scholar]
- Costello, C.; Longa, P. Four: Four-Dimensional Decompositions on a -Curve over the Mersenne Prime. In Proceedings of 21st International Conference on the Theory and Application of Cryptology and Information Security, Auckland, New Zealand, 29 November–3 December 2015; Springer: Heidelberg/Berlin, Germany, 2015; pp. 214–235. [Google Scholar]
- Longa, P. FourNEON: Faster Elliptic Curve Scalar Multiplications on ARM Processors. IACR Cryptol. ePrint Arch. 2016, 2016, 645. [Google Scholar]
- Järvinen, K.; Miele, A.; Azarderakhsh, R.; Longa, P. Four on FPGA: New Hardware Speed Records for Elliptic Curve Cryptography over Large Prime Characteristic Fields. In Proceedings of 18th International Workshop on Cryptographic Hardware and Embedded Systems, Santa Barbara, CA, USA, 17–19 August 2016; Springer: Heidelberg/Berlin, Germany, 2016; pp. 517–537. [Google Scholar]
- Liu, Z.; Longa, P.; Pereira, G.C.; Reparaz, O.; Seo, H. Four on Embedded Devices with Strong Countermeasures Against Side-Channel Attacks. In Proceedings of 19th International Workshop on Cryptographic Hardware and Embedded Systems, Taipei, Taiwan, 25–28 September 2017; Springer: Heidelberg/Berlin, Germany, 2017; pp. 665–686. [Google Scholar]
- Faz-Hernández, A.; Longa, P.; Sánchez, A.H. Efficient and secure algorithms for GLV-based scalar multiplication and their implementation on GLV–GLS curves (extended version). J. Cryptogr. Eng. 2015, 5, 31–52. [Google Scholar] [CrossRef]
- Kocher, P.C. Timing attacks on implementations of Diffie–Hellman, RSA, DSS, and other systems. In Proceedings of 16th Annual International Cryptology Conference, Santa Barbara, CA, USA, 18–22 August 1996; Springer: Heidelberg/Berlin, Germany; pp. 104–113.
- Page, D. Theoretical use of cache memory as a cryptanalytic side-channel. IACR Cryptol. ePrint Arch. 2002, 2002, 169. [Google Scholar]
- Edwards, H. A normal form for elliptic curves. Bull. Am. Math. Soc. 2007, 44, 393–422. [Google Scholar] [CrossRef]
- Bernstein, D.J.; Lange, T. Faster addition and doubling on elliptic curves. In Proceedings of 13th International Conference on the Theory and Application of Cryptology and Information Security, Kuching, Malaysia, 2–6 December 2007; Springer: Heidelberg/Berlin, Germany; pp. 29–50.
- Hisil, H.; Wong, K.K.H.; Carter, G.; Dawson, E. Twisted Edwards curves revisited. In Proceedings of 14th International Conference on the Theory and Application of Cryptology and Information Security, Melbourne, Australia, 7–11 December 2008; Springer: Heidelberg/Berlin, Germany; pp. 326–343.
- Hankerson, D.; Menezes, A.J.; Vanstone, S. Guide to Elliptic Curve Cryptography; Springer Science & Business Media: Heidelberg/Berlin, Germany, 2006. [Google Scholar]
- Yanık, T.; Savaş, E.; Koç, Ç.K. Incomplete reduction in modular arithmetic. IEE Proc. Comput. Digit. Tech. 2002, 149, 46–52. [Google Scholar] [CrossRef]
- Microchip. 8/16-Bit AVR XMEGA A3 Microcontroller. Available online: http://ww1.microchip.com/downloads/en/DeviceDoc/Atmel-8068-8-and16-bit-AVR-XMEGA-A3-Microcontrollers_Datasheet.pdf (accessed on 26 February 2018).
- Hutter, M.; Schwabe, P. Multiprecision multiplication on AVR revisited. J. Cryptogr. Eng. 2015, 5, 201–214. [Google Scholar] [CrossRef]
- Seo, H.; Liu, Z.; Choi, J.; Kim, H. Multi-Precision Squaring for Public-Key Cryptography on Embedded Microprocessors. In Proceedings of Cryptology—INDOCRYPT 2013, Mumbai, India, 7–10 December 2013; Springer: Heidelberg/Berlin, Germany, 2013; pp. 227–243. [Google Scholar]
- Texas Instruments. MSP430FR59xx Mixed-Signal Microcontrollers. Available online: http://www.ti.com/lit/ds/symlink/msp430fr5969.pdf (accessed on 26 February 2018).
- STMicroelectronics. UM1472: Discovery kit with STM32F407VG MCU. Available online: http://www.st.com/content/ccc/resource/technical/document/user_manual/70/fe/4a/3f/e7/e1/4f/7d/DM00039084.pdf/files/DM00039084.pdf/jcr:content/translations/en.DM00039084.pdf (accessed on 26 February 2018).
- FourQlib library. Available online: https://github.com/Microsoft/FourQlib (accessed on 10 March 2018 ).
- Babai, L. On Lovász’lattice reduction and the nearest lattice point problem. Combinatorica 1986, 6, 1–13. [Google Scholar] [CrossRef]
- Park, Y.H.; Jeong, S.; Lim, J. Speeding Up Point Multiplication on Hyperelliptic Curves With Efficiently-Computable Endomorphisms. In Proceedings of International Conference on the Theory and Applications of Cryptographic Techniques, Amsterdam, The Netherlands, 28 April–2 May 2002; Springer: Heidelberg/Berlin, Germany, 2002; pp. 197–208. [Google Scholar]
- Hamburg, M. Fast and compact elliptic-curve cryptography. IACR Cryptol. ePrint Arch. 2012, 2012, 309. [Google Scholar]
- Wenger, E.; Werner, M. Evaluating 16-bit processors for elliptic curve cryptography. In Proceedings of the International Conference on Smart Card Research and Advanced Applications, Leuven, Belgium, 14–16 September 2011; Springer: Heidelberg/Berlin, Germany; pp. 166–181.
- Wenger, E.; Unterluggauer, T.; Werner, M. 8/16/32 shades of elliptic curve cryptography on embedded processors. In Proceedings of Cryptology—INDOCRYPT 2013, Mumbai, India, 7–10 December 2013; Springer: Heidelberg/Berlin, Germany; pp. 244–261.
- Hinterwälder, G.; Moradi, A.; Hutter, M.; Schwabe, P.; Paar, C. Full-size high-security ECC implementation on MSP430 microcontrollers. Proceedings of Cryptology—LATINCRYPT 2014, Florianópolis, Brazil, 17–19 September 2014; pp. 31–47. [Google Scholar]
- Düll, M.; Haase, B.; Hinterwälder, G.; Hutter, M.; Paar, C.; Sánchez, A.H.; Schwabe, P. High-speed Curve25519 on 8-bit, 16-bit, and 32-bit microcontrollers. Des. Codes Cryptogr. 2015, 77, 493–514. [Google Scholar] [CrossRef]
- De Santis, F.; Sigl, G. Towards Side-Channel Protected X25519 on ARM Cortex-M4 Processors. Proceedings of Software performance enhancement for encryption and decryption, and benchmarking, Utrecht, The Netherlands, 19–21 October 2016. [Google Scholar]
- Renes, J.; Schwabe, P.; Smith, B.; Batina, L. μKummer: Efficient Hyperelliptic Signatures and Key Exchange on Microcontrollers. In Proceedings of 18th International Workshop on Cryptographic Hardware and Embedded Systems, Santa Barbara, CA, USA, 17–19 August 2016; Springer: Heidelberg/Berlin, Germany, 2016; pp. 301–320. [Google Scholar]
- Hutter, M.; Schwabe, P. NaCl on 8-Bit AVR Microcontrollers. In Proceedings of Cryptology—AFRICACRYPT 2013, Cairo, Egypt, 22–24 June 2013; Springer: Heidelberg/Berlin, Germany, 2013; pp. 156–172. [Google Scholar]
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
