Next Article in Journal
Optically Transparent Nano-Patterned Antennas: A Review and Future Directions
Next Article in Special Issue
A Novel Risk Assessment Methodology for SCADA Maritime Logistics Environments
Previous Article in Journal
Co-Optimization of Communication and Sensing for Multiple Unmanned Aerial Vehicles in Cooperative Target Tracking
Previous Article in Special Issue
An Integrated Cyber Security Risk Management Approach for a Cyber-Physical System
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Efficient Implementations of Four-Dimensional GLV-GLS Scalar Multiplication on 8-Bit, 16-Bit, and 32-Bit Microcontrollers

1
Center for Information Security Technologies (CIST), Korea University, Seoul 02841, Korea
2
The Affiliated Institute of ETRI, Daejeon 34044, Korea
*
Author to whom correspondence should be addressed.
Appl. Sci. 2018, 8(6), 900; https://doi.org/10.3390/app8060900
Submission received: 16 March 2018 / Revised: 20 May 2018 / Accepted: 22 May 2018 / Published: 31 May 2018
(This article belongs to the Special Issue Security and Privacy for Cyber Physical Systems)

Abstract

:
In this paper, we present the first constant-time implementations of four-dimensional Gallant–Lambert–Vanstone and Galbraith–Lin–Scott (GLV-GLS) scalar multiplication using curve Ted 127 - glv 4 on 8-bit AVR, 16-bit MSP430, and 32-bit ARM processors. In Asiacrypt 2012, Longa and Sica introduced the four-dimensional GLV-GLS scalar multiplication, and they reported the implementation results on Intel processors. However, they did not consider efficient implementations on resource-constrained embedded devices. We have optimized the performance of scalar multiplication using curve Ted 127 - glv 4 on 8-bit AVR, 16-bit MSP430, and 32-bit ARM processors. Our implementations compute a variable-base scalar multiplication in 6,856,026, 4,158,453, and 447,836 cycles on AVR, MSP430, and ARM Cortex-M4 processors, respectively. Recently, Four Q -based scalar multiplication has provided the fastest implementation results on AVR, MSP430, and ARM Cortex-M4 processors to date. Compared to Four Q -based scalar multiplication, the proposed implementations require 4.49% more computational cost on AVR, but save 2.85% and 4.61% cycles on MSP430 and ARM, respectively. Our 16-bit and 32-bit implementation results set new speed records for variable-base scalar multiplication.

1. Introduction

Wireless sensor networks (WSNs) are wireless networks consisting of a large number of resource-constrained sensor nodes, where each node is equipped with a sensor to monitor physical phenomena, such as temperature, light, and pressure. The main features of WSNs are resource constraints, such as storage, computing power, and sensing distance. Recently, the energy consumption of data centers has attracted attention because of the fast growth of data throughput. WSNs can provide a solution for data collection and data processing in various applications including data center monitoring. That is, WSNs can be utilized for data center monitoring to improve the efficiency of energy consumption. Several solutions were proposed to solve this problem [1,2].
Since sensor nodes are usually deployed in remote areas and left unattended, they can be led to network security issues, such as node capture, eavesdropping, and message tampering during data communication. Additionally, many application areas of WSNs require data confidentiality, integrity, authentication, and non-repudiation, meaning there is a need for an efficient cryptographic mechanism to satisfy current security requirements. However, due to the constraint of WSNs, it is difficult to utilize the conventional cryptographic algorithms. Therefore, efficient cryptographic algorithms considering code size, computation time, and power consumption are required for the security of WSNs.
In 1985, elliptic curve cryptography (ECC) was proposed independently of public key cryptography (PKC) by Miller and Koblitz [3,4]. ECC is mainly used for digital signature and key exchange based on the elliptic curve discrete logarithm problem (ECDLP), which is defined by elliptic curve point operations in a finite field. ECC provides the same security level with a smaller key size compared to existing PKC algorithms such as Rivest-Shamir-Adleman (RSA) cryptosystem [5]. For example, ECC over F p with a 256-bit prime p provides an equivalent security level as RSA using 3072-bit key. Because RSA uses a small integer as the public key, RSA public key operations can be efficiently computed. However, RSA private key operations are extremely slower than ECC, therefore they have limited use in the applications of WSNs. Therefore, ECC can be efficiently utilized than RSA for resource-constrained WSNs devices, such as smart cards and sensor nodes.
However, recently proposed manipulation and backdoors have raised the suspicion of weakness in previous ECC standards. In particular, the National Institute of Standards and Technology (NIST) P-224 curve is not secure against twist attacks, which are the combined attacks that use the small-subgroup attacks and the invalid-curve attacks using the twist of curve [6]. The dual elliptic curve deterministic random bit generator (Dual_EC_DRBG) is a pseudo-random number generator (PRNG) standardized in NIST SP 800-90A. However, the revised version of NIST SP 800-90A standard removes Dual_EC_DRBG because this algorithm contains a backdoor for the national security agency (NSA) [7].
Therefore, the demand for next generation elliptic curves has increased. Specific examples of such curves are Curve25519, Ed448-Goldilocks, and twisted Edwards curves [8,9,10]. The main features of these curves are the selection of efficient parameters. The Curve25519 utilizes a prime of the form p = 2 255 19 and a fast Montgomery elliptic curve. The Ed448-Goldilocks curve utilizes a Solinas trinomial prime of the form p = 2 448 2 224 1 , which provides fast field arithmetic on both 32-bit and 64-bit machines because 224 = 28 × 8 = 32 × 7 = 56 × 4 . These parameters can accelerate the performance of ECC-based protocols. The details of the twisted Edwards curves can be found in Section 2.3.
Scalar multiplicationor point multiplication computes an operation k P using an elliptic curve point P and a scalar k. This operation determines the performance of ECC. Therefore, many researchers have proposed various methods to improve the efficiency of scalar multiplication. The speed-up methods for scalar multiplication can be classified into three types: methods based on speeding up the finite field exponentiation, such as comb techniques and windowing methods, scalar recoding methods, and methods that are particular to elliptic curve scalar multiplication [11].
Speed-up methods using efficiently computable endomorphisms are one type of method that are particular to elliptic curve scalar multiplication. The Gallant–Lambert–Vanstone (GLV) method proposed by Gallant et al. is a method for accelerating scalar multiplication by using efficiently computable endomorphisms [11]. If the cost of computing endomorphism is less than (bit-length of curve order)/3 elliptic curve point doubling (ECDBL) operations, then this method has a computational advantage. Their method reduces about half of the ECDBL operations and saves the costs of scalar multiplication by roughly 33%. Additionally, recent studies have reported that scalar multiplication methods using efficiently computable endomorphisms are significantly faster than generalized methods. The Galbraith–Lin–Scott (GLS) curves proposed by Galbraith et al. constructed an efficiently computable endomorphism for elliptic curves defined over F p 2 , where p is a prime number [12]. They demonstrated that the GLV method can efficiently compute scalar multiplication on such curves. Longa and Gebotys [13] presented an efficient implementation of two-dimensional GLS curves over  F p 2 .
In 2012, Longa and Sica [14] proposed four-dimensional GLV-GLS curves over F p 2 , which generalized the GLV method and GLS curves. Hu et al. [15] proposed a GLV-GLS curve over F p 2 , which supports the four-dimensional scalar decomposition. They reported the implementation results indicating that the four-dimensional GLV-GLS scalar multiplication reduces at most 22% of computational cost than the two-dimensional GLV method. Bos et al. [16] proposed two- and four-dimensional scalar decompositions over genus 2 curves defined over F p 2 . Bos et al. [17] introduced an eight-dimensional GLV-GLS method over genus 2 curves defined over F p 2 . Oliveira et al. [18] presented the implementation results of a two-dimensional GLV method over binary GLS elliptic curves defined over F 2 254 . Guillevic and Ionica [19] utilized the four-dimensional GLV method on genus 1 curves defined over F p 2 and genus 2 curves defined over F p . Smith [20] proposed a new family of elliptic curves over F p 2 , called “ Q -curves”. Costello and Longa [21] introduced a four-dimensional Q curve defined over F p 2 , called “Four Q ”. They reported the implementation results of Four Q on various Intel and AMD processors.
After a Four Q -based approach has been proposed, many implementation results were reported considering various environments, such as AVR, MSP430, ARM, and field-programmable gate array (FPGA) devices [22,23,24]. An efficient Four Q -based implementation on 32-bit ARM processor with the NEON single instruction multiple data (SIMD) instruction set was proposed by Longa [22]. Järvinen et al. [23] proposed a fast and compact Four Q -based implementation on FPGA device. In CHES 2017, Liu et al. [24] presented highly optimized implementations using curve Four Q on 8-bit AVR, 16-bit MSP430, and 32-bit ARM Cortex-M4 processors, respectively.
In the case of curve Ted 127 - glv 4 , Longa and Sica and Faz-Hernández et al. [14,25] reported the implementation results on high-end processors, such as Intel Sandy Bridge, Intel Ivy Bridge, and ARM Cortex-A processors. However, efficient implementations on resource-constrained embedded devices have not been considered to date. Therefore, we focused on optimized implementations of scalar multiplication using curve Ted 127 - glv 4 on 8-bit ATxmega256A3, 16-bit MSP430FR5969, and 32-bit ARM Cortex-M4 processors, respectively.
Our main contributions can be summarized as follows:
  • We present efficient implementations at each level of the implementation hierarchy of four-dimensional GLV-GLS scalar multiplication considering the features of 8-bit AVR, 16-bit MSP430, and 32-bit ARM Cortex-M4 processors. To improve the performance of scalar multiplication, we carefully selected the internal algorithms at each level of the implementation hierarchy. These implementations also run in constant time to resist timing and cache-timing attacks [26,27].
  • We demonstrate that the efficiently computable endomorphisms can accelerate the performance of four-dimensional GLV-GLS scalar multiplication. For this purpose, we analyze the operation counts of two elliptic curves “Ted127-glv4” and “Four Q ”, which support the four-dimensional GLV-GLS scalar multiplication. The GLV-GLS curve Ted 127 - glv 4 requires fewer number of field arithmetic operations than Four Q -based implementation to compute a single variable-base scalar multiplication. However, because Four Q uses a Mersenne prime p = 2 127 1 and the curve Ted 127 - glv 4 uses a Mersenne-like prime p = 2 127 5997 , Four Q has a computational advantage of faster field arithmetic operations. By using the computational advantage of endomorphisms, we overcome the computational disadvantage of curve Ted 127 - glv 4 at field arithmetic level.
  • We present the first constant-time implementations of four-dimensional GLV-GLS scalar multiplication using curve Ted 127 - glv 4 on three target platforms, which have not been considered in previous works. The proposed implementations on AVR, MSP430, and ARM processors require 6,856,026, 4,158,453, and 447,836 cycles to compute a single variable-base scalar multiplication, respectively. Compared to Four Q -based implementations [24], which have provided the fastest results to date, our results are 4.49% slower on AVR, but 2.85% and 4.61% faster on MSP430 and ARM, respectively. Our MSP430 and ARM implementations set new speed records for variable-base scalar multiplication.
The remainder of this paper is organized as follows. Section 2 describes preliminaries regarding ECC and its speed-up techniques, including the GLV and GLS methods. Section 3 presents a review of four-dimensional GLV-GLS scalar multiplication and its implementation hierarchy. Section 4 describes the implementation details of field arithmetic and optimization methods for the target platforms. Section 5 describes optimization methods for ECC in terms of point arithmetic and scalar multiplication. Experimental results and a comparison of our work to previous ECC implementations on AVR, MSP430, and ARM processors are presented in Section 6. Finally, we conclude this paper in Section 7.

2. Preliminaries

In Section 2.1, we describe the field representation and notations used for the remainder of this paper. We briefly describe ECC using a short Weierstrass curve and its group law in Section 2.2. We also describe twisted Edwards curves, which are the target of our implementation, in Section 2.3. In Section 2.4, we describe the GLV-GLS method including the GLV method and GLS curves.

2.1. Field Representation and Notations

We assume that the target platform has a w-bit architecture. Let n = log 2 p be the bit-length of a Mersenne-like prime p = 2 n c , where c is small. Let m = n / w be its word-length. Then, an arbitrary element a F p is represented by an array ( a m 1 , , a 2 , a 1 , a 0 ) of mw-bit words. The notations M 1 , S 1 , I 1 , and A 1 represent multiplication, squaring, inversion, and addition (subtraction) over F p , respectively. Similarly, the notations M 2 , S 2 , I 2 , and A 2 represent multiplication, squaring, inversion, and addition (subtraction) over F p 2 , respectively. The notation A i represents multi-precision addition without modular reduction and the notation M d represents multiplication with a curve parameter.

2.2. Elliptic Curve Cryptography

Let F q be a finite field with odd characteristic. An elliptic curve E over F q is defined by a short Weierstrass equation of the following form:
E : y 2 = x 3 + a x + b ,
where a , b F q and 4 a 3 + 27 b 2 0 .
Because the most important operation in ECC is scalar multiplication k P , it must be implemented efficiently. The basic method for computing k P is comprised of two elliptic curve operations: elliptic curve point addition (ECADD) and the ECDBL operations. Let P = ( x 1 , y 1 ) and Q = ( x 2 , y 2 ) be two points on an elliptic curve E. The ECADD and ECDBL operations can be computed in affine coordinates as follows:
x 3 = λ 2 x 1 x 2 , y 3 = λ ( x 1 x 3 ) y 1 , λ = y 2 y 1 x 2 x 1 if P Q , and λ = 3 x 1 2 + a 2 y 1 if P = Q .
The ECADD and ECDBL operations are composed of finite field arithmetic operations, such as field addition, subtraction, multiplication, squaring, and inversion. Therefore, to improve the performance of scalar multiplication, the internal algorithms such as field and curve arithmetic operations should be efficiently implemented.

2.3. Twisted Edwards Curves

The Edwards curves are a normal form of elliptic curves introduced by Edwards [28]. Bernstein and Lange [29] introduced Edwards curves defined by x 2 + y 2 = c 2 ( 1 + d x 2 y 2 ) , where c , d F q with c d ( 1 d c 4 ) 0 . In 2007, Bernstein et al. [10] introduced twisted Edwards curves, which are a generalization of Edwards curves defined by
E a , d : a x 2 + y 2 = 1 + d x 2 y 2 ,
where a , d F q with a d ( a d ) 0 . The Edwards curves are a special case of twisted Edwards curves with a = 1 . The point ( 0 , 1 ) is the identity element and the point ( 0 , 1 ) has order two. The point ( 1 , 0 ) and ( 1 , 0 ) have order four. The negative of a point P = ( x 1 , y 1 ) is P = ( x 1 , y 1 ) . The ECADD operation of two points P = ( x 1 , y 1 ) and Q = ( x 2 , y 2 ) on a twisted Edwards curve E is defined as follows:
( x 1 , y 1 ) + ( x 2 , y 2 ) = x 1 y 2 + y 1 x 2 1 + d x 1 y 1 x 2 y 2 , y 1 y 2 a x 1 x 2 1 d x 1 y 1 x 2 y 2 .
Because the addition law is unified, it can be used for computing the ECDBL operation. Suppose that two points P and Q have an odd order. Then, the denominators of the addition formula 1 + d x 1 y 1 x 2 y 2 and 1 d x 1 y 1 x 2 y 2 are nonzero. Therefore, the doubling formula can be obtained as follows:
2 ( x 1 , y 1 ) = 2 x 1 y 1 y 1 2 + a x 1 2 , y 1 2 a x 1 2 2 y 1 2 a x 1 2 .
Two relationships can be obtained by considering the curve equation: a x 1 2 + y 1 2 = 1 + d x 1 2 y 1 2 and a x 2 2 + y 2 2 = 1 + d x 2 2 y 2 2 . After straightforward elimination, the curve parameters a and d can be represented by x 1 , x 2 , y 1 , and y 2 . Substitutions in the unified addition formula yield the addition formula as follows:
( x 1 , y 1 ) + ( x 2 , y 2 ) = x 1 y 1 + x 2 y 2 y 1 y 2 + a x 1 x 2 , x 1 y 1 x 2 y 2 x 1 y 2 y 1 x 2 .
These addition and doubling formulas are used in the dedicated addition and doubling formulas described in Section 5. The features of these formulas are independent of the curve parameter d [30].

2.4. The GLV-GLS Method

We will now describe the GLV method to explain the GLV-GLS method. Let E be an elliptic curve defined over a finite field F q . An endomorphism ϕ of E over F q is a rational map ϕ : E E such that ϕ ( O ) = O and ϕ ( P ) = ( g ( P ) , h ( P ) ) for all points P E , where g and h are rational functions and O is a point at infinity. An endomorphism ϕ is a group homomorphism, defined as
ϕ ( P 1 + P 2 ) = ϕ ( P 1 ) + ϕ ( P 2 ) for all P 1 , P 2 E .
Suppose that # E ( F q ) contains a subgroup of order r and let ϕ be an efficiently computable endomorphism on E such that ϕ ( P ) = λ P for some 1 λ r 1 . The GLV method computes the integers k 0 and k 1 such that k = k 0 + k 1 λ mod r for scalar multiplication k P . Because
k P = k 0 P + k 1 λ P = k 0 P + k 1 ϕ ( P ) ,
scalar multiplication k P can be computed by computing ϕ ( P ) and then using multiple scalar multiplications [31]. This is because the multi-scalars k 0 and k 1 have approximately half the bit-length of the scalar k. The efficiency of the GLV method depends on scalar decomposition and the efficiency of computing endomorphism ϕ .
The main concept of the GLS curves is described as follows: Let E / F q 2 be the quadratic twists of E / F q 2 [12]. Let ψ be the quadratic twist map and π be the q-th Frobenius endomorphism. Then, we can obtain the efficiently computable endomorphism ϕ = ψ π ψ 1 , which satisfies the equation X 2 + 1 = 0 if p 5 ( mod 8 ) . However, GLS curves only work for elliptic curves over F q m with m > 1 .
As mentioned in the introduction, the GLV-GLS method is the generalized method of the GLV method and GLS curves. Let ϕ and ψ be two efficiently computable endomorphisms over F p 2 and P be a point of prime order r. Then, the four-dimensional scalar multiplication k P for any scalar k [ 1 , r ] can be computed as follows:
k P = k 0 P + k 1 ϕ ( P ) + k 2 ψ ( P ) + k 3 ψ ( ϕ ( P ) ) ,
where max i ( | k i | ) < C r 1 / 4 for 0 i 3 and C is some explicit constant. The details of internal algorithms of the four-dimensional scalar multiplication can be found in Section 4 and Section 5.

3. Review of Four-Dimensional GLV-GLS Scalar Multiplication

The curve Ted 127 - glv 4 was introduced by Longa and Sica [14]. It is based on twisted Edwards curves and has efficiently computable endomorphisms, which facilitates the four-dimensional GLV-GLS scalar multiplication. The parameters of curve Ted 127 - glv 4 are as follows:
E / F p 2 : x 2 + y 2 = 1 + d x 2 y 2 ,
where d = 170141183460469231731687303715884099728 + 116829086847165810221872975542241037773 i , p = 2 127 5997 and # E ( F p 2 ) = 8 r , where r is a 251-bit prime. Let F p 2 = F p [ i ] / ( i 2 + 1 ) and u = 1 + i be a quadratic non-residue in F p 2 . E is isomorphic to the Weierstrass curve E / F p 2 : y 2 = x 3 15 / 2 u 2 x 7 u 3 . The curve Ted 127 - glv 4 contains two efficiently computable endomorphisms ϕ and ψ defined over F p 2 as follows:
ϕ ( x , y ) = ( ζ 8 3 + 2 ζ 8 2 + ζ 8 ) x y 2 + ( ζ 8 3 2 ζ 8 2 + ζ 8 ) x 2 y , ( ζ 8 2 1 ) y 2 + 2 ζ 8 3 ζ 8 2 + 1 ( 2 ζ 8 3 + ζ 8 2 1 ) y 2 ζ 8 2 + 1 ,
ψ ( x , y ) = ζ 8 x p , 1 y p ,
where ζ 8 = u / 2 is a primitive eighth root of unity. It can be verified that ϕ 2 + 2 = 0 and ψ 2 + 1 = 0 .
Let P be a point in E / F p 2 and k be a random scalar in the range [ 1 , r ] . Algorithm 1 outlines variable-base scalar multiplication using curve Ted 127 - glv 4 and four-dimensional decompositions. Steps 1 and 2 in Algorithm 1 compute three endomorphisms ϕ ( P ) , ψ ( P ) , and ψ ( ϕ ( P ) ) , and then compute the eight points T [ u ] = P + u 0 ϕ ( P ) + u 1 ψ ( P ) + u 2 ψ ( ϕ ( P ) ) , where u = ( u 2 , u 1 , u 0 ) in 0 u 7 . Step 3 decomposes the input scalar k into multi-scalars ( k 0 , k 1 , k 2 , k 3 ) such that 0 k i 2 65 , where 0 i 3 . For constant-time implementation, the multi-scalars ( k 0 , k 1 , k 2 , k 3 ) must guarantee the same number of iterations of the main computation. Because all coordinates of scalar decomposition are less than 2 65 , we apply the scalar recoding algorithm to guarantee a fixed loop length for the main computation at step 4 [25]. The result of the scalar recoding is represented by 66 lookup table indices d i and 66 masks m i , where 0 i 65 . Steps 5 to 9 represent the main computation stage, including point loading, the ECADD operation, and the ECDBL operation. The result of the main computation is converted from an extensible coordinates to the affine coordinates in step 10. Therefore, a variable-base scalar multiplication using curve Ted 127 - glv 4 requires one ϕ ( P ) endomorphism, two ψ ( P ) endomorphisms, and seven ECADD operations in the precomputation; 65 table lookups, 65 ECADD operations, and 65 ECDBL operations in the main computation; and one inversion and two field multiplications over F p 2 for point normalization.
Figure 1 describes the implementation hierarchy of four-dimensional GLV-GLS scalar multiplication and its internal algorithms. Because the implementation algorithms at each level affect the performance of scalar multiplication, we carefully choose proper algorithms considering the features of AVR, MSP430, and ARM processors. Additionally, field arithmetic over F p 2 and curve arithmetic are comprised of field arithmetic over F p , which is the computationally primary operations. Therefore, field arithmetic over F p is written at the assembly level.
Algorithm 1: Scalar multiplication using curve Ted 127 - glv 4 [21].
Require: 
Scalar k [ 1 , r ] and point P E / F p 2 .
Ensure: 
k P .
 1:
Compute ϕ ( P ) , ψ ( P ) , and ψ ( ϕ ( P ) ) .
 2:
Compute T [ u ] = P + u 0 ϕ ( P ) + u 1 ψ ( P ) + u 2 ψ ( ϕ ( P ) ) where u = ( u 2 , u 1 , u 0 ) in 0 u 7 .
 3:
Decompose the scalar k into the multi-scalars ( k 0 , k 1 , k 2 , k 3 ) .
 4:
Recode the multi-scalars ( k 0 , k 1 , k 2 , k 3 ) to ( d 65 , , d 0 ) and ( m 65 , , m 0 ) . s i = 1 if m i = 1 and s i = 1 if m i = 0 .
 5:
Q = s 65 · T [ d 65 ] .
 6:
for i = 64 to 0 do
 7:
Q 2 Q .
 8:
Q Q + s i · T [ d i ] .
 9:
end for
  10:
returnQ.

4. Implementation Details of Field Arithmetic

In this section, we describe the implementation details of field arithmetic on AVR, MSP430X, and ARM Cortex-M4 processors using a Mersenne-like prime of the form p = 2 127 5997 . We describe the field arithmetic algorithms that are commonly used in three target platforms in Section 4.1, Section 4.2, Section 4.3 and Section 4.4. In Section 4.5, Section 4.6 and Section 4.7, we describe our optimization strategy for field arithmetic on AVR, MSP430, and ARM processors, respectively.

4.1. Field Addition and Subtraction over F p

The curve Ted 127 - glv 4 uses a Mersenne-like prime of the form p = 2 127 5997 . An efficient field addition/subtraction method for this scenario was proposed by Bos et al. [16]. Let 0 a , b < p = 2 127 5997 . Field addition over F p can be computed by c = a + b ( mod p ) = ( ( a + 5997 ) + b ) c a r r y · 2 127 ( 1 c a r r y ) · 5997 , where c a r r y = 0 if a + b + 5997 < 2 127 . Otherwise, c a r r y = 1 . The result is bounded by p because, if a + b + 5997 < 2 127 , then a + b < 2 127 5997 , whereas if a + b + 5997 2 127 , then ( a + b + 5997 ) ( mod 2 127 ) = a + b p < p . Because a + 5997 < 2 127 , addition does not require carry propagation. Note that subtraction with c a r r y · 2 127 can be efficiently implemented by clearing the 128-th bit of ( a + 5997 ) + b .
Similar to field addition, field subtraction over F p can be computed by c = a b ( mod p ) = ( a b ) + b o r r o w · 2 127 b o r r o w · 5997 , where b o r r o w = 0 if a b , otherwise, b o r r o w = 1 . Addition with b o r r o w · 2 127 can be implemented by clearing the 128-th bit of a b .

4.2. Modular Reduction

To use primes of a special form may result in a faster reduction method [31]. The NIST recommends five primes for the elliptic curve digital signature algorithm (ECDSA). These primes can be represented as the sums or differences of powers of two and facilitate the fast reduction method. The curve Ted 127 - glv 4 uses a Mersenne-like prime of the form p = 2 127 5997 . Therefore, modular reduction can be efficiently computed by using a NIST-like reduction method [16]. Let  0 a , b p = 2 127 5997 . We compute c = a · b = 2 128 c h + c l , where 0 c h , c l < 2 128 . The first reduction step can be computed by c c l + 2 · 5997 · c h . Then, the second reduction step can be computed by c 2 127 R h + R l R l + 5997 · R h ( mod p ) , where R l , 5997 · R h < 2 127 .

4.3. Inversion over F p

For the field inversion a 1 ( mod p ) , we use the fact that a 1 = a p 2 ( mod p ) in Fermat’s little theorem (in our case, a p 2 ( mod p ) = a 2 127 5999 ( mod p ) ). This method can be implemented by modular exponentiation using fixed addition chains and guarantees constant-time execution requiring 13 M 1 + 126 S 1 operations.

4.4. Field Arithmetic over F p 2

The incomplete reduction method proposed by Yanık et al. [32] is one of the optimization methods in field arithmetic over F p 2 . Given two elements a , b [ 0 , p 1 ] , the result of operations stays in the range [ 0 , 2 m 1 ] , where p < 2 m < 2 p 1 and m is a fixed integer (in our case, m = 128 ). Because the modulus of curve Ted 127 - glv 4 is a Mersenne-like prime of the form p = 2 127 5997 , the incomplete reduction method can be applied more advantageously.
Let a = a 0 + a 1 i and b = b 0 + b 1 i be two arbitrary elements in a finite field F p 2 . Field addition and subtraction over F p 2 can be computed by a + b = ( a 0 + b 0 ) + ( a 1 + b 1 ) i and a b = ( a 0 b 0 ) + ( a 1 b 1 ) i , respectively. Field inversion over F p 2 can be computed by a 1 = ( a 0 a 1 i ) / ( a 0 2 + a 1 2 ) .
We utilize Karatsuba multiplication to compute field multiplication over F p 2 . The Karatsuba multiplication uses the fact that a · b = ( a 0 + a 1 i ) ( b 0 + b 1 i ) = ( a 0 b 0 a 1 b 1 ) + { ( a 0 + a 1 ) ( b 0 + b 1 ) a 0 b 0 a 1 b 1 } i , which can be computed by 3 M 1 + 3 A 1 + 2 A i operations. It requires 1 A 1 + 2 A i more operations but saves 1 M 1 operations compared to general multiplication methods, which require 4 M 1 + 2 A 1 operations. Because field multiplication requires more computational cost than the multi-precision addition and field addition, the Karatsuba multiplication has a computational advantage. Algorithm 2 describes field multiplication over F p 2 using the Karatsuba multiplication and the incomplete reduction  method.
Algorithm 3 describes field squaring over F p 2 using the incomplete reduction method. Note that a 2 = ( a 0 2 a 1 2 ) + 2 a 0 a 1 i = ( a 0 + a 1 ) ( a 0 a 1 ) + 2 a 0 a 1 i . The first representation can be computed by 1 M 1 + 2 S 1 + 1 A 1 + 1 A i operations, and the remaining representation can be computed by 2 M 1 + 1 A 1 + 2 A i operations. Because 1 M 1 operation can be implemented faster than 2 S 2 operations, we use 2 M 1 + 1 A 1 + 2 A i operations to compute field squaring over F p 2 . The results of steps 3 and 4 in Algorithm 2 and steps 1 and 3 in Algorithm 3 were represented by the incompletely reduced form.
Algorithm 2: Field multiplication over F p 2 [25].
Require: 
a = a 0 + a 1 i , b = b 0 + b 1 i F p 2 , p = 2 127 5997 .
Ensure: 
c = a · b = c 0 + c 1 i F p 2 .
 1:
t 1 a 0 × b 0 ( mod p ) { M 1 }
 2:
t 2 a 1 × b 1 ( mod p ) { M 1 }
 3:
t 3 a 0 + a 1 { A i }
 4:
c 1 b 0 + b 1 { A i }
 5:
c 1 c 1 × t 3 ( mod p ) { M 1 }
 6:
c 1 c 1 t 1 ( mod p ) { A 1 }
 7:
c 1 c 1 t 2 ( mod p ) { A 1 }
 8:
c 0 t 1 t 2 ( mod p ) { A 1 }
 9:
returnc.
Algorithm 3: Field squaring over F p 2 [25].
Require: 
a = a 0 + a 1 i F p 2 , p = 2 127 5997 .
Ensure: 
c = a 2 = c 0 + c 1 i F p 2 .
 1:
t 1 a 0 + a 1 { A i }
 2:
t 2 a 0 a 1 ( mod p ) { A 1 }
 3:
t 3 a 0 + a 0 { A i }
 4:
c 0 t 1 × t 2 ( mod p ) { M 1 }
 5:
c 1 t 3 × a 1 ( mod p ) { M 1 }
 6:
returnc.

4.5. Optimization Strategy on 8-Bit AVR

The AVR processor is a family of 8-bit microcontrollers that is widely used in MICA2/MICAz sensor motes. The AVR processors are equipped with an 8-bit integer multiplier and register file with 32×8-bit general registers that are numbered from R 0 to R 31 . Registers R 26 : R 27 , R 28 : R 29 , and  R 30 : R 31 pairs are used as 16-bit indirect address registers called X , Y , and Z . The automatic increment and decrement addressing modes are supported on all X , Y , and Z registers, and Y and Z support fixed positive displacement. R 0 and R 1 registers store the 16-bit results of 8 × 8 -bit multiplication. The AVR processors provide a typical 8-bit reduced instruction set computer (RISC) instruction set. The most important instructions for ECC are 8 × 8 -bit multiplication ( MUL ) and memory access ( LD , ST ) instructions, which require two cycles. Instructions between two registers, such as addition ( ADD , ADC ) or subtraction ( SUB , SBC ), require only one cycle. Therefore, the basic optimization strategy on 8-bit AVR is reducing the number of memory access instructions.
To simulate our implementations, we targeted the ATxmega256A3 processor [33]. This processor can be clocked up to 32 MHz and provides 256 KB of programmable flash memory, 16 KB of SRAM, and 4 KB of EEPROM.
Recently, Hutter and Schwabe [34] proposed a highly optimized Karatsuba multiplication for the 8-bit AVR processor. There are two variants of the Karatsuba multiplication method: the additive Karatsuba and subtractive Karatsuba methods. Algorithm 4 outlines the subtractive Karatsuba multiplication. We consider n × n -bit multiplication, where n is even and k = n / 2 (in our case, n = 128 and k = 64 ). The additive Karatsuba method can be computed similarly to Algorithm 4. However, the additive Karatsuba method may produce the carry bits in the addition of two numbers ( a l + a h ) and ( b l + b h ) . The additional multiplication using the carry bits incurs a significant overhead for integer multiplication. The subtractive Karatsuba method does not produce carry bits in the computation of M, but computes two absolute values | a l a h | and | b l b h | . This overhead is not only smaller than the overhead required for the  additive Karatsuba method, but can also be executed in constant-time. Therefore, we chose and implemented the subtractive Karatsuba multiplication for the 8-bit AVR implementation.
Algorithm 4: Subtractive Karatsuba multiplication [34].
Require: 
a = 2 k a h + a l , b = 2 k b h + b l F p for k-bit integers a l , a h , b l , and b h .
Ensure: 
c = a · b .
 1:
Compute L = a l · b l
 2:
Compute H = a h · b h
 3:
Compute M = | a l a h | · | b l b h |
 4:
Set t = 0 , if M = ( a l a h ) · ( b l b h ) , t = 1 otherwise
 5:
Compute M ^ = ( 1 ) t M = ( a l a h ) · ( b l b h )
 6:
c = A · B = L + 2 k ( L + H M ^ ) + 2 n H
 7:
returnc.
For integer squaring, we chose the sliding block doubling (SBD) method [35], which is more efficient than the subtractive Karatsuba method in the case of 128-bit operands on 8-bit AVR. To improve the performance of field arithmetic, we combined integer multiplication and squaring with modular reduction.

4.6. Optimization Strategy on 16-Bit MSP430X

The MSP430X processor was designed as an ultra-low power microcontroller based on the 16-bit RISC CPU. The MSP430X CPU has 16 20-bit registers that are numbered from R 0 to R 15 . Registers R 0 to R 3 are special-purpose registers that are used as the program counter, stack pointer, status register, and constant generator, respectively. Registers R 4 to R 15 are general-purpose registers that are used to store data values, address pointers, and index values.
The MSP430X instruction set does not include multiply and multiply-and-accumulate (MAC) instructions. Instead, the MSP430 family is equipped with a memory-mapped hardware multiplier. The hardware multiplier provides four different multiply operations (unsigned multiplication, signed multiplication, unsigned multiplication and accumulation, and signed multiplication and accumulation) for the first operand, called MPY , MPYS , MAC , and MACS . The second operand register is common to all multiplier modes, called OP 2 . Namely, the first operand determines the operation type of the multiplier, but does not start the operation. Writing the second operand to the  OP 2 register starts the selected multiplication with two values. The multiplication result is written in three result registers RESLO , RESHI , and SUMEXT . RESLO stores the lower 16-bit of the result, RESHI stores the upper 16-bit of the result, and SUMEXT stores the carry bit or sign of the result.
The MSP430X processor provides seven addressing modes for the source operand and four addressing modes for the destination operand. The total computation time depends on the instruction format and the addressing modes for the operand. Instructions between two CPU registers only require one cycle. However, memory access instruction ( MOV ) requires two to six cycles depending on addressing modes of operands. To improve the performance of field arithmetic, reducing the number of memory access instructions and efficiently utilizing MAC operations are the basic optimization strategies.
In our implementations, we targeted the MSP430FR5969 processor [36]. This processor is equipped with 64 KB of program flash memory and 2 KB of RAM and can be clocked up to 16 MHz.
For integer multiplication on 16-bit MSP430X processor, we chose and implemented the product scanning multiplication. Algorithm 5 outlines the product scanning method for multi-precision multiplication. The first loop in Algorithm 5 computes the lower half of the multiplication result c, and the second loop computes the upper half of the result c. It accumulates partial multiplications of the inner loop a j × b i j and these operations can be efficiently computed using the MAC operations of the hardware multiplier. Specifically, two 16-bit operands are multiplied and the results are added to the intermediate value s, which is held in RESLO , RESHI , and SUMEXT .
In Four Q [24], integer squaring was implemented using the SBD method [35]. We utilize the product scanning method for 128-bit integer squaring on 16-bit MSP430X. It can be easily implemented by modifying the product scanning multiplication. Additionally, this method results in better performance than the SBD method in Four Q . The implementation results can be found in Section 6.2.
Algorithm 5: Product scanning multiplication.
Require: 
a = ( a m 1 , , a 0 ) , b = ( b m 1 , , b 0 ) F p .
Ensure: 
c = a · b = ( c 2 m 1 , c 0 ) .
 1:
s 0
 2:
fori from 0 to m 1 do
 3:
for j from 0 to i do
 4:
   s s + a j · b i j
 5:
end for
 6:
c i s ( mod 2 w )
 7:
s s / 2 w
 8:
end for
 9:
fori from m to 2 m 2 do
  10:
for j from i m + 1 to m 1 do
  11:
   s s + a j · b i j
  12:
end for
  13:
c i s ( mod 2 w )
  14:
s s / 2 w
  15:
end for
  16:
c 2 m 1 s ( mod 2 w )
  17:
return c = ( c 2 m 1 , , c 0 ) .

4.7. Optimization Strategy on 32-Bit ARM

The ARM Cortex-M is a family of 32-bit RISC ARM processors for microcontrollers. The Cortex-M4 processor is a high-performance Cortex-M processor with digital signal processing (DSP), SIMD, and MAC instructions. It based on the ARMv7-M architecture and equipped with 16 32-bit general registers that are numbered from R 0 to R 15 . Registers R 13 to R 15 are special-purpose registers that are used for the stack pointer (SP), link register (LR), and program counter (PC), respectively. The Cortex-M4 instruction set provides multiply and MAC instructions, such as UMULL , UMLAL , and  UMAAL . The  UMULL instruction multiplies two unsigned 32-bit operands to obtain a 64-bit result. The  UMLAL and UMAAL instructions multiply two unsigned 32-bit operands and accumulate a single 64-bit value and two 32-bit values.
In our implementations, we used the STM32F407-DISC1 board, which contains a 32-bit ARM Cortex-M4 STM32F407VGT6 microcontroller [37]. This microcontroller is equipped with 1 MB of flash memory, 192 KB of SRAM, and 64 KB of core-coupled memory (CCM) data RAM and can be clocked up to 168 MHz.
For integer multiplication and squaring, we implemented the operand scanning method by using efficient MAC operations. Additionally, these MAC operations facilitate an efficient implementation of modular reduction. The first reduction computes c c l + 2 · 5997 · c h , where 0 c h , c l < 2 128 . For example, the intermediate values c h are loaded in R 9 to R 12 and c l are loaded in R 5 to R 8 . The constant 11994 = 2 · 5997 is loaded in R 3 and 0 is loaded in R 4 . The computation c c l + 2 · 5997 · c h is performed as follows:
MOV R 3 , # 11994 , MOV R 4 , # 0 , UMLAL R 5 , R 4 , R 3 , R 9 , UMAAL R 4 , R 6 , R 3 , R 10 , UMAAL R 6 , R 7 , R 3 , R 11 , UMAAL R 7 , R 8 , R 3 , R 12 .
The results of the first reduction c are held in ( R 5 , R 4 , R 6 , R 7 , R 8 ) . The second reduction can be computed using simple multiplication ( MUL ) and addition ( ADD , ADC ) instructions.
For the further improvement of field arithmetic, we implemented field arithmetic over F p 2 at the assembly level [24,38]. In the case of field multiplication over F p 2 , we utilized the operand scanning multiplication with a lazy reduction method. This operation computes a · b = ( a 0 + a 1 i ) ( b 0 + b 1 i ) = ( a 0 b 0 a 1 b 1 ) + ( a 0 b 1 + a 1 b 0 ) i , where a = a 0 + a 1 i , b = b 0 + b 1 i F p 2 . The operand scanning method results in better performance than the Karatsuba multiplication in our case. The field squaring over F p 2 is implemented using a 2 = ( a 0 + a 1 ) ( a 0 a 1 ) + 2 a 0 a 1 i at the assembly level.

5. Implementation Details of Curve Arithmetic

In this section, we describe the scalar decomposition and curve arithmetic that are commonly used on three target platforms. Section 5.1 describes the scalar decomposition and recoding methods for multi-scalars. The details of point arithmetic, coordinate system, and endomorphisms are described in Section 5.2 and Section 5.3.

5.1. Scalar Decomposition

In this subsection, we describe the scalar decomposition method for a random integer k [ 1 , r ] and corresponding multi-scalars ( k 0 , k 1 , k 2 , k 3 ) Z 4 such that k k 0 + k 1 ϕ + k 2 ψ + k 3 ψ ϕ as max ( k i ) < C r 1 / 4 for 0 i 3 and some explicit constant C > 0 . We assume that ϕ λ ( mod r ) and ψ μ ( mod r ) . Let F be a four-dimensional GLV-GLS reduction map defined by
F : Z 4 Z / n , ( k 0 , k 1 , k 2 , k 3 ) k 0 + k 1 λ + k 2 μ + k 3 λ μ ( mod r ) .
Let B = ( b 0 , b 1 , b 2 , b 3 ) be a 4 × 4 matrix consisting of four linearly independent vectors with max i | b i | C r 1 / 4 . Then, for any k [ 1 , r 1 ] , the decomposition method computes ( α 0 , α 1 , α 2 , α 3 ) Q 4 and computes the multi-scalars
( k 0 , k 1 , k 2 , k 3 ) = ( k , 0 , 0 , 0 ) i = 0 3 α i · b i ,
where represents a rounding operation. There are two typical methods for decomposing a scalar: the Babai rounding method [39] and division in a ring Z [ ϕ ] method, where ϕ is an efficiently computable endomorphism [40]. In [14], lattice reduction algorithms based on Cornacchia’s algorithms were proposed for finding a uniform basis. The first step is finding Cornacchia’s GCD in Z and the second step is using the Cornacchia’s algorithm in Z [ i ] . We utilize these two algorithms to find four linearly independent vectors b 0 , b 1 , b 2 , b 3 kerF, where the rectangle norms < 51.5 2 r 1 / 4 . The coordinates of these vectors utilize the scalar decomposition. Additionally, the relationships of four vectors can reduce the number of fixed constants. Two vectors b 0 = ( b 0 [ 0 ] , b 0 [ 1 ] , b 0 [ 2 ] , b 0 [ 3 ] ) and b 1 = ( b 1 [ 0 ] , b 1 [ 1 ] , b 1 [ 2 ] , b 1 [ 3 ] ) can represent the remaining vectors b 2 = ( b 0 [ 2 ] , b 0 [ 3 ] , b 0 [ 0 ] , b 0 [ 1 ] ) and b 3 = ( b 1 [ 2 ] , b 1 [ 3 ] , b 1 [ 0 ] , b 1 [ 1 ] ) .
Let B i be the  4 × 4 matrix formed by replacing b i in B with the vector ( 1 , 0 , 0 , 0 ) . We then define four precomputed constants h i = det ( B i ) , where 0 i 3 . The four-dimensional decomposition computes α i = k · h i r using four integer multiplication, four integer divisions, and four rounding operations. Bos et al. [17] introduced an efficient rounding method for eliminating integer divisions. This method chooses an integer m such that r < 2 m , and precomputes the fixed constants l i = h i r · 2 m . Then, α i can be computed by k · l i 2 m , where the division by 2 m can be computed by a shift operation. The four-dimensional decomposition of a random scalar k using curve Ted 127 - glv 4 can be computed as follows:
k 0 = k α 0 · b 0 [ 0 ] α 1 · b 1 [ 0 ] + α 2 · b 0 [ 2 ] + α 3 · b 1 [ 2 ] , k 1 = α 0 · b 0 [ 1 ] α 1 · b 1 [ 1 ] + α 2 · b 0 [ 3 ] + α 3 · b 1 [ 3 ] , k 2 = α 0 · b 0 [ 2 ] α 1 · b 1 [ 2 ] α 2 · b 0 [ 0 ] α 3 · b 1 [ 0 ] , k 3 = α 0 · b 0 [ 3 ] α 1 · b 1 [ 3 ] α 2 · b 0 [ 1 ] α 3 · b 1 [ 1 ] .
However, Ref. [21] reported that this method yields the correct answer and k · h i r 1 . They also reported that a large size of m decreases the probability of a round-off error.
Because the multi-scalars ( k 0 , k 1 , k 2 , k 3 ) lie between 2 63 and 2 63 , all coordinates are both positive and negative. Signed multi-scalars require additional cost to compute scalar multiplication. Costello and Longa [21] demonstrated the offset vectors such that all coordinates of the multi-scalars were always positive to simplify scalar recoding. However, this odd-only scalar recoding method requires that the first element k 0 of the muli-scalars is always odd. For constant-time execution and odd-only recoding, they found two offset vectors c 1 and c 2 such that ( k 0 , k 1 , k 2 , k 3 ) + c 1 and ( k 0 , k 1 , k 2 , k 3 ) + c 2 are valid decompositions of the scalar k and one of the two multi-scalars had a first element that was odd. To utilize these methods for curve Ted 127 - glv 4 , we carefully chose two offset vectors c 1 = 2 b 0 + b 1 3 b 2 4 b 3 and c 2 = 3 b 0 + 2 b 1 3 b 2 2 b 3 . The multi-scalars ( k 0 , k 1 , k 2 , k 3 ) + c 1 and ( k 0 , k 1 , k 2 , k 3 ) + c 2 are valid decompositions of the scalar k. Finally, all four coordinates of the two decompositions are positive and less than 2 65 , and k 0 in one of them is odd.
Because all coordinates of multi-scalars are less than 2 65 , scalar decomposition and recoding require more computational cost compared to Four Q -based implementation, which has coordinates of multi-scalars less than 2 64 . However, this additional cost is an extremely small portion of the scalar multiplication.

5.2. Point Arithmetic

To enhance the performance of scalar multiplication, the selections of efficient point arithmetic and coordinate system are one of the most crucial subjects. The extended Edwards coordinates of the form ( X : Y : Z : T ) were proposed by Hisil et al., where T = X Y / Z [30]. The extended Edwards coordinates are an extended version of the homogeneous coordinates of the form ( X : Y : Z ) . The identity element is represented by ( 0 : 1 : 1 : 0 ) and the negative element of ( X : Y : Z : T ) is represented by ( X : Y : Z : T ) .
Hisil et al. [30] proposed dedicated addition and doubling formulas that are independent of the curve parameter d. Given ( X 1 : Y 1 : Z 1 : T 1 ) and ( X 2 : Y 2 : Z 2 : T 2 ) of distinct points with Z 1 0 and Z 2 0 , the ECADD operation ( X 3 : Y 3 : Z 3 : T 3 ) = ( X 1 : Y 1 : Z 1 : T 1 ) + ( X 2 : Y 2 : Z 2 : T 2 ) can be computed as follows:
X 3 = ( X 1 Y 2 Y 1 X 2 ) ( T 1 Z 2 + Z 1 T 2 ) , Y 3 = ( Y 1 Y 2 + a X 1 X 2 ) ( T 1 Z 2 Z 1 T 2 ) , Z 3 = ( T 1 Z 2 Z 1 T 2 ) ( T 1 Z 2 Z 1 T 2 ) , T 3 = ( Y 1 Y 2 + a X 1 X 2 ) ( X 1 Y 2 Y 1 X 2 ) .
Similarly, given ( X 1 : Y 1 : Z 1 : T 1 ) with Z 1 0 , the ECDBL operation ( X 3 : Y 3 : Z 3 : T 3 ) = 2 ( X 1 : Y 1 : Z 1 : T 1 ) can be computed as follows:
X 3 = 2 X 1 Y 1 ( 2 Z 1 2 Y 1 2 a X 1 2 ) , Y 3 = ( Y 1 2 + a X 1 2 ) ( Y 1 2 a X 1 2 ) , Z 3 = 2 X 1 Y 1 ( Y 1 2 a X 1 2 ) , T 3 = ( Y 1 2 + a X 1 2 ) ( 2 Z 1 2 Y 1 2 a X 1 2 ) .
Hamburg [41] proposed extensible coordinates of the form ( X : Y : Z : T a : T b ) , where T = T a · T b . The final step of the ECADD and ECDBL operations using extended Edwards coordinates computes T = T a · T b . However, the extensible coordinates store the coordinates T as T a and T b , and compute T when required for point arithmetic. For the further improvement of the ECADD operation, the precomputed point Q is represented in the form ( X + Y , Y X , 2 Z , 2 T ) [25]. This method eliminates two multiplication by 2 operations and two field additions over F p compared to the extended Edwards coordinates. In the case of the ECDBL operation, we utilize the transformation 2 X Y = ( X + Y ) 2 X 2 Y 2 to reduce the number of multiplications. It can be computed by converting one field multiplication and one field addition over F p 2 to one field squaring, two field subtractions over F p 2 . Algorithms 6 and 7 describe the extensible coordinates of the ECADD and ECDBL operations over F p 2 with a curve parameter a = 1 , which require 8 M 2 + 6 A 2 and 3 M 2 + 4 S 2 + 6 A 2 operations, respectively.   
Algorithm 6: Twisted Edwards point addition over F p 2 .
Require: 
P = ( X 1 , Y 1 , Z 1 , T a , T b ) where T 1 = T a · T b and Q = ( X 2 + Y 2 , Y 2 X 2 , 2 Z 2 , 2 T 2 ) .
Ensure: 
P + Q = ( X 3 , Y 3 , Z 3 , T a , T b ) where T 3 = T a · T b .
 1:
t 2 T a × T b { M 2 }
 2:
t 2 t 2 × 2 Z 2 { M 2 }
 3:
t 1 2 T 2 × Z 1 { M 2 }
 4:
T a t 2 t 1 { A 2 }
 5:
T b t 2 + t 1 { A 2 }
 6:
t 2 X 1 + Y 1 { A 2 }
 7:
t 2 ( Y 2 X 2 ) × t 2 { M 2 }
 8:
t 1 Y 1 X 1 { A 2 }
 9:
t 2 ( X 2 + Y 2 ) × t 1 { M 2 }
  10:
Z 3 t 1 t 2 { A 2 }
  11:
t 1 t 1 + t 2 { A 2 }
  12:
X 3 T b × Z 3 { M 2 }
  13:
Z 3 t 1 × Z 3 { M 2 }
  14:
Y 3 t a × t 1 { M 2 }
  15:
return P + Q = ( X 3 , Y 3 , Z 3 , T a , T b ) where T = T a · T b .
To demonstrate the efficiency of the twisted Edwards curves, we compare it to the cost of a short Weierstrass elliptic curve. The ECADD and ECDBL operations of a short Weierstrass curve of the form y 2 = x 3 + a x + b over F p 2 using Jacobian coordinates require 11 M 2 + 5 S 2 + 9 A 2 and 1 M 2 + 8 S 2 + 10 A 2 + 1 M d operations. The ECADD operation of the twisted Edwards curve using extensible coordinates saves 3 M 2 + 5 S 2 + 3 A 2 operations. The ECDBL operation requires 2 M 2 additional operations but saves 4 S 2 + 5 A 2 + 1 M d operations. Therefore, the twisted Edwards curves using extensible coordinates have a computational advantage compared to short Weierstrass curves using Jacobian coordinates.   
Algorithm 7: Twisted Edwards point doubling over F p 2 .
Require: 
P = ( X 1 , Y 1 , Z 1 ) .
Ensure: 
2 P = ( X 3 , Y 3 , Z 3 , T a , T b ) where T 3 = T a · T b .
 1:
t 1 X 1 2 { S 2 }
 2:
t 2 Y 1 2 { S 2 }
 3:
T b t 1 + t 2 { A 2 }
 4:
T a X 1 + X 1 { A 2 }
 5:
T a T a 2 { S 2 }
 6:
t 1 t 2 T 1 { A 2 }
 7:
t 2 Z 1 2 { S 2 }
 8:
T a T a T b { A 2 }
 9:
t 2 t 2 + t 2 { A 2 }
  10:
t 2 t 2 t 1 { A 2 }
  11:
Y 3 t b × t 1 { M 2 }
  12:
X 3 T a × t 2 { M 2 }
  13:
Z 3 t 1 × t 2 { M 2 }
  14:
return 2 P = ( X 3 , Y 3 , Z 3 , T a , T b ) where T 3 = T a · T b .

5.3. Endomorphisms

In [25], the formulas for the endomorphisms ϕ and ψ are described. To reduce the number of representation conversions, we represent the results of endomorphism operations using extensible coordinates. Let P = ( X 1 , Y 1 , Z 1 ) be a point in curve Ted 127 - glv 4 represented by homogeneous projective coordinates. Then, ϕ ( P ) = ( X 2 , Y 2 , Z 2 , T a , T b ) , where T = T a · T b can be computed as follows:
X 2 = X 1 ( α Y 1 2 + θ Z 1 2 ) ( σ Y 1 2 β Z 1 2 ) , Y 2 = 2 Y 1 Z 1 2 ( β Y 1 2 + γ Z 1 2 ) , Z 2 = 2 Y 1 Z 1 2 ( σ Y 1 2 β Z 1 2 ) , T a = X 1 ( α Y 1 2 + θ Z 1 2 ) , T b ( β Y 1 2 + γ Z 1 2 ) ,
where α = ζ 8 3 + 2 ζ 8 2 + ζ 8 , θ = ζ 8 3 2 ζ 8 2 + ζ 8 , σ = 2 ζ 8 3 + ζ 8 2 1 , γ = 2 ζ 8 3 ζ 8 2 + 1 and β = ζ 8 2 1 . We also utilize the fixed values for curve Ted 127 - glv 4 as follows:
ζ 8 = 1 + A i , σ = ( A 1 ) + ( A + 1 ) i , θ = A + B i , α = A + 2 i , γ = ( A + 1 ) + ( A 1 ) i , β = B + 1 + i ,
where A = 143485135153817520976780139629062568752 and B = 1701411834604692317316873037158840
99729. The endomorphism ϕ can be computed by using 11 M 2 + 2 S 2 + 5 A 2 or 7 M 2 + 1 S 2 + 5 A 2 operations in the case Z 1 = 1 .
Similarly, ψ ( P ) = ( X 2 , Y 2 , Z 2 , T a , T b ) , where T 2 = T a · T b can be computed as follows:
X 2 = ζ 8 X 1 p Y 1 p , Y 2 = ( Z 1 p ) 2 , Z 2 = Y 1 p Z 1 p ,
T a = ζ 8 X 1 p , T b = Z 1 p .
The endomorphism ψ can be computed using 3 M 2 + 1 S 2 + 1.5 A 2 or 2 M 2 + 1 A 2 operations in the case Z 1 = 1 . Because the endomorphism ψ requires fewer operations than the endomorphism ϕ , ψ ( ϕ ( P ) ) can be computed on the order of ϕ ( P ) with Z 1 = 1 and ψ ( ϕ ( P ) ) .

6. Performance Analysis and Implementation Results

In this section, we analyze the operation counts and implementation results of variable-base scalar multiplication using curve Ted 127 - glv 4 on AVR (Microchip Technology Inc., Chandler, AZ, USA), MSP430 (Texas Instruments, Dallas, TX, USA), and ARM (ARM holdings plc, Cambridge, UK) processors. We performed simulations and evaluations using the IAR Embedded Workbench for AVR 6.80.7 (IAR systems, Uppsala, Sweden), IAR Embedded Workbench for MSP430 7.10.2 (IAR systems, Uppsala, Sweden), and STM32F4-DISC1 board (STMicroelectronics, Geneva, Switzerland) with the IAR Embedded Workbench for ARM 8.11.1 (IAR systems, Uppsala, Sweden). All implementations were set to the medium optimization level.

6.1. Operation Counts

Table 1 and Table 2 describe the operation counts of field arithmetic over F p 2 and their conversion into field arithmetic over F p for curve Ted 127 - glv 4 and Four Q using Algorithm 1. Because both curves support the four-dimensional decomposition, the operation counts for Algorithm 1 can be compared step by step.
Step 1 of Algorithm 1 computes three endomorphisms ϕ ( P ) , ψ ( P ) , and ϕ ( ψ ( P ) ) , and requires 73 M 2 + 27 S 2 + 59 . 5 A operations for Four Q and 13 M 2 + 2 S 2 + 11.5 A 2 operations for curve Ted 127 - glv 4 . Step 2 requires seven ECADD operations, which require 49 M 2 + 28 A 2 operations for Four Q and 56 M 2 + 42 A 2 operations for curve Ted 127 - glv 4 . However, these outputs are all converted for faster ECADD computations, which require 14 M 2 + 28 A 2 operations for Four Q and 7 M 2 + 28 A 2 operations for curve Ted 127 - glv 4 . Steps 3 and 4 require only bit and integer operations for all positive scalar decomposition and fixed-length recoding operations. Step 5 requires 1 A 2 operations for one point negation and one table lookup, and a conversion to extensible coordinates ( X , Y , Z , T a , T b ) for the initial point Q, which require 2 A 2 operations. Steps 6 to 9 require 64 ECDBL operations, 64 ECADD operations, 64 point negations, and 64 table lookups for Four Q , and 65 ECDBL operations, 65 ECADD operations, 65 point negations, and 65 table lookups for curve Ted 127 - glv 4 . The operation counts of these steps are 704 M 2 + 256 S 2 + 835 A 2 for Four Q and 715 M 2 + 260 S 2 + 845 A 2 for curve Ted 127 - glv 4 . Step 10 requires 1 I 2 + 2 M 2 operations for the normalization of the result point Q.
Variable-base scalar multiplication using the four-dimensional decomposition requires 1 I 2 + 842 M 2 + 283 S 2 + 950 . 5 A 2 operations for Four Q and 1 I 2 + 793 M 2 + 262 S 2 + 929 . 5 A 2 operations for curve Ted 127 - glv 4 . The curve Ted 127 - glv 4 requires 49 M 2 + 21 S 2 + 21 A 2 fewer operations than Four Q because the endomorphisms in curve Ted 127 - glv 4 are efficiently computable. However, the operation counts of field inversion over F p for Four Q and curve Ted 127 - glv 4 are I 1 = 10 M 1 + 126 S 1 and I 1 = 13 M 1 + 126 S 1 , respectively. Therefore, we convert the operation counts of the field arithmetic over F p 2 to the field arithmetic over F p . Field arithmetic over F p 2 can be represented by field arithmetic over F p as follows:
I 2 = 1 I 1 + 2 M 1 + 2 S 1 + 2 A 1 , M 2 = 3 M 1 + 3 A 1 + 2 A i , S 2 = 2 M 1 + 1 A 1 + 2 A i , A 2 = 2 A 1 .
The operation counts 1 I 2 + 842 M 2 + 283 S 2 + 950 . 5 A 2 can be represented by 3104 M 1 + 128 S 1 + 4712 A 1 + 2250 A i for Four Q and 1 I 2 + 793 M 2 + 262 S 2 + 929 . 5 A 2 can be represented by 2918 M 1 + 128 S 1 + 4504 A 1 + 2124 A i for curve Ted 127 - glv 4 . The scalar multiplication using curve Ted 127 - glv 4 saves 186 M 1 + 208 A 1 + 126 A i operations compared to Four Q -based scalar multiplication. Therefore, we can deduce that the four-dimensional scalar multiplication using curve Ted 127 - glv 4 can be faster than Four Q -based implementation when field arithmetic is efficiently implemented.

6.2. Implementation Results of Field Arithmetic

Table 3 lists how many cycles are used for field arithmetic over F p and F p 2 on AVR, MSP430, and ARM processors, including function call overhead. The field inversions F p and F p 2 are the average cycles performed 10 4 times and remaining the field arithmetic is the average cycles performed 10 7 times. To evaluate the implementation of field arithmetic for curve Ted 127 - glv 4 , we compare the number of cycles for its implementation with Four Q , which provides the fastest implementation results to date [24].
We will now compare the number of cycles for field arithmetic on 8-bit AVR processor. The field arithmetic over F p for curve Ted 127 - glv 4 on 8-bit AVR requires 198, 196, 1221, 1796, and 176,901 cycles to compute addition, subtraction, squaring, multiplication, and inversion over F p , respectively. Similarly, the field arithmetic for Four Q on AVR requires 155, 159, 1026, 1598, and 150,535 cycles to compute field addition, subtraction, squaring, multiplication, and inversion over F p , respectively. The curve Ted 127 - glv 4 requires 43, 37, 195, 198, and 26,366 more cycles than Four Q for these operations, respectively. The field arithmetic over F p 2 for curve Ted 127 - glv 4 requires 452, 448, 4093, 6277, and 183,345 cycles to compute field addition, subtraction, squaring, multiplication, and inversion over F p 2 , respectively. These same operations for Four Q require 384, 385, 3622, 5758, and 156,171 cycles, respectively. The curve Ted 127 - glv 4 requires 68, 63, 471, 519, and 27,174 more cycles than Four Q for these operations, respectively.
In the case of the 16-bit MSP430X processor, field arithmetic over F p for curve Ted 127 - glv 4 requires 120, 126, 837, 1087, and 119,629 cycles to compute field addition, subtraction, squaring, multiplication, and inversion over F p . The same operations for Four Q requires 102, 101, 927, 1027, and 131,819 cycles, respectively. The curve Ted 127 - glv 4 requires 18, 25, and 60 more cycles than Four Q to compute addition, subtraction, and multiplication, respectively. However, it saves 90 and 12,190 cycles than Four Q to compute squaring and inversion over F p , respectively. The field arithmetic over F p 2 for curve Ted 127 - glv 4 requires 266, 278, 2476, 3806, and 123,740 cycles to compute field addition, subtraction, squaring, multiplication, and inversion over F p 2 , respectively. These operations for Four Q require 233, 231, 2391, 3624, and 135,315 cycles, respectively. The curve Ted 127 - glv 4 requires 33, 47, 85, and 182 more cycles than Four Q to compute addition, subtraction, squaring, and multiplication, respectively. It saves 11,575 cycles than Four Q to compute inversion over F p 2 .
In the 32-bit ARM Cortex-M4 processor, field arithmetic for curve Ted 127 - glv 4 requires 55, 55, 88, 99, and 12,135 cycles to compute field addition, subtraction, squaring, multiplication, and inversion over F p , respectively. However, Ref. [24] does not report the implementation results of field arithmetic over F p . The field arithmetic over F p 2 for curve Ted 127 - glv 4 requires 82, 82, 196, 341, and 12,612 cycles to compute field addition, subtraction, squaring, multiplication, and inversion over F p 2 , respectively. These operations for Four Q require 84, 86, 215, 358, and 21,056 cycles, respectively. The curve Ted 127 - glv 4 saves 2, 4, 20, 17, and 8444 cycles than Four Q to compute addition, subtraction, squaring, multiplication, and inversion over F p 2 , respectively.
One can see that the field arithmetic over F p in Four Q on AVR and MSP430 is typically faster than curve Ted 127 - glv 4 . This difference occurs because the primes of both curves are different, with a Mersenne prime of the form p = 2 127 1 in Four Q and a Mersenne-like prime of the form p = 2 127 5997 in curve Ted 127 - glv 4 . Let p = 2 127 δ , where δ is small. The modular reduction step can be computed by c = c h · 2 128 + c l c l + 2 · δ · c h ( mod p ) . In this process, Four Q can be efficiently computed using simple shift operations because δ = 1 , but the curve Ted 127 - glv 4 requires more instructions because it uses multiplication by δ = 5997 = 0 x 176 d . In the 8-bit AVR implementation, 0 x 176 d can be represented by two 8-bit words as 0 x 17 and 0 x 6 d . Therefore, the operation c l + 2 · δ · c h ( mod p ) requires more 8 × 8-bit multiplications and accumulations. Unlike the AVR implementation, 0 x 176 d can be represented by one word in the MSP430 and ARM CPUs. Additionally, these CPUs provide efficient MAC instructions. Therefore, the modular reduction on MSP430 and ARM implementations require fewer additional instructions than the AVR implementation.
In the case of the MSP430, field squaring over F p in curve Ted 127 - glv 4 is faster than in Four Q . The field squaring over F p in curve Ted 127 - glv 4 requires 837 cycles, whereas Four Q requires 927 cycles. Our implementation saves 9.71% of the cycles for field squaring over F p compared to the SBD method, despite the modular reduction overhead. Additionally, the principal operation of inversion is field squaring over F p , our implementation saves 9.32% and 8.55% of the cycles for inversion over F p and F p 2 . For field squaring over F p 2 , field squaring over F p is not required because it can be computed by 2 M 1 + 1 A 1 + 2 A i operations. Therefore, field squaring over F p 2 for Ted 127 - glv 4 requires more cycles than Four Q .

6.3. Implementation Results of Scalar Multiplication

Table 4 summarizes the implementation results of variable-base scalar multiplication compared to the previous implementations on the 8-bit AVR, 16-bit MSP430, and 32-bit ARM processors. We measured the average cycles for our variable-base scalar multiplication by running it 10 3 times with random scalars k. For comparison, Table 4 includes the previous implementations that guarantee constant-time execution. These were implemented using various elliptic curves, such as NIST P-256 [42,43], Curve25519 [44,45,46], μ Kummer [47], and Four Q [24]. These curves are designed such that the bit-length of the curve order is slightly smaller than 256-bit for efficient implementation. NIST P-256 has a 256-bit curve order, but Curve25519, μ Kummer, Four Q , and curve Ted 127 - glv 4 have 252-bit, 250-bit, 246-bit, and 251-bit curve orders, respectively. Therefore, these curves provide approximately 128-bit security levels.
We will now summarize the implementation results of previous works on embedded devices that provide approximately 128-bit security levels. Wenger and Werner [42] and Wenger et al. [43] implemented the scalar multiplication using the NIST P-256 curve on various 16-bit microcontrollers and 8-bit, 16-bit, and 32-bit microcontrollers. Hutter and Schwabe [48] implemented the NaCl library on 8-bit AVR processor, which provides a Curve25519 scalar multiplication. Hinterwälder et al. [44] implemented a Diffie–Hellman key exchange on MSP430X processor using 16-bit and 32-bit hardware multipliers. In 2015, Düll et al. [45] implemented a Curve25519 scalar multiplication of on 8-bit, 16-bit, and 32-bit microcontrollers. Renes et al. [47] implemented a Montgomery ladder scalar multiplication on the Kummer surface of a genus 2 hyperelliptic curve on 8-bit AVR and 32-bit ARM Cortex-M0 processors. Faz-Hernández et al. [25] proposed an efficient implementation of the four-dimensional GLV-GLS scalar multiplication using curve Ted 127 - glv 4 on Intel and ARM processors.
The implementation results of variable-base scalar multiplication set new speed records on the 16-bit MSP430 and 32-bit ARM Cortex-M4 processors. Scalar multiplication using curve Ted 127 - glv 4 on AVR, MSP430, and ARM requires 6,856,026, 4,158,453, and 447,836 cycles, respectively. Compared to the previous fastest implementation, namely Four Q [24], which require 6,561,500, 4,280,400, and 469,500 cycles on AVR, MSP430, and ARM, respectively, our implementation requires 4.49% more cycles on AVR, but saves 2.85% and 4.61 % cycles on MSP430X and ARM processors, respectively. Compared to μ Kummer [47], which requires 9,513,536 cycles on AVR, our implementation saves 27.93% cycles. It also saves 50.68% and 47.58% cycles than Düll et al.’s Curve25519 implementation [45], which requires 13,900,397 and 7,933,296 cycles on AVR and MSP430, respectively. It saves 69.92% cycles compared to the NaCl library [48], which requires 22,791,579 cycles on AVR. It saves 54.50% cycles than Hinterwälder et al.’s Curve25519 implementation [44], which requires 9,139,739 cycles on MSP430. Additionally, it saves 68.54% cycles compared to the method in [46], which requires 1,423,667 cycles on the ARM Cortex-M4 processor.
The memory of embedded processors is very constrained, meaning the memory usage of various implementations is important. In the case of the 8-bit AVR, μ Kummer [47] requires the lowest memory usage in the recently proposed results, which requires 9490 bytes of code size and 99 bytes of stack memories. Wenger et al.’s and Düll et al.’s implementations [43,45] require the lowest code size and stack memories on MSP430, which require 8378 bytes of code size and 384 bytes of stack memories. In the 32-bit ARM, Ref. [46] require 3750 bytes of code size and 740 bytes of stack memories. Four Q [24] reported the memory usage of ECDH and signature operations, but did not report the memory usage of single scalar multiplication. Our implementations for curve Ted 127 - glv 4 requires 13,891, 9098, and 7532 bytes of code size and 2539, 2568, and 2792 bytes of stack memories on AVR, MSP430, and ARM Cortex-M4, respectively. Four Q and curve Ted 127 - glv 4 , which utilize the four-dimensional decompositions, precompute eight points, meaning they require more stack memory than other implementations. However, the performance of four-dimensional scalar multiplication is significantly faster than other implementations.

7. Conclusions

In this paper, we presented the first constant-time implementations of four-dimensional GLV-GLS scalar multiplication using curve Ted 127 - glv 4 on 8-bit ATxmega256A3, 16-bit MSP430FR5969, and 32-bit ARM Cortex-M4 processors. We also optimized the performance of internal algorithms in scalar multiplication on three target processors. Our implementations for single scalar multiplication on AVR require 4.49% more cycles than Four Q -based implementation, but save 2.85% and 4.61% cycles on MSP430 and ARM Cortex-M4, respectively. Our analysis and implementation results demonstrate that efficiently computable endomorphisms can accelerate scalar multiplication, even when using prime numbers that provide inefficient field arithmetic. Our implementations highlight that the four-dimensional GLV-GLS scalar multiplication using curve Ted 127 - glv 4 is one of the suitable elliptic curves for constructing ECC-based applications for resource-constrained embedded devices.

Author Contributions

J.K. designed and implemented the presented software. S.C.S. and S.H. analyzed the experimental results and improved the choice of internal algorithms of scalar multiplication.

Acknowledgments

This work was supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No. 2014-6-00910, Study on Security of Cryptographic Software).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Shojafar, M.; Canali, C.; Lancellotti, R.; Baccarelli, E. Minimizing computing-plus-communication energy consumptions in virtualized networked data centers. In Proceedings of the 2016 IEEE Symposium on Computers and Communication (ISCC), Messina, Italy, 27–30 June 2016; pp. 1137–1144. [Google Scholar]
  2. Baccarelli, E.; Naranjo, P.G.V.; Shojafar, M.; Scarpiniti, M. Q*: Energy and delay-efficient dynamic queue management in TCP/IP virtualized data centers. Comput. Commun. 2017, 102, 89–106. [Google Scholar] [CrossRef]
  3. Miller, V.S. Use of Elliptic Curves in Cryptography. In Proceedings of Conference on the Theory and Application of Cryptographic Techniques, Santa Barbara, CA, USA, 18–22 August 1985; Springer: Heidelberg/Berlin, Germany, 1985; pp. 417–426. [Google Scholar]
  4. Koblitz, N. Elliptic curve cryptosystems. Math. Comput. 1987, 48, 203–209. [Google Scholar] [CrossRef]
  5. Rivest, R.L.; Shamir, A.; Adleman, L. A method for obtaining digital signatures and public-key cryptosystems. Commun. ACM 1978, 21, 120–126. [Google Scholar] [CrossRef]
  6. SafeCurves: Choosing Safe Curves for Elliptic-Curve Cryptography. Available online: http://safecurves.cr.yp.to (accessed on 10 March 2018).
  7. Barker, E.; Kelsey, J. NIST Special Publication 800-90A Revision 1: Recommendation for Random Number Generation Using Deterministic Random Bit Generators; Technical Report; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2015. [Google Scholar]
  8. Bernstein, D.J. Curve25519: New Diffie–Hellman Speed Records. In Proceedings of 9th International Workshop on Public Key Cryptography, New York, NY, USA, 24–26 April 2006; Springer: Heidelberg/Berlin, Germany, 2006; pp. 207–228. [Google Scholar]
  9. Hamburg, M. Ed448-Goldilocks, a new elliptic curve. IACR Cryptol. ePrint Arch. 2015, 2015, 625. [Google Scholar]
  10. Bernstein, D.J.; Birkner, P.; Joye, M.; Lange, T.; Peters, C. Twisted Edwards Curves. In Proceedings of 1st International Conference on Cryptology in Africa, Casablanca, Morocco, 11–14 June 2008; Springer: Heidelberg/Berlin, Germany, 2008; pp. 389–405. [Google Scholar]
  11. Gallant, R.P.; Lambert, R.J.; Vanstone, S.A. Faster Point Multiplication on Elliptic Curves with Efficient Endomorphisms. In Proceedings of 21st Annual International Cryptology Conference, Santa Barbara, CA, USA, 19–23 August 2001; Springer: Heidelberg/Berlin, Germany, 2001; pp. 190–200. [Google Scholar]
  12. Galbraith, S.D.; Lin, X.; Scott, M. Endomorphisms for Faster Elliptic Curve Cryptography on a Large Class of Curves. In Proceedings of 28th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Cologne, Germany, 26–30 April 2009; Springer: Heidelberg/Berlin, Germany, 2009; pp. 518–535. [Google Scholar]
  13. Longa, P.; Gebotys, C. Efficient Techniques for High-Speed Elliptic Curve Cryptography. In Proceedings of 12th International Workshop on Cryptographic Hardware and Embedded Systems, Santa Barbara, CA, USA, 17–20 August 2010; Springer: Heidelberg/Berlin, Germany, 2010; pp. 80–94. [Google Scholar]
  14. Longa, P.; Sica, F. Four-Dimensional Gallant–Lambert–Vanstone Scalar Multiplication. In Proceedings of 18th International Conference on the Theory and Application of Cryptology and Information Security, Beijing, China, 2–6 December 2012; Springer: Heidelberg/Berlin, Germany, 2012; pp. 718–739. [Google Scholar]
  15. Hu, Z.; Longa, P.; Xu, M. Implementing the 4-dimensional GLV method on GLS elliptic curves with j-invariant 0. Des. Codes Cryptogr. 2012, 63, 331–343. [Google Scholar] [CrossRef]
  16. Bos, J.W.; Costello, C.; Hisil, H.; Lauter, K. Fast cryptography in genus 2. In Proceedings of 32nd Annual International Conference on the Theory and Applications of Cryptographic Techniques, Athens, Greece, 26–30 May 2013; Springer: Heidelberg/Berlin, Germany; pp. 194–210.
  17. Bos, J.W.; Costello, C.; Hisil, H.; Lauter, K. High-Performance Scalar Multiplication Using 8-Dimensional GLV/GLS Decomposition. In Proceedings of 15th International Workshop on Cryptographic Hardware and Embedded Systems, Santa Barbara, CA, USA, 20–23 August 2013; Springer: Heidelberg/Berlin, Germany, 2013; pp. 331–348. [Google Scholar]
  18. Oliveira, T.; López, J.; Aranha, D.F.; Rodríguez-Henríquez, F. Lambda Coordinates for Binary Elliptic Curves. In Proceedings of 15th International Workshop on Cryptographic Hardware and Embedded Systems, Santa Barbara, CA, USA, 20–23 August 2013; Springer: Heidelberg/Berlin, Germany, 2013; pp. 311–330. [Google Scholar]
  19. Guillevic, A.; Ionica, S. Four-Dimensional GLV via the Weil Restriction. In Proceedings of 19th International Conference on the Theory and Application of Cryptology and Information Security, Bengaluru, India, 1–5 December 2013; Springer: Heidelberg/Berlin, Germany, 2013; pp. 79–96. [Google Scholar]
  20. Smith, B. Families of Fast Elliptic Curves from Q -Curves. In Proceedings of 19th International Conference on the Theory and Application of Cryptology and Information Security, Bengaluru, India, 1–5 December 2013; Springer: Heidelberg/Berlin, Germany, 2013; pp. 61–78. [Google Scholar]
  21. Costello, C.; Longa, P. Four Q : Four-Dimensional Decompositions on a Q -Curve over the Mersenne Prime. In Proceedings of 21st International Conference on the Theory and Application of Cryptology and Information Security, Auckland, New Zealand, 29 November–3 December 2015; Springer: Heidelberg/Berlin, Germany, 2015; pp. 214–235. [Google Scholar]
  22. Longa, P. Four Q NEON: Faster Elliptic Curve Scalar Multiplications on ARM Processors. IACR Cryptol. ePrint Arch. 2016, 2016, 645. [Google Scholar]
  23. Järvinen, K.; Miele, A.; Azarderakhsh, R.; Longa, P. Four Q on FPGA: New Hardware Speed Records for Elliptic Curve Cryptography over Large Prime Characteristic Fields. In Proceedings of 18th International Workshop on Cryptographic Hardware and Embedded Systems, Santa Barbara, CA, USA, 17–19 August 2016; Springer: Heidelberg/Berlin, Germany, 2016; pp. 517–537. [Google Scholar]
  24. Liu, Z.; Longa, P.; Pereira, G.C.; Reparaz, O.; Seo, H. Four Q on Embedded Devices with Strong Countermeasures Against Side-Channel Attacks. In Proceedings of 19th International Workshop on Cryptographic Hardware and Embedded Systems, Taipei, Taiwan, 25–28 September 2017; Springer: Heidelberg/Berlin, Germany, 2017; pp. 665–686. [Google Scholar]
  25. Faz-Hernández, A.; Longa, P.; Sánchez, A.H. Efficient and secure algorithms for GLV-based scalar multiplication and their implementation on GLV–GLS curves (extended version). J. Cryptogr. Eng. 2015, 5, 31–52. [Google Scholar] [CrossRef]
  26. Kocher, P.C. Timing attacks on implementations of Diffie–Hellman, RSA, DSS, and other systems. In Proceedings of 16th Annual International Cryptology Conference, Santa Barbara, CA, USA, 18–22 August 1996; Springer: Heidelberg/Berlin, Germany; pp. 104–113.
  27. Page, D. Theoretical use of cache memory as a cryptanalytic side-channel. IACR Cryptol. ePrint Arch. 2002, 2002, 169. [Google Scholar]
  28. Edwards, H. A normal form for elliptic curves. Bull. Am. Math. Soc. 2007, 44, 393–422. [Google Scholar] [CrossRef]
  29. Bernstein, D.J.; Lange, T. Faster addition and doubling on elliptic curves. In Proceedings of 13th International Conference on the Theory and Application of Cryptology and Information Security, Kuching, Malaysia, 2–6 December 2007; Springer: Heidelberg/Berlin, Germany; pp. 29–50.
  30. Hisil, H.; Wong, K.K.H.; Carter, G.; Dawson, E. Twisted Edwards curves revisited. In Proceedings of 14th International Conference on the Theory and Application of Cryptology and Information Security, Melbourne, Australia, 7–11 December 2008; Springer: Heidelberg/Berlin, Germany; pp. 326–343.
  31. Hankerson, D.; Menezes, A.J.; Vanstone, S. Guide to Elliptic Curve Cryptography; Springer Science & Business Media: Heidelberg/Berlin, Germany, 2006. [Google Scholar]
  32. Yanık, T.; Savaş, E.; Koç, Ç.K. Incomplete reduction in modular arithmetic. IEE Proc. Comput. Digit. Tech. 2002, 149, 46–52. [Google Scholar] [CrossRef]
  33. Microchip. 8/16-Bit AVR XMEGA A3 Microcontroller. Available online: http://ww1.microchip.com/downloads/en/DeviceDoc/Atmel-8068-8-and16-bit-AVR-XMEGA-A3-Microcontrollers_Datasheet.pdf (accessed on 26 February 2018).
  34. Hutter, M.; Schwabe, P. Multiprecision multiplication on AVR revisited. J. Cryptogr. Eng. 2015, 5, 201–214. [Google Scholar] [CrossRef]
  35. Seo, H.; Liu, Z.; Choi, J.; Kim, H. Multi-Precision Squaring for Public-Key Cryptography on Embedded Microprocessors. In Proceedings of Cryptology—INDOCRYPT 2013, Mumbai, India, 7–10 December 2013; Springer: Heidelberg/Berlin, Germany, 2013; pp. 227–243. [Google Scholar]
  36. Texas Instruments. MSP430FR59xx Mixed-Signal Microcontrollers. Available online: http://www.ti.com/lit/ds/symlink/msp430fr5969.pdf (accessed on 26 February 2018).
  37. STMicroelectronics. UM1472: Discovery kit with STM32F407VG MCU. Available online: http://www.st.com/content/ccc/resource/technical/document/user_manual/70/fe/4a/3f/e7/e1/4f/7d/DM00039084.pdf/files/DM00039084.pdf/jcr:content/translations/en.DM00039084.pdf (accessed on 26 February 2018).
  38. FourQlib library. Available online: https://github.com/Microsoft/FourQlib (accessed on 10 March 2018 ).
  39. Babai, L. On Lovász’lattice reduction and the nearest lattice point problem. Combinatorica 1986, 6, 1–13. [Google Scholar] [CrossRef]
  40. Park, Y.H.; Jeong, S.; Lim, J. Speeding Up Point Multiplication on Hyperelliptic Curves With Efficiently-Computable Endomorphisms. In Proceedings of International Conference on the Theory and Applications of Cryptographic Techniques, Amsterdam, The Netherlands, 28 April–2 May 2002; Springer: Heidelberg/Berlin, Germany, 2002; pp. 197–208. [Google Scholar]
  41. Hamburg, M. Fast and compact elliptic-curve cryptography. IACR Cryptol. ePrint Arch. 2012, 2012, 309. [Google Scholar]
  42. Wenger, E.; Werner, M. Evaluating 16-bit processors for elliptic curve cryptography. In Proceedings of the International Conference on Smart Card Research and Advanced Applications, Leuven, Belgium, 14–16 September 2011; Springer: Heidelberg/Berlin, Germany; pp. 166–181.
  43. Wenger, E.; Unterluggauer, T.; Werner, M. 8/16/32 shades of elliptic curve cryptography on embedded processors. In Proceedings of Cryptology—INDOCRYPT 2013, Mumbai, India, 7–10 December 2013; Springer: Heidelberg/Berlin, Germany; pp. 244–261.
  44. Hinterwälder, G.; Moradi, A.; Hutter, M.; Schwabe, P.; Paar, C. Full-size high-security ECC implementation on MSP430 microcontrollers. Proceedings of Cryptology—LATINCRYPT 2014, Florianópolis, Brazil, 17–19 September 2014; pp. 31–47. [Google Scholar]
  45. Düll, M.; Haase, B.; Hinterwälder, G.; Hutter, M.; Paar, C.; Sánchez, A.H.; Schwabe, P. High-speed Curve25519 on 8-bit, 16-bit, and 32-bit microcontrollers. Des. Codes Cryptogr. 2015, 77, 493–514. [Google Scholar] [CrossRef] [Green Version]
  46. De Santis, F.; Sigl, G. Towards Side-Channel Protected X25519 on ARM Cortex-M4 Processors. Proceedings of Software performance enhancement for encryption and decryption, and benchmarking, Utrecht, The Netherlands, 19–21 October 2016. [Google Scholar]
  47. Renes, J.; Schwabe, P.; Smith, B.; Batina, L. μKummer: Efficient Hyperelliptic Signatures and Key Exchange on Microcontrollers. In Proceedings of 18th International Workshop on Cryptographic Hardware and Embedded Systems, Santa Barbara, CA, USA, 17–19 August 2016; Springer: Heidelberg/Berlin, Germany, 2016; pp. 301–320. [Google Scholar]
  48. Hutter, M.; Schwabe, P. NaCl on 8-Bit AVR Microcontrollers. In Proceedings of Cryptology—AFRICACRYPT 2013, Cairo, Egypt, 22–24 June 2013; Springer: Heidelberg/Berlin, Germany, 2013; pp. 156–172. [Google Scholar]
Figure 1. The implementation hierarchy of four-dimensional Gallant-Lambert-Vanstone and Galbraith-Lin-Scott (GLV-GLS) scalar multiplication.
Figure 1. The implementation hierarchy of four-dimensional Gallant-Lambert-Vanstone and Galbraith-Lin-Scott (GLV-GLS) scalar multiplication.
Applsci 08 00900 g001
Table 1. The operation counts of curve Ted 127 - glv 4 using field arithmetic over F p 2 and operation counts for conversion into field arithmetic over F p .
Table 1. The operation counts of curve Ted 127 - glv 4 using field arithmetic over F p 2 and operation counts for conversion into field arithmetic over F p .
Operation Ted 127 - glv 4
I 2 M 2 S 2 A 2 M 1 S 1 A 1 A i
Compute
endomorphisms
-13211.543-6630
Precompute
lookup table
-63-70189-329140
Scalar
decomposition
--------
Scalar
recoding
--------
Main
computation
-7152608482665-41011950
Normalization12--2112884
Total Cost1793262929.5291812845042124
Table 2. The operation counts of curve Four Q using field arithmetic over F p 2 and operation counts for conversion into field arithmetic over F p .
Table 2. The operation counts of curve Four Q using field arithmetic over F p 2 and operation counts for conversion into field arithmetic over F p .
OperationFour Q [21]
I 2 M 2 S 2 A 2 M 1 S 1 A 1 A i
Compute
endomorphisms
-732759.5273-365200
Precompute
lookup table
-63-56189-301126
Scalar
decomposition
--------
Scalar
recoding
--------
Main
computation
-7042568352,624-40381,920
Normalization12--1812884
Total Cost1842283950.53,10412847122,250
Table 3. Cycle counts for field arithmetic on 8-bit AVR, 16-bit MSP430, and 32-bit ARM processors, including function call overhead.
Table 3. Cycle counts for field arithmetic on 8-bit AVR, 16-bit MSP430, and 32-bit ARM processors, including function call overhead.
Operation8-Bit AVR16-Bit MSP43032-Bit ARM
Ted 127 - glv 4
(This Work)
Four Q [24] Ted 127 - glv 4
(This Work)
Four Q [24] Ted 127 - glv 4
(This Work)
Four Q [24]
F p Add19815512010255n/a
Sub19615912610155n/a
Sqr1221102683792788n/a
Mul179615981087102799n/a
Inv176,901150,535119,629131,81912,135n/a
F p 2 Add4523842662338284
Sub4483852782318286
Sqr4093362224762391195215
Mul6277575838063624341358
Inv183,345156,171123,740135,31512,61221,056
Table 4. Cycle counts and memory usage of variable-base scalar multiplication on 8-bit AVR, 16-bit MSP430, 32-bit ARM processors.
Table 4. Cycle counts and memory usage of variable-base scalar multiplication on 8-bit AVR, 16-bit MSP430, 32-bit ARM processors.
PlatformImplementationsBit-Length of
Curve Order
Cost
(Cycles)
Code Size
(Bytes)
Stack Usage
(Bytes)
AVRNIST P-256 [43]25634,930,00016,112590 a
Curve25519 [48]25222,791,579n/a677
Curve25519 [45]25213,900,39717,710494
μ Kummer [47]2509,513,536949099
Four Q [24]2466,561,500n/an/a
Ted 127 - glv 4 (This work)2516,856,02613,8912539
MSP430NIST P-256 [42]25623,973,000n/an/a
NIST P-256 [43]25622,170,0008378418 a
Curve25519 [44]2529,139,73911,778513
Curve25519 [45]2527,933,29613,112384
Four Q [24]2464,280,400n/an/a
Ted 127 - glv 4 (This work)2514,158,45390982568
ARM
Cortex-M4
Curve25519 [46]2521,423,6673750740
Four Q [24]246469,500n/an/a
Ted 127 - glv 4 (This work)251447,83675322792
a includes RAM and stack.

Share and Cite

MDPI and ACS Style

Kwon, J.; Seo, S.C.; Hong, S. Efficient Implementations of Four-Dimensional GLV-GLS Scalar Multiplication on 8-Bit, 16-Bit, and 32-Bit Microcontrollers. Appl. Sci. 2018, 8, 900. https://doi.org/10.3390/app8060900

AMA Style

Kwon J, Seo SC, Hong S. Efficient Implementations of Four-Dimensional GLV-GLS Scalar Multiplication on 8-Bit, 16-Bit, and 32-Bit Microcontrollers. Applied Sciences. 2018; 8(6):900. https://doi.org/10.3390/app8060900

Chicago/Turabian Style

Kwon, Jihoon, Seog Chung Seo, and Seokhie Hong. 2018. "Efficient Implementations of Four-Dimensional GLV-GLS Scalar Multiplication on 8-Bit, 16-Bit, and 32-Bit Microcontrollers" Applied Sciences 8, no. 6: 900. https://doi.org/10.3390/app8060900

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop