Efficient Implementations of Four-Dimensional GLV-GLS Scalar Multiplication on 8-Bit , 16-Bit , and 32-Bit Microcontrollers

In this paper, we present the first constant-time implementations of four-dimensional Gallant–Lambert–Vanstone and Galbraith–Lin–Scott (GLV-GLS) scalar multiplication using curve Ted127-glv4 on 8-bit AVR, 16-bit MSP430, and 32-bit ARM processors. In Asiacrypt 2012, Longa and Sica introduced the four-dimensional GLV-GLS scalar multiplication, and they reported the implementation results on Intel processors. However, they did not consider efficient implementations on resource-constrained embedded devices. We have optimized the performance of scalar multiplication using curve Ted127-glv4 on 8-bit AVR, 16-bit MSP430, and 32-bit ARM processors. Our implementations compute a variable-base scalar multiplication in 6,856,026, 4,158,453, and 447,836 cycles on AVR, MSP430, and ARM Cortex-M4 processors, respectively. Recently, FourQ-based scalar multiplication has provided the fastest implementation results on AVR, MSP430, and ARM Cortex-M4 processors to date. Compared to FourQ-based scalar multiplication, the proposed implementations require 4.49% more computational cost on AVR, but save 2.85% and 4.61% cycles on MSP430 and ARM, respectively. Our 16-bit and 32-bit implementation results set new speed records for variable-base scalar multiplication.


Introduction
Wireless sensor networks (WSNs) are wireless networks consisting of a large number of resource-constrained sensor nodes, where each node is equipped with a sensor to monitor physical phenomena, such as temperature, light, and pressure.The main features of WSNs are resource constraints, such as storage, computing power, and sensing distance.Recently, the energy consumption of data centers has attracted attention because of the fast growth of data throughput.WSNs can provide a solution for data collection and data processing in various applications including data center monitoring.That is, WSNs can be utilized for data center monitoring to improve the efficiency of energy consumption.Several solutions were proposed to solve this problem [1,2].
Since sensor nodes are usually deployed in remote areas and left unattended, they can be led to network security issues, such as node capture, eavesdropping, and message tampering during data communication.Additionally, many application areas of WSNs require data confidentiality, integrity, authentication, and non-repudiation, meaning there is a need for an efficient cryptographic mechanism to satisfy current security requirements.However, due to the constraint of WSNs, it is difficult to utilize the conventional cryptographic algorithms.Therefore, efficient cryptographic algorithms considering code size, computation time, and power consumption are required for the security of WSNs.
In 1985, elliptic curve cryptography (ECC) was proposed independently of public key cryptography (PKC) by Miller and Koblitz [3,4].ECC is mainly used for digital signature and key exchange based on the elliptic curve discrete logarithm problem (ECDLP), which is defined by elliptic curve point operations in a finite field.ECC provides the same security level with a smaller key size compared to existing PKC algorithms such as Rivest-Shamir-Adleman (RSA) cryptosystem [5].For example, ECC over F p with a 256-bit prime p provides an equivalent security level as RSA using 3072-bit key.Because RSA uses a small integer as the public key, RSA public key operations can be efficiently computed.However, RSA private key operations are extremely slower than ECC, therefore they have limited use in the applications of WSNs.Therefore, ECC can be efficiently utilized than RSA for resource-constrained WSNs devices, such as smart cards and sensor nodes.
However, recently proposed manipulation and backdoors have raised the suspicion of weakness in previous ECC standards.In particular, the National Institute of Standards and Technology (NIST) P-224 curve is not secure against twist attacks, which are the combined attacks that use the small-subgroup attacks and the invalid-curve attacks using the twist of curve [6].The dual elliptic curve deterministic random bit generator (Dual_EC_DRBG) is a pseudo-random number generator (PRNG) standardized in NIST SP 800-90A.However, the revised version of NIST SP 800-90A standard removes Dual_EC_DRBG because this algorithm contains a backdoor for the national security agency (NSA) [7].
Therefore, the demand for next generation elliptic curves has increased.Specific examples of such curves are Curve25519, Ed448-Goldilocks, and twisted Edwards curves [8][9][10].The main features of these curves are the selection of efficient parameters.The Curve25519 utilizes a prime of the form p = 2 255 − 19 and a fast Montgomery elliptic curve.The Ed448-Goldilocks curve utilizes a Solinas trinomial prime of the form p = 2 448 − 2 224 − 1, which provides fast field arithmetic on both 32-bit and 64-bit machines because 224 = 28 × 8 = 32 × 7 = 56 × 4.These parameters can accelerate the performance of ECC-based protocols.The details of the twisted Edwards curves can be found in Section 2. 3.
Scalar multiplicationor point multiplication computes an operation kP using an elliptic curve point P and a scalar k.This operation determines the performance of ECC.Therefore, many researchers have proposed various methods to improve the efficiency of scalar multiplication.The speed-up methods for scalar multiplication can be classified into three types: methods based on speeding up the finite field exponentiation, such as comb techniques and windowing methods, scalar recoding methods, and methods that are particular to elliptic curve scalar multiplication [11].
Speed-up methods using efficiently computable endomorphisms are one type of method that are particular to elliptic curve scalar multiplication.The Gallant-Lambert-Vanstone (GLV) method proposed by Gallant et al. is a method for accelerating scalar multiplication by using efficiently computable endomorphisms [11].If the cost of computing endomorphism is less than (bit-length of curve order)/3 elliptic curve point doubling (ECDBL) operations, then this method has a computational advantage.Their method reduces about half of the ECDBL operations and saves the costs of scalar multiplication by roughly 33%.Additionally, recent studies have reported that scalar multiplication methods using efficiently computable endomorphisms are significantly faster than generalized methods.The Galbraith-Lin-Scott (GLS) curves proposed by Galbraith et al. constructed an efficiently computable endomorphism for elliptic curves defined over F p 2 , where p is a prime number [12].They demonstrated that the GLV method can efficiently compute scalar multiplication on such curves.Longa and Gebotys [13] presented an efficient implementation of two-dimensional GLS curves over F p 2 .
In 2012, Longa and Sica [14] proposed four-dimensional GLV-GLS curves over F p 2 , which generalized the GLV method and GLS curves.Hu et al. [15] proposed a GLV-GLS curve over F p 2 , which supports the four-dimensional scalar decomposition.They reported the implementation results indicating that the four-dimensional GLV-GLS scalar multiplication reduces at most 22% of computational cost than the two-dimensional GLV method.Bos et al. [16] proposed two-and four-dimensional scalar decompositions over genus 2 curves defined over F p 2 .Bos et al. [17] introduced an eight-dimensional GLV-GLS method over genus 2 curves defined over F p 2 .Oliveira et al. [18] presented the implementation results of a two-dimensional GLV method over binary GLS elliptic curves defined over F 2 254 .Guillevic and Ionica [19] utilized the four-dimensional GLV method on genus 1 curves defined over F p 2 and genus 2 curves defined over F p .Smith [20] proposed a new family of elliptic curves over F p 2 , called "Q-curves".Costello and Longa [21] introduced a four-dimensional Q curve defined over F p 2 , called "FourQ".They reported the implementation results of FourQ on various Intel and AMD processors.
In the case of curve Ted127-glv4, Longa and Sica and Faz-Hernández et al. [14,25] reported the implementation results on high-end processors, such as Intel Sandy Bridge, Intel Ivy Bridge, and ARM Cortex-A processors.However, efficient implementations on resource-constrained embedded devices have not been considered to date.Therefore, we focused on optimized implementations of scalar multiplication using curve Ted127-glv4 on 8-bit ATxmega256A3, 16-bit MSP430FR5969, and 32-bit ARM Cortex-M4 processors, respectively.
Our main contributions can be summarized as follows: • We present efficient implementations at each level of the implementation hierarchy of four-dimensional GLV-GLS scalar multiplication considering the features of 8-bit AVR, 16-bit MSP430, and 32-bit ARM Cortex-M4 processors.To improve the performance of scalar multiplication, we carefully selected the internal algorithms at each level of the implementation hierarchy.These implementations also run in constant time to resist timing and cache-timing attacks [26,27].

•
We demonstrate that the efficiently computable endomorphisms can accelerate the performance of four-dimensional GLV-GLS scalar multiplication.For this purpose, we analyze the operation counts of two elliptic curves "Ted127-glv4" and "FourQ", which support the four-dimensional GLV-GLS scalar multiplication.The GLV-GLS curve Ted127-glv4 requires fewer number of field arithmetic operations than FourQ-based implementation to compute a single variable-base scalar multiplication.However, because FourQ uses a Mersenne prime p = 2 127 − 1 and the curve Ted127-glv4 uses a Mersenne-like prime p = 2 127 − 5997, FourQ has a computational advantage of faster field arithmetic operations.By using the computational advantage of endomorphisms, we overcome the computational disadvantage of curve Ted127-glv4 at field arithmetic level.

•
We present the first constant-time implementations of four-dimensional GLV-GLS scalar multiplication using curve Ted127-glv4 on three target platforms, which have not been considered in previous works.The proposed implementations on AVR, MSP430, and ARM processors require 6,856,026, 4,158,453, and 447,836 cycles to compute a single variable-base scalar multiplication, respectively.Compared to FourQ-based implementations [24], which have provided the fastest results to date, our results are 4.49% slower on AVR, but 2.85% and 4.61% faster on MSP430 and ARM, respectively.Our MSP430 and ARM implementations set new speed records for variable-base scalar multiplication.
The remainder of this paper is organized as follows.Section 2 describes preliminaries regarding ECC and its speed-up techniques, including the GLV and GLS methods.Section 3 presents a review of four-dimensional GLV-GLS scalar multiplication and its implementation hierarchy.Section 4 describes the implementation details of field arithmetic and optimization methods for the target platforms.Section 5 describes optimization methods for ECC in terms of point arithmetic and scalar multiplication.Experimental results and a comparison of our work to previous ECC implementations on AVR, MSP430, and ARM processors are presented in Section 6.Finally, we conclude this paper in Section 7.

Preliminaries
In Section 2.1, we describe the field representation and notations used for the remainder of this paper.We briefly describe ECC using a short Weierstrass curve and its group law in Section 2.2.We also describe twisted Edwards curves, which are the target of our implementation, in Section 2.3.In Section 2.4, we describe the GLV-GLS method including the GLV method and GLS curves.

Field Representation and Notations
We assume that the target platform has a w-bit architecture.Let n = log 2 p be the bit-length of a Mersenne-like prime p = 2 n − c, where c is small.Let m = n/w be its word-length.Then, an arbitrary element a ∈ F p is represented by an array (a m−1 , • • • , a 2 , a 1 , a 0 ) of m w-bit words.The notations M 1 , S 1 , I 1 , and A 1 represent multiplication, squaring, inversion, and addition (subtraction) over F p , respectively.Similarly, the notations M 2 , S 2 , I 2 , and A 2 represent multiplication, squaring, inversion, and addition (subtraction) over F p 2 , respectively.The notation A i represents multi-precision addition without modular reduction and the notation M d represents multiplication with a curve parameter.

Elliptic Curve Cryptography
Let F q be a finite field with odd characteristic.An elliptic curve E over F q is defined by a short Weierstrass equation of the following form: where a, b ∈ F q and 4a 3 + 27b 2 = 0.
Because the most important operation in ECC is scalar multiplication kP, it must be implemented efficiently.The basic method for computing kP is comprised of two elliptic curve operations: elliptic curve point addition (ECADD) and the ECDBL operations.Let P = (x 1 , y 1 ) and Q = (x 2 , y 2 ) be two points on an elliptic curve E. The ECADD and ECDBL operations can be computed in affine coordinates as follows: The ECADD and ECDBL operations are composed of finite field arithmetic operations, such as field addition, subtraction, multiplication, squaring, and inversion.Therefore, to improve the performance of scalar multiplication, the internal algorithms such as field and curve arithmetic operations should be efficiently implemented.

Twisted Edwards Curves
The Edwards curves are a normal form of elliptic curves introduced by Edwards [28].Bernstein and Lange [29] introduced Edwards curves defined by x 2 + y 2 = c 2 (1 + dx 2 y 2 ), where c, d ∈ F q with cd(1 − dc 4 ) = 0.In 2007, Bernstein et al. [10] introduced twisted Edwards curves, which are a generalization of Edwards curves defined by where a, d ∈ F q with ad(a − d) = 0.The Edwards curves are a special case of twisted Edwards curves with a = 1.The point (0, 1) is the identity element and the point (0, −1) has order two.The point (1,0) and (−1, 0) have order four.The negative of a point P = (x 1 , y 1 ) is −P = (−x 1 , y 1 ).The ECADD operation of two points P = (x 1 , y 1 ) and Q = (x 2 , y 2 ) on a twisted Edwards curve E is defined as follows: Because the addition law is unified, it can be used for computing the ECDBL operation.Suppose that two points P and Q have an odd order.Then, the denominators of the addition formula 1 + dx 1 y 1 x 2 y 2 and 1 − dx 1 y 1 x 2 y 2 are nonzero.Therefore, the doubling formula can be obtained as follows: Two relationships can be obtained by considering the curve equation: After straightforward elimination, the curve parameters a and d can be represented by x 1 , x 2 , y 1 , and y 2 .Substitutions in the unified addition formula yield the addition formula as follows: These addition and doubling formulas are used in the dedicated addition and doubling formulas described in Section 5.The features of these formulas are independent of the curve parameter d [30].

The GLV-GLS Method
We will now describe the GLV method to explain the GLV-GLS method.Let E be an elliptic curve defined over a finite field F q .An endomorphism φ of E over F q is a rational map φ : E → E such that φ(O) = O and φ(P) = (g(P), h(P)) for all points P ∈ E, where g and h are rational functions and O is a point at infinity.An endomorphism φ is a group homomorphism, defined as φ(P 1 + P 2 ) = φ(P 1 ) + φ(P 2 ) for all P 1 , P 2 ∈ E.
Suppose that #E(F q ) contains a subgroup of order r and let φ be an efficiently computable endomorphism on E such that φ(P) = λP for some 1 ≤ λ ≤ r − 1.The GLV method computes the integers k 0 and k 1 such that k = k 0 + k 1 λ mod r for scalar multiplication kP.Because kP =k 0 P + k 1 λP =k 0 P + k 1 φ(P), scalar multiplication kP can be computed by computing φ(P) and then using multiple scalar multiplications [31].This is because the multi-scalars k 0 and k 1 have approximately half the bit-length of the scalar k.The efficiency of the GLV method depends on scalar decomposition and the efficiency of computing endomorphism φ.
The main concept of the GLS curves is described as follows: Let E /F q 2 be the quadratic twists of E/F q 2 [12].Let ψ be the quadratic twist map and π be the q-th Frobenius endomorphism.Then, we can obtain the efficiently computable endomorphism φ = ψ • π • ψ −1 , which satisfies the equation X 2 + 1 = 0 if p ≡ 5 (mod 8).However, GLS curves only work for elliptic curves over F q m with m > 1.
As mentioned in the introduction, the GLV-GLS method is the generalized method of the GLV method and GLS curves.Let φ and ψ be two efficiently computable endomorphisms over F p 2 and P be a point of prime order r.Then, the four-dimensional scalar multiplication kP for any scalar k ∈ [1, r] can be computed as follows: where max i (|k i |) < Cr 1/4 for 0 ≤ i ≤ 3 and C is some explicit constant.The details of internal algorithms of the four-dimensional scalar multiplication can be found in Sections 4 and 5.

Review of Four-Dimensional GLV-GLS Scalar Multiplication
The curve Ted127-glv4 was introduced by Longa and Sica [14].It is based on twisted Edwards curves and has efficiently computable endomorphisms, which facilitates the four-dimensional GLV-GLS scalar multiplication.The parameters of curve Ted127-glv4 are as follows: 3 .The curve Ted127-glv4 contains two efficiently computable endomorphisms φ and ψ defined over F p 2 as follows: where ζ 8 = u/ √ 2 is a primitive eighth root of unity.It can be verified that φ 2 + 2 = 0 and ψ 2 + 1 = 0. Let P be a point in E/F p 2 and k be a random scalar in the range [1, r].Algorithm 1 outlines variable-base scalar multiplication using curve Ted127-glv4 and four-dimensional decompositions.
Steps 1 and 2 in Algorithm 1 compute three endomorphisms φ(P), ψ(P), and ψ(φ(P)), and then compute the eight points T 65 , where 0 ≤ i ≤ 3.For constant-time implementation, the multi-scalars (k 0 , k 1 , k 2 , k 3 ) must guarantee the same number of iterations of the main computation.Because all coordinates of scalar decomposition are less than 2 65 , we apply the scalar recoding algorithm to guarantee a fixed loop length for the main computation at step 4 [25].The result of the scalar recoding is represented by 66 lookup table indices d i and 66 masks m i , where 0 ≤ i ≤ 65.Steps 5 to 9 represent the main computation stage, including point loading, the ECADD operation, and the ECDBL operation.The result of the main computation is converted from an extensible coordinates to the affine coordinates in step 10.Therefore, a variable-base scalar multiplication using curve Ted127-glv4 requires one φ(P) endomorphism, two ψ(P) endomorphisms, and seven ECADD operations in the precomputation; 65 table lookups, 65 ECADD operations, and 65 ECDBL operations in the main computation; and one inversion and two field multiplications over F p 2 for point normalization.
Figure 1 describes the implementation hierarchy of four-dimensional GLV-GLS scalar multiplication and its internal algorithms.Because the implementation algorithms at each level affect the performance of scalar multiplication, we carefully choose proper algorithms considering the features of AVR, MSP430, and ARM processors.Additionally, field arithmetic over F p 2 and curve arithmetic are comprised of field arithmetic over F p , which is the computationally primary operations.Therefore, field arithmetic over F p is written at the assembly level.

Implementation Details of Field Arithmetic
In this section, we describe the implementation details of field arithmetic on AVR, MSP430X, and ARM Cortex-M4 processors using a Mersenne-like prime of the form p = 2 127 − 5997.We describe the field arithmetic algorithms that are commonly used in three target platforms in Sections 4.1-4.4.In Sections 4.5-4.7,we describe our optimization strategy for field arithmetic on AVR, MSP430, and ARM processors, respectively.

Field Addition and Subtraction over F p
The curve Ted127-glv4 uses a Mersenne-like prime of the form p = 2 127 − 5997.An efficient field addition/subtraction method for this scenario was proposed by Bos et al. [16].Let 0 ≤ a, b < p = 2 127 − 5997.Field addition over F p can be computed by Because a + 5997 < 2 127 , addition does not require carry propagation.Note that subtraction with carry • 2 127 can be efficiently implemented by clearing the 128-th bit of (a + 5997) + b.
Similar to field addition, field subtraction over F p can be computed by c = a − b (mod p) = (a − b) + borrow • 2 127 − borrow • 5997, where borrow = 0 if a ≥ b, otherwise, borrow = 1.Addition with borrow • 2 127 can be implemented by clearing the 128-th bit of a − b.

Modular Reduction
To use primes of a special form may result in a faster reduction method [31].The NIST recommends five primes for the elliptic curve digital signature algorithm (ECDSA).These primes can be represented as the sums or differences of powers of two and facilitate the fast reduction method.The curve Ted127-glv4 uses a Mersenne-like prime of the form p = 2 127 − 5997.Therefore, modular reduction can be efficiently computed by using a NIST-like reduction method [16].Let 0 ≤ a, b ≤ p = 2 127 − 5997.We compute c = a • b = 2 128 c h + c l , where 0 ≤ c h , c l < 2 128 .The first reduction step can be computed by c ≡ c l + 2 • 5997 • c h .Then, the second reduction step can be computed by

Inversion over F p
For the field inversion a −1 (mod p), we use the fact that a −1 = a p−2 (mod p) in Fermat's little theorem (in our case, a p−2 (mod p) = a 2 127 −5999 (mod p)).This method can be implemented by modular exponentiation using fixed addition chains and guarantees constant-time execution requiring 13M 1 + 126S 1 operations.

Field Arithmetic over F p 2
The incomplete reduction method proposed by Yanık et al. [32] is one of the optimization methods in field arithmetic over F p 2 .Given two elements a, b ∈ [0, p − 1], the result of operations stays in the range [0, 2 m − 1], where p < 2 m < 2p − 1 and m is a fixed integer (in our case, m = 128).Because the modulus of curve Ted127-glv4 is a Mersenne-like prime of the form p = 2 127 − 5997, the incomplete reduction method can be applied more advantageously.
Let a = a 0 + a 1 i and b = b 0 + b 1 i be two arbitrary elements in a finite field F p 2 .Field addition and subtraction over F p 2 can be computed by a ).We utilize Karatsuba multiplication to compute field multiplication over F p 2 .The Karatsuba multiplication uses the fact that a , which can be computed by 3M 1 + 3A 1 + 2A i operations.It requires 1A 1 + 2A i more operations but saves 1M 1 operations compared to general multiplication methods, which require 4M 1 + 2A 1 operations.Because field multiplication requires more computational cost than the multi-precision addition and field addition, the Karatsuba multiplication has a computational advantage.Algorithm 2 describes field multiplication over F p 2 using the Karatsuba multiplication and the incomplete reduction method.
Algorithm 3 describes field squaring over F p 2 using the incomplete reduction method.Note that The first representation can be computed by 1M 1 + 2S 1 + 1A 1 + 1A i operations, and the remaining representation can be computed by 2M 1 + 1A 1 + 2A i operations.Because 1M 1 operation can be implemented faster than 2S 2 operations, we use 2M 1 + 1A 1 + 2A i operations to compute field squaring over F p 2 .The results of steps 3 and 4 in Algorithm 2 and steps 1 and 3 in Algorithm 3 were represented by the incompletely reduced form.

Optimization Strategy on 8-Bit AVR
The AVR processor is a family of 8-bit microcontrollers that is widely used in MICA2/MICAz sensor motes.The AVR processors are equipped with an 8-bit integer multiplier and register file with 32×8-bit general registers that are numbered from R0 to R31.Registers R26 : R27, R28 : R29, and R30 : R31 pairs are used as 16-bit indirect address registers called X, Y, and Z.The automatic increment and decrement addressing modes are supported on all X, Y, and Z registers, and Y and Z support fixed positive displacement.R0 and R1 registers store the 16-bit results of 8 × 8-bit multiplication.The AVR processors provide a typical 8-bit reduced instruction set computer (RISC) instruction set.The most important instructions for ECC are 8 × 8-bit multiplication (MUL) and memory access (LD, ST) instructions, which require two cycles.Instructions between two registers, such as addition (ADD, ADC) or subtraction (SUB, SBC), require only one cycle.Therefore, the basic optimization strategy on 8-bit AVR is reducing the number of memory access instructions.
To simulate our implementations, we targeted the ATxmega256A3 processor [33].This processor can be clocked up to 32 MHz and provides 256 KB of programmable flash memory, 16 KB of SRAM, and 4 KB of EEPROM.
Recently, Hutter and Schwabe [34] proposed a highly optimized Karatsuba multiplication for the 8-bit AVR processor.There are two variants of the Karatsuba multiplication method: the additive Karatsuba and subtractive Karatsuba methods.Algorithm 4 outlines the subtractive Karatsuba multiplication.We consider n × n-bit multiplication, where n is even and k = n/2 (in our case, n = 128 and k = 64).The additive Karatsuba method can be computed similarly to Algorithm 4.However, the additive Karatsuba method may produce the carry bits in the addition of two numbers (a l + a h ) and (b l + b h ).The additional multiplication using the carry bits incurs a significant overhead for integer multiplication.The subtractive Karatsuba method does not produce carry bits in the computation of M, but computes two absolute values |a l − a h | and |b l − b h |.This overhead is not only smaller than the overhead required for the additive Karatsuba method, but can also be executed in constant-time.Therefore, we chose and implemented the subtractive Karatsuba multiplication for the 8-bit AVR implementation.
For integer squaring, we chose the sliding block doubling (SBD) method [35], which is more efficient than the subtractive Karatsuba method in the case of 128-bit operands on 8-bit AVR.To improve the performance of field arithmetic, we combined integer multiplication and squaring with modular reduction.

Optimization Strategy on 16-Bit MSP430X
The MSP430X processor was designed as an ultra-low power microcontroller based on the 16-bit RISC CPU.The MSP430X CPU has 16 20-bit registers that are numbered from R0 to R15.Registers R0 to R3 are special-purpose registers that are used as the program counter, stack pointer, status register, and constant generator, respectively.Registers R4 to R15 are general-purpose registers that are used to store data values, address pointers, and index values.
The MSP430X instruction set does not include multiply and multiply-and-accumulate (MAC) instructions.Instead, the MSP430 family is equipped with a memory-mapped hardware multiplier.The hardware multiplier provides four different multiply operations (unsigned multiplication, signed multiplication, unsigned multiplication and accumulation, and signed multiplication and accumulation) for the first operand, called MPY, MPYS, MAC, and MACS.The second operand register is common to all multiplier modes, called OP2.Namely, the first operand determines the operation type of the multiplier, but does not start the operation.Writing the second operand to the OP2 register starts the selected multiplication with two values.The multiplication result is written in three result registers RESLO, RESHI, and SUMEXT.RESLO stores the lower 16-bit of the result, RESHI stores the upper 16-bit of the result, and SUMEXT stores the carry bit or sign of the result.
The MSP430X processor provides seven addressing modes for the source operand and four addressing modes for the destination operand.The total computation time depends on the instruction format and the addressing modes for the operand.Instructions between two CPU registers only require one cycle.However, memory access instruction (MOV) requires two to six cycles depending on addressing modes of operands.To improve the performance of field arithmetic, reducing the number of memory access instructions and efficiently utilizing MAC operations are the basic optimization strategies.
In our implementations, we targeted the MSP430FR5969 processor [36].This processor is equipped with 64 KB of program flash memory and 2 KB of RAM and can be clocked up to 16 MHz.
For integer multiplication on 16-bit MSP430X processor, we chose and implemented the product scanning multiplication.Algorithm 5 outlines the product scanning method for multi-precision multiplication.The first loop in Algorithm 5 computes the lower half of the multiplication result c, and the second loop computes the upper half of the result c.It accumulates partial multiplications of the inner loop a j × b i−j and these operations can be efficiently computed using the MAC operations of the hardware multiplier.Specifically, two 16-bit operands are multiplied and the results are added to the intermediate value s, which is held in RESLO, RESHI, and SUMEXT.
In FourQ [24], integer squaring was implemented using the SBD method [35].We utilize the product scanning method for 128-bit integer squaring on 16-bit MSP430X.It can be easily implemented by modifying the product scanning multiplication.Additionally, this method results in better performance than the SBD method in FourQ.The implementation results can be found in Section 6.2.
for j from 0 to i do end for 6: s ← s/2 w 8: end for 9: for i from m to 2m − 2 do 10: end for 13:

Optimization Strategy on 32-Bit ARM
The ARM Cortex-M is a family of 32-bit RISC ARM processors for microcontrollers.The Cortex-M4 processor is a high-performance Cortex-M processor with digital signal processing (DSP), SIMD, and MAC instructions.It based on the ARMv7-M architecture and equipped with 16 32-bit general registers that are numbered from R0 to R15.Registers R13 to R15 are special-purpose registers that are used for the stack pointer (SP), link register (LR), and program counter (PC), respectively.The Cortex-M4 instruction set provides multiply and MAC instructions, such as UMULL, UMLAL, and UMAAL.The UMULL instruction multiplies two unsigned 32-bit operands to obtain a 64-bit result.The UMLAL and UMAAL instructions multiply two unsigned 32-bit operands and accumulate a single 64-bit value and two 32-bit values.
In our implementations, we used the STM32F407-DISC1 board, which contains a 32-bit ARM Cortex-M4 STM32F407VGT6 microcontroller [37].This microcontroller is equipped with 1 MB of flash memory, 192 KB of SRAM, and 64 KB of core-coupled memory (CCM) data RAM and can be clocked up to 168 MHz.
For integer multiplication and squaring, we implemented the operand scanning method by using efficient MAC operations.Additionally, these MAC operations facilitate an efficient implementation of modular reduction.The first reduction computes c ≡ c l + 2 • 5997 • c h , where 0 ≤ c h , c l < 2 128  The results of the first reduction c are held in (R5, R4, R6, R7, R8).The second reduction can be computed using simple multiplication (MUL) and addition (ADD, ADC) instructions.
For the further improvement of field arithmetic, we implemented field arithmetic over F p 2 at the assembly level [24,38].In the case of field multiplication over F p 2 , we utilized the operand scanning multiplication with a lazy reduction method.This operation computes a The operand scanning method results in better performance than the Karatsuba multiplication in our case.The field squaring over F p 2 is implemented using a 2 = (a 0 + a 1 )(a 0 − a 1 ) + 2a 0 a 1 i at the assembly level.

Implementation Details of Curve Arithmetic
In this section, we describe the scalar decomposition and curve arithmetic that are commonly used on three target platforms.Section 5.1 describes the scalar decomposition and recoding methods for multi-scalars.The details of point arithmetic, coordinate system, and endomorphisms are described in Sections 5.2 and 5.3.

Scalar Decomposition
In this subsection, we describe the scalar decomposition method for a random integer k ∈ [1, r] and corresponding multi-scalars (k 0 , k 1 , k 2 , k 3 ) ∈ Z 4 such that k ≡ k 0 + k 1 φ + k 2 ψ + k 3 ψφ as max(k i ) < Cr 1/4 for 0 ≤ i ≤ 3 and some explicit constant C > 0. We assume that φ ≡ λ (mod r) and ψ ≡ µ (mod r).Let F be a four-dimensional GLV-GLS reduction map defined by b 3 ) be a 4 × 4 matrix consisting of four linearly independent vectors with max i |b i | ≤ Cr 1/4 .Then, for any k ∈ [1, r − 1], the decomposition method computes (α 0 , α 1 , α 2 , α 3 ) ∈ Q 4 and computes the multi-scalars where represents a rounding operation.There are two typical methods for decomposing a scalar: the Babai rounding method [39] and division in a ring Z[φ] method, where φ is an efficiently computable endomorphism [40].In [14], lattice reduction algorithms based on Cornacchia's algorithms were proposed for finding a uniform basis.The first step is finding Cornacchia's GCD in Z and the second step is using the Cornacchia's algorithm in Z[i].We utilize these two algorithms to find four linearly independent vectors b 0 , b 1 , b 2 , b 3 ∈ kerF, where the rectangle norms < 51.5 √ 2r 1/4 .The coordinates of these vectors utilize the scalar decomposition.Additionally, the relationships of four vectors can reduce the number of fixed constants.Two vectors b Let B i be the 4 × 4 matrix formed by replacing b i in B with the vector (1, 0, 0, 0).We then define four precomputed constants h i = det(B i ), where 0 ≤ i ≤ 3. The four-dimensional decomposition computes α i = k•h i r using four integer multiplication, four integer divisions, and four rounding operations.Bos et al. [17] introduced an efficient rounding method for eliminating integer divisions.This method chooses an integer m such that r < 2 m , and precomputes the fixed constants l i = h i r • 2 m .Then, α i can be computed by k•l i 2 m , where the division by 2 m can be computed by a shift operation.The four-dimensional decomposition of a random scalar k using curve Ted127-glv4 can be computed as follows: However, Ref. [21] reported that this method yields the correct answer and k•h i r − 1.They also reported that a large size of m decreases the probability of a round-off error.
Because the multi-scalars (k 0 , k 1 , k 2 , k 3 ) lie between −2 63 and 2 63 , all coordinates are both positive and negative.Signed multi-scalars require additional cost to compute scalar multiplication.Costello and Longa [21] demonstrated the offset vectors such that all coordinates of the multi-scalars were always positive to simplify scalar recoding.However, this odd-only scalar recoding method requires that the first element k 0 of the muli-scalars is always odd.For constant-time execution and odd-only recoding, they found two offset vectors c 1 and c 2 such that (k 0 , k 1 , k 2 , k 3 ) + c 1 and (k 0 , k 1 , k 2 , k 3 ) + c 2 are valid decompositions of the scalar k and one of the two multi-scalars had a first element that was odd.To utilize these methods for curve Ted127-glv4, we carefully chose two offset vectors c 1 = 2b 0 + b 1 − 3b 2 − 4b 3 and c 2 = 3b 0 + 2b 1 − 3b 2 − 2b 3 .The multi-scalars (k 0 , k 1 , k 2 , k 3 ) + c 1 and (k 0 , k 1 , k 2 , k 3 ) + c 2 are valid decompositions of the scalar k.Finally, all four coordinates of the two decompositions are positive and less than 2 65 , and k 0 in one of them is odd.
Because all coordinates of multi-scalars are less than 2 65 , scalar decomposition and recoding require more computational cost compared to FourQ-based implementation, which has coordinates of multi-scalars less than 2 64 .However, this additional cost is an extremely small portion of the scalar multiplication.

Point Arithmetic
To enhance the performance of scalar multiplication, the selections of efficient point arithmetic and coordinate system are one of the most crucial subjects.The extended Edwards coordinates of the form (X : Y : Z : T) were proposed by Hisil et al., where T = XY/Z [30].The extended Edwards coordinates are an extended version of the homogeneous coordinates of the form (X : Y : Z).The identity element is represented by (0 : 1 : 1 : 0) and the negative element of (X : Y : Z : T) is represented by (−X : Y : Z : −T).
Hamburg [41] proposed extensible coordinates of the form (X : Y : Z : T a : T b ), where T = T a • T b .The final step of the ECADD and ECDBL operations using extended Edwards coordinates computes T = T a • T b .However, the extensible coordinates store the coordinates T as T a and T b , and compute T when required for point arithmetic.For the further improvement of the ECADD operation, the precomputed point Q is represented in the form (X + Y, Y − X, 2Z, 2T) [25].This method eliminates two multiplication by 2 operations and two field additions over F p compared to the extended Edwards coordinates.In the case of the ECDBL operation, we utilize the transformation 2XY = (X + Y) 2 − X 2 − Y 2 to reduce the number of multiplications.It can be computed by converting one field multiplication and one field addition over F p 2 to one field squaring, two field subtractions over F p 2 .Algorithms 6 and 7 describe the extensible coordinates of the ECADD and ECDBL operations over F p 2 with a curve parameter a = −1, which require 8M 2 + 6A 2 and 3M 2 + 4S 2 + 6A 2 operations, respectively.
Algorithm 6: Twisted Edwards point addition over F p 2 .

Require
To demonstrate the efficiency of the twisted Edwards curves, we compare it to the cost of a short Weierstrass elliptic curve.The ECADD and ECDBL operations of a short Weierstrass curve of the form y 2 = x 3 + ax + b over F p 2 using Jacobian coordinates require 11M 2 + 5S 2 + 9A 2 and 1M 2 + 8S 2 + 10A 2 + 1M d operations.The ECADD operation of the twisted Edwards curve using extensible coordinates saves 3M 2 + 5S 2 + 3A 2 operations.The ECDBL operation requires 2M 2 additional operations but saves 4S 2 + 5A 2 + 1M d operations.Therefore, the twisted Edwards curves using extensible coordinates have a computational advantage compared to short Weierstrass curves using Jacobian coordinates.Algorithm 7: Twisted Edwards point doubling over F p 2 .

Endomorphisms
In [25], the formulas for the endomorphisms φ and ψ are described.To reduce the number of representation conversions, we represent the results of endomorphism operations using extensible coordinates.Let P = (X 1 , Y 1 , Z 1 ) be a point in curve Ted127-glv4 represented by homogeneous projective coordinates.Then, φ(P) = (X 2 , Y 2 , Z 2 , T a , T b ), where T = T a • T b can be computed as follows: , We also utilize the fixed values for curve Ted127-glv4 as follows: where A = 143485135153817520976780139629062568752 and B = 1701411834604692317316873037158840 99729.The endomorphism φ can be computed by using , where T 2 = T a • T b can be computed as follows: The endomorphism ψ can be computed using 3M 2 + 1S 2 + 1.5A 2 or 2M 2 + 1A 2 operations in the case Z 1 = 1.Because the endomorphism ψ requires fewer operations than the endomorphism φ, ψ(φ(P)) can be computed on the order of φ(P) with Z 1 = 1 and ψ(φ(P)).

Performance Analysis and Implementation Results
In this section, we analyze the operation counts and implementation results of variable-base scalar multiplication using curve Ted127-glv4 on AVR (Microchip Technology Inc., Chandler, AZ, USA), MSP430 (Texas Instruments, Dallas, TX, USA), and ARM (ARM holdings plc, Cambridge, UK) processors.We performed simulations and evaluations using the IAR Embedded Workbench for AVR 6.80.7 (IAR systems, Uppsala, Sweden), IAR Embedded Workbench for MSP430 7.10.2(IAR systems, Uppsala, Sweden), and STM32F4-DISC1 board (STMicroelectronics, Geneva, Switzerland) with the IAR Embedded Workbench for ARM 8.11.1 (IAR systems, Uppsala, Sweden).All implementations were set to the medium optimization level.

Operation Counts
Tables 1 and 2 describe the operation counts of field arithmetic over F p 2 and their conversion into field arithmetic over F p for curve Ted127-glv4 and FourQ using Algorithm 1.Because both curves support the four-dimensional decomposition, the operation counts for Algorithm 1 can be compared step by step.Step 1 of Algorithm 1 computes three endomorphisms φ(P), ψ(P), and φ(ψ(P)), and requires 73M 2 + 27S 2 + 59.5A operations for FourQ and 13M 2 + 2S 2 + 11.5A 2 operations for curve Ted127-glv4.
Step 2 requires seven ECADD operations, which require 49M 2 + 28A 2 operations for FourQ and 56M 2 + 42A 2 operations for curve Ted127-glv4.However, these outputs are all converted for faster ECADD computations, which require 14M 2 + 28A 2 operations for FourQ and 7M 2 + 28A 2 operations for curve Ted127-glv4.Steps 3 and 4 require only bit and integer operations for all positive scalar decomposition and fixed-length recoding operations.Step  Step 10 requires 1I 2 + 2M 2 operations for the normalization of the result point Q.

Implementation Results of Field Arithmetic
Table 3 lists how many cycles are used for field arithmetic over F p and F p 2 on AVR, MSP430, and ARM processors, including function call overhead.The field inversions F p and F p 2 are the average cycles performed 10 4 times and remaining the field arithmetic is the average cycles performed 10 7 times.To evaluate the implementation of field arithmetic for curve Ted127-glv4, we compare the number of cycles for its implementation with FourQ, which provides the fastest implementation results to date [24].
One can see that the field arithmetic over F p in FourQ on AVR and MSP430 is typically faster than curve Ted127-glv4.This difference occurs because the primes of both curves are different, with a Mersenne prime of the form p = 2 127 − 1 in FourQ and a Mersenne-like prime of the form p = 2 127 − 5997 in curve Ted127-glv4.Let p = 2 127 − δ, where δ is small.The modular reduction step can be computed by In this process, FourQ can be efficiently computed using simple shift operations because δ = 1, but the curve Ted127-glv4 requires more instructions because it uses multiplication by δ = 5997 = 0x176d.In the 8-bit AVR implementation, 0x176d can be represented by two 8-bit words as 0x17 and 0x6d.Therefore, the operation c l + 2 • δ • c h (mod p) requires more 8 × 8-bit multiplications and accumulations.Unlike the AVR implementation, 0x176d can be represented by one word in the MSP430 and ARM CPUs.Additionally, these CPUs provide efficient MAC instructions.Therefore, the modular reduction on MSP430 and ARM implementations require fewer additional instructions than the AVR implementation.
In the case of the MSP430, field squaring over F p in curve Ted127-glv4 is faster than in FourQ.The field squaring over F p in curve Ted127-glv4 requires 837 cycles, whereas FourQ requires 927 cycles.Our implementation saves 9.71% of the cycles for field squaring over F p compared to the SBD method, despite the modular reduction overhead.Additionally, the principal operation of inversion is field squaring over F p , our implementation saves 9.32% and 8.55% of the cycles for inversion over F p and F p 2 .For field squaring over F p 2 , field squaring over F p is not required because it can be computed by 2M 1 + 1A 1 + 2A i operations.Therefore, field squaring over F p 2 for Ted127-glv4 requires more cycles than FourQ.

Implementation Results of Scalar Multiplication
Table 4 summarizes the implementation results of variable-base scalar multiplication compared to the previous implementations on the 8-bit AVR, 16-bit MSP430, and 32-bit ARM processors.We measured the average cycles for our variable-base scalar multiplication by running it 10 3 times with random scalars k.For comparison, Table 4 includes the previous implementations that guarantee constant-time execution.These were implemented using various elliptic curves, such as NIST P-256 [42,43], Curve25519 [44][45][46], µKummer [47], and FourQ [24].These curves are designed such that the bit-length of the curve order is slightly smaller than 256-bit for efficient implementation.NIST P-256 has a 256-bit curve order, but Curve25519, µKummer, FourQ, and curve Ted127-glv4 have 252-bit, 250-bit, 246-bit, and 251-bit curve orders, respectively.Therefore, these curves provide approximately 128-bit security levels.We will now summarize the implementation results of previous works on embedded devices that provide approximately 128-bit security levels.Wenger and Werner [42] and Wenger et al. [43] implemented the scalar multiplication using the NIST P-256 curve on various 16-bit microcontrollers and 8-bit, 16-bit, and 32-bit microcontrollers.Hutter and Schwabe [48] implemented the NaCl library on 8-bit AVR processor, which provides a Curve25519 scalar multiplication.Hinterwälder et al. [44] implemented a Diffie-Hellman key exchange on MSP430X processor using 16-bit and 32-bit hardware multipliers.In 2015, Düll et al. [45] implemented a Curve25519 scalar multiplication of on 8-bit, 16-bit, and 32-bit microcontrollers.Renes et al. [47] implemented a Montgomery ladder scalar multiplication on the Kummer surface of a genus 2 hyperelliptic curve on 8-bit AVR and 32-bit ARM Cortex-M0 processors.Faz-Hernández et al. [25] proposed an efficient implementation of the four-dimensional GLV-GLS scalar multiplication using curve Ted127-glv4 on Intel and ARM processors.
The implementation results of variable-base scalar multiplication set new speed records on the 16-bit MSP430 and 32-bit ARM Cortex-M4 processors.Scalar multiplication using curve Ted127-glv4 on AVR, MSP430, and ARM requires 6,856,026, 4,158,453, and 447,836 cycles, respectively.Compared to the previous fastest implementation, namely FourQ [24], which require 6,561,500, 4,280,400, and 469,500 cycles on AVR, MSP430, and ARM, respectively, our implementation requires 4.49% more cycles on AVR, but saves 2.85% and 4.61% cycles on MSP430X and ARM processors, respectively.Compared to µKummer [47], which requires 9,513,536 cycles on AVR, our implementation saves 27.93% cycles.It also saves 50.68% and 47.58% cycles than Düll et al.'s Curve25519 implementation [45], which requires 13,900,397 and 7,933,296 cycles on AVR and MSP430, respectively.It saves 69.92% cycles compared to the NaCl library [48], which requires 22,791,579 cycles on AVR.It saves 54.50% cycles than Hinterwälder et al.'s Curve25519 implementation [44], which requires 9,139,739 cycles on MSP430.Additionally, it saves 68.54% cycles compared to the method in [46], which requires 1,423,667 cycles on the ARM Cortex-M4 processor.
The memory of embedded processors is very constrained, meaning the memory usage of various implementations is important.In the case of the 8-bit AVR, µKummer [47] requires the lowest memory usage in the recently proposed results, which requires 9490 bytes of code size and 99 bytes of stack memories.Wenger et al.'s and Düll et al.'s implementations [43,45] require the lowest code size and stack memories on MSP430, which require 8378 bytes of code size and 384 bytes of stack memories.In the 32-bit ARM, Ref. [46] require 3750 bytes of code size and 740 bytes of stack memories.FourQ [24] reported the memory usage of ECDH and signature operations, but did not report the memory usage of single scalar multiplication.Our implementations for curve Ted127-glv4 requires 13,891, 9098, and 7532 bytes of code size and 2539, 2568, and 2792 bytes of stack memories on AVR, MSP430, and ARM Cortex-M4, respectively.FourQ and curve Ted127-glv4, which utilize the four-dimensional decompositions, precompute eight points, meaning they require more stack memory than other implementations.However, the performance of four-dimensional scalar multiplication is significantly faster than other implementations.

Conclusions
In this paper, we presented the first constant-time implementations of four-dimensional GLV-GLS scalar multiplication using curve Ted127-glv4 on 8-bit ATxmega256A3, 16-bit MSP430FR5969, and 32-bit ARM Cortex-M4 processors.We also optimized the performance of internal algorithms in scalar multiplication on three target processors.Our implementations for single scalar multiplication on AVR require 4.49% more cycles than FourQ-based implementation, but save 2.85% and 4.61% cycles on MSP430 and ARM Cortex-M4, respectively.Our analysis and implementation results demonstrate that efficiently computable endomorphisms can accelerate scalar multiplication, even when using prime numbers that provide inefficient field arithmetic.Our implementations highlight that the four-dimensional GLV-GLS scalar multiplication using curve Ted127-glv4 is one of the suitable elliptic curves for constructing ECC-based applications for resource-constrained embedded devices.
. For example, the intermediate values c h are loaded in R9 to R12 and c l are loaded in R5 to R8.The constant 11994 = 2 • 5997 is loaded in R3 and 0 is loaded in R4.The computation c ≡ c l + 2 • 5997 • c h is performed as follows:

Table 1 .
The operation counts of curve Ted127-glv4 using field arithmetic over F p 2 and operation counts for conversion into field arithmetic over F p .

Table 2 .
The operation counts of curve FourQ using field arithmetic over F p 2 and operation counts for conversion into field arithmetic over F p .

Table 3 .
Cycle counts for field arithmetic on 8-bit AVR, 16-bit MSP430, and 32-bit ARM processors, including function call overhead.
a includes RAM and stack.