Efficient Implementations of Four-Dimensional GLV-GLS Scalar Multiplication on 8-Bit, 16-Bit, and 32-Bit Microcontrollers

Kwon, Jihoon; Seo, Seog Chung; Hong, Seokhie

doi:10.3390/app8060900

Open AccessArticle

Efficient Implementations of Four-Dimensional GLV-GLS Scalar Multiplication on 8-Bit, 16-Bit, and 32-Bit Microcontrollers

by

Jihoon Kwon

¹

,

Seog Chung Seo

²

and

Seokhie Hong

^1,*

¹

Center for Information Security Technologies (CIST), Korea University, Seoul 02841, Korea

²

The Affiliated Institute of ETRI, Daejeon 34044, Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2018, 8(6), 900; https://doi.org/10.3390/app8060900

Submission received: 16 March 2018 / Revised: 20 May 2018 / Accepted: 22 May 2018 / Published: 31 May 2018

(This article belongs to the Special Issue Security and Privacy for Cyber Physical Systems)

Download

Browse Figure

Versions Notes

Abstract

In this paper, we present the first constant-time implementations of four-dimensional Gallant–Lambert–Vanstone and Galbraith–Lin–Scott (GLV-GLS) scalar multiplication using curve

Ted 127

-

glv 4

on 8-bit AVR, 16-bit MSP430, and 32-bit ARM processors. In Asiacrypt 2012, Longa and Sica introduced the four-dimensional GLV-GLS scalar multiplication, and they reported the implementation results on Intel processors. However, they did not consider efficient implementations on resource-constrained embedded devices. We have optimized the performance of scalar multiplication using curve

Ted 127

-

glv 4

on 8-bit AVR, 16-bit MSP430, and 32-bit ARM processors. Our implementations compute a variable-base scalar multiplication in 6,856,026, 4,158,453, and 447,836 cycles on AVR, MSP430, and ARM Cortex-M4 processors, respectively. Recently, Four

Q

-based scalar multiplication has provided the fastest implementation results on AVR, MSP430, and ARM Cortex-M4 processors to date. Compared to Four

Q

-based scalar multiplication, the proposed implementations require 4.49% more computational cost on AVR, but save 2.85% and 4.61% cycles on MSP430 and ARM, respectively. Our 16-bit and 32-bit implementation results set new speed records for variable-base scalar multiplication.

Keywords:

elliptic curves; scalar multiplication; constant-time implementation; twisted Edwards curves; AVR; MSP430; ARM

1. Introduction

Wireless sensor networks (WSNs) are wireless networks consisting of a large number of resource-constrained sensor nodes, where each node is equipped with a sensor to monitor physical phenomena, such as temperature, light, and pressure. The main features of WSNs are resource constraints, such as storage, computing power, and sensing distance. Recently, the energy consumption of data centers has attracted attention because of the fast growth of data throughput. WSNs can provide a solution for data collection and data processing in various applications including data center monitoring. That is, WSNs can be utilized for data center monitoring to improve the efficiency of energy consumption. Several solutions were proposed to solve this problem [1,2].

Since sensor nodes are usually deployed in remote areas and left unattended, they can be led to network security issues, such as node capture, eavesdropping, and message tampering during data communication. Additionally, many application areas of WSNs require data confidentiality, integrity, authentication, and non-repudiation, meaning there is a need for an efficient cryptographic mechanism to satisfy current security requirements. However, due to the constraint of WSNs, it is difficult to utilize the conventional cryptographic algorithms. Therefore, efficient cryptographic algorithms considering code size, computation time, and power consumption are required for the security of WSNs.

In 1985, elliptic curve cryptography (ECC) was proposed independently of public key cryptography (PKC) by Miller and Koblitz [3,4]. ECC is mainly used for digital signature and key exchange based on the elliptic curve discrete logarithm problem (ECDLP), which is defined by elliptic curve point operations in a finite field. ECC provides the same security level with a smaller key size compared to existing PKC algorithms such as Rivest-Shamir-Adleman (RSA) cryptosystem [5]. For example, ECC over

F_{p}

with a 256-bit prime p provides an equivalent security level as RSA using 3072-bit key. Because RSA uses a small integer as the public key, RSA public key operations can be efficiently computed. However, RSA private key operations are extremely slower than ECC, therefore they have limited use in the applications of WSNs. Therefore, ECC can be efficiently utilized than RSA for resource-constrained WSNs devices, such as smart cards and sensor nodes.

However, recently proposed manipulation and backdoors have raised the suspicion of weakness in previous ECC standards. In particular, the National Institute of Standards and Technology (NIST) P-224 curve is not secure against twist attacks, which are the combined attacks that use the small-subgroup attacks and the invalid-curve attacks using the twist of curve [6]. The dual elliptic curve deterministic random bit generator (Dual_EC_DRBG) is a pseudo-random number generator (PRNG) standardized in NIST SP 800-90A. However, the revised version of NIST SP 800-90A standard removes Dual_EC_DRBG because this algorithm contains a backdoor for the national security agency (NSA) [7].

Therefore, the demand for next generation elliptic curves has increased. Specific examples of such curves are Curve25519, Ed448-Goldilocks, and twisted Edwards curves [8,9,10]. The main features of these curves are the selection of efficient parameters. The Curve25519 utilizes a prime of the form

p = 2^{255} - 19

and a fast Montgomery elliptic curve. The Ed448-Goldilocks curve utilizes a Solinas trinomial prime of the form

p = 2^{448} - 2^{224} - 1

, which provides fast field arithmetic on both 32-bit and 64-bit machines because

224 = 28 \times 8 = 32 \times 7 = 56 \times 4

. These parameters can accelerate the performance of ECC-based protocols. The details of the twisted Edwards curves can be found in Section 2.3.

Scalar multiplicationor point multiplication computes an operation

k P

using an elliptic curve point P and a scalar k. This operation determines the performance of ECC. Therefore, many researchers have proposed various methods to improve the efficiency of scalar multiplication. The speed-up methods for scalar multiplication can be classified into three types: methods based on speeding up the finite field exponentiation, such as comb techniques and windowing methods, scalar recoding methods, and methods that are particular to elliptic curve scalar multiplication [11].

Speed-up methods using efficiently computable endomorphisms are one type of method that are particular to elliptic curve scalar multiplication. The Gallant–Lambert–Vanstone (GLV) method proposed by Gallant et al. is a method for accelerating scalar multiplication by using efficiently computable endomorphisms [11]. If the cost of computing endomorphism is less than (bit-length of curve order)/3 elliptic curve point doubling (ECDBL) operations, then this method has a computational advantage. Their method reduces about half of the ECDBL operations and saves the costs of scalar multiplication by roughly 33%. Additionally, recent studies have reported that scalar multiplication methods using efficiently computable endomorphisms are significantly faster than generalized methods. The Galbraith–Lin–Scott (GLS) curves proposed by Galbraith et al. constructed an efficiently computable endomorphism for elliptic curves defined over

F_{p^{2}}

, where p is a prime number [12]. They demonstrated that the GLV method can efficiently compute scalar multiplication on such curves. Longa and Gebotys [13] presented an efficient implementation of two-dimensional GLS curves over

F_{p^{2}}

.

In 2012, Longa and Sica [14] proposed four-dimensional GLV-GLS curves over

F_{p^{2}}

, which generalized the GLV method and GLS curves. Hu et al. [15] proposed a GLV-GLS curve over

F_{p^{2}}

, which supports the four-dimensional scalar decomposition. They reported the implementation results indicating that the four-dimensional GLV-GLS scalar multiplication reduces at most 22% of computational cost than the two-dimensional GLV method. Bos et al. [16] proposed two- and four-dimensional scalar decompositions over genus 2 curves defined over

F_{p^{2}}

. Bos et al. [17] introduced an eight-dimensional GLV-GLS method over genus 2 curves defined over

F_{p^{2}}

. Oliveira et al. [18] presented the implementation results of a two-dimensional GLV method over binary GLS elliptic curves defined over

F_{2^{254}}

. Guillevic and Ionica [19] utilized the four-dimensional GLV method on genus 1 curves defined over

F_{p^{2}}

and genus 2 curves defined over

F_{p}

. Smith [20] proposed a new family of elliptic curves over

F_{p^{2}}

, called “

Q

-curves”. Costello and Longa [21] introduced a four-dimensional

Q

curve defined over

F_{p^{2}}

, called “Four

Q

”. They reported the implementation results of Four

Q

on various Intel and AMD processors.

After a Four

Q

-based approach has been proposed, many implementation results were reported considering various environments, such as AVR, MSP430, ARM, and field-programmable gate array (FPGA) devices [22,23,24]. An efficient Four

Q

-based implementation on 32-bit ARM processor with the NEON single instruction multiple data (SIMD) instruction set was proposed by Longa [22]. Järvinen et al. [23] proposed a fast and compact Four

Q

-based implementation on FPGA device. In CHES 2017, Liu et al. [24] presented highly optimized implementations using curve Four

Q

on 8-bit AVR, 16-bit MSP430, and 32-bit ARM Cortex-M4 processors, respectively.

In the case of curve

Ted 127

-

glv 4

, Longa and Sica and Faz-Hernández et al. [14,25] reported the implementation results on high-end processors, such as Intel Sandy Bridge, Intel Ivy Bridge, and ARM Cortex-A processors. However, efficient implementations on resource-constrained embedded devices have not been considered to date. Therefore, we focused on optimized implementations of scalar multiplication using curve

Ted 127

-

glv 4

on 8-bit ATxmega256A3, 16-bit MSP430FR5969, and 32-bit ARM Cortex-M4 processors, respectively.

Our main contributions can be summarized as follows:

We present efficient implementations at each level of the implementation hierarchy of four-dimensional GLV-GLS scalar multiplication considering the features of 8-bit AVR, 16-bit MSP430, and 32-bit ARM Cortex-M4 processors. To improve the performance of scalar multiplication, we carefully selected the internal algorithms at each level of the implementation hierarchy. These implementations also run in constant time to resist timing and cache-timing attacks [26,27].
We demonstrate that the efficiently computable endomorphisms can accelerate the performance of four-dimensional GLV-GLS scalar multiplication. For this purpose, we analyze the operation counts of two elliptic curves “Ted127-glv4” and “Four $Q$ ”, which support the four-dimensional GLV-GLS scalar multiplication. The GLV-GLS curve $Ted 127$ - $glv 4$ requires fewer number of field arithmetic operations than Four $Q$ -based implementation to compute a single variable-base scalar multiplication. However, because Four $Q$ uses a Mersenne prime $p = 2^{127} - 1$ and the curve $Ted 127$ - $glv 4$ uses a Mersenne-like prime $p = 2^{127} - 5997$ , Four $Q$ has a computational advantage of faster field arithmetic operations. By using the computational advantage of endomorphisms, we overcome the computational disadvantage of curve $Ted 127$ - $glv 4$ at field arithmetic level.
We present the first constant-time implementations of four-dimensional GLV-GLS scalar multiplication using curve $Ted 127$ - $glv 4$ on three target platforms, which have not been considered in previous works. The proposed implementations on AVR, MSP430, and ARM processors require 6,856,026, 4,158,453, and 447,836 cycles to compute a single variable-base scalar multiplication, respectively. Compared to Four $Q$ -based implementations [24], which have provided the fastest results to date, our results are 4.49% slower on AVR, but 2.85% and 4.61% faster on MSP430 and ARM, respectively. Our MSP430 and ARM implementations set new speed records for variable-base scalar multiplication.

The remainder of this paper is organized as follows. Section 2 describes preliminaries regarding ECC and its speed-up techniques, including the GLV and GLS methods. Section 3 presents a review of four-dimensional GLV-GLS scalar multiplication and its implementation hierarchy. Section 4 describes the implementation details of field arithmetic and optimization methods for the target platforms. Section 5 describes optimization methods for ECC in terms of point arithmetic and scalar multiplication. Experimental results and a comparison of our work to previous ECC implementations on AVR, MSP430, and ARM processors are presented in Section 6. Finally, we conclude this paper in Section 7.

2. Preliminaries

In Section 2.1, we describe the field representation and notations used for the remainder of this paper. We briefly describe ECC using a short Weierstrass curve and its group law in Section 2.2. We also describe twisted Edwards curves, which are the target of our implementation, in Section 2.3. In Section 2.4, we describe the GLV-GLS method including the GLV method and GLS curves.

2.1. Field Representation and Notations

We assume that the target platform has a w-bit architecture. Let

n = ⌈ {log}_{2} p ⌉

be the bit-length of a Mersenne-like prime

p = 2^{n} - c

, where c is small. Let

m = ⌈ n / w ⌉

be its word-length. Then, an arbitrary element

a \in F_{p}

is represented by an array

(a_{m - 1}, \dots, a_{2}, a_{1}, a_{0})

of mw-bit words. The notations

M_{1}, S_{1}, I_{1}

, and

A_{1}

represent multiplication, squaring, inversion, and addition (subtraction) over

F_{p}

, respectively. Similarly, the notations

M_{2}, S_{2}, I_{2}

, and

A_{2}

represent multiplication, squaring, inversion, and addition (subtraction) over

F_{p^{2}}

, respectively. The notation

A_{i}

represents multi-precision addition without modular reduction and the notation

M_{d}

represents multiplication with a curve parameter.

2.2. Elliptic Curve Cryptography

Let

F_{q}

be a finite field with odd characteristic. An elliptic curve E over

F_{q}

is defined by a short Weierstrass equation of the following form:

\begin{matrix} E : y^{2} = x^{3} + a x + b, \end{matrix}

where

a, b \in F_{q}

and

4 a^{3} + 27 b^{2} \neq 0

.

Because the most important operation in ECC is scalar multiplication

k P

, it must be implemented efficiently. The basic method for computing

k P

is comprised of two elliptic curve operations: elliptic curve point addition (ECADD) and the ECDBL operations. Let

P = (x_{1}, y_{1})

and

Q = (x_{2}, y_{2})

be two points on an elliptic curve E. The ECADD and ECDBL operations can be computed in affine coordinates as follows:

\begin{matrix} x_{3} = λ^{2} - x_{1} - x_{2}, & y_{3} = λ (x_{1} - x_{3}) - y_{1}, \\ λ = \frac{y_{2} - y_{1}}{x_{2} - x_{1}} if P \neq Q, & and λ = \frac{3 x_{1}^{2} + a}{2 y_{1}} if P = Q . \end{matrix}

The ECADD and ECDBL operations are composed of finite field arithmetic operations, such as field addition, subtraction, multiplication, squaring, and inversion. Therefore, to improve the performance of scalar multiplication, the internal algorithms such as field and curve arithmetic operations should be efficiently implemented.

2.3. Twisted Edwards Curves

The Edwards curves are a normal form of elliptic curves introduced by Edwards [28]. Bernstein and Lange [29] introduced Edwards curves defined by

x^{2} + y^{2} = c^{2} (1 + d x^{2} y^{2})

, where

c, d \in F_{q}

with

c d (1 - d c^{4}) \neq 0

. In 2007, Bernstein et al. [10] introduced twisted Edwards curves, which are a generalization of Edwards curves defined by

\begin{matrix} E_{a, d} : a x^{2} + y^{2} = 1 + d x^{2} y^{2}, \end{matrix}

where

a, d \in F_{q}

with

a d (a - d) \neq 0

. The Edwards curves are a special case of twisted Edwards curves with

a = 1

. The point

(0, 1)

is the identity element and the point

(0, - 1)

has order two. The point

(1, 0)

and

(- 1, 0)

have order four. The negative of a point

P = (x_{1}, y_{1})

is

- P = (- x_{1}, y_{1})

. The ECADD operation of two points

P = (x_{1}, y_{1})

and

Q = (x_{2}, y_{2})

on a twisted Edwards curve E is defined as follows:

\begin{matrix} (x_{1}, y_{1}) + (x_{2}, y_{2}) = (\frac{x_{1} y_{2} + y_{1} x_{2}}{1 + d x_{1} y_{1} x_{2} y_{2}}, \frac{y_{1} y_{2} - a x_{1} x_{2}}{1 - d x_{1} y_{1} x_{2} y_{2}}) . \end{matrix}

Because the addition law is unified, it can be used for computing the ECDBL operation. Suppose that two points P and Q have an odd order. Then, the denominators of the addition formula

1 + d x_{1} y_{1} x_{2} y_{2}

and

1 - d x_{1} y_{1} x_{2} y_{2}

are nonzero. Therefore, the doubling formula can be obtained as follows:

\begin{matrix} 2 (x_{1}, y_{1}) = (\frac{2 x_{1} y_{1}}{y_{1}^{2} + a x_{1}^{2}}, \frac{y_{1}^{2} - a x_{1}^{2}}{2 - y_{1}^{2} - a x_{1}^{2}}) . \end{matrix}

Two relationships can be obtained by considering the curve equation:

a x_{1}^{2} + y_{1}^{2} = 1 + d x_{1}^{2} y_{1}^{2}

and

a x_{2}^{2} + y_{2}^{2} = 1 + d x_{2}^{2} y_{2}^{2}

. After straightforward elimination, the curve parameters a and d can be represented by

x_{1}, x_{2}, y_{1}

, and

y_{2}

. Substitutions in the unified addition formula yield the addition formula as follows:

\begin{matrix} (x_{1}, y_{1}) + (x_{2}, y_{2}) = (\frac{x_{1} y_{1} + x_{2} y_{2}}{y_{1} y_{2} + a x_{1} x_{2}}, \frac{x_{1} y_{1} - x_{2} y_{2}}{x_{1} y_{2} - y_{1} x_{2}}) . \end{matrix}

These addition and doubling formulas are used in the dedicated addition and doubling formulas described in Section 5. The features of these formulas are independent of the curve parameter d [30].

2.4. The GLV-GLS Method

We will now describe the GLV method to explain the GLV-GLS method. Let E be an elliptic curve defined over a finite field

F_{q}

. An endomorphism

ϕ

of E over

F_{q}

is a rational map

ϕ : E \to E

such that

ϕ (O) = O

and

ϕ (P) = (g (P), h (P))

for all points

P \in E

, where g and h are rational functions and

O

is a point at infinity. An endomorphism

ϕ

is a group homomorphism, defined as

\begin{matrix} ϕ (P_{1} + P_{2}) = ϕ (P_{1}) + ϕ (P_{2}) for all P_{1}, P_{2} \in E . \end{matrix}

Suppose that

# E (F_{q})

contains a subgroup of order r and let

ϕ

be an efficiently computable endomorphism on E such that

ϕ (P) = λ P

for some

1 \leq λ \leq r - 1

. The GLV method computes the integers

k_{0}

and

k_{1}

such that

k = k_{0} + k_{1} λ

mod r for scalar multiplication

k P

. Because

\begin{matrix} k P & = k_{0} P + k_{1} λ P \\ = k_{0} P + k_{1} ϕ (P), \end{matrix}

scalar multiplication

k P

can be computed by computing

ϕ (P)

and then using multiple scalar multiplications [31]. This is because the multi-scalars

k_{0}

and

k_{1}

have approximately half the bit-length of the scalar k. The efficiency of the GLV method depends on scalar decomposition and the efficiency of computing endomorphism

ϕ

.

The main concept of the GLS curves is described as follows: Let

E^{'} / F_{q^{2}}

be the quadratic twists of

E / F_{q^{2}}

[12]. Let

ψ

be the quadratic twist map and

π

be the q-th Frobenius endomorphism. Then, we can obtain the efficiently computable endomorphism

ϕ = ψ \circ π \circ ψ^{- 1}

, which satisfies the equation

X^{2} + 1 = 0

if

p \equiv 5 (mod 8)

. However, GLS curves only work for elliptic curves over

F_{q^{m}}

with

m > 1

.

As mentioned in the introduction, the GLV-GLS method is the generalized method of the GLV method and GLS curves. Let

ϕ

and

ψ

be two efficiently computable endomorphisms over

F_{p^{2}}

and P be a point of prime order r. Then, the four-dimensional scalar multiplication

k P

for any scalar

k \in [1, r]

can be computed as follows:

\begin{matrix} k P = & k_{0} P + k_{1} ϕ (P) + k_{2} ψ (P) + k_{3} ψ (ϕ (P)), \end{matrix}

where

{max}_{i} (| k_{i} |) < C r^{1 / 4}

for

0 \leq i \leq 3

and C is some explicit constant. The details of internal algorithms of the four-dimensional scalar multiplication can be found in Section 4 and Section 5.

3. Review of Four-Dimensional GLV-GLS Scalar Multiplication

The curve

Ted 127

-

glv 4

was introduced by Longa and Sica [14]. It is based on twisted Edwards curves and has efficiently computable endomorphisms, which facilitates the four-dimensional GLV-GLS scalar multiplication. The parameters of curve

Ted 127

-

glv 4

are as follows:

\begin{matrix} E / F_{p^{2}} : - x^{2} + y^{2} = 1 + d x^{2} y^{2}, \end{matrix}

where

d = 170141183460469231731687303715884099728 + 116829086847165810221872975542241037773 i

,

p = 2^{127} - 5997

and

# E (F_{p^{2}}) = 8 r

, where r is a 251-bit prime. Let

F_{p^{2}} = F_{p} [i] / (i^{2} + 1)

and

u = 1 + i

be a quadratic non-residue in

F_{p^{2}}

. E is isomorphic to the Weierstrass curve

E^{'} / F_{p^{2}} : y^{2} = x^{3} - 15 / 2 u^{2} x - 7 u^{3}

. The curve

Ted 127

-

glv 4

contains two efficiently computable endomorphisms

ϕ

and

ψ

defined over

F_{p^{2}}

as follows:

\begin{matrix} ϕ (x, y) = & (- \frac{(ζ_{8}^{3} + 2 ζ_{8}^{2} + ζ_{8}) x y^{2} + (ζ_{8}^{3} - 2 ζ_{8}^{2} + ζ_{8}) x}{2 y}, \frac{(ζ_{8}^{2} - 1) y^{2} + 2 ζ_{8}^{3} - ζ_{8}^{2} + 1}{(2 ζ_{8}^{3} + ζ_{8}^{2} - 1) y^{2} - ζ_{8}^{2} + 1}), \end{matrix}

\begin{matrix} ψ (x, y) = (ζ_{8} x^{p}, \frac{1}{y^{p}}), \end{matrix}

where

ζ_{8} = u / \sqrt{2}

is a primitive eighth root of unity. It can be verified that

ϕ^{2} + 2 = 0

and

ψ^{2} + 1 = 0

.

Let P be a point in

E / F_{p^{2}}

and k be a random scalar in the range

[1, r]

. Algorithm 1 outlines variable-base scalar multiplication using curve

Ted 127

-

glv 4

and four-dimensional decompositions. Steps 1 and 2 in Algorithm 1 compute three endomorphisms

ϕ (P), ψ (P)

, and

ψ (ϕ (P))

, and then compute the eight points

T [u] = P + u_{0} ϕ (P) + u_{1} ψ (P) + u_{2} ψ (ϕ (P))

, where

u = (u_{2}, u_{1}, u_{0})

in

0 \leq u \leq 7

. Step 3 decomposes the input scalar k into multi-scalars

(k_{0}, k_{1}, k_{2}, k_{3})

such that

0 \leq k_{i} \leq 2^{65}

, where

0 \leq i \leq 3

. For constant-time implementation, the multi-scalars

(k_{0}, k_{1}, k_{2}, k_{3})

must guarantee the same number of iterations of the main computation. Because all coordinates of scalar decomposition are less than

2^{65}

, we apply the scalar recoding algorithm to guarantee a fixed loop length for the main computation at step 4 [25]. The result of the scalar recoding is represented by 66 lookup table indices

d_{i}

and 66 masks

m_{i}

, where

0 \leq i \leq 65

. Steps 5 to 9 represent the main computation stage, including point loading, the ECADD operation, and the ECDBL operation. The result of the main computation is converted from an extensible coordinates to the affine coordinates in step 10. Therefore, a variable-base scalar multiplication using curve

Ted 127

-

glv 4

requires one

ϕ (P)

endomorphism, two

ψ (P)

endomorphisms, and seven ECADD operations in the precomputation; 65 table lookups, 65 ECADD operations, and 65 ECDBL operations in the main computation; and one inversion and two field multiplications over

F_{p^{2}}

for point normalization.

Figure 1 describes the implementation hierarchy of four-dimensional GLV-GLS scalar multiplication and its internal algorithms. Because the implementation algorithms at each level affect the performance of scalar multiplication, we carefully choose proper algorithms considering the features of AVR, MSP430, and ARM processors. Additionally, field arithmetic over

F_{p^{2}}

and curve arithmetic are comprised of field arithmetic over

F_{p}

, which is the computationally primary operations. Therefore, field arithmetic over

F_{p}

is written at the assembly level.

Algorithm 1: Scalar multiplication using curve

Ted 127

-

glv 4

[21].

Require:: Scalar $k \in [1, r]$ and point $P \in E / F_{p^{2}}$ .
Ensure:: $k P$ .
1:: Compute $ϕ (P), ψ (P)$ , and $ψ (ϕ (P))$ .
2:: Compute $T [u] = P + u_{0} ϕ (P) + u_{1} ψ (P) + u_{2} ψ (ϕ (P))$ where $u = (u_{2}, u_{1}, u_{0})$ in $0 \leq u \leq 7$ .
3:: Decompose the scalar k into the multi-scalars $(k_{0}, k_{1}, k_{2}, k_{3})$ .
4:: Recode the multi-scalars $(k_{0}, k_{1}, k_{2}, k_{3})$ to $(d_{65}, \dots, d_{0})$ and $(m_{65}, \dots, m_{0})$ . $s_{i} = 1$ if $m_{i} = - 1$ and $s_{i} = - 1$ if $m_{i} = 0$ .
5:: $Q = s_{65} \cdot T [d_{65}]$ .
6:: for $i = 64$ to 0 do
7:: $Q \leftarrow 2 Q$ .
8:: $Q \leftarrow Q + s_{i} \cdot T [d_{i}]$ .
9:: end for
10:: returnQ.

4. Implementation Details of Field Arithmetic

In this section, we describe the implementation details of field arithmetic on AVR, MSP430X, and ARM Cortex-M4 processors using a Mersenne-like prime of the form

p = 2^{127} - 5997

. We describe the field arithmetic algorithms that are commonly used in three target platforms in Section 4.1, Section 4.2, Section 4.3 and Section 4.4. In Section 4.5, Section 4.6 and Section 4.7, we describe our optimization strategy for field arithmetic on AVR, MSP430, and ARM processors, respectively.

4.1. Field Addition and Subtraction over $F_{p}$

The curve

Ted 127

-

glv 4

uses a Mersenne-like prime of the form

p = 2^{127} - 5997

. An efficient field addition/subtraction method for this scenario was proposed by Bos et al. [16]. Let

0 \leq a, b < p = 2^{127} - 5997

. Field addition over

F_{p}

can be computed by

c = a + b (mod p) = ((a + 5997) + b) - c a r r y \cdot 2^{127} - (1 - c a r r y) \cdot 5997

, where

c a r r y = 0

if

a + b + 5997 < 2^{127}

. Otherwise,

c a r r y = 1

. The result is bounded by p because, if

a + b + 5997 < 2^{127}

, then

a + b < 2^{127} - 5997

, whereas if

a + b + 5997 \geq 2^{127}

, then

(a + b + 5997) (mod 2^{127}) = a + b - p < p

. Because

a + 5997 < 2^{127}

, addition does not require carry propagation. Note that subtraction with

c a r r y \cdot 2^{127}

can be efficiently implemented by clearing the 128-th bit of

(a + 5997) + b

.

Similar to field addition, field subtraction over

F_{p}

can be computed by

c = a - b (mod p) = (a - b) + b o r r o w \cdot 2^{127} - b o r r o w \cdot 5997

, where

b o r r o w = 0

if

a \geq b

, otherwise,

b o r r o w = 1

. Addition with

b o r r o w \cdot 2^{127}

can be implemented by clearing the 128-th bit of

a - b

.

4.2. Modular Reduction

To use primes of a special form may result in a faster reduction method [31]. The NIST recommends five primes for the elliptic curve digital signature algorithm (ECDSA). These primes can be represented as the sums or differences of powers of two and facilitate the fast reduction method. The curve

Ted 127

-

glv 4

uses a Mersenne-like prime of the form

p = 2^{127} - 5997

. Therefore, modular reduction can be efficiently computed by using a NIST-like reduction method [16]. Let

0 \leq a, b \leq p = 2^{127} - 5997

. We compute

c = a \cdot b = 2^{128} c_{h} + c_{l}

, where

0 \leq c_{h}, c_{l} < 2^{128}

. The first reduction step can be computed by

c^{'} \equiv c_{l} + 2 \cdot 5997 \cdot c_{h}

. Then, the second reduction step can be computed by

c^{'} \equiv 2^{127} R_{h} + R_{l} \equiv R_{l} + 5997 \cdot R_{h} (mod p)

, where

R_{l}, 5997 \cdot R_{h} < 2^{127}

.

4.3. Inversion over $F_{p}$

For the field inversion

a^{- 1} (mod p)

, we use the fact that

a^{- 1} = a^{p - 2} (mod p)

in Fermat’s little theorem (in our case,

a^{p - 2} (mod p) = a^{2^{127} - 5999} (mod p)

). This method can be implemented by modular exponentiation using fixed addition chains and guarantees constant-time execution requiring

13 M_{1} + 126 S_{1}

operations.

4.4. Field Arithmetic over $F_{p^{2}}$

The incomplete reduction method proposed by Yanık et al. [32] is one of the optimization methods in field arithmetic over

F_{p^{2}}

. Given two elements

a, b \in [0, p - 1]

, the result of operations stays in the range

[0, 2^{m} - 1]

, where

p < 2^{m} < 2 p - 1

and m is a fixed integer (in our case,

m = 128

). Because the modulus of curve

Ted 127

-

glv 4

is a Mersenne-like prime of the form

p = 2^{127} - 5997

, the incomplete reduction method can be applied more advantageously.

Let

a = a_{0} + a_{1} i

and

b = b_{0} + b_{1} i

be two arbitrary elements in a finite field

F_{p^{2}}

. Field addition and subtraction over

F_{p^{2}}

can be computed by

a + b = (a_{0} + b_{0}) + (a_{1} + b_{1}) i

and

a - b = (a_{0} - b_{0}) + (a_{1} - b_{1}) i

, respectively. Field inversion over

F_{p^{2}}

can be computed by

a^{- 1} = (a_{0} - a_{1} i) / (a_{0}^{2} + a_{1}^{2})

.

We utilize Karatsuba multiplication to compute field multiplication over

F_{p^{2}}

. The Karatsuba multiplication uses the fact that

a \cdot b = (a_{0} + a_{1} i) (b_{0} + b_{1} i) = (a_{0} b_{0} - a_{1} b_{1}) + {(a_{0} + a_{1}) (b_{0} + b_{1}) - a_{0} b_{0} - a_{1} b_{1}} i

, which can be computed by

3 M_{1} + 3 A_{1} + 2 A_{i}

operations. It requires

1 A_{1} + 2 A_{i}

more operations but saves 1

M_{1}

operations compared to general multiplication methods, which require

4 M_{1} + 2 A_{1}

operations. Because field multiplication requires more computational cost than the multi-precision addition and field addition, the Karatsuba multiplication has a computational advantage. Algorithm 2 describes field multiplication over

F_{p^{2}}

using the Karatsuba multiplication and the incomplete reduction method.

Algorithm 3 describes field squaring over

F_{p^{2}}

using the incomplete reduction method. Note that

a^{2} = (a_{0}^{2} - a_{1}^{2}) + 2 a_{0} a_{1} i = (a_{0} + a_{1}) (a_{0} - a_{1}) + 2 a_{0} a_{1} i

. The first representation can be computed by

1 M_{1} + 2 S_{1} + 1 A_{1} + 1 A_{i}

operations, and the remaining representation can be computed by

2 M_{1} + 1 A_{1} + 2 A_{i}

operations. Because

1 M_{1}

operation can be implemented faster than

2 S_{2}

operations, we use

2 M_{1} + 1 A_{1} + 2 A_{i}

operations to compute field squaring over

F_{p^{2}}

. The results of steps 3 and 4 in Algorithm 2 and steps 1 and 3 in Algorithm 3 were represented by the incompletely reduced form.

Algorithm 2: Field multiplication over

F_{p^{2}}

[25].

Require:: $a = a_{0} + a_{1} i, b = b_{0} + b_{1} i \in F_{p^{2}}$ , $p = 2^{127} - 5997$ .
Ensure:: $c = a \cdot b = c_{0} + c_{1} i \in F_{p^{2}}$ .
1:: $t_{1} \leftarrow a_{0} \times b_{0} (mod p)$ { $M_{1}$ }
2:: $t_{2} \leftarrow a_{1} \times b_{1} (mod p)$ { $M_{1}$ }
3:: $t_{3} \leftarrow a_{0} + a_{1}$ { $A_{i}$ }
4:: $c_{1} \leftarrow b_{0} + b_{1}$ { $A_{i}$ }
5:: $c_{1} \leftarrow c_{1} \times t_{3} (mod p)$ { $M_{1}$ }
6:: $c_{1} \leftarrow c_{1} - t_{1} (mod p)$ { $A_{1}$ }
7:: $c_{1} \leftarrow c_{1} - t_{2} (mod p)$ { $A_{1}$ }
8:: $c_{0} \leftarrow t_{1} - t_{2} (mod p)$ { $A_{1}$ }
9:: returnc.

Algorithm 3: Field squaring over

F_{p^{2}}

[25].

Require:: $a = a_{0} + a_{1} i \in F_{p^{2}}$ , $p = 2^{127} - 5997$ .
Ensure:: $c = a^{2} = c_{0} + c_{1} i \in F_{p^{2}}$ .
1:: $t_{1} \leftarrow a_{0} + a_{1}$ { $A_{i}$ }
2:: $t_{2} \leftarrow a_{0} - a_{1} (mod p)$ { $A_{1}$ }
3:: $t_{3} \leftarrow a_{0} + a_{0}$ { $A_{i}$ }
4:: $c_{0} \leftarrow t_{1} \times t_{2} (mod p)$ { $M_{1}$ }
5:: $c_{1} \leftarrow t_{3} \times a_{1} (mod p)$ { $M_{1}$ }
6:: returnc.

4.5. Optimization Strategy on 8-Bit AVR

The AVR processor is a family of 8-bit microcontrollers that is widely used in MICA2/MICAz sensor motes. The AVR processors are equipped with an 8-bit integer multiplier and register file with 32×8-bit general registers that are numbered from

R 0

to

R 31

. Registers

R 26 : R 27

,

R 28 : R 29

, and

R 30 : R 31

pairs are used as 16-bit indirect address registers called

X

,

Y

, and

Z

. The automatic increment and decrement addressing modes are supported on all

X

,

Y

, and

Z

registers, and

Y

and

Z

support fixed positive displacement.

R 0

and

R 1

registers store the 16-bit results of

8 \times 8

-bit multiplication. The AVR processors provide a typical 8-bit reduced instruction set computer (RISC) instruction set. The most important instructions for ECC are

8 \times 8

-bit multiplication (

MUL

) and memory access (

LD, ST

) instructions, which require two cycles. Instructions between two registers, such as addition (

ADD, ADC

) or subtraction (

SUB, SBC

), require only one cycle. Therefore, the basic optimization strategy on 8-bit AVR is reducing the number of memory access instructions.

To simulate our implementations, we targeted the ATxmega256A3 processor [33]. This processor can be clocked up to 32 MHz and provides 256 KB of programmable flash memory, 16 KB of SRAM, and 4 KB of EEPROM.

Recently, Hutter and Schwabe [34] proposed a highly optimized Karatsuba multiplication for the 8-bit AVR processor. There are two variants of the Karatsuba multiplication method: the additive Karatsuba and subtractive Karatsuba methods. Algorithm 4 outlines the subtractive Karatsuba multiplication. We consider

n \times n

-bit multiplication, where n is even and

k = n / 2

(in our case,

n = 128

and

k = 64

). The additive Karatsuba method can be computed similarly to Algorithm 4. However, the additive Karatsuba method may produce the carry bits in the addition of two numbers

(a_{l} + a_{h})

and

(b_{l} + b_{h})

. The additional multiplication using the carry bits incurs a significant overhead for integer multiplication. The subtractive Karatsuba method does not produce carry bits in the computation of M, but computes two absolute values

| a_{l} - a_{h} |

and

| b_{l} - b_{h} |

. This overhead is not only smaller than the overhead required for the additive Karatsuba method, but can also be executed in constant-time. Therefore, we chose and implemented the subtractive Karatsuba multiplication for the 8-bit AVR implementation.

Algorithm 4: Subtractive Karatsuba multiplication [34].

Require:: $a = 2^{k} a_{h} + a_{l}$ , $b = 2^{k} b_{h} + b_{l} \in F_{p}$ for k-bit integers $a_{l}, a_{h}, b_{l},$ and $b_{h}$ .
Ensure:: $c = a \cdot b$ .
1:: Compute $L = a_{l} \cdot b_{l}$
2:: Compute $H = a_{h} \cdot b_{h}$
3:: Compute $M = | a_{l} - a_{h} | \cdot | b_{l} - b_{h} |$
4:: Set $t = 0$ , if $M = (a_{l} - a_{h}) \cdot (b_{l} - b_{h})$ , $t = 1$ otherwise
5:: Compute $\hat{M} = {(- 1)}^{t} M = (a_{l} - a_{h}) \cdot (b_{l} - b_{h})$
6:: $c = A \cdot B = L + 2^{k} (L + H - \hat{M}) + 2^{n} H$
7:: returnc.

For integer squaring, we chose the sliding block doubling (SBD) method [35], which is more efficient than the subtractive Karatsuba method in the case of 128-bit operands on 8-bit AVR. To improve the performance of field arithmetic, we combined integer multiplication and squaring with modular reduction.

4.6. Optimization Strategy on 16-Bit MSP430X

The MSP430X processor was designed as an ultra-low power microcontroller based on the 16-bit RISC CPU. The MSP430X CPU has 16 20-bit registers that are numbered from

R 0

to

R 15

. Registers

R 0

to

R 3

are special-purpose registers that are used as the program counter, stack pointer, status register, and constant generator, respectively. Registers

R 4

to

R 15

are general-purpose registers that are used to store data values, address pointers, and index values.

The MSP430X instruction set does not include multiply and multiply-and-accumulate (MAC) instructions. Instead, the MSP430 family is equipped with a memory-mapped hardware multiplier. The hardware multiplier provides four different multiply operations (unsigned multiplication, signed multiplication, unsigned multiplication and accumulation, and signed multiplication and accumulation) for the first operand, called

MPY, MPYS, MAC

, and

MACS

. The second operand register is common to all multiplier modes, called

OP 2

. Namely, the first operand determines the operation type of the multiplier, but does not start the operation. Writing the second operand to the

OP 2

register starts the selected multiplication with two values. The multiplication result is written in three result registers

RESLO, RESHI

, and

SUMEXT

.

RESLO

stores the lower 16-bit of the result,

RESHI

stores the upper 16-bit of the result, and

SUMEXT

stores the carry bit or sign of the result.

The MSP430X processor provides seven addressing modes for the source operand and four addressing modes for the destination operand. The total computation time depends on the instruction format and the addressing modes for the operand. Instructions between two CPU registers only require one cycle. However, memory access instruction (

MOV

) requires two to six cycles depending on addressing modes of operands. To improve the performance of field arithmetic, reducing the number of memory access instructions and efficiently utilizing MAC operations are the basic optimization strategies.

In our implementations, we targeted the MSP430FR5969 processor [36]. This processor is equipped with 64 KB of program flash memory and 2 KB of RAM and can be clocked up to 16 MHz.

For integer multiplication on 16-bit MSP430X processor, we chose and implemented the product scanning multiplication. Algorithm 5 outlines the product scanning method for multi-precision multiplication. The first loop in Algorithm 5 computes the lower half of the multiplication result c, and the second loop computes the upper half of the result c. It accumulates partial multiplications of the inner loop

a_{j} \times b_{i - j}

and these operations can be efficiently computed using the MAC operations of the hardware multiplier. Specifically, two 16-bit operands are multiplied and the results are added to the intermediate value s, which is held in

RESLO, RESHI

, and

SUMEXT

.

In Four

Q

[24], integer squaring was implemented using the SBD method [35]. We utilize the product scanning method for 128-bit integer squaring on 16-bit MSP430X. It can be easily implemented by modifying the product scanning multiplication. Additionally, this method results in better performance than the SBD method in Four

Q

. The implementation results can be found in Section 6.2.

Algorithm 5: Product scanning multiplication.

Require:: $a = (a_{m - 1}, \dots, a_{0}), b = (b_{m - 1}, \dots, b_{0}) \in F_{p}$ .
Ensure:: $c = a \cdot b = (c_{2 m - 1}, \dots c_{0})$ .
1:: $s \leftarrow 0$
2:: fori from 0 to $m - 1$ do
3:: for j from 0 to i do
4:: $s \leftarrow s + a_{j} \cdot b_{i - j}$
5:: end for
6:: $c_{i} \leftarrow s (mod 2^{w})$
7:: $s \leftarrow s / 2^{w}$
8:: end for
9:: fori from m to $2 m - 2$ do
10:: for j from $i - m + 1$ to $m - 1$ do
11:: $s \leftarrow s + a_{j} \cdot b_{i - j}$
12:: end for
13:: $c_{i} \leftarrow s (mod 2^{w})$
14:: $s \leftarrow s / 2^{w}$
15:: end for
16:: $c_{2 m - 1} \leftarrow s (mod 2^{w})$
17:: return $c = (c_{2 m - 1}, \dots, c_{0})$ .

4.7. Optimization Strategy on 32-Bit ARM

The ARM Cortex-M is a family of 32-bit RISC ARM processors for microcontrollers. The Cortex-M4 processor is a high-performance Cortex-M processor with digital signal processing (DSP), SIMD, and MAC instructions. It based on the ARMv7-M architecture and equipped with 16 32-bit general registers that are numbered from

R 0

to

R 15

. Registers

R 13

to

R 15

are special-purpose registers that are used for the stack pointer (SP), link register (LR), and program counter (PC), respectively. The Cortex-M4 instruction set provides multiply and MAC instructions, such as

UMULL

,

UMLAL

, and

UMAAL

. The

UMULL

instruction multiplies two unsigned 32-bit operands to obtain a 64-bit result. The

UMLAL

and

UMAAL

instructions multiply two unsigned 32-bit operands and accumulate a single 64-bit value and two 32-bit values.

In our implementations, we used the STM32F407-DISC1 board, which contains a 32-bit ARM Cortex-M4 STM32F407VGT6 microcontroller [37]. This microcontroller is equipped with 1 MB of flash memory, 192 KB of SRAM, and 64 KB of core-coupled memory (CCM) data RAM and can be clocked up to 168 MHz.

For integer multiplication and squaring, we implemented the operand scanning method by using efficient MAC operations. Additionally, these MAC operations facilitate an efficient implementation of modular reduction. The first reduction computes

c^{'} \equiv c_{l} + 2 \cdot 5997 \cdot c_{h}

, where

0 \leq c_{h}, c_{l} < 2^{128}

. For example, the intermediate values

c_{h}

are loaded in

R 9

to

R 12

and

c_{l}

are loaded in

R 5

to

R 8

. The constant

11994 = 2 \cdot 5997

is loaded in

R 3

and 0 is loaded in

R 4

. The computation

c^{'} \equiv c_{l} + 2 \cdot 5997 \cdot c_{h}

is performed as follows:

\begin{matrix} MOV R 3, # 11994, \\ MOV R 4, # 0, \\ UMLAL R 5, R 4, R 3, R 9, \\ UMAAL R 4, R 6, R 3, R 10, \\ UMAAL R 6, R 7, R 3, R 11, \\ UMAAL R 7, R 8, R 3, R 12 . \end{matrix}

The results of the first reduction

c^{'}

are held in

(R 5, R 4, R 6, R 7, R 8)

. The second reduction can be computed using simple multiplication (

MUL

) and addition (

ADD, ADC

) instructions.

For the further improvement of field arithmetic, we implemented field arithmetic over

F_{p^{2}}

at the assembly level [24,38]. In the case of field multiplication over

F_{p^{2}}

, we utilized the operand scanning multiplication with a lazy reduction method. This operation computes

a \cdot b = (a_{0} + a_{1} i) (b_{0} + b_{1} i) = (a_{0} b_{0} - a_{1} b_{1}) + (a_{0} b_{1} + a_{1} b_{0}) i

, where

a = a_{0} + a_{1} i, b = b_{0} + b_{1} i \in F_{p^{2}}

. The operand scanning method results in better performance than the Karatsuba multiplication in our case. The field squaring over

F_{p^{2}}

is implemented using

a^{2} = (a_{0} + a_{1}) (a_{0} - a_{1}) + 2 a_{0} a_{1} i

at the assembly level.

5. Implementation Details of Curve Arithmetic

In this section, we describe the scalar decomposition and curve arithmetic that are commonly used on three target platforms. Section 5.1 describes the scalar decomposition and recoding methods for multi-scalars. The details of point arithmetic, coordinate system, and endomorphisms are described in Section 5.2 and Section 5.3.

5.1. Scalar Decomposition

In this subsection, we describe the scalar decomposition method for a random integer

k \in [1, r]

and corresponding multi-scalars

(k_{0}, k_{1}, k_{2}, k_{3}) \in Z^{4}

such that

k \equiv k_{0} + k_{1} ϕ + k_{2} ψ + k_{3} ψ ϕ

as

max (k_{i}) < C r^{1 / 4}

for

0 \leq i \leq 3

and some explicit constant

C > 0

. We assume that

ϕ \equiv λ (mod r)

and

ψ \equiv μ (mod r)

. Let F be a four-dimensional GLV-GLS reduction map defined by

\begin{matrix} F : Z^{4} & \to Z / n, \\ (k_{0}, k_{1}, k_{2}, k_{3}) & \mapsto k_{0} + k_{1} λ + k_{2} μ + k_{3} λ μ (mod r) . \end{matrix}

Let

B = (b_{0}, b_{1}, b_{2}, b_{3})

be a

4 \times 4

matrix consisting of four linearly independent vectors with

{max}_{i} | b_{i} | \leq C r^{1 / 4}

. Then, for any

k \in [1, r - 1]

, the decomposition method computes

(α_{0}, α_{1}, α_{2}, α_{3}) \in Q^{4}

and computes the multi-scalars

\begin{matrix} (k_{0}, k_{1}, k_{2}, k_{3}) = (k, 0, 0, 0) - \sum_{i = 0}^{3} ⌈ α_{i} ⌋ \cdot b_{i}, \end{matrix}

where

⌈ ⌋

represents a rounding operation. There are two typical methods for decomposing a scalar: the Babai rounding method [39] and division in a ring

Z [ϕ]

method, where

ϕ

is an efficiently computable endomorphism [40]. In [14], lattice reduction algorithms based on Cornacchia’s algorithms were proposed for finding a uniform basis. The first step is finding Cornacchia’s GCD in

Z

and the second step is using the Cornacchia’s algorithm in

Z [i]

. We utilize these two algorithms to find four linearly independent vectors

b_{0}, b_{1}, b_{2}, b_{3} \in

kerF, where the rectangle norms

< 51.5 \sqrt{2} r^{1 / 4}

. The coordinates of these vectors utilize the scalar decomposition. Additionally, the relationships of four vectors can reduce the number of fixed constants. Two vectors

b_{0} = (b_{0} [0], b_{0} [1], b_{0} [2], b_{0} [3])

and

b_{1} = (b_{1} [0], b_{1} [1], b_{1} [2], b_{1} [3])

can represent the remaining vectors

b_{2} = (- b_{0} [2], - b_{0} [3], b_{0} [0], b_{0} [1])

and

b_{3} = (- b_{1} [2], - b_{1} [3], b_{1} [0], b_{1} [1])

.

Let

B_{i}

be the

4 \times 4

matrix formed by replacing

b_{i}

in B with the vector

(1, 0, 0, 0)

. We then define four precomputed constants

h_{i} = det (B_{i})

, where

0 \leq i \leq 3

. The four-dimensional decomposition computes

α_{i} = ⌈ \frac{k \cdot h_{i}}{r} ⌋

using four integer multiplication, four integer divisions, and four rounding operations. Bos et al. [17] introduced an efficient rounding method for eliminating integer divisions. This method chooses an integer m such that

r < 2^{m}

, and precomputes the fixed constants

l_{i} = ⌈ \frac{h_{i}}{r} \cdot 2^{m} ⌋

. Then,

α_{i}

can be computed by

⌈ \frac{k \cdot l_{i}}{2^{m}} ⌋

, where the division by

2^{m}

can be computed by a shift operation. The four-dimensional decomposition of a random scalar k using curve

Ted 127

-

glv 4

can be computed as follows:

\begin{matrix} k_{0} = & k - α_{0} \cdot b_{0} [0] - α_{1} \cdot b_{1} [0] + α_{2} \cdot b_{0} [2] + α_{3} \cdot b_{1} [2], \\ k_{1} = & - α_{0} \cdot b_{0} [1] - α_{1} \cdot b_{1} [1] + α_{2} \cdot b_{0} [3] + α_{3} \cdot b_{1} [3], \\ k_{2} = & - α_{0} \cdot b_{0} [2] - α_{1} \cdot b_{1} [2] - α_{2} \cdot b_{0} [0] - α_{3} \cdot b_{1} [0], \\ k_{3} = & - α_{0} \cdot b_{0} [3] - α_{1} \cdot b_{1} [3] - α_{2} \cdot b_{0} [1] - α_{3} \cdot b_{1} [1] . \end{matrix}

However, Ref. [21] reported that this method yields the correct answer and

⌈ \frac{k \cdot h_{i}}{r} ⌋ - 1

. They also reported that a large size of m decreases the probability of a round-off error.

Because the multi-scalars

(k_{0}, k_{1}, k_{2}, k_{3})

lie between

- 2^{63}

and

2^{63}

, all coordinates are both positive and negative. Signed multi-scalars require additional cost to compute scalar multiplication. Costello and Longa [21] demonstrated the offset vectors such that all coordinates of the multi-scalars were always positive to simplify scalar recoding. However, this odd-only scalar recoding method requires that the first element

k_{0}

of the muli-scalars is always odd. For constant-time execution and odd-only recoding, they found two offset vectors

c_{1}

and

c_{2}

such that

(k_{0}, k_{1}, k_{2}, k_{3}) + c_{1}

and

(k_{0}, k_{1}, k_{2}, k_{3}) + c_{2}

are valid decompositions of the scalar k and one of the two multi-scalars had a first element that was odd. To utilize these methods for curve

Ted 127

-

glv 4

, we carefully chose two offset vectors

c_{1} = 2 b_{0} + b_{1} - 3 b_{2} - 4 b_{3}

and

c_{2} = 3 b_{0} + 2 b_{1} - 3 b_{2} - 2 b_{3}

. The multi-scalars

(k_{0}, k_{1}, k_{2}, k_{3}) + c_{1}

and

(k_{0}, k_{1}, k_{2}, k_{3}) + c_{2}

are valid decompositions of the scalar k. Finally, all four coordinates of the two decompositions are positive and less than

2^{65}

, and

k_{0}

in one of them is odd.

Because all coordinates of multi-scalars are less than

2^{65}

, scalar decomposition and recoding require more computational cost compared to Four

Q

-based implementation, which has coordinates of multi-scalars less than

2^{64}

. However, this additional cost is an extremely small portion of the scalar multiplication.

5.2. Point Arithmetic

To enhance the performance of scalar multiplication, the selections of efficient point arithmetic and coordinate system are one of the most crucial subjects. The extended Edwards coordinates of the form

(X : Y : Z : T)

were proposed by Hisil et al., where

T = X Y / Z

[30]. The extended Edwards coordinates are an extended version of the homogeneous coordinates of the form

(X : Y : Z)

. The identity element is represented by

(0 : 1 : 1 : 0)

and the negative element of

(X : Y : Z : T)

is represented by

(- X : Y : Z : - T)

.

Hisil et al. [30] proposed dedicated addition and doubling formulas that are independent of the curve parameter d. Given

(X_{1} : Y_{1} : Z_{1} : T_{1})

and

(X_{2} : Y_{2} : Z_{2} : T_{2})

of distinct points with

Z_{1} \neq 0

and

Z_{2} \neq 0

, the ECADD operation

(X_{3} : Y_{3} : Z_{3} : T_{3}) = (X_{1} : Y_{1} : Z_{1} : T_{1}) + (X_{2} : Y_{2} : Z_{2} : T_{2})

can be computed as follows:

\begin{matrix} X_{3} & = (X_{1} Y_{2} - Y_{1} X_{2}) (T_{1} Z_{2} + Z_{1} T_{2}), \\ Y_{3} & = (Y_{1} Y_{2} + a X_{1} X_{2}) (T_{1} Z_{2} - Z_{1} T_{2}), \\ Z_{3} & = (T_{1} Z_{2} - Z_{1} T_{2}) (T_{1} Z_{2} - Z_{1} T_{2}), \\ T_{3} & = (Y_{1} Y_{2} + a X_{1} X_{2}) (X_{1} Y_{2} - Y_{1} X_{2}) . \end{matrix}

Similarly, given

(X_{1} : Y_{1} : Z_{1} : T_{1})

with

Z_{1} \neq 0

, the ECDBL operation

(X_{3} : Y_{3} : Z_{3} : T_{3}) = 2 (X_{1} : Y_{1} : Z_{1} : T_{1})

can be computed as follows:

\begin{matrix} X_{3} & = 2 X_{1} Y_{1} (2 Z_{1}^{2} - Y_{1}^{2} - a X_{1}^{2}), \\ Y_{3} & = (Y_{1}^{2} + a X_{1}^{2}) (Y_{1}^{2} - a X_{1}^{2}), \\ Z_{3} & = 2 X_{1} Y_{1} (Y_{1}^{2} - a X_{1}^{2}), \\ T_{3} & = (Y_{1}^{2} + a X_{1}^{2}) (2 Z_{1}^{2} - Y_{1}^{2} - a X_{1}^{2}) . \end{matrix}

Hamburg [41] proposed extensible coordinates of the form

(X : Y : Z : T_{a} : T_{b})

, where

T = T_{a} \cdot T_{b}

. The final step of the ECADD and ECDBL operations using extended Edwards coordinates computes

T = T_{a} \cdot T_{b}

. However, the extensible coordinates store the coordinates T as

T_{a}

and

T_{b}

, and compute T when required for point arithmetic. For the further improvement of the ECADD operation, the precomputed point Q is represented in the form

(X + Y, Y - X, 2 Z, 2 T)

[25]. This method eliminates two multiplication by 2 operations and two field additions over

F_{p}

compared to the extended Edwards coordinates. In the case of the ECDBL operation, we utilize the transformation

2 X Y = {(X + Y)}^{2} - X^{2} - Y^{2}

to reduce the number of multiplications. It can be computed by converting one field multiplication and one field addition over

F_{p^{2}}

to one field squaring, two field subtractions over

F_{p^{2}}

. Algorithms 6 and 7 describe the extensible coordinates of the ECADD and ECDBL operations over

F_{p^{2}}

with a curve parameter

a = - 1

, which require

8 M_{2} + 6 A_{2}

and

3 M_{2} + 4 S_{2} + 6 A_{2}

operations, respectively.

Algorithm 6: Twisted Edwards point addition over

F_{p^{2}}

.

Require:: $P = (X_{1}, Y_{1}, Z_{1}, T_{a}, T_{b})$ where $T_{1} = T_{a} \cdot T_{b}$ and $Q = (X_{2} + Y_{2}, Y_{2} - X_{2}, 2 Z_{2}, 2 T_{2})$ .
Ensure:: $P + Q = (X_{3}, Y_{3}, Z_{3}, T_{a}, T_{b})$ where $T_{3} = T_{a} \cdot T_{b}$ .
1:: $t_{2} \leftarrow T_{a} \times T_{b}$ { $M_{2}$ }
2:: $t_{2} \leftarrow t_{2} \times 2 Z_{2}$ { $M_{2}$ }
3:: $t_{1} \leftarrow 2 T_{2} \times Z_{1}$ { $M_{2}$ }
4:: $T_{a} \leftarrow t_{2} - t_{1}$ { $A_{2}$ }
5:: $T_{b} \leftarrow t_{2} + t_{1}$ { $A_{2}$ }
6:: $t_{2} \leftarrow X_{1} + Y_{1}$ { $A_{2}$ }
7:: $t_{2} \leftarrow (Y_{2} - X_{2}) \times t_{2}$ { $M_{2}$ }
8:: $t_{1} \leftarrow Y_{1} - X_{1}$ { $A_{2}$ }
9:: $t_{2} \leftarrow (X_{2} + Y_{2}) \times t_{1}$ { $M_{2}$ }
10:: $Z_{3} \leftarrow t_{1} - t_{2}$ { $A_{2}$ }
11:: $t_{1} \leftarrow t_{1} + t_{2}$ { $A_{2}$ }
12:: $X_{3} \leftarrow T_{b} \times Z_{3}$ { $M_{2}$ }
13:: $Z_{3} \leftarrow t_{1} \times Z_{3}$ { $M_{2}$ }
14:: $Y_{3} \leftarrow t_{a} \times t_{1}$ { $M_{2}$ }
15:: return $P + Q = (X_{3}, Y_{3}, Z_{3}, T_{a}, T_{b})$ where $T = T_{a} \cdot T_{b}$ .

To demonstrate the efficiency of the twisted Edwards curves, we compare it to the cost of a short Weierstrass elliptic curve. The ECADD and ECDBL operations of a short Weierstrass curve of the form

y^{2} = x^{3} + a x + b

over

F_{p^{2}}

using Jacobian coordinates require

11 M_{2} + 5 S_{2} + 9 A_{2}

and

1 M_{2} + 8 S_{2} + 10 A_{2} + 1 M_{d}

operations. The ECADD operation of the twisted Edwards curve using extensible coordinates saves

3 M_{2} + 5 S_{2} + 3 A_{2}

operations. The ECDBL operation requires

2 M_{2}

additional operations but saves

4 S_{2} + 5 A_{2} + 1 M_{d}

operations. Therefore, the twisted Edwards curves using extensible coordinates have a computational advantage compared to short Weierstrass curves using Jacobian coordinates.

Algorithm 7: Twisted Edwards point doubling over

F_{p^{2}}

.

Require:: $P = (X_{1}, Y_{1}, Z_{1})$ .
Ensure:: $2 P = (X_{3}, Y_{3}, Z_{3}, T_{a}, T_{b})$ where $T_{3} = T_{a} \cdot T_{b}$ .
1:: $t_{1} \leftarrow X_{1}^{2}$ { $S_{2}$ }
2:: $t_{2} \leftarrow Y_{1}^{2}$ { $S_{2}$ }
3:: $T_{b} \leftarrow t_{1} + t_{2}$ { $A_{2}$ }
4:: $T_{a} \leftarrow X_{1} + X_{1}$ { $A_{2}$ }
5:: $T_{a} \leftarrow T_{a}^{2}$ { $S_{2}$ }
6:: $t_{1} \leftarrow t_{2} - T_{1}$ { $A_{2}$ }
7:: $t_{2} \leftarrow Z_{1}^{2}$ { $S_{2}$ }
8:: $T_{a} \leftarrow T_{a} - T_{b}$ { $A_{2}$ }
9:: $t_{2} \leftarrow t_{2} + t_{2}$ { $A_{2}$ }
10:: $t_{2} \leftarrow t_{2} - t_{1}$ { $A_{2}$ }
11:: $Y_{3} \leftarrow t_{b} \times t_{1}$ { $M_{2}$ }
12:: $X_{3} \leftarrow T_{a} \times t_{2}$ { $M_{2}$ }
13:: $Z_{3} \leftarrow t_{1} \times t_{2}$ { $M_{2}$ }
14:: return $2 P = (X_{3}, Y_{3}, Z_{3}, T_{a}, T_{b})$ where $T_{3} = T_{a} \cdot T_{b}$ .

5.3. Endomorphisms

In [25], the formulas for the endomorphisms

ϕ

and

ψ

are described. To reduce the number of representation conversions, we represent the results of endomorphism operations using extensible coordinates. Let

P = (X_{1}, Y_{1}, Z_{1})

be a point in curve

Ted 127

-

glv 4

represented by homogeneous projective coordinates. Then,

ϕ (P) = (X_{2}, Y_{2}, Z_{2}, T_{a}, T_{b})

, where

T = T_{a} \cdot T_{b}

can be computed as follows:

\begin{matrix} X_{2} & = - X_{1} (α Y_{1}^{2} + θ Z_{1}^{2}) (σ Y_{1}^{2} - β Z_{1}^{2}), \\ Y_{2} & = 2 Y_{1} Z_{1}^{2} (β Y_{1}^{2} + γ Z_{1}^{2}), \\ Z_{2} & = 2 Y_{1} Z_{1}^{2} (σ Y_{1}^{2} - β Z_{1}^{2}), \\ T_{a} & = - X_{1} (α Y_{1}^{2} + θ Z_{1}^{2}), \\ T_{b} & - (β Y_{1}^{2} + γ Z_{1}^{2}), \end{matrix}

where

α = ζ_{8}^{3} + 2 ζ_{8}^{2} + ζ_{8}, θ = ζ_{8}^{3} - 2 ζ_{8}^{2} + ζ_{8}, σ = 2 ζ_{8}^{3} + ζ_{8}^{2} - 1, γ = 2 ζ_{8}^{3} - ζ_{8}^{2} + 1

and

β = ζ_{8}^{2} - 1

. We also utilize the fixed values for curve

Ted 127

-

glv 4

as follows:

\begin{matrix} ζ_{8} & = 1 + A i, & σ & = (A - 1) + (A + 1) i, & θ & = A + B i, \\ α & = A + 2 i, & γ & = (A + 1) + (A - 1) i, & β & = B + 1 + i, \end{matrix}

where A = 143485135153817520976780139629062568752 and B = 1701411834604692317316873037158840

99729. The endomorphism

ϕ

can be computed by using

11 M_{2} + 2 S_{2} + 5 A_{2}

or

7 M_{2} + 1 S_{2} + 5 A_{2}

operations in the case

Z_{1} = 1

.

Similarly,

ψ (P) = (X_{2}, Y_{2}, Z_{2}, T_{a}, T_{b})

, where

T_{2} = T_{a} \cdot T_{b}

can be computed as follows:

X_{2} = ζ_{8} X_{1}^{p} Y_{1}^{p}, Y_{2} = {(Z_{1}^{p})}^{2}, Z_{2} = Y_{1}^{p} Z_{1}^{p},

T_{a} = ζ_{8} X_{1}^{p}, T_{b} = Z_{1}^{p} .

The endomorphism

ψ

can be computed using

3 M_{2} + 1 S_{2} + 1.5 A_{2}

or

2 M_{2} + 1 A_{2}

operations in the case

Z_{1} = 1

. Because the endomorphism

ψ

requires fewer operations than the endomorphism

ϕ

,

ψ (ϕ (P))

can be computed on the order of

ϕ (P)

with

Z_{1} = 1

and

ψ (ϕ (P))

.

6. Performance Analysis and Implementation Results

In this section, we analyze the operation counts and implementation results of variable-base scalar multiplication using curve

Ted 127

-

glv 4

on AVR (Microchip Technology Inc., Chandler, AZ, USA), MSP430 (Texas Instruments, Dallas, TX, USA), and ARM (ARM holdings plc, Cambridge, UK) processors. We performed simulations and evaluations using the IAR Embedded Workbench for AVR 6.80.7 (IAR systems, Uppsala, Sweden), IAR Embedded Workbench for MSP430 7.10.2 (IAR systems, Uppsala, Sweden), and STM32F4-DISC1 board (STMicroelectronics, Geneva, Switzerland) with the IAR Embedded Workbench for ARM 8.11.1 (IAR systems, Uppsala, Sweden). All implementations were set to the medium optimization level.

6.1. Operation Counts

Table 1 and Table 2 describe the operation counts of field arithmetic over

F_{p^{2}}

and their conversion into field arithmetic over

F_{p}

for curve

Ted 127

-

glv 4

and Four

Q

using Algorithm 1. Because both curves support the four-dimensional decomposition, the operation counts for Algorithm 1 can be compared step by step.

Step 1 of Algorithm 1 computes three endomorphisms

ϕ (P), ψ (P)

, and

ϕ (ψ (P))

, and requires

73 M_{2} + 27 S_{2} + 59.5 A

operations for Four

Q

and

13 M_{2} + 2 S_{2} + 11.5 A_{2}

operations for curve

Ted 127

-

glv 4

. Step 2 requires seven ECADD operations, which require

49 M_{2} + 28 A_{2}

operations for Four

Q

and

56 M_{2} + 42 A_{2}

operations for curve

Ted 127

-

glv 4

. However, these outputs are all converted for faster ECADD computations, which require

14 M_{2} + 28 A_{2}

operations for Four

Q

and

7 M_{2} + 28 A_{2}

operations for curve

Ted 127

-

glv 4

. Steps 3 and 4 require only bit and integer operations for all positive scalar decomposition and fixed-length recoding operations. Step 5 requires

1 A_{2}

operations for one point negation and one table lookup, and a conversion to extensible coordinates

(X, Y, Z, T_{a}, T_{b})

for the initial point Q, which require

2 A_{2}

operations. Steps 6 to 9 require 64 ECDBL operations, 64 ECADD operations, 64 point negations, and 64 table lookups for Four

Q

, and 65 ECDBL operations, 65 ECADD operations, 65 point negations, and 65 table lookups for curve

Ted 127

-

glv 4

. The operation counts of these steps are

704 M_{2} + 256 S_{2} + 835 A_{2}

for Four

Q

and

715 M_{2} + 260 S_{2} + 845 A_{2}

for curve

Ted 127

-

glv 4

. Step 10 requires

1 I_{2} + 2 M_{2}

operations for the normalization of the result point Q.

Variable-base scalar multiplication using the four-dimensional decomposition requires

1 I_{2} + 842 M_{2} + 283 S_{2} + 950.5 A_{2}

operations for Four

Q

and

1 I_{2} + 793 M_{2} + 262 S_{2} + 929.5 A_{2}

operations for curve

Ted 127

-

glv 4

. The curve

Ted 127

-

glv 4

requires

49 M_{2} + 21 S_{2} + 21 A_{2}

fewer operations than Four

Q

because the endomorphisms in curve

Ted 127

-

glv 4

are efficiently computable. However, the operation counts of field inversion over

F_{p}

for Four

Q

and curve

Ted 127

-

glv 4

are

I_{1} = 10 M_{1} + 126 S_{1}

and

I_{1} = 13 M_{1} + 126 S_{1}

, respectively. Therefore, we convert the operation counts of the field arithmetic over

F_{p^{2}}

to the field arithmetic over

F_{p}

. Field arithmetic over

F_{p^{2}}

can be represented by field arithmetic over

F_{p}

as follows:

\begin{matrix} I_{2} & = 1 I_{1} + 2 M_{1} + 2 S_{1} + 2 A_{1}, & M_{2} & = 3 M_{1} + 3 A_{1} + 2 A_{i}, \\ S_{2} & = 2 M_{1} + 1 A_{1} + 2 A_{i}, & A_{2} & = 2 A_{1} . \end{matrix}

The operation counts

1 I_{2} + 842 M_{2} + 283 S_{2} + 950.5 A_{2}

can be represented by

3104 M_{1} + 128 S_{1} + 4712 A_{1} + 2250 A_{i}

for Four

Q

and

1 I_{2} + 793 M_{2} + 262 S_{2} + 929.5 A_{2}

can be represented by

2918 M_{1} + 128 S_{1} + 4504 A_{1} + 2124 A_{i}

for curve

Ted 127

-

glv 4

. The scalar multiplication using curve

Ted 127

-

glv 4

saves

186 M_{1} + 208 A_{1} + 126 A_{i}

operations compared to Four

Q

-based scalar multiplication. Therefore, we can deduce that the four-dimensional scalar multiplication using curve

Ted 127

-

glv 4

can be faster than Four

Q

-based implementation when field arithmetic is efficiently implemented.

6.2. Implementation Results of Field Arithmetic

Table 3 lists how many cycles are used for field arithmetic over

F_{p}

and

F_{p^{2}}

on AVR, MSP430, and ARM processors, including function call overhead. The field inversions

F_{p}

and

F_{p^{2}}

are the average cycles performed

10^{4}

times and remaining the field arithmetic is the average cycles performed

10^{7}

times. To evaluate the implementation of field arithmetic for curve

Ted 127

-

glv 4

, we compare the number of cycles for its implementation with Four

Q

, which provides the fastest implementation results to date [24].

We will now compare the number of cycles for field arithmetic on 8-bit AVR processor. The field arithmetic over

F_{p}

for curve

Ted 127

-

glv 4

on 8-bit AVR requires 198, 196, 1221, 1796, and 176,901 cycles to compute addition, subtraction, squaring, multiplication, and inversion over

F_{p}

, respectively. Similarly, the field arithmetic for Four

Q

on AVR requires 155, 159, 1026, 1598, and 150,535 cycles to compute field addition, subtraction, squaring, multiplication, and inversion over

F_{p}

, respectively. The curve

Ted 127

-

glv 4

requires 43, 37, 195, 198, and 26,366 more cycles than Four

Q

for these operations, respectively. The field arithmetic over

F_{p^{2}}

for curve

Ted 127

-

glv 4

requires 452, 448, 4093, 6277, and 183,345 cycles to compute field addition, subtraction, squaring, multiplication, and inversion over

F_{p^{2}}

, respectively. These same operations for Four

Q

require 384, 385, 3622, 5758, and 156,171 cycles, respectively. The curve

Ted 127

-

glv 4

requires 68, 63, 471, 519, and 27,174 more cycles than Four

Q

for these operations, respectively.

In the case of the 16-bit MSP430X processor, field arithmetic over

F_{p}

for curve

Ted 127

-

glv 4

requires 120, 126, 837, 1087, and 119,629 cycles to compute field addition, subtraction, squaring, multiplication, and inversion over

F_{p}

. The same operations for Four

Q

requires 102, 101, 927, 1027, and 131,819 cycles, respectively. The curve

Ted 127

-

glv 4

requires 18, 25, and 60 more cycles than Four

Q

to compute addition, subtraction, and multiplication, respectively. However, it saves 90 and 12,190 cycles than Four

Q

to compute squaring and inversion over

F_{p}

, respectively. The field arithmetic over

F_{p^{2}}

for curve

Ted 127

-

glv 4

requires 266, 278, 2476, 3806, and 123,740 cycles to compute field addition, subtraction, squaring, multiplication, and inversion over

F_{p^{2}}

, respectively. These operations for Four

Q

require 233, 231, 2391, 3624, and 135,315 cycles, respectively. The curve

Ted 127

-

glv 4

requires 33, 47, 85, and 182 more cycles than Four

Q

to compute addition, subtraction, squaring, and multiplication, respectively. It saves 11,575 cycles than Four

Q

to compute inversion over

F_{p^{2}}

.

In the 32-bit ARM Cortex-M4 processor, field arithmetic for curve

Ted 127

-

glv 4

requires 55, 55, 88, 99, and 12,135 cycles to compute field addition, subtraction, squaring, multiplication, and inversion over

F_{p}

, respectively. However, Ref. [24] does not report the implementation results of field arithmetic over

F_{p}

. The field arithmetic over

F_{p^{2}}

for curve

Ted 127

-

glv 4

requires 82, 82, 196, 341, and 12,612 cycles to compute field addition, subtraction, squaring, multiplication, and inversion over

F_{p^{2}}

, respectively. These operations for Four

Q

require 84, 86, 215, 358, and 21,056 cycles, respectively. The curve

Ted 127

-

glv 4

saves 2, 4, 20, 17, and 8444 cycles than Four

Q

to compute addition, subtraction, squaring, multiplication, and inversion over

F_{p^{2}}

, respectively.

One can see that the field arithmetic over

F_{p}

in Four

Q

on AVR and MSP430 is typically faster than curve

Ted 127

-

glv 4

. This difference occurs because the primes of both curves are different, with a Mersenne prime of the form

p = 2^{127} - 1

in Four

Q

and a Mersenne-like prime of the form

p = 2^{127} - 5997

in curve

Ted 127

-

glv 4

. Let

p = 2^{127} - δ

, where

δ

is small. The modular reduction step can be computed by

c = c_{h} \cdot 2^{128} + c_{l} \equiv c_{l} + 2 \cdot δ \cdot c_{h} (mod p)

. In this process, Four

Q

can be efficiently computed using simple shift operations because

δ = 1

, but the curve

Ted 127

-

glv 4

requires more instructions because it uses multiplication by

δ = 5997 = 0 x 176 d

. In the 8-bit AVR implementation,

0 x 176 d

can be represented by two 8-bit words as

0 x 17

and

0 x 6 d

. Therefore, the operation

c_{l} + 2 \cdot δ \cdot c_{h} (mod p)

requires more 8 × 8-bit multiplications and accumulations. Unlike the AVR implementation,

0 x 176 d

can be represented by one word in the MSP430 and ARM CPUs. Additionally, these CPUs provide efficient MAC instructions. Therefore, the modular reduction on MSP430 and ARM implementations require fewer additional instructions than the AVR implementation.

In the case of the MSP430, field squaring over

F_{p}

in curve

Ted 127

-

glv 4

is faster than in Four

Q

. The field squaring over

F_{p}

in curve

Ted 127

-

glv 4

requires 837 cycles, whereas Four

Q

requires 927 cycles. Our implementation saves 9.71% of the cycles for field squaring over

F_{p}

compared to the SBD method, despite the modular reduction overhead. Additionally, the principal operation of inversion is field squaring over

F_{p}

, our implementation saves 9.32% and 8.55% of the cycles for inversion over

F_{p}

and

F_{p^{2}}

. For field squaring over

F_{p^{2}}

, field squaring over

F_{p}

is not required because it can be computed by

2 M_{1} + 1 A_{1} + 2 A_{i}

operations. Therefore, field squaring over

F_{p^{2}}

for

Ted 127

-

glv 4

requires more cycles than Four

Q

.

6.3. Implementation Results of Scalar Multiplication

Table 4 summarizes the implementation results of variable-base scalar multiplication compared to the previous implementations on the 8-bit AVR, 16-bit MSP430, and 32-bit ARM processors. We measured the average cycles for our variable-base scalar multiplication by running it

10^{3}

times with random scalars k. For comparison, Table 4 includes the previous implementations that guarantee constant-time execution. These were implemented using various elliptic curves, such as NIST P-256 [42,43], Curve25519 [44,45,46],

μ

Kummer [47], and Four

Q

[24]. These curves are designed such that the bit-length of the curve order is slightly smaller than 256-bit for efficient implementation. NIST P-256 has a 256-bit curve order, but Curve25519,

μ

Kummer, Four

Q

, and curve

Ted 127

-

glv 4

have 252-bit, 250-bit, 246-bit, and 251-bit curve orders, respectively. Therefore, these curves provide approximately 128-bit security levels.

We will now summarize the implementation results of previous works on embedded devices that provide approximately 128-bit security levels. Wenger and Werner [42] and Wenger et al. [43] implemented the scalar multiplication using the NIST P-256 curve on various 16-bit microcontrollers and 8-bit, 16-bit, and 32-bit microcontrollers. Hutter and Schwabe [48] implemented the NaCl library on 8-bit AVR processor, which provides a Curve25519 scalar multiplication. Hinterwälder et al. [44] implemented a Diffie–Hellman key exchange on MSP430X processor using 16-bit and 32-bit hardware multipliers. In 2015, Düll et al. [45] implemented a Curve25519 scalar multiplication of on 8-bit, 16-bit, and 32-bit microcontrollers. Renes et al. [47] implemented a Montgomery ladder scalar multiplication on the Kummer surface of a genus 2 hyperelliptic curve on 8-bit AVR and 32-bit ARM Cortex-M0 processors. Faz-Hernández et al. [25] proposed an efficient implementation of the four-dimensional GLV-GLS scalar multiplication using curve

Ted 127

-

glv 4

on Intel and ARM processors.

The implementation results of variable-base scalar multiplication set new speed records on the 16-bit MSP430 and 32-bit ARM Cortex-M4 processors. Scalar multiplication using curve

Ted 127

-

glv 4

on AVR, MSP430, and ARM requires 6,856,026, 4,158,453, and 447,836 cycles, respectively. Compared to the previous fastest implementation, namely Four

Q

[24], which require 6,561,500, 4,280,400, and 469,500 cycles on AVR, MSP430, and ARM, respectively, our implementation requires 4.49% more cycles on AVR, but saves 2.85% and

4.61 %

cycles on MSP430X and ARM processors, respectively. Compared to

μ

Kummer [47], which requires 9,513,536 cycles on AVR, our implementation saves 27.93% cycles. It also saves 50.68% and 47.58% cycles than Düll et al.’s Curve25519 implementation [45], which requires 13,900,397 and 7,933,296 cycles on AVR and MSP430, respectively. It saves 69.92% cycles compared to the NaCl library [48], which requires 22,791,579 cycles on AVR. It saves 54.50% cycles than Hinterwälder et al.’s Curve25519 implementation [44], which requires 9,139,739 cycles on MSP430. Additionally, it saves 68.54% cycles compared to the method in [46], which requires 1,423,667 cycles on the ARM Cortex-M4 processor.

The memory of embedded processors is very constrained, meaning the memory usage of various implementations is important. In the case of the 8-bit AVR,

μ

Kummer [47] requires the lowest memory usage in the recently proposed results, which requires 9490 bytes of code size and 99 bytes of stack memories. Wenger et al.’s and Düll et al.’s implementations [43,45] require the lowest code size and stack memories on MSP430, which require 8378 bytes of code size and 384 bytes of stack memories. In the 32-bit ARM, Ref. [46] require 3750 bytes of code size and 740 bytes of stack memories. Four

Q

[24] reported the memory usage of ECDH and signature operations, but did not report the memory usage of single scalar multiplication. Our implementations for curve

Ted 127

-

glv 4

requires 13,891, 9098, and 7532 bytes of code size and 2539, 2568, and 2792 bytes of stack memories on AVR, MSP430, and ARM Cortex-M4, respectively. Four

Q

and curve

Ted 127

-

glv 4

, which utilize the four-dimensional decompositions, precompute eight points, meaning they require more stack memory than other implementations. However, the performance of four-dimensional scalar multiplication is significantly faster than other implementations.

7. Conclusions

In this paper, we presented the first constant-time implementations of four-dimensional GLV-GLS scalar multiplication using curve

Ted 127

-

glv 4

on 8-bit ATxmega256A3, 16-bit MSP430FR5969, and 32-bit ARM Cortex-M4 processors. We also optimized the performance of internal algorithms in scalar multiplication on three target processors. Our implementations for single scalar multiplication on AVR require 4.49% more cycles than Four

Q

-based implementation, but save 2.85% and 4.61% cycles on MSP430 and ARM Cortex-M4, respectively. Our analysis and implementation results demonstrate that efficiently computable endomorphisms can accelerate scalar multiplication, even when using prime numbers that provide inefficient field arithmetic. Our implementations highlight that the four-dimensional GLV-GLS scalar multiplication using curve

Ted 127

-

glv 4

is one of the suitable elliptic curves for constructing ECC-based applications for resource-constrained embedded devices.

Author Contributions

J.K. designed and implemented the presented software. S.C.S. and S.H. analyzed the experimental results and improved the choice of internal algorithms of scalar multiplication.

Acknowledgments

This work was supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No. 2014-6-00910, Study on Security of Cryptographic Software).

Conflicts of Interest

The authors declare no conflict of interest.

References

Shojafar, M.; Canali, C.; Lancellotti, R.; Baccarelli, E. Minimizing computing-plus-communication energy consumptions in virtualized networked data centers. In Proceedings of the 2016 IEEE Symposium on Computers and Communication (ISCC), Messina, Italy, 27–30 June 2016; pp. 1137–1144. [Google Scholar]
Baccarelli, E.; Naranjo, P.G.V.; Shojafar, M.; Scarpiniti, M. Q*: Energy and delay-efficient dynamic queue management in TCP/IP virtualized data centers. Comput. Commun. 2017, 102, 89–106. [Google Scholar] [CrossRef]
Miller, V.S. Use of Elliptic Curves in Cryptography. In Proceedings of Conference on the Theory and Application of Cryptographic Techniques, Santa Barbara, CA, USA, 18–22 August 1985; Springer: Heidelberg/Berlin, Germany, 1985; pp. 417–426. [Google Scholar]
Koblitz, N. Elliptic curve cryptosystems. Math. Comput. 1987, 48, 203–209. [Google Scholar] [CrossRef]
Rivest, R.L.; Shamir, A.; Adleman, L. A method for obtaining digital signatures and public-key cryptosystems. Commun. ACM 1978, 21, 120–126. [Google Scholar] [CrossRef]
SafeCurves: Choosing Safe Curves for Elliptic-Curve Cryptography. Available online: http://safecurves.cr.yp.to (accessed on 10 March 2018).
Barker, E.; Kelsey, J. NIST Special Publication 800-90A Revision 1: Recommendation for Random Number Generation Using Deterministic Random Bit Generators; Technical Report; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2015. [Google Scholar]
Bernstein, D.J. Curve25519: New Diffie–Hellman Speed Records. In Proceedings of 9th International Workshop on Public Key Cryptography, New York, NY, USA, 24–26 April 2006; Springer: Heidelberg/Berlin, Germany, 2006; pp. 207–228. [Google Scholar]
Hamburg, M. Ed448-Goldilocks, a new elliptic curve. IACR Cryptol. ePrint Arch. 2015, 2015, 625. [Google Scholar]
Bernstein, D.J.; Birkner, P.; Joye, M.; Lange, T.; Peters, C. Twisted Edwards Curves. In Proceedings of 1st International Conference on Cryptology in Africa, Casablanca, Morocco, 11–14 June 2008; Springer: Heidelberg/Berlin, Germany, 2008; pp. 389–405. [Google Scholar]
Gallant, R.P.; Lambert, R.J.; Vanstone, S.A. Faster Point Multiplication on Elliptic Curves with Efficient Endomorphisms. In Proceedings of 21st Annual International Cryptology Conference, Santa Barbara, CA, USA, 19–23 August 2001; Springer: Heidelberg/Berlin, Germany, 2001; pp. 190–200. [Google Scholar]
Galbraith, S.D.; Lin, X.; Scott, M. Endomorphisms for Faster Elliptic Curve Cryptography on a Large Class of Curves. In Proceedings of 28th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Cologne, Germany, 26–30 April 2009; Springer: Heidelberg/Berlin, Germany, 2009; pp. 518–535. [Google Scholar]
Longa, P.; Gebotys, C. Efficient Techniques for High-Speed Elliptic Curve Cryptography. In Proceedings of 12th International Workshop on Cryptographic Hardware and Embedded Systems, Santa Barbara, CA, USA, 17–20 August 2010; Springer: Heidelberg/Berlin, Germany, 2010; pp. 80–94. [Google Scholar]
Longa, P.; Sica, F. Four-Dimensional Gallant–Lambert–Vanstone Scalar Multiplication. In Proceedings of 18th International Conference on the Theory and Application of Cryptology and Information Security, Beijing, China, 2–6 December 2012; Springer: Heidelberg/Berlin, Germany, 2012; pp. 718–739. [Google Scholar]
Hu, Z.; Longa, P.; Xu, M. Implementing the 4-dimensional GLV method on GLS elliptic curves with j-invariant 0. Des. Codes Cryptogr. 2012, 63, 331–343. [Google Scholar] [CrossRef]
Bos, J.W.; Costello, C.; Hisil, H.; Lauter, K. Fast cryptography in genus 2. In Proceedings of 32nd Annual International Conference on the Theory and Applications of Cryptographic Techniques, Athens, Greece, 26–30 May 2013; Springer: Heidelberg/Berlin, Germany; pp. 194–210.
Bos, J.W.; Costello, C.; Hisil, H.; Lauter, K. High-Performance Scalar Multiplication Using 8-Dimensional GLV/GLS Decomposition. In Proceedings of 15th International Workshop on Cryptographic Hardware and Embedded Systems, Santa Barbara, CA, USA, 20–23 August 2013; Springer: Heidelberg/Berlin, Germany, 2013; pp. 331–348. [Google Scholar]
Oliveira, T.; López, J.; Aranha, D.F.; Rodríguez-Henríquez, F. Lambda Coordinates for Binary Elliptic Curves. In Proceedings of 15th International Workshop on Cryptographic Hardware and Embedded Systems, Santa Barbara, CA, USA, 20–23 August 2013; Springer: Heidelberg/Berlin, Germany, 2013; pp. 311–330. [Google Scholar]
Guillevic, A.; Ionica, S. Four-Dimensional GLV via the Weil Restriction. In Proceedings of 19th International Conference on the Theory and Application of Cryptology and Information Security, Bengaluru, India, 1–5 December 2013; Springer: Heidelberg/Berlin, Germany, 2013; pp. 79–96. [Google Scholar]
Smith, B. Families of Fast Elliptic Curves from $Q$ -Curves. In Proceedings of 19th International Conference on the Theory and Application of Cryptology and Information Security, Bengaluru, India, 1–5 December 2013; Springer: Heidelberg/Berlin, Germany, 2013; pp. 61–78. [Google Scholar]
Costello, C.; Longa, P. Four $Q$ : Four-Dimensional Decompositions on a $Q$ -Curve over the Mersenne Prime. In Proceedings of 21st International Conference on the Theory and Application of Cryptology and Information Security, Auckland, New Zealand, 29 November–3 December 2015; Springer: Heidelberg/Berlin, Germany, 2015; pp. 214–235. [Google Scholar]
Longa, P. Four $Q$ NEON: Faster Elliptic Curve Scalar Multiplications on ARM Processors. IACR Cryptol. ePrint Arch. 2016, 2016, 645. [Google Scholar]
Järvinen, K.; Miele, A.; Azarderakhsh, R.; Longa, P. Four $Q$ on FPGA: New Hardware Speed Records for Elliptic Curve Cryptography over Large Prime Characteristic Fields. In Proceedings of 18th International Workshop on Cryptographic Hardware and Embedded Systems, Santa Barbara, CA, USA, 17–19 August 2016; Springer: Heidelberg/Berlin, Germany, 2016; pp. 517–537. [Google Scholar]
Liu, Z.; Longa, P.; Pereira, G.C.; Reparaz, O.; Seo, H. Four $Q$ on Embedded Devices with Strong Countermeasures Against Side-Channel Attacks. In Proceedings of 19th International Workshop on Cryptographic Hardware and Embedded Systems, Taipei, Taiwan, 25–28 September 2017; Springer: Heidelberg/Berlin, Germany, 2017; pp. 665–686. [Google Scholar]
Faz-Hernández, A.; Longa, P.; Sánchez, A.H. Efficient and secure algorithms for GLV-based scalar multiplication and their implementation on GLV–GLS curves (extended version). J. Cryptogr. Eng. 2015, 5, 31–52. [Google Scholar] [CrossRef]
Kocher, P.C. Timing attacks on implementations of Diffie–Hellman, RSA, DSS, and other systems. In Proceedings of 16th Annual International Cryptology Conference, Santa Barbara, CA, USA, 18–22 August 1996; Springer: Heidelberg/Berlin, Germany; pp. 104–113.
Page, D. Theoretical use of cache memory as a cryptanalytic side-channel. IACR Cryptol. ePrint Arch. 2002, 2002, 169. [Google Scholar]
Edwards, H. A normal form for elliptic curves. Bull. Am. Math. Soc. 2007, 44, 393–422. [Google Scholar] [CrossRef]
Bernstein, D.J.; Lange, T. Faster addition and doubling on elliptic curves. In Proceedings of 13th International Conference on the Theory and Application of Cryptology and Information Security, Kuching, Malaysia, 2–6 December 2007; Springer: Heidelberg/Berlin, Germany; pp. 29–50.
Hisil, H.; Wong, K.K.H.; Carter, G.; Dawson, E. Twisted Edwards curves revisited. In Proceedings of 14th International Conference on the Theory and Application of Cryptology and Information Security, Melbourne, Australia, 7–11 December 2008; Springer: Heidelberg/Berlin, Germany; pp. 326–343.
Hankerson, D.; Menezes, A.J.; Vanstone, S. Guide to Elliptic Curve Cryptography; Springer Science & Business Media: Heidelberg/Berlin, Germany, 2006. [Google Scholar]
Yanık, T.; Savaş, E.; Koç, Ç.K. Incomplete reduction in modular arithmetic. IEE Proc. Comput. Digit. Tech. 2002, 149, 46–52. [Google Scholar] [CrossRef]
Microchip. 8/16-Bit AVR XMEGA A3 Microcontroller. Available online: http://ww1.microchip.com/downloads/en/DeviceDoc/Atmel-8068-8-and16-bit-AVR-XMEGA-A3-Microcontrollers_Datasheet.pdf (accessed on 26 February 2018).
Hutter, M.; Schwabe, P. Multiprecision multiplication on AVR revisited. J. Cryptogr. Eng. 2015, 5, 201–214. [Google Scholar] [CrossRef]
Seo, H.; Liu, Z.; Choi, J.; Kim, H. Multi-Precision Squaring for Public-Key Cryptography on Embedded Microprocessors. In Proceedings of Cryptology—INDOCRYPT 2013, Mumbai, India, 7–10 December 2013; Springer: Heidelberg/Berlin, Germany, 2013; pp. 227–243. [Google Scholar]
Texas Instruments. MSP430FR59xx Mixed-Signal Microcontrollers. Available online: http://www.ti.com/lit/ds/symlink/msp430fr5969.pdf (accessed on 26 February 2018).
STMicroelectronics. UM1472: Discovery kit with STM32F407VG MCU. Available online: http://www.st.com/content/ccc/resource/technical/document/user_manual/70/fe/4a/3f/e7/e1/4f/7d/DM00039084.pdf/files/DM00039084.pdf/jcr:content/translations/en.DM00039084.pdf (accessed on 26 February 2018).
FourQlib library. Available online: https://github.com/Microsoft/FourQlib (accessed on 10 March 2018 ).
Babai, L. On Lovász’lattice reduction and the nearest lattice point problem. Combinatorica 1986, 6, 1–13. [Google Scholar] [CrossRef]
Park, Y.H.; Jeong, S.; Lim, J. Speeding Up Point Multiplication on Hyperelliptic Curves With Efficiently-Computable Endomorphisms. In Proceedings of International Conference on the Theory and Applications of Cryptographic Techniques, Amsterdam, The Netherlands, 28 April–2 May 2002; Springer: Heidelberg/Berlin, Germany, 2002; pp. 197–208. [Google Scholar]
Hamburg, M. Fast and compact elliptic-curve cryptography. IACR Cryptol. ePrint Arch. 2012, 2012, 309. [Google Scholar]
Wenger, E.; Werner, M. Evaluating 16-bit processors for elliptic curve cryptography. In Proceedings of the International Conference on Smart Card Research and Advanced Applications, Leuven, Belgium, 14–16 September 2011; Springer: Heidelberg/Berlin, Germany; pp. 166–181.
Wenger, E.; Unterluggauer, T.; Werner, M. 8/16/32 shades of elliptic curve cryptography on embedded processors. In Proceedings of Cryptology—INDOCRYPT 2013, Mumbai, India, 7–10 December 2013; Springer: Heidelberg/Berlin, Germany; pp. 244–261.
Hinterwälder, G.; Moradi, A.; Hutter, M.; Schwabe, P.; Paar, C. Full-size high-security ECC implementation on MSP430 microcontrollers. Proceedings of Cryptology—LATINCRYPT 2014, Florianópolis, Brazil, 17–19 September 2014; pp. 31–47. [Google Scholar]
Düll, M.; Haase, B.; Hinterwälder, G.; Hutter, M.; Paar, C.; Sánchez, A.H.; Schwabe, P. High-speed Curve25519 on 8-bit, 16-bit, and 32-bit microcontrollers. Des. Codes Cryptogr. 2015, 77, 493–514. [Google Scholar] [CrossRef]
De Santis, F.; Sigl, G. Towards Side-Channel Protected X25519 on ARM Cortex-M4 Processors. Proceedings of Software performance enhancement for encryption and decryption, and benchmarking, Utrecht, The Netherlands, 19–21 October 2016. [Google Scholar]
Renes, J.; Schwabe, P.; Smith, B.; Batina, L. μKummer: Efficient Hyperelliptic Signatures and Key Exchange on Microcontrollers. In Proceedings of 18th International Workshop on Cryptographic Hardware and Embedded Systems, Santa Barbara, CA, USA, 17–19 August 2016; Springer: Heidelberg/Berlin, Germany, 2016; pp. 301–320. [Google Scholar]
Hutter, M.; Schwabe, P. NaCl on 8-Bit AVR Microcontrollers. In Proceedings of Cryptology—AFRICACRYPT 2013, Cairo, Egypt, 22–24 June 2013; Springer: Heidelberg/Berlin, Germany, 2013; pp. 156–172. [Google Scholar]

Figure 1. The implementation hierarchy of four-dimensional Gallant-Lambert-Vanstone and Galbraith-Lin-Scott (GLV-GLS) scalar multiplication.

Table 1. The operation counts of curve

Ted 127

-

glv 4

using field arithmetic over

F_{p^{2}}

and operation counts for conversion into field arithmetic over

F_{p}

.

Table 1. The operation counts of curve

Ted 127

-

glv 4

using field arithmetic over

F_{p^{2}}

and operation counts for conversion into field arithmetic over

F_{p}

.

Operation	$Ted 127$ - $glv 4$
Operation	$I_{2}$	$M_{2}$	$S_{2}$	$A_{2}$	$M_{1}$	$S_{1}$	$A_{1}$	$A_{i}$
Compute endomorphisms	-	13	2	11.5	43	-	66	30
Precompute lookup table	-	63	-	70	189	-	329	140
Scalar decomposition	-	-	-	-	-	-	-	-
Scalar recoding	-	-	-	-	-	-	-	-
Main computation	-	715	260	848	2665	-	4101	1950
Normalization	1	2	-	-	21	128	8	4
Total Cost	1	793	262	929.5	2918	128	4504	2124

Table 2. The operation counts of curve Four

Q

using field arithmetic over

F_{p^{2}}

and operation counts for conversion into field arithmetic over

F_{p}

.

Table 2. The operation counts of curve Four

Q

using field arithmetic over

F_{p^{2}}

and operation counts for conversion into field arithmetic over

F_{p}

.

Operation	Four $Q$ [21]
Operation	$I_{2}$	$M_{2}$	$S_{2}$	$A_{2}$	$M_{1}$	$S_{1}$	$A_{1}$	$A_{i}$
Compute endomorphisms	-	73	27	59.5	273	-	365	200
Precompute lookup table	-	63	-	56	189	-	301	126
Scalar decomposition	-	-	-	-	-	-	-	-
Scalar recoding	-	-	-	-	-	-	-	-
Main computation	-	704	256	835	2,624	-	4038	1,920
Normalization	1	2	-	-	18	128	8	4
Total Cost	1	842	283	950.5	3,104	128	4712	2,250

Table 3. Cycle counts for field arithmetic on 8-bit AVR, 16-bit MSP430, and 32-bit ARM processors, including function call overhead.

Operation		8-Bit AVR		16-Bit MSP430		32-Bit ARM
Operation		$Ted 127$ - $glv 4$ (This Work)	Four $Q$ [24]	$Ted 127$ - $glv 4$ (This Work)	Four $Q$ [24]	$Ted 127$ - $glv 4$ (This Work)	Four $Q$ [24]
$F_{p}$	Add	198	155	120	102	55	n/a
	Sub	196	159	126	101	55	n/a
	Sqr	1221	1026	837	927	88	n/a
	Mul	1796	1598	1087	1027	99	n/a
	Inv	176,901	150,535	119,629	131,819	12,135	n/a
$F_{p^{2}}$	Add	452	384	266	233	82	84
	Sub	448	385	278	231	82	86
	Sqr	4093	3622	2476	2391	195	215
	Mul	6277	5758	3806	3624	341	358
	Inv	183,345	156,171	123,740	135,315	12,612	21,056

Table 4. Cycle counts and memory usage of variable-base scalar multiplication on 8-bit AVR, 16-bit MSP430, 32-bit ARM processors.

Platform	Implementations	Bit-Length of Curve Order	Cost (Cycles)	Code Size (Bytes)	Stack Usage (Bytes)
AVR	NIST P-256 [43]	256	34,930,000	16,112	590 $^{a}$
	Curve25519 [48]	252	22,791,579	n/a	677
	Curve25519 [45]	252	13,900,397	17,710	494
	$μ$ Kummer [47]	250	9,513,536	9490	99
	Four $Q$ [24]	246	6,561,500	n/a	n/a
	$Ted 127$ - $glv 4$ (This work)	251	6,856,026	13,891	2539
MSP430	NIST P-256 [42]	256	23,973,000	n/a	n/a
	NIST P-256 [43]	256	22,170,000	8378	418 $^{a}$
	Curve25519 [44]	252	9,139,739	11,778	513
	Curve25519 [45]	252	7,933,296	13,112	384
	Four $Q$ [24]	246	4,280,400	n/a	n/a
	$Ted 127$ - $glv 4$ (This work)	251	4,158,453	9098	2568
ARM Cortex-M4	Curve25519 [46]	252	1,423,667	3750	740
	Four $Q$ [24]	246	469,500	n/a	n/a
	$Ted 127$ - $glv 4$ (This work)	251	447,836	7532	2792

^{a}

includes RAM and stack.

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kwon, J.; Seo, S.C.; Hong, S. Efficient Implementations of Four-Dimensional GLV-GLS Scalar Multiplication on 8-Bit, 16-Bit, and 32-Bit Microcontrollers. Appl. Sci. 2018, 8, 900. https://doi.org/10.3390/app8060900

AMA Style

Kwon J, Seo SC, Hong S. Efficient Implementations of Four-Dimensional GLV-GLS Scalar Multiplication on 8-Bit, 16-Bit, and 32-Bit Microcontrollers. Applied Sciences. 2018; 8(6):900. https://doi.org/10.3390/app8060900

Chicago/Turabian Style

Kwon, Jihoon, Seog Chung Seo, and Seokhie Hong. 2018. "Efficient Implementations of Four-Dimensional GLV-GLS Scalar Multiplication on 8-Bit, 16-Bit, and 32-Bit Microcontrollers" Applied Sciences 8, no. 6: 900. https://doi.org/10.3390/app8060900

APA Style

Kwon, J., Seo, S. C., & Hong, S. (2018). Efficient Implementations of Four-Dimensional GLV-GLS Scalar Multiplication on 8-Bit, 16-Bit, and 32-Bit Microcontrollers. Applied Sciences, 8(6), 900. https://doi.org/10.3390/app8060900

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Implementations of Four-Dimensional GLV-GLS Scalar Multiplication on 8-Bit, 16-Bit, and 32-Bit Microcontrollers

Abstract

1. Introduction

2. Preliminaries

2.1. Field Representation and Notations

2.2. Elliptic Curve Cryptography

2.3. Twisted Edwards Curves

2.4. The GLV-GLS Method

3. Review of Four-Dimensional GLV-GLS Scalar Multiplication

4. Implementation Details of Field Arithmetic

4.1. Field Addition and Subtraction over $F_{p}$

4.2. Modular Reduction

4.3. Inversion over $F_{p}$

4.4. Field Arithmetic over $F_{p^{2}}$

4.5. Optimization Strategy on 8-Bit AVR

4.6. Optimization Strategy on 16-Bit MSP430X

4.7. Optimization Strategy on 32-Bit ARM

5. Implementation Details of Curve Arithmetic

5.1. Scalar Decomposition

5.2. Point Arithmetic

5.3. Endomorphisms

6. Performance Analysis and Implementation Results

6.1. Operation Counts

6.2. Implementation Results of Field Arithmetic

6.3. Implementation Results of Scalar Multiplication

7. Conclusions

Author Contributions

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Efficient Implementations of Four-Dimensional GLV-GLS Scalar Multiplication on 8-Bit, 16-Bit, and 32-Bit Microcontrollers

Abstract

1. Introduction

2. Preliminaries

2.1. Field Representation and Notations

2.2. Elliptic Curve Cryptography

2.3. Twisted Edwards Curves

2.4. The GLV-GLS Method

3. Review of Four-Dimensional GLV-GLS Scalar Multiplication

4. Implementation Details of Field Arithmetic

4.1. Field Addition and Subtraction over F p

4.2. Modular Reduction

4.3. Inversion over F p

4.4. Field Arithmetic over F p 2

4.5. Optimization Strategy on 8-Bit AVR

4.6. Optimization Strategy on 16-Bit MSP430X

4.7. Optimization Strategy on 32-Bit ARM

5. Implementation Details of Curve Arithmetic

5.1. Scalar Decomposition

5.2. Point Arithmetic

5.3. Endomorphisms

6. Performance Analysis and Implementation Results

6.1. Operation Counts

6.2. Implementation Results of Field Arithmetic

6.3. Implementation Results of Scalar Multiplication

7. Conclusions

Author Contributions

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.1. Field Addition and Subtraction over $F_{p}$

4.3. Inversion over $F_{p}$

4.4. Field Arithmetic over $F_{p^{2}}$