Elliptic Curve Cryptography for Wireless Sensor Networks Using the Number Theoretic Transform

We implement elliptic curve cryptography on the MSP430 which is a commonly used microcontroller in wireless sensor network nodes. We use the number theoretic transform to perform finite field multiplication and squaring as required in elliptic curve scalar point multiplication. We take advantage of the fast Fourier transform for the first time in the literature to speed up the number theoretic transform for an efficient realization of elliptic curve cryptography. Our implementation achieves elliptic curve scalar point multiplication in only 0.65 s and 1.31 s for multiplication of fixed and random points, respectively, and has similar or better timing performance compared to previous works in the literature.


Introduction
Wireless sensor network (WSN) technology is a widespread and enabling technology that has been rapidly penetrating our daily lives. It has environmental applications such as temperature, humidity, pressure and fire monitoring [1,2], health applications such as patient monitoring [3], military applications such as enemy detection and reconnaissance [4], and applications to smart cities such as in smart grids [5]. Securing WSN applications is an important task since sensitive information they communicate should be kept confidential from malicious third parties. A sensor node, which is a single unit of a WSN, is a tiny, cheap and constrained embedded system that is usually equipped with a simple microcontroller. Cryptographic solutions are needed for applications running on constrained microcontrollers on sensor nodes [6][7][8][9][10][11][12][13]. However, due to the complex nature of cryptographic algorithms and the constrained nature of WSN nodes, e.g., their CPU power and memory size limitations, it is a challenge to implement cryptographic algorithms efficiently on WSN nodes [14][15][16][17][18].
Among different types of cryptosystems, symmetric key cryptography comes forward as a good choice to be used for WSNs due to its simplicity and efficiency. However, for many WSN applications, the distribution of the private key between sensor nodes remains as a problem that needs to be addressed. Public key cryptography (PKC) [19] provides a solution to the key distribution problem, yet it is considered computationally expensive for constrained WSN nodes. On the other hand, previous works prove PKC to be applicable on constrained WSN nodes for solving the key distribution problem [20][21][22][23][24]. Elliptic curve cryptography (ECC) [25,26] is a popular option for PKC. It requires a 160-bit or longer key to be considered secure, while the same level of security can be achieved with much longer key sizes with other PKC algorithms, e.g., a 1024-bit key is needed to achieve the same level of security using the RSA cryptosystem [27]. In this work, we realize an efficient implementation of ECC for solving the key distribution problem in WSNs. We present a novel implementation of Our Main Contribution: We present a novel realization of ECC which uses Edwards curves for point arithmetic and the NTT for the underlying finite field multiplication and squaring operations. To the best of our knowledge, our work presents the first realization of ECC using the Fast Fourier Transform (FFT) [53,54] to speed up NTT computations. Our implementation achieves similar or faster timings for ECC scalar point multiplication compared to existing implementations in the literature and proves that NTT-based arithmetic is feasible for ECC implementations on constrained devices such as WSN nodes.
The paper continues as follows. In Section 2, we explain ECC using Edwards curves and also give a detailed explanation of finite field multiplication in GF((2 13 − 1) 13 ) using the NTT. In Section 3, we give the details of our optimized implementation of ECC point multiplication which uses our improved Edwards curves formulae for point arithmetic and NTT-based finite field multiplication/squaring over GF((2 13 − 1) 13 ). In Section 4, we present our implementation results and comparisons with the existing work in the literature. Finally, Section 5 includes our conclusion.

Finite Field Multiplication Using the NTT
In elliptic curve cryptography, a large number of multiplication and squaring operations are performed in a finite field. Elements of the finite field GF(p m ) are typically represented in the time domain as polynomials of degree m − 1 with coefficients in GF(p) [55,56] Multiplication of two GF(p m ) elements, e.g., a(x) and b(x), is achieved typically by computing the polynomial product c (x) = a(x) · b(x) mod p followed by the modular reduction where P(x) is the irreducible field generating polynomial. Please note that if the field generating polynomial can be selected as the binomial x m − 2, the cost of the modular reduction computation becomes negligible. Due to the convolution theorem, the classical polynomial multiplication operation in the time domain, e.g., the computation of c (x) = a(x) · b(x) mod p, which has quadratic complexity, is equivalent to the simple pairwise multiplication of the corresponding frequency domain sequence coefficients which has only linear complexity [53]. Thus, the complexity of polynomial multiplication can be reduced by performing this computation in the frequency domain. The coefficients of a GF(p m ) element form a time domain sequence. To perform the polynomial multiplication of two GF(p m ) elements in the frequency domain, the time domain sequences for the two GF(p m ) elements should be transformed into their corresponding frequency domain sequences. This conversion is achieved by using the NTT [31]. After the polynomial multiplication operation is completed in the frequency domain, the result can be converted back to the time domain by using the inverse NTT computation. Algorithm 1 gives an overview of how polynomial multiplication can be achieved in the frequency domain.

Return (c )
A finite field element a(x) ∈ GF(p m ) can be converted to its d-element frequency domain sequence representation, where d ≥ m, in two steps as explained below: by appending d − m zeros at the end.

2.
Obtain the frequency domain sequence representation (A) = (A 0 , A 1 , A 2 , ..., A d−1 ) for a(x) by performing the following NTT computation over (a): where r is a d th primitive root of unity.
Please note that in order to obtain the time domain sequence (a) back from the frequency domain sequence (A), the inverse NTT computation can be used as follows: In Algorithm 1, since c (x) = a(x) · b(x) mod p may have up to 2m − 1 coefficients, representing it in the frequency domain with a sequence of length shorter than 2m − 1 may result in its value being corrupted. Therefore, for Algorithm 1 to always generate the correct result, one should have d ≥ 2m − 1 as the NTT length. We now describe with an example the execution of Algorithm 1 for computing c (x) = a(x) · b(x) mod p where a(x), b(x) ∈ GF(p 13 ). As described in (1), the polynomials a(x) = a 0 + a 1 x + a 2 x 2 + . . . + a 12 x 12 and b(x) = b 0 + b 1 x + b 2 x 2 + . . . + b 12 x 12 are first converted into their corresponding 26-element time domain sequence representations as (a) = (a 0 , a 1 , a 2 , a 3 , a 4 , a 5 , a 6 , a 7 , a 8 , a 9 , a 10 , a 11 , a 12 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) and 12 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) . Secondly, the NTT is applied to (a) and (b), as described in (2), to obtain the following frequency domain sequences: Thirdly, the coefficients of (A) and (B) are pairwise multiplied in the frequency domain, i.e., by computing C i = A i B i mod p for 0 ≤ i ≤ 25, and thus the following sequence is obtained in the frequency domain: which corresponds to c (x) = a(x) · b(x) mod p in the time domain. Finally, the inverse NTT is applied to (C ), as described in (3), to obtain the following time domain sequence for c (x) = a(x) · b(x) mod p: Please note that since c (x) is a polynomial of degree 24, c 25 is zero and the first 25 coefficients of (c ) give us the polynomial c (x) = a(x) · b(x) mod p , given as follows: As a final step in GF(p 13 ) multiplication, the polynomial c (x) needs to be reduced modulo the field generating polynomial by computing c(x) = c (x) mod P(x), which has only linear complexity.

Elliptic Curve Cryptography Using Edwards Curves
The main operation in ECC is scalar point multiplication, i.e., computing s · P for an integer s and a point P on the elliptic curve. ECC scalar point multiplication involves performing several ECC point addition and doubling operations. To achieve ECC point multiplication, the binary method [57] can be used, where the bits of the scalar s are scanned one bit at a time starting with the most significant bit, and for each scanned bit, a point doubling operation is performed, in addition to a point addition operation if the scanned bit is 1. However, the binary method is both inefficient and vulnerable against simple power analysis [58]. As an alternative to the binary method, and in order to help mitigate its drawbacks, the NAF4 and Comb methods can be used for ECC scalar point multiplication of random and fixed points, respectively. NAF4 and Comb require computing a significantly reduced number point additions and doublings compared to the binary method [59].
Edwards curves, proposed in [30], are a new form for elliptic curves and defined by the following equation: The ECC point addition of the two distinct points P 1 and P 2 on an Edwards curve is computed as The ECC point doubling operation on the point P 1 (x 1 , y 1 ) on an Edwards curve is computed as .
The above ECC point operations can be achieved in projective coordinates [59][60][61] to avoid costly inversions. For the Edwards curve x 2 + y 2 = c 2 (1 + dx 2 y 2 ), with c = 1, d random and d · c 4 = 1, the formulae for ECC point doubling and addition in projective coordinates over prime fields are given in Algorithms 2 and 3, respectively [62].

Algorithm 2:
Elliptic curve point doubling in projective coordinates over prime fields using Edwards curves [62] Input: Algorithm 3: Elliptic curve point addition in projective coordinates over prime fields using Edwards curves [62] Input: P 1 (X 1 : Y 1 : Z 1 ) and P 2 (X 2 :

Our ECC Implementation Using the NTT and Edwards Curves
We implement ECC over an optimal extension field [29,63], namely GF(p m ) with the Mersenne prime field characteristic p = 2 13 − 1 and the prime field extension degree m = 13. Please note that ECC over a prime extension field of the form GF(p m ) is considered secure when the finite field is sufficiently large and its extension degree m is a prime number [59]. We select the field characteristic p such that polynomial coefficients fit in a single processor word, in our case a 16-bit word, eliminating the need for performing multiprecision arithmetic. We use the binomial x 13 − 2 as the field generating polynomial which facilitates efficient modular reduction. For finite field multiplication and squaring, we use the NTT and use the approach described in Algorithm 1. For NTT computations, we use the NTT length of d = 26 and the 26 th primitive root of unity as r = −2. We use the FFT [53,54,64] to speed up NTT computations. For ECC point doubling and addition, we use our improved versions of Algorithms 2 and 3, respectively. Finally, we use the NAF and Comb methods, with a 4-bit window, to perform ECC scalar point multiplication with random and fixed points, respectively [59].
Forward NTT for Converting GF((2 13 − 1) 13 ) Elements to the Frequency Domain: 13 ) is obtained by computing the NTT of the corresponding 26-element time domain sequence (a) = (a 0 , a 1 , a 2 , . . . , a 12 , 0, 0, . . . , 0) as where p = 2 13 − 1 . The above NTT computation can be optimized by applying the FFT as and for 0 ≤ j ≤ 12 [64]. Please note that the first summations in (7) and (8) are the same NTT computation. Likewise, the second summations in (7) and (8) are also the same NTT computation. Both NTT computations are of length 13. Hence, using the FFT, the computation in (6), which is a 26-element NTT computation, is reduced to the computation of roughly two 13-element NTT computations. Since a i = 0 for 13 ≤ i ≤ 25, we compute the summations in (7) and (8) only for i running from 0 to 6 in the first NTT computation, and from 0 to 5 in the second. Our optimized algorithm for computing the forward NTT of a(x) ∈ GF((2 13 − 1) 13 ) on the MSP430 is given in Algorithm 4.
We implement Algorithm 4 with an assembly routine and optimize it by using microcontroller registers as much as possible to minimize the number of memory read/write operations. For the additions in lines 4, 9, 14 and 21, there is no need to do modular reduction after every addition. We reduce the number of modular reductions by accumulating the sums and deferring modular reduction as much as possible.
Multiplication of a GF(2 13 − 1) element with a power of 2, e.g., in lines 7 and 19 of Algorithm 4, corresponds to a bitwise left rotation of the GF(2 13 − 1) element. For instance, for R ∈ GF(2 13 − 1), 2 j R mod 2 13 − 1 can be computed by rotating the bits of R by j mod 13 bits to the left. We realize multiplications of GF(2 13 − 1) elements with powers of 2 with an optimized assembly routine.
Please note that for multiplying two distinct GF((2 13 − 1) 13 ) elements with Algorithm 1, Algorithm 4 needs to be executed twice, i.e., once for each input operand to obtain its frequency domain sequence representation. On the other hand, for squaring a GF((2 13 − 1) 13 ) element, Algorithm 4 needs to be executed only once for the single input operand. Hence, squaring using Algorithm 1 is faster than multiplication.
Pairwise Coefficient Multiplication of GF((2 13 − 1) 13 ) Elements in the Frequency Domain: Polynomial multiplication of two GF((2 13 − 1) 13 ) elements can be achieved in the frequency domain with only linear complexity by multiplying pairwise their frequency domain sequence coefficients. Let a(x), b(x) ∈ GF((2 13 − 1) 13 ), and let (A) = (A 0 , A 1 , · · · , A 25 ) and (B) = (B 0 , B 1 , · · · , B 25 ) be their 26-element frequency domain sequence representations obtained using Algorithm 4. The following 26 pairwise coefficient multiplications generate the 26-element frequency domain sequence representation (C') of the product c (x) = a(x) · b(x) mod p : The frequency domain sequence (C ) can be converted back to the time domain, by applying the inverse NTT, to give us the coefficients of the polynomial product c (x) = a(x) · b(x) mod p .
The multiplications in (9) are the only GF(p) multiplications required for computing the polynomial product c (x) = a(x) · b(x) mod p in the NTT-based multiplication approach. Please note that only 26 coefficient multiplications are performed here, which is significantly less than the 169 coefficient multiplications required in the classical schoolbook method for multiplication.
Inverse NTT for Converting the Frequency Domain Product to a Time Domain GF((2 13 − 1) 13 ) Element: As described in (3), the time domain sequence representation (c ) = (c 0 , c 1 , c 2 , . . . , c 25 ) of c (x) = a(x) · b(x) mod p can be obtained by computing the inverse NTT of the corresponding 26-element frequency domain sequence (C ) = (C 0 , C 1 , C 2 , . . . , C 23 , C 24 , C 25 ) as follows The above inverse NTT computation can be optimized by applying the inverse FFT as and for 0 ≤ j ≤ 12 [64]. Please note that the first summations in (11) and (12) are the same inverse NTT computation. Likewise, the second summations in (11) and (12) are the same inverse NTT computation. Furthermore, both inverse NTT computations are of length 13. Hence, using the inverse FFT, the computation of (10), which is a 26-element inverse NTT computation, is reduced to the computation of roughly two 13-element inverse NTT computations. Since c 25 = 0, we compute the second summations in (11) and (12) only for i running from 0 to 11. Our inverse FFT algorithm for computing the inverse NTT of (C ), the frequency domain sequence corresponding to c (x) = a(x) · b(x) mod p, and for obtaining c(x) = c (x) mod P(x), where P(x) = x 13 − 2, is given in Algorithm 5. Please note that, unlike in the inverse NTT computation in Algorithm 1, in Algorithm 5 (lines 35 − 42) we embed the modular reduction of c (x) = a(x) · b(x) mod p by the field generating polynomial P(x) = x 13 − 2. Hence, for a(x), b(x) ∈ GF((2 13 − 1) 13 ), while the output of Algorithm 1 is a polynomial of degree 24 (with 25 coefficients in GF(2 13 − 1)), the output of Algorithm 5 is an element of GF((2 13 − 1) 13 ) with 13 coefficients.
Similar to our implementation of Algorithm 4, we implement Algorithm 5 with an assembly routine and optimize it by using microcontroller registers exhaustively to minimize the number of memory read/write operations. We reduce the number of performed modular reductions in lines 4,9,14,19,24,29,34 and 40 of Algorithm 5 by accumulating the sums and deferring the modular reduction computation as much as possible.
Division of a GF(2 13 − 1) element by a power of 2, e.g., in lines 7, 17, 27 and 38 of Algorithm 5, can be achieved with a bitwise right rotation. For instance, for R ∈ GF(2 13 − 1), R/2 j mod 2 13 − 1 can be computed by rotating the bits of R by j mod 13 bits to the right. We realize this bitwise rotation operation with an optimized assembly routine.

ECC Point Arithmetic with NTT Based Multiplication and Squaring
For ECC operations, we use the Edwards curve x 2 + y 2 = c 2 (1 + dx 2 y 2 ), with c = 1, d random and d · c 4 = 1, over the 169-bit prime extension field GF((2 13 − 1) 13 ), and use our optimized versions of the elliptic curve point addition and doubling formulae given in Algorithms 2 and 3. We improve Algorithms 2 and 3 by taking advantage of NTT-based multiplication and squaring operations. Our improved algorithms are given in Algorithms 6 and 7.
Algorithm 6: Elliptic curve point doubling in projective coordinates over prime fields using Edwards curves and NTT-based multiplication/squaring Input: P = (X 1 : Y 1 : Z 1 ), R 1 and R 2 are temporary registers.
Algorithm 6 is a reordered and optimized version of Algorithm 2. It takes advantage of NTT-based finite field multiplication and squaring computations. In line 1 of the algorithm, the NTTs of X 1 and Y 1 are computed, and then added in the frequency domain to find the NTT of R 1 = X 1 + Y 1 . The computed NTTs of X 1 , Y 1 and R 1 are stored. The stored frequency domain representations of X 1 , Y 1 and R 1 are used in lines 2 − 4 (marked bold) for the three finite field squarings. Please note that for these three finite field squarings, a total number of only two forward NTT computations are performed, i.e., NTT(X 1 ) and NTT(Y 1 ) in line 1, instead of three as required in Algorithm 1. Furthermore, in line 10, the computed NTT of Z 1 is stored and reused in line 11 (marked bold). Similarly, in line 11, the computed NTT of R 2 is stored and reused in line 12 (marked bold). Please note that each time the stored result of an NTT computation is reused, a forward NTT computation is saved in Algorithm 1.

Algorithm 7:
Elliptic curve point addition in projective coordinates over prime fields using Edwards curves and NTT-based multiplication/squaring Input: P = (X 1 : Y 1 : Z 1 ) , Q = (X 2 : Y 2 : Z 2 ) , R 1 and R 2 are temporary registers. Output: P + Q = (X 3 : Algorithm 7 is a reordered and optimized version of Algorithm 3. It takes advantage of NTT-based finite field multiplication and squaring computations. In lines 2 − 3 of the algorithm, the NTTs of X 1 , X 2 , Y 1 and Y 2 are computed and stored. Only two addition operations are performed in the frequency domain on the stored NTTs to readily obtain the NTTs of R 1 = X 1 + Y 1 and R 2 = X 2 + Y 2 . The NTTs of R 1 and R 2 are also stored. The stored NTTs of R 1 , R 2 , X 1 , X 2 , Y 1 and Y 2 are readily used in lines 4 − 6 (denoted with bold color) for the three finite field multiplication computations. Thus, for three finite field multiplications, a total number of only four forward NTT computations are performed, instead of six as required in Algorithm 1. Furthermore, in lines 11 − 13 of the algorithm, the stored NTTs of Y 1 , X 1 and Z 1 are reused (marked bold). Similarly, in line 16, the NTT of Z 1 is computed and stored. The stored NTT of Z 1 is reused in line 17 (marked bold). Likewise, in line 17, the NTT of X 1 is computed and stored, and reused in line 18 (marked bold).

Implementation Results
We use Texas instrument's MSP430 microcontroller, which is commonly used in wireless sensor nodes, and select version MSP430F1611 [65]. Our target device, MSP430F1611, is a 1-series low power microcontroller which runs at 8 MHz clock frequency, and has a 48 kB flash memory in addition to a 10 kB RAM. We develop our code in the C language but also use the assembly language for computationally intensive and/or commonly performed operations. We use the IAR Workbench IDE as our development environment [66]. We obtain timings by using the IAR Workbench IDE's clock cycle counter in debug mode. The detailed timing figures for our ECC implementations are given in Table 1.
In Table 2 and Figure 1, we present our timings for ECC random point multiplication on the MSP430F1611 as well as the timings of the related work in the literature on the same microcontroller. Liu et al.'s work, which uses a 159-bit Montgomery curve, presents the fastest timing for random point multiplication on the MSP430 microcontroller [67]. They use the Montgomery ladder method and achieve random point multiplication in 3, 460, 000 clock cycles which is equivalent to 0.48 s at 8 MHz clock frequency. Gouvêa et al.'s work, which uses the 160-bit curve secp160r1 that has a slightly smaller elliptic curve group order than ours, achieves ECC random point multiplication in 0.58 s [68]. Our previous ECC implementation over GF((2 13 − 1) 13 ) on the MSP430F149, a similar microcontroller to the MSP430F1611, achieves random point multiplication in 1.55 s [20]. Please note that our ECC random point multiplication implementation in this work, which exploits the NTT-based finite field multiplication/squaring and the FFT, is more than 18% faster than our previous implementation on the same elliptic curve. Wang et al.'s implementation of elliptic curve random point multiplication over a 160-bit elliptic curve has a timing value of 3.51 s which is significantly worse than our timing result [69]. In a later work, the same authors improve their timing to 1.60 s; however, their new implementation is still 22% slower than our work [24]. Please note that the timing figure for our ECC implementation is for a 169-bit elliptic curve with a higher security level, whereas the others' works use the smaller ordered 159-bit and 160-bit elliptic curves. In Table 3 and Figure 2, we present our timings for ECC fixed point multiplication on the MSP430F1611 as well as the timings of the related work in the literature on the same microcontroller. Liu et al.'s work, which uses a 159-bit twisted Edwards curve, presents the fastest timing for fixed point multiplication on the MSP430 microcontroller [67]. They use the Comb method and twisted Edwards curves to achieve fixed point multiplication in 1, 920, 000 clock cycles which is equivalent to 0.24 s at 8 MHz clock frequency. Gouvêa [70]. Szczechowiak et al.'s work achieves elliptic curve fixed point multiplication in 0.72 s using a 160-bit elliptic curve over a prime field [22] and in 1.04 s using a 163-bit elliptic curve over a binary field [22]. Our timing for elliptic curve fixed point multiplication over a larger ordered 169-bit elliptic curve is slightly better than their results. Please note that the timing figure for our ECC implementation is for a 169-bit elliptic curve with a higher security level, whereas the others' works use the the smaller ordered 159-bit, 160-bit and 163-bit elliptic curves.

Conclusions
We implemented ECC on the MSP430 microcontroller, which is a widely used microcontroller in WSNs, by using Edwards curves for point arithmetic and the number theoretic transform for the underlying finite field multiplication and squaring operations. In our work, we realized a novel implementation of the fast Fourier transform over GF((2 13 − 1) 13 ) to speed up the number theoretic transform on the MSP430 microcontroller. Furthermore, for the point addition and doubling operations on Edwards curves, we introduced optimized formulae where some arithmetic operations are eliminated by taking advantage of the number theoretic transform. Our ECC implementation resulted in comparable or better timing values than the existing work in the literature on the same microcontroller. Please note that the techniques introduced in this paper can be applied to ECC implementations over other elliptic curves with more efficient formulae for point arithmetic. We identify the application of the introduced techniques to ECC implementations on other elliptic curves, such as Montgomery curves or twisted Edwards curves, as directions for future research.