Low-Cost, Low-Power FPGA Implementation of ED25519 and CURVE25519 Point Multiplication

: Twisted Edwards curves have been at the center of attention since their introduction by Bernstein et al. in 2007. The curve ED25519, used for Edwards-curve Digital Signature Algorithm (EdDSA), provides faster digital signatures than existing schemes without sacriﬁcing security. The CURVE25519 is a Montgomery curve that is closely related to ED25519. It provides a simple, constant time, and fast point multiplication, which is used by the key exchange protocol X25519. Software implementations of EdDSA and X25519 are used in many web-based PC and Mobile applications. In this paper, we introduce a low-power, low-area FPGA implementation of the ED25519 and CURVE25519 scalar multiplication that is particularly relevant for Internet of Things (IoT) applications. The efﬁciency of the arithmetic modulo the prime number 2 255 − 19, in particular the modular reduction and modular multiplication, are key to the efﬁciency of both EdDSA and X25519. To reduce the complexity of the hardware implementation, we propose a high-radix interleaved modular multiplication algorithm. One beneﬁt of this architecture is to avoid the use of large-integer multipliers relying on FPGA DSP modules.


Introduction
Based on Euler and Gauss works, Edwards introduced a normal form of elliptic curves in 2007 [1].He generalized the curve as: over the field K, where a ∈ K, such that: a 5 = a.As Edwards stated in his paper, every curve of the form given in (1) is birationally equivalent to an elliptic curve in Weierstrass form.Bernstein et al. [2] generalized Edwards' original curves.For a fixed field K of odd characteristic and arbitrary integers c, d ∈ K such that cd(1 − dc 4 ) = 0, they introduced the curves: This definition covers "more than 1/4 of all isomorphism classes of elliptic curves over a finite field".They showed that every elliptic curve over a non-binary field is birationally equivalent to a curve in Edwards form over an extension of the field and in many cases over the original field [2].In [3], Bernstein et al. introduced a generalization of Edwards curves named twisted Edwards curves.These include more curves, including Edwards curves and every elliptic curve in Montgomery form [4].As explained in [3], the curve name comes from the fact that the set of twisted Edwards curves is invariant under quadratic twists while a quadratic twist of an Edwards curve is not necessarily an Edwards curve.A quadratic twist of a curve is an isomorphic curve over a field extension of degree two.
For a field K of odd characteristic, and nonzero elements a, d ∈ K , the twisted Edwards curve E T,a,d (K) is defined as: E T,a,d (K) : ax 2 + y 2 = 1 + dx 2 y 2 . (3) If a = 1, then E T,a,d is an Edwards curve with c = 1.Moreover, E T,a,d is a quadratic twist of the Edwards curve E O,1,d/a with the map: (x, y) → (x, y) = ( x √ a , y) over the field extension K( √ a): Twisted Edwards curves and Montgomery curves are closely related.As shown in [3], every twisted Edwards curve E T,a,d on the Field K with char(K) = 2, is birationally equivalent to a Montgomery curve E M,A,B : Bv 2 = u 3 + Au 2 + u using the map: where If a is a square in K, then these curves are isomorphic over K itself.From the operation counts of the point arithmetic given in [5], it is easy to see that twisted Edwards curves outperform curves in Weierstrass form in terms of speed (despite the binary form of Edwards curve that is a bit slower than its Weierstrass counterpart [6]).However twisted Edwards curves are appealing for another reason.Their group laws are unified and complete; that leads to safer implementations against certain types of attacks [3].
The Edwards-curve Digital Signature Algorithm (EdDSA) is the most significant application of twisted Edwards curves.The ED25519 is a twisted Edwards curve used for EdDSA, where its parameters are defined as [7]: The corresponding Montgomery curve of ED25519 is CURVE25519 that is defined as [8]: Point multiplication is fast and efficient on Montgomery curves.It efficiently uses differential point addition and point doubling [5] and uniform Montgomery ladder algorithm to perform a point multiplication [9].The uniform Montgomery ladder algorithm is performed in constant time that makes its implementations robust to timing attacks.The CURVE25519 has been used in many software implementations since its introduction by Bernstein in 2006 [8].It has also become a promising candidate for Internet of Things (IoT) applications due to its 128-bit security level and efficient arithmetic.Recently, a number of hardware implementations have been introduced [10][11][12][13] with a focus on IoT applications.All these works use FPGA DSP slices to implement modular multipliers.High-performance cryptographic processors that can be implemented on low-cost FPGAs or ASICs are in demand for mobile applications such as the Internet of Things (IoT) and Intelligent Transport Systems (ITS) [14].Low-cost FPGAs (including anti-fused-based FPGAs) are namely restricted in the number of hardware resources.A portable low-power design that uses minimum hardware resources without losing its performance is then the most appealing.In the following, we propose an area-efficient, low-power hardware implementation of the CURVE25519 and ED25519 on FPGA.
Our design is not using multipliers and DSP units of FPGA resources.We introduce a high speed interleaved modular multiplier tailored for this application.Section 2 provides a background on Elliptic Curve Discrete Logarithm Problem (ECDLP) and the arithmetic of curves ED25519 and CURVE25519, used in this work.Section 3 introduces hardware design of the point multiplication core and Section 4 shows implementation results and comparisons with previous work.

Background
Like any elliptic curve cryptosystem, the security of ED25519 and CURVE25519 is based on the elliptic curve discrete logarithm problem (ECDLP).Let E be an elliptic curve defined over the prime field F p and let the group of rational points on the curve E denoted by E(F p ). Now, consider a point P ∈ E(F p ) of order n and the cyclic subgroup of E(F p ) generated by point P, i.e., P = {O, P, 2P, . . ., (n − 1)P}.Take a random integer k ∈ Given the domain parameters and Q, the problem of determining the integer k is called ECDLP [15].The point Q can be easily computed with a given k using the one-way function Q = k • P (called elliptic curve point multiplication or scalar multiplication).However, it is computationally difficult to calculate k from known points Q and P.
Optimized explicit point addition and point doubling formulae for twisted Edwards curves and Montgomery curves are presented in [5].Projective coordinates are used in this work.The input Z-coordinate for the point P is set to Z = 1.So, transformation from affine to projective coordinates can be done at no cost.
We did minor modifications in the Elliptic Curve Point Doubling (ECPD) formulae ( 9) and Elliptic Curve Point Addition (ECPA) (10) to minimize the number of holding registers and optimize hardware implementation.Data flow diagram of ECPD and ECPA are shown in Figures 1 and 2, respectively.At each level one modular multiplication is performed.Modular addition and/or modular subtraction is performed in parallel with modular multiplication whenever possible.Point multiplication on the Montgomery curve CURVE25519 can be done by using efficient uniform differential point addition and doubling for X and Z in projective coordinates.This allows low latency, low-power hardware implementations.Explicit formulae can be found in [5].We have rearranged these formulae as in (11) for hardware optimized implementation.Similar to ED25519, at every level one modular multiplication in parallel to possible modular addition and/or modular subtraction is performed.Data flow diagram of differential point addition and doubling on CURVE25519 is shown in Figure 3.

Interleaved Modular Multiplication Algorithm
Modular multiplication is a basic operation of crucial importance in elliptic curve cryptography.The interleaved modular multiplication algorithm, unlike other modular multiplication methods, does not employ actual multipliers.The basic interleaved modular multiplication algorithm [16] is shown in Algorithm 1.The idea of this algorithm is to interleave the accumulation steps of the multiplication with the steps of a division operation.An operand is multiplied by a bit of the other operand in a loop and followed by a division by the modulus to control the size of the intermediate values.A multiple of the modulus is then subtracted from the value of the accumulator and a new partial product is added.It is essential to start with the most significant bit to avoid gradually increasing digits when adding the shifted version of the multiplicand.Direct implementation of this algorithm is not efficient due to the carry propagation delay of the long bit adder and the sequential steps that add delay to the circuit and increase the clock period.Bunimov et al. [17] proposed an architecture to solve these problems.A carry save adder (CSA) is used to eliminate carry propagation delay.
Instead of comparing with the modulus, they compared intermediate values with 2 n and precomputed and saved the difference in a look-up table.This precomputed difference is added to the intermediate value at the next iteration.Figure 5 shows their proposed hardware architecture.To reduce the number of cycles the High-Radix technique has been proposed [18][19][20][21].
Algorithm 1: Basic interleaved modular multiplication algorithm [16] input : X, Y, p output : X • Y mod p Algorithm 2 presents our proposed radix-8 interleaved modular multiplier.The values {2Y mod p, . . ., 7Y mod p} are precomputed before the start of shift cycles and stored in a look-up table (LUT1).To complete the for loop in the algorithm, 85 clock cycles are required.Three bits of the multiplicand X are read at every clock cycle and decide the output of LUT1.At the end of the loop, the accumulator value is not greater than 12p.The algorithm is proofed with 10,000 random 255-bit integers using MAPLE.The MAPLE implementation of Algorithm 2 is suggested in Appendix A. The hardware implementation is very efficient.Figure 6 depicts the hardware implementation of the proposed Radix-8 algorithm 2. The loop logic has maximum net and logic delay of 1.8 ns.So maximum clock frequency of 550 MHz is achievable.The weakness of this algorithm is the latency of calculating LUT1 for every new run.However, taking this latency into account, the overall improvement is notable compared to similar works.The first ten clock cycles are treated as waiting cycles to complete LUT1 table.The modular multiplication is then completed in 95 clock cycles.In case of using 550 MHz clock frequency, that is equivalent to 172.7 ns.A reducer logic is used to output a complete modulo reduction at the end of last clock cycle that costs one comparison and one subtraction and imposes 7.2 ns delay to output.So, the complete reduction latency is 180 ns.
In this architecture, the high-frequency logic is very small (900 FPGA LUTs or 22% of total area), hence the dynamic power consumption is very low.Table 1 compares our design performance with some similar works in the literature.It must be noted that the circuit area/latency reported in [19,21] are related to partial reduction and the final reduction logic is not taken into account.We estimated the power consumption of the designs presented in [19,21] with Xilinx power estimator tool.These estimates are based on the reported used FPGA resources, clock frequency, average signal toggle rate, and default average fan-out. ;

Modular Addition and Subtraction
Figures 7 and 8 show the proposed fast and area-efficient modular addition and subtraction units, respectively.The Carry Propagate Adders (CPA) and a Carry Save Adder(CSA) resources are shared and the implementation is fast and very area-efficient.( B and p denotes bitwise not(B) and not(p) respectively.)

ED25519 and CURVE25519 Point Multiplication Core
Point multiplication unit uses an Arithmetic Logic Unit (ALU) that consists of a point addition and a point doubling state machine.As shown in Figure 9, the state machines share the modular multiplication and modular add/subtract arithmetic units as well as register bank resources.

Modular Inversion
There are basically two methods to calculate the modular inverse of an integer in the field F p .Euclidean algorithm and Fermat's little theorem [15].The Euclidean algorithm is a recursive method that needs multiplications, additions/subtractions and maintaining intermediate results at every iteration.Implementation of the Euclidean algorithm requires a new hardware that contrasts with our low-area design approach.Based on Fermat's little theorem, a −1 p = a p−2 p .The modular field inversion unit can be implemented by sharing the interleaved modular multiplication with the point multiplication unit.As mentioned in [8], 254 squaring and 11 multiplications are required to complete a field inversion for p = 2 255 − 19.However, the actual sequence of operations is not provided.Our design uses the chain in (12) to return a modular inversion using 265 modular multiplications.

Results and Comparison
Our design implementation results on ZYNQ 7000 series FPGAs are presented in Table 2.We used minimum hardware resources to achieve low-area/low-power goals, which are very appealing for IoT applications.Point multiplication of ED25519 uses Double and Add and NAF (Non-Adjacent Form) algorithms [15].The average core latency is calculated using one thousand 255-bit random scalars.Several implementations of CURVE25519 and ED25519 are presented in [10][11][12][13].Table 3 summarizes the outcome of these works for comparison.The focus of all these works has been on the speed of point multiplication operation by adding arithmetic resources to the hardware.Heuristic methods presented in [12,13] to optimize the modular multiplication that is the critical arithmetic unit.All known works employed FPGA DSP modules for implementation of modular multipliers.However, in our design, no DSP module is used.We estimated the power consumption of these works with the Xilinx power estimator tool.The power consumption shown in Table 3 is calculated based on the FPGA resources used in each work.These include clock frequency, default average fan-out, and considering 100% toggle rate for the DSP modules.We implemented multipliers using LUT resources only, to provide a baseline comparison.In [10,13], fifteen 17-bit × 17-bit multipliers are used to implement a 255-bit modular multiplier.This implementation requires 345 LUT resources for each multiplier.As a comparison, a 64-bit × 64-bit multiplier needs 4256 LUT and a 127-bit × 127-bit multiplier uses 24335 LUTs on a 7-Series Xilinx FPGA.Table 3 gives an estimate of cores area in [10][11][12][13] assuming that no DSP slice is used.

Side-Channel Attacks Considerations
Resistance against side-channel attacks can be easily provided to ED25519 and CURVE25519 point multiplication cores by different approaches.The Montgomery powering ladder algorithm [22] can be employed for ED25519 point multiplication hardware to hide the power spectrum patterns of ECPD and ECPA and provide resistance against SPA (Simple Power Analysis) attacks.At every single step, both ECPD and ECPA operations are performed, then it must be decided which result should be used for the next step.The latency of the Montgomery ladder algorithm is constant equal to the latency of 255-point doublings and 254-point additions.The constant calculation time provides immunity to timing attacks.The disadvantage of the Montgomery ladder algorithm is its larger latency compared to other point multiplication methods.Curve25519, however, uses the uniform differential point addition and point doubling method.To implement resistance to DPA (Differential Power Analysis) attacks, a random factor λ is injected to the projective coordinates of the initial point P(x, y) [23].Then randomized projective coordinated point P(λX 1 , λY 1 , λZ 1 ) is used.The algorithm starts with: X 2 = λ, Z 2 = 0, X 3 = X 1 , Z 3 = λ.The formulae (11) and Figure 3 must be revised by replacing X 3 = λ • G 2 .The DPA attack cannot be successful as it is not possible to predict any specific bit of 4P (or other multiples of P in randomized projective coordinates [15].The timing costs of this approach is 256 more modular multiplications, two initial and 254 in loop modular multiplications.

Conclusions
Interleaved modular multipliers are very efficient in terms of area and power consumption and work at high clock frequencies.We introduced a radix-8 interleaved modular multiplier algorithm to reduce the number of clock cycles required to achieve one modular multiplication.We implemented ED25519 and CURVE 25519-point multiplication cores using the interleaved modular multiplier as a primitive arithmetic unit.Comparing our results to the most recent works listed in Table 3, reveals that we achieved a low-power/area-efficient design.The modular multiplier is the critical arithmetic unit that determines the overall performance of the hardware.Research on the improvement of the modular multiplier performance is recommended as future works.

Figure 3 .
Figure 3. CURVE25519 differential point addition and point doubling flow diagram.The legend of Figures 1-3 is given separately in Figure 4.

Figure 7 .
Figure 7. Hardware implementation of modular addition (A + B mod p).

Figure 8 .
Figure 8. Hardware implementation of modular subtraction (A − B mod p).

Figure 10
Figure10shows the point multiplication core configuration.Projective coordinates are used to avoid modular inversions.Finally, one modular inversion is required at the end of the computation to convert projective coordinates back to affine.

Table 2 .
Implementation results of ED25519 and CURVE25519 point multiplication cores on ZYNQ 7000 Series.

Table 3 .
Other works implementation results.