Design and Implementation of High-Performance ECC Processor with Unified Point Addition on Twisted Edwards Curve

With the swift evolution of wireless technologies, the demand for the Internet of Things (IoT) security is rising immensely. Elliptic curve cryptography (ECC) provides an attractive solution to fulfill this demand. In recent years, Edwards curves have gained widespread acceptance in digital signatures and ECC due to their faster group operations and higher resistance against side-channel attacks (SCAs) than that of the Weierstrass form of elliptic curves. In this paper, we propose a high-speed, low-area, simple power analysis (SPA)-resistant field-programmable gate array (FPGA) implementation of ECC processor with unified point addition on a twisted Edwards curve, namely Edwards25519. Efficient hardware architectures for modular multiplication, modular inversion, unified point addition, and elliptic curve point multiplication (ECPM) are proposed. To reduce the computational complexity of ECPM, the ECPM scheme is designed in projective coordinates instead of affine coordinates. The proposed ECC processor performs 256-bit point multiplication over a prime field in 198,715 clock cycles and takes 1.9 ms with a throughput of 134.5 kbps, occupying only 6543 slices on Xilinx Virtex-7 FPGA platform. It supports high-speed public-key generation using fewer hardware resources without compromising the security level, which is a challenging requirement for IoT security.


Introduction
The Internet of Things (IoT) refers a global network, where billions of devices are connected through the Internet and share data with each other. Since most of these devices have constrained resources, data are usually stored in the cloud, where people can continuously upload and download data from anywhere via the Internet [1]. Security concerns arise as data owners have no control over the data management in the cloud-computing environment. The importance of data security and the limited resources of IoT devices motivate us to install lightweight cryptographic schemes that can satisfy the security, low-energy, and low-memory requirements of the existing IoT applications.
Elliptic curve cryptography (ECC), a public-key cryptography (PKC), has become a promising approach to the IoT security, smart card security, and digital signatures as it provides high levels of security with smaller key sizes. Compared with traditional Rivest-Shamir-Adleman (RSA) algorithm, ECC provides an equal level of security but with a shorter key length [2][3][4]. ECC can be implemented with low hardware resource usage and low energy consumption without degrading its security Edwards curves, a family of elliptic curves, are gaining enormous attention among security researchers because of their simplicity and high resistance against SCAs [26]. ECPM on Edwards curves is faster and more secure than that on the Weierstrass form of elliptic curves [27,28]. Edwards curves have the advantage of providing strongly unified addition formulas [28], which cover both PA and PD. Separate hardware architectures for PA and PD are not required to perform ECPM. Moreover, unified PA prevents probable SPA attacks by making the secret key indistinguishable from power tracing. When ECPM adopts the same module for PA and PD, the binary bit pattern of the secret key cannot be retrieved by SPA. The twisted Edwards curves are a generalization of Edwards curves [29], which are mainly used in the digital signature scheme EdDSA. One of the most compatible twisted Edwards curves in digital signature systems is Edwards25519, which is the Edwards form of the elliptic curve Curve25519 [23,30]. In modern times, Edwards25519 curve is used for a high-speed, high-security digital signature scheme called Ed25519 [15,16]. ECPM using unified twisted Edwards curve not only provides high resistance against SPA but also it reduces the area of ECC processors.
ECC can be accomplished with both hardware and software approaches. Although the software implementation is simple and cost-effective, it cannot provide high-speed computation as the hardware implementation can. Indeed, the hardware implementation of ECC with limited resources is a highly challenging task because low hardware use leads to a lower computational speed. In this point of view, Edwards curves are more effective than classical elliptic curves as they can be implemented on a smaller area with higher processing speed. Most of the hardware implementations of ECC reported in the literature are based on the Weierstrass form of elliptic curves. Few hardware implementations based on twisted Edwards curves over GF(p) have been reported. Baldwin et al. [31] first documented hardware implementation of a reconfigurable 192-bit ECC processor adopting twisted Edwards curve over GF(p). They provide a comparison between the FPGA implementation of an elliptic curve-based point multiplication and that of a twisted Edwards curve for different number of arithmetic logic units (ALUs) operated in parallel, which shows the Edwards curve as more efficient. Additionally, the twisted Edwards curve point operations are compared with the unified version of these operations. Although the unified version shows little bit worse performance, it provides a higher resistance against SPA. Liu et al. [21] present a computable endomorphism on twisted Edwards curves to boost the speed of ECDSA verification process. They provide area-efficient hardware architecture for signature verification with its FPGA implementation. Application specific integrated circuit (ASIC) implementation of the architecture is also provided for low-cost applications. The implementation results show that the design reduces approximately 50% of the number of PD operations required. Parallel architectures for ECPM on extended twisted Edwards are proposed by Abdulrahman et al. [32]. The authors present a new radix-8 ECPM algorithm to cope with SCAs and speed up computations. However, no hardware implementation of these architectures is reported.
In this paper, a lightweight FPGA-based hardware implementation of ECC over GF(p) is proposed for IoT appliances. The major contributions of this paper are summarized as follows: • An efficient radix-4 interleaved modular multiplier is proposed to perform 256-bit modular multiplication over a prime field. • A novel hardware architecture for strongly unified PA on the Edwards25519 curve is proposed. • An efficient ECPM scheme is proposed to perform high-speed point multiplication on the Edwards25519 curve. The same module is used for PA and PD to prevent probable SPA attacks. The area required by the scheme is significantly lower than other available designs for ECPM. • ECPM is performed in projective coordinates to avoid the most expensive (in terms of computational complexity) modular division operation. In addition, a projective-to-affine (P2A) converter is proposed to transform the projective output into its affine form. This type of transformation reduces the computation time additionally required for the modular division operation performed in affine coordinate-based PA. • An ECC processor is designed by combining the ECPM scheme and the P2A converter in such a manner as to reduce the number of modular inversion operations required. The area-delay product of the proposed ECC processor is considerably small that ensures a better performance of our processor.
The rest of this paper is organized as follows: Section 2 presents the mathematical background of the twisted Edwards curve and unified PA formula. Section 3 presents the proposed hardware architectures for field operations (modular multiplication and modular inversion), unified PA, ECPM, and ECC processor. Section 4 presents the implementation results of the proposed designs. Section 5 shows a performance comparison of our proposed ECC processor with other related processors. Finally, Section 6 concludes this research study.

Mathematical Background
This section presents the twisted Edwards curve with its affine and projective representations as well as the unified PA formula for the curve.

Twisted Edwards Curve
The affine representation of a twisted Edwards curve over a prime field F p with not characteristic 2 is given by the equation [23,29]: where a, b ∈ F p \ {0, 1} with a = d. When a = 1, the curve is called untwisted Edwards curve or, formally, Edwards curve. In the case of a = −1, the curve will be when a = −1, d = −121665/121666, and p = 2 255 − 19, the curve is called Edwards25519 that is the Edwards form of the elliptic curve Curve25519 [23]. In a projective or Jacobian coordinate system, each point (x, y) on t a,d is represented by a triplet form (X, Y, Z). The affine point P(x, y) corresponds to the projective point P(X = x, Y = y, Z = 1). The projective point P(X, Y, Z) corresponds to the affine point P(x = X/Z, y = Y/Z) with Z = 0.
The projective representation of the curve t a,d is given by the equation [23,29]: The projective form of the curve t d is given by the equation:

Unified Point-Addition Formula
PA on the curve T d in projective coordinates is given by the equation: where P 1 and P 2 are two points on the curve and P 3 is the resultant point. The unified PA formula [29] for T d can be given as follows: The above formula is applicable for both PA and PD. PD can be performed considering that the points P 1 and P 2 are identical.

Proposed Hardware Architectures
This Section presents the proposed hardware architectures for ECC operations and the final ECC processor.

Modular Multiplication
Modular multiplication is the most important arithmetic operation of an ECC processor. The speed and occupied area of the processor entirely depend on it. Although a radix-2 multiplier consumes less hardware resources compared to higher radix (e.g., radix-4 and radix-8) multipliers [33], it is not compatible for high-speed multiplication because of its high latency. To reduce the latency, an efficient radix-4 interleaved modular multiplication algorithm is proposed as demonstrated in Algorithm 1. It requires n/2 + 1 clock cycles (CCs) to multiply two n-bit integers A and B over the prime field GF(p), where p is an n-bit prime number. Figure 2 illustrates the proposed modular multiplier based on this algorithm. 5: if T(n + 1 downto n) = "01" then 6:

Algorithm 1 Proposed Radix-4 Interleaved Modular Multiplication
else if T(n + 1 downto n) = "10" then 8: E ← D + 2A; 9: else if T(n + 1 downto n) = "11" then 10: E ← D + 3A; 11: else 12: E ← D; 13: end if; 14: C ← E mod p; 15: T ← T(n − 1 downto 0)||"00"; \\left shift operation 16: end while; 17: return C; Modular multiplication is obtained by performing iterative addition of its interim partial products reducing to modulo p. A shift-left register "Reg T" is used to perform left to right bitwise multiplication and for a synthesizable loop operation. T[(n + 1) : 2] is precomputed as the multiplier B and T[1 : 0] is precomputed as "01". These two extra bits are added at the rightmost position of the register T to determine the appropriate end of the loop in the case of b 0 = 0. At the beginning of each iteration, accumulator C is quadrupled and computed as D. For the bitwise multiplication, A, 2A, and 3A are separately added to D. MUX1 is used to select one of the four outputs D, D + A, D + 2A, and D + 3A as E based on the three bits T[(n + 1) : n]. If T n+1 and T n both are zero, D remains unchanged and E becomes D. At the end of each iteration, E is reduced to modulo p and T is shifted to the left by 2 bits. The modulo operation (E mod p) is performed by subtracting the prime numbers p to (j − 1)p from E, where E is always less than jp; (j = 3, 4, 5...). In this module, (E mod p) is obtained by subtracting the prime numbers p to 6p from E as E is always less than 7p. These subtractions are executed using the 2's complement method. MUX2 selects one of the seven outputs E, E − p, E − 2p, E − 3p, E − 4p, E − 5p, and E − 6p as C for the next iteration based on the comparisons E ≥ p, E ≥ 2p, E ≥ 3p, E ≥ 4p, E ≥ 5p, and E ≥ 6p. These comparisons are obtained by checking the three bits E[(n + 1) : (n − 1)]. After n/2 number of iterations, B, as well as T[(n − 1) : 0], is shifted to zero value and the execution is stopped. The final content of the register "Reg C" is the modular multiplication of A and B. A total of n/2 + 1 CCs are required to perform the modular multiplication operation, where n/2 CCs correspond to n/2 number of iterations and one extra CC is required for the initialization. To perform modular squaring, the inputs A and B are taken as identical.

Modular Inversion
Modular inversion is the costliest (in terms of the hardware resource requirements) arithmetic operation in finite fields. In affine representations, PA and PD require modular inversion operation to perform modular division. In this study, although our ECC processor is designed in projective coordinates, modular inversion is required for P2A conversion. Algorithm 2 [2] demonstrates the binary modular inversion for the P2A conversion module proposed in this paper. The hardware architecture of this module is depicted in Figure 3.

Algorithm 2 Binary Modular Inversion [2]
Input : if s(0) = 0 then 6: s ← s/2; end if; 10: end while; 11: while r(0) = 0 do 12: r ← r/2; 13: if t(0) = 0 then 14: t ← t/2; 15: else 16: t ← (t + p)/2; 17: end if; 18: end while; 19: if q > r then 20: q ← q − r; 21: if s > t then 22: s ← s − t; 23: else 24: s ← s + p − t; 25: end if; 26: else 27: r ← r − q; 28: if t > s then 29: t ← t − s; 30: else 31: t ← t + p − s; 32: end if; 33: end if; 34: end while; 35: return s mod p; The contents of the registers "Reg Q", "Reg R", "Reg S", and "Reg T" are updated in every iteration. Five multiplexers such as MUX1, MUX2, MUX3, MUX4, and MUX5 are used to select corresponding outputs, satisfying different conditions by their select lines. In the case of q being even, MUX1 selects q/2 and MUX3 selects s/2 if s is even or (s + p)/2 if s is odd. In the case of q being odd and greater than r, MUX1 selects q − r and MUX3 selects The comparisons q > r and s > t are obtained by checking the sign bits of the subtractions q − r and s − t, respectively. If q is odd and less than r, q and r both remain unchanged. Similarly, MUX2 selects one of the three outputs r, r/2, and r − q based on the conditions r(0) = 0 and r > q. MUX4 selects one of the five outputs t, t/2, (t + p)/2, t − s, and t + p − s based on the conditions r(0) = 0, t(0) = 0, r > q, and t > s. MUX5 is used to select the final result as (s mod p) if q = 1. In this regard, q is subtracted by 2 to check whether q < 2 at the end of each iteration. When the sign bit of the subtraction q − 2 is 1, (s mod p) is stored in the register "Reg C", which is the modular inversion of B.
In this architecture, on average n + n/4 CCs are required to perform the modular inversion operation, where n number of iterations are to reduce the n-bit variable q to 1 in a regular manner and additional n/4 number of iterations are for such uncertain case as q being odd. The clock cycles required for the modular inversion operation may vary from our estimation depending on the binary bit pattern of B.

Unified Point Addition
Unified PA is required to perform both PA and PD by the same module so as to prevent possible SPA attacks in ECPM. The proposed hardware architecture for the unified PA formula described in (6) is depicted in Figure 4. The architecture includes 12 multiplications, 1 squaring, 3 additions, and 1 subtraction, which can be denoted as (12M+1S+4A). The proposed design consists of four consecutive levels, in which the arithmetic modules are connected in a sequential manner. The modules are arranged in horizontally parallel among the levels to achieve the shortest data path. The whole architecture is efficiently balanced to reduce the area required. Start signals are used to start the arithmetic operations and Done signals are used to confirm the end of the operations. The Done signals of the modules at each level are considered to be the Start signals of the modules at its subsequent level. AND blocks are used to synchronize the horizontal modules in time (e.g., if the Done signals d 1 , d 2 , d 3 , d 4 , and d 5 all be 1, the Start signal s 1 will be 1; otherwise, it will be 0). The modular multiplier and the squarer require n/2 + 1 CCs to perform modular multiplication and squaring. Modular addition and subtraction are completed in only one CC. The level that contains any multiplication or squaring operation requires n/2 + 1 CCs and the level that contains no multiplication or squaring requires one CC to jump to the next level. In this design, a total of 2n + 5 CCs are required to complete the unified PA operation.

Elliptic Curve Point Multiplication
ECPM is the ultimate operation of an ECC processor. It multiplies a point on an elliptic curve with a scalar. The execution time of ECC schemes is dominated by ECPM. Let P(X, Y, Z) be a point on the curve T d , k be a scalar that is considered to be secret key. A public key Q(X, Y, Z) is generated from the known base point P and the secret key k by performing ECPM as follows: where Q is also a point on the curve. It can be obtained by adding P to itself k − 1 times such as If k is expressible as a power of 2, Q can be obtained by doubling P on itself log 2 k times such as Q = ...2(2(2(P))).
In the binary/ DAA method, ECPM is performed by a combination of PD and PA following the binary bit pattern of the secret key as shown in Algorithm 3. In this algorithm, separate modules are required to perform PA and PD. The power consumption of the two separate modules are different. Monitoring these two power levels by SPA, the bit pattern of k can be retrieved as shown in Figure 5. Moreover, k can be assumed by timing analysis; hence, ECPM based on this algorithm is vulnerable to SPA attacks. To cope with SPA, Algorithm 3 is modified to Algorithm 4, where PD is replaced by unified PA. According to this algorithm, power is only consumed for PA with a fixed power consumption, which is independent of the bit pattern of k as shown in Figure 6. Since the power consumption is the same across all the iterations, this algorithm is free from SPA. Figure 7 illustrates the proposed hardware architecture for ECPM based on Algorithm 4. Two point-addition blocks PA1, PA2 and three multiplexers MUX1, MUX2, MUX3 are used in this processor. Initially, Q 1 is precomputed as P. PA1 adds the point Q 1 to itself and the output Q 2 goes to the input of PA2. Identical inputs are inserted in PA1 to perform PD by means of PA. One of the two inputs of PA2 is the output of PA1 and the other one is P or 0. If k i = 1, PA2 adds the point P to the point Q 2 and the output Q 3 goes to the input of the PA1 via the register Rg. On the contrary, if k i = 0, PA2 remains idle and the output of PA1 directly goes to its input via Rg. MUX1 is used to select the ith bit of k by log 2 l number of select lines, where l is the bit length of k. Based on k i , MUX2 selects P or 0 as one of the two inputs of PA2; MUX3 selects Q 2 or Q 3 as the input Q 1 for the subsequent iteration.

Algorithm 3 DAA ECPM without Unified PA [2]
Input : P(X, Y, Z), k = ∑ l−1 i=0 k i 2 i ; k i ∈ {0, 1} , k l−1 = 1 Output : Q(X, Y, Z)  For the l-bit k, the register stores kP as the final result after l − 1 number of iterations. The average CCs required to perform the ECPM can be calculated as For l = n, PA1 and Rg remain active in every iteration, whereas PA2 goes idle in the case of k i = 0. In every iteration, a total 2n + 6 CCs are spent by PA1 and Rg. Additional 2n + 5 CCs are spent by PA2 if k i = 1. On average, l(n + 2.5) CCs are spent by PA2 across the ECPM. For the n-bit k, the latency of the ECPM is approximately 3n 2 + 6.5n − 6 CCs. This latency may vary depending on the bit pattern of the key; it increases with the number of 1 and decreases with the number 0 present in the bit pattern. In this study, an average case is considered. This means that the key has equal number of 1 and 0 in its bit pattern, although this is not always the case.

Proposed ECC Processor
A time-area-efficient ECC processor is designed for public-key generation using the proposed projective coordinate-based ECPM along with a P2A converter as shown in Figure 8. This processor will generate a public key from a private key and a base point on T d . Initially, the affine base point P(x, y) is transformed into its projective form such as P(X, Y, Z) by an affine-to-projective (A2P) converter. The public key Q(X, Y, Z) is obtained by performing ECPM of the projective point P(X, Y, Z) with the secret key k. Finally, Q(X, Y, Z) is transformed into its affine form such as Q(x, y) by the P2A converter. For the P2A conversion, Z is inverted by the proposed modular inversion module and separately multiplied by X and Y. The latency required by the processor to process the ECPM operation along with the coordinate conversions is 3n 2 + 8.25n − 5 CCs, which is the total sum of the latency of ECPM, modular inversion, and modular multiplication.

Implementation Results
The proposed ECC processor was programmed in VHDL and implemented using the Xilinx ISE 14.7 Design Suite software. Xilinx ISim simulator was used to simulate the ECC operations. The simulation results were verified by the Maple 18 software. Synthesizing, mapping, placing, and routing of the proposed ECC modules were performed on Xilinx Virtex-7 and Virtex-6 FPGA platforms, separately. The details of these FPGA platforms and settings are as follows: The implementation results of the proposed ECC modules are summarized in Table 1. On Platform 1, all the modules run at a maximum frequency of 104.39 MHz. The proposed ECC processor occupies 6543 slices (25,898 LUTs) and generates a public key from a given 256-bit private key in 1.9 ms with a throughput of 134.5 kbps. On Platform 2, the modules operate at a maximum frequency of 93.23 MHz. The numbers of slices and LUTs used by the processor are 6579 and 25,968, respectively, the delay of the public-key generation is 2.13 ms, and the throughput is 120.1 kbps. The performance of the ECC modules on the Virtex-6 FPGA platform is a little bit worse compared to the Virtex-7 FPGA platform in terms of speed. However, the area use of the different modules on these platforms are almost the same. It must be noted that no digital signal processing (DSP) slice is used to implement our processor. Although DSP slices increase processing speed, they increase processor's cost as well.

Performance Comparison
Several hardware implementations of ECC have been reported in [34][35][36][37][38][39][40][41][42][43][44][45][46][47][48][49][50][51][52][53], where some authors aimed to minimize the area use while others tried to reduce the computation time. Achieving a higher processing speed with low-area use is technically challenging. We tried to maintain a balance between area and time as they are two important performance criteria of a cryptographic processor. A performance comparison of our proposed ECC processor with other related designs is presented in Table 2. The residue number system (RNS)-based ECC design reported in [34] provides a higher throughput (1816.2 kbps) by performing ECPM on 21 keys in parallel. Conventional DAA method is adopted for ECPM, where PA and PD are executed by separate modules carrying high risk of SPA attacks. On Virtex-7 FPGA, the design consumes 96,867 LUTs (approx. 24,216 slices) with 2799 additional DSP slices. Although the throughput of this design is higher than that of our design, it costs 3.7 times more hardware resources. The novelty of this design is that it processes 21 keys simultaneously, which prevents template-based attacks by increasing the computation complexity. In [35], the authors propose a high-performance ECC processor with its ASIC and FPGA implementations. A novel hardware architecture for combined PA-PD operation in Jacobian coordinates is proposed to achieve high-speed ECPM with low hardware use. On Kintex-7 FPGA, the processor separately designed in affine and Jacobian coordinates performs ECPM in 4.7 ms and 3.27 ms, occupying 9.3k and 11.3k slices, respectively. Our processor implemented on 7-series FPGA is 1.72 times faster and costs 1.73 times less slices as compared with this processor designed in Jacobian coordinates. The throughput of our design is 1.76 times higher. A high-speed ECC processor is proposed in [36] providing redundant signed digit (RSD)-based carry free modular arithmetic. The processor performs high-speed ECPM with a higher throughput. However, it occupies 10 times more slices on Virtex-6 FPGA than our processor. Although RSD representation offers fast computation, it consumes a vast amount of hardware resources, which makes processor bulky and hence not suited for low-power IoT devices. The high-speed RSD-based modular multiplier proposed in this paper performs single multiplication in only 0.34 µs, consuming 22k LUTs. In comparison with this multiplier, our proposed modular multiplier performs single multiplication in 1.45 µs and consumes only 1.3k LUTs with almost 4 times better efficiency in terms of area-time (AT) product. The RSD-based ECC processors reported in [37,38] present comprehensive pipelining technique for Karatsuba-Ofman multiplication to achieve high throughput. Our processor has smaller AT product compared with these processors.
Liu et al. [39] propose a hardware-software approach for flexible duel-field ECC processor with its ASIC and FPGA implementations. The traditional DAA method for ECPM is replaced by the double-and-add-always (DAAA) method to protect the processor from SPA attacks. Although the DAAA method for ECPM provides high resistance against SPA, it increases the computational complexity and hence reduces the frequency and throughput. In addition, it consumes more power than the conventional DAA method as PA and PD are performed in every iteration. Our processor is protected against SPA attacks by implementing the cost-effective DAA algorithm with unified PA. When compared to our processor, the main advantage of this processor is that it is flexible and reconfigurable over different field orders. In addition, it can perform ECPM over both GF(2 n ) and GF(p), whereas our processor performs ECPM over GF(p) only.
Hu et al. [40] propose an SPA-resistant ECC design over GF(p), providing its ASIC and FPGA implementations. The design uses 9370 slices with 14 additional DSP slices on Virtex-4 FPGA. Despite employing additional DSP slices, the speed of this design is considerably low. It takes 29.84 ms with a frequency of 20.44 MHz to perform single ECPM over a 256-bit prime field. The advantage of this design that makes it well suited for embedded applications is its reconfigurable computing capability. A low latency ECPM design is proposed in [41] exploiting parallel multiplication over GF(p). Protection against timing and SPA attacks is provided by using the DAAA method for ECPM. The latency of this design is 3n 2 + 37n + 4n CCs, whereas the latency of our design is 3n 2 + 8.25n − 5 CCs. Therefore, the computational complexity of ECPM in this design is higher than that in our design. The radix-4 parallel interleaved modular multiplier proposed in this paper performs multiplication in 0.79 µs, consuming 6.3k LUTs. Four multiplier units are operated in parallel to speed up the multiplication process. The main feature of this design is its capability to perform ECPM over GF(p) with any arbitrary value of p less than or equal to 256 bits in size.
The design reported in [42] exploits the Montgomery ladder algorithm for SPA-resistant ECPM. Although the Montgomery ladder algorithm offers lower latency ECPM and higher resistance against SPA than the general DAA method [23], it deals with around 50% additional PA operations that results in a higher power consumption. Hence, the DAA method is more efficient than the Montgomery ladder technique in terms of energy consumption. The advantage of this design is that it supports any prime number p ≤ 256-bit. In [43], the authors present a high-performance hardware design for ECPM adopting non-adjacent form (NAF) method. Although NAF method has the advantage of reducing the latency of ECPM, the computational complexity and its vulnerability to SCAs are high in this method. Moreover, additional point subtraction operation is required for NAF scalar multiplication. Like the designs reported in [40,41], this design is programmable for any prime p ≤ 256-bit. Parallel crypto design is proposed in [44] using the DAAA method to perform SCA-resistant ECPM over different field orders. The design is represented in affine coordinates, where PA and PD require modular division operations. Modular division is the most time-consuming arithmetic operation in finite fields. Therefore, this design is not convenient for high-speed computation. However, it provides high resistance against timing and SPA attacks by parallel computation of PA and PD.
Ananyi et al. [45] propose a flexible hardware ECC processor that supports five National Institute of Standard and Technology (NIST) recommended prime curves. They provide a comparison between the binary and NAF ECPM over all five NIST prime fields such as p 192 , p 224 , p 256 , p 384 , and p 521 , where the NAF ECPM is found to be more time-efficient. Their processor consumes 20,793 slices (31,946 LUTs) with 32 additional DSP blocks on Virtex-4 FPGA and performs the binary ECPM in 6.9 ms and the NAF ECPM in 6.1 ms over p 256 . The modular inverter designed in this paper operates at a frequency of 58.6 MHz costing 10,921 slices with 32 DSP blocks, whereas our modular inverter implemented on Virtex-7 FPGA runs at 110.65 MHz consuming 1197 slices without any DSP block.
A scalable ECC processor developed by Loi et al. [46] can perform ECPM on five NIST suggested prime curves such as P-192, P-224, P-256, P-384, and P-521 without hardware reconfiguring. On Virtex-4 FPGA, this processor performs ECPM in 5.46 ms, occupying 7020 slices along with 8 additional DSP slices. Despite using DSP slices, the computational speeds of the processors reported in [45,46] are low. The main significance of these processors is that they are flexible over the five NIST prime fields and hence they can be programmed to perform ECPM for variable prime numbers ranging from 192 to 521 bits in size without being architecturally reconfigured. The processors reported in [47][48][49][50][51][52][53], are implemented on some backdated FPGA platforms, which are now obsolete.
Performance comparison in terms of AT product is shown in Figure 9. The AT product of our design is lower than that of the other designs tabulated in Table 2. Figure 10 shows performance comparison in terms of throughput per slice. The per slice throughput of our design is higher than that of the other designs except [34]. The RNS-based design reported in [34] provides a higher throughput by performing ECPM on 21 keys concurrently. Our processor's low value of AT product and high value of throughput ensure a better performance in IoT platforms. However, a fair comparison is not possible because the compared processors are implemented on different FPGA platforms. Our proposed ECC processor is implemented only on the Virtex-7 and Virtex-6 FPGAs because the number of input/output blocks (IOBs) is limited in earlier FPGAs. Furthermore, the earlier FPGAs such as Virtex-II-Pro, Virtex-4, and Virtex-5 are not compatible with low-power devices because of their high power consumption.

Conclusions
In this paper, a high-performance ECC processor has been proposed exploiting unified PA on Edwards25519 curve to perform SPA-resistant point multiplication. An efficient ECPM module has been designed in projective coordinates, which supports 256-bit point multiplication over a prime field. Unified PA is adopted for the ECPM module to provide strong protection against SPA attacks and reduce the area required by an additional PD module. To perform high-speed modular multiplication, an efficient radix-4 interleaved modular multiplier has been proposed. The proposed ECC processor performs fast point multiplication with a considerably lower area use, providing high resistance against SPA. Because of its less hardware resource requirements and high computation speed, it is well suited for resource-constrained IoT devices. Since it provides a faster ECPM that is a rising demand of elliptic curve-based digital signature schemes, it could be manipulated in Bitcoin-like cryptocurrencies for high-speed digital signature generation and verification, which would reduce latency in transaction confirmation. Based on the overall performance analyses, it can be concluded that the proposed ECC processor could be a good choice for the IoT security as well as the emerging technology "Blockchain".