cryptography

: In cryptography, elliptic curve cryptography (ECC) is considered an efﬁcient and secure method to implement digital signature algorithms (DSAs). ECC plays an essential role in many security applications, such as transport layer security (TLS), internet protocol security (IPsec), and wireless sensor networks (WSNs). The proposed designs of ECC hardware implementation only focus on a single ECC variant and use many resources. These proposals cannot be used for resource-constrained applications or for the devices that need to provide multiple levels of security. This work provides a multi-functional elliptic curve digital signature algorithm (ECDSA) and Edwards-curve digital signature algorithm (EdDSA) hardware implementation. The core can run multiple ECDSA/EdDSA algorithms in a single design. The design consumes fewer resources than the other single-functional design, and is not based on digital signal processors (DSP). The experiments show that the proposed core could run up to 112.2 megahertz with Virtex-7 devices while consuming only 10,259 slices in total.


Introduction
The urgent request for information security has led to the evolution of cryptography. Elliptic curve cryptography (ECC) is more and more preferred, as it provides an impressive security level. ECC is a brand of public-key cryptography based on the structure of elliptic curves over finite fields. Compared to other public-key cryptography algorithms, ECC offers a higher level of security. The elliptic curve digital signature algorithm (ECDSA) and Edwards-curve digital signature algorithm (EdDSA) are the two well-known ECC that are used in many security protocols such as transport layer security (TLS) [1] and internet protocol security (IPsec) [2].
ECDSA was published in 1992 by Scoot Vanstone. Until now, ECDSA has been accepted and recommended by many standards, such as the International Organization for Standardization (ISO) [3], Federal Information Processing Standards (FIPS) [4], and American National Standards Institute (ANSI) [5]. EdDSA [6] is a digital signature algorithm that was developed by Bernstein [7]. EdDSA is very effective, thanks to its fast operations while keeping a high level of security. The EdDSA algorithm includes Ed25519 and Ed448; Ed25519 is preferred when compared with Ed448 because it requires fewer resources while still providing perfectly adequate safety [6].
Many publications of EdDSA and ECDSA have been published over the last years. In 2017, Sghaier et al. proposed an ECDSA system over the field GF(2 163 ) [8,9]. Their implementation consumes 18,504 slices and can run up to 107.4 MHz on Virtex-5 devices.
Montgomery ladder algorithm is used for elliptic curve point multiplication (ECPM). In this design, two separate cores are deployed for point adding and point doubling to increase the throughput. Although the authors try to implement the whole system of ECDSA, a clear structure of scalar multiplication or the number of digital signal processors (DSPs) usage are not reported. In 2015, Panjwani et al. suggested a hardware-software co-design approach for ECDSA with the National Institute of Standards and Technology (NIST) P-163 curve [10]. Their improvement version that was published in 2017 can run all NIST recommended field sizes from 192 to 521 bits [11]. Their ECDSA design relies on a general purpose processor (GPP), which is Microblaze in their work, and a hardware accelerator for some functions of ECDSA. Such an approach could take advantage of GPP, which can run at high frequency, and heterogeneous hardware, which provides a high level of parallelism. Their design provides a high performance, but it encounters some drawbacks. Firstly, the GPP cannot perform another task while the ECDSA core is working, and the design suffers too much overhead because many data exchanges between GPP and FPGA are performed through the communication bus. Secondly, the design consumes too many resources, with 24,135 slices and 234 DSP for NIST p-521. Although the design can be reconfigurable, another synthesis is required for a different configuration. The design in [12][13][14] targets a high-performance design with Curve25519. While the authors of [14] proposed a multicore design, the authors in [12,13] optimized the scalar multiplication by using the Karatsuba multiplication algorithm. In 2019, Turan and Verbauwhede provided a compact and flexible FPGA implementation of Ed25519 [15]. Although their work targets embedded and IoT devices, the resource consumption is still high. All the mentioned designs rely on the DSP of the FPGA. Although the DSP-based architectures provide high performance, they can be adjusted in the low-profile hardware platforms, where the number of DSPs is limited or not supported. On the contrary, the non-DSP architectures could be deployed on all devices. In [16], Islam et al. reports a high-performance design of ECPM for Curve25519. The core consumes 8900 slices on Virtex-7 without using any DSP. Asif et al. in [17] provided a novel method to apply the residue number system (RNS) to implement ECPM. The RNS significantly improves the parallelism of scalar multiplication. Therefore, the throughput is very high. However, the core consumes up to 24,200 slices and 276 block RAMs (BRAMs). These designs cannot be deployed on resource-constrained applications, such as wirelesssensor network (WSN) [18], and software-defined networking (SDN) applications [19]. In [19], the authors reported a platform for SDN, which has a wide range of applications in today's world. In this work, Cheng et al. integrated a variety of ECC algorithms. However, to date, there are not any designs that can combine multiple ECC algorithms in a single optimized architecture.
In this work, we target an ECC processor that integrates multiple ECC algorithms in a single low-cost, area-efficient architecture. The proposed core could perform multiple algorithms without re-synthesizing. The selected algorithms are Ed25519, ECDSA with NIST P-256, NIST P-384, and the NIST P-521 curve. The experimental results show that the design consumes significantly low resources and provides a working frequency comparable with other proposals. In the experiments, we deploy two different configurations of the core. The first one is integrated with two lightweight algorithms, which are Ed25519 and ECDSA with NIST P-256. The second one can operate Ed25519, ECDSA with NIST P-256, NIST P-384, and NIST P-521. The resource usage for the later configuration is 10,259 slices on Virtex-7 devices, and the working frequency achieves 112.2 megahertz (MHz), which can be compared with other EdDSA or ECDSA single-functional architectures.
The rest of the paper is organized as follows. Section 2 introduces the ECDSA and EdDSA algorithms and reveals the essential similarities between them. Section 3 explains the optimized architecture of the multi-functional design. Section 4 gives the experimental results under various aspects. Finally, Section 5 concludes the paper.

Digital Signature Algorithm
Digital signatures are effective in providing the authentication, integrity, and nonrepudiation of messages. The digital signature algorithm (DSA) was adopted as a digital signature standard (DSS) by NIST from 1994. The DSA algorithms include three main stages. The first one is the key-generation stage, where the sender generates a public key from a random number. The generated key is then sent with the message and used to verify the signature. The second one is the signature-generation stage, where the sender generates the signature for the message based on a secret key and the hash of the message. The message is then digitally signed with this signature. The last one is the verification stage. In this stage, the receiver verifies the validity of the message by using the public key and the signature. The message is considered valid if the recovered signature matches the received signature. Figure 1 illustrates the three stages of DSA.

Elliptic Curve Digital Signature Algorithm-ECDSA
The high level of security of ECC relies on the difficulty level of the elliptic curve discrete logarithm problem (ECDLP). ECDSA depends on the elliptic curve cryptosystem. Like the other DSA, ECDSA has three stages. In the key-generation stage, a random number is multiplied by the base point of the selected curve to compute the public key. In the signature-generation stage, based on a secret key, a point multiplication and a hash function, additions are performed to find the signature. In the verification stage, the receiver recovers the signature from the received message and the public key, then compares it with the received signature. If the two signatures are identical, the received message is valid. Algorithm 1 explains ECDSA in detail.

Key-generation
State message m is invalid 8: else 9: Assign v = x 1 10: end if 11: if v = r then 12: State message m is valid 13: else 14: State message m is invalid 15: end if

Edwards-Curve Digital Signature Algorithm-EdDSA
The EdDSA is based on the Edwards curves developed by Bernstein et al. The security of Ed25519 and Curve25519 are based on ECDLP. Ed25519 has three stages. In the keygeneration stage, a public key is generated from a secret key and the based point of Curve25519. In the signature-generation stage, a pair of signatures is produced from the private key and the hash of the input message. The signature is then attached with the message and sent to the receiver. Finally, the received signature and the public key by the receiver are used to verify the message in the verification stage. Algorithm 2 reviews EdDSA in detail.

ECDSA/EdDSA Hierarchy
The hierarchy of the ECDSA/EdDSA combinational design can be illustrated as Figure 2. In this illustration, the operations in the higher position are performed based on the operations in the lower position. Figure 2 also reveals that there are three different layers of abstraction. The first layer is the DSA algorithm, where the processes of keygeneration, signature-generation, and verification of a specific DSA are specified. The control flow for a DSA, such as ECDSA or EdDSA, is implemented in this layer. The second layer is the algorithms for the elliptic curve. Depending on the selected curve, the additions and subtractions between two points in this curve are defined. The point multiplication can be used for different curves. The third layer describes the methods of modular operations, such as modular multiplication, inversion, addition, and subtraction. An additional modular reduction may be required for this layer. Finally, the base of all layers is binary addition, subtraction, and comparison. Among these layers, the implementation for point multiplication, modular multiplication, modular inversion, modular addition, modular subtraction, and binary operation can be reused. In Figure 2, the gray boxes illustrate the functions that could be shared among different elliptic curves.  The base of ECC algorithms is modular operations. Figure 3 shows the combinational architecture of modular addition/subtraction. This module could perform both addition and subtraction with the modularity of a prime number p. The architecture is split into two branches. The left branch performs the operation without p, and the right branch performs the operation with p. When the sub signal is activated, the multiplexer at the left-side selects −b, then < a − b > is performed, while the multiplexer at the right-side selects p and then the addition circuit performs < a − b + p >. When the sub signal is inactivated, < a + b > is calculated on the left circuit, and < a + b − p > is calculated on the right circuit. The final value is selected depending on the carry out of the right branch. In this architecture, the carry save adders (CSAs) are used to reduce the logic delay of the three-operands additions.   Figure 4 illustrates the architecture CSAs that are used in Figure 3. A three-operands CSA includes two layers. In the first layer, the corresponding bits from each operand are simultaneously added together using full adders (FAs). The results of the first layer is the sum bits [s 0 ...s n ] and the carry bits [c 0 ...c n ]. In the second layer, a ripple-carry addition is performed between the generated sum and carry from the first layer. The result of this layer is the final result of the three-operands addition. The delay of CSA is the total delay of two layers. The delay of the first layer is equal to the delay of a 1-bit FA. The delay of the second layer is the delay of the two-operands adder. Figure 5 illustrates the architecture of a conventional three-operands adder. Two operands are added in the first layer, and then the result is added with the third operands in the second layer. Figures 4 and 5 reveal that the conventional three-operand adder suffers more ripple carry delay from two consecutive adders while using the same resource as CSA.

Processing Element
The processing element (PE) includes an addition/subtraction module and a controller. The controller drives the addition/subtraction module to perform modular additions, subtractions, or multiplications. In our design, the addition/subtraction module naturally performs the modular reduction in each operation; there is no need for an additional modular reduction module. Each PE contains three registers. In the case of addition and subtraction, two registers are used to store the operands (registers A and B), and the sum/product (SP) register stores the sum or difference. In the case of multiplication, the multiplier and multiplicand are stored in registers A and B; the product is put into register SP. During the multiplication, the multiplier is shifted for each cycle. Depending on the width of the operands, the multiplication could be performed in multiple cycles. The final result is valid when the finish signal is activated. Figure 3 illustrates the architecture of the processing element. The register p is used to store the prime number for the modular operations. It is shared among different PEs in the system.

Modular Inversion
In our design, the modular inversion is calculated by using the binary Euclidean algorithm (BEA) [20]. BEA is the traditional method that can calculate the inversion of a number over the prime field. It is independent of the usage of the ECC algorithm. Therefore, it is the most suitable for a multi-functional design. Algorithm 3 reveals the binary Euclidean algorithm. Figure 6 shows the optimized hardware implementation of BEA in our proposed architecture. The BEA memory system includes four registers to store the temporal data during calculation. The datapath contains a modular addition/subtraction and a right shift circuit. The controller determines the updated register and the source of updated data. if x mod 2 = 0 then 6: x ← x 1 7: else 8: x ← (x + p) 1 9: end if 10: end while 11: while v mod 2 = 0 do 12: v ← v 1 13: if y mod 2 = 0 then 14: y ← y 1 15: else 16: y ← (y + p) 1 17: end if 18: end while 19: if u ≥ v then 20: u ← (u − v) mod p 21: x ← (x − y) mod p

Elliptic Curve Cryptography Unit
The ECC unit is the top module of the system. It includes a controller, a random access memory (RAM), a read-only memory (ROM), an inversion module (Inv.), and four PEs. Figure 7 illustrates the architecture of the ECC unit. In our design, the point adding and doubling for both ECDSA and EdDSA are performed under projective coordination. The conversion from affine to projective coordination is performed without any additional fee. In comparison with other types of coordination, projective coordination offers optimization for parallelism processing by reducing data dependency. Four PEs and an inversion module are integrated to achieve the highest performance in projective coordination.
Each processing element and inversion module has their register to store the temporal data during their operation. The input data for each PE are selected from the RAM, the initial data from ROM, the calculated data from the other PEs, and the generated data by themselves. The controller controls the operations of the four PEs and the inversion module to produce the appropriate results. The FSM of the ECC unit controller is designed to promote the parallelism of the four PEs. The RAM is used for multiple purposes: to store the data/result and exchange data among the PEs. The result is read from RAM in a specific address. When the operation of the processing elements and the inversion module is completed, the controller raises the finish signal to indicate that the result is valid and it can be read.
The ECPM is the essential part of an elliptic curve cryptography design. We also compare our processing core, with four PEs, with other proposals in Table 2. Table 2 reveals that the processing core occupies nearly 70% of the total resource. In comparison with other ECPM proposals, the resource consumption of our ECPM design is significantly lower, even though the core supports multiple curves. Table 2 also reveals that the DSP-based architectures [13,22] provides high throughput but consumes too many resources for a single curve. Therefore, they cannot be deployed on resource-constrained applications.  78  58  400  371  531  443  8033  5344  10,434   Through-25519  91  117  69  108  ---5773  2159   put  P-256  114  146  87  135  157  173  1816  --(Kbps)  P-384  --58  90  -----P-521  --43  66  -----Slices 2  25519  100  74  511  475  ---81  556   /Kbps  P-256  80  59  409  380  545  455  322 -- We also deploy our system on the Xilinx VC707 Evaluation Kit. Figure 8 illustrates the experimental system, including a RISC-V softcore processor, a Secure Hash Algorithm 2 (SHA2) core, and the EdDSA/ECDSA core with Curve25519 and NIST P-256/384/521. The RISC-V processor communicates with the SHA2 core and EdDSA/ECDSA core through a communication bus. The implemented SHA2 core could perform three SHA2 variants, which are SHA2-256, SHA2-384, and SHA2-512 in a single design. The RISC-V processor configures the operating mode for the SHA2 core and EdDSA/ECDSA core, controls the SHA2 core to perform the hash on the input message, and then transfers the hash value to the EdDSA/ECDSA core. When the EdDSA/ECDSA process terminates, the processor reads the results and then compares them with the correct results in [2,6]. A computer connects with the RISC-V processor through UART to monitor the experimental process. Such a design system allows the processor to perform the SHA2 hash process and EdDSA/ECDSA process at the same time. Furthermore, the SHA2 functions and EdDSA/ECDSA functions can be used separately.

Conclusions
In this study, a low-cost, area-efficient ECC processor is developed for multiple ECC algorithms in a single design. Based on the proposed architecture, four well-known algorithms, which are Ed25519, ECDSA with NIST P-256, NIST P-384, and NIST P-521, are deployed. The novel hardware architecture satisfies multiple requirements that were trade-offs from the previous designs. It consumes significantly few resources and does not use any DSP. Therefore, it can be easily deployed on multiple hardware platforms without re-design. Especially, this proposal provides an approach to deploy multiple ECC algorithms in a single architecture without re-synthesis.
In the experiments, we suggest a system that can improve the flexibility of the design. By combining with a separate SHA2 processor, the system not only performs the EdDSA/ECDSA, but also the SHA2 hash functions. The experimental results show that the core consumes 10,259 slices, which is lower than other single-functional ECC designs and can run up to 112.2 MHz on a Virtex-7 device. Another implementation that can perform two lightweight ECC algorithms, which are Ed25519 and ECDSA with NIST-P256, is also deployed. It consumes only 4276 slices and can run up to 122.05 MHz on the Virtex-7 device. Based on the features of the proposed EdDSA/ECDSA core and the experimental results, it can be concluded that our design can serve a wide range of applications and be able to deploy on multiple hardware platforms, especially for the resource-constrained devices that need a high level of security.
For future works, we are working on higher performance for the processing elements. Two targets can be achieved if the processing elements are able to operate with a lower delay but keep the same resource usage. For the first target, our core can provide higher overall performance while ensuring efficiency in the area. For the second target, the resource consumption can be further reduced by removing some processing elements while maintaining the current working performance.

Data Availability Statement:
No new data were created or analyzed in this study. Data sharing is not applicable to this article.