Design and Implementation of Low Area / Power Elliptic Curve Digital Signature Hardware Core

The Elliptic Curve Digital Signature Algorithm (ECDSA) is the analog to the Digital Signature Algorithm (DSA). Based on the elliptic curve, which uses a small key compared to the others public-key algorithms, ECDSA is the most suitable scheme for environments where processor power and storage are limited. This paper focuses on the hardware implementation of the ECDSA over elliptic curves with the 163-bit key length recommended by the NIST (National Institute of Standards and Technology). It offers two services: signature generation and signature verification. The proposed processor integrates an ECC IP, a Secure Hash Standard 2 IP (SHA-2 Ip) and Random Number Generator IP (RNG IP). Thus, all IPs will be optimized, and different types of RNG will be implemented in order to choose the most appropriate one. A co-simulation was done to verify the ECDSA processor using MATLAB Software. All modules were implemented on a Xilinx Virtex 5 ML 50 FPGA platform; they require respectively 9670 slices, 2530 slices and 18,504 slices. FPGA implementations represent generally the first step for obtaining faster ASIC implementations. Further, the proposed design was also implemented on an ASIC CMOS 45-nm technology; it requires a 0.257 mm2 area cell achieving a maximum frequency of 532 MHz and consumes 63.444 (mW). Furthermore, in this paper, we analyze the security of our proposed ECDSA processor against the no correctness check for input points and restart attacks.


Introduction
Being proposed in 1992, research tended to draw too much attention to the Elliptic Curve Digital Signature Algorithm (ECDSA).Due to its different advantages compared to DSA and Rivest-Shamir-Adleman-system (RSA), which are its small key length and its speed of signature operations, ECDSA was recommended by organizations, such as NIST [1] and Certicom [2].Computations needed for ECDSA authentication are the generation of a key pair (private key, public key), the computation of a signature and the verification of a signature.Firstly, the production of the key is based on the random or pseudorandom bit sequences.Thus, the first step is to find the appropriate algorithms for producing these bit sequences called the key.There are many algorithms, but the quality of these generated binary sequences should be tested, and their randomness should be checked.Hence, the appropriate algorithm should be deterministic and should generate these keys reliably and quickly.Secondly, ECDSA uses ECC scalar multiplication based on the hardness of solving the Elliptic Curve Discrete Logarithm Problem (ECDLP) and its smaller key size at the same security level compared to other asymmetric cryptosystems.Key size offers a significant gain in terms with 256-bit prime fields was done in [11].The proposition of the constant-time implementation of the NIST and SECG (Standards for Efficient Cryptography Group) prime curve over F 256 accelerate and improve the efficiency the perfect forward secrecy TLS (Transport Layer Security) handshakes, which use the two elliptic curve cryptosystems ECDSA and/or ECDHE (Elliptic Curve Diffie-Hellman Ephemeral).In this paper, the authors are only interested in the optimization of just the public key cryptosystem ECC (i.e., the hash function and the random number generator are not optimized).They performed three optimizations.First, they implement the scalar by G (the Generator) with the windowing method and Booth encoding, using a window of size seven, and avoid MSQR (Montgomery Square) via pre-computation.Second, they speed up point multiplication (for a general point P and also by G) by writing MM (Montgomery Multiplication) and MSQR assembly routines, which are specifically optimized for the MF (Montgomery Friendly modulus) prime p 256.They use the windowing method with Booth encoding, but with a smaller window size of five.Finally, they use the projective coordinates, and the modular inversion is implemented with the little Fermat theorem method.The software patch makes the entire ECDSA sign function constant time, which is resistant to attacks.The software implementation is done with the assembly language.These optimizations accelerate the hall design.In [12], a low-resource implementation of a 160-bit ECDSA signature generation algorithm over prime field curve secp160r1 is presented.The novel hardware architecture of the Keccak hashing algorithm is presented.Moreover, they applied co-Z ECC formulae (add two projective points sharing the same Z-coordinate), a pipelined multiplication unit and RAM macros, and they evaluated fixed-base comb methods to improve the efficiency of ECDSA on passive tags.Furthermore, their design runs with constant runtime and provides basic resistance against common implementation attacks.From the hardware implementation, it requires a total area of 12,448 GEs (or 63,700 µm 2 in 130 nm) and can generate a message digest within 140 k cycles.It has a power consumption of 42.42 µW at 1 MHz on a low-leakage 130-nm CMOS process technology.However, their design is slower and occupied more area than ours.In the same year, the authors in [13], proposed a design of an elliptic curve cryptography processor as an application for RFID tag chips.The works cited below are vulnerable to attacks.Some techniques can be applied to test and study the design architecture, such as in [14]; the authors studied the hot carrier injection stress effect on a 65-nm low-noise transistor.Furthermore, in [15], the authors examined the temperature and process variability on a p-channel MOSFET voltage multiplier and determined their impact on the voltage multiplier; this method can be applied to the global ECDSA design.In 2017, Wuqiong Pan et al. [16] developed a high-performance signature server called Guess.It implements the ECDSA with a 256-bit key size on a Linux-powered commodity computer, harnessing a desktop Graphics Processing Unit (GPU) as a featured cryptographic accelerator.This server Guess is designed to be resistant to timing attacks.Their contribution is a novel, systematic and inclusive implementation of ECDSA, turning cryptographic theory into productivity on off-the-shelf processors.They optimize the point multiplication (employing Pre-Computing Table (PCT) built offline), point addition (using mixed Jacobian-affine addition) and point doubling (Jacobian system) to accelerate the computation signature generation and signature verification.Besides maximizing computing power, various algorithms are customized and optimized for the platform.Guess readily supports various categories of ECC schemes like digital signature, key agreement and encryption.Guess achieves a throughput equal to 8.71 × 10 6 Operations Per Second (OPS) for signature generation and 9.29 × 10 5 OPS for verification.We can note that the authors just optimize the ECC IP, and their implementation results achieve a high throughput, but it consumes high energy.As we can see, every proposed design should be evaluated, and its security should be studied.However, these cryptosystems can be implemented in small devices such as mobile, smart cards and RFID.Bi et al. [17] tried to protect Radio-Frequency (RF) designs by applying a split manufacturing method in RF circuit protection, developing a quantitative security evaluation method to measure the protection level of RF designs under this split manufacturing and demonstration.One of the important field of research is how to protect an Integrated Circuit (IC) from different attacks and finding efficient countermeasures against IC piracy.In this context, Alasad et al. [18] proposed to insert the Multiplexer (MUX) with two cases: firstly, by randomly inserting MUXs equal to half of the output bit number (half MUX insertions), and secondly, by inserting MUXs equal to the number of output bits (full MUX insertions).They adopted Hamming distance as a security evaluation, and they measured the delay, power and area overheads with this proposed technique.

ECDSA Processor Methodology and Flow Design
The design flow and needs analyses are the most important tasks in order to obtain a system that verifies the requirements' specification.In order to find such a design, these instructions have to be followed:

•
Select the most suitable algorithms for designing signature IP blocks with respect to the large-scale application's requirement: a high-security level, a minimum area with maximum throughput and a low consumption.

•
Propose RTL (Register Transfer Level) architectural optimizations with the aim of adopting and scaling the signature processor to both the application needs and the platform specifications.

•
Propose a hardware verification approach to the proposed signature processor so as to validate and verify RTL implementations.In order to accelerate the verification of the overall architecture, the verification and validation have to apply different co-simulation methods to the used IPs.This can be done thanks to the interfaces between the higher level system environments and the high-performance HDL simulators.
Figure 1 resumes the design flow of the digital signature proposed in this paper.Using the mentioned design flow, to develop and to design a secure digital signature processor, the following strategic points have to be fixed: Focusing on the application's requirements, the different elements needed in the system should be represented and placed into the system requirement analyses process.The main purpose of the needs analyses is the application satisfaction.With respect to the application's requirements and the chosen platform constraints, the appropriate methods will be fixed to design different needed IPs: 1.
The most used algorithms for hash are Message Digest algorithm 5 (MD5) and Secure Hash Algorithm (SHA-0, SHA-1 and SHA-2).Thus, MD5, SHA-1 and SHA-0 have the same size of the message, the word and the block.Furthermore, they have the same number of rounds, except that the size of the MD5 digest (128-bit) is smaller than the SHA 0 and SHA-1 ones (160-bit).Collision has been found in MD5, SHA-0 and SHA-1 with a number of 2 52 attacks [19], which made them unreliable and unsafe, so not adaptable for the actual cryptographic needs.Since their appearance, the SHA-2 family is the most used hash function thanks to the higher security against attacks due to the larger condensed size and speed.[20], and the cellular automata, which are an important family of stream cipher generators [21].The appropriate algorithm in our case is grain-128, because it supports the 128-bit key, and its initial vector is about 96 bits.This encryption is very small and easy to implement in hardware [22].In addition, it is possible and easy to speed up the hardware design.The grain-128 uses an LFSR to ensure good statistical properties; an NLFSR is used with a non-linear filter to introduce non-linearity.The non-linear filter takes the contribution of the two shift registers.
In order to provide savings on the system's final cost, area and power consumption, the grain-128, the ECC and the SHA-2 hash function studied earlier will be adopted to design the proposed ECDSA processor.The state-of-the-art works will be the subject of the next section.

Proposed Hardware Architecture for ECDSA IPs
In this section, the architectural design of each IP will be discussed in order to implement low-power and low-area ECDSA design.Different optimizations were made to decrease area occupancy and power consumption.Then, optimized IPs are assembled for an efficient ECDSA design.In order to decide how efficient the design is, we calculate the throughput and the efficiency of each IP, such as:

Secure Hash Standard 2 IP
In order to ensure the data integrity, different cryptographic hash functions can be used.The SHA-2 function offers a higher security, and it is resistant to the collisions.Further, SHA-256 is the most used because it is considered cryptographically safe.In this section, the hardware design and implementation of SHA-2 IP will be presented.Then, the hardware performances will be compared and evaluated.

SHA-2 Architecture Design
The proposed hardware architecture of the SHA-2 IP is described in Figure 2. As is shown in Figure 2, our hash processor is based on four modules, which form the overall architecture.They are:

•
The control unit: this is designed to control the data flow in the design, as well as the data transfer between the digest calculation unit (hash computation unit) and the pre-processing unit (padded unit).An FSM is used for this purpose.The control unit coordinates all system operations.It defines the necessary constants and the length of the operation word.It manages the ROM blocks and controls all algebraic and digital logic functions necessary to calculate the digest.

•
The pre-processing unit (padded unit): its role is to complete the message in order to make it compatible with the used hash protocols.

•
The calculation unit (hash computation unit): this is the digest calculation block.It performs the data transformation functions.

•
The input/output interface unit has been designed in order to communicate the processor with the external environment.

SHA-2 Architecture Optimization
Different hardware optimizations were done in order to speed-up the SHA-2 IP and decrease area use: • Addition is the most common operation in calculating the message digest.It requires adding a 64-bit operand size.Therefore, it is important for it to be optimized.Thus, in order to have a fast and low area SHA-2 IP, various time-efficient adder architectures have been developed in VHDL [23], such as the Ripple Carry Adder (RCA), the Carry Look-ahead Adder (CLA), the Carry Save Adder (CSaA) and the Carry Selected Adder (CSeA).Being studied in [24] and implemented on FPGA Virtex-II Pro xc2vp7-5ff672, the RCA adder and the CLA are the best in terms of area occupation on the FPGA platform; whereas, the carry select adder is the most speed efficient.

•
In each round of the SHA-2, some operations can be calculated independently and the others are dependent.In [24], the authors changed the expressions A, B, C, D, E, F, G and H in order to optimize the SHA-2 design by condensing two cycles (t 1 , t 2 ) in the same cycle.Thus, the following round can instantly calculate an intermediate value based on the available inputs by storing the intermediate values of Wt and Kt.The data processing time is reduced by half N/2 cycles.The processing algorithm will be different since it reduces the number of cycles (divided by two).

•
By changing the expressions of A, B, C, D, E, F, G and H, the number of adders increased from 7 to 14.To solve this problem and analyzing the dependency between hash calculations, two methods were proposed, the first used only two adders and the second used three adders.For the first method, only two adders will be used at each stage of the calculation, the two adders are active.
In the second method, just three adders are required in two states.It is clear that both methods have the same speed, so, the first method was adopted in order to decrease area occupation.

Simulation and Synthesis Results of the SHA-2 IP
The SHA-2 IP was implemented on FPGA Virtex2 xc2v2000.Table 1 presents its performance in terms of frequency, speed, slices occupancy and power consumption.Before optimization, the processor throughput is about 308 Mbit/s.In order to improve the service quality of the IP integrity, the SHA-256 was optimized.In this case, the processor throughput increases up to 584 Mbit/s.The last column in the table gives the algorithmic efficiency of SHA-256 which has the same efficiency before and after optimizations.It is due to the additional tests and the increase number of CLBs.Furthermore, the optimized processor was speeded-up with a reduced variation in the FPGA occupation.Similarly to power, it has also increased but remains low.

Elliptic Curves Over F 2 m Finite Fields
This section briefly sums up the theory of elliptic curves.An elliptic curve E over F 2 m is defined by an equation of the form: ) is the set of points P = (x, y) that verifies the Equation (3) with a point at infinity denoted O.The two main operations in finite field elliptic curves are point addition and point doubling.Let P = (x 1 , y 1 ) = O be the first point and Q = (x 2 , y 2 ) = O be the second point such that Q = −P, the sum is P + Q = (x 3 , y 3 ).The algebraic formula of P + Q and 2P are presented by Algorithm 1.
Algorithm 1 :Point Addition and Point Doubling.
In the next section, the architecture design of the ECC encryption scheme will be presented.The implementation results will be given and discussed.

ECC IP Architecture Design
In order to avoid modular inversion in point addition and point doubling, Lopez and Dahab projective transformation will be used: (X, Y, Z), Z = 0, maps to (X/Z, Y/Z 2 ) [25].The security of ECC depends on the ability to compute a point multiplication.
Thus, to perform it, the Montgomery algorithm will be adopted in order to exploit the parallelism of point addition and point doubling which are calculated independently.They are computed at the same time as mentioned in Figure 3.These two basic operations are based on modular arithmetic operations.Indeed, the point addition needs five multiplications and the point doubling needs six multiplications, but, only two multiplications will be used in each operation.The approach here is based on the components full-time function, so, in all modules (point conversions and point operation) only two multipliers are used.They are activated and reactivated in the next step as it is shown in the Figure 3.The component reuse and the components full-time function are the main optimizations in the ECC architecture which presents competitive results compared to the state-of-the-arts.Results and comparison will be given in the next section.

Simulation and Synthesis Results of ECC IP
The ECC IP is based on the Montgomery scalar multiplication over F 2 m using projective coordinates.The main operations in the Montgomery algorithm are the point addition and the point doubling shown in Figure 3.They were both optimized in order to use only two reused multipliers.Thus, this can increase the IP performance by decreasing the IP consumption and the area occupancy.The ECC implementation results are listed in Table 2.The proposed ECC-IP was implemented on FPGA Virtex 5 ML 50 and it gave good results.Thus, its performance allows it to be more suitable for ECDSA implementation.Table 3 gives a comparison results with the state-of-the-art implementations.As it is shown in the table below, the implementation in [26] presents a minimal area compared to the proposed ECC architecture while requiring an important execution time at the same time.Thus, the implementation results of the proposed ECC architecture outperform those of the implementation in [28] in terms of area and required time to compute the scalar multiplication.The implementation of [29] is recent, it was done in 2015 using the Virtex5 platform, our proposed ECC architecture presents better results in all parameters.In fact, the area and time are decreased by 6.68% and 49.4% respectively.The last two columns in the table show the algorithmic throughput and efficiency.In order to compare our design to those of the other works we draw the Figure 4 containing both throughput and efficiency values.Therefore, our proposed ECC design has the highest throughput and efficiency when compared with those of the existing works, because a high throughput requires optimization of the design critical path, the design area, and the design clock cycles number.

Random Number Generator (RNG)
Generally, asymmetric schemes are based on the level of difficulty of such problem and on the complexity of the used algorithm.However, its security must not only rely on these points but also on the secret of the key which is a parameter used in each implementation [30].Several problems appear when generating keys: • Finding the appropriate algorithms for random (or pseudorandom) generation of the bit sequences.
The period of these algorithms must be long enough, • Testing the quality of these binary sequences, which means checking the randomness of the generated bit sequences, • Generating these keys through deterministic algorithms in order to achieve higher speed and efficiency, • Protecting these pseudorandom generators against the mathematical and the physical attacks.
• Implementing and optimizing these key generation algorithms in hardware with respect to the platform's requirements.
There are several methods for hardware key generation, the best known are the Linear Feedback Shift Registers (LFSRs) and the Non Linear Feedback Shift Registers (NLFSRs) generators.The LFSRs can be used for other stream cipher generators such as those used for GSM (A5/1, A5/2 and W7), and the NLFSRs are known as the grain [20].Table 4 shows the functional characteristics of the mentioned pseudorandom generator which are key length and initialization vector (IV).[31] 128 96 The grain-128 supports the 128-bit key and an IV of 96 bits.This encryption is still very small and easy to be implemented in hardware [22].In addition, the speed in hardware implementation can be easily increased.This is a good characteristic for the grain family compared to other Pseudorandom Number Generators (PRNG).

Security Analyses of PRNG
The generated keys quality is one of the most critical points of configuring a crypto-processor.If the keys are not randomly generated, then an attacker can guess the key.To detect deviations from the randomness of the binary sequences, the National Institute of Standards and Technology (NIST) uses a statistical test suite for random and pseudorandom number generators for cyptographic applications.The NIST test suite is a statistical package which contains 15 different tests that test the randomness of binary sequences produced by cryptographic random or pseudorandom number generators [32].In our case, the randomness of A5/1, W7, CA, and Grain output's was tested.
A sequence can be random if the P-value probability for each test is greater than 1% (0.01).The various tests results applied to the algorithms A5/1, W7, CA, and the grain are presented in Table 5.The results presented in this table show that the standard version of the grain outperforms CA, W7, and A5/1 in terms of security.Analyzing these results, the number of keys generated by this version (through the different tests) is always greater than the key's number generated by CA, W7, and A5/1.For example, the key number tested using Monobit Test with acceptable results is 0.0109 × 2 1024 (≈5 × 0.0022 × 2 1024 for W7 and ≈ 4 2960 × 0.0026 × 2 64 for A5/1).By increasing the speed of the grain, the results are less efficient but still better than those given by W7 and A5/1.To conclude, thanks to the non-linear functions, the grain ensures a higher level of security than CA and W7.

Implementation and Synthesis Results of the Pseudorandom Number Generators (PRNG)
The synthesis results of the grain, the A5/1, the W7 and the CA 16 × 16 are presented in Table 6.They were synthesized using the packages of "Synplify Pro" component and the Virtex2 XC2v2000-6ff896 platform.The A5/1 generator has an acceptable speed with an occupancy rate of 2% and a relatively low consumption ratio.This generator can be used for GSM.The W7 frequency is lower while its period is greater than that of the other generators, which ensures a high-security level.The occupation of the various generators: A5/1, W7 and CA 16 × 16 is very small and similar.The speed of the generators A5/1 and W7 are negligible compared to CA.The throughput increases about 38,113.68Mbps between CA 16 × 16 and W7.Thus, the use of a higher level of security for the W7 presents a loss in terms of execution time.The W7 and CA 16 × 16 have the maximum efficiency but they have the maximum power consumption.All PRNGs have a low consumption, which makes them the most suitable for use in the restrained environments such as Bluetooth or GSMs.As a conclusion and from the implementation results, the grain generator presents firstly the best trade-off between area, speed, and consumption.Secondly, the grain-128 preserves the advantages of the grain-80, because, it supports a 128-bit key and a 96-bit initialization vector.

The grain IP
From a hardware point of view, the grain is designed to be very small and efficient.It is based on the bit synchronous stream cipher that requires an 80-bit key to initialize its input registers.It is based on two shift registers of fixed size (80 bits) in which the bits are shifted at every clock (LFSR and NLFSR) and a nonlinear output function.
Depending on the used platform, the user can estimate or fix the speed of encryption.Figure 5 shows the grain algorithm.It is based on three modules which are:

•
The LFSR module: based on a sequence (s i , s i+1 , . . ., s i+80 ) and a linear feedback function.It guarantees a minimum period for the key-stream, so it can be efficiently implemented and it increases significantly the throughput.The polynomial function of the LFSR (feedback polynomial) denoted f (x) is a primitive polynomial of degree 80. Figure 6 illustrates the operation of the LFSR block including the register initialization, the polynomial function f (x) and the update function.
It includes a masked input with the LFSR output in order to balance its state.The NLFSR introduces non-linearity to cipher with the nonlinear output function.It is a filter of a polynomial function g(x).

•
The filter module: it is a function of nonlinear output which introduces non-linearity encryption.
Based on a nonlinear filtering function (h(x)) with five input variables, the filter of algebraic degree 3 is selected to be well balanced.

Grain Optimizations
The two registers (LFSR and NLFSR) of the grain are synchronized such that a bit is generated each clock cycle.The grain offers the possibility to increase the speed thanks to the implementation of polynomial functions ( f (x) and g(x)) and the filter function (h(x)) several times.Hence, to simplify this implementation, the last 15 bits of the two shift registers (s i and b i , 65 ≤ i ≤ 79) were used neither in f (x) and g(x) functions, nor in the filter function input [20].Thus, this can multiply the speed by 16 and reduce the initialization phase required time (160/16 cycles) and the key generation (80/16 cycles).Figure 7 shows a sample implementation: in a doubling, two bits are generated at each clock cycle.

NFSR LFSR h(x)
h(x) In addition, a generic version of the grain stream cipher was proposed in order to generate a 128-bit key, it allows users to set the speed using a signal of 4 bits.This version is designed for constrained environments where resources are limited and power consumption is reduced.It is based on the same principle as the first version of the grain 80-bit: 2 shift registers LFSR and NLFSR (of size 128 bits) with an output function.Supporting a 128-bit key and a 96-bit initialization vector, the grain-128 preserves the advantages of the grain-80: it ensures height security level, reduced size, and simplicity in implementation.The grain-128 can also offer the possibility of increasing the speed through the implementation of the polynomial functions ( f (x) and g(x)) and the filter function (h(x)) several times.The speed can be multiplied by 32.

Simulation and Synthesis Results of the Grain IP
The grain was synthesized using the packages of "Synplify Pro" component and the Virtex2 XC2v2000-6ff896 platform.Figure 8a,b illustrate respectively the area occupancy (Luts) and the speed (Mbps).Various versions of the grain-80 were implemented.The grain Vi denotes the grain version i (i indicates the increase of speed).The synthesis of the various versions shows that the speed is proportional to the occupancy and also to the power consumption.It is becoming increasingly significant from one version to another.For example, comparing the standard version of the grain (grain-80) to the grain V16, the consumption variation is negligible compared to the evolution of the speed (1652,8 > 7 × 230,9 Mbps).The generic version of the grain VN (N can be 1, 2, 4, 8, 10, 16) gives the opportunity to choose the compatible version with dedicated applications, but it presents a loss in speed, frequency, and occupation.Thus, the grain VN requires an area of about 4500 Luts achieving a frequency of 44 MHz and consuming 15.95 mW.If we take the example of the version whose speed is equal to 1, the frequency decreases from 230.9 (grain V1) to 44 MHz (grain VN), while the occupancy reaches a value equal to 4500 Luts (≈12 × 355).The grain generator has a simple algorithm which uses a small number of registers and uses a finite number of iterations to achieve a key.Each version of the grain has its own characteristics, the choice of the appropriate version is based on the application's constraints.The grain ensures security, it reduces the size and it is simple to be implemented.For this reason, it will be implemented in the ECDSA architecture design which will be detailed in the next section.

ECDSA Architecture for Low-Area Low-Power Computing
Optimizations mentioned earlier, applied on different IPs allow designing a low area low-power ECDSA architecture.In this section, the different design modules will be presented.

Proposed ECDSA Processor Design
To perform ECC digital signature, three algorithms are needed: the key pair generation, the signature generation, and the signature verification.They are given, respectively, by the Algorithms 2-4.
Algorithm 2 is responsible for the generation of the pair of key (public and private), it is based on a good choice of elliptic curve, random selection of integer and the computation of the scalar multiplication.
Algorithm 2 :Private and Public Key Generation.

G (the Montgomery scalar multiplication).
Return private key d and public key Q.
To sign a message m, an entity A follows the steps given by Algorithm 3 with a selected domain parameters.It is based on the scalar multiplication and the hash function.
2. Calculate kG = (x 1 , y 1 ).3. Calculate r = x 1 mod n.If r = 0, so return to step 1. 4. Calculate k −1 mod n. 5. Calculate e = H(m) such that: H(m) is cryptographic hash result using SHA-1 or SHA-2 of the message m. 6. Calculate s = k −1 (e + d.r) mod n.If s = 0, return to step 1.Return (r, s) the signature of the message m Algorithm 4 explains how the receiver verifies the signature, it does so by calculating the hash function to obtain the message digest then using the public key of the sender on this message.Algorithm 4 :ECDSA signature verification.
Output: signature verification or rejection 1. Verify that integer r and s are both in [1, (using the point addition formula on the elliptic curve).6.If X = 0, so signature will be rejected.Else, calculate v = x 1 mod n. 7. Signature will be accepted only if v = r.The ECC unit: it computes the point scalar multiplication based on the Montgomery algorithm which was explained in Section 4.2, 2.
The SHA-2 function unit: generates hash used in both the signature generation and the signature verification of the message m, it was presented in Section 4.1, 3.
The PRNG unit: it is a random number generator which generates a random number used as keys during the signing process, it was well detailed in Section 4.3, 4.
The intermediate register: used to store the intermediate results, 5.
The controller unit: it generates and sends control signals to all units in order to synchronize them.It is totally responsible of the system management and the data exchange between the different units by the use of the control lines.The ECC, the SHA-2 and the random generator IPs were explained earlier.The controller unit is one of the main units in the ECDSA architecture.The interface between the different IPs and registers is ensured through a controller.It is responsible for the synchronization, the sequential activation of the different components, the flow control, and the data dependencies.It is based on a hard synchronous finite state machine represented by Figure 10.Thus, an efficient hardware implementation needs the synchronization of the different components as shown in the Figure 10.The controller uses two modes: Mode 0 for the key and the signature generation and Mode 1 for the signature verification.All used blocks need two signals: "Start" to begin the computation and "End" to end the computation.The two modes are explained as follows: • Mode 0: this mode is used for the Key and the signature generation.After the selection of an elliptic curve E(a,b), a point G ∈ E(a,b) of order n and a cryptographically strong random number d which is the private key in the interval [1, n − 1], the controller computes the scalar multiplication using ECC IP: point Q = dG.It activates the ECC IP by the signals: "Reset-ECC" (enable) and "Start-ECC" (begin the computation).If the signal "End-ECC" = 1, the outputs of the IP-ECC are the coordinates x and y of the public key Q, this is the key generation step.For the signature generation: after selecting a pseudorandom number k, the controller reactivates the ECC-IP as it is mentioned earlier in order to compute kG = (x 1 , y 1 ).Then, it sends the signal "Start-Red" to the pseudo-Mersenne-reduction block to convert x 1 to an integer X 1 -conv.After that, the controller tests if X 1 -conv = 0, then the step of random number generation will be repeated, else, the controller activates the inversion block, via the signal "Start-Inv", in order to calculate I = k −1 mod n.Here, the inversion and the Mersenne reduction operations are independent, so, they can be done in parallel.Receiving the signal "END-Inv" indicating the end of the inversion, the controller activates the SHA-2 IP.The output of the SHA-2 IP is the message digest e. Receiving the signal "End-SHA2", the controller activates the multiplication operation, by sending the signal "Start-Mul", in order to perform T 1 = d × r.The Carry Look Ahead Adder (CLA) is used to calculate T 2 = e + T 1 .The block of multiplication is reactivated to calculate s = I × T 2 mod n.
Computing the second multiplication, the controller tests the value of s, if s = 0 then return to the step of random number generation, else returns the signature of the message m which is (r, s).
As it is mentioned, the ECC-IP and the multiplication block are used twice in the Mode 0. • Mode 1: it is responsible for the signature verification.The inversion block and the SHA-2 IP are reactivated to calculate respectively w = S −1 mod n and e = H(m).Receiving the signals "End-Inv" and "End-SHA2", the multiplication block is reactivated by the controller to compute firstly u 1 = e × w mod n and then u 2 = r × w mod n.After receiving the signal "End-Mul", the controller sends the signal "Start-AddMong" to the Addition-Montgomery block to calculate X = u 1 G + u 2 Q using the point addition formula on the elliptic curve.The controller tests the value of X: if X = 0, so, the signature will be rejected, else, it calculates v = x 1 modn.Finally, the signature will be accepted only if v = r.
As it is shown in the controller finite state machine explanation, a sequential activation of the IPs and the other required blocks is required.Table 7 presents how the controller activates (ON) and disables (OFF) the different IPs as needed.The grain, the ECC IP, the SHA-2 IP, and the inversion block are called twice in the entire architecture.And, the multiplication block is reactivated 4 times.As it is mentioned in the table below, As it is shown in the table below, the proposed design D 1 (FPGA implementation) is about 60% faster than [33].Being compared to [2], our results outperform B. Panjwani et al. results in terms of area and frequency.In addition, our architecture presents a gain of about 22% and 57% in terms of area and execution time.Comparing our results to those of [34], we have decreased the area by 22.12 % and the execution time by 57.54%.Being implemented in ASIC platform, our design D 2 has minimal area than that reported in [34].
In order to have a relevant performance comparison between our synthesis results and those of related works, we will add another parameter which is the Area Time Product (ATP).Figure 11 gives different ATP of the mentioned state-of-the-art FPGA implementations.It is clear from Figure 11 that our design D 1 and the design in [2] have the minimum value of ATP.Our designs (D 1 , D 2 ) have a trade-off between area and time and it is the most efficient compared to other designs.In the literature, only a small number of authors have given the power dissipation of the ECDSA entire design.For this reason, in order to study the influence of the power, we have draw Figure 12 containing all the SPP (Speed Power Product) of all IPs.The SPP is calculated by multiplying the gate propagation delay by the power dissipation.As it is shown in Figure 12, the ECC-IP has the highest SPP value.

Security Analyses of ECDSA Processor
Several interpretations have been reported of what it means to break a digital signature: retrieving the secret key, creating another signing algorithm with an equivalent secret key, forging a signature for a chosen message, and forging a signature for at least one message [36].In order to be useful, ECDSA must have a high security.The essential security conditions for ECDSA are [37]:

•
To ensure that one cannot easily solve the discrete logarithm problem and therefore obtain the secret key,

•
The hash function used is a one-way collision-resistant hash function.

•
The secret key can be obtained using k, r, and s when the generator for k is predictable.
The Elliptic Curve Discrete Logarithm Problem (ECDLP) is a special case of the discrete logarithm problem.It consists of finding an integer d, if it exists, such that Q = dG, given points G, Q ∈ E(F q ).Many attacks against ECDLP exist such as the exhaustive search, the Pohlig-Hellman, and the Baby-Step Giant-Step algorithms.One of these attacks is the Pollard's Rho algorithm, which has a running time of √ nπ/2 where n is the order of point G.However, this algorithm can be parallelized and run on r different processors, so that the new running time is √ nπ/2r.In this section, two attacks against ECDSA processor have been chosen: fault injection attack and restart attack.

Fault Injection Attack: No Correctness Check for Input Points
This attack is applicable when the device neither explicitly checks whether an input point P nor the result of the computation really is a point on the cryptographically strong elliptic curve E which is a parameter of the system.The no correctness check for input points attack is simple and should not be applicable to a well-designed system, but nevertheless, such a "bug" might happen in practice.Let E = (a 1 , a 2 , a 3 , a 4 , a 6 ) be a given cryptographically strong elliptic curve, which is part of the setup of the ECC system.The coefficients a i are in a field K and E(K) denotes the set of all solutions (x, y) ∈ K × K, together with the point at infinity O.We note that when calculating a scalar multiplication, the coefficient a 6 is not used.In this situation when a cryptosystem receives a point P (x , y ) with x , y ∈ K but P is not a point on E, but a point on some other elliptic curve E .
We choose the input pair P (x , y ) carefully, such that with a 6 = y 2 + a 1 x y + a 3 y − x 3 − a 2 x 2 − a 4 x the tipple (a 1 , a 2 , a 3 , a 4 , a 6 ) defines an elliptic curve E whose order has a small divisor r and such that ord(P) = r.If r is relatively small, the attacker can solve the discrete logarithm problem in the subgroup of order r and find k r = k mod r.We can repeat this procedure with a different choices of P and use the Chinese Remainder Theorem to compute the correct value of k.
In particular, an attack is possible by injecting any fault on the coordinates x or y point P.With stronger assumptions, the attacker can even find the secret k having injected any fault on the two coordinates.This attack is quite efficient if we do not choose P , but the curve E first and compute P .To avoid fault injection attack in the scalar multiplication, a point P must be a valid point on the curve, as it is advised in the protocols of most ECC.

Restart Attack
In this part, we present our results on an ECDSA signature generation against the restart attack.To break the scheme with signatures of two different messages we assume that the pseudorandom generator for the key k is deterministic.Then assume that one can reset the internal state of the generator.So, if the signer signs M 1 by generating k and we can reset it so that it generates the same k for M 2 , we have a signature (r, s 1 ) for M 1 and a signature (r, s 2 ) for M 2 .Hence we obtain that: x = − (s 2 SH A2(M 1 ) − s 1 SH A2(M 2 )) (r(s 2 − s 1 )) mod q (4) Accordingly, using the chaos-based key generator affects the resistivity of the proposed ECDSA cryptosystem to the restart attack because the initial key value of the grain key generator is modified for each signature generation.In order to improve the sensitivity of the signature processor to the initial key, we maximized the use of the chaos generation of the key in the inter-functions of the grain pseudorandom generator.Thus, the initial key value of the grain is not constant.Instead, it is formed from the initial chaos generated key.

Conclusions and Future Works
In this paper, an ECDSA signature scheme was implemented.All integrated IPs (ECC, SHA-2, and the grain) were optimized in order to lead to a trade-off between area and execution time.Thus, the implementation results, in both Virtex-5 and ASIC, are competitive with those of the state-of-the-art.The signature and the verification processor used 18,504 slices in Virtex-5 achieving a frequency of 107.4 in 1.5 µs.Being implemented on an ASIC CMOS 45 nm technology, the design requires 0.257 mm 2 area cell achieving a maximum frequency of 532 MHz and consumes 63.444 (mW).The proposed ECDSA implementation is suitable for applications that need: low-bandwidth communication, low-storage and low-computation environments such as embedded systems.As we said earlier, the design overhead costs should be reduced to be suitable for applications which require low area resources.As a future work, ECC-IP can be again optimized by introducing the procedure of Secure Hardware Activation System (SEHAS) and the Physically Unclonable Functions (PUF) [38].

Figure 3 .
Figure 3. Parallelism between the point addition and the point doubling.

Algorithm 3 :
ECDSA Signature Generation.Input: private key d, message m, domain parameters (n, G

Figure 9
Figure 9 sums up the architecture design of the ECDSA digital signature cryptosystem.It contains five main units: 1.The ECC unit: it computes the point scalar multiplication based on the Montgomery algorithm which was explained in Section 4.2, 2.The SHA-2 function unit: generates hash used in both the signature generation and the signature verification of the message m, it was presented in Section 4.1, 3.The PRNG unit: it is a random number generator which generates a random number used as keys during the signing process, it was well detailed in Section 4.3, 4.The intermediate register: used to store the intermediate results, 5.The controller unit: it generates and sends control signals to all units in order to synchronize them.It is totally responsible of the system management and the data exchange between the different units by the use of the control lines.

Figure 10 .
Figure 10.The controller finite state machine.

Figure 12 .
Figure 12.FPGA Implementation SPP of the proposed IPs.

Table 3 .
ECC Implementation results comparison.

Table 4 .
Functional characteristics of the pseudorandom generator.

Table 5 .
Statistical Test Analyses for Pseudorandom Number Generators.

Table 6 .
Synthesis results of the different Pseudorandom Number Generator (PRNG).