Next Article in Journal
Copy-Move Forgery Detection and Localization Using a Generative Adversarial Network and Convolutional Neural-Network
Previous Article in Journal
Another Step in the Ladder of DNS-Based Covert Channels: Hiding Ill-Disposed Information in DNSKEY RRs
 
 
Article
Peer-Review Record

Low-Cost, Low-Power FPGA Implementation of ED25519 and CURVE25519 Point Multiplication

Information 2019, 10(9), 285; https://doi.org/10.3390/info10090285
by Mohamad Ali Mehrabi 1,* and Christophe Doche 2
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Information 2019, 10(9), 285; https://doi.org/10.3390/info10090285
Submission received: 13 August 2019 / Revised: 2 September 2019 / Accepted: 9 September 2019 / Published: 14 September 2019

Round 1

Reviewer 1 Report

The paper presents FPGA implementations of elliptic curve operations for Ed25519 and Curve25519. These are operations with high contemporary relevance as they represent the state-of-the-art of elliptic curve crypto as it is used today in many applications including TLS 1.3. The main contribution of the paper is to present an implementation that doesn't utilize the DSP or BRAM resources of modern FPGAs because previous results (that are faster and/or smaller) are available by using DSPs and BRAMs.

Unfortunately, I don't really see the point in avoiding the use of DSPs or BRAMs. Modern FPGAs have plenty of these resources available and they typically provide good performance and power advantages over implementing similar functions with LUTs and FFs only, as you have done. Consequently, the paper must provide clear motivations where avoiding the use of these resources makes sense because otherwise it is hard to see the advantage over the prior results.

Why did you select the formulas that you chose for Ed25519 from the EFD? As far as I remember, more efficient ones exist, namely the ones by Hisil et al. with the additional T coordinate (T=xy).

You tested with 10000 random test vectors. I suggest also testing several corner cases like (e.g., 0,1,2 and q-1) because they are the ones that typically fail if any do.

The claim that DPA attack cannot be succesfull is clearly too strong. A DPA attack can also observe, for example, the accessing patterns of registers which depends on the secret scalar but is not randomized by the use of a random lambda.

EdDSA requires also other operations than just elliptic curve operations (that is, a hash function and arithmetic modulo the order of the elliptic curve group). Those don't seem to be supported by your architecture. Even not all elliptic curve operations such as double scalar multiplication or point compresssion / decompression seem to be supported. Is it assumed that these will be performed by external (software) components?

 

Author Response

This project aims to design a low power, efficient area hardware for elliptic curve cryptography coprocessor targeting ASIC or low cost FPGAs (including anti-fused based FPGAs) for IoT / Automotive applications. Multipliers can be implemented in DSP modules in different ways. ( [9] , [10], [12] ). Our design is not using Multipliers, this means that the design is more efficient in terms of area. Can be mapped to any FPGA, and mapping to ASIC is easy. This design is more suitable for rapid prototyping. Moreover, the interleaved modular multiplier used in our design is uniform that makes it suitable for a SCA resistant design. The paper revised by adding motives to our design.

The Hisil et al. Formulae add (T=x*y) coordinate, The advantage is that this reduces the point addition modular multiplications to 9 (in case of Z_2 =1). The disadvantage is first, one initial modular multiplication is required to compute T. Second, for hardware implementation extra registers are required to hold T coordinate, and its intermediate values during computations. Third, the point doubling needs 8 modular multiplications ( or squaring). Considering that our point doubling has 7 modular multiplication, and point doubling is the most frequent operation in scalar multiplication, using Hisil’s formulae won’t help to achieve shorter latency and the area won’t be improved as well.

Special points 0, 1, 2, and p-1 are included in 10,000 test points.

We are using uniform differential Montgomery scalar multiplication algorithm that is immune against SPA.

DPA on the other hand , uses statistical methods to find the private key. Using a randomised Key, thwarts against DPA analysis as averaging power analysis. This claim is supported by several research in the literature.

S. Coron, “Resistance against Differential Power Analysis for Elliptic Curve Cryptosystems”

and

2. R. Avanzi, “Countermeasures against Differential Power Analysis for Hyperelliptic Curve Cryptosystems”. CHES2003.

The project scope, was to implement point multiplication core and study side channel attacks on the actual core that is in progress. The Implementation of full EdDSA was not in the scope of this project.

Reviewer 2 Report

The paper deals with the implementation aspects of elliptic-curve-cryptography algorithms on the FPGA hardware. The main contribution is the improved design of the scalar multiplication algorithm and its simulation on the FPGA platform. Furthermore, the implementation results on the Xilinx chips are provided.

The layout, syntax, readability and structure of the paper is adequate, and I have not identified any major weakness.

While the introductory section with theoretical background is sufficient, I suggest improving the sections describing the practical results. In particular, I have the following suggestions for improvement:

1)

The authors claim that the modification of ECPD and ECPA algorithms were done, but there is not enough information about the details and motivations. Furthermore, the paper compares the design results with existing work, but only papers published 4-5 years ago are covered. The improvement it terms of speed and area should be explicitly described and compared the actual best state of the art.

2)

The frequency used for the comparison is 550 MHz which can be very difficult to achieve on a concrete device. Are there any practical results using these high frequencies?

3)

The Fermat theorem is used for computing the inverses, but is that the fastest approach? What would be the results using the Euclidean algorithm?

4)

In a paper describing the implementation results, it is important to provide also the code (the VHDL in this case) so that the results can be verified. Is the implementation available anywhere?

In case the above-mentioned questions are sufficiently addressed, I recommend the acceptance of the paper.

Author Response

 

We first designed new radix-8 Modular multiplier based on existing interleaved modular multiplication. The works in the literature are mainly published 2010-2014. To best of our knowledge no new work in this area is published in recent years. We have done comparison to the latest similar works in Table 1. Then we designed ED25519 and CURVE25519 point multiplication cores based on our proposed modular multiplier and compared results with similar recent works in the literature in Table 2 and Table 3. The motivation of using Radix-8 modular multiplier, is to achieve a low power low area design for IoT application.  Our Radix-8 Modular multiplier design is very efficient in terms of logic resources. The logic specifically reduced using Modulo $2^255-19$ .Our implementation achieved 550 MHz, Refer to Figure 6 of the manuscript, the most critical part is calculations of ROM1 components. After completion of the calculations they will be stored in memory and wont be changed until the end of cycles. Fetching from memory is done in maximum 0.7 ns. The route maximum delay from input of first CSA to the input of registers is not more than 1.8 ns. Similar work, [17, 19] achieved frequency of 422 and 437 MHZ on older VIRTEX 5 and VIRTEX 4 devices.The point is that only the modular multiplier works at 550 MHz clock provided by an internal PLL. The ECC point multiplication core uses 187.5 MHz clock (1/4 of 550 MHz). If using Euclidean algorithm for modular inversion, then we need to add an extra hardware to our design including multipliers and 255-bit registers. Moreover, The Euclidean algorithm is not  performed in constant time for every input. The Fermat theorem, requires only a modular multiplier (squarer) and a small state machine that implements chain (12). The VHDL codes are provided. We are using the implementation of this design on an ARTIX-7 evaluation board for our side-channel attacks research.

Round 2

Reviewer 1 Report

The resubmission sufficiently addresses my concerns. I still have some reservations about the motivations of the work (minimize the use of DSPs) but I support acceptance, nonetheless.

Reviewer 2 Report

The major weaknesses identified in the first-round review were fixed or well-adressed in the response. I recommend the acceptance.

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.


Round 1

Reviewer 1 Report

Major.

1. Ref. [5] collects the 'add-2008-bbjlp' formula which is depicted in Figure 2. However I couldn't find the respective formulae for figures 1 and 3. Include the mathematical representation of these formulas or a direct cite to the paper where they are reported.

2. Consider providing a script with a realization of Algorithm 2. SageMath might be a good implementation choice.

3. Table 2 must report the underlying field length used in each work. More concerning yet, I was unable to find the quoted results in the respective references. If you are providing estimates as hinted in line 79, this needs to be stated in the table or the original results must be also provided.

4. As a general comment for all of your tables, complement the performance figures provided with the latency cycles for each realization.

5. Your conclusion in line 111 cannot be drawn from Table 4. Only one of the cited works reports power dissipation, and the underlying platform is not specified. How do we know that your design is "low-power"? Review this particular claim and complement Table 4 with the implementation platform used in each case.

6. Lastly, it is not clear to me that your solution represents an improvement over [9] or [12]. I understand the difficulties for achieving a fair comparison when the other works make use of DSPs/BRAMs, but you could compare your work with scalar multiplication architectures implemented under similar conditions, even if they target different curve models.

Minor.

6. Algorithm 1 has a duplicated reduction step.
7. The redaction of multiple sections can be improved. These include but are not limited to: line 66, line 88, paragraph stating in line 96.
8. References 9 and 17 need to be corrected.

Author Response

1. Ref. [5] collects the 'add-2008-bbjlp' formula which is depicted in Figure 2. However I couldn't find the respective formulae for figures 1 and 3. Include the mathematical representation of these formulas or a direct cite to the paper where they are reported.

A. The formulae have a minor change. included in the new revision.


2. Consider providing a script with a realization of Algorithm 2. SageMath might be a good implementation choice.

A. MAPLE codes of algorithm 2, are added in Appendix A. 



3. Table 2 must report the underlying field length used in each work. More concerning yet, I was unable to find the quoted results in the respective references. If you are providing estimates as hinted in line 79, this needs to be stated in the table or the original results must be also provided.

A. Noted. edited.


4. As a general comment for all of your tables, complement the performance figures provided with the latency cycles for each realization. 

A. Noted. we used dual clock the higher 550MHz for the modular multiplier and lower rate (550/4)=137.5 MHz for the Point Multiplication (P.M.) state machine. It might not be a good measure in case of considering the states to complete a P.M. 89655 clock cycles are required for CRVE25519(including modular inversion). For ED25519 point multiplication, the two algorithms that we used are not time constant and the cycles dependent to the scalar value. So we preferred to use average latency instead. 


5. Your conclusion in line 111 cannot be drawn from Table 4. Only one of the cited works reports power dissipation, and the underlying platform is not specified. How do we know that your design is "low-power"? Review this particular claim and complement Table 4 with the implementation platform used in each case.

A. we run  Xilinx power estimator for all designs and reflected the results in the paper. we also estimated the area of the recent works by remapping multipliers to LUTs ( instead of using DSPs)

we also moved to ZYNQ FPGA series to provide a fair platform for comparison.


 Lastly, it is not clear to me that your solution represents an improvement over [9] or [12]. I understand the difficulties for achieving a fair comparison when the other works make use of DSPs/BRAMs, but you could compare your work with scalar multiplication architectures implemented under similar conditions, even if they target different curve models.


A.  we estimated the area of the recent works by remapping multipliers to LUTs ( instead of using DSPs) and  moved to ZYNQ FPGA series to provide a fair platform for comparison. the results in Table 2, Table 3 and Table 4 show that our design is smaller, and consumes lower power than [9] and [10], it is faster than [12] while consuming same power.




Minor.

6. Algorithm 1 has a duplicated reduction step. 

A. Yes, this algorithms performs maximum two subtraction in series. It is presented by [15]. we just cited this work without any change.


7. The redaction of multiple sections can be improved. These include but are not limited to: line 66, line 88, paragraph stating in line 96.

A. Noted.


8. References 9 and 17 need to be corrected.

A. Corrected.


Reviewer 2 Report

The target of this paper  seems to be Ed25519 but it is not appropriate.

It is just an implementation of modular arithmetic for finite field of the size for elliptic curve cryptography.


Especially for the hardware implementation such as FPGA should be resistant to side channel attacks; however it is not described in this paper, for example about Montgomery ladder scalar multiplication and so on.


I could not recommend this result as acceptance/conditional acceptance.

Author Response

It is just an implementation of modular arithmetic for finite field of the size for elliptic curve cryptography. Especially for the hardware implementation such as FPGA should be resistant to side channel attacks; however it is not described in this paper, for example about Montgomery ladder scalar multiplication and so on.


A. We respectfully but categorically disagree with the claim that every hardware implementation should be resistant against side channel attacks. This greatly depends on the type of applications as illustrated by abundant bibliography.

Furthermore, ED25519 is not immune against side channel attacks even if we use the Montgomery ladder algorithm to implement Point Multiplication. In this work we used double and add and NAF to have a fair compare with other works. (Montgomery ladder needs 254 add and double). 

Our results support the Bernstein's paper [High-speed high-security signatures] available on https://ed25519.cr.yp.to/ed25519-20110926.pdf where he says:

"Of course, there is much more to say about countermeasures to hardware side channel attacks; we do not claim that any single countermeasure is adequate by itself. The software situation is simpler, since the side channels exposed to an attacker are much more limited."

Figure 1, and Figure 2 , in our paper show that Point doubling and point addition have almost similar computing levels that makes power analysis harder but not impossible. 

we are working on side channel attacks on ED25519 using Machine learning to extract the private it is continuation of our published work : https://doi.org/10.3390/app9010064  and our results show that the private key can be extracted even if using Montgomery ladder or always double and add algorithms. 

Round 2

Reviewer 1 Report

Dear authors, thank you for your changes to the manuscript.

In this new version most of my concerns have been addressed.

A couple additional recommendations:

- Describe the method employed for estimating the power with the XPE tool.

- Table 4 is could be merged into table 3.

- Include your justification for not considering SCA resilience in your designs, as described to the other reviewer.

Author Response

Describe the method employed for estimating the power with the XPE tool. A. The Xilinx power estimator tool gets the used FPGA resources , clk frequency, fan-out of and toggle rate of the signals estimate power consumption. we used data provided by the works in [9,10,11,12] and [17,18,19], considered default average fan-out for signals and set DSP signals toggle rates to 100%, and other signals to default value 12.5% Table 4 is could be merged into table 3.

       A. Noted and revised.

Include your justification for not considering SCA resilience in your designs, as described to the other reviewer.  

      A. Section "Side channel attacks considerations" added. 

Back to TopTop