HighSpeed and Unified ECC Processor for Generic Weierstrass Curves over GF(p) on FPGA
Abstract
:1. Introduction
 1.
 We propose a highspeed, unified ECC processor that is generic for arbitrary prime modulus on Weierstrass curves. To the best of our knowledge, in terms of generic implementation, it is the fastest among the existing literature.
 2.
 For the underlying architecture, we propose a novel and fast pipelined Montgomery Modular Multiplier (pMMM), which is constructed from an nbit pipelined multiplieraccumulator. The speedup comes from combining two existing multiplication algorithms: schoolbook long and Karatsuba–Ofman multiplications, enabling parallelization of digit multiplications while preserving low complexity. Moreover, to further optimize the process, we utilize DSP cores as digit multipliers, resulting in a higher speed multiplier compared to other existing methods.
 3.
 To balance the speed of our fast pMMM, we also propose a unified and pipelined Modular Adder/Subtractor (pMAS) for the underlying field arithmetic operations. In particular, we modify the modular adder/subtractor in [11] to support pipelining, and employ an adjustable radix. The proposed design offers better flexibility in adjusting the performance of the ECC processor.
 4.
 Additionally, we propose a more efficient and compact scheduling of the Montgomery ladder for the algorithm for ECPM in [22], in which our implementation does not require any additional temporary register as opposed to one additional register in the original algorithm. As a result, it only needs 97 clock cycles to perform ladder operation per bit scalar (for 256bit size).
 5.
 Since our ECC processor and the underlying field multiplier (i.e., pMMM) are generic for arbitrary prime modulus, we can support multicurve parameters in a single ECC processor, forming a unified ECC architecture.
 6.
 Lastly, our architecture performs the ECPM in constant time by employing a timeinvariant algorithm for each module, including using Fermat’s little theorem to carry out field inversion, making the algorithm secure against sidechannel attacks.
2. Preliminaries
2.1. Hamburg’s Formula for ECPM with Montgomery Ladder
Algorithm 1 Hamburg’s Montgomery Ladder Formula [22].  
Input:$({X}_{QP},{X}_{RP},{Y}_{Q},{Y}_{R},G)$  
Output:$({X}_{SP},{X}_{TP},{Y}_{S},{Y}_{T},{G}^{\prime})$  


Algorithm 2 Montgomery Ladder. 
Input:$k,q\le {2}^{n},P\in E\left({\mathbb{F}}_{p}\right)$ Rewrite$k\leftarrow {2}^{n}+(k{2}^{n}\phantom{\rule{3.33333pt}{0ex}}mod\phantom{\rule{0.277778em}{0ex}}q)$ Output:$Q=kP$ 

2.1.1. Ladder Setup
2.1.2. Ladder Final
2.2. Montgomery Modular Multiplication
Algorithm 3 Montgomery Multiplication.  
Input: an odd modulus p of nbits, $R={2}^{n}$, $gcd(R,p)=1$ $M=\phantom{\rule{0.277778em}{0ex}}mod\phantom{\rule{0.277778em}{0ex}}R$, $A,B:A,B<p<R$ Output: $AB{R}^{1}\phantom{\rule{0.277778em}{0ex}}mod\phantom{\rule{0.277778em}{0ex}}p$  
1: $x\leftarrow AB$  ▹ 1st multiplication 
2: $s\leftarrow (x\phantom{\rule{0.277778em}{0ex}}mod\phantom{\rule{0.277778em}{0ex}}R)M\phantom{\rule{0.277778em}{0ex}}mod\phantom{\rule{0.277778em}{0ex}}R$  ▹ 2nd multiplication 
3: $t\leftarrow (x+sp)/R$  ▹ 3rd multiplication 
4: $u\leftarrow tp$  ▹ subtraction 
5: if $u<0$ then  ▹ MSB of u 
6: return t 7: else 8: return u 
3. Proposed Architecture
3.1. Pipelined Montgomery Modular Multiplication (pMMM)
3.1.1. Overview of pMMM
3.1.2. Proposed Pipelined MultiplierAccumulator
 Stage1: Two inputs A and B are split based on the radix (digit size), which is into 16 bits in our design. Afterward, a parallel 16bit RCA is used to compute ${a}_{i}+{a}_{j}$ and ${b}_{j}+{b}_{i}$. At the same time, parallel DSP cores are utilized as 16bit digit multipliers to compute ${a}_{k}{b}_{k}$. As shown in Figure 2a, we employ a twostage pipeline for the DSP cores to achieve better performance, as recommended in [32].
 Stage2: We again utilize the DSP cores as a 17bit MultiplyAccumulate (MAC) function to compute the Karatsuba–Ofman multiplication, $({a}_{i}+{a}_{j})({b}_{i}+{b}_{j}){a}_{i}{b}_{i}$. $({a}_{i}+{a}_{j})$ and $({b}_{i}+{b}_{j})$ are obtained from the output of RCAs at the first stage, as shown in Figure 2b.
 Stage3: The outputs of 16bit multipliers ${a}_{k}{b}_{k}$ are routed to the input accumulator in the MAC modules as ${a}_{i}{b}_{i}$.
 Stage4: The final accumulation for Karatsuba–Ofman is computed by a 34bit RCA. The equation $({a}_{i}+{a}_{j})({b}_{i}+{b}_{j}){a}_{i}{b}_{i}{a}_{j}{b}_{j}$ results in a 33bit length. At this stage, $mul\_ocrdy$ is set when the CTL value is 3. It means that the input $mul\_ic$ is ready to be included in the CSAT at Stage 5 as the final accumulation of the Montgomery reduction algorithm. The algorithm itself is as presented in Algorithm 3.
 Stage5: Before being processed by the CSAT, all intermediate values are aligned to reduce the number of inputs in CSAT as well as the depth of the tree. This is due to the additional bit length on each intermediate value, i.e., 33bit instead of 32bit length. Figure 3 shows the example of the alignment process for fourinput CSAT.All aligned intermediate values, including the input $mul\_ic$, are assembled by CSAT where the compressor components in the CSA use LUT6_2, a similar 3:2 compressor circuit proposed by [11]. However, while they use multiple compressor circuits (e.g., a 4:2 compressor in [11]) to construct the multiplier, we employ the homogeneous 3:2 compressor to achieve a balanced performance, as illustrated on Figure 4.
 Stage6 and 7: The $sum$ and $carry$ as the outputs of CSAT are then fed to the carryselect adder to obtain the final product. Note that we use the carryselect adder proposed by Nguyen et al. [33] due to its relatively short delay propagation. In the carryselect adder by [33], both options for the carry are computed. Subsequently, the carry is solved similarly to that of the carrylookahead adder (CLA). Lastly, the sum output is then generated with the final carry for each bit [34].
 Stage8: A register is used to hold the output $mul\_or$. The outputs $o\_val$ and $o\_ctl$ are given with respect to the input values $i\_val$ and $i\_ctl$, respectively, which are shifted through the stages via a shift register.
3.1.3. Montgomery Modular Multiplication Using pMMM
 1.
 The pMMM starts by multiplying the nbit inputs $pmmm\_ia$ and $pmmm\_ib$, resulting in a $2n$bit product, which is then stored in the firstin, firstout (FIFO) buffer. This product will be used later in the third multiplication. Note that our FIFO buffer uses block RAM (BRAM) to reduce the required number of registers, where the depth of the FIFO buffer depends on the number of possible multiplication processes that can be executed concurrently.
 2.
 The nbit LSB product of Step 1 is multiplied with the precalculated constant $PARAM\_M$.
 3.
 Accordingly, the nbit LSB product of Step 2 is multiplied by the modulus $PARAM\_P$. In this multiplier, the product that was previously stored in the FIFO at Stage 1 is used as the input $mul\_ic$ to be included in CSAT in the multiplier module. This gives the benefit that we do not need to make additional $2n$bit adders. Instead, we include it in the CSAT.
 4.
 The nbit MSB of the third multiplication product is then evaluated and corrected using the carryselect subtractor, so that the output of pMMM is within the range [0, P].
3.2. Pipelined Modular Adder/Subtractor (pMAS)
3.3. Modular Inversion Implementation
Algorithm 4 Constanttime Field Inversion algorithm 
Input: a and prime modulus p of nbits, $0\le a<p$ Output: ${a}^{1}\phantom{\rule{3.33333pt}{0ex}}mod\phantom{\rule{0.277778em}{0ex}}p$

3.4. Montgomery Ladder Scheduling
3.5. Generic ECC Architecture
Unified Architecture
4. Hardware Implementation Result and Discussion
4.1. Result and Analysis of Generic Implementation on Weierstrass Curve
4.2. Result and Analysis of Unified ECC Architecture
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
 Ullah, H.; Nair, N.G.; Moore, A.; Nugent, C.; Muschamp, P.; Cuevas, M. 5G communication: An overview of vehicletoeverything, drones, and healthcare usecases. IEEE Access 2019, 7, 37251–37268. [Google Scholar] [CrossRef]
 Park, J.H.; Park, J.H. Blockchain security in cloud computing: Use cases, challenges, and solutions. Symmetry 2017, 9, 164. [Google Scholar] [CrossRef][Green Version]
 SuárezAlbela, M.; FernándezCaramés, T.M.; FragaLamas, P.; Castedo, L. A Practical Performance Comparison of ECC and RSA for ResourceConstrained IoT Devices. In Proceedings of the 2018 Global Internet of Things Summit (GIoTS), Bilbao, Spain, 4–7 June 2018; pp. 1–6. [Google Scholar] [CrossRef]
 Wood, G. Ethereum: A secure decentralised generalised transaction ledger. arXiv 2014, arXiv:1011.1669v3. [Google Scholar]
 Mai, L.; Yan, Y.; Jia, S.; Wang, S.; Wang, J.; Li, J.; Ma, S.; Gu, D. Accelerating SM2 Digital Signature Algorithm Using Modern Processor Features. In Proceedings of the International Conference on Information and Communications Security, Beijing, China, 15–17 December 2019; pp. 430–446. [Google Scholar]
 Yang, A.; Nam, J.; Kim, M.; Choo, K.K.R. Provablysecure (Chinese government) SM2 and simplified SM2 key exchange protocols. Sci. World J. 2014, 2014, 825984. [Google Scholar] [CrossRef]
 BlakeWilson, S.; Bolyard, N.; Gupta, V.; Hawk, C.; Moeller, B. Elliptic Curve Cryptography (ECC) Cipher Suites for Transport Layer Security (TLS). RFC 4492, IETF. 2006. Available online: https://tools.ietf.org/html/rfc4492 (accessed on 28 December 2020).
 National Institute of Standards and Technology. FIPS 1864–Digital Signature Standard (DSS); National Institute of Standards and Technology: Gaithersburg, MD, USA, 2013. [Google Scholar]
 Mehrabi, M.A.; Doche, C.; Jolfaei, A. Elliptic curve cryptography point multiplication core for hardware security module. IEEE Trans. Comput. 2020, 69, 1707–1718. [Google Scholar] [CrossRef]
 Gallant, R.P.; Lambert, R.J.; Vanstone, S.A. Faster point multiplication on elliptic curves with efficient endomorphisms. In Proceedings of the Annual International Cryptology Conference, Santa Barbara, CA, USA, 19–23 August 2001; pp. 190–200. [Google Scholar]
 Roy, D.B.; Mukhopadhyay, D. Highspeed implementation of ECC scalar multiplication in GF(p) for generic Montgomery curves. IEEE Trans. Very Large Scale Integr. Syst. 2019, 27, 1587–1600. [Google Scholar]
 Costello, C.; Longa, P.; Naehrig, M. Efficient algorithms for supersingular isogeny DiffieHellman. In Proceedings of the Annual International Cryptology Conference, Santa Barbara, CA, USA, 14–18 August 2016; pp. 572–601. [Google Scholar]
 Miller, V.S. The Weil pairing, and its efficient calculation. J. Cryptol. 2004, 17, 235–261. [Google Scholar] [CrossRef]
 Asif, S.; Hossain, M.S.; Kong, Y.; Abdul, W. A fully RNS based ECC processor. Integration 2018, 61, 138–149. [Google Scholar] [CrossRef]
 Bajard, J.C.; Merkiche, N. Double level Montgomery CoxRower architecture, new bounds. In Proceedings of the International Conference on Smart Card Research and Advanced Applications, Paris, France, 5–7 November 2014; pp. 139–153. [Google Scholar]
 Ma, Y.; Liu, Z.; Pan, W.; Jing, J. A HighSpeed Elliptic Curve Cryptographic Processor for Generic Curves over GF(p). In Proceedings of the International Conference on Selected Areas in Cryptography, Burnaby, BC, Canada, 14–16 August 2013; pp. 421–437. [Google Scholar]
 Shah, Y.A.; Javeed, K.; Azmat, S.; Wang, X. A highspeed RSDbased flexible ECC processor for arbitrary curves over general prime field. Int. J. Circuit Theory Appl. 2018, 46, 1858–1878. [Google Scholar] [CrossRef]
 Lai, J.Y.; Wang, Y.S.; Huang, C.T. Highperformance architecture for elliptic curve cryptography over prime fields on FPGAs. Interdiscip. Inf. Sci. 2012, 18, 167–173. [Google Scholar] [CrossRef][Green Version]
 Vliegen, J.; Mentens, N.; Genoe, J.; Braeken, A.; Kubera, S.; Touhafi, A.; Verbauwhede, I. A compact FPGAbased architecture for elliptic curve cryptography over prime fields. In Proceedings of the ASAP 201021st IEEE International Conference on ApplicationSpecific Systems, Architectures and Processors, Rennes, France, 7–9 July 2010; pp. 313–316. [Google Scholar]
 Hu, X.; Zheng, X.; Zhang, S.; Cai, S.; Xiong, X. A low hardware consumption elliptic curve cryptographic architecture over GF(p) in embedded application. Electronics 2018, 7, 104. [Google Scholar] [CrossRef][Green Version]
 Karatsuba, A.A.; Ofman, Y.P. Multiplication of manydigital numbers by automatic computers. Dokl. Akad. Nauk. Russ. Acad. Sci. 1962, 145, 293–294. [Google Scholar]
 Hamburg, M. Faster Montgomery and doubleadd ladders for short Weierstrass curves. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2020. [Google Scholar] [CrossRef]
 Ding, J.; Li, S.; Gu, Z. HighSpeed ECC Processor Over NIST Prime Fields Applied With Toom–Cook Multiplication. IEEE Trans. Circuits Syst. Regul. Pap. 2019, 66, 1003–1016. [Google Scholar] [CrossRef]
 Devlin, B. Blockchain Acceleration Using FPGAs—Elliptic Curves, zkSNARKs, and VDFs; ZCASH Foundation, 2019; Available online: https://github.com/ZcashFoundation/zcashfpga (accessed on 28 December 2020).
 Alrimeih, H.; Rakhmatov, D. Fast and Flexible Hardware Support for ECC Over Multiple Standard Prime Fields. IEEE Trans. Very Large Scale Integr. Syst. 2014, 22, 2661–2674. [Google Scholar] [CrossRef]
 Güneysu, T.; Paar, C. Ultra High Performance ECC over NIST Primes on Commercial FPGAs. In International Workshop on Cryptographic Hardware and Embedded Systems; Springer: Berlin/Heidelberg, Germany, 2008; pp. 62–78. [Google Scholar] [CrossRef][Green Version]
 Fan, J.; Verbauwhede, I. An updated survey on secure ECC implementations: Attacks, countermeasures and cost. In Cryptography and Security: From Theory to Applications; Springer: Berlin/Heidelberg, Germany, 2012; pp. 265–282. [Google Scholar]
 Galbally, J. A new Foe in biometrics: A narrative review of sidechannel attacks. Comput. Secur. 2020, 96, 101902. [Google Scholar] [CrossRef]
 Montgomery, P.L. Speeding the Pollard and elliptic curve methods of factorization. Math. Comput. 1987, 48, 243–264. [Google Scholar] [CrossRef]
 Montgomery, P.L. Modular Multiplication Without Trial Division. Math. Comput. 1985. [Google Scholar] [CrossRef]
 Xilinx. UG953: Vivado Design Suite 7 Series FPGA and Zynq7000 SoC Libraries Guide. Available online: https://www.xilinx.com/support/documentation/sw_manuals/xilinx2020_1/ug953vivado7serieslibraries.pdf (accessed on 28 December 2020).
 Xilinx. 7 Series DSP48E1 Slice User Guide. 2018. Available online: https://www.xilinx.com/support/documentation/user_guides/ug479_7Series_DSP48E1.pdf (accessed on 28 December 2020).
 Nguyen, H.D.; Pasca, B.; Preußer, T.B. FPGAspecific arithmetic optimizations of shortlatency adders. In Proceedings of the 2011 21st International Conference on Field Programmable Logic and Applications, Chania, Greece, 5–7 September 2011; pp. 232–237. [Google Scholar]
 Massolino, P.M.C.; Longa, P.; Renes, J.; Batina, L. A Compact and Scalable Hardware/Software Codesign of SIKE. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2020. [Google Scholar] [CrossRef]
 Liskov, M. Fermat’s Little Theorem. In Encyclopedia of Cryptography and Security; Springer: Boston, MA, USA, 2005; p. 221. [Google Scholar] [CrossRef]
 Kawamura, S.; Koike, M.; Sano, F.; Shimbo, A. Coxrower architecture for fast parallel montgomery multiplication. In Proceedings of the International Conference on the Theory and Applications of Cryptographic Techniques, Bruges, Belgium, 14–18 May 2000; pp. 523–538. [Google Scholar]
 Qu, M. Sec 2: Recommended Elliptic Curve Domain Parameters; Tech. Rep. SEC2Ver0.6; Certicom Res.: Mississauga, ON, Canada, 1999. [Google Scholar]
 Hu, X.; Zheng, X.; Zhang, S.; Li, W.; Cai, S.; Xiong, X. A highperformance elliptic curve cryptographic processor of SM2 over GF(p). Electronics 2019, 8, 431. [Google Scholar] [CrossRef][Green Version]
 Lochter, M.; Merkle, J. Elliptic Curve Cryptography (ECC) Brainpool StandardCurves and Curve Generation. RFC 5639, IETF. 2010. Available online: https://tools.ietf.org/html/rfc5639 (accessed on 28 December 2020).
 Amiet, D.; Curiger, A.; Zbinden, P. Flexible FPGABased Architectures for Curve Point Multiplication over GF(p). In Proceedings of the 19th Euromicro Conference on Digital System Design, DSD 2016, Limassol, Cyprus, 31 August–2 September 2016. [Google Scholar] [CrossRef]
 Wu, T.; Wang, R. Fast unified elliptic curve point multiplication for NIST prime curves on FPGAs. J. Cryptogr. Eng. 2019, 9, 401–410. [Google Scholar] [CrossRef]
 MoralesSandoval, M.; DiazPerez, A. Novel algorithms and hardware architectures for Montgomery Multiplication over GF(p). IACR Cryptol. ePrint Arch. 2015, 2015, 696. [Google Scholar]
Designs  Platform  Slices  DSP  BRAM  Max. Freq. (MHz)  Cycles  Time (ms)  Time x Area ${}^{\mathit{a}}$ 

Virtex7  6909  136  15  232.3  0.139  0.96  
This work  Kintex7  7115  136  15  234.1  32.3k  0.138  0.98 
XC7Z020  7077  136  15  156.8  0.206  1.46  
Roy et al. [11]  XC7Z020  2223  40  9  208.3  95.5k  0.459  1.02 
Bajard et al. [15]  Kintex7  1630  46  16  281.5  172.3k  0.612  1.00 
Asif et al. [14]  Virtex7  18.8k (LUT)      86.6  63.2k  0.730  3.43 ${}^{b}$ 
Ma et al. [16]  Virtex5  1725  37    291  110.6k  0.380  0.66 
Lai et al. [18]  Virtex5  3657  10  10  263  226.2k  0.860  3.15 
Shah et al. [17]  Virtex6  44.3k (LUT)      221  143.7k  0.650  7.20 ${}^{b}$ 
Vliegen et al. [19]  VirtexII Pro  1947  7  9  68.17  1074.4k  15.760  30.68 
Hu et al. [20]  Virtex4  9370      20.44  609.9k  29.840  279.60 
Operation  Clock Cycles  Latency @234.1 MHz (ns) 

1 × Input Modular Addition  5  21.36 
3 × Input Modular Addition  7  29.90 
1 × Modular Multiplication  26  111.07 
4 × Modular Multiplication  29  123.89 
Modular Inverse  6911  29,523.79 
Ladder Setup  131  559.63 
One Step Ladder Update  97  414.38 
Ladder Finish  7050  30,117.60 
One ECC Scalar Multiplication  32,272  137,865.98 
Resource  Used  Available  Utilization % 

LUT  22,736  433,200  5.25 
FF  12,511  866,400  1.44 
Slice  6909  108,300  6.38 
DSP48E1  136  3600  3.78 
BRAM  15  1470  1.02 
Designs  Curve  Modulus Size (Bits)  Slices  DSP  BRAM  Max. Freq. (MHz)  Time (ms) 

192  0.119  
This work  Any  224  7281  136  15 *  204.2  0.138 
256  0.158  
192  0.296  
224  0.389  
Wu et al. [41]  NIST  256  8411  32  310  0.526  
384  1.070  
521  1.860  
Amiet et al. [40]  Any  192  6816 (LUT)  20  225  0.690  
256  1.490  
384  4.080  
521  9.700 
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. 
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Awaludin, A.M.; Larasati, H.T.; Kim, H. HighSpeed and Unified ECC Processor for Generic Weierstrass Curves over GF(p) on FPGA. Sensors 2021, 21, 1451. https://doi.org/10.3390/s21041451
Awaludin AM, Larasati HT, Kim H. HighSpeed and Unified ECC Processor for Generic Weierstrass Curves over GF(p) on FPGA. Sensors. 2021; 21(4):1451. https://doi.org/10.3390/s21041451
Chicago/Turabian StyleAwaludin, Asep Muhamad, Harashta Tatimma Larasati, and Howon Kim. 2021. "HighSpeed and Unified ECC Processor for Generic Weierstrass Curves over GF(p) on FPGA" Sensors 21, no. 4: 1451. https://doi.org/10.3390/s21041451