Abstract
Elliptic curve cryptography (ECC) over prime fields relies on scalar point multiplication realized by point addition and point doubling. Point addition and point doubling operations consist of many modular multiplications of large operands (256 bits for example), especially in projective and Jacobian coordinates which eliminate the modular inversion required in affine coordinates for every point addition or point doubling operation. Accelerating modular multiplication is therefore important for high-performance ECC. This paper presents the hardware implementations of modular multiplication algorithms, including (1) interleaved modular multiplication (IMM), (2) Montgomery modular multiplication (MMM), (3) shift-sub modular multiplication (SSMM), (4) SSMM with advance preparation (SSMMPRE), and (5) SSMM with CSAs and sign detection (SSMMCSA) algorithms, and evaluates their execution time (the number of clock cycles and clock frequency) and required hardware resources (ALMs and registers). Experimental results show that SSMM is 1.80 times faster than IMM, and SSMMCSA is 3.27 times faster than IMM. We also present the ECC hardware implementations based on the Secp256k1 protocol in affine, projective, and Jacobian coordinates using the IMM, SSMM, SSMMPRE, and SSMMCSA algorithms, and investigate their cost and performance. Our ECC implementations can be applied to the design of hardware security module systems.
1. Introduction
The use of elliptic curves in cryptography was proposed by Neal Koblitz and Victor S. Miller independently in 1985 [1,2]. Elliptic curve cryptography (ECC) over the finite field of a prime number m relies on the fact that scalar point multiplication can be computed, but it is almost impossible to compute the multiplicand d given only the original point P and the point of the product Q. Scalar point multiplication can be conducted with point addition (adding two points) and point doubling [3].
In conventional affine coordinates, point addition and point doubling require computing the slope of a line involving division. In both projective and Jacobian coordinates, divisor multiplication is calculated such that division is eliminated during scalar point multiplication. One final division is required to convert a point from projective or Jacobian coordinates to affine coordinates to obtain the shared secret key based on the elliptic curve Diffie–Hellman (ECDH) key exchange. ECDH is a variant of the Diffie–Hellman key agreement protocol using ECC between two parties to establish a shared secret key over an insecure network [4,5]. A modular inversion algorithm calculating is given in [3]. Based on this, this paper provides a Verilog HDL implementation to calculate . It can be used in affine coordinates to calculate the line slope, or in projective and Jacobian coordinates to convert points to affine coordinates in the final step for obtaining the shared secret key.
Point addition and point doubling operations consist of many modular multiplications, especially in projective and Jacobian coordinates. Accelerating modular multiplication is therefore important for high-performance ECC. The interleaved modular multiplication (IMM) algorithm [6] eliminates the need for division and has a lower hardware cost. To calculate with , IMM first calculates , where is the ith bit of b. To guarantee , we can subtract at most from p since . To carry this out, the traditional IMM algorithm performs the following calculation twice sequentially: if , . The work in [7] is an FPGA implementation of a processor for elliptic curve point multiplication over prime fields. It uses a modified IMM algorithm with a three-input multiplexer to select one from p, , and . Also, it uses a multiplexer to select one from and for where is used as the multiplexer selection signal. The design in [8] is similar to [7], but uses two-input “AND” gates to implement so that it will perform if is 0, and otherwise. The work in [9] is a low hardware consumption elliptic curve cryptographic architecture over prime fields for embedded applications. It performs and uses a four-input multiplexer to select one from 0, , , and , for 0, 1, 2, and 3, respectively. In fact, a three-input multiplexer is sufficient because and hence . The authors in [10] present an ECC architecture that adopts the modular multiplication algorithm proposed in [9]. The design in [11] implements an ECC processor over the NIST [5] prime fields. It uses two subtractors to prepare and and then uses a three-input multiplexer to select one from p, , and . It uses two comparison modules to generate the multiplexer selection signals (Figure 3 of paper [11]). This increases hardware costs. In fact, the most significant bit of the 258-bit subtractors can be directly used as the multiplexer selection signals to obtain the final 256-bit product. The authors in [12] describe an implementation of a dual-field ECC processor. The radix-4 IMM algorithm checks two bits of b in each iteration, which reduces the number of iterations by half. Modular multiplications with higher radix require fewer iterations, but the precomputation increases exponentially. The work in [13] is an ECC processor over the NIST P-256 [5] elliptic curve for real-time IoT (Internet of things) applications. It splits the 256-bit inputs into four 64-bit parts and uses the schoolbook-based multiplication algorithm in a pipelined manner to obtain a 512-bit product. Finally, the P-256 fast modular reduction algorithm is used to realize the modular multiplication. Such an implementation can only support the P-256 curve. The Montgomery modular multiplication (MMM) algorithm [14] performs multiplication in the Montgomery domain and is very efficient for modular exponentiation used by RSA cryptography. The modular exponentiation can be calculated by repeatedly calling MMM. However, transformations to the Montgomery domain are required before calculations, and a transformation back to the regular domain is also required to obtain the final result. Domain transformation requires the value , which is calculated by an expensive modular operation. Using a precomputation of q for fixed R and m speeds up the calculations but reduces the flexibility of using different moduli m. The shift-sub modular multiplication (SSMM) algorithm uses shifts and subtractions to perform modular multiplication. The SSMM algorithm and its use in RSA cryptography are described in [15,16]. An algorithm using CSA (Carry save adder) [17] is proposed but it requires either a modular computation after the iterations or a precomputed look-up table for fixed multiplier and moduli.
This paper focuses on radix-2 algorithms and proposes two enhanced versions of SSMM: SSMM with advance preparation (SSMMPRE) and SSMM with CSAs and sign detection (SSMMCSA) algorithms. We also present the ECC implementations based on Secp256k1 [4] protocol which is used in the Ethereum blockchain. The specific contributions of this paper are summarized as follows: (1) Based on SSMM, we propose SSMMPRE (SSMM with advance preparation) and SSMMCSA (SSMM with CSAs and sign detection) algorithms that have lower latency than SSMM. (2) The hardware implementations of IMM, MMM, SSMM, SSMMPRE, and SSMMCSA algorithms are presented and their cost (required adaptive logic modules (ALMs) and registers) and performance (the number of clock cycles and clock frequency) are evaluated. (3) The hardware implementations of ECC in affine, projective, and Jacobian coordinates using the IMM, SSMM, SSMMPRE, and SSMMCSA algorithms are presented and their cost and performance are evaluated and compared with those proposed in [7,8,9,10]. (4) Some important Verilog HDL source codes and their simulation waveform are included in Appendix A and Appendix B. The experimental results show that SSMM is 1.80 times faster than IMM and SSMMCSA is 3.27 times faster than IMM, and our ECC implementations perform better than those proposed in [7,8,9,10]. Compared to RSA cryptography, ECC provides stronger encryption with shorter key lengths [18]. The ECC implementations described in this paper can support other elliptic curves, such as the NIST P-256 (Secp256r1) [5] curve.
The rest of the paper is organized as follows. Section 2 introduces the background of ECC, including the point addition, point doubling, scalar point multiplication, modular inversion, ECDH key agreement protocol, and affine, projective, and Jacobian coordinates. Section 3 describes the modular multiplication algorithms, including the IMM, MMM, SSMM, SSMMPRE, and SSMMCSA algorithms. Section 4 presents hardware implementations of modular multiplications and ECC and evaluates their cost and performance. Comparisons with [7,8,9,10] are also given in this section. And Section 5 concludes the paper. Verilog HDL codes for SSMM and modular inversion are listed in Appendix A and Appendix B.
2. Elliptic Curve Cryptography Algorithms
This section describes ECC algorithms and ECDH that underlie this work. A top view of the relationship between these algorithms is shown in Figure 1.
Figure 1.
ECDH and ECC algorithms. The ECDH key exchange invokes a scalar point multiplication that uses two computations—point addition and point doubling. Four primitive modular calculations (addition, subtraction, multiplication, and inversion) are used for these two computations.
Modular multiplication and modular inversion are performed with iterations which will be described later. Modular addition and modular subtraction are calculated in one clock cycle. The Verilog HDL codes for modular addition and modular subtraction used in our ECC implementations are listed below. The most significant bits of 258-bit subtractors s_m[257] and sum[257] are used as the selection signals of the multiplexers.
module modadd (a, b, m, s); // s = (a + b) mod m
input [255:0] a, b, m;
output [255:0] s;
wire [257:0] sum = {2’b00,a} + {2’b00,b};
wire [257:0] s_m = sum - {2’b00,m};
assign s = s_m[257] ? sum[255:0] : s_m[255:0];
endmodule
module modsub (a, b, m, s); // s = (a - b) mod m
input [255:0] a, b, m;
output [255:0] s;
wire [257:0] sum = {2’b00,a} - {2’b00,b};
wire [257:0] s_m = sum + {2’b00,m};
assign s = sum[257] ? s_m[255:0] : sum[255:0];
endmodule
The interesting computations in ECC are point addition and point doubling. We first introduce these two computations in an elliptic curve over the real numbers. In the real number field, an elliptic curve can be defined in Weierstrass form as
Note that this curve is symmetrical about the x-axis. If point is on an elliptic curve, then is also on the same elliptic curve.
2.1. ECC Point Addition and Doubling in Affine Coordinates
Affine coordinates use two coordinates to represent an elliptic curve point, as shown by, for example, and . We will see that the ECC point addition and point doubling in affine coordinates require expensive divisions.
2.1.1. ECC Point Addition in Affine Coordinates
Figure 2 shows the point addition on an elliptic curve . Given two distinct points and on the curve, if the line through P and Q intersects the curve in , then is defined as and . Because , the line through S and R is a vertical line.
Figure 2.
Point addition on an elliptic curve in the real number field.
Below we show how to obtain formulas to calculate and based on , , , and for . The formula of the line through P and Q is
where is the slope of the line . Squaring both the left side and right side of Equation (2), we obtain . Then, replacing of Equation (1) with , we have
Because P, Q, and S are three points on the curve, meaning that , , and are three roots of Equation (3), based on Vieta’s formulas, we have
Expanding Equation (4) gives:
Then from Equations (3) and (5) we have . That is, and . Considering the line slope , , and , we summarize the formulas for point addition on elliptic curve as follows.
For , the line through P and does not intersect the elliptic curve at the third point. For this reason, the point O at infinity is included in the group of elliptic curves and defined as . By this definition, .
In practice, a group of elliptic curves over a finite field of or is used, where contains numbers from 0 to and uses n-bit binary numbers. In the case of , the results of all the above calculations are modularized by m, where m is usually a prime number. For example, Secp256k1 [4] elliptic curve used in Ethereum blockchain uses a 256-bit . Secp256k1 defines and gives a point on the elliptic curve as follows.
a = 0x0000000000000000000000000000000000000000000000000000000000000000 b = 0x0000000000000000000000000000000000000000000000000000000000000007 m = 0xfffffffffffffffffffffffffffffffffffffffffffffffffffffffefffffc2f x = 0x79be667ef9dcbbac55a06295ce870b07029bfcdb2dce28d959f2815b16f81798 y = 0x483ada7726a3c4655da4fbfc0e1108a8fd17b448a68554199c47d08ffb10d4b8
By considering , we give the point addition algorithm over the finite field of in Algorithm 1. In our implementation, O is denoted as . In the case of , we perform the point doubling (line 6 in the algorithm).
| Algorithm 1 PAA (P, Q, m, a) (point addition in affine coordinates). |
|
inputs: Points and ; m and a in
output:
begin
1 , , , ,
2 if return Q /* */
3 if return P /* */
4 if
5 if return O /* */
6 else return PDA () /* */
7
8
9
10 return /* */
end
|
An example of point addition on the Secp256k1 curve is shown below where , , and in affine coordinates.
Px = 0xe493dbf1c10d80f3581e4904930b1404cc6c13900ee0758474fa94abe8c4cd13 Py = 0x51ed993ea0d455b75642e2098ea51448d967ae33bfbdfe40cfe97bdc47739922 Qx = 0x79be667ef9dcbbac55a06295ce870b07029bfcdb2dce28d959f2815b16f81798 Qy = 0x483ada7726a3c4655da4fbfc0e1108a8fd17b448a68554199c47d08ffb10d4b8 Rx = 0x2f8bde4d1a07209355b4a7250a5c5128e88b84bddc619ab7cba8d569b240efe4 Ry = 0xd8ac222636e5e3d6d4dba9dda6c9c426f788271bab0d6840dca87d3aa6ac62d6
2.1.2. ECC Point Doubling in Affine Coordinates
Figure 3 shows the point doubling on an elliptic curve . Compared to the point addition shown in Figure 2, here we have and . Given a point on the curve, if the tangent line through P intersects the curve in , then is defined as and .
Figure 3.
Point doubling on an elliptic curve in the real number field.
Below we show how to obtain formulas to calculate and based on , , and a for . The formula of the line through P is
where is the slope of the line . Squaring both the left side and right side of Equation (9), we obtain . Then, replacing of Equation (1) with , we have
Because P, Q (), and S are three points on the curve, meaning that , , and are three roots of Equation (10), based on Vieta’s formulas, we have
Expanding Equation (11) gives:
Then from Equations (10) and (12) we have . That is, and . The slope of the tangent line of at P can be obtained as follows.
Considering and , we summarize the formulas for point doubling on elliptic curve as follows.
We give the point doubling algorithm over the finite field of in Algorithm 2. For , the tangent at P is vertical and does not intersect the elliptic curve at any other point. By definition, for such a point P (line 2 in the algorithm).
| Algorithm 2 PDA (P, m, a) (point doubling in affine coordinates). |
|
inputs: Point ; m and a in
output:
begin
1 , ,
2 if return O /* vertical tangent */
3
4
5
6 return /* */
end
|
An example of point doubling on the Secp256k1 curve is shown below where , and in affine coordinates.
Px = 0xb91dc87409c8a6b81e8d1be7f5fc86015cfa42f717d31a27d466bd042e29828d Py = 0xc35b462fb20bec262308f9d785877752e63d5a68e563e898b4f82f47594680fc Rx = 0x2d4fca9e0dff8dec3476a677d555896a0980ebccc6bc595a23675496dcc33bb5 Ry = 0xcce413eee9496094256e446b22fd234c03d9258330d77fc8b0d318a6aedba8cb
2.2. ECC Point Addition and Doubling in Projective Coordinates
Point addition and point doubling in affine coordinates require modular inversion (division) to calculate the line slope . We can eliminate the expensive modular inversion during calculations by using projective coordinates. In projective coordinates, a point is defined as . Initially, we can convert a point in affine coordinates to a point in projective coordinates by . Then, we calculate in projective coordinates using formulas that do not contain division. At the very final step, we can obtain the point in affine coordinates with the transformation of and , which requires divisions.
The formulas for ECC point addition and doubling in projective coordinates can be derived based on the formulas in affine coordinates. A point P in projective coordinates is represented by the triple , corresponding to the point in affine coordinates. That is, and . We derive the point doubling formulas for in projective coordinates as follows.
From Equation (17), we have
From Equation (18), we have
From Equation (19), we have
Let .
Because , i.e., , then
Because , i.e., , then
Given , the formulas for point doubling in projective coordinates are summarized below.
We can use a similar method to derive the formulas for point addition in projective coordinates. The derivation is omitted here but the calculations are shown in the following algorithm. It can be seen that the calculations require many more multiplications. Algorithm 3 gives the algorithm for point addition in projective coordinates. And Algorithm 4 gives the algorithm for point doubling in projective coordinates.
| Algorithm 3 PAP (P, Q, m, a) (point addition in projective coordinates). |
|
inputs: Points and ; m and a in
output:
begin
1 , , , , , ,
2 if return Q /* */
3 if return P /* */
4 if
5 if return O /* */
6 else return PDP () /* */
7 , , /* level 1 calculations */
8 , /* level 2 calculations */
9 /* level 3 calculations */
10 /* level 3 calculations */
11 /* level 3 calculations */
12 return /* */
end
|
| Algorithm 4 PDP (P, m, a) (point doubling in projective coordinates). |
|
inputs: Point ; m and a in
output:
begin
1 , , ,
2 if return O /* vertical tangent */
3 , /* level 1 calculations */
4 , /* level 2 calculations */
5 /* level 3 calculations */
6 /* level 3 calculations */
7 /* level 3 calculations */
8 return /* */
end
|
An example of point addition on the Secp256k1 curve is shown below where , , and in projective coordinates.
Px = 0x61bac660b055382e5906bd6e56e316542194b799b7bcf5ad05ee2171fd81735a Py = 0xbe44ac0a2b712ccb6bb3ea933e4db0a4213c139078aef594cf8c2c5c2924d54d Pz = 0x2b6da6fb02877584dc4d5111c88783772d7be5ac2866cce3707d53913384bf49 Qx = 0x6789c1137724f1f3f585337a1814eebc23ea329a0390fd9b1b9ece7af3e71ce1 Qy = 0xc9563c5035ccafec8673f56185141f720073ab3063bb417bf0e70e9d9128c232 Qz = 0x58990cd022b711912676c0451bdab6be04a06c1871b0139214bdbe81fd965555 Rx = 0xf4e9cb9ba9c18876b7b0ad000ce921b35e23139456f4f6c3f70e2fea149500a0 Ry = 0xc06176a9221b6d8b49a22130fb934b21358a1775df68d93ec308aca3ece072b5 Rz = 0x878fb153f0690416ba0ee136ec663debf8472f3ee92d350f9b3a42b4fd53fb27
An example of point doubling on the Secp256k1 curve is shown below where and in projective coordinates.
Px = 0x24fd537e9a5125438a02848f6b74725f678723f5c1450b8fb82a68f0c88c9764 Py = 0xe42e83a1d3d7c2241535b5c0ba5f2462c24bd87aaf9f15b05f3775d168b9bf6c Pz = 0xe00794d20b32e0e94472c36b89cf5e5d6ec769b53dd6c1422e9467090c272305 Rx = 0x24d00c48ac8bbe61ceb0ac5daf5defd913af9220a07650642a3a41cad9030ee6 Ry = 0x75d429714ea6ce1ab3811d9adc16961a219e2812210fa8465042c18ecd5a0de6 Rz = 0x644a5a2964435364b74d7fa79fe0f06a5b1d2782e7f7b8d1e835db6d6b8786bc
2.3. ECC Point Addition and Doubling in Jacobian Coordinates
A point P in Jacobian coordinates is represented by the triple , corresponding to the point [2] (p. 424) in affine coordinates. That is, and . We derive the point doubling formulas for in Jacobian coordinates as follows.
From Equation (17), we have
From Equation (18), we have
From Equation (19), we have
Let .
Because , we have
Because , we have
Given , the formulas for point doubling in Jacobian coordinates are summarized below.
An example of point doubling on the Secp256k1 curve is shown below where and in Jacobian coordinates.
Px = 0xe43306185ef298127aef469d577aed78acafaddfc28ad0857491c38ffbedc475 Py = 0x4d83871239769596f65c180546c170a28cffca37bf6393025c457f406f54c517 Pz = 0xdafa620812722dceda7d93a91158dadbe11fee894e71eafa054d5f5fd274377e Rx = 0x7d7cd6974d7e127a5fdf3f3c9c9eb5dcd9c15e033794466de63bcf2b9548ff85 Ry = 0x8f967b514296945a6dd052bca59ec1418a35cde3c6dd7b269d2e71daa80f851e Rz = 0xcf89daf8b6c736cc851882b7e85c8ea8f703a9323a3d627909582b7904766035
Based on Equations (6)–(8), and and , the formulas for point addition in Jacobian coordinates can be derived which are omitted here. An example of point addition on the Secp256k1 curve is shown below where , , and in Jacobian coordinates.
Px = 0xb7bae589ec8a8c722c1ffb2c37fd4bbeda59074675c3eb50f1673ed46bbedfbe Py = 0x81dee3398bdd718591c10762f61a0e41c4d609dffddcbeeb3894b8c4ce75e027 Pz = 0x51ec57b21350ad3d3466be5a7d28742279fbac1146fb4143767ee368a7dc741e Qx = 0x0aa20a04dba4788e9b99e10f2e9f4d43b7f53916a5cacf2050dc70bc34c18d21 Qy = 0x60f318d01180b303f4b20a49c2b7e2b498405f88bd423a9a7cb92bab5f1b6abf Qz = 0x3e1dcb6efa88b113f40b5858ea8c3cb5ddae2277c4683af9487e27023cba690d Rx = 0x348248c47ad5d3186bd807c382659263840ba7ea13e61128d24337db9b0e5278 Ry = 0xb38573698dd6fef9ec93a9d68ae0a997191b678474cc00a13961defa3ed763e9 Rz = 0x76da2008046aae9901e6b96a7b54c42fd480de5cbc8cef6c0d0a9b3d7086f0f4
The point in affine coordinates can be obtained from Jacobian coordinates by and . This is only performed one time at the final step. It can be conducted with the modular inversion which we will introduce next.
2.4. Modular Inversion
In affine coordinates, the point addition and point doubling algorithms must compute the slope of a line. Projective and Jacobian coordinates require the point to be transformed into affine coordinates at the final step to obtain the shared secret key. These calculations or transformations require division and modulo operations.
Generally, the modular inversion calculates , where m is an n-bit odd number and . An example of modular inversion is shown below.
b = 0x9cfa1c993911914be0f15bd74a878abe0079c6254b961b82e1abda76387d1d85 a = 0xd5076ae274e874c2eb0f7778717c39460236549ddd9fc651e68a0c0e787b4ce8 m = 0xfffffffffffffffffffffffffffffffffffffffffffffffffffffffefffffc2f c = 0xe8e5ac2e1d3358894ce1b3342737b38c39b89059dd55d3c4741626de8270228e
Algorithm 2.22 in [3] gives an algorithm to calculate using the extended Euclidean algorithm. Based on this, we give a Verilog HDL implementation of the modular inversion that calculates in Appendix B. Figure A2 shows the simulation waveform generated with ModelSim.
In affine coordinates, every point addition or point doubling invokes modular inversion to calculate the line slope . In projective or Jacobian coordinates, this modular inversion can be removed during point addition or point doubling calculations, but a final division is required to transform a point from projective or Jacobian to affine coordinates to obtain the shared secret key. This division can be achieved using modular inversion.
2.5. Scalar Point Multiplication
Scalar point multiplication performs where P and Q are elliptic curve points and is an n-bit scalar. Scalar point multiplication can be conducted with the “double-and-add” method [3]. Algorithm 5 formally gives the algorithm for scalar point multiplication. The algorithm invokes point addition and point doubling. Point doubling can be calculated simultaneously with point addition.
| Algorithm 5 ScaMul (d, P, m, a) (scalar point multiplication). |
|
inputs: and point ; m and a in
output:
begin
1 , , /* and */
2 while to
3 if
4 /* point addition */
5 /* point doubling */
6
7 endwhile
8 return Q /* */
end
|
Below we give an example to show the calculation steps of the scalar point multiplication. For a 5-bit , we calculate in 5 steps to obtain .
| Weight | Point Addition | Point Doubling | |
| Initial | |||
| 1 | |||
| 2 | |||
| 4 | |||
| 8 | |||
| 16 |
2.6. Elliptic Curve Diffie–Hellman Key Exchange
The elliptic curve Diffie–Hellman (ECDH) algorithm is a variant of the Diffie–Hellman key agreement protocol using ECC between two parties to establish a shared secret key over an insecure network [4,5]. Then, this shared secret key can be used by the two parties to encrypt and decrypt subsequent communications using fast symmetric-key cryptography over the insecure network. The ECDH key exchange protocol is shown in Table 1.
Table 1.
Elliptic curve Diffie–Hellman key exchange.
Because , , and , we have . Below is an ECDH key exchange example using Secp256k1. We can see that two parties have the same shared secret key (Qabx = Qbax).
-
Alice generates and exposes :
da = 0x650aa7095daeaa37ab9051541f0ce304f8969a6d88bb3bebb4fe680fca9a2595 Qax = 0x167d2537aa6bbd8d978b58be0f9466520b7b184e205ff96a9ff567b35b32c7b7 Qay = 0xde3961553d36551f92726fee0e332133960edddccd2784b98b2af730d2fc6e14
-
Bob generates and exposes :
db = 0xedc68f194c4e30d6ef90467df822b00e5ef122dea48c9d1c54817080d1a341f4 Qbx = 0x839da64a414c2243a5526230603109be9c615613a9e98c3d650bb0488580bbda Qby = 0x96e88e99304a5afcdd77c4f3b3327a28162627ebe08194baa0c78dfb67a11042
-
Alice obtains and calculates :
Qabx = 0x1f254c7da15899275cdcab9d992f58251a4ab630fe9864d20cf317ab57749947 Qaby = 0xd6cb400b3c49d33d3df28f9d34fa09f8b6c8edf117a378c5a45d0a51e6c0debc
-
Bob obtains and calculates :
Qbax = 0x1f254c7da15899275cdcab9d992f58251a4ab630fe9864d20cf317ab57749947 Qbay = 0xd6cb400b3c49d33d3df28f9d34fa09f8b6c8edf117a378c5a45d0a51e6c0debc
-
Now, Alice and Bob have a same secret key (). They can use a symmetric-key cryptography for the subsequent communications.
3. Modular Multiplication Algorithms
Point addition and point doubling use many modular multiplications. This section describes interleaved modular multiplication (IMM), Montgomery modular multiplication (MMM), and shift-sub modular multiplication (SSMM) algorithms, and proposes shift-sub modular multiplication with advance preparation (SSMMPRE) and shift-sub modular multiplication with CSAs and sign detection (SSMMCSA) algorithms.
3.1. Interleaved Modular Multiplication Algorithm
The IMM algorithm [6] is formally given in Algorithm 6. It computes where , and m is an n-bit odd number. That is, the ()th bit and 0th bit of n-bit m are 1. IMM begins with checking the ()th bit of multiplier b. Therefore, in each iteration, the product p is shifted to the left by one bit (line 3). Because , (lines 3 and 4). Therefore, it is enough to subtract from (lines 5 and 6), ensuring .
| Algorithm 6 IMM (a, b, m) (interleaved modular multiplication). |
inputs: , , , m: n-bit odd number output: begin 1 /* product */ 2 for downto 0 3 /* */ 4 /* add multiplicand a to p if */ 5 if , /* subtract m from p */ 6 if , /* subtract m from p */ 7 return p end |
Figure 4 shows a possible IMM hardware implementation of Algorithm 6. Red rectangles are registers, others are combinational circuits. Clearly, the critical path is the right part that computes the new p in each iteration. It consists of three carry propagate adders (CPAs) and three multiplexers. The most significant bit of the adder output (sign) can be used as the select signal of multiplexers. Note that can be realized by wiring.
Figure 4.
Block diagram of interleaved modular multiplication (IMM). It implements Algorithm 6.
3.2. Montgomery Modular Multiplication Algorithm
The MMM algorithm [14] performs modular multiplication in the Montgomery domain. It is very efficient for modular exponentiation used by RSA cryptography. MMM calculates , where , , and m is an n-bit odd number with . It performs reduction during the multiplication because a and b are represented in the Montgomery domain as follows.
where and are the operands in the conventional domain. Such a reduction ensures that the product p is an operand still in Montgomery domain:
MMM is widely used in RSA cryptography where the modular exponentiation is realized with the repeated modular multiplications. A bit-level MMM algorithm is formally given in Algorithm 7. For the reduction, we perform
in n iterations and divide p by 2 in each iteration. Line 3 performs multiplication. Line 4 makes p even for the reduction where is the least significant bit of p. Line 5 shifts p to the right by one bit (). Line 6 ensures . The reason is shown below. In the loop body, p is ensured to be less than . Then, (line 3), (line 4), and (line 5). In the finalization, if (line 6).
| Algorithm 7 MMM (a, b, m) (Montgomery modular multiplication). |
inputs: , , , , m: n-bit odd number output: begin 1 /* product */ 2 for to 3 /* add multiplicand a to p if */ 4 /* make p even */ 5 /* : reduction */ 6 if , /* subtract m from p in the finalization */ 7 return p end |
Figure 5 shows a possible MMM hardware implementation of Algorithm 7. Red rectangles are registers, and others are combinational circuits. The critical path is the right part that computes the new p in each iteration. It consists of three CPAs and three multiplexers. The most significant bit of the adder output (sign) or the least significant bit of p can be used as the select signal of multiplexers. Note that can be realized by wiring.
Figure 5.
Block diagram of Montgomery modular multiplication (MMM). It implements Algorithm 7.
3.3. Shift-Sub Modular Multiplication Algorithm
The SSMM algorithm [15,16] is formally given in Algorithm 8. It calculates , where and m is an n-bit odd number. It begins with checking the 0th bit of multiplier b. Therefore, in each iteration, the multiplicand u (= a) is shifted to the left by one bit (line 5). Because , (line 3), it is enough to subtract m from (line 4), ensuring . Because , (line 5), it is enough to subtract m from (line 6), ensuring .
| Algorithm 8 SSMM (a, b, m) (shift-sub modular multiplication). |
inputs: , , , m: n-bit odd number output: begin 1 ; /* multiplicand, product */ 2 for to 3 /* add multiplicand u to p if */ 4 if , /* subtract m from p */ 5 6 if , /* subtract m from u */ 7 return p end |
Figure 6 shows a possible SSMM hardware implementation of Algorithm 8. Red rectangles are registers, and others are combinational circuits. Compared to Figure 4 and Figure 5, here, we reduce the critical path to two CPAs and two multiplexers. The bit (b[cnt] in the figure) is used as a selection signal for the last (bottom) multiplexer. If it is a 0, the value in register p is selected (unchanged). The register u and its corresponding combinational circuits are added for performing lines 5 and 6 in Algorithm 8. Note that the implementation is slightly different from the algorithm. Regardless of , we calculate and first. If is non-negative, select it; otherwise, select . Finally, the value selected by is written to register p. Note that instead of using a multiplexer in the bottom, bit (b[cnt] in the figure) can be used as a write enabler for register p.
Figure 6.
Block diagram of shift-sub modular multiplication (SSMM). It implements Algorithm 8.
3.4. Shift-Sub Modular Multiplication with Advance Preparation Algorithm
From the implementation of the SSMM algorithm, we can see that the critical path is the computation of . m is a modulus and does not change during the multiplication. u is the multiplicand and doubles with each iteration. Then, we can prepare in advance in the previous iteration. u is generated as follows. If is non-negative, is written to register u; otherwise, is written to register u, where is the value of u in the previous iteration. In the current iteration, we have two cases for . Case 1: if is negative. Case 2: if is non-negative. Then, we just prepare and , and store them in registers. That is, if , ; otherwise, . The algorithm SSMMPRE is formally given in Algorithm 9.
| Algorithm 9 SSMMPRE (a, b, m) (shift-sub modular multiplication with preparation). |
inputs: , , , m: n-bit odd number output: begin 1 ; ; ; 2 for to 3 4 if , /* x: prepared in previous clock cycle */ 5 else /* y: prepared in previous clock cycle */ 6 if 7 if , 8 else 9 ; /* prepare for use in next clock cycle */ 10 if , 11 else 12 return p end |
Figure 7 shows a possible SSMMPRE hardware implementation of Algorithm 9. Registers x and y hold and , respectively, where is the value of u in the previous iteration.
Figure 7.
Block diagram of shift-sub modular multiplication with advance preparation (SSMMPRE). It implements Algorithm 9.
3.5. Shift-Sub Modular Multiplication with CSAs and Sign Detection Algorithm
The algorithm SSMMCSA (SSMM with CSAs and sign detection) is formally given in Algorithm 10. Figure 8 shows a possible implementation of SSMMCSA. The two CPAs in Figure 6 are replaced with CSAs. Through our hardware implementation, we find that the part of CSAs that generates c and s is no longer the critical path. We use another register q to store calculated with a CPA, and generate p from q. The critical path of this circuit contains a CPA and a multiplexer. The output of CSAs consists of carry c and sum s. The result value p will be .
| Algorithm 10 SSMMCSA (a, b, m) (shift-sub modular multiplication with CSAs). |
inputs: , , , m: n-bit odd number output: begin 1 ; 2 for to 3 4 if 5 /* add multiplicand u to */ 6 7 if sign (negative) 8 9 else /* subtract m from */ 10 11 if 12 /* subtract m from u */ 13 if 14 /* subtract m from q */ 15 return p end |
Figure 8.
Block diagram of shift-sub modular multiplication with CSAs and sign detection (SSMMCSA). It implements Algorithm 10.
We will later propose an easy way to determine the sign of from c and s without using CPA to calculate . The upper CSA performs and the other CSA performs , corresponding to and in Figure 6. One CSA’s output will be selected with the “csasign” (CSA’s sign) signal and stored in the CS register in case (b[cnt] = 1).
Note that we can obtain from m quickly. Usually where needs a CPA. But here, m is an n-bit odd value (), so that we can invert the left bits of m and leave the least significant bit unchanged, that is, ’b.
Now, we describe how to generate the signal of “csasign”. Figure 9 shows an example of determining the sign of from 17-bit C and S for a 16-bit when using CSAs. We divide the 17 bits into four windows, namely , , , and . has 5 bits and each of , , and has 4 bits. We use one 5-bit and three 4-bit CPAs, carry lookahead adders (CLAs), for instance, to perform additions in four windows simultaneously as follows. Note that each of the adders generates a 5-bit result.
| = | + | (Add 5-bit, 5-bit sum) | ||||
| = | + | (Add 4-bit, 5-bit sum) | ||||
| = | + | (Add 4-bit, 5-bit sum) | ||||
| = | + | (Add 4-bit, 5-bit sum) |
Figure 9.
Determining the sign of based on the 17-bit outputs C and S of CSAs. We divide 17 bits into four windows. There are five bits in the left-most window. Each of the other three windows has four bits. The C and S are inputs; others are outputs of adders. indicates the sign except for Case 0, Case 1, or Case 2, where the sign is the inverse of when == 41111.
We summarize the three cases shown in Figure 9 as follows. The signal “sign_inverse” is true, meaning that the original sign () is inverted when ’b1111.
| sign_inverse | = | Case 2 | ||||||
| | | ’b | & | Case 1 | |||||
| | | ’b | & | ’b | & | Case 0 |
The sign bit is a 1 if is negative; otherwise, it is 0. When is a 0 and “sign_inverse” is true, the sign will be 1 at the condition of ’b1111. Similarly, when is a 1 and “sign_inverse” is true, the sign will be 0 at the condition of ’b1111. Thus, we have the following expression for determining the “csasign”.
The symbol ⊕ denotes an exclusive OR (XOR) operation. When ’b1111, the sign equals , regardless of the case for , , or . Note that , , and are sums of two 4-bit values c and s, so they do not have the pattern 5’b11111. The maximum sum is 5’b11110 when both c and s have a maximum value of 4’b1111.
In the example described above, the window size is 4. We can let the window size be 2 or 8. Then, we can use eight 2-bit adders or two 8-bit adders to determine the sign of . The former has a shorter latency introduced by adders, but the logic equation for determining the sign is more complicated. On the other hand, the latter has a simpler logic equation but the adder has a longer latency to determine the sign.
For a 256-bit ECC, there are more options for the number of adder bits and the number of windows. In order to design a low-cost and high-performance circuit, we instigate the cost (the number of ALMs) and performance (the circuit frequency) of the sign detection circuit for using CSAs in 256-bit ECC on FPGA (Field-programmable gate array) chip. There are seven configurations. The experimental results are shown in Table 2 and Figure 10.
Table 2.
Cost and performance of sign detection with different configurations. The FPGA chip device is Cyclone V 5CGXFC7D7F31C8.
Figure 10.
Cost and performance of sign detection with different configurations.
In general, decreasing CPA bits increases the frequency. However, in our simulations, the frequency of the last configuration (128 windows and a 2-bit adder) is lower than the configuration with 64 windows and a 4-bit adder. This is because the logic equation for sign detection becomes complicated. We can also see that larger adders require more ALMs.
Based on the experimental results, our 256-bit ECC implementations use 4-bit adders, so there are 64 windows (). A frequency of 392.46 MHz is measured when the FPGA chip implements only the sign detection circuitry. Implementing both the sign detection circuit and the CSAs results in a frequency of 343.76 MHz, lower than 392.46 MHz. The latency of the CSAs is the same as that of a 1-bit full adder. Clearly, it is less than that of the sign detection circuit. These experimental results imply that the frequency decreases as the circuit becomes larger.
Note that the sign detection circuit itself is a combinational circuit. If we want to test its latency, we can add registers on both the input and output sides. These registers are for testing purposes only and should be removed for ECC implementations. Otherwise, there will be a delay of two clock cycles, resulting in incorrect timing.
4. Hardware Implementations of Modular Multiplications and ECC
We have implemented modular multiplication algorithms and ECC in Verilog HDL. Unlike sequential program code in software, hardware modules can operate simultaneously. Considering data dependency, we must handle the synchronization between hardware modules using signals such as “start” and “ready”. This section describes these implementations and evaluates their cost and performance.
4.1. Hardware Implementations of Modular Multiplications
We have implemented the five modular multiplication algorithms described in the previous section. Since we use the 256-bit key proposed in Secp256k1 [4], the Verilog HDL code for modular multiplication also uses 256-bits. As an example of the Verilog HDL code shown in Appendix A, the circuit calculates , where m is a 256-bit odd number and , as described in Algorithm 8 (SSMM). The input “start” is a one-clock cycle active signal that tells the module to start modular multiplication. The outputs “ready”, “busy”, and “ready0” are synchronization signals with other modules. The simulation waveform generated with ModelSim can be found in Figure A1.
When developing Verilog HDL code, it is recommended to use continuous assignment to perform calculations outside of “always” statements that use the clock signal. Verilog HDL code for other modular multiplications can be easily developed by referring to the SSMM example code and the corresponding algorithms and figures.
Table 3 gives the cost performance of the five modular multiplication algorithms. The column of clock cycles shows the required number of clock cycles when executing the modular multiplication algorithm. The column of frequency (MHz) shows the frequency in MHz at which the circuit can work. The column of latency (μs) shows the time in microseconds calculated by dividing the clock cycles by the clock frequency. The column of ALMs shows the required number of adaptive logic modules. The column of registers shows the required number of flip-flops. The last column shows the static/dynamic (S/D) power dissipation estimated using the Quartus Prime “Power Analyzer Tool” function when using Intel/Altera Cyclone V 5CGXFC7D7F31C8 FPGA chip.
Table 3.
Comparison of modular multiplication algorithms. SSMM is 1.80 times faster than IMM and SSMMCSA is 3.27 times faster than IMM. The FPGA chip device is Cyclone V 5CGXFC7D7F31C8.
SSMM has a higher frequency than IMM and MMM due to its shorter critical path. IMM only stores the product p, but SSMM needs to store both the product p and the multiplicand u, so SSMM uses about twice as many registers compared to IMM. The number of ALMs used by SSMM is a little bit smaller than that used by IMM. Figure 11 plots the relative cost performance of the modular multiplication algorithms where the cost performance of IMM is set to 1.0, and the data for other implementations are obtained by dividing the corresponding value by the value of IMM. The figure and table show that SSMMCSA provides the highest frequency and lowest latency at the highest hardware cost.
Figure 11.
Cost performance comparison of modular multiplication algorithms (set IMM to 1.0).
4.2. Hardware Implementations of ECC
We have implemented ECC in affine, projective, and Jacobian coordinates using the IMM, SSMM, SSMMPRE, and SSMMCSA modular multiplication algorithms. Referring to Algorithm 5, in affine coordinates, the module of scalar point multiplication (scalarmult) invokes the module of point addition (addpoints) and the module of point doubling (doublepoint), as follows (simplified).
module scalarmult (clk, rst_n, start, x, y, d, m, a, rx, ry, ready); // … signal declarations addpoints ap (clk, rst_n, start_ap, x1, y1, x2, y2, m, a, apx, apy, ready_ap); doublepoint dp (clk, rst_n, start_dp, x1, y1, m, a, dpx, dpy, ready_dp); always @(posedge clk or negedge rst_n) begin // … to generate start_ap and start_dp and register results // … to check completeness and generate signals for ready end endmodule
Based on the start signal of module scalarmult, we can generate the start signals of start_ap for addpoints and start_dp for doublepoint. The result of scalarmult is calculated in iterations on the scalar d. Note that in each iteration, addpoints and doublepoint can be executed in parallel. An important synchronization is to allow addpoints to start only after the previous iteration’s doublepoint has finished.
Referring to Algorithm 5, we use a 256-bit register k to control the iterations. It is initialized with the scalar d at the start and shifted one bit to the right at each iteration. After 256 iterations, k becomes 0 and the addition point [apx, apy] is the result point [rx, ry]. The modules of scalar point multiplication in projective and Jacobian coordinates invoke the point addition and doubling modules in projective and Jacobian coordinates, respectively. In the addpoints and doublepoint modules, the result point is calculated based on the algorithms of point addition and point doubling, as described in Algorithm 1 (or 3) and Algorithm 2 (or 4). These modules invoke (1) modadd (modular addition), (2) modsub (modular subtraction), (3) modmul (modular multiplication), and (4) modinv (modular inversion). The first two modules are combinational circuits which take one clock cycle. The last two modules, modmul and modinv, are sequential circuits for which we must generate the start signals. The Verilog HDL source codes of these two modules are given in the Appendix A and Appendix B. Referring to Table 1, in affine coordinates, x of the and can be used as the shared secret key for two parties. In projective and Jacobian coordinates, x must be calculated once by and , respectively. Such calculations can be performed using modmul and modinv, as shown in the Appendix A and Appendix B.
Table 4 gives the cost performance of the ECC in affine, projective, and Jacobian coordinates using the IMM, modular multiplication algorithms proposed in [7,8,9,10], SSMM, SSMMPRE, and SSMMCSA. The [7,8,9,10] implementations used three-input multiplexers, while the SSMM implementations used only two-input multiplexers. The implementation in [7] used a multiplexer to perform . If , . Instead, the implementation in [8] used “AND” gates to perform : p = p + (a & {256{b_i}}), which is a little slower than the implementation in [7].
Table 4.
Comparison of ECC in three coordinates with modular multiplication algorithms.
As mentioned before, instead of using a multiplexer, bit (b[cnt] in Figure 6) can be used as a write enabler for register p. The line labeled “SSMM-WE” shows the cost performance of such an implementation. We can see that it achieves exactly the same cost performance as “SSMM”. This is because a DFFE is actually implemented using a DFF and a 2-to-1 multiplexer. The line labeled “SSMM-AND” shows the cost performance of an implementation that uses neither multiplexers nor write-enabled registers. Instead, it uses 256 “AND” gates and a CPA to perform . Its performance is lower than using a multiplexer or a write-enabled register. In conclusion, we recommend using a multiplexer for calculating , as shown in Figure 6.
The relative cost performance of the ECC implementations in affine, projective, and Jacobian coordinates are plotted in Figure 12, Figure 13 and Figure 14, respectively, where the cost performance of ECC with IMM is set to 1.0, and the data for other implementations are obtained by dividing the corresponding value by the value of the ECC with IMM.
Figure 12.
Cost performance comparison of ECC in affine coordinates (set ECC with IMM to 1.0).
Figure 13.
Cost performance comparison of ECC in projective coordinates (set ECC with IMM to 1.0).
Figure 14.
Cost performance comparison of ECC in Jacobian coordinates (set ECC with IMM to 1.0).
The ECC implementations in projective coordinates use more ALMs and registers than those in the other two coordinates. The ECC implementations in affine coordinates have a lower hardware cost than those in the other two coordinates. For all coordinates, the ECC implementations using SSMM, SSMMPRE, and SSMMCSA have higher clock frequencies than those with IMM.
As shown in Table 3, the circuit SSMMCSA has the highest frequency, but applying it to the ECC design does not achieve a higher frequency than the other circuits, such as SSMM. This is because, in ECC design, there are other modules that have longer latency than CSAs and sign detection. For example, modadd takes one clock cycle to perform an addition on two operands a and b, a subtraction (subtracting the modulus m from the sum), and a selection using a multiplexer based on the sign of the subtraction result. Another example is modinv. Referring to Appendix B, the calculation of in modular inversion takes even longer time than modadd, as shown as follows, where r is the result in the source code and m is the modulus.
if r - 2m >= 0
c = r - 2m // c = r - 2m
else if r - m >= 0
c = r - m // c = r - 1m
else if r >= 0
c = r // c = r + 0m
else
c = r + m // c = r + 1m
endif
endif
endif
Such calculations take longer time than the CSAs and sign detection operations. Meanwhile, the SSMMCSA circuit is larger than SSMM, this also decreases the frequency.
Figure 15 shows the relative latency of ECC implementations. The value of ECC with IMM is set to 1.0, and the values for other implementations are obtained by dividing the corresponding latency by the latency of the ECC with IMM. We can see that the SSMM in projective coordinates has the lowest latency. This is because, SSMMPRE and SSMMCSA use a lot of hardware resources, which reduces the frequency. Interestingly, the latency in projective coordinates is lower than that in Jacobian coordinates. The formulas for point addition and point doubling in Jacobian coordinates look simpler than those in projective coordinates, but the calculation of depends on , resulting in a sequential execution. On the other hand, and in projective coordinates can be calculated in parallel.
Figure 15.
Relative latency of ECC implementations (set the latency of ECC with IMM to 1.0).
The ECC implementations presented in this paper are based on the Secp256k1 curve. It is easy to use the NIST Secp256r1 (P-256) curve [4,19]. We have also implemented the ECC based on the Secp256r1 curve. Our experimental results show that the cost performance of ECC implementations based on both curves is almost the same.
5. Conclusions and Future Work
This paper introduced ECC modular multiplication algorithms and their hardware implementations. Experimental results show that the proposed SSMMPRE (shift-sub modular multiplication with advance preparation) and SSMMCSA (shift-sub modular multiplication with CSAs and sign detection) have lower latencies than SSMM (shift-sub modular multiplication). We proposed a fast and simple method to determine the sign based on the separate output sum and carry of the CSAs (carry save adders). For a 256-bit SSMMCSA, it is recommended to use 64 windows and 4-bit CPAs (carry propagate adders). At the current time, SSMMCSA should be used. We also investigated the ECC implementations in affine, projective, and Jacobian coordinates using the IMM (interleaved modular multiplication), a modified IMM, SSMM, SSMMPRE, and SSMMCSA algorithms. Experimental results show that the ECC implementation in projective coordinates using SSMM has the lowest latency among all the ECC implementations.
The SSMMCSA circuit itself has a higher frequency (116.58 MHz) than SSMM (63.97 MHz), but the ECC circuits using SSMMCSA have the same or even lower frequency than that using SSMM. The next challenge is to find and shorten the critical path to make the SSMMCSA ECC circuit faster than the SSMM ECC circuit. Also, high-radix SSMM algorithms and their use in ECC implementations are worth investigating.
Funding
This research received no external funding.
Data Availability Statement
Not applicable.
Conflicts of Interest
The author declares no conflict of interest.
Appendix A. Verilog HDL Code of Shift-Sub Modular Multiplication (SSMM)
‘timescale 1ns/1ns
module modmul (clk, rst_n, start, a, b, m, p, ready, busy, ready0); // p = a * b mod m
input clk, rst_n;
input start;
input [255:0] a, b, m;
output [255:0] p;
output ready;
output reg busy;
output reg ready0;
reg ready1;
assign ready = ready0 ^ ready1;
reg [257:0] u, s;
reg [7:0] cnt;
wire [7:0] next_cnt = cnt + 8’d1;
wire bi_is_1 = b[cnt];
wire [257:0] plus_u = s + u; // s + u
wire [257:0] minus_m = plus_u - {2’b00,m}; // s + u - m
wire [257:0] new_s = bi_is_1 ? minus_m[257] ? plus_u : minus_m : s; // new s
wire [257:0] two_u = {u[256:0],1’b0}; // 2u
wire [257:0] two_u_m = two_u - {2’b00,m}; // 2u - m
wire [257:0] new_u = two_u_m[257] ? two_u : two_u_m; // new u
assign p = s[255:0];
always @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
ready0 <= 0;
ready1 <= 0;
busy <= 0;
end else begin
ready1 <= ready0;
if (start) begin
u <= {2’b0,a}; // u <= a
s <= 0; // s <= 0
ready0 <= 0;
ready1 <= 0;
busy <= 1;
cnt <= 0;
end else begin
if (busy) begin
s <= new_s; // s <= new_s;
if (cnt == 8’d255) begin // finished
ready0 <= 1;
busy <= 0;
end else begin // not finished
u <= new_u; // u <= new_u;
cnt <= next_cnt; // cnt++
end
end
end
end
end
endmodule
This code implements Algorithm 8. Its block diagram is shown in Figure 6. The signal “start” is active for one clock cycle to tell the module to start modular multiplication. The signal “ready” remains active for one clock cycle to indicate that the modular multiplication result is available. “ready0” remains active until the multiplier is cleared. This signal is used to confirm the readiness of multiple modules to initiate another module’s operation. Figure A1 shows the simulation waveform generated with ModelSim.
Figure A1.
Waveform of SSMM that calculates . The Verilog HDL code is given in Appendix A.
Appendix B. Verilog HDL Code of Modular Inversion
‘timescale 1ns/1ns
module modinv (clk, rst_n, start, b, a, m, c, ready, busy, ready0); // c = b * a^{-1} mod m
input clk, rst_n;
input start;
input [255:0] b, a, m;
output [255:0] c;
output ready, ready0;
output reg busy;
reg ready0, ready1;
assign ready = ready0 ^ ready1;
reg [259:0] u, v, x, y, q, result;
wire [259:0] x_plus_m = x + q; // x + m
wire [259:0] y_plus_m = y + q; // y + m
wire [259:0] u_minus_v = u - v; // u - v
wire [259:0] r_plus_m = result + q; // r + m
wire [259:0] r_minus_m = result - q; // r - m
wire [259:0] r_minus_2m = result - {q[258:0],1’b0}; // r - 2m
assign c = r_minus_2m[259] ? r_minus_m[259] ? result[259] ? r_plus_m[255:0] :
result[255:0] : r_minus_m[255:0] : r_minus_2m[255:0]; // c = b * a^{-1} mod m
always @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
ready0 <= 0;
ready1 <= 0;
busy <= 0;
end else begin
ready1 <= ready0;
if (start) begin
u <= {4’b0,a}; // u <= a
v <= {4’b0,m}; // v <= m
x <= {4’b0,b}; // x <= b
y <= {260’b0}; // y <= 0
q <= {4’b0,m}; // q <= m
ready0 <= 0;
ready1 <= 0;
busy <= 1;
end else begin
if (busy && ((u == 1) || (v == 1))) begin // finished
ready0 <= 1;
busy <= 0;
if (u == 1) begin // if u == 1
if (x[259]) begin // if x < 0
result <= x_plus_m; // c = x + m
end else begin // else
result <= x; // c = x
end
end else begin // else
if (y[259]) begin // if y < 0
result <= y_plus_m; // c = y + m
end else begin // else
result <= y; // c = y
end
end
end else begin // not finished
if (!u[0]) begin // while u & 1 == 0
u <= {u[259],u[259:1]}; // u = u >> 1
if (!x[0]) begin // if x & 1 == 0
x <= {x[259],x[259:1]}; // x = x >> 1
end else begin // else
x <= {x_plus_m[259],x_plus_m[259:1]}; // x = (x + m) >> 1
end
end
if (!v[0]) begin // while v & 1 == 0
v <= {v[259],v[259:1]}; // v = v >> 1
if (!y[0]) begin // if y & 1 == 0
y <= {y[259],y[259:1]}; // y = y >> 1
end else begin // else
y <= {y_plus_m[259],y_plus_m[259:1]}; // y = (y + m) >> 1
end
end
if ((u[0]) && (v[0])) begin // two while loops finished
if (u_minus_v[259]) begin // if u < v
v <= v - u; // v = v - u
y <= y - x; // y = y - x
end else begin // else
u <= u - v; // u = u - v
x <= x - y; // x = x - y
end
end
end
end
end
end
endmodule
This code is developed based on Algorithm 2.22 in [3]. The signal “start” is active for one clock cycle to tell the module to start modular inversion. The signal “ready” remains active for one clock cycle to indicate that the modular inversion result is available. “ready0” remains active until the multiplier is cleared. This signal is used to confirm the readiness of multiple modules to initiate another module’s operation. The testbench is listed below.
‘timescale 1ns/1ns
module modinv_tb;
reg clk, rst_n, start;
reg [255:0] b, a, m;
wire [255:0] c;
wire ready, busy, ready0;
modinv inst (clk, rst_n, start, b, a, m, c, ready, busy, ready0);
initial begin
#0 clk = 1;
#0 rst_n = 0;
#0 start = 0;
#0 b = 256’h9cfa1c993911914be0f15bd74a878abe0079c6254b961b82e1abda76387d1d85;
#0 a = 256’hd5076ae274e874c2eb0f7778717c39460236549ddd9fc651e68a0c0e787b4ce8;
#0 m = 256’hfffffffffffffffffffffffffffffffffffffffffffffffffffffffefffffc2f;
#1 rst_n = 1;
#2 start = 1;
#2 start = 0;
wait(ready);
#40 $stop;
end
always #1 clk = !clk;
endmodule
Figure A2 shows the simulation waveform generated with ModelSim.
Figure A2.
Waveform of modular inversion that calculates . The Verilog HDL code is given in Appendix B.
References
- Koblitz, N. Elliptic curve cryptosystems. Math. Comput. 1987, 48, 203–209. Available online: https://www.ams.org/journals/mcom/1987-48-177/S0025-5718-1987-0866109-5/S0025-5718-1987-0866109-5.pdf (accessed on 15 October 2023). [CrossRef]
- Miller, V.S. Use of Elliptic Curves in Cryptography. In Proceedings of the Advances in Cryptology—CRYPTO’85 Proceedings; Springer: Berlin/Heidelberg, Germany, 1986; pp. 417–426. Available online: https://link.springer.com/content/pdf/10.1007/3-540-39799-X_31.pdf?pdf=inline%20link (accessed on 15 October 2023).
- Hankerson, D.; Menezes, A.; Vanstone, S. Guide to Elliptic Curve Cryptography; Springer: New York, NY, USA, 2004. [Google Scholar] [CrossRef]
- Certicom_Corp. Standards for Efficient Cryptography. SEC 2: Recommended Elliptic Curve Domain Parameters; Certicom Corp: Mississauga, ON, Canada, 2010; Available online: http://www.secg.org/sec2-v2.pdf (accessed on 15 October 2023).
- Barker, E.; Chen, L.; Roginsky, A.; Vassilev, A.; Davis, R. SP 800-56A Rev. 3, Recommendation for Pair-Wise Key-Establishment Schemes Using Discrete Logarithm Cryptography; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2018. Available online: https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-56Ar3.pdf (accessed on 15 October 2023).
- Blakely, G.R. A Computer Algorithm for Calculating the Product AB Modulo M. IEEE Trans. Comput. 1983, C-32, 497–500. [Google Scholar] [CrossRef]
- Islam, M.M.; Hossain, M.S.; Hasan, M.K.; Shahjalal, M.; Jang, Y.M. FPGA Implementation of High-Speed Area-Efficient Processor for Elliptic Curve Point Multiplication over Prime Field. IEEE Access 2019, 7, 178811–178826. [Google Scholar] [CrossRef]
- Islam, M.M.; Hossain, M.S.; Shahjalal, M.; Hasan, M.K.; Jang, Y.M. Area-Time Efficient Hardware Implementation of Modular Multiplication for Elliptic Curve Cryptography. IEEE Access 2020, 8, 73898–73906. [Google Scholar] [CrossRef]
- Hu, X.; Zheng, X.; Zhang, S.; Cai, S.; Xiong, X. A Low Hardware Consumption Elliptic Curve Cryptographic Architecture over GF(p) in Embedded Application. Electronics 2018, 7, 104. [Google Scholar] [CrossRef]
- Cui, C.; Zhao, Y.; Xiao, Y.; Lin, W.; Xu, D. A Hardware-Efficient Elliptic Curve Cryptographic Architecture over GF(p). Math. Probl. Eng. 2021, 2021, 8883614. [Google Scholar] [CrossRef]
- Hossain, M.S.; Kong, Y.; Saeedi, E.; Vayalil, N.C. High-performance elliptic curve cryptography processor over NIST prime fields. IET Comput. Digit. Tech. 2017, 11, 33–42. [Google Scholar] [CrossRef]
- Liu, Z.; Liu, D.; Zou, X. An Efficient and Flexible Hardware Implementation of the Dual-Field Elliptic Curve Cryptographic Processor. IEEE Trans. Ind. Electron. 2017, 64, 2353–2362. [Google Scholar] [CrossRef]
- Di Matteo, S.; Baldanzi, L.; Crocetti, L.; Nannipieri, P.; Fanucci, L.; Saponara, S. Secure Elliptic Curve Crypto-Processor for Real-Time IoT Applications. Energies 2021, 14, 4676. [Google Scholar] [CrossRef]
- Montgomery, P.L. Modular Multiplication without Trial Division. Math. Comput. 1985, 44, 519–521. Available online: https://www.ams.org/journals/mcom/1985-44-170/S0025-5718-1985-0777282-X/S0025-5718-1985-0777282-X.pdf (accessed on 15 October 2023). [CrossRef]
- Li, Y.; Chu, W. Shift-Sub Modular Multiplication Algorithm and Hardware Implementation for RSA Cryptography. In Hybrid Intelligent Systems; Springer: Berlin/Heidelberg, Germany, 2021; pp. 541–552. [Google Scholar] [CrossRef]
- Li, Y.; Chu, W. Verilog HDL Implementation for an RSA Cryptography using Shift-Sub Modular Multiplication Algorithm. J. Inf. Assur. Secur. 2022, 17, 113–121. [Google Scholar]
- Bunimov, V.; Schimmler, M. Area and time efficient modular multiplication of large integers. In Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures, and Processors, The Hague, The Netherlands, 24–26 June 2003; pp. 400–409. [Google Scholar] [CrossRef]
- Gayoso Martínez, V.; Hernández Encinas, L.; Sánchez Ávila, C. A Survey of the Elliptic Curve Integrated Encryption Scheme. J. Comput. Sci. Eng. 2010, 2, 7–13. [Google Scholar]
- Chen, L.; Moody, D.; Regenscheid, A.; Robinson, A.; Randall, K. Recommendations for Discrete Logarithm-based Cryptography: Elliptic Curve Domain Parameters. NIST Special Publication NIST SP 800-186. 2023. Available online: https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-186.pdf (accessed on 15 October 2023).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).