A High-Performance Elliptic Curve Cryptographic Processor of SM2 over GF( p )

: Elliptic curve cryptography (ECC) is widely used in practical applications because ECC has far fewer bits for operands at the same level of security than other public-key cryptosystems such as RSA. The performance of an ECC processor is usually determined by modular multiplication (MM) and point multiplication (PM) operations. For recommended prime ﬁeld, MM operation can consist of multiplication and fast reduction operations. In this paper, a 256-bit multiplication operation is implemented by a 129-bit (half-word) multiplier using Karatsuba–Ofman multiplication algorithm. The fast reduction is a modulo operation, which gets 512-bit input data from multiplication and outputs a 256-bit result ( 0 ≤ Z < p ) . We propose a two-stage fast reduction algorithm (TSFR) over SCA-256 prime ﬁeld, which can obtain an intermediate result of 0 ≤ Z < 2 p instead of 0 ≤ Z < 14 p in traditional algorithm, avoiding a lot of repetitive subtraction operations. The PM operation is implemented in width nonadjacent form (NAF) algorithm and its operational schedules are improved to increase the parallelism of multiplication and fast reduction operations. Synthesized with a 0.13 µ m complementary metal oxide semiconductor (CMOS) standard cell library, the proposed processor costs an area of 280 k gates and PM operation takes 0.057 ms at the frequency of 250 MHz. The design is also implemented on Xilinx Virtex-6 platform, which consumes 27.655 k LUTs and takes 0.37 ms to perform one 256-bit PM operation, attaining six times speed-up over the state-of-the-art. The processor makes a tradeoff between area and performance, thus it is better than other methods.


Introduction
Elliptic curve cryptography (ECC) was proposed in 1986 by Miller [1] and Koblitz [2] to solve the difficult problem of the elliptic curve discrete logarithm problem (ECDLP).ECC has played an important role in the public key cryptography of information security.SM2 is as ECC and was promulgated by State Cryptography Administration (SCA) of China in 2010.It was added to ISO/IEC14888-3/AMD1 in November 2017.The recommended 256-bit prime field of SM2 is a pseudo-Mersenne prime field called SCA-256 [3] and the details about SM2 can be found in [4].
The structure of MM operation can be classified into two categories [12]: multiplier-based structure and adder-based structure.Specific prime field multiplication and Montgomery multiplication are used in multiplier-based structures [21].Interleaved multiplication algorithm is usually applied in the adder-based structures [22].The processors in [9,10,12,17,18] are based on adders and aim at low hardware and power consumption.Most high-performance accelerators, as reported in [3,6,7,13,19], are based on multiplier.The architectures in [13] are based on Montgomery multiplier whose size ranges from 8-bit × 8-bit to 64-bit × 64-bit.Area efficiency and low latency are achieved at the sacrifice of performance in their works.The processor in [3] is based on 256-bit × 256-bit full-word multiplier.Its fast reduction operation in SCA-256 prime field gets intermediate result Z (0 ≤ Z < 14p), which will cost thirteen subtraction operations to get the final result Z (0 ≤ Z < p) in the worst case.Moreover, the full-word multiplier consumes much more hardware footage and brings severe latency.
On the one hand, small bit multiplier results in low performance, whereas full-word multiplier leads to large area consumption.On the other hand, traditional fast reduction algorithms are one-stage, which get intermediate result Z, such as Z ∈ [0, 14p) in [3], Z ∈ (−4p, 5p) in [6], followed by a lot of iterative addition or subtraction operations to get the final result within [0, p).
In this paper, we present a high-performance processor of SM2 over GF(p).The main contributions of this paper are as follows.

•
A two-stage fast reduction (TSFR) algorithm in SCA-256 is proposed.TSFR performs fast reduction operations twice and then gets the intermediate result The arrangement of this paper is as follows.Section 2 reviews the elliptic curve over GF(p).Section 3 presents a high-performance processor of SM2.The implementation results of the processor are shown in Section 4, followed by the comparison with previous work.Section 5 concludes this paper.

Elliptic Curve
This subsection briefly describes the elliptic curve (EC).A non-super singular elliptic curve E over GF(p) for p > 3. The Weierstrass equation [23] is defined as where (x, y) ∈ E, a, b ∈ GF(p) and 4a 3 + 27b 2 = 0 (mod p).The set of points (x, y) that satisfies Weierstrass equation and the point at infinity makes an abelian group.The elliptic curve PM operation is defined as , where P is a point of elliptic curve and k is an integer.The width NAF point multiplication algorithm [24] is shown in Algorithm operation consists of a series of point addition (PA) and point doubling (PD) operations.
) are points of elliptic curve.The PA operation is defined as P 3 = P 1 + P 2 and the PD operation is defined as P 3 = 2P 1 .To avoid inversion/division, a tedious operation, mixed affine-Jacobian coordinates yield the fastest PA operation, while Jacobian coordinates yield the fastest PD operation [24].The PA formulas in mixed affine-Jacobian coordinates are The PD formulas in Jacobian coordinates are

SM2 Architecture
In this section, the SM2 architecture based on one half-word multiplier is presented.PM operation is made up of PA and PM operations.PA and PM operations consist of MM, modular addition (MA) and modular subtraction (MS) operations.Fast-reduction operation and full-word multiplication operations complete MM operation.Modular inversion (MI) operation is implemented using binary modular inversion algorithm [5].MI operation is used to convert the Jacobian coordinates to affine coordinates at the end of PM operation.

Modular Multiplication
In SM2, the prime field SCA-256 can be denoted as p = 2 256 − 2 224 − 2 96 + 2 64 − 1.In the specific prime field, MM operation can be achieved by multiplication and fast reduction operations.

A. Fast-Reduction
An existing fast reduction algorithm [3] over SCA-256 is given in Algorithm 2. It is a one-stage fast reduction operation.After a series of addition and subtraction operations, the intermediate result is In the worst case, it will cost thirteen subtraction operations to get the final result [0, p) and those repetitive subtraction operations will bring a significant latency.

B. Multiplication Structure
This subsection introduces a multiplication structure.Multiplier is typically used in traditional high-performance architectures.To avoid the large hardware consumption caused by a full-word multiplier, we split full-word multiplication into half-word multiplication, costing more clock cycles.There are some works based on half-word multiplier.Cascading multipliers structure is applied in the full-word multiplication in [7].This structure is designed with four half-word multiplications, a 1 b 1 , a 1 b 0 , a 0 b 1 and a 0 b 0 , shown in the formula below.
Corresponding Karatsuba-Ofman multiplication algorithm in [15] is shown in Algorithm 4. Three half-word multiplication operations are performed separately, as a result only one half-word multiplier are consumed.Figure 1 shows the schedule comparison between the existing structure in [15] and our structure.In [15], the half-word multiplier takes two clock cycles and the multiplication structure requires six, while our multiplication structure consumes five clock cycles.With hardware reuse technology, the full-word multiplication can be completed in five clock cycles with one 129-bit half-word multiplier and one 512-bit adder (subtraction can be implemented in the form of complements with an adder).

Point Addition and Point Doubling
Since the PM operation consists of PA and PD operations, an efficient implementation of modular multiplication operations does not necessarily yield a high-performance PM operation.The algorithm optimization at the point arithmetic layer is also very important.
Algorithm 5 gives the traditional point addition and point doubling algorithms shown in [23].Each step can only performs one operation at modular arithmetic layer because these algorithms are not designed for enhance computational parallelism.There are 18 steps including 11 MM operations in PA algorithm, while there are 17 steps containing eight MM operations in PD algorithm.
Algorithm 5: Point addition and point doubling algorithms shown in [23] Input: P1 = (X 1 , Y 1 , Z 1 ) in Jacobian coordinates, P 2 = (x 2 , y 2 ) in affine coordinates Output: There are some modified point addition and point doubling algorithms to reduce computation steps, such as Algorithm 6 reported in [3].For improving performance, each step except the last one should perform MM operation.There are 13 steps including 12 MM operations in its PA algorithm, while there are nine steps containing eight MM operations in PD algorithm.Algorithm 6: Point addition and point doubling algorithms reported in [3] Input: P1 = (X 1 , Y 1 , Z 1 ) in Jacobian coordinates, P 2 = (x 2 , y 2 ) in affine coordinates Output: In this paper, we propose the novel point addition and point doubling algorithms given in Algorithm 7.There are 11 steps including 11 MM operations totally in our PA algorithm, fewer steps than that in [3,23] and fewer MM operations than that in [3].There are eight steps containing eight MM operations in our PD algorithm, fewer steps than that in [3,23].Therefore, our PA and PD algorithms are more efficient those that in [3,23].
Figure 2 gives the detailed pipeline operational schedules.Figure 2a,b shows the internals of PA and PD operations.The fast reduction, MA and MS operations are completely parallel with the continuous multiplication.Figure 2c demonstrates the transition from PA to PD operations and Figure 2d shows the switch between PD and PD operations.In these schedules, the multiplication operation is constantly running, not affected by switch between PA and PD and shifting from PD to PD.

SM2 Architecture
In this section, a SM2 architecture is demonstrated.The block diagram of SM2 is given in Figure 3.The Control unit block is a two-level controller.The top level is responsible for point arithmetic layer including PM, PA and PD operations, and the sublevel is in charge of modular arithmetic layer including MA, MS, MM and MI operations.MM operation is run by the Mult.unit and the NAF unit blocks.The Inversion unit block is applied to coordinate conversion from Jacobian coordinates to affine coordinates at the end of PM operation.For saving hardware footage, the Inversion unit block can also perform the MA/MS operation.
Elliptic curve cryptosystems contain Elliptic Curve Digital Signature Algorithm (ECDSA) signature generation, ECDSA signature verification, and Elliptic Curve Integrated Encryption Scheme (ECIES) encryption and decryption [23].The PM operation of the architecture here is limited to the SCA-256 prime field.This architecture can also perform PM operation in other specific prime field, if a fast reduction algorithm of other specific prime field is also considered.The architecture we proposed mainly focuses on the acceleration of PM operation, which can be configured to perform the operation at modular arithmetic layer as well.By hardware/software co-design, our hardware module can be used by the software in an embedded system, and the ECC encryption and decryption, signature and verification algorithm can be realized.

Hardware Implementation Result
The ECC architecture described above was implemented with Verilog-HDL language.This architecture was synthesized by Synopsys Design Compiler with the SMIC 130-nm CMOS standard cell library.The circuit area was evaluated based on two-way NAND gate.For better comparison, this architecture was also implemented on different Xilinx FPGA boards, including Virtex-6 xc6vlx760, Virtex-5 xc5vfx130t and Virtex-4 xc4vlx200.Xilinx ISE 14.7 was chosen for synthesis, mapping, and routing.
In Algorithm 1, since the results of P[i] = iP, i ∈ {1, 3, ..., 2 w−1 − 1}, can be pre-calculated, the PM operation took clock cycle as follows: where m is the length of scalar k, w is the width of NAF, N is the clock cycle for calculating NAF(k), D and A are the clock cycle to perform PD and PA operations, and I stands for the clock cycle of MI operation.
Table 1 shows the clock cycle of each EC operation.The MM operation took five clock cycles.The number of PA operations could be reduced by scalar k being coded by NAF 4 in PM operation.By testing 1000 times, the PM operation required 14,242 clock cycles on average, including the time of coordinate conversion, which took two MI operations.
Table 2 shows the hardware consumption of each block on ASIC platform.The Mult. unit block consists of one 129-bit multiplier, one 512-bit adder and some registers.The Mult. unit block occupies half of the total area, as the table shows, since the multiplier consumes a lot of hardware resources.Table 3 gives the resource utilization on FPGA platform.Table 4 shows the result of ECC hardware performance comparison among different architecture on ASIC platform.The processor in [3] is designed with a full-word multiplier and took 0.02 ms to perform a PM operation.Although full-word multiplier only needed one clock cycle to perform full multiplication operation, it brought large area consumption and severe latency.This design has a circuit area of 659 k gates, 2.35 times ours, and can run at a frequency of 163.7 MHz, while ours design based on half-word multiplier runs at a frequency of 250 MHz.Our design can run at higher frequency and has better point operational schedules, and the width of NAF w is set to 4 while w is set to 3 in [3].Therefore, although multiplication operation of our design took five clock cycles, five times that of the method in [3], the PM operation of our design took 2.8 times over the method in [3] but not 5.In [9,12], the MM and MI operations are based on adder, using interleaved modular multiplication algorithm and binary inversion algorithm.The design in [9] has two multiplier units and two inversion units, whereas the design in [12] combines inversion unit and multiplier unit into one common unit.Therefore, the processor area of the design in [9] is 2.5 times larger than that of the design in [12].Because their PA and PD operations are implemented in affine coordinates, there are MM and MI operations in PA or PD operations.Since both designs are based on adder, they cost lower areas but more run time of PM operation, reaching 3.01 ms in [9] and 4.07 ms in [12].
The processors proposed in [13,14] employ NAF in PM operation and do not focus on optimizing the operational schedules of PM operation.The processor in [13] employs Montgomery multipliers with sizes ranging from 8-bit × 8-bit to 64-bit × 64-bit.Small size of multiplier results in low hardware consumption but low performance.As a result, this processor costs an area of 120.26k gate, 57.05% less than ours, and requires 2.47 ms, 43 times slower than ours.The processor in [14] is based on a systolic arithmetic unit.It can run at very high frequency of 556 MHz and takes 1.01 ms for one PM operation, 2.65 times faster than the design in [13].
The design we proposed is based on half-word multiplier and the point operational schedules are optimized to increase efficiency.Synthesized with a 130 nm CMOS standard cell library, the area of our design is 280 k gates with PM operation time of 0.057 ms at the frequency of 250 MHz.For parameter AT, our design is small than [9,[12][13][14].
Table 5 shows the performance results on FPGA platform.The designs in [9,10,12,17,18] are all based on adder and have lower performance than our design.The multiplication reported in [9,12] are implemented with interleaved modular multiplication algorithm, that is the simplest multiplication algorithm but costs less adders.Radix-8 booth encoded interleaved modular multiplication algorithm is applied to realize multiplication in [10], while radix-4 booth encoding interleaved modular multiplication algorithm is adopted in [17,18].
avoiding a lot of iterations of subtraction operation to get the final result.• Multiplication operation is implemented with half-word multiplier using Karatsuba-Ofman multiplication algorithm and takes five clock cycles.The MM operation includes five clock cycles of multiplication operation and one clock cycle of fast reduction operation.With the pipeline design, the MM operation completes in five clock cycles on average.• A high-performance ECC architecture based on half-word multiplier is proposed.PM operation consists of a series of point addition (PA) and point doubling (PD) operations.The novel operational schedules of PA and PD are presented to reduce the MM operations and to improve the parallelism of multiplication and fast reduction operations.

Figure 1 .
Figure 1.Multiplication operational schedule between existing structure and our structure.

Figure 2 .
Figure 2. Operational schedule without idle clock cycle: (a) point addition; (b) point doubling; (c) between point addition and point doubling; and (d) between point doubling and point doubling.

Table 1 .
The clock cycle of EC operations.

Table 2 .
Hardware consumption of each block on ASIC platform.

Table 3 .
Resource utilization on FPGA platform.

Table 4 .
ECC hardware performance comparison on ASIC platform.