A Low Hardware Consumption Elliptic Curve Cryptographic Architecture over GF( p ) in Embedded Application

: In this paper, a low hardware consumption design of elliptic curve cryptography (ECC) over GF( p ) in embedded applications is proposed. The adder-based architecture is explored to reduce the hardware consumption of performing scalar multiplication (SM). The Interleaved Modular Multiplication Algorithm and Binary Modular Inversion Algorithm are improved and implemented with two full-word adder units. The full-word register units for data storage are also optimized. The design is based on two full-word adder units and twelve full-word register units of pipeline structure and was implemented on Xilinx Virtex-4 platform. Design Compiler is used to synthesized the proposed architecture with 0.13 µ m CMOS standard cell library. For 160, 192, 224, 256 ﬁeld order, the proposed architecture consumes 5595, 7080, 8423, 9370 slices, respectively, and saves 17.58 ∼ 54.93% slice resources on FPGA platform when compared with other design architectures. The synthesized result uses 35.43 k, 43.37 k, 50.38 k, 57.05 k gate area and saves 52.56 ∼ 91.34% in terms of gate count in comparison. The design takes 2.56 ∼ 4.07 ms to perform SM operation over different ﬁeld order under 150 MHz frequency. The proposed architecture is safe from simple power analysis (SPA). Thus, it is a good choice for embedded applications.


Introduction
Elliptic curve cryptography (ECC) is an asymmetric cryptography proposed in 1986 by Miller [1] and Koblitz [2].The main advantage of ECC is that it uses a smaller key than some other methods, such as the RSA encryption algorithm, to provide a comparable or higher level of security.International standard organizations, such as NIST [3], ANSI [4] and IEEE [5], have standardized ECC.
Many hardware implementations of ECC have been proposed [6][7][8][9][10][11][12][13][14][15][16][17][18] for ECC.The accelerator of modular multiplication (MM) can be divided into two categories: multiplier-based architecture and adder-based architectures.Multiplier-based architecture includes specific prime field multiplication and Montgomery Multiplication [19].Adder-based architecture uses Interleaved Multiplication algorithm [20].Design in [8] is based on modified Montgomery multiplication algorithm using an r-bit × r-bit multiplier.Designs in [9,10] are based on specific prime field and use a full-size n-bit × n-bit multiplier and fast reduction operation to perform MM.However, multiplier-based architecture consumes large hardware area.Modular inversion (MI) operation is another tedious operation in ECC.Binary Modular Inversion Algorithm is well known adder-based MI algorithm.The MM and MI units in design [12] are adder based but two operation units are independent and do not share adder with each other.
In this paper, an adder-based architecture with low hardware consumption over GF(p) is proposed.The main contributions of this paper are as follows: • Interleaved Modular Multiplication Algorithm and Binary Modular Inversion Algorithm are improved carefully to make full use of hardware source of adder and register.MM and MI are implemented with two full-word adder units and four full-word register units.

•
The utilization of registers is optimized to minimize the hardware area.For data register, MA, MS, MM, MI consume four full-word register units and scalar multiplication (SM) operation uses eight full-word register units.

•
The architecture is flexible and safe from SPA.The parameters, such as prime value p, elliptic curve point P and scalar value k, can be easily deployed without hardware reconfiguration.
The rest of the paper is arranged as follows.Section 2 reviews the elliptic curve (EC) and EC scalar multiplication (SM).Section 3 presents a low hardware consumption architecture over GF(p).Section 4 gives the result of the implement followed by analysis and comparison with other designs.The paper is concluded in Section 5.

Elliptic Curve Over GF(p)
This section provides a brief introduction to elliptic curve over GF(p) and the finite field arithmetic involved.More information about elliptic curve cryptographic primitives can be found in [5,21].A non-super singular elliptic curve E over GF(p) for p > 3 can be described by Weierstrass equation.
where x, y, a and b are elements of GF(p) and 4a 3 + 27b 2 = 0 (mod p).The set of points (x, y) which satisfies Weierstrass equations together with the point at infinity consists an abelian group.
In affine coordinates, the point addition (PA) and point doubling (PD) operations can be represented as follows: assuming that P 1 = (x 1 , y 1 ) and P 2 = (x 2 , y 2 ) are on the elliptic curve, PA formulas for computing P 3 (x 3 , y 3 ) = P 1 (x 1 , y 1 ) + P 2 (x 2 , y 2 ) are: When P 1 = P 2 , i.e., adding a point to itself, this special case operation is called PD operation.In this paper, only two full-word adder units are needed.In affine coordination, PA operation consists of one MI, two MM, and six MA/MS operations, whereas PD operation needs one MM and two MA/MS more operations than PA.In order to reduce the power dissipation, optimization of MM and MI operation are significant on the overall performance of the SM operation.

Elliptic Curve Scalar Multiplication
Scalar multiplication (SM) operation is an elemental operation of elliptic curve crypto systems.The scalar multiplication is an operation of adding a EC point P to itself k times, denoted kP, where k = (k l−1 , ..., k 0 ), l is the binary length of k.The scalar multiplication algorithm needs to be able to resist simple power analysis (SPA) attacks.Therefore it is necessary to perform scalar multiplication as described in Algorithm 1 bellow, where PA and PD operations are performed for every bit of the scalar.

Algorithm 1: Elliptic Curve SPA Resistant Scalar Multiplication
Input: scalar k and, EC point

Scalar Multiplication Architecture
In this section, a bottom-up optimization algorithm over GF(p) is presented, which takes advantage of maximum reuse of adder unit.

Modular Addition/Subtraction
Modular addition (MA) and modular subtraction (MS) operations are performed as two step operations of addition and subtraction operations according to Algorithm 2 given bellow.The most significant bit of subtraction result can be used as the result of comparing the two numbers, for example C [1] n in MA operation, where n is the length of p.In FPGA or ASIC, we could achieve addition or subtraction operation with almost equal hardware, that is to say same adder.In proposed design, MA and MS operations are performed in one cycle, so need two full-word adders.The adder is considered as the minimum unit.

Algorithm 2: Modular Addition and Subtraction in GF(p)
Input: A, B, p: 0 ≤ A, B < p, p is prime field.Output: R: R = (A + B) mod p 1: Input: A, B, p: 0 ≤ A, B < p, p is prime field.Output: R: R = (A − B) mod p 1: The architecture of two used full-word adder units are illustrated in Figure 1.Left diagram is the minimum adder unit, which can easily be modified to perform subtraction using B's complement and c = 1 as shown in right.

Modular Multiplication
Modular multiplication (MM) operation is an important component in the implementation of SM operation.Traditional high-speed MM implementations use Montgomery multiplication or specific prime field modular multiplication.In affine coordination, we choose Standard Interleaved Modular Multiplication shown in Algorithm 3 given bellow.This algorithm has some disadvantages: (1) The algorithm needs three additions with carry propagation in step 3 to step 5; (2) The comparison in step 4 and 5 requires check of the all lengths of the operands in the worst case.The carry propagation of addition has a significant influence on the latency.Before addition, the comparison must be performed for MSB first.Those two operations cannot be pipelined without delay.There are researchers have tried to address these problems previously, such as shown in [22].In which, Algorithm 4 adopts Modular multiplication using carry save addition and Algorithm 5 uses Optimized version of the new algorithm.Algorithm 4 uses carry save adders to perform the additions inside the loop, and Algorithm 5 uses lookup-table to reduce both area and time.Both algorithms have high complexity and are unsuitable for the proposed design in this paper.
An improved Interleaved Modular Multiplication Algorithm is given in Algorithm 4 given bellow.The step 4 in Algorithm 4 are used to replace the step 4 and step 5 in Algorithm 3 given bellow.This modification addresses the timing latency of comparison and uses only two adders in one iteration.After step 5 in Algorithm 3, because R may be greater than 2p, the computation of (R mod p) needs two clock cycles with one full-word adder.In step 4 of Algorithm 4, (R − (R n+1 , R n ) * p) is computed instead of (R mod p), resulting one cycle saving in every iteration compared with Algorithm 3. (R n+1 , R n ) is the two most significant bit of R and its value is 0∼3.

Algorithm 3: Standard Interleaved Modular Multiplication Algorithm
Input: X, Y, p: 0 ≤ X, Y < p, p is prime field.Output: R: R = X * Y mod p 1: R = 0 2: for i = n − 1 downto 0 do 3: The algorithm is implemented using the architecture shown in Figure 2. The implementation uses only two full-word adder units shown in Figure 1 and one full-word register unit.For simplicity, Figure 2 omits the output data and modular switch multiplexors which select input data to adders for MA, MS, MM and MI.The Mult.Counter block, which is a down iteration counter, creates control signal for selecting the i-th bit of X and selection signal for Mux before adders.When the counter's number reduced to 0, the iteration finishes.In Algorithm 4 given above, step 3 to 5 was performed in one cycle.Thus, the loop in step 2 to step 5 in Algorithm 4 takes n cycles and MM consumes (n + 1) cycles, where n is field order.Step 6 is required to make sure the result R ∈ (0, 2p).

Modular Inversion
Modular inversion (MI) operation is another significant component in the implementation of SM operation.In order to using the same adder units shown in Figure 1, Binary Modular Inversion Algorithm presented in [23] is selected.By assigning a instead of 1 to variable s in step 1 of Algorithm 5 given bellow, the result y satisfies y = a/x.This way, both MI and MM operations can be achieved in the same run time.To guarantee all operations of each step finish in one cycle, one comparator and three adders are needed in step 4. The comparator is needed for comparison of u and v. Two adders are used for modular subtraction ((r − s) mod p) and one adder is required for subtraction (v − u).

Algorithm 5: Binary Modular Inversion Algorithm
Input: p, x ≤ (0, p) Output: y, satisfying xy = 1 mod p step1: if (s < 0) return s = s + p; else return s. else go to step 2. Algorithm 5 given above is the modified version of Binary Modular Inversion Algorithm to achieve minimum hardware consumption.In step 4, the comparison result of u and v can be pre-calculated in step 2 or step 3 and this step completes only the calculation of (r − s) and (u − v) or (s − r) and (v − u).In the case of r is negative in step 4, (r/2 mod p) can be calculated by adding r to p and right shifting in step 2 when r is positive odd or negative odd.Similar cases are handled the same way.
Figure 3 bellow gives the design architecture of implementing Algorithm 5.For simplicity, the output data and modular switch multiplexors, which select input data of adders for MA, MS, MM and MI, are omitted in Figure 3.The Inversion Ctrl block is a state machine of seven state: 3 b111 for step 1, 3 b000 for setp 2, 3 b001 for step 3, 3 b010 and 3 b011 for step 4, 3 b100 for step 5 and 3 b101 for finish.Table 1 bellow shows the data and operators of each step in Algorithm 5.In step 2 or step 3, with the exception of performing (r/2 mod p) or (s/2 mod p), (u/2 − v) or (u − v/2) should be executed in order to pre-calculate comparison result of u and v for step 4.
At each iteration in step 2 and 3, either u or v is reduced one bit of length, and total number of iterations is 2n, where n is field order.In worst case, the number of iterations of step 4 is 0.5n.Therefore, the overall total number of iterations is at most 2.5n.

Point Addition and Point Doubling
The MA, MS, MM, MI operations have been introduced above.Because all those operations use the same adder units and the same register units, the operations must be performed one after one.Point addition (PA) and point doubling (PD) operations consist of those operations.The scheduling of those operations in PA and PD of proposed architecture is given by Algorithm 6 bellow.It is noted that PA and PD operations need six full-word registers: x1, y1, x2, y2, t1, t2.The remaining two registers are used for SM scalar k and prime field p.  Scalar Mult Core block performs modular operations, MA, MS, MI, MM, and point operations, PA, PD, SM.The finish-flag signal is set to high when all the calculations are done and the result is ready for master system.
Main Controller block is a state machine which controls PA, PD and SM operations.In addition, it also controls data transforming from Register Block to Modular Controller and from Modular Controller to Register Block.
Modular Controller block performs one of MA, MS, MM, and MI operations at a time.

Implementation and Result
The elliptic curve scalar multiplication architecture described above was implemented using Verilog-HDL language.The design was synthesized using Design Compiler with the 130-nm CMOS standard cell library.The hardware area is evaluated based on a 2-way NAND gate.This architecture is also implemented on FPGA platform Xilinx Virtex-4 xc4vsx35, using Modelsim for simulation and Xilinx ISE 14.7 for synthesis, mapping, and routing.
Since MI operation can also perform the computation a/b mod p, PA operation needs 1 MI, 2 MMs and 6 MA/MSs and PD operation needs 1 MI, 3 MMs and 8 MA/MSs in Algorithm 6 given above.The number of cycles for the PA and PD operations are given by (4), respectively.
where I is the cycles of MI operation, M is the cycles of MM operation and A is the cycles of MA/MS operation.The total number of cycles to perform SM operation is given by (5).
where n is prime field order.Table 2 shows the execution cycles of different operations over 160∼256 field order.In 100 tests, a 256-bit EC takes 1066 cycles for PA operation, 1325 cycles for PD operation and 610 k cycles for SM operation.The proposed architecture costs two full-word adder units and twelve full-word register units.Table 3 shows that the hardware consumption of all registers and adders occupied 42% of total area in average.Register includes twelve register units for data storage and other non-combination.As the bit width increases, the hardware consumption percent of the adder units increased from 13.72% to 15.09%.
The inversion and multiplier units in [12] are implemented by using the Binary Inversion Algorithm and Interleaved Modular Multiplication Algorithm, similar to this work.However, the design in [12] uses two inversions and two multipliers whereas the proposed design here uses one inversion and one multiplier, both based same two adder units.As shown in Table 4, in 256-bit field order, the design in [12] requires 167.5 k gate while this design uses 57.05 k gate, saving 65.94% hardware resource.In the given field order, AT parameter of this design is 232 and the AT of [12] is 504.This design has advantages of area-time product.The proposed design saves 64.77% to 65.94% hardware resource comparing with [12] under different field order.The design in [8] adopts word-based Montgomery multiplier with an optimized data bus and an on-the-fly redundant binary converter boost the throughput of the EC scalar multiplication.Compared with [8] in different field order, the proposed design saves 52.56∼69.85%hardware resource.The design in [10] is based on full-size 256-bit × 256-bit multiplier and uses 659 k gates to implement.The design in [15] is based on systolic arithmetic unit and operates in higher frequency at 556 MHz.The designs in [10,15] has smaller AT parameters than the proposed design here.Compared whit designs in [8,10,12,15], the proposed design here consumes least hardware, saving 52.56∼91.34%hardware resource on the average.The processor in [16] consists of with CPU, data RAM, program memory and others.Though it consumes 11.686 k gate area, but it perform one scalar multiplication operation requires up to 1003 k clock cycles, 2.93 times over this proposed design here.The scalar multiplication execution time and energy of [16] can be computed from cycles, frequency and power.The total energy to perform one 192-bit scalar multiplication of this proposed design is 18.65 µJ while the processor in [16] is 114.20 µJ.Because the low frequency with 1.695 MHz, the processor in [16] has very large scalar multiplication execution time with 592 ms.The processor in [17] consumes five m-bit registers and require seven multiply operations per key bit in GF(2 m ).It has little area and energy than this proposed design here.However, it did not give detail information about scalar multiplication execution time or frequency, so there is no way to compare the SM operation time with the proposed design here.The design in [18] adopts registers and bit-level multiplier share method and consumes 11.571 k gates area.The complexity of the scalar multiplication between GF(p) and GF(2 m ) is different, as example, the modular addition needs carry addition and modulo p operation in GF(p) while modular addition needs just xor operation in GF(2 m ).The modular addition is fundamental operation in low area ECC architecture.Therefore is difficult to compare which are better among [17,18] and the proposed design here from area consumption and operation time.
Table 5   Table 6 shows the performance data of several existing FPGA implementations based on EC scalar multiplication.The architecture in [6] is based on a unified Add/Sub/Mul unit.It consumes 13,158 Slices over 256 field order while architecture designed in this paper consumes 9370 Slices.On the same platform, the proposed architecture saves 17.58∼28.79%on average in terms of used slices comparing with [6] over different field orders.The architectures proposed in [6,12] are capable of resisting SPA.The architectures in [13,14] are designed to implement Elliptic Curve Digital Signature Algorithm over GF (2 163 ) while the scalar multiplication implementation data are given in the above Table 6.Compared with [13] over 163 field order, architecture provided in this paper uses 42.14% less slices over 160 field order and 26.78% less slices over 192 field order.The proposed architecture has the lowest hardware consumption among designs given in above Table 6.

Table 1 .
Inversion state and Mux.

Table 2 .
Execution cycles for different operations.

Table 3 .
Hardware consumption of register and adder on ASIC.

Table 4 .
ECC Hardware Performance Comparison on ASIC.
provides detailed data of the proposed hardware implement of EC designs over 160, 192, 224, 256 field order on given FGPA platform.It consumes 239, 342, 468, 610 clock cycles and takes 5595, 7080, 8423, 9370 Slices to perform SM operation.The SM operation costs 239 clock cycles with 9199 Slice LUTs, 2833 Flip Flops and 8 DPS48s in 160 field order.