A Low-Complexity Edward-Curve Point Multiplication Architecture

: The Binary Edwards Curves (BEC) are becoming more and more important, as compared to other forms of elliptic curves, thanks to their faster operations and resistance against side channel attacks. This work provides a low-complexity architecture for point multiplication computations using BEC over GF ( 2 233 ) . There are three major contributions in this article. The ﬁrst contribution is the reduction of instruction-level complexity for uniﬁed point addition and point doubling laws by eliminating multiple operations in a single instruction format. The second contribution is the optimization of hardware resources by minimizing the number of required storage elements. Finally, the third contribution is to reduce the number of required clock cycles by incorporating a 32- bit ﬁnite ﬁeld digit-parallel multiplier in the datapath. As a result, the achieved throughput over area ratio over GF ( 2 233 ) on Virtex-4, Virtex-5, Virtex-6 and Virtex-7 Xilinx FPGA (Field Programmable Gate Array) devices are 2.29, 19.49, 21.5 and 20.82, respectively. Furthermore, on the Virtex-7 device, the required computation time for one point multiplication operation is 18 µ s, while the power consumption is 266 mW. This reveals that the proposed architecture is best suited for those applications where the optimization of both area and throughput parameters are required at the same time.


Introduction
The internet-of-things (IoT) concerns a global network, where billions of heterogeneous devices are required to connect with an unsecured internet [1]. The connected devices share information (or) data with each other. Since most of the devices in an IoT framework have constrained resources, data are usually stored in the cloud [2]. As a result, the users can continuously upload and download data from anywhere using the internet [3]. Due to this enormous communication of IoT devices through a cloud, they are subject to malicious attacks [4]. Security concerns arise, and therefore various threats and attacks may occur as data owners have no control over the data management [5]. Consequently, the importance of data security and the availability of limited resources provoke us to explore recent low-complexity cryptographic schemes [6].
Elliptic-curve cryptography (ECC), a public-key cryptography scheme, has become an attractive approach to target many applications like IoT security [7]. The main motivation behind the wide spread adoption of ECC is its ability to provide a similar security level with a relatively smaller key-sizes [8]. It comprises of four layers [9]. The top most layer (fourth layer) is the protocol layer which ensures the encryption and decryption of data. In layer three, the scalar (or) point multiplication (PM) is computed which is the most critical operation. For PM computation, the point addition (PA) and point doubling (PD) operations are performed in layer two. Finally, the layer one of ECC consists of finite-filed (FF) arithmetic operations (addition, multiplication, square and inversion). In addition to the layer model of ECC, there are two coordinate systems, i.e., affine and projective. The latter is more frequently employed to achieve a high throughput [10]. Furthermore, two field representations, prime (GF(P)) and binary (GF(2 m )), are commonly available. The prime field representation is generally utilized for software based implementations while the binary field representation is preferred for hardware deployments [11].
As far as the security of ECC is concerned, it offers a variety of implementation models, such as weierstrass [12], binary-edward (BEC) [13], hessian (HC) [14] and binary-huff (BHC) [15]. The Weierstrass model is the fundamental ECC model, however, it may expose secret-key via simple power analysis (SPA) attacks [10]. The vulnerability of Weierstrass model to SPA attacks is due to different mathematical formulations of PA and PD for PM computation [9]. The SPA is a type of side channel attack. In SPA, attackers can come into a secret-key in terms of zeros and ones by inspecting the power trails of PA and PD computations. The power trails are inspected using various power analysis tools such as logic analyzer [16]. Among several other choices, one of the solution to resist SPA attacks in ECC is the use of unified PA and PD laws.
Except the weierstrass model of ECC, all the aforementioned models (HC, BEC, BHC) provide unified mathematical formulations for the computation of PA and PD operations [13][14][15]. As compared to BEC and HC, the BHC model of ECC has a relatively higher mathematical complexity in unified PA and PD laws. The higher mathematical complexity requires more hardware resources and ultimately consumes more power [15]. Therefore, the use of BHC model in IoT related applications is less attractive. On the other hand, the BEC model is usually preferred to achieve a higher throughput while utilizing lower hardware resources as compared to HC model of ECC [17]. For complete mathematical structures of various unified PA and PD laws of ECC models, the interested readers are referred to [18]. The objective of this article is to propose a low-complexity point multiplication design on an FPGA (Field Programmable Gate Array) platform for BEC model of ECC. In the following, state-of-the-art BEC implementations on FPGA and their limitations are provided.

Related Work for BEC Implementations
The work in [19] presents a re-configurable processor architecture, using point halving and unified point doubling approaches, for the computation of PM operation. When considering only the point halving method, the authors in [19] [20]. In the first architecture, a single Gaussian basis FF multiplier is used while the second architecture advocates the use of three parallel FF Gaussian basis multipliers. The larger design with three FF multipliers utilizes higher hardware resources (29,255 Slices-when the value for d = 59) on Virtex-4 device as compared to the architecture with one FF multiplier (12,403 Slices-when the value for d = 59). For different values of d (59 and 26), the architectures described in [20] achieve various clock frequencies and require different times (in µs) for the PM computation process.
While the focus in [19,20] is to optimize throughput, however, the optimization of required number of clock cycles is equally important. In order to reduce the total number of clock cycles, the work in [21] employs three digit-serial multipliers. Furthermore, it employs a pipelining technique which reduces the critical path and improves clock frequency. Similar to [21], another hardware solution is reported in [22] where a pipelined digit-serial multiplier is used. This digit-serial use of multiplier results 31,702 (on Virtex-4 when d = 59), 4987 (on Virtex-5 when d = 26) and 11,494 (on Virtex-5 when d = 59) slices.
For area optimization, an interesting solution on Virtex-5 device is described in [23]. At first, the instruction-level-parallelism is employed to optimize the clock frequency (308 MHz), which is comparatively higher than the solution provided in [19]. Subsequently, the area (15,804 Slices) is reduced by using a single FF arithmetic operator (adder, multiplier, squarer and inverter) in the datapath of the cryptoprocessor. The most efficient area optimized architecture is implemented on Virtex-6 in [24]. Here, a bit-serial approach for FF multiplication is employed instead of a digit-serial multiplier (as used in solutions [20][21][22]). Moreover, this solution takes 6720 µs for one PM computation.
For optimizing different design parameters (i.e., latency, area, power, etc.,) the most recent architectures are reported in [25][26][27][28][29]. In [25], a comb PM algorithm is utilized to achieve low-complexity (LC) and low-latency (LL) architectures. The LC architecture provides 62%, 46% and 152% improvements over GF(2 233 ), GF(2 163 ) and GF(2 283 ), respectively. On the other hand, the LL architecture reduces the time required to compute one PM operation. An interesting solution is described in [26] where a new modular reduction method is provided. Rather than applying a full modular reduction, the modular reduction approach of [26] iteratively performs reductions on partial products. As a result, the latency of modular multiplication operation is reduced. The complete ECC architecture is synthesized for 180-nm common-metal-oxide-semiconductor (CMOS) process technology over GF(p) with p = 256 bits. It utilizes 144.8-65.4 k logic gates (or gate counts).
In [27], a low-cost as well as fast hardware implementations for BEC are reported. The low-cost implementation is achieved by incorporating one pipelined digit-serial modular multiplier. On the other hand, three parallel multipliers are used to accelerate PA and PD computations. Similarly, a low-power and low-area implementation is described in [28] where a high-radix interleaved modular multiplication method is utilized. Recently, a highspeed, low-area and SPA-resistant processor architecture is described in [29]. The processor architecture performs 256-bit PM in 198,715 clock cycles and requires 1.9 ms (for Virtex-7) and 2.13 ms (for Virtex-6). The architecture utilizes only 6543 slices on Virtex-7 FPGA.

Limitations of Existing Practices
Section 1.1 reveals that various FF multipliers can be employed in the datapath of PM architectures, either to optimize area or throughput. The use of a bit-serial FF multiplier decreases area (hardware resources) while decreasing the overall throughput of BEC architecture/design [24]. On the other hand, the use of a digit-serial FF multiplier, as employed in [20][21][22], increases throughput with the expanse of higher hardware resources. However, there are various area constrained applications such as RFID cards [30] and digital management systems [31], which require a trade-off between throughput, area and power consumption.

Contributions
The contributions of this work are given as: • A low-complexity binary edward-curve PM architecture over GF(2 m ) with m = 233 is proposed. • For unified PA and PD laws of BEC, the instruction-level-complexity is reduced by eliminating multiple operations in a single instruction format (details are provided in Section 3.2.1). • To save hardware resources and reduce clock cycles, the minimization of storage elements is provided by using efficient replacements of required memory addresses. It shows 7.2% decrease in total number of required clock cycles and 28.5% decrease in hardware resources (see Section 3.2.2). • A 32-bit digit-parallel FF multiplier is employed instead of bit-serial and digit-serial FF multipliers. For one FF multiplication, the digit-parallel requires only one clock cycle while the bit-serial and digit-serial FF multipliers require m and m v clock cycles (where, m is the field size and v is the digit-size). The used FF multiplier ultimately reduces the required number of clock cycles. • To speed-up control functionalities, a Finite State Machine (FSM) controller is used in this work.
The implementation results after synthesis over GF(2 233 ) on Virtex-7 FPGA device reveal that the proposed low-complexity architecture takes 3244 clock cycles. Moreover, for different values of d, i.e., 59 and 26, it achieves an operational clock frequency of 179 MHz and utilizes only 2662 slices. The time required for the computation of one PM operation is 18µs while the power consumption of the architecture is 266 mW. It has been observed that the proposed architecture in this article provides higher throughput over slice figures with respect to the most recent state-of-the-art solutions, described in [19][20][21][22][23][24]27].
As compared to state-of-the-art power-efficient architectures, reported in [28,29], the proposed architecture is much faster while consuming relatively more power. The power consumption increase due to the use of digit-parallel multiplier. The digit-serial multiplier approach, as employed in [28], is more convenient to reduce area and power parameters while compromising on the throughput. On the other hand, the digit-parallel and the bitparallel multiplication methods are useful to optimize the overall performance (throughput) of the architecture.
The remainder of this paper is structured as follows: In Section 2, the mathematical background pertaining to the computation of PM on BEC model of ECC over GF(2 m ) is presented. The optimization of unified PA and PD operations for BEC is described in Section 3. The proposed low-complexity architecture for the PM computation of BEC is described in Section 4. Subsequently, Section 5 presents implementation results and provides a comparison in terms of various performance attributes. Finally, Section 6 concludes the paper.

Mathematical Background
This section provides a brief mathematical background on BEC over GF(2 m ) in Section 2.1. Subsequently, the unified PA and PD laws of BEC over GF(2 m ) and Montgomery Ladder algorithm for point multiplication are described in Section 2.2 and Section 2.3, respectively.

BEC over GF(2 m )
The Weierstrass equations are generally used to represent different forms of Elliptic curves such as BEC, BHC and Hessian curves [12]. The BEC curves for PM computation process, using some definite group law operations, are initially introduced in [13]. If d 1 and d 2 are the elements of GF(2 m ) such that d 1 = 0 and d 2 = d 1 (d 1 + 1), then the BEC with co-coefficients d 1 and d 2 can be expressed mathematically as: In Equation (1), x and y are the coordinates of the initial point P while d 1 and d 2 are the curve constants/parameters.

Differential PA and PD Laws for BEC over GF(2 m )
The differential PA and PD formulations over GF(2 m ) for BEC [21] are shown in Table 1. The first column of Table 1 shows the total number of required instructions (Inst i ) for PA and PD while the second column represents the corresponding instructions. It can be observed from Table 1 that there are 7 instructions in total with the requirement of 11 storage elements. The values e 1 , e 2 and e are given as: In order to understand the mathematical formulation comprehensively, the interested readers are referred to [32]. As shown in Table 1, W 1 , Z 1 , W 2 and Z 2 are the initial projective points, whereas Z a , Z d ,W a and W d are the final projective points in BEC model. Moreover, A, B and C are the storage elements, which are required to hold the intermediate results. Similarly, w is the rational function for an elliptic curve E over GF(2 m ). It is presented by a fraction of polynomials in the coordinate ring of E over GF(2 m ) and can be calculated as: w(P) = x+y d 1 (x+y+1 for P = (x, y) in E B,d1,d2 . The w − coordinate differential addition and doubling implies to calculate w(2P 1 ) and w(P 1 + P 2 ) from the given values w(P 1 ), w(P 2 ) and w(P 1 −P 2 ), where P 1 and P 2 are the points on E over GF(2 m ). Table 1. Differential PA and PD formulations for BEC [21].

Instri
Complete PA and PD Laws. It Requires 11 × m Memory Size

Point Multiplication on BEC
Point multiplication is the result of addition of K copies at initial point P, i.e., Q = k.p = K(P + P+, . . . , +P). The term Q determines the resultant point on the defined BEC curve, K is a scalar multiplier while P is the initial point on the selected/defined BEC curve. Therefore, the following Montgomery point multiplication algorithm, known as Montgomery ladder algorithm, (shown as Algorithm 1 in the following), is employed.

Algorithm 1: Montgomery point multiplication algorithm [20]
Input: Step-1:-Conversions from Affine to ω Coordinates. for (i from m-1 down to 0) do Step It can be observed from Algorithm 1 that the point multiplication is the process of computing kP for a given point P on the elliptic curve E defined over a finite field GF(2 m ) and a given integer k. The algorithm starts with the initialization phase. It is important to note that one addition and one doubling is performed in each step of Algorithm 1, which makes this method invulnerable against side-channel invasions [21]. In general, the primary computation for each step of the Montgomery ladder is differential doubling and addition law (dADD), as shown in Algorithm 1. In the second step of Algorithm 1, the value k i is used, where k i is the i th bit of the scalar number k. Based on the value of k i , loop iterations are computed. If k i = 1, the values W 2 and Z 2 are set as the inputs of the doubling computation. Otherwise, W 1 and Z 1 are selected.

Optimization of Differential Addition Law for BEC
The complete differential equations for PA and PD operations, shown in Table 1, require higher hardware resources for the PM computation process [21]. Therefore, this section provides an optimized version of differential addition law from the utilization of hardware resources point of view. First, Section 3.1 identifies the limitations in standard PA and PD laws for BEC. Subsequently, the optimizations are presented in Section 3.2 to address the identified limitations.

Limitations of the PA Law for BEC
The limitations of complete PA and PD operations, shown in Table 1, are listed below: • There are total 7 complex mathematical instructions, which are required for the complete PA and PD law (i.e., Instr 1 to Instr 7 ), as shown in column 2 of Table 1.
The term complexity implies that there are many arithmetic operations in a single instruction. These operations are performed in parallel, which results in higher hardware resources. • The total number of storage elements, required for the complete PA and PD law of Table 1, are 11 × m. The number 11 indicates the total storage elements while m determines the width of each element. The requirement of high storage elements ultimately increases the hardware resources utilization.
To optimize the utilization of hardware resources, it is necessary to tackle the aforementioned limitations.

Proposed Optimizations
The improvements in the instructions of Table 1 are listed as: (1) Elimination of multiple operations in a single instruction and (2) minimization of storage elements.

Elimination of Multiple Operations in a Single Instruction
By eliminating multiple operations in a single instruction, the complexity of instructions can be reduced, as shown in Table 2. The first and second columns of Table 2 provide the required number of clock cycles and instructions for the execution of differential addition law. Consequently, the single-operator-form of differential addition law for BEC is shown in column 3. The single-operator-form of column 3 has reduced the complexity of instructions by eliminating multiple operations. However, it can be observed that the number of instructions have increased from 7 (shown in Table 1) to 14 (shown in Table 2). Additionally, it also shows an increase in the required number of storage elements from 11 (shown in Table 1) to 14 (see column 3 in Table 2) for implementing Algorithm 1. Table 2. Optimized form of complete differential addition law for BEC.

Clock
Instr i Addition Law (from Table 1 Instr 10

Minimization of Storage Elements
The mathematical formulations, shown in column 3 of Table 2, require a total of 14 memory elements (i.e., A, B, C, W d , Z d , W a , Z a , W 1 , W 2 , Z 1 , Z 2 , T 1 to T 3 ). The memory elements W 1 , Z 1 , W 2 and Z 2 are utilized to keep the values of initial projective points while W a , Z a , W d and Z d (shown with red color in column 3 of Table 2) are employed to hold the updated values of the final projective point. The remaining elements A, B, C and T 1 to T 4 are used to store intermediate results. In order to reduce the overall memory requirements, the number of storage elements in column 3 can be reduced, as shown in column 4 of Table 2. The required memory size, for the optimized formulations in column 4, is 10 × m. The storage elements W 1 , W 2 , Z 1 and Z 2 are utilized to keep the values of initial projective points. Moreover, the storage elements W 0 , T 1 , T 3 and T 4 (shown with blue color in column 4) are used to hold the updated values of the final projective point. The remaining storage elements (Z 0 and T 2 ) are employed to store intermediate results.
As shown in columns 3 and 4 of Table 2, the operations performed in inst 1 to inst 6 and inst 9 to inst 14 are identical. However, inst 7 (T 3 = T 2 × T 2 ) and inst 8 Table 2 can be performed in one clock cycle, as they involve frequent multiplications to be performed. In order to operate two multiplications in one clock cycle, an immediate squaring after multiplication operation can be computed, as given in inst 8 of column 4. It results a decrease in the total number of instructions from 14 to 13 (see columns 3 and 4 of Table 2). Moreover, there is a decrease in the number of storage elements from 14 × m to 10 × m. Furthermore, it also allows to reduce m clock cycles when m bit key length is targeted for the PM computation of BEC (see Step-2 of Algorithm 1). To summarize, the proposed optimized formulations (given in column 4 of Table 2) result in 7.2% (ratio of 13 to 14) decrease in total number of required clock cycles. Furthermore, a reduction of 28.5% (ratio of 10 to 14) in hardware resources is obtained.

Proposed Binary Edward Curve Point Multiplication Architecture
The proposed area-optimized PM architecture for BEC comprises of three major modules: Memory unit (MU), Data path unit (DU) and Control Unit (CU). Figures 1 and 2 provide the simplified and detailed representations of the architecture, respectively. The initial curve parameters are selected from the recommended document of National Institute of Standards and Technology (NIST) [33] over GF(2 233 ) for binary curves. The description of aforementioned modules (MU, DP and CU) are presented in Section 4.1, Section 4.2 and Section 4.3, respectively.

Memory Unit
The memory unit consists of a Register File with a size of 10 × m, where 10 is the total number of cells/locations and m determines the size of a particular cell. The objective of Register File is to keep the initial and the final projective points using 8 MU locations for storing W 1 , Z 1 , W 2 , Z 2 , W 0 , T 1 , T 3 ,T 4 . The remaining 2 locations are used to keep the intermediate results (Z 0 , T 2 ). Similarly, the two 10 × 1 muxes are used to read the data from Register File, using the corresponding control signals (C1 and C2). The output of these multiplexers are operand 1 (OP1) and operand 2 (OP2). Furthermore, a 1 × 10 de-multiplexer is used to update the Register File through a control signal C3.

Data Path Unit
The data path unit consists of an Arithmetic Logic Unit (ALU) and three routing networks (MUX1, MUX2 and MUX3).

Routing Networks
The size of MUX1 and MUX2 is 6 × 1 while the size of MUX3 is 3 × 1. The inputs to MUX1 and MUX2 are initial curve parameters (e 1 , e 2 , w), basepoint (x,y) and the operands from MU (OP1 and OP2). The outputs of MUX1 and MUX2 are OP_1 and OP_2, respectively, while the corresponding control signals for these two muxes are C4 and C5, respectively. The outputs OP_1 and OP_2 enter into ALU for further processing. Finally, the inputs to MUX3 can be either from adder (A_out), multiplier (M_out) or squarer (MS_out). Consequently, using the control signal C6, the output from MUX3 is written into Register File through a 1 × 10 de-multiplexer.

Arithmetic Logic Unit
The arithmetic logic unit comprises of an adder, a multiplier and a squarer unit. As shown in Figure 2, the adder and the multiplier units are connected in parallel. However, the squarer unit is connected serially after the multiplier. The adder unit performs the polynomial addition using the bitwise exclusive-OR (XOR) gate. To perform multiplication over 232-bit polynomials, a digit parallel multiplier is incorporated inside the ALU, as shown in Figure 3. The multiplier architecture consists of four-blocks: (1) Splitter, (2) multiplication, (3) concatenation and (4) reduction.
• splitter-block: The splitter-block takes one 233-bit polynomial (i.e., B) as an input and results digits (i.e., B1 to B8) of polynomial B as an output. In other words, it is responsible to split one input polynomial out of the required two (the multiplier requires two operands as inputs and results one operand as an output). The selection of an operand to split always depends on the designer. Figure 3 shows that the input polynomial is divided B into eight digits, i.e., B1 to B8, with a digit size of 32-bits each except the last digit, i.e., B8. To perform multiplication, the required size for the last digit is 9-bits. • multiplication-block: As shown in Figure 3, the complete multiplication-block comprises of eight instances of smaller multipliers. Each smaller instance of a multiplier is responsible to perform a polynomial multiplication in parallel. Furthermore, it takes two input polynomials. These polynomials are A and a digit of polynomial B. The corresponding lengths of A and B are 232 and 32-bits, respectively. Consequently, the output of each instance of a smaller multiplier is d + m − 1-bits. Here, d represents the size of each created digit of polynomial B (i.e., 32-bits). Similarly, m shows the length of input polynomial A (i.e., 233-bits). • concatenation-block: The concatenation-block takes eight inputs. The length of each input is d + m − 1-bits each (generated by the multiplication-block). It provides an output polynomial of length 2 × m − 1-bits. The length of output polynomial is not shown in Figure 3. The internal structure of a concatenation block consists of seven arrays of XOR gates. These arrays are shown with the gray color blocks inside the concatenation-block of Figure 3. The concatenation operation is performed using shift and add operations. • reduction-block: To produce a resultant polynomial of m-bits, a finite field reduction operation over polynomial of length 2 × m − 1-bits is performed [11]. Therefore, a NIST recommended polynomial reduction algorithm is utilized in this work as employed in [11] over GF (2 233   The square module is implemented by the placement of 0 after each input data bit, as implemented for Weierstrass model in [11]. Similar to the multiplication operation, a reduction over polynomial of length 2 × m − 1-bits is computed. As a result, an instance of a reduction-block is utilized inside the squarer module. It results an m-bit polynomial after squarer. The inversion operation is performed by using the multiplier and the squarer modules. Furthermore, the Quad block version of Itoh-Tsujii algorithm [34] is used for the inversion operation. The Quad block version provides better results in terms of clock cycles as compared to the square version [11].

Control Unit
To generate control signals, an FSM based dedicated control unit is used in this work, as given in Figure 4. The CU incorporates a total of 63 states. The description about these states are provided in the text as follows: • State 0 is an idle state. The reset and start signals determine the start of execution process. • Once the start signal becomes 1, the current state (State 0) of the processor switches to nextstate (State 1), as shown in Figure 2. Moreover, the State 1 to State 3 generate control signals for the affine to projective conversions of Algorithm  In the proposed low-complexity architecture, Equation (2) is used to compute the total number of clock cycles (CCs). The term Initial refers to the initialisation part of Algorithm 1, whereas m determines the key length. Furthermore Inv defines the inversion operation as described in Equation (2). The first column of Table 3 provides the key length. The second column determines the required CCs for the initialisation part (initial) of Algorithm 1, whereas the total CCs for the computations of PA and PD in Algorithm 1 are presented in column three. The fourth column provides the required CCs for inversion (Inv) part of Algorithm 1. Finally, the last column of Table 3 shows the total CCs for implementing Algorithm 1.

Results and Comparisons
This section first introduces a performance metric (throughput/area) in Section 5.1. Subsequently, experimental results for a binary key length of GF(2 233 ) are provided in terms of area, frequency, time, throughput/area and power (Section 5.2). The obtained results are then compared with state-of-the-art in Section 5.3.

Performance Metric
A throughput over area metric is defined where area refers to the utilized FPGA slices, as given in Equation (3). Another performance metric (throughput over power) can also be defined by considering the throughput and power parameters at the same time. However, the throughput over power values are not analyzed in this work as most of the low-complexity architectures lack power related information. This fact will be more evident in Section 5.3. The simplified form of Equation (3) In Equations (3) and (4), the term throughput is the ratio of 1 over the time required for the computation of one PM (i.e., Q = k.p in s). Similarly, Q is the final point on the BEC curve, k is the scalar multiplier, P is the initial point on the BEC curve and the term slices refers to the used area on the selected FPGA device. The term 10 6 in Equation (4) just simplifies Equation (3) by converting time (or) latency (i.e., time for one PM) from microseconds to seconds. The time (or) latency mentioned in column 6 of Tables 4 and 5 is calculated by using Equation (5). time (or) latency = required clock cycles (CCs) operational clock f requency (in MHz)

Performance Results
The architecture is modeled in Verilog (HDL) using Xilinx ISE 14.2 design suite tool. The synthesis results are provided on Xilinx Virtex-7 (XC7VX290T) FPGA device, as shown in Table 4. As shown in Table 4, the proposed architecture achieves a clock frequency of 179 MHz when the value for d is either 59 or 26. For similar values of d, it utilizes 2662 slices and takes 18 µs for one PM computation over GF (2 233 ). The calculated performance metric (ratio of throughput over slices) is 20 and the power consumption is 266 mW.

Performance Comparison with State-of-the-Art
To provide a realistic and reasonable comparison to state-of-the-art solutions, the proposed architecture is synthesized for similar Xillinx FPGA devices as used in existing solutions. The synthesis results are given in Table 5. It is important to note that a fair comparison with state-of-the-art solutions requires/imposes the use of identical implementation device (FPGA) and ECC model. As a result, there are various recent solutions where a fair comparison is not possible either due to a different implementation platforms or distinct ECC model. For example, the architectures described in [25,26] use a Weierstrass model of ECC while this work employs the BEC model. Additionally, these architectures utilize a CMOS process technology for synthesis using cadence genus compiler while the reported work in this article is considering an FPGA platform. M− determines the number of utilized multipliers in the architecture, Dyn− is the dynamic power, D− represents the targeted digit size for digitized polynomial multipliers, T/slices− is throughput/slices, for reference # [28], the throughput/slices is calculated by using FPGA LUTs.
In Table 5, for Virtex-4, Virtex-5 and Virtex-6 platforms, the selected FPGA devices for synthesis are XC4VLX100, XC5VLX110 and XC6VSX315t, respectively. The first column of Table 5 shows the reference solutions. The second column provides implementation platform. The third column provides the operational clock frequency (in MHz). Columns four and five present the area information in terms of FPGA Slices and LUTs. The time (in µs) required to compute one PM operation is given in column six. The column seven presents the calculated results of defined performance metric (ratio of throughput over area). Finally, the last column presents the power consumption (in W).
Area, Time and Frequency comparison over Virtex-4 FPGA devices: The architecture, described in [19], utilizes 21.3% (when only the PA and PD operations are computed) and 23.3% (when point halving is merged to the PA and PD operations for computations) higher FPGA slices as compared to the proposed architecture in this article. At the same time, the frequency of proposed architecture is 2.65 times or 165% higher. For different values of d, two different solutions are described in [20]. In the first solution, the authors in [20] have used only one Gaussian basis FF multiplier while in the second solution, three parallel Gaussian basis FF multipliers are employed. The reported architecture in this article utilizes 1.38 times or 38.3% higher slices as compared to the first solution of [20] and consumes 41.3% less hardware resources as compared to the second solution of [20]. Moreover, due to pipelining, the solutions of [20] results higher clock frequency. On Virtex-4 FPGA device, another hardware solution is reported in [22] where a pipelined digit-serial parallel multiplier is used for the computation of PA and PD operations. Comparatively, the proposed design in this article utilizes 46% lower hardware resources when the value of d = 59. The operational clock frequency, achieved in [22], is 50.5% higher. When considering the time required for the computation of one PM operation, the architecture in [22] is 1.9 times faster.
Area, Time and Frequency comparison on Virtex-5 FPGA devices: For d = 59 and 26, the proposed architecture utilizes 71% and 51.6% lower hardware resources, respectively, as compared to the second solution of [20]. The operational clock frequencies, achieved in [20], are 2 times (For d = 59) and 2.4 times (For d = 26) higher, respectively. Moreover, the required time for one PM computation with the presented architecture in this article is 1.7 (For d = 59) times and 1.1 times (For d = 26) more. Similarly, for d = 59 and 26, it utilizes 34% and 1% lower hardware resources as compared to the first solution of [20]. Furthermore, the required time is reduced by a factor of 22% and 63, respectively%. As compared to [21], the proposed architecture utilizes 55% (for d1 = d2 = 59) and 42% (for d1 = 117 and d2 = 59) lower FPGA slices (hardware resources). Additionally, it requires 62.8% (for d1 = d2 = 59) and 25.7% (for d1 = 117 and d2 = 59) lower computational time as compared to the pipelined digit-serial multiplier of [21]. Due to the bit-serial multiplication architecture in [24], the proposed architecture utilizes 1.9 times higher slices and requires much lower computational time. As compared to the solution reported in [23], the proposed solution utilizes almost 6 times lower hardware resources. A low-cost and high-speed architectures of [27] with one and three pipeline digit-serial multipliers result 53% and 58% higher clock frequency as compared to this work (when d = 59). This is due to the inclusion of pipeline registers in the employed multiplier architectures. On the other hand, the proposed architecture (for d = 59) requires 2.72 times lower computational time with a similar utilization of FPGA slices (2662).
Area, Time and Frequency comparison over Virtex-6 FPGA devices: As compared to the bit-serial multiplication architecture of [24], the proposed solution in this article utilizes 2.2 times higher number of slices. However, it provides 43% higher clock frequency due to the digit-parallel multiplier architecture rather than bit-serial. Moreover, the computational time is 375 times less (the ratio of 6720 and 17.39).
Throughput/Area comparison over Virtex-4, Virtex-5 and Virtex-6 FPGA devices: As shown in column seven of Table 5, the highest throughput over area ratio in state-of-the-art works on Virtex-4, Virtex-5 and Virtex-6 devices are 2.45, 10.9 and 0.119, respectively. On the other hand, the achieved throughput over area ratio in this article on Virtex-4, Virtex-5 and Virtex-6 devices are 2.29, 19.49 and 21.5. It implies that the proposed architecture on Virtex-4 device results just 6.5% lower throughput over slices ratio as compared to the most efficient solution [20]. However, on Virtex-5 and Virtex-6 devices, it provides much higher throughput over slices ratio as compared to the most efficient solutions, reported in [22,24], respectively, as shown in Table 5.
Comparison with most recent GF(p) architectures: A realistic comparison between the proposed architecture and the works in [28,29] is not possible due to different implementation fields. The implementation field in this article is GF(2 233 ) while the implementation field for [28,29] is GF(p). The Virtex-6 implementation of the proposed work (for d = 26 & d = 59) achieves 36% higher clock while utilizing 2.56 times higher resources as compared to the work in [28]. In addition to the clock frequency and hardware resources, the time required to perform one PM computation is much lower (17.39 µs) as compared to 511.78 µs of [28]. Similarly, the Virtex-7 implementation achieves 31.4% higher clock frequency while utilizing 2.83 times higher resources as compared to the work in [28]. Furthermore, it consumes 1.54 times more power. However, it achieves 80 times higher throughput over slices value (the ratio of 20.82 and 0.26) as compared to [28]. Similarly, if the number of slices in Equation (4) is replaced with the consumed power, the overall ratio of throughput over power is 13.49 for [28] and 208.39 for the proposed architecture. Considering the architecture of [29] for comparisons, the proposed architecture in this article utilizes 2.5 times lower hardware resources and achieves 293 times higher throughput over slices value (the ratio of 20.82 and 0.071).

Conclusions
This article has reported a low-complexity hardware architecture for BEC over GF(2 233 ) field, providing a better throughput over slice ratio. To achieve this, the original mathematical formulation of PA and PD laws for BEC are revised. As a result, a simplified formulation using a single-instruction-with-single arithmetic-operation is obtained with 28.5% decrease in hardware resources. As compared to alternative bit-serial and digit-serial multiplier architectures, the inclusion of a 32-bit digit-parallel multiplier decreases clock cycles. Consequently, the implementation results show that the proposed low-complexity hardware architecture provides higher throughput over slice figures as compared to most recent state-of-the-art architectures. In addition to the low area applications, there are some architectures which particularly focus on low power consumption. The proposed architecture in this article outperforms these power optimized architectures with a relatively small increase in power consumption.