Energy/Area-Efficient Scalar Multiplication with Binary Edwards Curves for the IoT

Making Elliptic Curve Cryptography (ECC) available for the Internet of Things (IoT) and related technologies is a recent topic of interest. Modern IoT applications transfer sensitive information which needs to be protected. This is a difficult task due to the processing power and memory availability constraints of the physical devices. ECC mainly relies on scalar multiplication (kP)—which is an operation-intensive procedure. The broad majority of kP proposals in the literature focus on performance improvements and often overlook the energy footprint of the solution. Some IoT technologies—Wireless Sensor Networks (WSN) in particular—are critically sensitive in that regard. In this paper we explore energy-oriented improvements applied to a low-area scalar multiplication architecture for Binary Edwards Curves (BEC)—selected given their efficiency. The design and implementation costs for each of these energy-oriented techniques—in hardware—are reported. We propose an evaluation method for measuring the effectiveness of these optimizations. Under this novel approach, the energy-reducing techniques explored in this work contribute to achieving the scalar multiplication architecture with the most efficient area/energy trade-offs in the literature, to the best of our knowledge.


Introduction
The deployment of Internet of Things (IoT) applications is pushing society to interact with smart environments on a regular basis. Smartphones, buildings, vehicles, roads, home appliances; most new instances of these technologies are being equipped with capabilities for data sensing and internet connectivity [1]. The data retrieved by these systems might be sensitive, since it can be inherently confidential [2] or can be used to infer a user's behavior [3]. Providing security for the IoT is said to be the equivalent of providing security for a conventional network, with the added complexity that the network can be physically reached by attackers [4].
A common characteristic in many IoT nodes is that they suffer from physical constraints, most notably on size and energy [4,5]. For reducing manufacturing costs, devices' physical size needs to be decreased.
For some, one of the most precious resources of a constrained device is energy [6,7]. The reasoning is that after deployment, some nodes rely on battery systems which cannot be replaced and ought to last for several months or years. That is why "to minimize energy consumption, lightweight Public-key Cryptography (PKC) implementations are a fundamental requirement" [8]. For both cases, lightweight cryptography can provide an effective solution that (a) is physically small and (b) has low energy consumption.
The rest of the paper is structured as follows. Section 2 briefly enumerates some preliminary notions regarding the topics in this paper. Section 3 describes the energy-oriented improvements applied to ECC architectures in selected works from the literature. The description and implementation for our energy improvements can be found in Section 4. A novel evaluation method for energy improvements is detailed in Section 5. Lastly, our concluding remarks are available in Section 6.

Elliptic Curve Cryptography
An elliptic curve can be described as the set of points that satisfy the Weierstrass model in (1) over the finite field F q .
E : y 2 + a 1 xy + a 3 y = x 3 + a 2 x 2 + a 4 x + a 6 with a i ∈ F q (1) Simplifications of (1) and equivalences are used as the basis for different elliptic curve families: random prime, random binary, Koblitz, Montgomery, Edwards, twisted Edwards, binary Edwards, among others.
The elliptic curve points E, a group operation + and the point at infinity O form an elliptic curve group E(F q ), which can be used in cryptographic applications. The operation + is the addition of points, it varies for each elliptic curve family. Thus kP represents the consecutive application of the group operation k times over the base point or generator P: In practice kP relies on point addition (P + P) and doubling (2P), where each is composed of multiple field operations. The complexity of kP depends on the group and field arithmetic definitions. The kP calculation is used in any ECC-based algorithm, hence improving its efficiency is critical.
The BECs family is defined by the model where d 1 , d 2 ∈ F 2 m with d 1 = 0 and d 2 = d 2 1 + d 1 . These curves are birationally equivalent to binary generic curves [23]. Their principal advantages of BECs are that (a) their group operation is complete, so no extra checks are required and (b) their group operation requires less field operations.
In [23] the authors introduced the concept of w coordinates for BEC. By using this point representation it is possible to reduce the amount of field operations required in performing kP. Furthermore, the use of projective-w coordinates enables reducing the number of inversions required-inversions are some of the most expensive field operations. Differential addition and doubling formulae can be combined with projective-w coordinates to achieve the smallest requirements in terms of field operations for kP in BECs [24].

Power and Energy
Let the energy (ENE) consumed by a circuit to perform a task as the product between the dissipated power (POW) and the runtime (t): This approach is employed in multiple works from the literature [17,18,[25][26][27]. We consider the runtime to be directly linked with the performance of the system: lower runtime equals higher performance and vice versa. The runtime is the product of the latency clock cycles (LAT) and the inverse of the operational frequency (f ): POW is obtained as the sum of the dynamic (DP) and static or quiescent (SP) powers: DP is the sum of powers associated with clocking, signals, logic, IOs, and dedicated blocks; this includes the data-dependent power. SP is dissipated by the whole FPGA fabric and remains somewhat constant regardless of the implemented circuit. For FPGAs the static power tends to be higher than the dynamic part. Each component is usually modeled as DP = e × f × A and SP = I s × V cc (7) where e is the average energy spent during one clock cycle per area unit, A represents the area of the circuit, I s is the static current consumed from the power supply, and V cc is the supply voltage. The designer has control over the operational frequency, the latency, and the area to influence the energy consumption of the system. The effects of f over ENE are not straightforward. If f is reduced, then t grows and ENE rises-as shown in (4)-from the SP component in POW; if f is increased, then POW may grow due its DP element-see (7)-and the increment of ENE follows from (4). Finding the optimal operational frequency for the proposed kP architecture is outside of the scope of this work, we do however use two operational frequencies (low vs. high) to study this variation.
So, if we seek to reduce ENE we need to find a minimum in the balance between the area and the latency. The former has been the main optimization goal for lightweight cryptography, whereas the latter has generated interest in recent years [28].
Other popular optimizations such as clock gating and datapath insulation aim at mitigating the switching activity of parts of the circuit which are not actively used-these aim at reducing the dynamic power consumption of the circuit.

Percentile Differences
The percentile increment (∆%) is provided whenever new implementation results are presented. These increments are calculated as the difference between the new (O C i ) and the previous observation In this work we only use percentile differences to assess the area increments and the energy decrements.

Evaluation Environment
We used the Xilinx ISE Design Suite 14.3 for synthesis and configuration of all the architectures described. The designs were described in VHDL and synthesized with Area Reduction as design goal and strategy2 as strategy. All the results provided in this document were obtained after Place and Route (PAR) unless explicitly stated otherwise.
The power estimations reported in this paper correspond to the sum of dynamic and static power. Since for FPGAs the static part tends to outweigh the dynamic power, in some cases the total power might appear somewhat constant.
These estimations were obtained using the Xilinx XPower Analyzer software. In order to obtain a high overall confidence level we employed the post-PAR design file (ncd), the physical constraints file (pcf) for the specified FPGA, and a simulation activity file (saif). The latter was obtained using the Xilinx Isim software from a post-PAR simulation; each one of the architectures was simulated using actual data for over 10,000 cycles.

Energy Reduction in the Literature
Improving the performance of the system is one of the most common approaches for reducing the energy consumption [16][17][18][19][20][29][30][31][32][33][34]. In [35] it is shown that techniques like pipelining and parallelism can be used to reduce the power consumption. If the computations are completed quickly, a moderate rise in the power required (due to increments in the area and switching activity) can be mitigated by the time reduction. In this regard multiple alternatives have been proposed: using low-latency algorithms, proposing low-latency implementations, exploiting algorithm parallelism, and using dedicated processing units. Nonetheless, just as it is inadequate to say that low-area equals lightweight, it is also flawed to assume that high-performance equals low-energy. As reviewed in the previous section, the relations between energy and performance are not clear-cut. The other strategies for achieving power reduction consider area minimization [21,22] and exploring area/performance tradeoffs [32,36].
From the perspective of security protocols, it can be concluded that low overheads in the number of packets [22,[37][38][39][40] and the number of cryptographic operations [32,38,[41][42][43] are key for low-energy PKC. These nodes are characterized by wireless transmissions, which require considerable amounts of energy to be performed, thus it is opportune to use protocols with low packet count requirements. As mentioned, ECC offers the smallest key sizes for comparable security levels. That property holds for all the group elements, thus contributing to reducing the transmissions overhead.
The implementation platform plays a significant role in the design of an ECC system. Using a generic processor would imply selecting prime curves, since commercial ALUs seldom include binary multipliers. On the other hand, a hardware solution would benefit from using binary curves [17].
Selecting the adequate coordinate representation, the group operations, and the field operations used in the ECC system is of paramount importance. For a software-implementation these choices translate into different routines that are executed by the processor, whilst for a hardware-realization these translate into different hardware modules. Processors benefit from shorter routines, from quick calculations, but also from reduced memory accesses [44]. On the other hand, hardware architectures can exploit the arithmetic of binary fields for performing calculations swiftly.

Methods
In this section we outline the application and evaluation of different energy-reducing techniques over a low-area kP architecture. Throughout the document we use the Binary Edwards Curve BE251 [23] as case study.
We study three architecture-level transformations-field inverter, field multiplier, field squareras well as a circuit-level modification-datapath insulation. We study these strategies in the aforementioned order so that the contribution of each technique can be studied in a way in which it benefits the most from previous techniques.

Starting Point: Low-Area kP Architecture
In Figure 1 we illustrate the base area-optimized architecture used. This module follows the Montgomery Ladder algorithm with differential addition and doubling for binary Edwards curves in mixed-w coordinates as proposed in [24]. One of the main characteristics of this design is that it offers flexibility of the field, curve, base point, and scalar; all the proposed optimizations ought to preserve this property. The field operations supported by this design are multiplication, addition, and inversion. A bit-serial like multiplier is used to reduce implementation size. Addition is performed by a layer of XOR gates. Field inversion is required to convert the input and output of the system from w to projective-w coordinates and vice versa. This operation is performed with only multiplications thanks to Fermat's Little Theorem. The particular inversion algorithm used is Wang's [51].
In regards to latency, each inversion requires 2m − 3 m-bit multiplications, which amounts to 125,249 cycles when m = 251. A step in the Montgomery ladder requires 9 × m full multiplications with a latency of 567,009 cycles, m short multiplications with a latency of 14,558 cycles, and 3 × m additions which take 753 cycles. The architecture requires two inversions and an m-bit Montgomery ladder per kP, hence the total latency of the design is 832,818 cycles.
While this design performs well in regards to hardware resources, it requires many latency cycles. This has a negative effect on the performance and energy consumption of the system. In the following we review the application of different optimization strategies devised to reduce the energy footprint.

Modification 1: Inversion Algorithm
Field inversion provides a convenient way to perform divisions in finite fields. Such operations are required in point conversion. The scalar multiplication algorithm selected requires two inversions. The Wang inversion algorithm is used in the C0 architecture. Although this method is simple and flexible, more efficient solutions exist.

Fermat's Little Theorem
Let q be a prime number and let a be an integer satisfying gcd(a, q) = 1 then This conjecture is known as Fermat's Little Theorem [52]. A simple proof for the theorem is provided in [53]. Consider the product (a)(2a)(3a) . . . ((q − 1)a), which can be written as (q − 1)!a q−1 . The list of terms in the product modulo q is a complete list of variables from 1 to q − 1, since no two terms in the list are equivalent modulo q. From this, the product can also be written as (q − 1)! mod q. Thus and (9) is demonstrated.

Divisions on Finite Fields
In 1979, MacWilliams and Sloane demonstrated that every element a ∈ F p m , where p = 2 n , satisfies the identity a p m = a. This, together with the demonstration from Wang in 1985 that a non-zero element a ∈ F p m has a unique multiplicative inverse a −1 , shows that a −1 = a 2 m −2 Then, for all a ∈ F 2 m , a = 0, a −1 can be computed as according to a generalization of (9). This requires n − 2 multiplications and n − 1 squarings [54]. Inverses are important in calculating divisions since Therefore, it is possible to perform divisions through a series of repeated multiplications and squarings.

Wang Inversion
The naïve approach for computing inversions through Fermat's Little Theorem is denominated Wang Inversion [51]. As presented in Algorithm 1, this operation requires m − 2 multiplications and m − 1 squarings.

Algorithm 1 Wang Inversion Method.
Albeit slow, the Wang method of inversion is capable of solving for any A(x) which has an inverse over F 2 m with m of any length. It is also important to note that only two registers are required in this procedure.

Itoh-Tsujii Inversion Algorithms
In their work [51], Itoh and Tsujii proposed three field inversion algorithms. The first two of them for inverses over binary fields and the third for inverses over generic fields. The third case, however, relies on subfield inversion.
The first algorithm is applicable in F 2 m such that m = 2 r + 1. It is based on the observation that the exponent 2 m − 2 in (11) can be rewritten as (2 m−1 − 1) × 2. Thus if m = 2 r + 1, it follows that From this, Algorithm 2 is obtained. This procedure requires log 2 (m − 1) multiplications and m − 1 squarings.
Algorithm 2 can be generalized to any value of m as proposed in [51]. For this, write m − 1 as where k 1 > k 2 > . . . > k t is an addition chain. Then, knowing that and (15), it can be shown that the inverse of A can be solved as in (13). The Itoh-Tsujii inversion for fields of generic length can be computed following two approaches.
Note that by calculating A 2 2 k 1 −1 , all the previous partial products are also obtained. For posterior use these must be either stored by using additional registers (Algorithm 3) or re-calculated by taking additional operations (Algorithm 4).

Algorithm 3 Itoh-Tsujii Inversion for Generic Binary Fields
Where Extra Storage is Used.
Algorithm 4 Itoh-Tsujii Inversion for Generic Binary Fields Where Additional Cycles are Required.
These algorithms perform inverses over fields of generic length. The addition chains used are based on the binary representation of the field length. It is possible to compute optimal addition chains, however, this task is difficult to perform on constrained devices given that the field length is variable.

Comparison of the Inversion Methods Reviewed
A summary of the computational and storage costs for the different inversion algorithms reviewed is provided in Table 1. Whereas Table 2 reports the latency and storage estimation of the inversion algorithms for security levels close to 128-bits. Table 1. Inversion algorithms cost over binary fields of variable length. Let v = HW(m − 1) and u 1 . . . u i the binary representation of m, where HW(w) represents the Hamming weight of w. a Field multiplications (M) and squarings (S) are performed using a bit-serial multiplier. b Multiplications are performed using a bit-serial multiplier and squarings are considered to take 1 cycle.

Inv. Field Multiplications Squarings Storage Bits
As it can be noted from Table 2, there is an improvement in the number of underlying operations when the Itoh-Tsujii algorithm is implemented over the Wang inversion method. Recall that kP requires two field inversions, therefore, reducing the latency of this operation by x reduces the latency of the scalar multiplication by 2x.
The alternative in Algorithm 2 only works for fields that satisfy the condition m = 2 r + 1 and thus would limit the elliptic curves that can be used if selected. The alternative in Algorithm 3 works for any m but its implementation requires increased storage space which would be translated into higher hardware usage-four additional m-bit registers if m = 251; this increment can be calculated as described in Table 1. Whereas the inversion method in Algorithm 4 does not offer the same latency advantages as the alternatives, it preserves generality without requiring additional hardware resources. Moreover, when the overall kP latency is considered and a dedicated squaring module can be added, the performance cost is not as significant (3% difference).
The Itoh-Tsujii inversions exploit the fact that squarings over finite fields are faster than multiplications. To achieve further improvement in the energy consumption of the system, it is necessary to improve the multiplication and squaring modules. Figure 2 reflects the changes in the architectural design compared to the base architecture in Figure 1. For this second architecture it was necessary to include the field length as an additional input. This value is used to control the iterations in the Itoh-Tsujii inversion algorithm. One of the MUX that feeds the field multiplier input was required to be re-wired as well. The implementation results for the kP architectures (comparing C0 and C1) can be found in Table 3. Table 3. Implementation results for C0 and C1 at frequencies of f 1 = 100 KHz and f 2 = 13.56 MHz in the xc6slx16 FPGA. From these results it can be noted how the modification of the inversion algorithm offers an average reduction of 7% in the energy consumption for different versions of the kP architecture. On the other hand, the hardware usage shows an average increment of 7%. This is consistent with the data in Table 2.

Modification 2: Field Multiplier
In the outlined second strategy, replacing the bit-serial multiplier with a digit multiplier is suggested. The new multiplier should be created with the same ports as the previous one to ease the interconnection; it ought to provide support for fast ×1 operations (which can be used to store data in the registers); and constant multiplications (with reduced length) should also be preserved. The new multiplier also needs to be parameterized in order to function for any digit size and any field length, preserving the generality of the design.

Digit-Based Multiplier
A digit-based multiplier, as presented in Algorithm 5, allows to explore area/latency tradeoffs for different applications. Implementing a digit-based multiplier makes it possible to explore how much hardware can be compromised in order to reduce the cycle count of the architecture. If the design is parameterized then a single architecture can be used for a wide range of applications.
Algorithm 5 Digit Multiplication in F 2 m Where d is the Digit Size [55].
The digit multiplier from Algorithm 5 uses an underlying combinatorial multiplier: The size of this combinatorial multiplier is what determines the hardware cost of the digit multiplier. A combinatorial multiplier can be seen as a matrix of hardware cells where its width is the digit size and its depth is the operand size.

Implementation of the Digit Multiplier
We designed a digit multiplier based on two combinatorial multipliers. The design was synthesized for the xc6slx16 FPGA. At this point, the number of IO ports in the digit multiplier makes the place-and-route process infeasible; some post-synthesis results are provided in Table 4. The digit multiplier was integrated in the architecture C0 which uses the Wang inversion algorithm to generate a new architecture denominated C2. This aims to determine if the use of Itoh-Tsujii inversion is cost-effective when a dedicated squaring module is not implemented. In this case, as can be seen in Figure 3, the bit-serial multiplier is replaced with the digit multiplier. Only small changes in the input MUX are required. The multiplier was also merged into architecture C1 to generate the design shown in Figure 4. This architecture now has been modified with the first two proposed optimizations. The implementation results for C2 and C3, which now use a digit multiplier, can be found in Table 5. Both designs were synthesized for the xc6slx16 FPGA using operational frequencies of 100 KHz and 13.56 MHz. These results are compared against the implementation results for C0 and C1 from Table 3. In this instance we are evaluating the efficiency of the digit multiplier (used in C2 and C3) compared with the bit-serial multiplier (used in C0 and C1).
The use of a digit multiplier enables achievement of a reduction in the energy ranging from 51% to 92% with hardware increments ranging from 6% to 54%. This trend seems to be consistent for both C2 and C3. The main difference between these architectures is that C3 has greater energy reduction for small digit sizes, which implies smaller hardware increments. In the long run (d > 16), however, both architectures tend to reach similar energy consumption levels.

Modification 3: Squaring Module
In the base architecture from Figure 1 the squaring operations are realized as multiplications. The selected kP algorithm performs four squarings per ladder step (4m). By including a dedicated squaring module the latency can be reduced since squarings are more efficient than multiplications in hardware. The C1 and C3 architectures (Figure 2 and Figure 4, respectively) can also benefit from this modification since the inversion method used (Algorithm 4) relies heavily on squarings. Although the advantages of a dedicated squaring component are evident, the hardware costs must be evaluated in order to assess its efficiency.
A combinatorial design for squarings was selected in order to maximize the latency reduction. Note that using a squaring module can reduce the latency independently of the field multiplier used. For this reason, we study both the alternative where the field multiplication uses a bit-serial approach but that also features dedicated squarings, and the option where the system uses a digit multiplier together with a squaring module.

Field Squarings
A squaring module is a special kind of field multiplier which exploits the fact that both input operands are the same word. The squaring procedure is presented in the following. Following this method, the squaring operation is reduced to a field multiplication of an m − 2 bits word by a d bits word.
Let the input element A be represented as a polynomial A(x): Thus, the squaring of A(x) in polynomial form is obtained by shifting the polynomial's coefficients to the left, generating a 2m − 1 terms polynomial A 2 (x): The coefficients in A 2 can be divided in two polynomials A h (x) and A l (x), considering that the elements in A h (x) are shifted m + 1 positions to the left: The coefficients in each of the new polynomials are: Shifting the elements in A h (x) can be solved as a multiplication by the element x m+1 , which can be obtained from the finite field's irreducible polynomial f (x): The final multiplication can be performed either using a bit-serial multiplier or a combinatorial multiplier. The latter was used for this work.

Implementation of the Squaring Module
The squaring module was included to the architectures that use the Itoh-Tsujii inversion algorithm (C1 and C3) to generate the versions C4 and C5 of the architecture, respectively. The architectural designs of C4 and C5 are illustrated in Figure 5.  In both modules the main difference is the addition of the squaring module, which has an effect on the MUXs at the input of the field multiplier. The MUXs at the input of the data registers were also updated to store the results of the squaring module.
The designs were implemented for the xc6slx16 FPGA using operational frequencies of 100 KHz and 13.56 MHz. Table 6 provides implementation results for the kP architectures which include a dedicated squaring module.
In the case where a bit-serial multiplier is used, the energy consumption is halved-compare C4 in Table 6 to C1 in Table 3. The addition of the squaring module enables achieving energy reductions ranging from 77% to 96% if a digit multiplier is used (C5). Comparing C5 to C3-where the reduction ranges from 51% to 92%, see Table 5-the improvement is noticeable. The hardware increment of implementing the squaring module is 20%.

Other Strategies
In a design which contains combinatorial logic and data registers, if these registers are not disconnected from the combinatorial logic spurious calculations will be performed. The switching activity on the combinatorial modules translates into power dissipation, and if the data being processed is not useful then it represents energy being wasted. In order to mitigate the spurious calculations, it is a good design practice to insulate the data registers. This applies both for inputs and outputs. If storing the data is not required, then the register writing must be disabled; if the data in the register is not needed, then it should be masked with zeros.
Register insulation is built-in in the proposed designs. As can be noted from Figures 1-5 the data registers outputs are always connected to a MUX element. These modules, under all cases default to GND when the output is not required, effectively insulating the data in the registers from reaching any combinatorial module.
We evaluated the impact of removing the register insulation from our designs. Although this strategy has a hardware cost and contributes to reducing the power dissipation, the variation is not significant-13% less energy in the best case and 10% more hardware in the worst case. Table 7 provides a summary of all the designs studied in this section.

Energy Savings in Relation to Area Costs
In this section we describe the design and application of a method to evaluate the efficiency of the optimization techniques that were used to create the kP architectures C1-C5. This section is conformed of two parts: first we describe a novel method for quantifying the efficiency of energy optimizations in regards to area cost, then we use this method for comparing our work with other state of the art solutions.
For the analysis provided in this section we consider that the configurations C0, C1, and C4 are equivalent to the configurations C2, C3, and C5 when d = 1, respectively.

Novel Metric for Efficiency of Energy Oriented Optimizations in Regards to Area Costs
Since it is complicated to characterize the efficiency of an optimization technique in terms of area or energy, we have developed an evaluation metric which can account for both magnitudes.
We start from the energy evaluation and area cost of all the hardware implementations. Figure 6 shows the area (FF, LUT, SLC), power (POW), and energy (ENE) results for the different kP architectures under study.
In this work we describe four challenges which should be surpassed for an evaluation metric aimed at comparing hardware realizations; these challenges are described in the following.

Selecting the Data
In FPGA implementations it is customary to use SLCs as area unit. However, as it can be seen from Figure 6, the SLC measurements are prone to outliers. This occurs due to the nature of the PAR process which follows heuristic approaches. In the ideal case the number of SLC should be correlated with the number of FFs and LUTs placed in the design. For example, for the FPGA used in our experiments, each SLC contains four FFs, four LUTS, and some connection logic.
The number of FFs required by the kP architectures is given by the number of registers allocated. The modifications applied do not modify the number of registers substantially, which is why this value remains almost constant. In contrast, since most of the changes made require combinatorial logic, we can observe that the amount of LUTs varies steadily. With this reasoning, we propose to use LUTs as area indicator for the configurations evaluated. The first challenge for the proposed metric is to define whether the LUT results can be used to represent the hardware increment in the design accurately.
In regards to power dissipation and energy consumption, the quiescent component in FPGAs is almost constant and more significant than its dynamic counterpart; this makes it easier to study the energy profile of an architecture.
To measure the area and energy increments we use percentile differences (∆%). The variation, difference, or increment in the measurement of a particular metric (O C i ) for the architecture C i , with regards to a previous observation (O C i−1 ) for the architecture C i−1 can be computed as in (8). Figure 7 shows the area and energy ∆s for the different architectures created. It is important to recall that in the proposed scheme a positive difference implies an increment, like in the case of area, and a negative difference implies a decrement, like in the case of energy consumption.
From the results in Figure 7 we can note that in fact, the LUT usage is a close match to the SLC in regards to perceived hardware cost, with less impact from outliers. The R-square yields a closeness of 74.54%, 98.21% and 16.45% for C2, C3, and C5, respectively. Even though the R-square in the case of C5 is not great, the goodness-of-fit achieved for C2 and C3 hints that the LUT measurements can substitute the SLC as area units when the number of FFs remains constant. This solves the first challenge proposed. (e) ENE @100KHz

Efficiency Metric
Using the LUT and ENE results we propose the efficiency (EFF) metric in (24). What this value conveys is the energy decrement weighted by the area increments associated with the improvement. If the energy savings are high (negative percentages), and the area costs for said improvements are low (in relation to a reference model) then the efficiency metric will yield a high negative result. Results that are less negative imply that the area cost outweighs the energy savings achieved.

Sensitivity to Frequency Variations
For the results in Figure 8 we used an operational frequency of 100 KHz. However, how does the operational frequency affect the proposed evaluation metric? This is the second challenge for our metric. This question is important for comparing our results with proposals from the literature. It is evident that not all the works would use the same operational frequency. Furthermore, given the relevance of this magnitude in the energy consumption of a design, it is clear that any evaluation metric should account for frequency variations. Figure 9 illustrates the differences in the energy consumption for the kP architectures under evaluation as a result of changing the operational frequency. As it can be noted, even though the measured values vary by two orders of magnitude, the consumption models are similar. In fact, both energy measurements can be used for computing the energy increment of each architecture, and later the efficiency evaluation using both operational frequencies.
The evaluation of the efficiency metric for two operational frequencies is provided in Figure 10. The results demonstrate that the proposed metric can account for variations in the operational frequency of the implementation. In these results, the R-square for the configurations C2, C3, and C5 indicates goodness values of 99.87%, 99.91%, and 99.62%, respectively. This answers the second challenge proposed, since the metric proposed appears to be able to isolate the strong influence that the frequency has on the static power consumption, while highlighting the improvements achieved by the architecture modifications on the dynamic power consumption.

Sensitivity to Different Curve Sizes
How does the curve length influence the results? This is considered the third challenge proposed. In this work we use the elliptic curve BE251 as case study. However, when comparing our work with the literature, it is noteworthy that most existing lightweight proposals of elliptic curve systems target security levels of at most 80 bits. Since our work targets security levels close to 128 bits, the necessity of accounting for the difference in field length is clear.
We have used the results provided in [22] to evaluate the sensitivity of the proposed metric to differences in the curve length. The relevance of that work is that the authors present implementation results for a scalar multiplication architecture using generic elliptic curves of varying length. We took their area and energy results and utilized them to evaluate our metric. Figure 11 illustrates the area and energy results from [22] for various curve lengths. Figure 12 presents the evaluation of the proposed metric using the results from [22]. In this case the area increments are measured in GEs and the energy increments in µJ.  As it can be noted from Figure 12, our metric yields similar calculations for the different experiments which use varying curve lengths. These figures generate R-square evaluations of 99.99%, 99.99%, and 99.99%, which implies a close match in the values. Based on this experiment, we can conclude that the proposed metric is not sensitive to variations in the curve length, which solves the third challenge presented. This is a significant result as it implies that in comparing our results with the state of the art it is not necessary to account for variations in the curve length.

Sensitivity to the Implementation Technology
As evidenced in the previous point, not all works in the literature target FPGA technology. Some of them, as in the case of [22], have been developed for ASIC. Even though our metric can be applied to both scenarios without problems, it is necessary to determine if changing the implementation technology can impact the results of the proposed metric for the same architecture. However, our work focuses on FPGA technology and in the literature we have not identified any work which allows carrying out this experiment. As of now, we consider this fourth question as an open challenge.

Applying the Proposed Metric for Comparing Our Work with the State of the Art
All of the reviewed works which propose low-power or low-energy kP architectures use digit multipliers. This is understandable, given how a digit multiplier allows for significant improvements in the reduction of the energy consumption with relative low hardware costs.
The metric proposed is particularly useful for comparing such works. For starters, the ability of synthesizing a design for varying digit sizes allows flexibility of the application. Some scopes might be able to accommodate greater hardware strains in order to achieve improved performance, whereas others can have stricter area bounds. Therefore, an architecture of this type cannot be evaluated solely on the efficiency for a particular digit size. The curves derived from the evaluation of the efficiency metric proposed, as a function of the digit size in architectures with digit multipliers, make it possible to use the area under the curve as an objective quantifier of efficiency. To this end different problems need to be addressed.
First, since in this case of comparison we refer to the efficiency of a particular architecture which uses a digit multiplier, each series shall use as reference the instance of the implementation where d = 1. In this scenario we aim at quantifying the efficiency of an individual architecture. The relative percentile increments can be computed as in (25).
Second, to use the area under the curve as quantifier it is necessary that the evaluation bounds are coincident for each configuration. That is, that all the designs evaluated provide implementation results for the same digit interval. The case where d = 1 is mandatory since it is used as reference, but as upper bound we can define any d = n.
Once the evaluation boundaries have been defined, we note that the majority of works in the literature do not provide results for continuous intervals of the digit space. For instance, the works in [17] and [25] only provide implementation results for the cases where d ∈ {1, 15} and d ∈ {1, 16}, respectively. A solution for this problem is to use interpolation models in order to obtain the missing data.

Modeling the Data
The area and energy increments are the source for computing the efficiency of a design. These increments are calculated from the raw data of hardware resources and energy consumption. The former can be modeled using a polynomial fit of first degree of the form y = α 1 d + α 2 while the latter can be adjusted to an exponential model of the form y = α 3 d α 4 where α i ∈ R are constants for the model of each configuration and d ∈ Z is the digit size. The model proposed for the efficiency metric is presented in (26).
The use of a mathematical model over the raw data has the additional advantage that the effects of outliers are mitigated. This is practical since some works from the literature that target FPGAs do not provide LUT results [25] or do provide them but the variance in the flip flop count is significant [17].
In Figure 13 we show the models obtained for the area and energy results from our C2, C3, and C5 architectures. Consequently, Figure 14 presents the evaluation of the efficiency metric applied over these data. It is possible to observe the precision obtained in the final model, which produces R-square evaluations of 93.12%, 93.66% for C2 and C3, respectively.  As can be observed in Figure 13, the area in LUTs recorded for the configuration C5 where d = 1 is an outlier. When this anomalous reference point is used for evaluating (24), the results are skewed. Modeling the data prevents obtaining erroneous results by removing the outliers. This is the reason for the significant variation exhibited between C5 and C5 m .
With the updated analysis we can note that the most cost effective solution provided in this work, in regards to preserving the implementation area while reducing the energy profile, is the architecture C5. This design consistently outperforms the other configurations for any digit size.

Quantifying the Efficiency
The data in Figure 14 can be used to obtain the area under the curve for each configuration using a trapezoidal rule as shown in Equation (27).
For this evaluation we shall define n = 15 and ∆ d = 1 since d ∈ Z. From this, the configurations C2, C3, and C5 obtain efficiency scores of −77.59, −80.16, and −97.5, respectively. In this evaluation, the configuration C5 is the one that achieves the greater energy reduction per area cost overall. Table 8 provides implementation results from works in the literature that are defined as "low power" or "low energy" by their authors. Using these data we have adjusted coefficients for the model of the area and energy measurements from each work. These models are used for evaluating the efficiency metric and to obtain the respective efficiency score for each design. In 77% of the non-trivial models we achieved R-square evaluations above 99%, which implies that the provided results are accurate. Table 9 provides the coefficients obtained for the model of each configuration, according to the formula in Equation (26).    Figure 15 illustrates the evaluation of the efficiency metric for the different works in the state of the art. Finally, the efficiency scores for each configuration evaluated are reported in Figure 16.

Limitations of the Proposed Method
The proposed method is sensitive to data outliers. Since the results are provided as percentages, when the measurements are small, area or energy variations can skew the results. This is solved by using models to adjust the data.
Conditions that do not adjust to the models proposed also lead to unexpected results. If the energy consumption is not reduced or the hardware requirements are not increased, the sign of the results will flop and produce spurious evaluations of (27). While that might have some use, for the purposes intended in this article such results are undesired.

Conclusions
In this paper we have studied the reduction of energy consumption in six different scalar multiplication architectures. Starting from a base low-area design, we have improved it following energy and power reducing strategies. The result of this process is a comprehensive set of designs that have gradual optimization levels, and thus exhibit from moderate to increased area/energy tradeoffs. These scalar multiplication modules can be used in key establishment systems with low-area requirements and low-energy consumption.
The novel metric proposed can be applied in studying the impact of any modification to a reference architecture, implemented in hardware. In a sense, it represents the energy costs, weighted by the associated hardware costs. The main goal for using this indicator is to demonstrate the effectiveness of any energy-related improvement in a platform with hardware constraints. We have shown that this metric is capable of accounting for differences in the area units, the operational frequency, and the field size; we also provided a way to reduce its sensitivity to unavailable data and outliers. For these reasons we believe that it is adequate for comparing works implemented under heterogeneous conditions.
From the proposed architectures, the configuration C5 exhibits the greatest efficiency. This design employs a digit-multiplier, the Itoh-Tsujii inversion algorithm, and a dedicated squaring module; and also implements datapath insulation. Compared against the state of the art, this configuration turned out to be 13.33% more efficient than the closest work. In terms of efficiency, our proposal represents a good candidate for implementation in environments with area and energy constraints such as IoT devices. Funding: This research was funded by CONACyT Mexico, grant number 336750. The research was also funded by "Fondo Sectorial de Investigación para la Educación", CONACyT Mexico, through the project number 281565.