Completing the Complete ECC Formulae with Countermeasures

This work implements and evaluates the recent complete addition formulae for the prime order elliptic curves of Renes, Costello and Batina on an FPGA platform. We implement three different versions:(1) an unprotected architecture; (2) an architecture protected through coordinate randomization; and (3) an architecture with both coordinate randomization and scalar splitting in place. The evaluation is done through timing analysis and test vector leakage assessment (TVLA). The results show that applying an increasing level of countermeasures leads to an increasing resistance against side-channel attacks. This is the ﬁrst work looking into side-channel security issues of hardware implementations of the complete formulae.


Introduction
Public-key cryptography in constrained embedded devices is usually based on elliptic curve cryptography (ECC) because of the relatively low resource and power consumption compared to other public-key systems [1,2]. Since modern use case scenarios in the Internet of Things usually allow possible attackers to be in the vicinity of the embedded device, the leakage of sensitive information through side-channels becomes a realistic threat. The first step in making an implementation side-channel resistant is to protect it against simple power analysis (SPA) attacks [3]. This can be done by making the execution time and the instantaneous power consumption of the operations independent of the processed data and executed instructions. In the case of ECC, this strategy should be applied, e.g., to the point multiplication algorithm, the point addition and point doubling algorithms and the operations in the underlying finite field. More powerful than SPA are side-channel attacks that use statistics to process many measurements and exploit the correlation between (secret) data and physical leakages. This kind of side-channel attack is usually referred to as differential power analysis (DPA) [3]. An effective way of protecting ECC implementations against DPA attacks at the algorithmic level is to apply randomization countermeasures. Examples are scalar blinding and point blinding, which randomize the point representation and the key bits' evaluation [4].
As the point operations are different in principle, there exist different formulae to compute an addition of two different points or a doubling of one point. This has led to insecure implementations, as the two operations feature different power consumption patterns. Recently, there have been efforts to use a single set of formulae to compute both, point addition and doubling [5]. Although this slows down the implementation, it is an important first step in making the implementation side-channel resistant. The formulae derived in [5] are only applicable to a special type of curve, i.e., so-called Edwards curves. Nevertheless, the idea has enabled the balancing of point operations in an intrinsic manner, i.e., without adding dummy operations [6].
In this paper, we evaluate the power analysis resistance of a completely balanced ECC implementation, based on recent advances in the development of complete addition formulae for all prime order Weierstrass curves by Renes et al. [7]. In a second step, we protect the implementation using point randomization techniques, and in a third step, we use random scalar splitting. The architecture is implemented on a SAKURA-GII board containing a Spartan-6 FPGA. This is the first paper that addresses the side-channel security of unprotected and protected hardware implementations of the complete formulas by Renes et al.
The paper is structured as follows. In Section 2, we give background information on the three implementation versions and the type of power analysis attacks we perform. Section 3 gives an overview of related work on protected ECC implementations. In Section 4, the experimental setup is described. Section 5 presents and discusses the measurement results. Finally, Section 6 concludes the paper and gives an outlook on future work.

Background Information
Here, we give some relevant background information on elliptic curves over a prime field, and we recall the complete formulas of Renes et al. [7]. In addition, we discuss some issues in side-channel analysis that are ECC specific.

Formulas
Let F q be a finite field of characteristic p, i.e., q = p n for some n, and assume that p = 2, 3. Typically, the curves used in security applications are defined over F p , so for n = 1, but the formulas for the elliptic curve addition/doubling work for any n. For arbitrary a, b ∈ F q , an elliptic curve E over F q is defined as the set of solutions (x, y) to the curve equation E : y 2 = x 3 + ax + b with an additional point O, called the point at infinity. Those points (x, y) form a group, with O as its identity element. One has to make sure that a and b are chosen to meet the security requirements as defined by ECC standards.
Elliptic curve cryptography [1,2] relies on the difficulty of the elliptic curve discrete logarithm problem (ECDLP). This means that given two points P, Q on an elliptic curve, it is hard to find a scalar k ∈ Z such that Q = kP, if it exists. Therefore, the main component of curve-based cryptosystems is the scalar multiplication operation (k, P) → kP. Namely, all ECC protocols are typically based on a few scalar multiplications, i.e., the computations of kP where k is a scalar and P is a known point. The computation of kP is performed via repeated point additions (P + Q) and doublings (P + P = 2P). Both operations can be performed by the use of the same sequence of instructions [7] that consist of several finite field operations, i.e., modular multiplications, additions and inversions. The exact counts of field operations depend on the choice of curves and coordinates; see [8]. Modular multiplications are much more expensive than additions in terms of time, area and memory, but the most expensive are inversions. One way to avoid inversions is to work with projective coordinates. In this case, we choose a different point representation, i.e., we represent points with projective coordinates.
Using projective coordinates, there are no inversions, but the number of modular multiplications increases, which makes the design of a hardware multiplier crucial for efficient implementations.
The work of Renes et al. [7] presents addition formulas (to realize the group law) for curves in the short Weierstrass form embedded in the projective plane. They compute the sum of two points P = (X 1 : Y 1 : Z 1 ) and Q = (X 2 : Y 2 : Z 2 ) as P + Q = (X 3 : Y 3 : Z 3 ), where: (1)

Side-Channel Analysis and Countermeasures
In many ECC applications that compute kP, k is a secret key. This implies that this operation has to be protected against all attacks. In particular, many side-channel attacks [3,9] and countermeasures [4] have been proposed. To ensure protection against SPA attacks, it is important to use regular scalar multiplication algorithms, e.g., Montgomery ladder [10] or double-and-add-always [4], executing both an addition and a doubling operation per scalar bit. On the other hand, the regularity is also important for the group operation, so addition and doubling should preferably be executed via an identical sequence of field operations. This suggests a clear preference for the complete formulas. Our first implementation is based on these formulas.

Countermeasures
In order to protect the implementation against DPA attacks, we use projective coordinate randomization in our second implementation. This countermeasure exploits the fact that the Z-coordinate can be chosen randomly [4]. This comes down to choosing a different Z-coordinate for each point multiplication during the conversion of the input point P to projective coordinates.
In a third implementation, we implement further protection mechanisms against DPA attacks by adding randomized scalar splitting. In this countermeasure, the scalar multiplication kP is randomly split into two scalar multiplications, namely rP and (k-r)P, with r a random number.

Test Vector Leakage Assessment Methodology
We evaluate the SPA and DPA leakage of our hardware architecture by running the test vector leakage assessment methodology (TVLA) [11][12][13]; we follow and extend the TVLA approach for ECC from [14]. TVLA is a testing methodology for side-channel resistance validation that is based on the following rationale: side-channel attacks, such as SPA and DPA, exploit the presence of information about any sensitive intermediates within the traces collected from a device. The approach uses statistical hypothesis testing to detect if one of a number of sensitive intermediates significantly influences the measurement data.
TVLA consists of two phases. The measurement phase is based on the collection of side-channel traces when standardized test vectors are provided as input to the algorithm being tested and establishes requirements for power measurement equipment and setup, data collection, signal alignment and pre-processing. The analysis phase is based on Welch's t-test, which can detect different types of leakages and allows the analyst to identify points in time that deserve further investigation. The testing methodology has so far been applied to AES [11], RSA [12,13] and ECC [13,14].

ECC Hardware Implementations
There are numerous works published on hardware implementations of ECC focusing on various platforms, fields, curves and bases. Even if we narrow our choice to elliptic curves over prime fields only, there are numerous papers in the literature that could be mentioned. To avoid being exhaustive on the topic, we focus on more recent hardware implementations of ECC over prime fields. For those interested more generally in the topic, a recent survey is done by Marzouqi et al. [15].
Hardware implementations allow for different trade-offs of resources (in terms of silicon area, configurable look-up tables, embedded mathematical and memory blocks), operational speed (in terms of latency or throughput) and power or energy consumption. In this work, we concentrate on a low-resource implementation. Some publications that also use the same approach are by Roy et al. [16], Pöpper et al. [17] and Vliegen et al. [18], where the last one is the basis of this work. All of these implementations use the same approach of having a very small datapath on which the algorithmic operations are executed in a sequential way. Because of the very restricted amount of resources, the number of cycles can increase quadratically in time for certain multiplication algorithms. Since all ECC operations are constructed on top of field arithmetic, the latency of an ECC point multiplication easily grows. To compensate this loss in time, resource-constrained architectures are usually carefully optimized to try to do all operations in the least amount of cycles. For example, Roy et al. [16] optimize the reduction for NIST curves, while Vliegen et al. [18] optimize for Montgomery multiplication.
On the other side of the spectrum, there is the work of Alrimeih and Rakhmatov [19] and Guillermin [20], which focus on a higher speed by utilizing a large amount of resources. This trade-off between area and time resources can only be optimized with an application scenario in mind. Some applications, like servers, will need high throughput, while entrance authorization and vehicle communication need low latency. In IoT use cases, very inexpensive and low power devices are required.
A different field arithmetic implementation is presented by Guillermin [20] and Esmaeildoust et al. [21], based on the residue-number-system (RNS). Usually, implementations represent the values in one basis, thus requiring a multiplier of the same size of the basis or iterating several steps in a smaller multiplier. RNS-based implementations work with different small bases. Therefore, one can split the one multiplication into several smaller and faster multiplications and combine all results afterwards. Because of working in a parallel way, RNS implementations require many resources, but also have very good timing results.
Research based on different types of curves is performed by Sasdrich and Güneysu [22], Baldwin et al. [23] and Järvinen et al. [24]. These implementations are not based on standard Weierstrass curves, but on curves with more efficient and faster arithmetic. Unfortunately, some applications do not allow the use of those new curves. Nevertheless, given the high efficiency, this scenario is expected to change.
In terms of adding side-channel protection, there is the work of Ghosh et al. [25] and Pöpper et al. [17]. These solutions add both simple and differential power analysis protection. They show that even though most ECC implementations tend to focus on scalar multiplication only, side-channel protection needs to be evaluated on a higher level for real-life applications.

Experimental Setup
This section first elaborates on the hardware architecture, which is calculating the point operations. Subsequently, the setup to perform the actual measurements is discussed.

Hardware Architecture
The hardware architecture is a standalone 32-bit elliptic curve processor as presented in [18]. A block diagram of this processor is depicted in Figure 1 and consists of the typical components: an instruction memory, a data memory, a modular arithmetic and logic unit (ALU) and a control unit. The curve on which the processor operates is configurable trough the initialization of the data memory and is set to the NIST standardized P-256 curve.  Figure 2. A conversion to Montgomery form (and back) is required: A = a × R, with A the Montgomery form of a. As originally publish by Montgomery [26], the Montgomery multiplication requires a final subtraction. This can be avoided at the cost of an additional word in the datapath, as presented by Walter in [27]. Therefore, the data memory has a width of 256 + w bits, which, for a 32-bit processor, rounds up to a width of 288 bits. execution of the Montgomery ladder [28], performing a scalar multiplication of a hard coded scalar with P 4.
calculation of the modular inverse of QZ −1

conversion of Q back to affine coordinates and undoing the Montgomery form Q(QX, QY, QZ) → scalarP(x, y)
The modular inverse is calculated by using Fermat's little theorem, which states: m p ≡ m mod p when gcd(m, p) = 1. Because p is a prime number, the latter condition is met. The modular inverse can hence be computed with m −1 ≡ m p−2 mod p. This exponentiation is achieved by a square-and-multiply operation, using the modular multiplication unit in the ALU.
The ECPcomponent is wrapped into a top level component. This wrapping provides an interface for the measurement setup, discussed in Section 4.2. Table 1 summarizes the required resources on a Xilinx Spartan-6 LX75 FPGA. This table reports both the occupied resources for the standalone ECP and for the wrapped ECP. Adding countermeasures to the ECP component does not have an impact on the resource occupation, because these modifications only touch the content of the instruction memory in BRAM. The reported number of nine BRAMs in Table 1 breaks down into eight BRAMs for the data memory and one BRAM for the instruction memory. For all versions of the ECP, the instructions fit into one BRAM.

Measurement Setup
The power measurement setup, shown in Figure 3, consists of a oscilloscope, an FPGA board and a commodity PC. The devices are arranged in such a way that the PC can coordinate and store all power measurements/traces.  The oscilloscope in our experiment is the Teledyne Lecroy Waverunner 610Zi. It was configured with 10 8 samples/s and at most 32 × 10 6 samples. This is enough in terms of sample rate and sample size, since the cryptographic core is running at 6 MHz and takes about 300 ms to complete one point multiplication. Furthermore, the oscilloscope has TCP/IP support for both controlling and downloading the measurements, which helps to automatize the entire process.
The FPGA board in Figure 3 is the SAKURA-GII board [29]. The board has two FPGAs, one USB/serial controller and two separate power regulators. For our setup, one FPGA was configured to act as an interface [29] between the serial inputs and the other FPGA that contains the ECC core. Although the ECC core can handle a larger clock frequency, we let it operate at a 6-MHz clock, given the 48-MHz frequency of the USB/serial communication. The separation of the two FPGAs, with different power regulators, lowers the amount of noise in the measurements introduced by the USB communication.
The traces were acquired with the following procedure. First, the PC configures the oscilloscope in terms of the number of channels, the trigger event and the number of samples. Then, the elliptic curve parameters are sent via USB to the FPGA board SAKURA-GII. After verifying the correct reception of the parameters, the PC signalizes the FPGA to start. When the FPGA receives the start signal, it automatically sends a trigger signal to the oscilloscope. After finishing both the power acquisition and the FPGA computation, the PC verifies if the computed value matches the correct output and downloads the power measurements from the oscilloscope. This entire process is repeated until all measurements have been done. Subsequently, the acquired traces are analyzed using Riscure's Inspector software package (http://www.riscure.com/).

Application of the TVLA Methodology to ECC
We apply the TVLA methodology [12,13] to our implementation using the aforementioned measurement setup, following the approach from [14] (In [14], the authors apply the TVLA methodology to evaluate an implementation of ECDH-Curve25519.). Specifically, we select a set of test vectors to be used for the power measurement phase, which cover normal and special cases for the chosen implementation, as shown in Table 2. Table 3 shows the categories of special values used in Sets 4 and 5. We use a notation that is similar to [14]. Table 2. Sets of test vectors for test vector leakage assessment (TVLA) leakage analysis, where k is the secret scalar and P is the point.

Set # Properties Rationale
This is the baseline. The tests compare power consumption from the other sets against it.

constant k, varying P
The goal is to detect systematic relationships between the power consumption and the P value.

varying k, constant P
The goal is to detect systematic relationships between the power consumption and the k value. 4 constant k, special P Edge cases of the algorithms used. 5 special k, constant P Edge cases of the algorithms used. Table 3. Categories of special values for k, x and y, where x and y are coordinates of the input point, k is the scalar and l is the subgroup order.

Leakage Analysis
The leakage analysis of our implementation adapts the approach from [14] and is conducted in the following way. Let {DS 1 , . . . DS 5 } be the set of power traces corresponding to the selected test vectors. The full test consists of running the (pairwise) tests described in [12,13] for each of the following pairs of datasets: {(DS 1 , DS 2 ), . . . (DS 1 , DS 5 )}. If any of the previous tests fail, then the top-level test fails, and the implementation is deemed to have failed; otherwise, it is deemed to have passed the tests. We chose the confidence threshold C = 4.5, the same as in [12][13][14].

Timing Analysis
Timing analysis of the implementation (with and without coordinate randomization) was performed based on the power measurements. We have analyzed 200 measurements with different private keys. The results show that the implementations with and without coordinate randomization are constant time, with respect to the private key.
If the private key is randomized, then the execution time only differs if some of the most significant bits of the scalar are zeroes; this is as expected, since the multiplication iterations are not performed for the most significant zero bits of the scalar.

TVLA Analysis
The leakage analysis methodology was applied to our implementation in three different settings: • with no countermeasures applied; • with point randomization; • with point randomization and scalar randomization.

Unprotected Implementation
First, we test the implementation without any countermeasure implemented. For this test, we assume that T = 200. We only performed the tests for DS 1 , DS 2 , DS 3 , because we detected a significant leakage already at this stage; we chose T relatively low since we expected to find a significant leakage. Figure 4 shows the t-statistics for a small range of sample indices (time instants), for one run of Welch's t-test for group A 3 (S A,1 , S A,3 ) of vectors selected from DS 1 and DS 3 and the same test run over a random grouping R 3 (S R,1 , S R,3 ). Groups A j and R j are a partition of test vector sets DS i and DS j (where i = 1 and j = 3): We compute a t-statistics trace using the following formula: where µ x , σ x and N x denote, respectively: the average of all the traces, the standard deviation and the number of traces in the partition x. The t-statistics for the group A 3 is way above C = 4.5; it even reaches 400. The t-statistics for the group R 3 rarely reaches C = 4.5. We do not need to consider negative values because we compute the absolute value of t-statistics.
We consider values around C = 4.5 to be ghost peaks. These results show that the implementation is vulnerable to DPA, as expected, since no countermeasure against DPA is employed.
Implementation protected with coordinate randomization: Second, we test the implementation with the coordinate randomization enabled. This countermeasure is implemented in the following way: instead of initializing Z to one, we initialize Z randomly; then, we update X and Y by multiplying it by the new random Z.
We perform the TVLA analysis similarly to the unprotected implementation, but we set T = 1000 and perform the tests for all DS 1 , . . . DS 5 . Note that one can argue that T = 1000 is a relatively low number of traces for a t-test. However, we need to acquire the whole execution of the scalar multiplication (i.e., 32,000,000 samples) for 13T = 13,000 traces; this acquisition results in a trace set of approximately 300 gigabytes. Therefore, for the sake of efficiency, we decided not to acquire larger trace sets.
The results of the TVLA analysis are as follows: • the t-statistics values for the groups A 3 and A 5 (both categories) are way above C = 4.5, as presented in Figure 5; the values reach 30.0 for A3, 9.0 for A 5 and Category 3 and 17.0 for A 5 and Category 4. • the t-statistics for the group A 2 and the group A 4 are almost always less than C = 4.5, and they rarely reach the threshold (never significantly).
We consider the values around 4.5 to be ghost peaks, because the same levels of values are achieved in the corresponding random groupings. Additionally, we perform the analysis for two equal and disjoint parts of A 2 that contain T/2 elements each. The spikes that are above 4.5 do not occur at the same time for both parts. Therefore, we consider these spikes to be false positives. Based on the results presented above, we conclude that the implementation protected with coordinate randomization does not leak the intermediate point values during the scalar multiplication. However, the implementation seems to leak the key bit values; therefore, we suspect that the implementation might be susceptible to attacks similar to address-based DPA [30] or address-based template attacks [31].
Implementation protected with coordinate randomization and scalar splitting: Since the coordinate randomization does not protect the scalar itself, we consider an additional countermeasure that hides the scalar: scalar splitting [32,33]. We consider the additive version of splitting: the scalar k is split into two values r and k − r, where r is a random value of the size of k. Subsequently, for an input point P, two scalar multiplications are performed: (1) [r]P and (2) [k − r]P. Then, the two resulting points are added to obtain [k]P.
Observe that for each splitting execution, the random value r is chosen independently at random from the previous random choices. As a result, all scalars used in the first multiplication are independent of each other; the same holds also for the second one. Therefore, since we test for first order leakage, it is sufficient to test a single scalar multiplication [r]P using TVLA.
We perform the TVLA analysis for two groups, DS 3 and DS 5 , acquired during the previous analysis in the following way: we divide each group into two equal non-intersecting sets of size T/2 = 500 and compute the t-value between the sets. The t-statistics values for two sets from DS 3 are presented in Figure 6; we obtained similar results for DS 5 . Again, we consider a single value that is slightly above 4.5 to be a ghost peak, because we achieve a similar value for a random grouping. Figure 6. T-statistics versus two sample index groups coming from DS 3 .

Results and Discussion
As expected, the unprotected implementation is time constant and resistant against SPA attacks. The implementations protected with coordinate randomization and scalar splitting are resistant against first order attacks, like SPA and DPA. Only applying the coordinate randomization as a countermeasure, however, is not enough to protect the implementation.
Observe that our analysis is aimed at detecting first order leakage. We have not evaluated the implementation against higher order attacks, like cross-correlation [34] (Observe that the scalar splitting should mitigate the cross-correlation attack.), horizontal cross-correlation [35], single trace template attacks [31] and horizontal cluster attacks [36]. We leave evaluating the implementation against these attacks as future work.

Conclusions
This paper presented the FPGA implementation and side-channel evaluation of three algorithms for ECC point multiplication, all based on the complete addition formulae introduced by Renes, Costello and Batina, with an increasing level of side-channel protection. Based on the TVLA method, the results indicate that only the implementation with all countermeasures in place is resistant against first-order DPA. Further improvements include the integration of countermeasures against higher order attacks.