3. Related Work
The basic vulnerability of scalar multiplication is the observation that when the secret key bit is 0, only point doubling is performed. On the other hand, a secret key bit value of 1 requires point addition plus point doubling. An obvious solution to this is to replace the double-and-add with double-and-add using dummy operations.
A slightly more sophisticated solution is the Montgomery ladder. The Montgomery ladder algorithm [
14,
15] is essentially similar to the LtR scalar multiplication algorithm as shown in Algorithm 3.
Algorithm 3: RtL algorithm using the Montgomery ladder technique. |
![Telecom 06 00075 i003 Telecom 06 00075 i003]() |
From Algorithm 3 we see that the IF condition branch in Lines 4 or 6 results in the same amount of power consumption and delay. This effectively obfuscates any attempt to deduce the secret key bit values.
Longa [
16] proposed composite or atomic operations based on point doubling and point addition to remove the SCA vulnerability and to speed up the operations. He used this approach for accelerating scalar multiplication and for accelerating precomputing in window-based approaches. The atomic operations performed point tripling (
) and point quadrupling (
). In addition, more atomic operations were proposed such as unified doubling-addition (
) and unified tripling-addition (
).
Sigourou et al. [
8] implemented the scalar multiplication operation using Longa’s atomic pattern for point doubling and point addition discussed in the previous paragraph. The atomic blocks used in this work adopted the MNAMNAA, where M denotes field multiplication, N denotes negation, and A denotes addition. Using these techniques, point doubling and point add use the same registers and hence defeat SCA since the operations would be indistinguishable.
Wei et al. [
17] considered cluster-based side-channel attacks that use clustering algorithms to analyze power traces together with principle component analysis to reduce the dimension of the data. They combined an intelligent framework that combines unsupervised clustering techniques and supervised deep learning. This approach is powerful for mining data for in-depth information. It can be used as a powerful tool to assess the vulnerability to proposed secure scalar multiplication algorithms.
Klavier and Joye [
18] divided the scalar factor into two parts so that scalar multiplication is expressed as:
where
is the least significant bits and
is the most significant bits. The number of bits representing
or
is randomly chosen. Next, the authors used left-to-right evaluation of the term
and right-to-left for evaluation of the term
. Using LtR and RtL using parts of the secret key still relies on point doubling and conditional point add. In this fashion, an SCA attacker is able to detect the presence and the location of the ones in the secret key. The attacker, however, is not able to determine if this location is from the right of the MSB or from the left of the LSB. In addition, the SCA attacker is not able to determine when processing moves from
to
or vice versa.
Itoh et al. [
19] proposed three windowing techniques to defeat differential power analysis (DPA) attacks:
Overlapping window method (O-WM);
Randomized table window method (RT-WM);
Hybrid randomized window method (HR-WM).
Each technique has unique characteristics such as speed and security to suit the target environment. The basic idea is to randomly distribute the secret key bits among the overlapping windows.
Kolagatla and Desalphine [
20] considered modular exponentiation for the RSA algorithm:
where
are the message, ciphertext, and secret or public key, respectively.
p is the prime of the field and typically
p is a quasi-Mersenne prime [
3]. We should bear in mind that modular exponentiation in RSA and scalar multiplication in ECC are very much equivalent. Squaring and multiplication in RSA are replaced with doubling and addition in ECC. Modular exponentiation involves two modular operations: multiplication and squaring. The authors studied the vulnerability of the RSA implementation and proposed countermeasures against SCA to enhance the security. Their approach is to use the random radices Montgomery ladder algorithm to suppress SCA. The radices considered were
,
,
,
,
,
,
, and
. At each Montgomery ladder iteration, a radix is chosen at random and the processing of the radix representation of the secret key bits is done in parallel because of the digit representation in a high-radix format.
Ding et al. [
21] investigated testing methods for mitigation against SCA. Specifically, the authors studied cluster-based SCA and introduced their adjacent distance coefficient to quantify the accuracy of recovering secret key bits. It is crucial for ECC systems to be resistant to the different forms of SCA techniques. The authors claim that their metric outperformed traditional metrics such as silhouette coefficient, membership degree, and information entropy.
Yang et al. [
9] used a variable radix system to prevent SCA. Their method transforms the secret key into a variable-radix format. This approach is claimed to de-correlate the secret key bits and the power trace waveforms. The proposed implementation proved immune to simple power analysis, differential power analysis, and timing analysis.
Kido et al. [
22] considered implementing the ECC curve GLS245 represented in
, where
. They targeted memory- and power-limited IoT devices. The authors used several coordinates like
Modified -Jacobian coordinates with X, Z multiplied by ;
-affine coordinates;
Proposed -Jacobian coordinates consisting of three coordinates ().
It should be noted that the authors selected the best coordinates for each calculation.
7. Implementations and Performance Measurement
We start this section with a complexity comparison of the three algorithms discussed in this paper: Standard RtL and the two proposed algorithms, RP and TRP.
7.2. Hardware SCA Monitoring Using Intel’s Tools
All implementations were developed in Python 3.9.13 and executed on the same hardware environment to ensure a fair comparison. The elliptic curve operations, including point addition, doubling, and scalar multiplication, were implemented in Python, providing a uniform baseline for evaluating both performance and side-channel resistance. The Intel
® Performance Counter Monitor (PCM) was used on Windows to collect precise hardware metrics [
25]. The monitored metrics included Instructions Retired (INST), Active CPU Cycles (ACYC), L3 Cache Misses (L3MISS), Core and Package Energy (Joules), Core Temperature (°C), and Core Frequency (MHz). These metrics are not the only indicators of computational performance but also potential leakage sources as variations in instructions retired, cache behaviour, or energy draw can reveal patterns correlated with secret scalar bits. To evaluate both instantaneous leakage and statistical resistance, each design was executed once to capture single-run traces and repeated 100 times to analyze averaged and variance-based results. This dual perspective ensures that both single-execution leakage and long-term statistical leakage are addressed.
Python libraries such as
PrimeFieldEllipticCurve [
26] provide optimized and well-tested routines for point operations over prime fields; however, they are not used in our implementations. The main reason is that our study focuses on comparing the performance of several custom ECC operations, baseline, Design #1, and Design #2 algorithms, which involve custom transformations, randomized bit manipulations, and windowed doubling techniques for side-channel resistance. Using a library would abstract away these low-level operations, preventing us from accurately implementing, controlling, and analyzing the effects of these design strategies. Therefore, a manual implementation of elliptic curve operations is preferred, despite the additional complexity and potential performance overhead.
The elliptic curve implemented in all three algorithms is secp256k1, defined on a quasi-Mersenne prime , which allows efficient modular arithmetic. Listing 1 shows the parameters defining the elliptic curve: a, b, prime p, generator point G, and scale factor k.
Listing 1. Curve secp256k1 parameters used in all experiments. |
# Curve parameters
p = 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEFFFFFC2F
a = 0
b = 7
# Generator point
Gx = 0x79BE667EF9DCBBAC55A06295CE870B07029BFCDB2DCE28D959F2815B16F81798
Gy = 0x483ADA7726A3C4655DA4FBFC0E1108A8FD17B448A68554199C47D08FFB10D4B8
G = (Gx, Gy)
# Secret scalar key
k = 0x1234567890ABCDEF1234567890ABCDEF1234567890ABCDEF1234567890ABCDEF
|
These values allow direct and fair performance comparison of all algorithms and provide a consistent basis for side-channel analysis.
To select an appropriate sampling interval for PCM measurements, we considered both the execution time of the scalar multiplication algorithm and the number of key bits processed.
The effective interval observed in the CSV is about 0.25 ms, computed as . Intel’s Running Average Power Limit (RAPL) interface PCM update cadence: energy and performance counters are exposed via Model-Specific Registers (MSRs) that the hardware driver updates at a fixed cadence, so user-space sampling cannot reliably exceed that refresh rate even if a finer period is requested. Second, operating-system scheduling overhead: PCM is a user-space process awakened by kernel timers; timer granularity, context switches, and competing threads that stretch the requested period. This does not affect our comparisons as all designs were recorded under identical PCM settings, and we analyze interval-normalized metrics—per-sample power and energy per operation. Consequently, the effective interval changes only the sample density, not the operation results.
The algorithm operates on a 256-bit key, requiring 256 point doublings and 128 point additions, with a total execution time of approximately 35.3 ms. To capture the effect of each individual key bit (0 or 1) in the performance metrics, we targeted at least one PCM sample per bit. Since the cost of processing a “0” bit doubling only differs from that of a “1” bit doubling plus addition, the execution time is not uniformly distributed across all 256 key bits. However, for analysis, we normalize by the key length and divide the total execution time by the number of key bits, yielding a minimum sampling interval of approximately 138 μs (0.138 ms):
In practice, we selected a sampling interval of 50 μs which balances resolution with manageable data size and provides sufficient detail to visualize operation patterns corresponding to the binary key.
9. Experimental Evaluation
The scalar multiplication operation was expressed in terms of the two elliptic curve arithmetic operations: point addition and point doubling. These two elliptic curve arithmetic operations were in turn expressed in terms of basic field operations: modular multiplication, modular addition/subtraction, and finding the multiplicative inverse using the Extended Euclidean Algorithm (EEA).
There are three sources of delay/power randomness inherent in the scalar multiplication operation:
The number of iterations in the EEA algorithm.
The need to check if a reduction operation is needed after add/subtract operations.
Random delays in wall clock, as opposed to CPU cycles, due to thread stalls and the operating system (OS) starting other processes or threads.
To eliminate the third source of random delays, a dedicated crypto accelerator is typically used, where specialized systolic arrays are used to implement modular multiplication and pipelining is used to implement EEA.
In order to reduce the effect of the first two sources of noise, multiple traces using the same key are used and the average trace is monitored. Typically any number of traces between 100 and 1000 is used. In this work we used 200 traces to find the average.
Typically ECC operations are performed on either general-purpose programmable processors or on specialized application-specific processors (ASPs). Most ASPs perform specialized applications and operations such as modular multipliers [
27,
28], cryptographic processors [
29], and telecommunication processors [
30]. Be that as it may, the area of the processor is independent of the algorithm being implemented. However, the delay is very much dependent on the algorithm implementation and the figure of merit would be the delay. The basic unit of delay in a processor is the adder speed since this determines the clock period, and hence, the clock speed. A processor with word size
W bits determines the full-adder speed or delay. We take this as the unit of delay in our complexity analysis below.
The integer modular addition requires checking the sum result and there is a need to apply a reduction operation if the sum exceeds the modulo. The average delay complexity
is estimated as
where
L is the number of words, assuming the processor word size is
W and the factor 1.5 accounts for the random nature of the addition result. Here, we have
L given by
The integer modular multiplier delay complexity
is estimated as
where it was assumed that multiplying two
m-bit integers is done using one of the systolic multipliers where multiplication and reduction operations are merged [
28,
31]. This design is much faster than the multiply-then-reduce approach.
The delay complexity of the extended Euclidean algorithm operation
is estimated as
where
is the average estimated number of iterations necessary to find the multiplicative inverse using the extended Euclidean algorithm.
Based on the above basic arithmetic complexity estimates, we can now estimate the complexities of elliptic curve point add and point double operations.
Elliptic curve point add complexity
is estimated as
Elliptic curve point double complexity
is estimated as
The typical elliptic curves in
have
p with number of bits varying between 192 and 521, which have similar security strengths in RSA with corresponding sizes between 1536 and 15,360 [
32].
Figure 4 shows the performance of scalar multiplication for the case the key size is 512 bits for executing the RtL or RP scalar multiplication algorithms. Both curves show the same type of traces, except the RP algorithm, which shows traces that bear no relation to the actual location of the non-zero bits in the secret key
K.
Figure 4a shows the average delay trace for a 512-bit key.
Figure 4b shows the 512-bit key corresponding to the trace in (a).
Figure 4c shows the expanded view of the average delay for the first 64 bits.
Figure 4d shows the expanded view of the key for the first 64 bits.
The figure shows the delay per iteration for the RtL scalar multiplication algorithm listed in Algorithm 1. We see that the increased delay corresponds to the secret key bit of 1.
Using the same data but applying the proposed TRP Algorithm 5, we get the delay trace shown in
Figure 5.
The first 512 samples correspond to the average delay trace due to performing the point double operations. The last 254 samples correspond to the average delay trace due to performing the point add operations. It is immediately apparent that distinguishing between the point double trace and the point add trace is very difficult. This is a unique feature of elliptic curves, and there are two reasons for this:
Any SCA will only be able to detect that the number of non-zero secret key bits is 254. But their actual location is completely lost.