SCA-Safe Implementation of Modified SaMAL2R Algorithm in FPGA

Cryptographic algorithms (RSA, DSA, and ECC) use modular exponentiation as part of the principal operation. However, Non-profiled Side Channel Attacks such as Simple Power Analysis and Differential Power Analysis compromise cryptographic algorithms that use such operation. In this work, we present a modification of a modular exponentiation algorithm implemented in programmable devices, such as the Field Programmable Gate Array, for which we use Virtex-6 and Artix-7 evaluation boards. It is shown that this proposal is not vulnerable to the attacks mentioned previously. Further, a comparison was made with other related works, which use the same family of FPGAs. These comparisons show that this proposal not only defeats physical attack but also reduces the number of resources. For instance, the present work reduces the Look-Up Tables by 3550 and the number of Flip-Flops was decreased by 62,583 compared with other works. Besides, the number of memory blocks used is zero in the present work, in contrast with others that use a large number of blocks. Finally, the clock cycles (latency) are compared in different programmable devices to perform operations.


Introduction
Nowadays, most of the confidential information and data of people, governments, and companies, are exchanged via electronic media. Electronic media have the advantage of increasing the speed of information exchange. Moreover, they supply an elevated level of security, confidentiality, and data integrity.
Many researchers have developed various methods for encryption and decryption of secret information, including methods with artificial intelligence, as in Ref. [1]. However, attackers have engaged in these techniques.
Encryption and decryption of the information use a secret key. Cryptographic devices such as Smart Cards, Field Programmable Gate Arrays (FPGA), Application-Specific Integrated Circuits (ASICs), microcontrollers, and others operate storing this secret key.
A device that is very interesting to use for various applications is the FPGA. These devices can be used in various fields, such as artificial intelligence [2,3], power electronics [4], control [5], optimization [6], among others. Due to the versatility of these devices, we use them, in the present work, within the area of cryptography.
Kocher, in the 1990s, realized that anyone could get the secret key through physical measurements in cryptographic devices [7,8].
These attacks do not require highly specialized equipment since the attack can be carried out using accessible equipment, such as an oscilloscope and resistors.

Side-Channel Attacks
Since the first physical assault described by Kocher, researchers have developed many attacks against programmable devices. These attacks are known as Side-Channel Attacks (SCA). In the literature, there are two main categories for SCAs: the first category is related to profile attacks, and the second is related to non-profile attacks [11,12].

Profiled SCAs
Profile SCA corresponds to those made with machine learning. For this type of attack, a device with encryption algorithms is required, which allows for characterizing the vulnerabilities presented by it and thus obtaining a model.

Non-Profiled SCAs
Non-profiled SCAs require only the encrypted message. There are two major categories for this type of attack: Active Attacks and Passive Attacks.

Active Attacks
When an adversary performs an attack, it somehow exerts an influence over the behavior of the cryptographic device. Then, the attacker can notice a difference in the signals measured on the cryptographic device. From these waves, an attacker can extract secret information.
However, these attacks are considered invasive attacks. In an invasive attack, the attacker physically modifies the cryptographic device. For example, the attacker can induce a signal in the cryptographic device. This signal gives a change in the memory, inside the computation of the data, or the cryptographic algorithm. Fault Attacks (FA) are examples of these kinds of attacks Due to the nature of these attacks, cryptographic devices are usually irreversibly damaged.

Passive Attacks
On the other hand, passive attacks are not invasive. These kinds of attacks do not interfere with the operation of the device. Hence, the device is not damaged.
Power Analysis Attacks (PAA) [8], Timing Analysis Attacks (TAA) [7], and Electromagnetic Analysis Attacks (EAA) [10] are examples of Passive Attacks In Power Analysis Attacks (PAAs), the attacker takes advantage of the fact that the power consumed is correlated to the algorithm and the data processed by the crypto device.
Besides, the PAA can be classified into two categories: Simple Power Analysis (SPA) and Differential Power Analysis (DPA) [8].
It is possible to get the secret key of a cryptosystem using SPA by making one or a few measurements of the power traces of the power consumption when the device is performing its cryptographic operations.
On the other hand, DPA requires many measurements of the power traces of the power consumption. To get these measures, an attacker or researcher uses different input messages on the cryptographic device. A statistical analysis of the measured power traces is carried out. It should be mentioned that DPA is more powerful and more difficult to prevent than SPA.

Countermeasures to Passive Attacks
Many cryptographic algorithms, such as RSA [13], use modular exponentiation as the core. Modular exponentiation uses the binary representation of the exponent. SCAs use the measured traces to find the binary form of the exponent.
Before Kocher published his work, the Square-and-multiply Algorithm was the most used for modular exponentiation. This approach has two forms Left-to-right (SaML2R) and Right-to-left (SaMR2L). Algorithm 1 shows the Square-and-Multiply, Left to Right algorithm.

Algorithm 1 Square-and-Multiply, Left to Right
Require: m, d = (d n−1 , · · · , d 0 ) 2 , modulus N Ensure: S = m d mod N 1: S ← 1 2: for i from (n − 1) down to 0 do 3: if (d i = 1) then 5: S ← S · m 6: end if 7: end for 8: return S However, this technique is vulnerable to SPA and Timing Analysis Attacks. The SaM algorithm calculates modular multiplication and square if the exponent bit is 1, as shown in steps 4 and 5 in Algorithm 1; or it processes just one square when the exponent bit is 0.
Coron supplied the first technique to defeat the SPA and Timing Analysis Attack [14]. This regular technique is called Square-and-multiply Always (SaMA). Specifically, Algorithm 2 shows the Left to Right version of the SaMA algorithm. This algorithm, in contrast to the Square and Multiply Algorithm, calculates a square followed by a multiplication, independently of the value of the corresponding bit of the exponent. Namely, this algorithm calculates a square followed by a multiplication when the bit of the exponent is 0 or 1.

Algorithm 2 Square-and-Multiply Always, Left to Right
Require: m, d = (d n−1 , · · · , d 0 ) 2 , modulus N Ensure: S = m d mod N 1: S ← 1 2: for i from (n − 1) down to 0 do 3: S ← S[d i ] mod N 6: end for 7: return S Table 1 shows a brief example of the SaMAL2R algorithm execution. In this example, the parameters are d = 97 10 = 1100001 2 as the exponent and m as the input message.  On the other side, Mayima et al. proposed an algorithm to defeat DPA [15], called Binary Expansion with the Random Initial Point (BRIP) algorithm. Algorithm 3. This algorithm protects the input message with a random value, and it does not use dummy operations during cryptographic operations. Table 2 shows a brief example of the execution of the BRIP algorithm. In this example, the parameters are d = 97 10 = 1100001 2 as the exponent and m as the input message.
Nevertheless, this algorithm requires the calculation of random numbers for its execution, which stands for excessive consumption during implementation in programmable devices such as FPGA.
The previously mentioned algorithms were attacked by a special type of SCA.

Retrieve result
There are many kinds of passive attacks. An engrossing attack within passive attacks is the Chosen-message Attack. The Chosen-message Attack uses a specific input message. This input message produces a particular behavior during the performance of an algorithm. One of these kinds of attacks was developed by Yet et al. [16] and implemented by Miyamoto et al. [17]. This attack is called N − 1 Attack. This technique is used against the modular exponentiation of the SaMA and BRIP algorithms.
This N − 1 Attacks works, considering the two following observations in the RSA context with modulus N:

•
The first observation is that (N − 1) 2 ≡ 1 mod N, this is extended to the fact that (N − 1) j ≡ 1 mod N for any even integer j; • The second observation is similar to the earlier one when (N − 1) k ≡ (N − 1) mod N for any odd integer k.
These two observations in modular exponentiation will give only two values in the intermediate operations: a 1 value when the bit processed of the exponent is 0 and N − 1 when the bit is 1. Hence, an attacker will see only two patterns in the power traces of the measurements. Thus, the secret key, being the exponent, can be obtained.
An attacker or researcher may use the N − 1 Attack against the BRIP algorithm. In general, with this technique, anyone can attack algorithms that use a random value to protect the input message.

Countermeasures to N − 1 Attack
In the literature, there are proposals to defeat N − 1 Attacks. Some of these proposals are mentioned below.
The first proposal is to blind the message. The BRIP algorithm shows this countermeasure. The input message, m, is blind using a random value, r. A second proposal is to block the special input message. In this countermeasure, the input message is compared to the value N − 1 and is closed up when both are equal. However, before the proposals are vulnerable to the Chosen-message Attack, this technique follows the same idea as the N − 1 Attack.
Finally, D. Tinoco [18] made two proposals to countermeasure the N − 1 Attack. These countermeasures do not require blinding the message, and they do not have to block the input message when it is N − 1.
To calculate modular exponentiation, Tinoco follows the property shown in (1).
Tinoco used (1) to apply it in modular exponentiation algorithms. Finally, he presented Equation (2). This equation is the representation that works with intermediate-even exponents.
The proposals by D. Tinoco follow the idea of SaMAL2R and BRIP algorithms, and his Equation (2). Algorithm 4 shows the Modified square-and-multiply always, left to right, while Algorithm 5 shows the Modified BRIP Algorithm.

Algorithm 4 Modified Square-and-Multiply Always, Left to Right
return S 10: else 11: return S · m mod N 12: end if Since the algorithms run from n − 1 to 1, the last square and the last multiplication by m 2d 0 are eliminated. To compute the correct value, an if statement replaces the multiplication by m d 0 .
When the exponentiation is executed by MSaMAL2R or BRIP algorithms, the two patterns are not shown. Conversely, only one regular and masked shape in the power traces is observable.
Although these algorithms are implemented in software by the author, they are not implemented in programmable devices such as FPGAs.

The Proposed Algorithm
Even when it is possible to implement Algorithm 5 in programmable devices such as FPGAs, it is necessary to generate a random value, r, and calculate its inverse, r −1 . At this point, it is possible to create random values in an FPGA. However, a reduction of the resources used by the FPGA is necessary. Moreover, the algorithm requires three registers to store intermediate values used in the loop.
On the other hand, Algorithm 4 does not require generating one random value. This advantage helps saving resources compared with the MBRIP algorithm. As stated in the MBRIP algorithm, three registers are needed to store intermediate values in the loop, going against the compromise to reduce the device's resources.
It should be remembered that the main operation of modular exponentiation algorithms is modular multiplication. Modular multiplication was developed and synthesized in the FPGA.

Montgomery Multiplication
There are several proposals reported in the literature on modular multiplications. P. Montgomery developed an interesting modular multiplication [19]. This proposal, known as Montgomery multiplication, has been implemented in FPGAs [20][21][22]. Besides, there is a technique to improve Montgomery multiplication. This technique avoids the final subtraction of the original algorithm. Algorithm 6 shows the description steps suggested in this work for Montgomery multiplication.

4:
A ← (A + x i · y + u i · N)/b 5: end for 6: return A An engrossing proposal for implementing Montgomery multiplication is called Systolic Architecture [9,23]. Systolic Architecture uses regular blocks called Processing Elements (PEs), which is an advantage for FPGA implementation.
In this approach, the operands are split into words with size radix b because base b is in the power of 2 with k bits.
Taking this into account, Algorithm 7 shows the version of the Montgomery multiplication for implementation in FPGA. An added Montgomery multiplication must be done to retrieve the result from the Montgomery domain. It uses the A value and one value of 1 as the inputs. However, it is inexpensive for modular exponentiation. It is done only once during the exponentiation execution.

Proposed Modular Exponentiation
As previously mentioned, Algorithm 4 has some advantages. First, we consider the requirements for the Montgomery multiplication. Using this way, we propose to change Algorithm 4. That proposal is suitable for implementation on FPGAs. Algorithm 8 shows this new proposal. In the context of RSA, the exponent d is always an odd number. That is, d 0 will always be 1; considering this, step 9 in Algorithm 8 has replaced the last condition in Algorithm 4. Table 3 shows a brief example of the execution of the proposed algorithm. In this example, the parameters are d = 97 10 = 1100001 2 as the exponent and (N − 1) as the input message.  The proposed algorithm requires one extra register compared with the MSaMAL2R algorithm; this is a disadvantage for programmable devices with reduced resources. However, the extra register is not a drawback, due to the number of resources in the FPGA.

Implementation of the Proposed Algorithm in FPGA
The implementation of modular multiplication is essential because, as previously mentioned, it is the core of modular exponentiation. Thus, the implementation of modular multiplication defines the number of resources used by the FPGA. Thus, an optimal design of the implementation can be achieved.

Implementation of Proposed Modular Multiplication for FPGA
First, we implement the Montgomery multiplication into FPGA. We used the Systolic Architecture with a radix of 16 bits; the operands were chosen accordingly to avoid the final subtraction, as previously mentioned.
With the radix of 16-bits and 1024-bit RSA, we split the operands into 64 words; thus, there are 64 PEs. This operation takes only one clock cycle.
All these PEs are regular, which means that only one PE was developed, and this PE was replicated 64 times. Figure 1 shows the block diagram of the Montgomery multiplication architecture implemented in FPGA.
x -shift register The value x is left-shifted 16 bits, and this value is sent to all PEs and the block called u i . The task of this last block is to calculate the value of step 3 in Algorithm 7. As previously mentioned, the least significant bits are the only ones considered. Figure 2 shows the block diagram of the individual PE implemented. In the first PE, the input-carry value was set to zero. In the last PE, the output carry value was used as the final A w−1 . We use Digital Signal Processors (DSPs) embedded in the FPGA to reduce resources for the implementation of modular multiplication. These DSPs have internally one pre-adder, one 18 × 25 bits multiplier, and an accumulator. All PEs were set in a pipeline fashion. The implementation of the DSPs, which we use for the multiplier and the pre-adder, is carried out as recommended by the manufacturer Xilinx in its documentation.
The pre-adder receives operands from the multiplexers 2 and 3, as shown in Figure 2. The inputs of multiplexer 2 are the values y j and u i . The inputs of the multiplexer 3 are the values A j and the carry of the earlier PE, carry j−1 . The inputs of the multiplier are the values from the multiplexer 1 and the pre-adder. The accumulator of the DPS adds the earlier result with the value given for the multiplier. The purpose of this accumulator is to perform the additional operation given in step 5 of Algorithm 7.
A Finite State Machine (FSM) controls the operations of the Systolic Architecture. This FSM has five states to manage all PEs and to set the corresponding values into each PE.
Although five states are used and one state consumes one clock cycle, the first one is for idling, and the last state is for resetting. Thus, three clock cycles are used by the algorithm for performing all the operations required to perform the modular multiplication. So, considering that all the 64 PEs and each PE used three clock cycles for processing the data, plus one clock cycle to set up the beginning of the process and one to return the final result, the total clock cycles running for the modular multiplication is 194.
In the present work, we used a clock of 100 MHz for different platforms. Each PE uses three clock cycles, so the latency of each PE is 30 ns. Considering that we have 64 PEs, we have 1920 ns to operate. However, there are two extra clock cycles, one for idling and one for resetting, then the time for executing one modular multiplication is 1940 ns.

Implementation in FPGA of the Proposed Modular Exponentiation
For the implementation of the modular exponentiation proposed, an FSM, which follows Algorithm 8, was used. Figure 3 shows the block diagram of the proposed method. All registers were set according to 1024-bit RSA and eight registers of 1024-bits were used for this implementation distributed as follows: one register to store the input message m; one register to store the value R mod N; one register to store R 2 mod N; two more registers to store the values of M; two registers to store the intermediate values R; and one register to store the output S.
External RAM blocks were not used to implement the proposed modular exponentiation. Instead of using an external block of RAM, registers were used to store the values.
The FSM controls the steps within the algorithm. This FSM passes the data into all registers of the block Montgomery multiplication. Passing the data into all registers is done considering the d value exponent, as was shown in Algorithm 8. Figure 3 shows the block Montgomery multiplication. Then, the output is stored in the register labeled S in the block diagram. In this proposal, the FSM saves the output value in a register instead of in an external RAM block.

Results
We used two different families of FPGAs. For the first one, we use a development board Nexys 4 DDR of Digilent ® . The FPGA used is the Artix XC7A100T-CSG324 of Xilinx ® , and it was implemented and synthesized using Vivado 2022.1. For the last one, we used the FPGA Virtex-6 XC6VLX75T-2FF484, which was only simulated using the post place-and-route results. For the synthesized one, we used ISE 14.7 Web Pack Edition. In both cases, we used a clock of 100 MHz.
One of the purposes of carrying out the implementation of our proposal using two different families of FPGAs is not only to compare it with other architectures, but also to show that the results obtained are similar in consumption and speed, as well. Table 4 shows all the resources used for the implementation of the modular exponentiation, with N = 1024 bits, proposed in this work using the two families of FPGAs mentioned previously.
In the case of Artix-7, only 9803 Look-Up Tables (LUT) and 10,568 Flip-Flops (FF) were used, which is only 15.46% and 8.33% with respect to the total resources available. However, the most important parameter is the number of DSPs in both platforms. Both implementations used 66 DSPs.
These resources also include those used by the implementation of the Montgomery multiplication. These results show that only part of the FPGA resources is used for modular exponentiation.  Table 5 shows the resources used by modular exponentiation and Montgomery multiplication in the case of Artix-7.  Table 5 shows that the implementation of the modular multiplication used 9714 LUT, hence, only 89 extra LUT were needed for the implementation of the modular exponentiation. Similarly, the modular multiplication used 5338 FF, so, 5231 extra FF were needed for the implementation of the modular exponentiation.
As was previously mentioned, the Montgomery multiplication takes 194-clock cycles. Modular multiplication is performed by the proposed modular exponentiation. Algorithm 8 takes 2052 modular multiplications. The total multiplication takes 398,088 clock cycles. However, step 7 of Algorithm 8 requires 1 clock cycle to store the data into S. Besides, the entire loop of encryption or decryption of data takes a total of 399,113 clock cycles.
The value of 399,113 clock cycles is for the modular exponentiation. Nevertheless, the number of clock cycles needed to store and read the data from external communication was not considered. Table 6 shows a comparison of the State-of-the-Art Modular Exponentiation Architectures. The exponent has a length of 1024 bits in all implementations.  Table 6 compares the different architectures used to perform modular exponentiation, which includes different platforms. All of them have their advantages and disadvantages. The results show that our work has a reduction in the number of LUT compared with the rest of the works.
Our architecture, using Virtex-6, has a reduction of 3550 LUTs compared to Ref. [24] and 115,528 compared to Ref. [21]. In the case of Artix-7, there is a reduction of 4.31% of LUTs compared to Ref. [9].
The present work requires a smaller number of FFs. In the case of Virtex-6, there is a reduction of 62,583 of FFs compared to Ref. [21]. In the case of Artix-7, it requires a small increase of FFs, only of 1.38% compared to Ref. [9].
An advantage of the present work is that it does not require the use of memory blocks (BRAM), in contrast with the rest of the works, since they require a large number of memory blocks. For Virtex-6, the work presented in Ref. [24] requires 32 blocks of 32 kb and 2 blocks of 18 kb; for the case of the work presented in Ref. [21] it requires 132 BRAMs from the total that the FPGA has. For Artix-7, our architecture does not require the use of BRAM, while the work presented by Somnath requires 7.81% of BRAM.
Our proposal, in the case of the Virtex-6 platform, requires a slightly larger number of slices, using only 779, compared with those presented in Ref. [24], however, it has a reduction of 32,885 slices compared with the architectures presented in Ref. [21].
As previously mentioned, the number of DSPs is an important parameter to take into account. Our proposed implementation used only 66 DSP.
The latency or the number of clock cycles that our architecture takes to perform modular exponentiation is greater than those shown in other works [21,24]. However, our proposal shows a reduction in the number of resources.
Finally, earlier results show a great reduction in resources used for both families of FPGA. The foregoing, together with the proposal made in Algorithm 8, produces an increase in security.

Discussion
In the present work, we developed an implementation of the MSaMAL2R for FPGA, which is effective against non-profiled side channel attacks.
This implementation requires less consumption of resources used by the FPGA for its operation, as mentioned in the previous section. However, it requires more clock cycles to perform the modular exponentiation. To speed-up the performance, it is necessary to improve the architecture of the Montgomery modular multiplication presented in this work.
Our proposal not only requires a smaller number of DSPs but also a smaller number of slices. This reduction in resources is helpful due to the lower energy consumption during its operation, which makes it less susceptible to non-profiled SCA.
Although it has shown a reduction in resources for the encryption or decryption of information, in addition to being less vulnerable to side-channel attacks, it is necessary to study other types of attacks on this implementation. Some of these attacks may be profiled SCA.
Because we use VHDL for the hardware description, it is possible to scale or use the same code in other types of FPGAs from the same manufacturer, however, it can also be used in FPGAs from other manufacturers with the same characteristics as the FPGA used in the present work. To support the above statement, we implemented the same proposal in two different FPGAs families using the same VHDL code. The first one used a high-cost FPGA such as the Virtex and the second one implemented a low-cost FPGA such as the Artix.

Conclusions
This work presents a proposal to defeat the N − 1 Attack without cost penalties. One extra register is required, and no generations of random values are needed. Thus, it allows reducing the complexity of the application and reduces the resources in the implementation.
The resources shown in the results section demonstrated that the proposed implementation is a worthy possibility to be implemented in hardware using programmable devices such as FPGA. The proposal made in this paper shows a reduction in resource consumption; specifically, a consumption of LUTs by 3550 LUTs and 62,583 of FFs.
As was mentioned previously, we implemented our proposal using two different families of FPGAs. These two families are Virtex-6 and Artix-7. The former is one of the costliest in price and the last is one of the cheaper options. Both are from the Xilinx family. The implementation of our proposal in both types of FPGA presents similar results in terms of the number of resources used. This shows that the proposal can be implemented in low-cost and low-resource devices, as was the case with Artix-7. This allows it to be used in portable devices such as those used in IoT. It even allows other approaches such as Neural Networks to be implemented in the case of using high-resource FPGAs such as Virtex.
The running time of the modular exponentiation mostly depends on the modular multiplication. Even though there are implementations of modular multiplication with a faster running time than this proposal, nevertheless, the resources needed are higher than in our work. Table 6 shows different techniques to implement modular exponentiation. All of them are designed to defeat SCAs. However, our proposal has the advantage that fewer resources are needed. This is an advantage over the rest, thus increasing the security due to requiring less power consumption.