1. Introduction
As the theme of smart gradually comes to life, such as Internet of Things (IoT), mobile computing, big data, blockchain, robotic systems, digital forensic, industrial control systems, connected and automated vehicles (CAVs), and the vital integration of cybersecurity [
1], they have attracted considerable interest in recent years. In the field of IoT, the demand for resource-limited devices such as sensor nodes, Radio Frequency Identification (RFID) tags, actuators, etc. is also increasing day by day  [
2]. These devices generally have extremely few resources. As an example, standard RFID has roughly 10,000 GE, of which about 2000 GE are used for security. For CAVs, this technology needs to transfer data seamlessly and in real-time while protecting the security of the data [
3]. Due to the limitations of resources and real-time responsiveness, traditional cryptography cannot meet the demand, while lightweight cryptography can solve this problem well. Lightweight cryptography has the characteristics of low power consumption, small area and low latency. Lightweight ciphers include a class of low-latency ciphers such as PRINCE [
4], MANTIS [
5] and QARMA [
6]. Due to their low latency, these ciphers can be implemented in the unrolled architecture, which concentrates all operations into a single clock. As a result, they have better real-time responsiveness. In [
6], the research reports that the delay-area (or delay-power) product of AES [
7] is approximately 40 times larger than the respective values of PRINCE. Under the optimal implementation scheme for the area, the area of AES is about 817% of the area of PRINCE, and the power consumption is 766% of the power consumption of PRINCE. At a frequency of 100 kHz, the throughput of PRINCE is 533.3 kpbs, while that of AES is only 56.64 kpbs [
2]. In terms of low latency, small area, and low power consumption, traditional cryptography can no longer meet the needs. Therefore, it is necessary to study lightweight cryptography.
In practical applications, cryptography needs to consider countermeasures against Side-Channel Attacks (SCA). Researchers have successfully implemented SCA on the hardware circuit of traditional cryptography in [
8,
9,
10,
11,
12,
13]. Their attack points are focused on the times when registers are updated before and after. Traditional cryptography is generally implemented in loop architecture due to large path delays, which forces designers of circuits to use a large number of registers to store intermediate data, and the update of the registers is closely related to the clock. Registers generate large amounts of dynamic power consumption when they are updated, which is easily captured and utilized by attackers. Since the unrolled architecture lacks clocks and registers, it is challenging to pinpoint an exact attack location on the power consumption curve. Second, the unrolled architecture has a long critical path with a lot of glitch activities in between, which will cause the collected power consumption curve to have a low signal-noise ratio (SNR). In [
14], the research shows that unrolled DES [
15] has a certain resistance to DPA [
16] and CPA with the constraint of clearing the datapath after each encryption. In [
17], the research shows that unrolled MAC-PHOTON [
18] can resist a first-order CPA attack.
Compared with the SCA on the loop architecture, there is less research in the SCA field of unrolled architecture. However, studies have shown that there are leak points in it. In [
19,
20], the authors successfully improved the efficiency of the CPA attack by using a 
t-test [
21] to locate the Points-of-Interest (POI) inside the power consumption curve. In [
22], the authors successfully implemented DPA on unrolled PRINCE in the first round. In [
23], the authors proposed an improved correlation frequency analysis (CFA) [
24] attack, which makes it feasible to extract first-order side-channel leakages from combinational logic in the initial rounds of unrolled datapaths. In [
25], the authors provided a method for selecting plaintexts for recovery of the key through side-channel analysis. However, this attack method succeeds in a limited number of rounds. The difference in inputs will be masked by the algorithm, and the difference will not be observed after a certain number of rounds. In [
26], the authors proposed a leakage model based on differential inputs. All the aforementioned techniques aim to increase the SNR of the power consumption or the effectiveness and precision of the analysis phase. Nevertheless, there is no study on why they cannot attack deeper rounds.
In [
27], researchers also investigated the difficulty of implementing countermeasures against SCA on unrolled architecture. Different from the traditional loop architecture, the unrolled architecture has no registers. Therefore, it cannot implement the traditional Threshold Implementation (TI) in [
28,
29,
30], and each round of computing logic is independent of each other, which means that each round of protection needs to be implemented independently. In [
27], the experiment showed that the critical-path delay increases to 147%, the area increases to 441%, the throughput decreases to 68%, and the power consumption increases to 255% when TI is implemented on the first and last rounds of PRINCE. In [
31], the authors implemented DPA attacks on the first four rounds of GIFT [
32] and evaluated the cost of TI. According to experimental findings, TI for an unrolled GIFT causes the area to increase to 3157% of the original value and the frequency to drop to 62.8% of the original value. As can be seen, the cost associated with protection schemes is high for lightweight cryptography and requires special care in the number of rounds that need to be protected.
In [
33], the experiment compared unrolled combinational hardware implementations of six lightweight block ciphers. We choose PRINCE [
4] which is specially designed for low-latency cipher as the case of our study. In order to improve the number of rounds of differential propagation, we propose an optimized method for the chosen-input attack, a method that prevents the algorithm from masking the difference of inputs in the first few rounds. The method can detect the presence of the difference in a deeper round. In the case of PRINCE, our method makes sure that the difference is not enlarged and masked by the algorithm in the first round and that the difference is easily discernible in the fourth round, as opposed to [
25], whose difference in the fourth round has been almost masked by the algorithm and cannot be distinguished. In the paper, we implemented PRINCE with TI. The experimental findings show the enormous area cost of TI for the unrolled architecture, highlighting the need for a thorough investigation of the maximum number of attack rounds.
The main contributions of this paper are as follows:
- We propose an optimized method for a chosen-input attack that can effectively increase the number of rounds of differential propagation. 
- We implement CPA on PRINCE implemented of unrolled architecture in the fourth round. 
- We evaluate the resource costs associated with achieving various degrees of TI for PRINCE. 
The remainder of this paper is organized as follows. 
Section 2 reviews the research on lightweight cryptography. In particular, we introduce the PRINCE algorithm, a typical low-latency cipher implemented in unrolled architecture. 
Section 3 introduces the new method for the chosen-input attack considered in this paper. We introduce the power consumption model used for the attack and describe our leakage model and leakage point in detail. Then, we describe the chosen-input attack method and introduce CPA and DPA. At the end of the section, we present PRINCE countermeasures. In 
Section 4, we demonstrate the possibility and limitation of the above attacks through five sets of experiments. In addition, the countermeasures’ hardware overhead and protection effectiveness are examined. 
Section 5 summarizes our results and discusses directions for future work.
  2. Related Work
In this section, we briefly describe PRINCE, a typical low-latency block cipher, which is proposed by Borgho et al. at the ASIACRYPT 2012 annual meeting [
4], and it has the following characteristics:
- Encryption and decryption can be realized in a single clock cycle. 
- Hardware circuit low latency can adapt to high clock environment. 
- Hardware cost is low (much lower than the unrolled version of AES or PRESENT [ 34- ]). 
- Encryption and decryption share a set of hardware circuits. 
PRINCE has a very low hardware implementation cost and latency, and can be widely used in resource-constrained environments. It is a 64-bit block cipher with a 128-bit key. It has a 12-step round function that includes a key addition, a Sbox-layer, a linear layer, and the addition of a round constant in each round. The PRINCE implemented in unrolled architecture is shown in 
Figure 1, and the algorithm flow is shown in 
Table 1.
Give 
E and 
D as the encryption and decryption operations, respectively, their definitions are found in 
Table 1. The following expressions apply:
We find that 
 by Equations (
1) and (
2), and 
, where 
 is the 64-bit constant 
 = 0xc0ac29b7c97c50dd. Thus, for decryption one only has to do a very cheap change to the key and afterward reuse the exact same circuit. In this paper, we only analyze the encryption process.
The data analysis method used in this paper has been successfully applied to traditional cryptography. In [
35,
36], the authors successfully implemented CPA and DPA attacks on DES [
15]. In [
37], Lu et al. demonstrated the first and second-order differential power analysis on AES. The above-mentioned attacks on traditional cryptography fix the input in a particular way and then calculate the intermediate values in the inner loop, but the direct application of the above-mentioned attack methods is not feasible for the unrolled architecture. Due to the fact that unrolled architecture has no registers or clocks, the SNR of the power consumption traces is low, and the correlation with the operation sequence of the algorithm is weak, it’s difficult to implement CPA or DPA directly on unrolled architecture. There are reports that algebraic side-channel analysis (ASCA) [
38] and soft analytical side-channel analysis (SASCA) [
39] can be used to attack the inner rounds of block cipher algorithms. However, the above report is implemented on a microcontroller that processes one byte at a time, which is different from the unrolled architecture, which executes 64-bit data in parallel at one time and completes twelve rounds of iterations in one cycle. This implementation does not allow an attacker to simply obtain any independent byte-function distribution.
The unrolled architecture does more than one round of iterative operations in a single cycle. This lets the key do the deep diffusion all at once. At this time, it is necessary to make stronger assumptions about multi-bit to discover sensitive information. The experimental results show that the correlation power analysis of Hamming distance (HD) models and Hamming weight (HW) models can be resisted if the data path is cleared after each operation [
14]. However, in [
27], the authors performed a first-order SCA using a non-specific 
t-test (also known as a fixed versus random 
t-test) [
40] and found a fairly strong first-order leak on the PRINCE implemented in unrolled architecture without countermeasures. The 
t-test could only detect the presence of leakage but not give any impression of whether the leak was exploitable. In [
19,
20], the author significantly reduces the number of power consumption traces required to achieve CPA by selecting Points-of-Interest (POI) within the power traces on unrolled architecture. In [
25], the authors proposed an extended attack with partially fixed input values to improve the SNR between the first and second rounds of the power traces, but the depth of the CPA attack is also limited to the second round. In [
23], the authors proposed an improved CFA [
24] attack, which makes it feasible to extract first-order side-channel leakages from combinational logic in the initial rounds of unrolled datapaths. In [
26], the authors were able to deepen the attack by using the intermediate values of the first round (i.e., the difference in switching), which showed up as a side channel leakage during the processing of the inner round. However, they were only able to recover all of the keys in the third round, and only 1/16 of the keys were recovered in the fourth round.
The chosen-input attack proposed in this paper can generate a difference of only 1 bit at the MixColumn output in the first round. The minimum difference in the first round decreases from 3 nibbles to 1 nibble as compared to [
26], necessitating more rounds of iterations to completely mask this difference. Since only the input from the first round is used as the computational element, we are able to obtain the complete key information from the fourth round regardless of the protection of the first three rounds. We performed a total of five sets of experiments to validate our approach, and in the next section, we present a detailed case study of the PRINCE implemented in unrolled architecture to illustrate why this leakage exists in the unrolled implementation and how many rounds are affected.
  3. Side Channel Leakage on PRINCE
In this section, we introduce our principle of attack in detail as well as experimental methods. We introduce the power consumption model, the leakage model and the leakage point are introduced as well. Then we describe the attack method, which is a chosen-input attack.
  3.1. Leakage Model
Common power consumption models are mainly the HW model and the HD model. The HW model, denoted as 
, is the number of bits at “1” in the internal data structure. The HD model counts the number of different bits in the two data structures, denoted as 
. It has a good mapping relationship in the ASIC field because the basic device of ASIC is registered. The changing data generate large dynamic power consumption during the update process of the registers, while the unchanged data only generate small static power consumption. Therefore, the HD is a good approximation for the power consumption of ASIC. Moreover, our attack method is based on the difference in the input plaintext, not the input plaintext itself. Therefore, we choose the HD model as the power model. The following expression is obtained:
        in which 
 is the power consumption, 
p is plain text, 
 and 
 is the pair of Input-Differential data (the construction method is described in detail in 
Section 3.2), 
r is the round number of the observed S-box, 
i is the count flag of plain texts, and 
j is the attack position. Equation (
3) estimates the HD associated with the leakage model.
In PRINCE, 4 bits are taken as a unit and denoted as a nibble, and four consecutive nibbles form a halfword (16 bits). Therefore, there are sixteen nibbles and four halfwords in the data structure of each round, which are recorded as nibble
 and halfword
, respectively, where 
 and 
, see 
Figure 2 for details.
Because the method proposed attacks three nibbles at a time, the 64-bit data is divided into eight attack positions, see 
Table 2 for details.
For the attack point of the leakage model, we choose the output of the S-box in each round, because the S-box is the only nonlinear device in the entire algorithm, and because of its nonlinear characteristics, it has uneven differential properties, which shows that different Input-Differential data produce a very distinguishable Output-Differential data in S-box. It is conducive to attacker analysis.
The power consumption of the actual hardware circuit is mainly composed of dynamic power consumption and static power consumption. Static power consumption is generally related to the state maintained by the circuit. If the state of each device in the circuit remains unchanged, the static power consumption remains unchanged, while the dynamic power consumption occurs when the state of the device changes. In the loop architecture, a large amount of dynamic power consumption is generated when registers are updated, and in the unrolled circuit, the part of Input-Differential data is fixed due to the fact that the unrolled architecture has no registers and clock, and only the bits in the attack position are changed so that a part of devices in the circuit are in “static” state. This allows the circuit to generate only the dynamic power due to the differential bits [
25]. For example, we construct a 64-bit random plaintext 
 at first, put it to the device under test, and keep the input data unchanged. At this time, the circuit is in a stable state and has no dynamic power consumption. Then we construct a differential input 
 (assuming that only the nibble [13:15] is changed), which is fed into the device under test. Since the first twelve nibbles are not changed, the first twelve S-boxes and their associated logic in the first round do not generate dynamic power consumption, and at this time only the No. 13 to No. 15 S-boxes and their associated logic in the circuit are changed, generating large dynamic power consumption (see 
Figure 3). The difference can be transmitted to the next round, but it is constantly masked by the algorithm module diffusion in the later rounds. The differential changes in the later rounds gradually converge to the average value (32 bits), and the correlation with the input data gradually weakens (see 
Figure 4 and 
Figure 5). 
Figure 4 shows the HD of the S-box for each round when the difference of Input-Differential data is 1 bit [
26]. It can be seen from 
Figure 4 that the differential deviation provided by [
26] has tended to the mean value (32 bits) in the fourth round, and the key information cannot be obtained from it.
In PRINCE, RC
-add (RC[
r]), KEY
-add (K
) and Shift Row (SR) do not affect the differential characteristics. Therefore, we focus on the SubCell and MixColumn modules. The cipher uses one 4-bit S-box. The definition of the S-box in hexadecimal notation is given in 
Table 3.
In the entire algorithm, the S-box is the most important module for determining the differential path. From 
Table 3, it can be seen that the S-box of PRINCE is a bijective S-box, which means that any input is mapped to a unique output and vice versa. In [
4], the authors emphasize the differential properties considered in the design of the S-box to keep the maximum differential value within 1/4, but our attack method does not depend on the goodness of the differential properties of the S-box itself but only uses the bijective mapping of the S-box. We cannot calculate the output of S-boxes in the first round accurately since we do not know the key. However, since the key is fixed, we can traverse the output of the S-box in the first round by changing the input plaintext.
In the MixColumn layer, the 64-bit state is multiplied by a 
 matrix M. We recall from the specification of PRINCE [
4] that the 64-bit linear transformation M is defined as Equation (
4), and the definitions of 
 and 
 are in Equation (
5). In Equation (
6) we find the definitions of 
, 
, 
 and 
. In hardware, this matrix multiplication is implemented with the rerouting and the XOR layer shown in 
Figure 2.
The MixColumn of PRINCE takes reference from the MixColumn of AES, but the MixColumn of PRINCE does not use a multiplication operation similar to AES in order to achieve lightweight blending. As a result, each bit output of the MixColumn layer is only affected by the output of the three upper S-boxes. In particular, if three outputs of S-boxes within the same halfword only have 1 bit changed at the same position, the 16-bit output of MixColumn layer within a halfword will produce only 1 bit changed after MixColumn layer. Given that the attack is carried out by scanning subsets of 12 bits in the plain-text (three S-boxes at a time), there is a chance that the outputs collide at the MixColumn layer, producing this expected 1-bit change; see 
Figure 2.
        
We analyze the depth of such differential propagation. As seen in 
Figure 5, such differences still have significant deviations in the fourth round of S-box (the mean value is about 27 bits) by controlling three inputs of S-boxes in the first round. Compared with the method of [
26], the method in this paper effectively increases the depth of difference propagation.
The attack method proposed may have difficulties with the SCA of standard round ciphers, because the S-box of theirs is generally 8 bits, and their MixColumn module not only uses multiplication operations but also operates on more elements than lightweight cryptography, which makes the space of hypothetical key too large when performing analytical calculations if multiple S-boxes are controlled at the same time. However, it is easy for lightweight cryptography. In order to pursue low latency and small area, the S-boxes of lightweight algorithms are generally small, and the MixColumn operation is relatively simple. Even if three S-boxes are controlled at the same time, the data to be analyzed is only 12 bits. Because the space of a single hypothetical key is only , our attack method is successful.
  3.2. Chosen-Input Attack
In this section, we describe our attack method in detail, including how to build a differential pair as well as the implementation of CPA and DPA. The attack method is divided into two phases: the data collection phase and the data analysis phase. In the data collection phase, only one attack position (j) needs to be controlled and the rest of the data is not concerned. However, to improve the HD observation, we suggest using random data in these positions. The data collection phase is described as follows:
- Generate random 64-bit plaintext . 
- Generate 12-bit random differential values . Then it is placed in the attack position (j), and the rest of the positions are supplemented with 0 for 64-bit data. For example, when , then  = _0000_0000_, (x means random number). 
- Generate the corresponding differential plaintext . 
- Input  to the device under test, and input  after the device is stabilized. Record this power consumption  and corresponding plaintext differential pairs  and . 
- Repeat steps 1–4 multiple times for the same attack position j. 
- Change the attack position (j) and repeat steps 1–5 for several times. 
As shown in 
Figure 3 (
, 
), the random plaintext is 
 = 0x1867_ffc0_4ce5_2bab, the random difference value is 
 = 0xf50, the free position is filled with 0 to form the input difference 
 = 0x0000_0000_0000_0f50, and calculate 
 = 0x1867_ffc0_4ce5_24fb. Because we need to measure the circuit’s power consumption following the input of two consecutive differential plaintexts, we must first input 
 and wait for the circuit to become stable before inputting 
. We input 
 at time 
, followed by 
 at time 
, and then we record the power consumption during the time interval T1.
In the data analysis stage, we focus on CPA and DPA. It is difficult to implement DPA in unrolled architecture directly, because the unrolled architecture processes all the data in one cycle, the correlation between the power traces and the power model is very poor, which makes the DPA effect unsatisfactory. CPA has a higher utilization of the power consumption traces by calculating the correlation between each point on the power consumption traces and the power consumption model, which makes CPA have good results on the unrolled architecture, but the calculation of CPA is more complex, which is closely related to the number of sampling points. Then we describe how DPA and CPA are implemented in detail. DPA is described as follows:
- Determine the distinguisher. In this paper, we select the mean value distinguisher (6 bits) due to the total amount of 12-bit HD. 
- Determine the attack position (j). 
- Make a 12-bit hypothetical key . 
- Calculate the HD of the S-box with the hypothetical key and the input plaintext pairs in the first round. The calculation formula is as follows:
             - 
            in which  S-  represents SubCell. Equation ( 7- ) allows us to calculate the HD at the output of the first S-box. 
- Calculate the mean value of the corresponding single power consumption trace, and if , add the mean value of the power consumption traces to the set TH, otherwise add it to the set TL. 
- Repeating the steps 4–5, traverse all the difference pairs to obtain the distinguished sets TH and TL, calculate the subtraction difference between the mean values of the two sets, and record the absolute value as the difference value of the hypothetical key. 
- Repeat steps 3–6, traverse all the hypothetical keys to obtain the difference values of all hypothetical keys. The key of the DPA attack at position j is the hypothetical key with the maximum difference value. 
- Repeat steps 2–7, traverse all attack positions and combine the results of each attack position to get the final key of the DPA. 
It can be seen from the above that DPA analyzes the whole power consumption traces, but the power consumption model used is limited to HD of the first round. Therefore, the power consumption model cannot map the whole power consumption well in unrolled architecture, and in order to implement DPA, we require more power traces than CPA.
Before introducing CPA, we need to define the gain function. There is a correlation between the output of the MixColumn in the first round and the output of the S-box in the following rounds. In view of the fact that the HD of the MixColumn layer is only twelve cases (ignore 
HD = 0), we have established a relationship between these twelve values of HD of MixColumn and HD of S-box in the following rounds through a large number of tests. Find the results in 
Figure 6 and 
Table 4. In Equation (
8) we find the definitions of 
.
        
        in which 
r is a round number and 
, but only 
 of the first five rounds are taken into account in this paper, 
 is the HD average of the MixColumn in the first round, 
 is the HD average of the S-boxes in round 
r.
CPA is described as follows:
- Determine the attack position (j). 
- Make a 12-bit hypothetical key . 
- Calculate the HD of MixColumn with the hypothetical key and the input plaintext pairs in the first round. The calculation formula is as follows:
             - 
            in which  S-  represents SubCell and  M-  represents MixColumn, Equation ( 9- ) allows us to calculate the HD at the output of the first MixColumn. 
- Using the gain function, calculate the HD of the S-box of each round (HD). 
- Calculate the correlation  -  between  -  and the corresponding power consumption traces  - . To illustrate the correlation between  -  and  - , we use the Pearson correlation coefficient. The calculation formula is as follows:
             
- Repeat steps 2–5, get the correlation coefficients of all hypothetical keys, and then find the key with the largest correlation coefficient from them, which is the key of CPA at the j position. 
- Repeat steps 1–6, traverse all attack positions and combine the results of each attack position to get the final key of CPA. 
In Equation (
10), 
 means average value of 
 and 
 means average value of 
. From the above, it can be seen that CPA performs a correlation calculation for each power consumption point. Therefore, CPA has a better attack effect compared with DPA, but the computational complexity is also higher.
We actually get 
 when successfully using CPA or DPA to obtain the key. At this time, if we want to get the 128-bit original key 
K, we have two solutions. (1) Perform SCA on the decryption stage in the same way. Similarly, we can obtain 
, and we can easily obtain separate 
 and 
, because 
 and 
 is known. (2) Attack 
 in each round using the known 
 to restore the original key 
K [
25].
  3.3. Countermeasures
This section introduces mask-based countermeasures and evaluates their hardware overhead and level of protection. In 2006, Nikova et al. proposed a countermeasure based on secret-sharing with multi-party computation, known as a TI [
29]. Even when glitching is present, the TI approach has proven to be secure. For defense against higher-order DPA attacks, Bilgin et al. implemented higher-order TI in [
41].
The TI scheme used in this paper for PRINCE is referenced in [
27,
42]. Refs. [
43,
44] provides a thorough categorization of S-boxes from 3 to 4 bits, with the S-box of PRINCE having an algebraic degree of 3. This implies that there will be at least 3 + 1 = 4 components for the directly shared TI scheme of PRINCE. We have implemented two types of TI. The first one uses four components directly to participate in the operation, but this does not meet the uniformity property of TI, so we must add random numbers. As shown in 
Figure 7.
In [
45], the S-box of PRESENT is decomposed of the authors, who lower its algebraic degree from 3 to 2, which greatly decreases the TI cost but raises the originally shared functions from 1 to 3. Registers must be inserted in the middle of each shared function to avoid glitching. After reading about the TI approaches in [
27,
42,
45], we made the TI protection architecture shown in 
Figure 8. The scheme satisfies the uniformity property of TI, so we do not need an extra random number. For detailed information, the interested reader is referred to the original articles [
27,
29,
45].
To evaluate whether the countermeasures are effective, we choose the non-characteristic 
t-test [
46,
47], a technique that was shown in [
27,
48,
49] to be effective in observing the extent of leakage.
  4. Experiment
The experiments in this paper are completed by Electronic Design Automation (EDA) tools, and the experimental architecture is shown in 
Figure 9 and 
Table 5.
In order to prove that the leakage point proposed in this paper can also be detected under the condition that the first three rounds are protected. As shown in 
Figure 10, five sets of experiments are set up in this paper, which are as follows:
- Group A: All twelve rounds calculated by hardware. 
- Group B: Pre-calculate the first round by software and pass the pre-calculated result to the hardware to complete the next eleven rounds. 
- Group C: Pre-calculate the first two rounds by software and pass the pre-calculated result to the hardware to complete the next ten rounds. 
- Group D: Pre-calculate the first three rounds by software and pass the pre-calculated result to the hardware to complete the next nine rounds. 
- Group E: Pre-calculate the first four rounds by software and pass the pre-calculated result to the hardware to complete the next eight rounds. 
The side-channel information of the previous rounds can be completely shielded by software, thus proving that the leakage of the side-channel information comes from the hardware circuit without protection. The experimental steps are as follows:
- Generate netlist, Standard Delay Format (SDF) and Synopsys Design Constraints (SDC) according to Resistor Transistor Logic (RTL) by Design Compiler (DC) with SMIC 55 nm. 
- Simulate with differential pairs, SDF, and netlist to generate Fast Signal DataBase (FSDB) simulation waveform file by Verilog Compiled Simulator (VCS). 
- Simulate the power consumption with the FSDB and SDC, and obtain the power consumption traces by PrimeTime PX (PTPX). 
- Repeat steps 2–4 to obtain a sufficient number of power consumption traces and differential pairs. 
- Run CPA or DPA to analyze power consumption traces and difference pairs, then obtain the key. 
- Change the RLT code, repeat steps 1–6, and observe the experimental results of each RTL version. 
The experimental results of the five groups of experiments are shown in the 
Figure 11, 
Figure 12, 
Figure 13, 
Figure 14, 
Figure 15 and 
Figure 16, among which (
A) group of experiments performed DPA and CPA at the same time, and the remaining four groups only performed CPA. In (
A), both DPA and CPA were successful, and the traces of the correct key were significantly more prominent compared with other hypothetical keys. In (
B), (
C), (
D), the trace for the correct key is significantly higher compared with the other hypothetical keys. In (
E), the ranking of correct keys was found not to be significantly prominent and the correlations of all hypothetical keys were below 0.2, indicating that the bias caused by the input difference was already difficult to observe in the fifth round. In order to verify the security of the fifth round, we performed CPA on different numbers of power consumption traces. It can be seen from 
Figure 16 that with the increase in the number of power consumption traces, the ranking of the correct key has no upward trend, which means that no matter how many power consumption traces there are, the correct key information cannot be obtained.
To demonstrate the impact of the method in this paper, we added a set of tests using the same experimental settings as group D but the method used in [
25,
26]. 
Figure 17 illustrates the experimental findings, and it shows that the approach in [
25,
26] cannot extract the key in the fourth round since the ranking of true keys does not improve as the number of tests grows. From the experiments, we observed that the number of power consumption traces required for an attack also increases with the increase in the number of attack rounds. With the increase in the number of attack round, the leakage degree of key information gradually decreases, which is in line with the trend in 
Figure 6.
We assessed the hardware overhead and performance of two types of TI schemes, as shown in 
Table 6, to determine the effect of the countermeasure on the unrolled architecture. Scheme 1 represents the architecture shown in 
Figure 7, and Scheme 2 represents the architecture shown in 
Figure 8.
Table 6 shows that when the number of protected rounds climbs, the hardware overhead of the circuit increases significantly. Scheme 1 has a substantially greater throughput than Scheme 2, despite having much larger cells and more random numbers needed. We prioritized the area. Hence, for Scheme 2, a 
t-test was carried out. The test results are displayed in 
Figure 18. The red horizontal lines represent the points where the range of 
. The result indicates that the 
t-value during the masked rounds stays within the range of 
 and that later rounds result in larger 
t-values. However, as discussed in the prior article, the power consumption data of rounds 5 to 10 cannot be used anymore, making the first 4 rounds’ protection a superior cost-security trade-off.
 To summarize, we have sorted out the SCA on PRINCE in unrolled architecture in recent years in 
Table 7.