Template Attack of LWE/LWR-Based Schemes with Cyclic Message Rotation

The side-channel security of lattice-based post-quantum cryptography has gained extensive attention since the standardization of post-quantum cryptography. Based on the leakage mechanism in the decapsulation stage of LWE/LWR-based post-quantum cryptography, a message recovery method, with templates and cyclic message rotation targeting the message decoding operation, was proposed. The templates were constructed for the intermediate state based on the Hamming weight model and cyclic message rotation was used to construct special ciphertexts. Using the power leakage during operation, secret messages in the LWE/LWR-based schemes were recovered. The proposed method was verified on CRYSTAL-Kyber. The experimental results demonstrated that this method could successfully recover the secret messages used in the encapsulation stage, thereby recovering the shared key. Compared with existing methods, the power traces required for templates and attack were both reduced. The success rate was significantly increased under the low SNR, indicating a better performance with lower recovery cost. The message recovery success rate could reach 99.6% with sufficient SNR.


Introduction
The threat of quantum computing targeting traditional public key cryptography has generated great interest around the world in actively researching post-quantum cryptography (PQC). Since December 2016, NIST has launched a global standardization project for PQC algorithms [1]. With characteristics of small public key size, small ciphertext/signature size, fast calculation speed, and diverse functions, latticed-based PQC has received much attention. The selected PQC candidates have certain requirements in terms of security and performance, among which resistance to side-channel attack (SCA) is particularly emphasized. SCA was first proposed by Kocher [2] and includes timing attacks, power analysis, and fault attacks. Power analysis is widely employed in SCA due to its low cost and simple principle. Power analysis mainly uses the power leakage generated by the cryptographic equipment during operation, including power consumption, electromagnetic (EM) radiation and other information, combining certain mathematical analysis methods to obtain the secret information. With the advent of standardization, the side-channel vulnerabilities of implementations of the PQC algorithms need to be urgently explored.
It is pertinent to study the power analysis of the learning with errors/learning with rounding (LWE/LWR)-based schemes [3,4], as most lattice-based PQCs are constructed based on these two mathematical problems. In recent years, the SCAs of LWE/LWR-based schemes have fallen into two categories. One scheme involves obtaining the private key used over the long term, and the other involves recovery of the secret message and shared key used in the encryption process.
In relation to the first scheme, Refs. [5,6] chose to attack the number theoretic transform (NTT) and used a single EM trace to recover the private key. However, this approach was 1.
Considering the vulnerability proposed in [11], we present a message recovery attack of LWE/LWR-based schemes. Our method aims at the decoding operation in the decapsulation procedure and recovers the secret message, as well as the shared key, using the cyclic message rotation property in template style.

2.
We use the Hamming weight (HW) model to construct a classifier for the templates and construct specific ciphertexts using cyclic message rotation to reduce the number of power traces needed in the template-matching phase.

3.
We provided details of the specific attack and implemented the message recovery attack for CRYSTAL-Kyber with an ARM Cortex-M4 microprocessor. Compared with previous results, the power traces required for constructing the templates were reduced and the success rate for recovery of the message was greatly improved with the same signal-to-noise ratio (SNR), indicating better performance at lower cost. 4.
The main findings of this paper are summarized and compared to the existing literature. We also briefly illustrate the feasibility and validity of applying our message recovery attack to other schemes.
The remainder of the paper is organized as follows: Section 2 provides an introduction to the relevant concepts. Section 3 analyzes the leakage mechanism of the vulnerability which forms the basis of the attack. Section 4 presents a detailed method for the message recovery attack, using CRYSTAL-Kyber as an example. Section 5 assesses the proposed methods with CRYSTAL-Kyber and evaluates the accuracy and efficiency of this method. Section 6 concludes the paper.

Notations
Let Z be the integer ring and R q = Z q /ϕ(x) be the ring of integer polynomials modulo ϕ(x) and q, where ϕ(x) is a cyclotomic polynomial of Z and q is an integer. We use bold lowercase letters (a) for polynomials and bold uppercase letters (A) for vectors or matrices. Let β µ be a central binomial distribution with parameter µ. We write x ← χ σ to denote a uniform sampling of x from a distribution with standard deviation σ in a random way. We denote the ith coefficient of polynomial a as a[i] and the byte array of length k as B k . For m ∈ B k , we use m[i] to denote the ith byte of m, m i to denote the ith bit of m, and m [i]

LWE/LWR Problem
The LWE problem was first introduced by Regev [3], and governs the security of most lattice-based PQCs. Let n and q be positive integers, and, for a given s ∈ Z n×l q , a standard LWE instance is denoted as a tuple (A, t) = (A, (A × s + E) mod q), where A ∈ Z k×n q is chosen randomly and uniformly and E ∈ Z k×l q is sampled from distribution χ. The LWR problem proposed by Banerjee et al. [4] is a variant of the LWE problem as its error parameter is generated by the remainder of (a × s). We denote the scaled rounding as · and an LWR instance with p < q is defined as (a, b) = (a, ( p/q × (a × s) )), where a is chosen uniformly and randomly and s ← β µ Z n q . Among the NIST PQC candidates, FrodoKEM is the only candidate based on the standard LWE problem. Some schemes, such as NewHope and Round5, are developed in relation to the Ring-LWE/Ring-LWR problem, while some schemes, such as CRYSTAL-Kyber and Saber, are built on the Module-LWE/Module-LWR (MLWE/MLWR) problem, using polynomial vectors or matrices to operate on R k q , where k represents the rank of the module. MLWE/MLWR is a more efficient problem that reduces the computation pressure and the bandwidth of the standard LWE problem, providing a tradeoff between cost and security [15].
A simplified version of the LWE/LWR-based public key encryption (PKE) is presented in Algorithms 1 and 2, which is proven to be secure in the indistinguishability under chosen plaintext attack (IND-CPA) security model [16]. The Encode is an encoding function, representing the conversion of a byte array to a polynomial, while Decode is the inverse process of Encode, representing the conversion of the polynomial to a byte array. As shown in Algorithm 1, the IND-CPA PKE encryption uses the public key pk and the random seed r to encrypt the message m, and the ciphertext c is formed by concatenating the ciphertext segments c 1 and c 2 . In Algorithm 2, the IND-CPA PKE decryption uses the long-term private key sk to decrypt the received ciphertext c and results in the decrypted message m .  The re-encryption of m with the public key pk is to obtain the re-encrypted ciphertext c . The CCA is detected with the comparison between c and c. If c is invalid, i.e., c = c, the adversary will not be able to obtain any information about the decrypted message and thus break the CCA.

Test Vector Leakage Assessment (TVLA)
TVLA is a conformance-based method commonly used in both academia and industry to evaluate the side-channel security of cryptographic implementations [18]. It evaluates the data dependence and operational dependence of power consumption during encryption on devices through hypothesis testing. The collected power traces are divided into two groups and hypothesis testing is used to determine whether there is a significant difference in power consumption between these two groups. If there is a difference, then the device is likely to have data dependence and operational dependence, indicating that the device has power leakage. The accuracy of hypothesis testing is closed related to the method of hypothesis testing, among which Welch's t-test is the most widely used.
The formulation of TVLA over two sets of power measurements T r and T f is given by: where X r , σ r , and N r (resp. X f , σ f , and N f ) represent the expectation, sample standard deviation, and size of T r (resp. T f ). The null and alternative hypotheses (H 0 and H 1 resp.) of Welch's t-test are shown below: The H 0 is rejected with a confidence of 99.9999% if, and only if, the absolute value of the TVLA is greater than the pass-fail criterion of 4.5. Rejecting H 0 represents a considerable Entropy 2022, 24, 1489 5 of 15 discrepancy between the two measurement sets, which may lead to a leakage of sidechannel information.

Normalized Inter-Class Variance (NICV)
NICV is a univariate analysis of variance (ANOVA) F-test [19], which is the ratio between the class-conditioned leakage mean-variance and the total leakage variance. It does not need to know the implementation process or secret parameters of the cryptographic scheme but only the public parameters of the encryption and the plaintexts or ciphertexts of each time. Both NICV and TVLA can be used as side-channel evaluation metrics, but TVLA is usually used to distinguish two different classes, while NICV can distinguish two or more classes simultaneously.
We denote the classes of a variable X as C(X) and the measured leakage of X as T, then the NICV is computed as follows: where E[·] and Var[·] represent the univariate average and the standard deviation. Although there is no exact NICV threshold, the higher the NICV value at a given point, the greater the difference in leakage among each class.
In this paper, we use TVLA as a leakage-detecting tool, while NICV is the featureselecting tool for constructing different templates for each class.

Vulnerability in Message Decoding of LWE/LWR-Based KEM
In general, the operations that are closely related to the plaintexts or keys are chosen as the attack point in power analysis. Ravi et al. [11] described the Single_Bit_Update vulnerability of the decoding function (Decode in Algorithm 2, the red module in CPA PKE Decryption of Figure 1), which exists in most LWE/LWR-based PKEs/KEMs. This vulnerability uses the leakage generated when storing single-bit information of the decrypted message in memory, then realizing the complete recovery of the secret message. We chose CRYSTAL-Kyber as an example for a brief analysis of this vulnerability, detailed information on which can be found in [20]. In CRYSTAL-Kyber, the function poly2mg is used to convert polynomials to message bytes. Refer to Algorithm 5 for the C code snippet of poly2msg, which is taken from the pqm4 library [21]. All the experiments in this paper are based on this open-source library. In CRYSTAL-Kyber, the function poly2mg is used to convert polynomials to message bytes. Refer to Algorithm 5 for the C code snippet of poly2msg, which is taken from the pqm4 library [21]. All the experiments in this paper are based on this open-source library. We used the arm-none-eabi-gcc compiler for the ARM Cortex-M4 processor to compile the above code and generated the assembly code for further analysis. When i, j = 0, the assembly code snippet corresponding to the poly2msg is as shown in Figure 2. We can see that, after a series of calculations, the intermediate t in the register r3 is stored in the memory unit r2 through the STRB instruction (see line 9, Figure 2). The conversion of coefficients to message byte m[i] completes after eight iterations. The execution of STRB will cause power consumption, which has a certain relationship with the HW of the stored intermediate value.

Message Recovery Attack Method
According to the analysis of the Single_Bit_Update in Section 3, the message byte m[i] can be fully updated after eight iterations and each message byte is updated in the same way. Therefore, we consider our message recovery attack targeting a single message byte at a time. We construct templates for a target message byte and cycle the given ciphertext to move the remaining message bytes to the target position; then the remaining message bytes can be recovered using the constructed templates. The attack process is

Message Recovery Attack Method
According to the analysis of the Single_Bit_Update in Section 3, the message byte m[i] can be fully updated after eight iterations and each message byte is updated in the same way. Therefore, we consider our message recovery attack targeting a single message byte at a time. We construct templates for a target message byte and cycle the given ciphertext to move the remaining message bytes to the target position; then the remaining message bytes can be recovered using the constructed templates. The attack process is performed in two stages: the data preprocessing stage and the template matching stage. In this section, we introduce our message recovery attack based on the templates and cyclic message rotation and then analyze its feasibility.

Data Preprocessing
In this section, the data preprocessing method is introduced, which includes leakage detection and template construction. First, we detect the power leakage and build sets of points of interest (PoIs) by decapsulating ciphertexts that contain different messages and collecting corresponding power traces with TVLA. Then, we use NICV to classify the HW of different message intermediate values to establish corresponding reduced templates for the classification of the messages' HW value.

Leakage Detection
Since the same leakage mechanism applies to each message byte, we take the first message byte m[0] as an example and use Welch's t-test to achieve leakage detection. We denote n as the message bytes. First, we build two ciphertext sets denoted CT 0 and CT 1 , each containing l random ciphertexts. , which can be measured by power consumption. The process can be described as follows: 1.
Collect the power traces. Collect two sets of l power traces for CT 0 and CT 1 , denoted as T 0 and T 1 , respectively, with T = T 0 ∪T 1 .

2.
Normalize the measured power traces. The influence of the environment is reduced by removing the mean of each trace in the measurement sets, i.e., Identify the PoIs of the measurement sets. Use Equation (1) to calculate the TVLA between the two measurement sets. If the absolute value of the calculated TVLA is greater than the threshold Th sel , then there is a considerable discrepancy between the two measurement sets at this point, which may have leakage.
The We use the PoIs in each approximate time window W j for j ∈ [0, 7], updating each intermediate value identified in Section 4.1.1 to construct the templates. The process is as follows: 1.

2.
Calculate the NICV over T k (0,j) to distinguish different HW(m[0, j]), and select the points whose value of NICV in W j is greater than a certain threshold of PoIs denoted as p (0, j) . 3.
Construct the reduced trace sets T k (0,j) according to p (0, j) and calculate the mean of T k (0,j) , denoted as rt k (0,j) , which is the reduced template of each classification, so (j + 2) templates will be constructed at the jth iteration.

Template Matching
In this section, we first introduce the cyclic message rotation and then the procedure for matching the special ciphertexts with constructed templates.

Cyclic Message Rotation
Most of the lattice-based PQCs are constructed based on the LWE\LWR problem and its variants. R q has different properties with different choices of cyclotomic polynomial ϕ(x). For example, Round5 and its variants operate over is a reducible polynomial, leading to R q , a cyclic polynomial ring [14]. The multiplication of polynomial a and f t (x) = x t in R q results in a t [i] = Rotr(a, t)[i], indicating that the ith coefficient of a rotates t positions cyclically. The Rotr(·) function is defined as [8]: Some other schemes, such as CRYSTAL-Kyber, Saber, LAC, and NewHopeKEM, utilize an anti-cyclic polynomial ring R q = Z q [x]/(x n + 1), where (x n + 1) is an irreducible polynomial. So the product of a and f t (x) in the anti-cyclic polynomial ring is a t [i] = Anti_Rotr(a, t)[i], indicating an anti-cyclic rotation of a by t positions. The Anti_Rotr(·) function is defined as: We further analyze this property on CRYSTAL-Kyber, while this property is also adaptable to other schemes. In CRYSTAL-Kyber, the message bit m i is only related to one message polynomial coefficient x[i], which is generated by the ciphertext c and the private key sk in the decryption phase in poly2msg of CRYSTAL-Kyber (see Algorithm 5). The ciphertext c consists of two polynomials denoted as u and v, so the decoding operation on the first bit of message m 0 can be expressed as: Entropy 2022, 24, 1489 9 of 15 where Decode(·) is to determine whether m i is 0 or 1, based on the distance of x[i] and the center of a ring. We then create special ciphertexts c i = (u i , v i ) where u i = Anti_Rotr(u, i) and v i = Anti_Rotr(v, i). The first bit of message with cyclic message rotation m 0 is given as: where k = (n − i) mod n. Thus, we can simply change i to complete the cycling of the given ciphertext to obtain special ciphertexts, and the complete message can be recovered with the templates constructed in the preprocessing stage. Although these special ciphertexts are invalid, meaning that they cannot pass the final polynomial comparison, they can still be decapsulated on the device, creating the possibility of power analysis.

Template Matching
The same public-private key pair as that in the preprocessing stage is not required in the template-matching stage since we construct the templates for the possible HW value of the messages. The special ciphertexts are constructed with the method in Section 4.2.1, and the message is recovered using the templates constructed in Section 4.1.2. Then the matching process is described as follows: 1.
Decapsulate the given ciphertext c and collect the corresponding power trace denoted as tr. Normalize tr according to the template-construction process (see Setp2 in Section 4.1.1) and establish the reduced traces denoted as tr j according to the p (0, j) for j ∈ [0, 7].

2.
Calculate the sum of squared difference (SOSD) between tr j and the reduced templates of each class rt k (0,j) , denoted as SOSD k : We can assign HW(m[0, j]) = k based on the smallest value of SOSD k and then derive m j according to Equation (4).

3.
Construct different ciphertexts denoted as ct i for i ∈ [1, n−1] for a given valid ciphertext ct with cycle message rotation and repeat Step1 and Step2 to obtain HW(m[i, j]) and then derive m[i].

Experiments and Evaluation
In this section, we verify the proposed message recovery attack with CRYSTAL-Kyber and evaluate the accuracy and efficiency.

Experimental Setup
Our experimental setup is shown in Figure 3. The target device (DUT) was an STM32F3 target board equipped with an ARM Cortex-M4 microcontroller, plugging in a ChipWhisperer 308 UFO board [22]. The PC sent and received plaintexts/ciphertexts, while the ChipWhisperer-Lite controlled the communication between the DUT and the PC. A LeCroy 9404 oscilloscope was used to collect and save the power traces at a sampling rate of 29.48 MS/s. The implementation of CRYSTAL-Kyber was optimized for the Cortex-M4 microcontroller taken from pqm4, an open-source library for PQC schemes on the ARM Cortex-M4 microcontroller. We used arm-none-eabi-gcc to compile the implementation with the compiler options -mthumb -mfloat-abi = hard -mfpu = fpv4-sp-d16 and the highest compiler optimization level -O3 as it is the hardest to break by SCA. The STM32F303 target board ran at 7.37 MHz. Triggers were added before and after target operation to help align the power traces.
while the ChipWhisperer-Lite controlled the communication between the DUT and the PC. A LeCroy 9404 oscilloscope was used to collect and save the power traces at a sampling rate of 29.48 MS/s. The implementation of CRYSTAL-Kyber was optimized for the Cortex-M4 microcontroller taken from pqm4, an open-source library for PQC schemes on the ARM Cortex-M4 microcontroller. We used arm-none-eabi-gcc to compile the implementation with the compiler options -mthumb -mfloat-abi = hard -mfpu = fpv4-sp-d16 and the highest compiler optimization level -O3 as it is the hardest to break by SCA. The STM32F303 target board ran at 7.37 MHz. Triggers were added before and after target operation to help align the power traces.

Leakage Detection
According to the analysis in Section 3, the STRB instruction leaks information about the intermediate value of the message bytes in the decoding phase, so the first step in the message recovery attack is to identify corresponding features of decoding in traces. Figure  4 shows a partial power trace of decapsulating CRYSTAL-Kyber. We can roughly identify the different features corresponding to different operations during the decapsulation phase and then locate the time window containing the target operation. The target operation poly2msg corresponds to ⑩ and we only consider this part of the trace in the following experiments.

Leakage Detection
According to the analysis in Section 3, the STRB instruction leaks information about the intermediate value of the message bytes in the decoding phase, so the first step in the message recovery attack is to identify corresponding features of decoding in traces. Figure 4 shows a partial power trace of decapsulating CRYSTAL-Kyber. We can roughly identify the different features corresponding to different operations during the decapsulation phase and then locate the time window containing the target operation. The target operation poly2msg corresponds to 10 and we only consider this part of the trace in the following experiments.   Figure 5b, which shows only seven obvious peaks. Compared with the result in Figure 5a, the peak in W0 is missing since m[0, 0] = 0 for both CT0 and CT2. Thus, no significant difference can be found in the decoding operation between these two ciphertext sets in the first iteration.  We performed leakage detection on CRYSTAL-Kyber KEM according to the process represented in Section 4.1.1 and chose 4.5 as the threshold to reduce the influence of other irrelevant instructions. We collected and normalized two measurement sets and calculated the TVLA of these measurement sets according to Equation (1). Since the measurement sets have different m[0], it can be inferred that there will be some sample points that are over the threshold, indicating the leakage of storing m[0]. Refer to Figure 5a for the TVLA result, where it is observed that eight obvious peaks are greater than the threshold of 4.5. These peaks correspond to the storage of m[0] j for j ∈ [0, 7]; we can identify the time window W j in which each intermediate value is updated based on these peaks. We also repeated the same detection with ciphertext sets CT 0 (m[0] = 0) and CT 2 (m[0] = 2) for validation; the corresponding results are shown in Figure 5b, which shows only seven obvious peaks. Compared with the result in Figure 5a, the peak in W 0 is missing since m[0, 0] = 0 for both CT 0 and CT 2 . Thus, no significant difference can be found in the decoding operation between these two ciphertext sets in the first iteration.

Template Construction and Matching
are over the threshold, indicating the leakage of storing m[0]. Refer to Figure 5a for the TVLA result, where it is observed that eight obvious peaks are greater than the threshold of 4.5. These peaks correspond to the storage of m[0]j for j ∈ [0, 7]; we can identify the time window Wj in which each intermediate value is updated based on these peaks. We also repeated the same detection with ciphertext sets CT0 (m[0] = 0) and CT2 (m[0] = 2) for validation; the corresponding results are shown in Figure 5b, which shows only seven obvious peaks. Compared with the result in Figure 5a, the peak in W0 is missing since m[0, 0] = 0 for both CT0 and CT2. Thus, no significant difference can be found in the decoding operation between these two ciphertext sets in the first iteration.

Template Construction and Matching
After identifying the time window of each iteration, we constructed ciphertext sets denoted as CTk for k ∈ [0, 8], where ciphertexts in set CTk corresponding to message m satisfied HW(m[0, j]) = k for j ∈ [0, 7]. We chose the ciphertexts corresponding to message m that satisfied m0 = 0 and mk = 2 mk−1 + 1 for k ∈ [1, 8] in our experiments; the template construction could be performed with fewer power traces in this way and we only needed to construct nine ciphertext sets in total. We collected 100 power traces for each ciphertext set; a total of 900 power traces was sufficient to complete the construction of templates required to recover m[0].

Template Construction and Matching
After identifying the time window of each iteration, we constructed ciphertext sets denoted as CT k for k ∈ [0, 8], where ciphertexts in set CT k corresponding to message m satisfied HW(m[0, j]) = k for j ∈ [0, 7]. We chose the ciphertexts corresponding to message m that satisfied m 0 = 0 and m k = 2 m k−1 + 1 for k ∈ [1,8] in our experiments; the template construction could be performed with fewer power traces in this way and we only needed to construct nine ciphertext sets in total. We collected 100 power traces for each ciphertext set; a total of 900 power traces was sufficient to complete the construction of templates required to recover m[0].
We then constructed templates for HW(m[0, j]) for j ∈ [0, 7] according to Section 4.1.2; the partial results of NICV between each class are shown in Figure 6, where Figure 6a

Experimental Results
In power analysis, SNR is an important factor affecting the success rate of the attack. There are many ways to boost the SNR, such as using high-precision probes, analog/digital filters, etc. We used averaging of multiple repeated measurements as an SNR-boosting technique, which depends on use of an oscilloscope; the experimental results of our method and some previous implementations are shown in Table 1. ≈99.6% * FIA here refers to fault injection attack. The results of the EM method are to recover secret message bit-by-bit and byte-by-byte, respectively. We repeat the EM experiments in [11] for comparison.
Traces needed in the preprocessing stage: A total of 900 power traces were needed in the preprocessing phase for constructing templates using our method, which was more than the traces needed in the bit-by-bit method of [11] but much less than needed by other methods. Since the preprocessing stage is one-time, the preprocessing cost of our method is acceptable.
Traces needed in the attacking stage: The number of power traces used in the attacking phase was 32, which was slightly larger than the traces needed in [12]. We recovered a message byte each time and CRYSTAL-Kyber had 32 message bytes in total, while [12] divided a single attack trace into 32 sub-segments and performed template matching with 256 templates, respectively, to recover the whole message. Our method can also retrieve the entire message in a single power trace as long templates are constructed for all message bytes at the same time. Although the bit-by-bit method in [11] has an advantage at the preprocessing stage, it needs 256 power traces for attacking, which is eight-times greater than our method.
The success rate of recovering message: Without SNR enhancement, the success rate of our method reached 71.4%, but quickly grew to 96.5%, with only four averaged measurements taken. The final success rate was about 99.6%, as shown in Figure 8. We could achieve a complete message recovery by a brute-force attack on the wrong message bits with the complexity of 2 1 (256 × 0.004 ≈ 1). The success rate of our message recovery attack was higher than [6] and [12] and almost the same as [11]. With lower SNR (fewer traces for averaging), our method had a higher success rate compared with the byte-by-byte

Experimental Results
In power analysis, SNR is an important factor affecting the success rate of the attack. There are many ways to boost the SNR, such as using high-precision probes, analog/digital filters, etc. We used averaging of multiple repeated measurements as an SNR-boosting technique, which depends on use of an oscilloscope; the experimental results of our method and some previous implementations are shown in Table 1. Traces needed in the preprocessing stage: A total of 900 power traces were needed in the preprocessing phase for constructing templates using our method, which was more than the traces needed in the bit-by-bit method of [11] but much less than needed by other methods. Since the preprocessing stage is one-time, the preprocessing cost of our method is acceptable.
Traces needed in the attacking stage: The number of power traces used in the attacking phase was 32, which was slightly larger than the traces needed in [12]. We recovered a message byte each time and CRYSTAL-Kyber had 32 message bytes in total, while [12] divided a single attack trace into 32 sub-segments and performed template matching with 256 templates, respectively, to recover the whole message. Our method can also retrieve the entire message in a single power trace as long templates are constructed for all message bytes at the same time. Although the bit-by-bit method in [11] has an advantage at the preprocessing stage, it needs 256 power traces for attacking, which is eight-times greater than our method.
The success rate of recovering message: Without SNR enhancement, the success rate of our method reached 71.4%, but quickly grew to 96.5%, with only four averaged measurements taken. The final success rate was about 99.6%, as shown in Figure 8. We could achieve a complete message recovery by a brute-force attack on the wrong message bits with the complexity of 2 1 (256 × 0.004 ≈ 1). The success rate of our message recovery attack was higher than [6] and [12] and almost the same as [11]. With lower SNR (fewer traces for averaging), our method had a higher success rate compared with the byte-bybyte method in [11], implying better performance with lower recovery cost. Compared with [11], our method with NICV can focus on each input bit or byte. From [18], we know that NICV = 1/(1 + 1/SNR), indicating that a higher SNR will result in a larger NICV. The closer the value of NICV is to one, the easier it is to implement SCA.
Entropy 2022, 24, x FOR PEER REVIEW 15 of 17 method in [11], implying better performance with lower recovery cost. Compared with [11], our method with NICV can focus on each input bit or byte. From [18], we know that NICV = 1/ (1 + 1/SNR), indicating that a higher SNR will result in a larger NICV. The closer the value of NICV is to one, the easier it is to implement SCA. Although we studied the instantiation of CRYSTAL-Kyber, the proposed attack could be applied to other LWE/LWR-based schemes. This is because the Single_Bit_Update exists in the vast majority of LWE/LWR-based schemes and most LWE/LWR-based schemes satisfy the cycle message rotation property. So, the construction of special ciphertexts can be realized through this property. Therefore, the message recovery attack proposed in this paper has good generality in terms of LWE/LWR-based schemes.

Possible Countermeasures
In the previous sections, we established that the proposed attack is feasible, indicating that countermeasures are needed to prevent similar attacks. We then considered possible countermeasures for our attacks. These include: • Masking: Masking splits the secret information into multiple independent variables to achieve security. Masking the decapsulation stage can protect against our attack. However, masking the decapsulation stage may be costly in performance, so lowcost, but efficient, masking strategies are needed. • Shuffling: Shuffling uses a random permutation of a finite sequence to scramble the order of process, which removes the linear correlation between the process sequence and time.

•
Dummy Steps or Random Jitter: Adding dummy steps or random jitter will disturb the alignment of PoIs, thus, more attack costs are implied. • Combination of above methods: A combination of methods increases the trace requirement for the attack and may result in a better protection effect.

Conclusions
This paper proposes a template attack based on cyclic message rotation aimed at message decoding for LWE/LWR-based schemes. We constructed templates for the possible Hamming weight of the intermediate value in decoding during the decapsulation stage and applied cyclic message rotation to construct special ciphertexts to recover the message and shared key, which are suitable for most LWE/LWR-based schemes. We compared our results with other findings in the literature and provided targeted explanations. Our method reduced the power traces used for data preprocessing and needed 32 attack power traces to recover the CRYSTAL-Kyber message. With sufficient SNR, the success rate for recovering the message can reach 99.6%, which is very advantageous for the preprocessing stage and for balancing the success rate and recovery cost. Although we studied the instantiation of CRYSTAL-Kyber, the proposed attack could be applied to other LWE/LWR-based schemes. This is because the Single_Bit_Update exists in the vast majority of LWE/LWR-based schemes and most LWE/LWR-based schemes satisfy the cycle message rotation property. So, the construction of special ciphertexts can be realized through this property. Therefore, the message recovery attack proposed in this paper has good generality in terms of LWE/LWR-based schemes.

Possible Countermeasures
In the previous sections, we established that the proposed attack is feasible, indicating that countermeasures are needed to prevent similar attacks. We then considered possible countermeasures for our attacks. These include: • Masking: Masking splits the secret information into multiple independent variables to achieve security. Masking the decapsulation stage can protect against our attack. However, masking the decapsulation stage may be costly in performance, so low-cost, but efficient, masking strategies are needed. • Shuffling: Shuffling uses a random permutation of a finite sequence to scramble the order of process, which removes the linear correlation between the process sequence and time. • Dummy Steps or Random Jitter: Adding dummy steps or random jitter will disturb the alignment of PoIs, thus, more attack costs are implied. • Combination of above methods: A combination of methods increases the trace requirement for the attack and may result in a better protection effect.

Conclusions
This paper proposes a template attack based on cyclic message rotation aimed at message decoding for LWE/LWR-based schemes. We constructed templates for the possible Hamming weight of the intermediate value in decoding during the decapsulation stage and applied cyclic message rotation to construct special ciphertexts to recover the message and shared key, which are suitable for most LWE/LWR-based schemes. We compared our results with other findings in the literature and provided targeted explanations. Our method reduced the power traces used for data preprocessing and needed 32 attack power traces to recover the CRYSTAL-Kyber message. With sufficient SNR, the success rate for recovering the message can reach 99.6%, which is very advantageous for the preprocessing stage and for balancing the success rate and recovery cost.