Improving the Performance of RLizard on Memory-Constraint IoT Devices with 8-Bit ATmega MCU

: We propose an improved RLizard implementation method that enables the RLizard key encapsulation mechanism (KEM) to run in a resource-constrained Internet of Things (IoT) environment with an 8-bit micro controller unit (MCU) and 8–16 KB of SRAM. Existing research has shown that the proposed method can function in a relatively high-end IoT environment, but there is a limitation when applying the existing implementation to our environment because of the insu ﬃ cient SRAM space. We improve the implementation of the RLizard KEM by utilizing electrically erasable, programmable, read-only memory (EEPROM) and ﬂash memory, which is possessed by all 8-bit ATmega MCUs. In addition, in order to prevent a decrease in execution time related to their use, we improve the multiplication process between polynomials utilizing the special property of the second multiplicand in each algorithm of the RLizard KEM. Thus, we reduce the required MCU clock cycle consumption. The results show that, compared to the existing code submitted to the National Institute of Standard and Technology (NIST) PQC standardization competition, the required MCU clock cycle is reduced by an average of 52%, and the memory used is reduced by approximately 77%. In this way, we veriﬁed that the RLizard KEM works well in our low-end IoT environments.


Introduction
In Internet of Things (IoT) environments, devices utilize cryptographic algorithms to communicate securely with each other. To do this, they are required to share a common key to perform encryption efficiently with neighboring nodes. Of the established cryptographic algorithms, the key encapsulation method (KEM) is a method that enables the generation of a shared key between devices that communicate with each other. It is suitable for IoT environments because key sharing is possible at low cost.
Owing to recent developments in the field of quantum computing, existing standard encryption algorithms, such as RSA, Diffie-Hellman, and Elliptic curve cryptography, are expected to be unavailable in the near future. This is because underlying problems associated with existing algorithms can be solved efficiently using quantum computing [1]. For this reason, we need a safe KEM based on hardness problems that are not easily breakable, even with quantum computers.
To find a new cryptographic algorithm that will be used after the advent of quantum computers, the National Institute of Standards and Technology (NIST) is in the process of standardizing post-quantum cryptography (PQC) algorithms. Unfortunately, even though 8-bit microcontroller-based Electronics 2020, 9, 1549 3 of 14 and decryption processes are decreased by 74%, 77%, and 78%, respectively. Further, compared with other KEM algorithms implemented in an 8-bit MCU environment, the proposed method is more efficient both in terms of the execution time and the required SRAM size.
The remainder of this paper is structured as follows. Section 2 explains the prior knowledge required to understand the paper, and Section 3 discusses the related studies. Section 4 introduces the methods proposed for improving the performance of RLizard. In Section 5, the experimental results are compared with those of other KEMs. Finally, the conclusion is presented in Section 6.

Preliminary
In this section, we explain the prior knowledge and notation used throughout this paper. We explain the notation in Section 2.1, and the RLizard KEM is introduced in Section 2.2.

Notation
The notations used throughout the rest of this paper are provided in Table 1.
When D is a finite set, (n ≥ 1) D n . means product of space of d n T dimension of LWE samples, a positive integer For an integer d, let Φ d (X) be the d-th cyclotomic polynomial of degree n = Φ(d), where Φ(·) is Euler's totient function which denotes the number of coprime positive integers below the input. In out implementation it means X n + 1 R, R q Cyclotomic ring and its residue ring modulo an integer q: Indicates the range from a to b, a to b-1, a-1 to b, and a-1 to b-1, respectively. We deal with only integers.

RLizard
RLizard KEM is a ring version of Lizard KEM, which is a Korean standard for lattice-based post-quantum cryptography [8]. The security of RLizard is based on the hardness of the RLWE problem [6,7] and the RLWR problem [9], and it is realized by applying the Fujisaki-Okamoto transformation [10] to the RLizard encryption scheme.
RLizard [8] demonstrates the fast performance in encryption using the deterministic rounding operation in the process of sampling errors. In addition, the storage space required to use the public and secret keys required for the encryption and decryption processes is very compact to a few KBs. Because the size of the key is small, it is suitable for applications involving IoT endpoint devices Electronics 2020, 9, 1549 4 of 14 whose memory (SRAM) is of the order of a few KB. RLizard KEM works with three algorithms and the setup step. The setup step, which is referred to as RLizard.KEM.Setup, focuses on setting up the parameters. The key generation algorithm, which is called RLizard.KEM.KeyGen, generates a key pair of an entity participating in the KEM protocol. The key encapsulation algorithm, which is referred to as RLizard.KEM.Encaps, generates a ciphertext that can be used to extract the shared key if the decapsulation is performed with the correct private key. The final decapsulation algorithm, which is referred to as RLizard.KEM.Decaps, decapsulates the ciphertext to extract the shared key.
Algorithms 1-4 present details of the four steps mentioned above.  ← R q , the first element of pk, and polynomial s $ ← D n s , the first element of sk. 2: Generate the second element of sk, k ∈ R q through the random sampling.

Related Work
This section describes the related studies. We focus on research related to the implementation of PQC algorithms in the IoT environment.
In 2014, the authors in [11] implemented an authentication method in a very limited environment using the smart card as a target environment, and the NTRU algorithm and LP-LWE [12] algorithm in ARM7TDMI and AVR Atmega128. However, only the lattice-based authentication method is covered, and not KEM. In 2015, Zhe Liu et al. [13] presented a method that effectively implemented Regev's RLWE-based encryption method using the ATxmega128A1 processor, which is an 8-bit CPU environment. In order to accelerate the reduction operation, the shift-add-multiply subtract-subtract (SAMS2) method and the byte-scanning technology are applied to minimize the execution time to increase the efficiency of the discrete Gaussian sampler based on the Knuth-Yao random walk algorithm [14]. However, it did not address memory optimization, and KEM was not covered. In 2017, Oscar et al. [15] presented a method for improving the performance of the NTRUEncryption algorithm in ARM Cortex-M0. To improve the speed of multiplication of polynomials, which consumes the most CPU cycles among operations, the product form [16] was applied to polynomial multiplication to show faster operation than conventional algorithms. However, an 8-bit MCU environment was not considered. Angshuman et al. [17] implemented both memory and speed-optimized versions of the SABER scheme in the ARM Cortex-M4 and Cortex-M0 environments. However, the 8-bit environment was not considered. James et al. [18] implemented the FrodoKEM algorithm in the ARM Cortex-M4 environment in 2018. By improving the performance through the design of field-programmable gate arrays (FPGAs) for fast calculation, the required clock cycle is improved to use only approximately 45% compared to the previously implemented FrodoKEM algorithm. However, for the same level of security, approximately 300 million cycles are required to run the algorithms, which is over 90 times that of Kyber [19]. In 2018, Saarinen et al. [20] implemented the Round5 algorithm targeting the embedded environment. However, they only targeted the Cortex-M4 environment, and the 8-bit environment was not considered. In 2019, Cheng et al. [21] implemented a hash function that was optimized in terms of the assembly language for NTRU Prime KEM, and it exhibited improved performance in an 8-bit AVR ATmega1284 environment. However, the maximum memory usage of the algorithm is 11,478 bytes, and the environment where the memory size is approximately 8 KB is not considered. In 2020, Shahriar et al. [22] implemented RLWE encryption algorithms in the microprocessor of the AVR ATxmega128A1 and ARM Cortex-M0 in a limited environment using the binary Ring-LWE algorithm. However, binary Ring-LWE requires many more bits compared to the Ring-LWE with ternary bits, and it is therefore not suitable for memory constraint devices.
Many studies have implemented and optimized KEM using grid-based encryption in an IoT environment. However, many studies have been conducted on the 32 bit-ARM Cortex-M series, and they have not considered more limited environments such as 8-bit microprocessors. However, the market for 8-bit IoT devices is also growing, and KEM's performance improvement for these devices is also important [2]. Therefore, it is necessary to provide an efficient KEM by performing research on increasing speed and reducing memory usage in constrained environments.

Proposed Methods
This section shows how to reduce the number of required clock cycles and memory usage in the ATmega 8-bit environment. To support the 128-bit level of security, we assume that the following parameters in Table 2 are used in our implementation. We describe the data structure used for multiplication in the target implementation [7]. Because this is also used in the proposed method, its understanding is essential to understand the proposed method. Polynomial multiplication is used in the fourth step of the key generation algorithm (Algorithm 2) described in Section 2, the third step of the encapsulation algorithm (Algorithm 3), and the third step of the decapsulation algorithm (Algorithm 4).
In these polynomial multiplications, one polynomial (s in Algorithm 2 and r in Algorithms 3 and 4) has a special form. We explain this case using r. Figure 1 shows R idx , which is an array representing polynomial r. This array has the degrees of the terms whose coefficient values are 1 in front, in order from the largest to the smallest. Conversely, the degrees of the terms with −1 as a coefficient are stored in opposite directions starting from the last. When creating the polynomial r, the number of terms with 1 and −1 is fixed at H r , so we can set the size of the array to it. In addition, the index of the array in which the term with the coefficient value of −1 starts is stored in a variable called neg_start.
We describe the data structure used for multiplication in the target implementation [7]. Because this is also used in the proposed method, its understanding is essential to understand the proposed method. Polynomial multiplication is used in the fourth step of the key generation algorithm (Algorithm 2) described in Section 2, the third step of the encapsulation algorithm (Algorithm 3), and the third step of the decapsulation algorithm (Algorithm 4).
In these polynomial multiplications, one polynomial ( in Algorithm 2 and in Algorithms 3 and 4) has a special form. We explain this case using . Figure 1 shows , which is an array representing polynomial . This array has the degrees of the terms whose coefficient values are 1 in front, in order from the largest to the smallest. Conversely, the degrees of the terms with −1 as a coefficient are stored in opposite directions starting from the last. When creating the polynomial , the number of terms with 1 and −1 is fixed at , so we can set the size of the array to it. In addition, the index of the array in which the term with the coefficient value of −1 starts is stored in a variable called neg_start.  In the key generation process, the S idx array is created for the polynomial s in the same process, the size of the array is fixed to H s , and a variable called neg_start is set with the same meaning as above.

Proposed Methods for Improving the Speed of RLizard
In this subsection, we propose two ways to reduce the required MCU clock cycles. Before discussing the details of the description, we explain the meaning of the symbols used in this and subsequent subsections.
a: An n-order polynomial, and coefficients are values that are sampled with a uniform distribution among the integers between − q 2 + 1 and q 2 (inclusive). c 1 : An array representing the polynomial to hold the result of a × r. Depending on the algorithm used, it can finally be a ciphertext or information used to verify the ciphertexts in decapsulation.
The above two arrays have coefficients as values, and the index is the degree. Thus, their size is n. H r : this represents HR in our code. The "original algorithm" in Figure 2 represents the implementation of [7], and the "improved algorithm" depicts the proposed implementation in the figure. Because the previous implementation did not use neg_start, it was not known whether 1 or −1 would be included in the branch variable. Therefore, the branch value was multiplied, as in the code of the existing implementation. We improved this and finally eliminated the multiplication by dividing the loop into two and replacing the branch value with a constant using neg_start so that the branch value can be known before the inner loop starts. Using the above method, the process of multiplication by n × H r times has been eliminated. This has the effect of removing 131,072 multiplication operations from the above-described parameters. Moreover, because the multiplication operation generally requires three times as many cycles as the addition, the effect of removing them is very large.
Electronics 2020, 9, 1549 8 of 14 described parameters. Moreover, because the multiplication operation generally requires three times as many cycles as the addition, the effect of removing them is very large.
Here, we describe the second method. As can be seen in lines 2 to 7 of the "Improved algorithm" in Figure 2, there are two inner loops: the first one executes the 4th-5th lines iterated with the index value in the range of [0,n-deg), and the second executes the 6th-7th lines iterated with the index value in the range of [n-deg,n].  Here, we describe the second method. As can be seen in lines 2 to 7 of the "Improved algorithm" in Figure 2, there are two inner loops: the first one executes the 4th-5th lines iterated with the index value in the range of [0,n-deg), and the second executes the 6th-7th lines iterated with the index value in the range of [n-deg,n].
Of the two inner loops, we reduce the number of iterations for which the number of iterations is longer by the number of iterations in the shorter loop. Then, we add the body of the longer loop in the shorter loop to compensate for the reduction in the number of iterations in the longer loop. Because the code size of the bodies in both loops is small, this works well without consuming a significant amount of memory. Figure 3 describes how the code changes from the form of the first proposed method that was applied to the form of the second proposed method that was applied, while preserving their functions.
First, part (1.a) in the figure was divided into parts (2.a) and (2.b), and part (2.c) in order to process differently depending on whether deg is less than or equal to n/2. It is important to consider whether deg is less than or equal to n/2. This is because of the two inner loops described in (1.a), the loop that has a large number of iterations depends on the deg value.
For convenience, we focus on parts (2.a) and (2.b) of the figure to explain only the cases where deg is greater than or equal to n/2. The loop in Figure (2.a) can be further divided into two loops, i.e., (3.a) and (3.b). We also transformed the loop (3.b) into (3.c), preserving its function by simply modifying the range of values of the iteration variable and adjusting the formula used as the indices of arrays c1 and a. Finally, (4.b) can be constructed by adding the body of (3.c) to the body of the existing (2.b) loop. In conclusion, it can be seen that (2.a) and (2.b) perform the same operation as (4.a) and (4.b), but the number of iterations of the loop decreased by deg.
Based on the above improvement, if deg is less than n/2, the inner loops (4.a) and (4.b) that are iterated n times in the original implementation are iterated by only n-deg times, and if deg is greater than n/2, it is iterated by deg times, as in (4.c). Because the deg value has an average value of n/2 and is selected from a uniform distribution in [0,n−1], the average number of executions of the entire loop is reduced to 3n 4 × H r , considering the outer loop. By reducing the number of iterations using the above method, we can speed up the multiplication of polynomials. deg is greater than or equal to n/2. The loop in Figure (2.a) can be further divided into two loops, i.e., (3.a) and (3.b). We also transformed the loop (3.b) into (3.c), preserving its function by simply modifying the range of values of the iteration variable and adjusting the formula used as the indices of arrays c1 and a. Finally, (4.b) can be constructed by adding the body of (3.c) to the body of the existing (2.b) loop. In conclusion, it can be seen that (2.a) and (2.b) perform the same operation as (4.a) and (4.b), but the number of iterations of the loop decreased by deg. Based on the above improvement, if deg is less than n/2, the inner loops (4.a) and (4.b) that are iterated n times in the original implementation are iterated by only n-deg times, and if deg is greater than n/2, it is iterated by deg times, as in (4.c). Because the deg value has an average value of n/2 and is selected from a uniform distribution in [0,n-1], the average number of executions of the entire loop is reduced to ( ) × , considering the outer loop. By reducing the number of iterations using the above method, we can speed up the multiplication of polynomials. Algorithm 5 represents the description of the final version of the proposed implementation. All of the ideas for computational efficiency are applied to the pseudo code given in Algorithm 5. The r_idx array is explained in Section 4.1. Also, the variable neg_start is also explained in Section 4.1.
The algorithm consists of double loops. By applying the idea of (Figure 2), the multiplication process could be eliminated because the coefficient of polynomial r was 1 in lines 2-15 and −1 in lines [16][17][18][19][20][21][22][23][24][25][26][27][28][29]. Divide the loop based on the coefficient of r, then take out the order of polynomial r from the line 3, 17 and save it in the deg. In addition, by applying the idea of ( Figure 3), some of the computations to be done in the larger loop were modified to be done in the smaller loop. Thus, the multiplication operation of the polynomial is performed in inner loops lines 5-9, lines 11-15, lines 19-23, and lines 25-29. As a result, the result of multiplication of polynomials through double loops is stored in the c1 array.

SRAM Usage Improvement in 8-Bit ATmega Environment
The RLizard [7] implementation submitted to NIST uses up to 22 KB of memory. This is not a problem when running in a desktop environment. However, if we only have a few KB of SRAM, the code cannot be executed as is. Therefore, it is important to secure more SRAM space that can be used while running RLizard in order for it to run in a constrained environment where the SRAM size is about 8-16 KB.

Algorithm 5. The proposed implementation.
Description: Multiplication of two polynomials a, which is stored in the array a, and r, which is stored in the array r_idx. Please refer to Figure 1 to obtain the information about r_idx.
Input: the arrays a and r_idx Output: the array c1 that contains the result of multiplication. Procedure: 01: for i: = 0 to neg_start − 1 //Refer to Figure 1  The pk generated in the key generation process of RLizard KEM (Algorithm 2) is used to generate a ciphertext containing the shared key in the key encapsulation process of RLizard KEM (Algorithm 3). pk consists of two polynomials (a, b), and the coefficient of each has a length of 9 bits. Therefore, when n = 1024, pk occupies slightly more than 2 KB of memory. This is a very large size considering the environment where the SRAM size is assumed to be in the range between 8-16 KB. Fortunately, we have found that all ATmega 8-bit environments have EEPROMs with sufficient size. Table 3 presents a list of the products mentioned [23]. Table 3. 8-bit AVR products whose SRAM size is 8-16 KB.

100%
We store pk in the EEPROM and use it. Because the value of pk does not change during the algorithm execution process, it is suitable for storing in EEPROM, where the update time is very slow. Using this, we are able to secure an additional SRAM of 2 KB or more.
In addition, 640 bytes of memory were saved by storing all constants used in the implementation of flash memory. Furthermore, by optimizing the bit length of the random seed used in Gaussian sampling employed in [4], it is possible to save a total of 2 KB of SRAM usage.
In conclusion, we reduced the size of the required SRAM to run the proposed method to 6576 bytes. Therefore, the RLizard can work well in the 8-16 KB SRAM environment that we aim to achieve.

Performance Evaluation
We analyzed the performance of the proposed implementation. We used ATmega2560 [24,25], which is an 8-bit CPU environment, as the implementation environment, to prove the suitability of RLizard in a more restrictive environment, unlike [6], where the performance was evaluated on 32-bit Cortex-M3. To obtain the required clock cycles correctly, we ran each algorithm 10,000 times, and then averaged their required clock cycles. In addition, the maximum usage of SRAM was also measured.
We compared our implementation with that submitted to the NIST PQC competition [7]. Unfortunately, because of its high SRAM usage, it did not work well in our environment. Therefore, the performance and SRAM usage of the existing implementation were measured in another environment of 32-bit Cortex-M0+. The details of the environment used for the performance analysis are shown in Table 4. The performance analysis result is shown in the Figure 4, we can find that, compared to the implementation in [7], the proposed implementation requires 39%, 55%, and 17% fewer MCU clock cycles in key-generation, encapsulation, and decapsulation, respectively. As shown in Figure 5, the maximum SRAM usage is decreased in the proposed implementation to only 6248 bytes, 6576 bytes, and 6462 bytes in key-generation, encapsulation, and decapsulation, respectively. The required SRAM is small enough to be used in environments where the SRAM size is even 8 KB. Electronics 2020, 9, x FOR PEER REVIEW 12 of 14   Table 5 shows the comparison with the implementation of other KEMs in an 8-bit MCU environment. As shown in the table, our implementation performs the best compared with those in [11] and [21] in terms of both the required clock cycles and the SRAM usage. In addition, the implementations in [11] and [21] cannot be used for our target environment whose SRAM size is 8-16 KB because the required SRAM is much greater than 8 KB. The execution times of the improved algorithms are 118.0 ms (KeyGen), 117.8 ms (Enc.), and 377.8 ms (Dec.) on the same environment as Figure 4. Since they are less than a second, in terms of the computation time, it seems tolerable.

Conclusions
With the advent of the IoT era, there is a critical need to address security issues. In this study, we propose some methods with which IoT devices with a small amount of computational power and available SRAM can use KEM for data security. Focusing on the fact that 8-bit ATmega MCUs have    Table 5 shows the comparison with the implementation of other KEMs in an 8-bit MCU environment. As shown in the table, our implementation performs the best compared with those in [11] and [21] in terms of both the required clock cycles and the SRAM usage. In addition, the implementations in [11] and [21] cannot be used for our target environment whose SRAM size is 8-16 KB because the required SRAM is much greater than 8 KB. The execution times of the improved algorithms are 118.0 ms (KeyGen), 117.8 ms (Enc.), and 377.8 ms (Dec.) on the same environment as Figure 4. Since they are less than a second, in terms of the computation time, it seems tolerable.

Conclusions
With the advent of the IoT era, there is a critical need to address security issues. In this study, we propose some methods with which IoT devices with a small amount of computational power and available SRAM can use KEM for data security. Focusing on the fact that 8-bit ATmega MCUs have  Table 5 shows the comparison with the implementation of other KEMs in an 8-bit MCU environment. As shown in the table, our implementation performs the best compared with those in [11,21] in terms of both the required clock cycles and the SRAM usage. In addition, the implementations in [11,21] cannot be used for our target environment whose SRAM size is 8-16 KB because the required SRAM is much greater than 8 KB. The execution times of the improved algorithms are 118.0 ms (KeyGen), 117.8 ms (Enc.), and 377.8 ms (Dec.) on the same environment as Figure 4. Since they are less than a second, in terms of the computation time, it seems tolerable.

Conclusions
With the advent of the IoT era, there is a critical need to address security issues. In this study, we propose some methods with which IoT devices with a small amount of computational power and available SRAM can use KEM for data security. Focusing on the fact that 8-bit ATmega MCUs have an EEPROM of sufficient size, we proposed a method that allows the RLizard KEM to operate even in low-spec IoT devices with an SRAM size of 8-16 KB by maximizing the use of EEPROM and flash memory. In addition, the execution time of the RLizard algorithm has been improved to overcome the performance limitations of low-end IoT MCUs. Furthermore, by performing experiments, it was confirmed that the proposed method works efficiently in an environment with 8 KB SRAM. We hope that the results of this study can contribute to improving the security of low-cost IoT devices.

Conflicts of Interest:
The authors declare no conflict of interest.