Efficient Parallel Implementation of CTR Mode of ARX-Based Block Ciphers on ARMv8 Microcontrollers

With the advancement of 5G mobile telecommunication, various IoT (Internet of Things) devices communicate massive amounts of data by being connected to wireless networks. Since this wireless communication is vulnerable to hackers via data leakage during communication, the transmitted data should be encrypted through block ciphers to protect the data during communication. In addition, in order to encrypt the massive amounts of data securely, it is essential to apply one of secure mode of operation. Among them, CTR (CounTeR) mode is the most widely used in industrial applications. However, these IoT devices have limited resources of computing and memory compared to typical computers, so that it is challenging to process cryptographic algorithms that have computation-intensive tasks in IoT devices at high speed. Thus, it is required that cryptographic algorithms are optimized in IoT devices. In other words, optimizing cryptographic operations on these IoT devices is not only basic but also an essential effort in order to build secure IoT-based service systems. For efficient encryption on IoT devices, even though several ARX (Add-Rotate-XOR)-based ciphers have been proposed, it still necessary to improve the performance of encryption for smooth and secure IoT services. In this article, we propose the first parallel implementations of CTR mode of ARX-based ciphers: LEA (Lightweight Encryption Algorithm), HIGHT (high security and light weight), and revised CHAM on the ARMv8 platform, a popular microcontroller in various IoT applications. For the parallel implementation, we propose an efficient data parallelism technique and register scheduling, which maximizes the usage of vector registers. Through proposed techniques, we process the maximum amount of encryption simultaneously by utilizing all vector registers. Namely, in the case of HIGHT and revised CHAM-64/128 (resp. LEA, revised CHAM-128/128, and CHAM-128/256), we can execute 48 (resp. 24) encryptions simultaneously. In addition, we optimize the process of CTR mode by pre-computing and using the intermediate value of some initial rounds by utilizing the property that the nonce part of CTR mode input is fixed during encryptions. Through the pre-computation table, CTR mode is optimized up until round 4 in LEA, round 5 in HIGHT, and round 7 in revised CHAM. With the proposed parallel processing technique, our software provides 3.09%, 5.26%, and 9.52% of improved performance in LEA, HIGHT, and revised CHAM-64/128, respectively, compared to the existing parallel works in ARM-based MCU. Furthermore, with the proposed CTR mode optimization technique, our software provides the most improved performance with 8.76%, 8.62%, and 15.87% in LEA-CTR, HIGHT-CTR, and revised CHAM-CTR, respectively. This work is the fastest implementation of CTR mode on ARMv8 architecture to the best of our knowledge.


Introduction
Since the fourth industrial revolution, IoT (Internet of Things) devices are developing in the form of 'Intelligent IoT' in various applications by the convergence of technologies such as big data, artificial intelligence, and cloud services. IoT devices expand from the previously limited functions and communicate massive amounts of data with various IoT devices and servers. However, since a large portion of data communication is 1. First parallel implementation of CTR mode on embedded devices using ARMv8 architecture Until now, CTR mode optimization has been conducted only on 8-bit AVR, Intel Core i2, and Intel Core i7 [4][5][6][7][8]. However, there are many types of IoT platforms. Among them, ARMv8 is widely used in various IoT devices, and it supports a NEON engine that is capable of efficient parallel processing. Therefore, we present a first parallel implementation of CTR mode with ARX-based block ciphers on ARMv8 architecture. For parallel implementation, we not only present an efficient data parallelism, but also present register scheduling that maximizes the use of vector registers to simultaneously process multiple encryptions. Finally, we propose the parallel implementation of CTR mode by applying the proposed parallelism technique to CTR mode optimization, which pre-computes the initial few rounds using fixed nonce of input block. The proposed parallel implementation of CTR mode can be easily applied to various parallel environments such as Single Instruction Multiple Data (SIMD) and Advanced Vector Extensions (AVX2).

Proposing the efficient data parallelism of ARX-Based Block Ciphers
We propose parallel implementation of ARX-based block ciphers (LEA, HIGHT, and revised CHAM) by utilizing the NEON engine in ARMv8 architecture. The proposed parallel technique is more efficient than the existing data parallel processing techniques in [9,10]. In LEA and revised CHAM, we eliminate the transpose operations required to apply data parallelism through LD4 and ST4 instructions when loading data from memory to four vector registers and storing data from four vector registers into memory. In HIGHT, since it processes round operations in 8-bit units, it is difficult to process data parallelism without additional costs. Thus, we present an optimized transpose operation for data parallelism in HIGHT. Furthermore, to perform as much encryption as possible simultaneously, we present an efficient vector register scheduling. Through the proposed data parallelism techniques, 24 encryptions are performed simultaneously in LEA, revised CHAM-128/128, and revised CHAM-128/256, and 48 encryptions are performed simultaneously in HIGHT and revised CHAM-64/128. In case of HIGHT and revised CHAM, more encryptions are simultaneously performed than in the previous works. As a result, we outperformed the previous work by 3.09%, 5.26%, and 9.52%, respectively in proposed data parallelism. The proposed data parallelism techniques can be applied to various lightweight cryptography such as SIMECK [11] and SKINNY [12].

Presenting the first parallel implementation of CTR mode
We apply the proposed parallel techniques to the CTR mode implementation. The existing works of CTR optimization have been conducted on 8-bit AVR MCU [6,7]. They utilized the property of CTR mode that the input block consists of nonce and counter parts and the nonce part is always the same during encryption. Thus, we can precompute the initial few founds which are related to the fixed nonce part. In other words, by precomputing the operations related to the nonce part, it is possible to efficiently improve performance through utilizing precomputed values rather than computing round operations until 4 rounds in LEA, 5 rounds in HIGHT, and 7 rounds in revised CHAM. We extend the optimization concept of the existing works for the proposed parallel implementation of CTR mode. However, in the case of LEA, the input position of the nonce in CTR mode is not fixed. Thus, by changing the position of the nonce in the existing CTR mode optimization, we can precompute one more round operation than the previous work [7]. In addition, with the proposed data parallelism, the maximum encryption is performed simultaneously in CTR mode, and the number of encryption even in CTR mode is the same as the number in data parallelism. Through the parallel implementation of CTR mode optimization, we could achieve enhanced performance of 8.76%, 8.62%, and 15.87% in LEA-CTR, HIGHT-CTR, and revised CHAM-CTR by comparison with the previous works.
The rest of this paper is structured as follows. Section 2 explains NEON engine and target block ciphers. Section 3 provides existing studies using NEON engine and CTR mode optimization. Section 4 presents the efficient data parallelism and parallel implementation of CTR mode optimization. Section 5 provides performance comparison results from the proposed implementation and previous works. Finally, Section 6 concludes this paper.

NEON Engine and ASIMD Instructions
From the ARMv7 architecture, NEON engine and parallel processing unit have been supported to maximize performance. NEON engine in ARMv7 supports 16 vector registers (q0-q15) with a 128-bit width for parallel processing, but that in ARMv8, the target device of this paper, supports 32 128-bit vector registers (v0-v31). In addition, in the NEON engine, parallel processing is possible for 64, 32, 16 and 8-bit wise within a 128-bit vector register. Table 1 shows Advance Single Instruction Multiple Data (ASIMD) instructions and clock cycles of each instructions. Since the target block ciphers are based on ARX structure, ASIMD instructions for processing addition, rotate, and XOR are required. ADD instruction processes the addition in parallel by lane. SRI and SHL instructions process rotate shift operations in parallel, and it is commonly used to implement rotate shift operations in the NEON engine. REV16 is an instruction that can process ROL 8 operation more efficiently than SRI and SHL when parallel processing ROL 8 operation in 16-bit units. EOR instruction processes XOR operations in parallel depending on the lane. TRN1 and TRN2 instructions efficiently transpose the vector registers. The transpose operations of TRN1 and TRN2 instructions are as shown in Figure 1. ST4 and LD4 are instructions to store data from 4 vector registers to memory and load data from memory to 4 vector registers, and, at this time, transpose operation is automatically applied. Using these properties, we present a method to apply data parallelism without additional costs by using ST4 and LD4 instructions in ARX-based ciphers (LEA and revised CHAM) except HIGHT. From Table 1, it can be seen that more cycles are required for memory access operations than simple register operations. Table 1. Summary of ASIMD instructions used in this paper [13].

Instructions Operands
Description Cycles block cipher that provides high-speed encryption in common software platforms. The proposed LEA structure not only provides high-speed encryption and small code size, but also effectively resists widely attacks that are vulnerable to the block ciphers, such as differential cryptanalysis and linear cryptanalysis. Furthermore, LEA has been selected as ISO/IEC (29192-2) [14] in 2019, which proves its security and efficiency. It is classified into three types according to the parameters, and is shown in Table 2. LEA supports key lengths of 128, 192, 256-bit, and consists of 24, 28, and 32 rounds, respectively, depending on the key length. In the key schedule of LEA, a delta, which is a constant value, is used, and it is the root value of 766,995, which concatenates ASCII codes of 'L', 'E', and 'A'. The delta efficiently generates round keys through ROL operation and modular addition by 2 32 . Each round function of LEA requires a total of six 32-bit round keys, but only LEA-128 requires four 32-bit round keys. In the round function of LEA, round operations are computed in units of 32-bit suitable for 32-bit and 64-bit processors, which are widely used in recent years, and round operations consist of ARX operations (Addition, Rotation, and XOR). Figure 2 gives the the round function process of LEA. It is composed of 6 XOR, three modular additions by 2 32 , and three rotate operations (ROR 3 , ROR 5 , and ROL 9 ). After repeating the above round function for the required number of rounds, encryption of LEA is completed.  In CHES'06 [2], Hong et al. proposed the HIGHT block cipher to provide confidentiality in environments where only limited computation and power are allowed such as RFID and sensors. HIGHT was designed in a feistel structure and is a lightweight block cipher that encrypts 64-bit block through a 128-bit secret key. The HIGHT' round function is composed of lightweight arithmetic such as modular addition by 2 8 , rotate (F 0 and F 1 functions), and XOR operations in 8-bit units, so it is very suitable for low-end processors with limited resources. Furthermore, HIGHT was designated as an ISO/IEC 18033-3 [15] in 2010. In the key schedule process of HIGHT, the delta value is updated through Linear-Feedback Shift Register (LFSR), and subkeys are generated by computing the modular addition by 2 8 on one byte of the master key and the delta value. In LFSR, the connected polynomial uses x 7 + x 3 + 1, so the delta has a period of 127. By repeating the above process, a total of 128byte subkeys are generated by the delta value. In addition, 128-byte subkeys and an 8-byte part of the master key are required for the encryption process. In the encryption process of HIGHT, it is composed of an initial and final conversion performed at the beginning and end of encryption, and round functions. The initial and final transformations operate byte-wise XOR operation and modular addition by 2 8 . The round function performs F 0 , F 1 function, XOR, and modular addition by 2 8 operations. The F 0 and F 1 functions, which have the largest computational overheads in the round function, perform rotate and XOR operations. After each round operation is completed, the result values are rotated one byte to the left. Figure 3 shows the round function of HIGHT. A total of 32 round functions are performed to encrypt 64-bit plaintext. The rotate operation is not performed to the result values in the last round. Finally, the final conversion is performed on the result value. The encryption of HIGHT is completed.

Revised CHAM Block Cipher
In ICISC'17 [16], CHAM family was proposed for efficient encryption in environments with limited resources such as 8-bit and 16-bit. CHAM family is an ARX-based ultralightweight block ciphers that does not use Sbox operation. The key schedule of CHAM family does not renew the round keys, effectively reducing the memory space for storing round keys. Furthermore, the CHAM family has shown that H/W and S/W implementation are also more efficient than other lightweight block ciphers. In H/W implementation, it can be implemented with a smaller area than SIMON [17] block cipher that shows the most optimal performance. In S/W implementation, it provides more efficient performance than the SPECK [17] block cipher, which shows the most optimal performance in S/W. The CHAM family can be classified into three block ciphers according to parameters, and the composition of the CHAM family is shown in Table 3. In the key schedule of CHAM, it is composed of ROL 1 , ROL 11 , ROL 8 and XOR operations. Through these operations, round keys (2 * k/w) that are much less than the round required for each CHAM block ciphers are generated. In the round function of CHAM, it is made up of modular addition, rotate operations (ROL 1 and ROL 8 ), and XOR operations. These operations are performed on only one word in each round of CHAM. Thus, it performs the round function with only very light operations compared to other lightweight block ciphers. Figure 4 shows the round function of CHAM. Recently, in ICISC'19 [3], the inventor of original CHAM revealed a new differential characteristic using the SAT solver in the existing CHAM family, and proposed a revised CHAM family to provide sufficient security. Using the SAT solver, the differential characteristics of the existing CHAM-64/128 and CHAM-128/k were found in rounds 39 and 62, and any related-key differential characteristics to round 47 were found in the CHAM family. The revised CHAM family is the same as the CHAM family, but, in order to provide sufficient security for these differential characteristics, the revised CHAM family is a version that increases the number of rounds from 80 to 88 in CHAM-64/128, from 80 to 112 in CHAM-128/128, and from 96 to 120 in CHAM-128/256, respectively. Despite the increase in the number of rounds in the revised CHAM family, in case of H/W implementation, the revised CHAM family remains the same as that of the existing CHAM family. In case of S/W implementation, the revised CHAM family is comparable to SPECK, which is known to have the best performance in S/W.

Related Works Parallel Implementation of Block Ciphers on NEON Engine and CTR Mode Optimization
In this section, we give the description related to parallel implementation of the NEON engine and CTR mode optimization. ARMv7 architecture, in addition to the ARM core, supports the NEON engine that can effectively process data in parallel. In the NEON engine, parallel processing is possible to 64, 32, 16, and 8-bit wise depending on the lane within the 128-bit vector register. Until now, works on parallel optimization of various block ciphers utilizing the NEON engine in ARMv7 and ARMv8 architecture have been conducted [9,10,[18][19][20]. Table 4 shows the number of simultaneously performed encryptions utilizing the NEON engine in related works. In WISA'16 [18], taking advantage of the independent cores of ARM and NEON, the authors proposed an optimized technique to efficiently hide cycles of ARM instructions into cycles of NEON instructions by interleaving NEON and ARM instructions. In the NEON engine, 12 encryptions were simultaneously performed using 16 vector registers, and one encryption was efficiently performed utilizing a barrel shifter in ARM core. For parallel implementation on the NEON engine, the authors presented the overall process for parallel processing in NEON engine as follows: It could be seen that the transpose operation was required for parallel processing in the NEON engine. Furthermore, utilizing OpenMP, the authors expanded the proposed one core optimization into four cores so that a total of 52 encryptions were performed, simultaneously. In WISA'18 [19], the authors presented the results of applying the above optimization method to CHAM-64/128. With taking advantage of the property that CHAM-64/128 is a 16-bit word-wise, in the proposed method, 24 encryptions were processed simultaneously in the NEON engine and four encryptions were processed simultaneously in the ARM core. In the NEON parallel process, the transpose operation was required as in the above method. Furthermore, the authors interleaved ARM instructions and NEON instructions in CHAM-64/128 as in the above optimization method, and optimized CHAM-64/128 by using multiple cores through OpenMP. Recently, various research results are introduced for ARMv8 architecture, the latest version of ARM core. Unlike ARMv7, ARMv8 architecture supports 32 128-bit vector registers instead of 16 128-bit vector registers. In journal of the Korea Institute of Information and Communication Engineering'17 [9], the author presented an efficient parallel implementation of LEA utilizing NEON engine in ARMv8 architecture for the first time. In the proposed data parallelism, 24 encryptions were simultaneously processed, but the transpose operation was required for data parallelism in the same way as the above optimization method. In ICISC'19 [20], the authors optimized AES using the ASIMD instruction set on ARMv8 architecture. the 4-way transpose MixColumns was introduced to efficiently process MixColumns on four encryptions. Through the 4-way MixColumns, the authors achieved improved performance in MixColumns. In IEEE ACCESS'20 [10], the authors proposed secure and fast implementation of HIGHT and CHAM by utilizing the NEON engine in ARMv8 platforms. In the HIGHT and revised CHAM' fast implementation, data and task parallelism was effectively applied to process multiple operations and data simultaneously. Through the proposed data parallelism of HIGHT and revised CHAM, 24 encryptions were performed simultaneously in HIGHT, and 16, 10, and 8 encryptions were performed simultaneously in revised CHAM-64/128, CHAM-128/128, and CHAM-128/256, respectively. However, the authors did not maximize the scheduling of vector registers, so the maximum number of encryptions was not processed simultaneously.
In MDPI eletronic'20 [6], Kwon et al. proposed the first CHAM-CTR mode optimization in an 8-bit AVR processor. The authors utilized the property of CTR mode that nonce part is always the same during encryption. Through this CTR mode property, the authors optimized the some round operations up until round 7 in CHAM-CTR. The overall process of the proposed CHAM-CTR mode [6] is shown in Figure 5. In MDPI eletronic'20 [7], Kim et al. proposed the first LEA-CTR and HIGHT-CTR mode optimization in an 8-bit AVR processor. Utilizing the same CTR mode property as above, the authors optimized the same round operations up until round 3 in LEA-CTR and round 5 in HIGHT-CTR. The overall process of HIGHT-CTR optimization is shown in Figure 6.   Figure 5. Optimization of revised CHAM-64/128-CTR mode through partial pre-computation for the initial seven rounds in previous work [6].
When encryption in an actual application, massive amounts of data are encrypted, so the application of the mode of operation is essential. Currently, the modes of operation widely used industrially are CBC and CTR modes. However, the studies until now have tended to focus on optimization of the block cipher rather than optimization of mode of operation. Thus, in this paper, the objective of our work is to present the first parallel implementation of CTR mode optimization in ARMv8 architecture. For parallel implemen-tation, we introduce efficient register scheduling and data parallelism techniques. For CTR mode optimization, LEA-CTR proposed the method to optimize one round more than the previous study [7]. HIGHT-CTR and revised CHAM-CTR utilized the same method as the previous CTR mode optimization [6,7].

Proposed Data Parallelism Technique
In this section, we present an efficient data parallelism of LEA, HIGHT, and revised CHAM. In our data parallelism, the following techniques are used. First, the transpose operation is efficiently eliminated from the LEA and revised CHAM through the ST4 and LD4 instructions. However, since encryption is performed in units of 8-bit words in HIGHT, it is difficult to eliminate the transpose operation. Thus, we present the optimized transpose operation in HIGHT so that data parallelism can be efficiently applied. Second, we propose efficient register scheduling technique that can maximize the number of encryptions in parallel. Through proposed register scheduling, 24 encryptions are performed simultaneously for LEA, revised CHAM-128/128, and revised CHAM-128/256, and 48 encryptions are performed simultaneously for HIGHT and revised CHAM-64/128 on the NEON engine. Furthermore, to efficiently process multiple encryption simultaneously, we interleave the NEON instructions to minimize pipeline stalls. Interleaving NEON instructions is shown in Figure 7. In other words, when processing NEON instructions 1, 2, and 3, the pipeline stall is minimized by inserting other NEON instructions 1, 2, and 3 utilizing TEMP regis-ters. Finally, we generally access memory with the maximum number of vector registers, effectively reducing the number of memory accesses.

LEA Optimization
We introduce an efficient data parallelism technique for LEA. Since LEA performs round operation in units of 32-bit words, the lanes of 128-bit vector registers are set to be processed in parallel in units of 32-bits. Previously [9], the transpose operation was performed to apply data parallelism, but we efficiently eliminate computational overheads by automatically applying the transpose operation when loading data from memory to four vector registers and storing data from four vector registers to memory through LD4 and ST4 instructions. In addition, all vector registers are efficiently utilized to perform 24 encryptions simultaneously. Figure 8 shows register scheduling in the proposed parallel implementation of LEA. The vector registers v0-v23 are registers that store 24 plaintexts, which are stored with the transpose operation applied. The vector registers v24-v27 are used as temporary variables to process multiple plaintexts simultaneously. Finally, the vector registers v28-v31 store the round keys. Algorithm 1 represents one round of parallel implementation in LEA-128. The above algorithm shows data parallel implementation for four plaintexts. Other plaintexts are processed using the remaining Temp registers through interleaving NEON instructions. Our contribution in Algorithm 1 is to process the transpose operation without computational overhead, and the other data parallelism techniques are the same as in [9].
Step 1 loads the plaintexts into four vector registers via LD4 instruction. By using LD4 instruction, the transpose operation is automatically applied when loading PT from memory to four vector registers.
Step 2 loads the round keys required to process a round function into four vector registers. In Steps 3-5, Steps 8-10, and Steps 13-15, XOR operation and modular addition by 2 32 are performed. Steps 6-7 perform ROR 3 operation on one word to which modular addition is computed. As above, Steps 11-12 perform ROR 5 operation on one word, and Steps 16-17 perform ROL 9 operation on one word. By repeating the above process 24 times, the encryption of LEA-128 is completed. After that, the transpose operation is automatically applied through ST4 instruction and stored in the memory, effectively reducing the computational overheads on the transpose operation. Similar to LEA-128, the proposed parallel implementation of LEA-192 and LEA-256 also perform 24 encryptions simultaneously. However, unlike LEA-128, both LEA-192 and LEA-256 take more vector registers to store round keys. Therefore, in LEA-192 and LEA-256, the v26-v31 vector registers store the round keys, and the remaining v24-v25 vector registers are used as the Temp registers.

. HIGHT Optimization
In HIGHT, since the round function operates in units of 8-bit, the lane of the vector register is set to 16-bit to efficiently parallelize in units of 1-byte within 128-bit. Through the above process, 16 encryptions can be performed simultaneously in a total of eight vector registers. In the existing work [10], only 24 encryptions were processed simultaneously on the NEON engine, but we process 48 encryptions simultaneously by efficiently utilizing all vector registers. Figure 9 shows register scheduling for the proposed parallel implementation of HIGHT. The v0-v23 vector registers are vector registers that store 48 plaintexts, and 16 plaintexts are stored per eight vector registers. The v24-v27 vector register are Temp registers that store the intermediate values. Finally, the v28-v31 vector registers maintain round keys.
Unlike LEA and revised CHAM, HIGHT performs operations in byte-wise, so it is difficult to remove transpose operations in the overall process. We present an optimized transpose operation. Algorithm 2 is the optimized transpose operations used in the proposed parallel implementation for HIGHT. For the rest of the encryptions, we efficiently interleaved the instructions by minimizing the pipeline stall. Steps 1-14 are the transpose operation performed before encryption, and Steps 15-28 are the transpose operation performed after encryption. In Steps 1-2, the transposed plaintexts are loaded into four vector registers through LD4 instruction. For example, in the v0 vector register, each 0-th and 4-th index word of 8 plaintexts among 16 plaintexts are stored, and in the v4 vector register, each 0th and 4th index word from the remaining eight plaintexts are stored.
Step 3 is the process of storing the intermediate value, and, in Step 4, each 0th word of 16 plaintexts is efficiently stored in the v0 vector register through TRN1 instruction. In Step 5, each 4th word from 16 plaintexts are efficiently stored in the v4 vector register through TRN2 instruction. A total of three more iterations of the above process complete the transpose operation for 16 plaintexts. From Step 15, it is the transpose operation that is performed after encryption is completed and before storing the ciphertexts in memory, and proceeds in the reverse order of Steps 1-14. After the transpose operation is completed, the ciphertexts are stored in memory through ST4 instruction.

v0-v7
Plaintext v24-v27  Algorithm 3 is a round function to which a data parallelism technique is applied in HIGHT. In the HIGHT optimization of [10], both task and data parallelism technique are applied, but Algorithm 3 applies only a data parallelism technique. Our contribution of Algorithm 3 is a data parallelism technique for HIGHT. As data parallelism technique is applied, Algorithm 3 shows a one round function of processing 16 encryptions. The rest of the encryptions were processed using the Temp register through interleaving the NEON instructions.

Algorithm 2 Efficient Transpose Operation for HIGHT
Step 1 loads the round keys required for one round function into vector registers. Steps 3-11 and Steps 27-35 perform F 1 Function. Steps 12-14 and Steps 37-38 perform an XOR operation and then a modular addition by 2 8 , and Steps 25-26 and Steps 49-50 perform a modular addition by 2 8 and then an XOR operation. Finally, Steps 15-23 and Steps 39-47 perform F 0 Function.

Revised CHAM Optimization
We introduce an efficient data parallel implementation of the revised CHAM family. We describe our approach on the basis of the revised CHAM-64/128 because revised CHAM-128/128 and revised CHAM-128/256 are almost the same except for lanes because word units are different for revised CHAM-64/128. In the existing parallel implementation of revised CHAM [19], the transpose process was required to apply data parallelism, but we make data parallelism to be applied automatically when loading data from four vector registers to memory and storing data from four vector registers to memory through LD4 and ST4 instructions in the revised CHAM family. In addition, the maximum number of encryptions are performed simultaneously by utilizing all vector registers. Forty-eight encryptions are performed simultaneously, in revised CHAM-64/128, and 24 encryptions are performed simultaneously, the remaining revised CHAM. The vector register scheduling for processing maximum encryptions in revised CHAM-64/128 is shown in Figure 10. The register scheduling of the remaining revised CHAM is similar to that of LEA, and only the v27 vector register needs to be set as a counter. Algorithm 4 shows a parallel implementation of odd and even rounds with a data parallelism technique of revised CHAM-64/128. Algorithm 4 may also look like the HIGHT optimized implementation of [10], but the HIGHT optimized implementation of [10] applied the task and data parallelism techniques, and, in Algorithm 4, only the data parallelism technique is applied. Our contribution in Algorithm 4 is that the data parallelism technique for HIGHT and the transpose operation required to apply data parallelism technique is processed with LDN and STN instructions without computational overhead. In addition, we apply only the data parallelism technique for revised CHAM-64/128, so there is no necessity for a transpose operation within the round function, which is required in revised CHAM-64/128 optimized implementation of [10]. The data parallelism technique is applied to the Algorithm 4, so it processes eight encryptions for the round function, and the remaining plaintexts are encrypted by interleaving instructions efficiently through the Temp registers. Since the revised CHAM-64/128 performs round operations in 16-bit-wise, the lane, which is the unit of parallel processing of vector registers, is set in units of 8 h. In Step 1, plaintexts are automatically transposed through the LD4 instruction and loaded from memory into four vector registers.
Step 2 loads the round keys for processing four rounds into the four vector registers. In Step 3, one word of the plaintexts and the counter value are XORed. In Steps 4-5, the ROL 1 operation is performed, and, in Step 6, XOR operations are performed on the above result and round key. After that, modular addition is performed between the results of Step 3 and Step 6. In Step 8, ROL 8 operation, the last operation of odd round, is performed, and, in 16-bit parallel processing, rotate shift operation can be efficiently processed with only REV16 instruction. Steps 9-10 increase counter value because one round is over. From Step 11, it means even round, and the counter value and one word of the plaintext perform XOR operation again.
Step 12 performs ROL 8 operation, and Step 13 performs XOR operation on the round key and result of Step 12. After that, step 14 performs a modular addition by 2 16 between the resulting values from steps 11 and 13. In Steps 15-16, an even round is completed through ROL 1 operation. Finally, Steps 17-18 add 1 to the counter value. The encryption process of revised CHAM-64/128 is completed by repeating the above process 44 times, when the ciphertexts are stored in the memory, by automatically applying the transpose operation through the ST4 instruction. We achieve the processing of more encryptions simultaneously than previous work [10] through efficient vector register scheduling. Step 3

Parallel Implementation of CTR Mode of Operation on the NEON Engine
In this section, we propose the parallel implementation of CTR mode optimization. The CTR mode optimization is as follows. The input block of CTR mode is composed of a nonce and a counter, and the nonce part is a fixed value and the counter part is a variable value. When there is encryption in the CTR mode, the nonce part always comes out with a fixed value, so some round operations of the initial few rounds can be pre-computed. Therefore, when performing CTR mode encryption, it is possible to effectively reduce the computational overheads by looking up the pre-computation table rather than computing the round operation for the initial few rounds. We use the existing optimization of CTR mode for HIGHT and revised CHAM [6,7], and, in the case of LEA, we optimize one more round than the existing work [7]. Through the CTR mode optimization method, the operations of the round function are effectively reduced by looking up the pre-computation table up until four rounds for LEA, five rounds for HIGHT, and seven rounds for revised CHAM. Furthermore, we apply the proposed parallel implementation to CTR mode optimization to encrypt multiple plaintexts simultaneously. We not only apply the CTR mode optimization, but also process massive amounts of encryptions in parallel, effectively improving the performance of the CTR mode in ARMv8 architecture.

LEA-CTR Optimization
We describe how to efficiently process multiple encryption simultaneously by applying the proposed LEA-CTR mode optimization. In CTR mode optimization, the positions of the counter and nonce are not decided within the block. In addition, the size of the counter is not determined, and, for consideration of security, counters are used typically to have a period of 2 32 on 32-bit processors. Unlike previous work [7], we change the counter position in order to apply the proposed technique to one more round compared to the previous work [7]. Figure 11 shows the optimization of the LEA-CTR mode that can bypass some round operations of the initial four rounds by using a fixed nonce value. To give detailed description of round function up until the optimized rounds, we express X i,j word (i ∈ 0, 4, j ∈ 0, 3) as the word at which each round function is completed. X 0,0 word, which is a part of the input block, is a variable counter, and the remaining X 0,1 , X 0,2 , and X 0,3 words are fixed nonce. In the Round 1, the words except for X 0,0 word are nonce, so the round operations can be efficiently skipped through pre-computation. In the Round 2, X 1,2 and X 1,3 words are still the results of fixed nonces, so round operations are efficiently bypassed by looking up the pre-computation table. The remaining X 1,0 and X 1,1 words are affected by the variable counter, so round operations are required, and round operations are performed through pre-computation with fixed nonce. In the Round 3, round function operations of X 2,1 and X 2,0 words affected by the variable counter in the Round 2 and Round 3 and X 2,2 word affected by the counter from the previous rounds are required. Some round operations are efficiently performed by looking up the pre-computation table rather than computing the round operations. In the Round 4 rounds, all words except X 3,0 word are affected by the variable counter, so the round operation of the Round 4 is partially reduced through the pre-computation on the X 3,0 word.
The parallel implementation of LEA-CTR mode simultaneously processes the same number of encryptions as the proposed parallel implementation of LEA. To process the same number of encryptions as the proposed parallel implementation of LEA, the register scheduling for parallel implementation of LEA-CTR mode is similar to the proposed parallel implementation of LEA, and the following differences exist. The v26-v28 vector registers are used to store the pre-computation table when performing 1-4 rounds, and the remaining v29-v31 vector registers store the round keys. After round 4, the same register scheduling as the parallel implementation of LEA is applied. Algorithm 5 shows the optimization of the initial four rounds in LEA-CTR mode for four plaintexts. The rest of the plaintexts are processed by efficiently interleaving NEON instructions using the Temp registers. In Steps 1-2, the round keys and pre-computation table for rounds 1-2 are loaded into vector registers, respectively. In Steps 3-6, which are the steps of performing Round 1, only the X 0,0 word, the variable counter, requires the round operations. By bypassing the round operations for the remaining words, the computational overheads required for Round 1 are efficiently reduced.   Figure 11. Proposed LEA-CTR mode optimization through pre-computation for the initial four rounds.
Steps 7-14 show the round operations of Round 2. Steps 7-10 are the round operation process for X 1,0 word that is affected by the variable counter in Round 2. Steps 11-14 are the process of performing the round operations of the word containing the variable counter, and the remaining X 1,2 and X 1,3 words are for nonce, so round operations are not required. In Steps 15-16, the round keys and pre-computation table to be used in Round 3 and Round 4 are loaded into vector registers. Steps 17-29 represent the round operations process in Round 3. Steps 17-20 are the round operations for X 2,0 word affected by the variable counter in Round 3, and Steps 21-25 are the round operation process for the word affected by the variable counter in Round 2. In Steps 26-29, the word including the counter performs round operations. In Step 30, the round keys for one round are loaded into the vector registers. In Steps 31-32, one round operation is reduced by looking up the last pre-computation table. After Round 4, it is the same as the parallel implementation of LEA. In LEA-192 and LEA-256, only the number of rounds is different from that of LEA-128, so it is possible to optimize CTR mode in the same way as the above procedure.

HIGHT-CTR Optimization
In the HIGHT-CTR mode, we propose a parallel implementation that simultaneously processes 48 encryptions by applying the proposed data parallelism technique to the HIGHT-CTR mode optimization. As for the optimization method of HIGHT-CTR mode, the proposed optimization of CTR mode in the existing work [7] is used. The HIGHT-CTR mode optimization proposed in the previous work is shown in Figure 6. The 64-bit block of HIGHT consists of a 32-bit variable counter and fixed nonce. In the same way as the LEA-CTR optimization, some round operations for up to five rounds have been efficiently bypassed through pre-computation of the round operations for the fixed nonce part. To implement optimization of HIGHT-CTR mode in parallel, we utilize register scheduling similar to the proposed parallel implementation of HIGHT. The round function of HIGHT requires more TEMP registers than other block ciphers (LEA and CHAM) to store intermediate values. Thus, when loading RK and Table values from memory into registers, three vector registers are used respectively, and the remaining vector registers are utilized as TEMP registers.
Algorithm 6 shows the parallel implementation of the initial 1-5 rounds of the optimized HIGHT-CTR mode by looking up the pre-computation value for a fixed nonce part. After round 5, it is the same as the proposed parallel implementation of HIGHT, and the encryptions of the remaining plaintexts are performed by efficiently interleaving NEON instructions using the Temp register. In Round 1, only four words (X 0,0 − X 0,3 ), the counter value, need round operations, and the remaining words (X 0,4 − X 0,7 ) are already precomputed with a fixed value part, so round operations are not required. In Round 2, since X 1,1 and X 1,5 words are part of the variable counter, round operations for X 1,1 and X 1,5 words are performed by looking up the pre-computation table. Steps 3-13 show the round function for X 1,1 and X 1,5 words. In Step 3, round operation for X 1,1 word is efficiently performed using the pre-computation with only ADD instruction. Steps 4-13 are a round function for X 1,5 word and the cache is used in the round function.
In Round 3, the X 2,7 word is affected by the counter, so it is no longer possible to precompute intermediate value, and round operation is required. In Steps 15-25, to perform the round function for X 2,7 word, a modular addition is performed to the pre-computed X 2,7 word and X 2,6 word applied to the F 0 function and XOR operation. In Round 4, round operations of X 3,1 and X 3,3 words affected by the counter are required. Steps 27-36 are the process of performing the round function for X 3,1 word, and Steps 37-38 process round function for X 3,3 word with only XOR instruction by loading the pre-computation table into vector registers. In Round 5, all words are affected by the counter, so pre-computation is no longer possible, and all words require round operations. In Steps 40-48, the last pre-computation value is used to perform a round function of X 4,3 word. Through the HIGHT-CTR mode optimization, some round operations up to the initial five rounds were processed as a simple table lookup. After round 5, encryption is performed in the same way as the proposed parallel implementation of HIGHT.

Revised CHAM-CTR Optimization
In this section, we present a parallel implementation of the revised CHAM-CTR mode optimization. Since revised CHAM-128/128 and CHAM-128/256 are almost the same as the revised CHAM-64/128 except for word units, the proposed parallel implementation of CTR mode in revised CHAM-128/128, and CHAM-128/256 can be easily applied by changing only the lane of the vector register. Thus, we explain on a basis from revised CHAM-64/128. In common with HIGHT, revised CHAM-64/128 encrypts 64-bit blocks, and our target platform is a 64-bit processor, so the block of CTR mode consists typically of a 32-bit variable counter and a 32-bit fixed nonce. Revised CHAM-CTR mode optimization is the same as in the existing work [6], and is shown in Figure 5. Optimization of CHAM-CTR mode proposed in [6] reduces the computational overheads of round functions during encryption by pre-computing some operations in the initial few rounds using a fixed nonce part in the same way as the HIGHT-CTR and LEA-CTR optimization. To simultaneously process as many encryptions as possible, the proposed parallel implementation of the revised CHAM-CTR mode uses the same register scheduling as the proposed parallel implementation of revised CHAM, except that the v28-v31 vector registers are used to store the RK and pre-computation table.
Algorithm 7 shows the parallel implementation of the initial seven rounds in the CTR mode optimization of revised CHAM-64/128. After the seven rounds, the round functions are performed in the same way as the proposed parallel implementation of revised CHAM-64/128. In Algorithm 7, round functions until seven rounds are shown for eight plaintexts, and the remaining plaintexts are encrypted with interleaving NEON instructions using the Temp registers. As shown in Figure 5, a block of the revised CHAM-CTR mode consists of X 0,2 and X 0,3 words, which are a fixed nonce part, and X 0,0 and X 0,1 words, which are a variable counter part. In Step 1, the round keys used for Round 1 are loaded into vector registers. In Round 1, round operations of X 0,0 and X 0,1 words are required. Since X 0,0 and X 0,1 words are all variable counter parts, pre-computation is impossible, so the round function is normally performed for X 0,0 and X 0,1 words. In Step 2, the pre-computed values to be used in Round 2 and Round 4 are loaded into vector registers. In Round 2, the round function is performed for X 1,0 word with X 1,1 word. At this time, the pre-computed value is used, so that the computational overheads for the ROL 8 and XOR operations to be originally performed are reduced. In Steps 3-8, the round operations for X 1,0 word are processed by using the pre-computed value. so round operations are performed normally. In Round 6, the round function for X 5,0 word is performed with the pre-computed X 5,1 word. In Steps 20-25, the round function is completed only by ROL 1 , modular addition, and XOR operations through pre-computed X 5,1 words. In Round 7, the round function is performed using the last pre-computed value. In the later implementation, it is the same as the proposed parallel implementation of revised CHAM-64/128. Through the proposed parallel implementation of revised CHAM-CTR mode optimization, some round operations up until seven rounds were bypassed, and maximum encryptions were processed simultaneously.

Evaluation
We evaluate the proposed data parallelism technique and CTR mode optimization technique on Raspberry Pi 4B [21], which is the most popular embedded device. Raspberry Pi 4B supports up to 4 GB with an upgraded internal memory compared to previous models, and supports ARM Cortex-A72, a 64-bit processor as a microcontroller unit. We compiled using aarch64-linux-gnu-gcc in the Ubuntu 19.10 environment, and VS Codium was used as the development environment. To our knowledge, there are no studies on the optimization of CTR mode using LEA, HIGHT, and revised CHAM in ARMv8 architecture. Thus, we compare the performance of our work and the existing optimization works [9,10] of LEA, HIGHT, and revised CHAM, respectively, in ARMv8 architecture. Table 5 shows the performance comparison of LEA parallel implementation. Seo et al. [9] presented parallel implementation of LEA utilizing NEON engine in Apple A7 and Apple A9. A total of 24 encryptions were performed simultaneously using all vector registers, but a transpose process was required to apply data parallelism. We ported the Seo et al.'s implementation to our target device, ARM Cortex-A72, for fair situations. Our Work 1 is the proposed parallel implementation of LEA, and processes 24 encryptions simultaneously as shown in [9]. We reduce the computational overheads by automatically applying the transpose operation when loading data from memory to vector registers and storing data from vector register to memory. By removing the transpose operation process, our Work 1 shows 3.09%, 2.77%, and 2.63% performance improvement in LEA-128, LEA-192, and LEA-256, respectively, than the results ported to our target platform [9]. Until now, this is an optimization work of block ciphers. IoT devices send massive amounts of data in the real world, so the mode of operation should be applied when encrypted. Among them, the CTR mode is widely used in the current industry. Our Work 2 is the proposed CTR mode optimization technique. Our Work 2 performs 24 encryptions simultaneously. In particular, CTR mode optimization was applied, so that the round operations of the initial few rounds were processed only by simply looking up the pre-computation table, thereby reducing the computational overheads on the round operations efficiently. Moreover, by pre-computing one round more than the previous work [7], Our Work 2 in LEA-128, LEA-192, and LEA-256 achieved 8.76%, 6.39%, and 5.82% performance improvement respectively over [9]. The result of Our Work 2 shows faster performance than the previous study [9] and Our Work 1.  Table 6 shows the performance comparison of HIGHT and revised CHAM parallel implementations. Song et al. [10] presented the secure and fast implementation for HIGHT and revised CHAM using the NEON engine in the ARMv8 platform. In fast implementation, task and data parallelism was applied to compute multiple operations and data simultaneously. In addition, utilizing all the vector registers, 24 encryptions in HIGHT, 16 encryptions in revised CHAM-64/128, 10 encryptions in revised CHAM-128/128, and 8 encryptions in revised CHAM-128/256 were processed simultaneously. In the secure implementation, Song et al. [10] optimized random-shuffling, the core operation of fault attack countermeasures, using the NEON engine. Our Work 1 in HIGHT is a parallel implementation with efficient transpose operation and proposed parallel implementation. In addition, Our Work 1 in HIGHT processed 48 encryptions, with more encryptions than before, simultaneously by scheduling all vector registers more efficiently than previous work [10]. Through the above advantages, Our Work 1 in HIGHT achieved about 5.26% performance improvement over the existing fast implementation for HIGHT [10]. Our Work 2 in HIGHT is a parallel implementation of the HIGHT-CTR mode, and, like Our Work 1, 48 encryptions are simultaneously processed. Furthermore, by applying the CTR mode optimization, the computation overheads are reduced by efficiently pre-computing some round operations up to five rounds. Through proposed parallel techniques and optimization of CTR mode, it shows 8.62% performance improvement over previous work [10]. Our Work 1 of revised CHAM is our proposed parallel implementation. Through efficient vector register scheduling, we processed more 48, 24, and 24 encryptions simultaneously, which is more than the number of previous encryptions, in revised CHAM-64/128, CHAM-128/128, and CHAM-128/256, respectively, and likewise applied the transpose operation without additional costs when loading data from memory to four vector registers and storing four vector registers into memory. Through the proposed optimization techniques, Our Work 1 of a revised CHAM family shows an improvement of 9.52%, 1.52%, and 4.02% compared to the previous work [10]. Finally, Our Work 2 of revised CHAM is a parallel implementation of the CTR mode optimization. Our Work 2 of the revised CHAM processes the same number of encryptions simultaneously as Our Work 1 of revised CHAM. In addition, through CTR mode optimization, some round operations up to the initial eight rounds are pre-computed to reduce the computational overheads of the round functions. As a result, performance improvements of 15.87%, 2.94%, and 5.36% in revised CHAM-64/128, CHAM-128/128, and CHAM-128/256, respectively, were achieved compared to previous work [10]. Our Work 2 shows the results of the fastest CTR mode in ARMv8 architecture.

Conclusions
In this article, we have presented parallel implementations of CTR mode using ARXbased block ciphers (LEA, HIGHT, and revised CHAM) for data security on embedded devices using the ARMv8 architecture. For parallel implementation, we have presented data parallelism techniques. In LEA and revised CHAM, we have eliminated the transpose operation required to apply data parallelism techniques, and, in HIGHT, we have proposed an optimized transpose operation. In addition, to process the maximum number of encryption simultaneously, we have presented register scheduling that efficiently uses all vector registers. As a result, 24 encryptions were performed in LEA, revised CHAM-128/128, and revised CHAM-128/256, and 48 encryptions were performed in HIGHT and revised CHAM-64/128. HIGHT and revised CHAM performed more encryption simultaneously than previous work. Through the data parallelism technique and register scheduling, the performance was improved by 3.09%, 5.26%, and 9.52% compared to the previous study in LEA, HIGHT, and revised CHAM, respectively. Most of the studies up to now were optimization of block ciphers. Since massive amounts of data are encrypted, it needs to apply a mode of operation. Among them, the CTR mode is widely used in industrial applications. We applied the proposed parallel technique to CTR mode optimization to efficiently process multiple encryptions. In addition, utilizing the property of the CTR mode, the initial few round operations related to the nonce part were pre-computed. In particular, we optimized one round more than the previous work by changing the position of the nonce part in LEA-CTR. Through proposed optimization techniques, the LEA, HIGHT, and revised CHAM-64/128-CTR modes showed 8.76%, 8.62%, and 15.87% performance improvements, respectively, compared to the previous best results. The proposed CTR mode implementation achieved the faster performance than the previous work and the proposed data parallelism technique. We will study various lightweight cryptography such as SIMECK, SPECK, and SKINNY to apply the proposed parallel implementation of CTR mode in the future. Our work contributes to fast encryption in ARMv8-based IoT devices with the proposed parallel implementations of the CTR mode of ARX-based ciphers. We believe that our software can be a cornerstone for building secure IoT services and applications such as smart city, smart factory, autonomous driving, and so on.