High-Speed Implementation of PRESENT on AVR Microcontroller

: We propose the compact PRESENT on embedded processors. To obtain high-performance, PRESENT operations, including an add-round-key, a substitute layer and permutation layer operations are efﬁciently implemented on target embedded processors. Novel PRESENT implementations support the Electronic Code Book (ECB) and Counter (CTR). The implementation of CTR is improved by using the pre-computation for one substitute layer, two diffusion layer, and two add-round-key operations. Finally, compact PRESENT on target microcontrollers achieved 504.2, 488.2, 488.7, and 491.6 clock cycles per byte for PRESENT-ECB, 16-bit PRESENT-CTR (RAM-based implementation), 16-bit PRESENT-CTR (ROM-based implementation), and 32-bit PRESENT-CTR (ROM-based implementation) modes of operation, respectively. Compared with former implementation, the execution timing is improved by 62.6%, 63.8%, 63.7%, and 63.5% for PRESENT-ECB, 16-bit PRESENT-CTR (RAM based implementation), 16-bit PRESENT-CTR (ROM-based implementation), and 32-bit PRESENT-CTR (ROM-based implementation) modes of operation, respectively.


Introduction
Lightweight cryptography is getting more important than ever due to the emergence of the Internet of Things. The lightweight cryptography supports encryption in resourceconstrained environments, such as sensor network, health care, and surveillance systems. Therefore, the implementation of lightweight cryptography aims at optimizing certain criteria, such as energy consumption, execution time, memory footprint, and chip size.
We propose a number of implementation techniques for well-known lightweight cryptography, namely PRESENT, and its Electronic Code Book (ECB) and Counter (CTR) on low-end embedded processors, where ECB encrypts the plaintext directly with the master key and CTR encrypts the counter value with the master key and then the result of encryption is XORed with the plaintext. In order to achieve optimal results on target microcontrollers, we used processor-specific optimizations for PRESENT block ciphers. Furthermore, the compact counter mode of PRESENT and its bit-slicing-based implementation are also presented. Novel implementation techniques for PRESENT block cipher can be extended to other lightweight cryptography algorithms and other platforms.

Optimal Implementation of PRESENT Block Cipher on Embedded Processors
We implemented the PRESENT block cipher on low-end microcontrollers. The Alf and Vegard's RISC (AVR) processor is a resource-constrained device that is used extensively in low-end Internet of Things (IoT) applications, such as Arduino UNO and Arduino MEGA. The PRESENT-ECB implementation is optimized in terms of execution timing and other factors (e.g., code size and RAM). The word size of general purpose registers in the target AVR microcontroller is 8-bit wise. All 16-bit wise PRESENT operations are optimized for 8-bit word and instruction set. Compared with the former implementation of PRESENT-ECB for a 128-bit security level on AVR microcontrollers, the proposed work improved the execution timing by 62.6% [1].

Pre-Computation for PRESENT with CTR
CTR is utilized in real applications and services, such as Transport Layer Security (TLS) and Virtual Private Network (VPN). CTR receives the input consisting of two parts, including constant nonce and variable counter. Since the nonce part is the constant variable, the constant nonce value is repeated several times throughout computations. For this reason, some computations of PRESENT block cipher can be optimized through precomputation. By exploiting this feature, we further improved the execution timing of PRESENT-CTR. The method is a generic algorithm and can be implemented with other processors. Compared with the state-of-art implementation,the proposed works on embedded processors that have obtained performance enhancements by 63.8%, 63.7%, and 63.5% for 16-bit PRESENT-CTR (RAM), 16-bit PRESENT-CTR (ROM), and 32-bit PRESENT-CTR (ROM), respectively.

Open Source
The proposed PRESENT implementation is a public domain and full source codes are available at https://github.com/solowal/PRESENT_AVR (accessed on 7 January 2021). Source codes were written in (mixed) AVR assembly language (core algorithm) and C language (function call). Codes support four 128-bit PRESENT implementations, including PRESENT-ECB, PRESENT-CTR16 (RAM based implementation), PRESENT-CTR16 (ROM based implementation), and PRESENT-CTR32 (ROM based implementation). Projects were created and evaluated with Atmel Studio 7.0 framework. Researchers can evaluate and re-create the result with the available source codes.

PRESENT Block Cipher
PRESENT block cipher was introduced in CHES'07 [2]. PRESENT block cipher supports two parameters (i.e., PRESENT-64/80 and PRESENT-64/128). PRESENT block cipher requires 31 rounds and the Substitution-Permutation-Network (SPN) structure is adopted. PRESENT requires three computations including the substitution layer, permutation layer, and add-round-key.
The add-round-key operation performs exclusive-or computations with plaintext and round keys. Round keys (roundkey = (roundkey 1 , roundkey 2 , ..., roundkey 32 )) are generated from the key schedule. In particular, roundkey 32 is used for post-whitening. PRESENT block cipher uses a 4-bit substitution layer. The inner state of PRESENT block cipher (S 63 , ..., S 0 ) can be seen as 16 4-bit words (w 15 ... w 0 ), where one w word consists of four states (i.e., ). The 4-bit substitution layer can be represented in Boolean operations for the bitslicing implementation. The PRESENT 4-bit S-box is designed for higher hardware efficiency and compact implementation. PRESENT block cipher uses a bit of permutation for the linear diffusion layer. The permutation layer performs bit permutation in the intermediate result. Each bit state (x) is permutated through P(x).

Target Processor
The AVR microcontroller finds many interesting applications in embedded systems, such as sensor networks, surveillance systems, and health care. The number of available registers is only 32 8-bit long. Basic arithmetic instructions take a single clock cycle. The memory load/store instruction requires two clock cycles. The microcontroller supports an 8-bit instruction set, 128 KB of FLASH memory, 8 MHz of working frequency, two-stage pipeline design, and 4 KB of RAM (e.g., ATmega128). Among them, 6 registers (i.e., R26∼R31) are reserved for address pointers, and the remaining registers can be utilized for general purpose registers by a programmer. In particular, the R1 register is the ZERO register that should be cleared before function returns.
Many works are also devoted to improve the execution timing of AES on embedded processors [18][19][20][21][22]. In [23], the compact implementation of ARIA on low-end microcontrollers was proposed.
In CHES'17, optimized PRESENT implementation on embedded ARM CPUs was presented by using a novel decomposition of permutation layers (see Listing 1.2 of [24]), and bitsliced for the S-boxes [24]. A description of PRESENT is detailed in Algorithm 2 of [24]. Unlike a traditional PRESENT algorithm, it performs the permutation layer before the substitution layer. This order of computation is beneficial for bit-slicing-based substitution layer implementation.
In this paper, we presented the compact PRESENT implementation on AVR microcontrollers. We re-designed the PRESENT implementation for 8-bit architecture. Then, we also suggested the PRESENT-CTR. The CTR implementation technique optimizes 2 add-round-key, 2 permutation, and 1 substitution operations with a 1 look-up table operation.

Optimization of PRESENT-ECB
For the efficient implementation of PRESENT block cipher, add-round-key, substituion, and permutation layers are optimized.
In Algorithm 1, add-round-key operation is described in a source code level. The computation is performed with XOR operations with round keys where XOR operation represents logical bitwise exclusive-or operation. The memory access for round keys is performed with the incremental memory pointer mode.

Algorithm 1:
Add-round-key operation in assembly language.
Output: Output results (reg0∼7). The efficient implementation of permutation (P0) is described in Algorithm 2. A 16-bit wise rotation operations are performed with LSR, ROR, LSL, and ROL instructions. Exclusive-or and logical and operations are performed with EOR and ANDI instructions. Similar to the P0 operation, the permutation (P1) is implemented, efficiently.
//t=(X0⊕(ROR_u16(X1,1)))&0x5555 The bitslicing substitution operation is performed with Boolean operations. Detailed descriptions are given in Algorithm 3. Boolean operations, such as logical XOR, AND, OR, and one's complement are performed with EOR, AND, OR, and COM instructions. To move two adjacent registers in a single instruction, MOVW instruction is utilized.

Optimization of PRESENT-CTR
For high-end IoT devices, such as 32-bit ARM-based processors, the size of the counter is fixed at 32-bit [20,25]. However, in an 8-bit ATmega processor, the memory size is limited to at least 2KB depending on the ATmega model (e.g., ATtiny). For this reason, block cipher encryption is usually performed by 2 16 times [26]. From the security perspective of CTR mode, the attacker can pre-compute and collect ciphertext information relied on the IV. When the initial CTR mode is operated, the counter of IV (Initial Vector) is initialized to zero. If there is an unpredictable n-bit input in the encryption process other than the master key, the effective key size for Time-Memory Trade Off (TMTO) attack and Key Collision (KC) attacks increases by n-bit [27]. For an 8-bit AVR microcontroller with a small memory footprint, it is suitable to use a 16-bit counter. For general cases, a 32-bit counter is also widely used in practice. In this section, we present both PRESENT-CTR mode implementations with 16-bit and 32-bit counter modes of operation on the ATmega128 microcontroller.
PRESENT-CTR with a 16-bit counter is described in Figure 1. We represent the bit in square form. Since PRESENT block cipher performs 64-bit block-wise encryption, 64 squares are utilized (i.e., 64-bit data). The most left square and the most right square represent the first and last bit, respectively. Colored squares represent a counter part. The remaining white squares represent nonce part. The computation is performed from top to bottom. The 4-bit data for each color of the initial 16-bit counter is distributed to the 16-bit data through permutation P0 and the bitslicing-substitution process. After the permutation P1 process is done, 16-bit data for each color is gathered regularly in the color (green, red, blue, and yellow) order of the initial counter. Through this, it is possible to predict 16-bit data through 4-bit of the initial counter. During the encryption process up to permutation P1, there is no interference between each color. For four 4-bit counter data, four 16-bit data can be pre-computed, independently. The required look-up table size is 128 bytes (4 colors × 2 4 counters × 16-bit size of data). A detailed description of look-up table generation is given in Algorithm 4. It generates 16 16-bit data with a counter divided into 4-bit data and repeats this process 4 times. The cost of generating a look-up table is less than performing PRESENT-ECB encryption by 4 times. We computed the pre-computation table in a parallel way, which generates four look-up tables at once. Four index parts (1∼4-th bits, 5∼8-th bits, 9∼12-th bits, and 13∼16-th bits) generate four pre-computed outputs (1∼16-th bits, 17∼32-th bits, 33∼48-th bits, and 49∼64-th bits). This ensures the generation of pre-computation is independent of each other. The computation of a look-up table on AVR requires only 4022 clock cycles. This is roughly one time of PRESENT-ECB encryption. The look-up table can be stored in RAM or ROM. If we allocate the look-up table to RAM, we can access to the data with the LD instruction in 2 clock cycles. Otherwise, we can store it to ROM and access to the data with the LPM instruction in 3 clock cycles. The encryption process of PRESENT-CTR mode can be optimized away from the operation up-to the second add-round-key operation by using the created look-up table. Overall, this approach replaces the two permutation layers, two add-round-key, and one substitution layer to one look-up table accesses. C ← P 1 (C) 10:

S S S S S S S S S S S S S S S S
end for 13: end for 14: return LUT16 0 , LUT16 1 , LUT16 2 , LUT16 3 Algorithm 5 shows the proposed PRESENT-CTR16 implementation using a 16-bit counter. In steps 2-5, look-up table access with 16-bit counter is performed. Afterward, the remaining PRESENT computations are performed. Listing 1 shows the AVR assembly code for the 16-bit data look-up. In order to improve performance, 16-bit LUT is performed with two 8-bit memory accesses. The memory access for 16-bit data is 9 clock cycles. This process is repeated 4 times. PRESENT encryption is optimized at the cost of just 36 clock cycles. Output: 64-bit ciphertext C.
1: roundkey = (roundkey 1 , roundkey 2 , ..., roundkey 32 ) ← keySchedule(K) PRESENT-CTR with 32-bit counter is described in Figure 2. The 1-th to 16-th counters are indicated by a colored square. The 17-th to 32-th counters are indicated by symbol squares. During the encryption process, the colored symbol square, which can be shown in Permutation P1, represents part of being affected by a color square and symbol square.   The 8-bit data for each 4-bit color and 4-bit symbol parts of the initial 32-bit counter is distributed to the 16-bit data through permutation P0 and bitslicing-substitution process.

S S S S S S S S S S S S S S S S
Unlike the 16-bit counter case, the counter-part represented by the colored square and the counter-part represented by the symbol square interfere with each other during the bitslicing-substitution process. This can be seen in detail in Figure 2. When permutation (P1) is completed, the 16-bit data mixed by color and symbol is gathered in the color and symbolic order of the initial counter. This allows the pre-computation of 16-bit data through the 8-bit (4-bit color and 4-bit symbol) of the initial counter. At this time, the required lookup table size is 2048 bytes (= 4 color and symbol × 2 8 counter × 16-bit size of data). Unlike the 16-bit PRESENT-CTR implementation, 32-bit PRESENT-CTR implementation requires a huge look-up table (i.e., 2048). We placed a look-up table in ROM instead of RAM. The manufacture of AVR provides secure memory-based architecture (i.e., CryptoMemory; https: //www.microchip.com/design-centers/security-ics/mature-products/cryptomemory accessed on 7 January 2021). For real world implementation, we can utilize this technology. A detailed description of look-up table generation is given in Algorithm 6.
Algorithm 6: Generation of look-up tables for proposed PRESENT-CTR32 encryption.
Output: Look-up tables for 32-bit counter (LUT32 0 , LUT32 1 , LUT32 2 , LUT32 3 ). Similarly to the 16-bit counter, the encryption process of PRESENT-CTR mode can be optimized from the operation up-to the second add-round-key by using the created look-up table. Overall, this approach replaces the two permutation layers, two add-round-key, and one substitution layer to one look-up table accesses.
Algorithm 7 shows the proposed PRESENT-CTR32 implementation using a 32-bit counter. In Steps 2∼5, 16-bit data look-up with 8-bit (4-bit color and 4-bit symbol) counter is performed. Listing 2 shows the AVR assembly code for the 32-bit data look-up. The cost of looking-up 16-bit data is 10 clock cycles. This process is repeated 4 times. This is optimized at the cost of just 40 clock cycles.

Evaluation
In CHES'17, bitslicing-based PRESENT implementation was proposed [24]. It has been theoretically and practically proven that the bitslicing technique shows the best results in 32-bit or higher processors. However, bitslicing-implementation in an 8-bit AVR environment has not been explored before. In embedded devices, bitslicing optimizes the memory access for the substitution layer but it requires Boolean operations. The AVR microcontroller has 8-bit wise 32 general-purpose registers and it should be carefully optimized to achieve high performance in bitslicing implementation. We evaluated PRESENT-ECB and PRESENT-CTR implementations and compared them with former works. ATmega128 is selected as a microcontroller, which is one of the most popular AVR microcontrollers in wireless sensor networks. In the case of CTR mode, 16-bit counter and 32-bit counter versions are evaluated. The software was evaluated with Atmel Studio 7 and -Os option. Benchmarks are checked in clock cycles per byte which occurs when each mode of operation is called once.  [1,28]. On the other hand, the proposed PRESENT-ECB implementation uses almost the same RAM as the existing implementation, but only requires 504.2 clock cycles per byte. For the code size, the proposed implementation utilized two permutation operations (P0, P1). The code size is bigger than former works. Since the proposed PRESENT-CTR implementation is optimized further by utilizing pre-computation, the proposed PRESENT-CTR mode achieved a higher performance than the existing PRESENT-ECB mode. The code size of the CTR mode of operation is bigger than the ECB mode of operation, but it achieved 488.2, 488.7, and 491.6 CPB, for 16-bit counter (RAM), 16-bit counter (ROM), and 32-bit counter (ROM). In Table 2, the comparison of execution timing depending on the message size is given. The RAM based 16-bit counter mode of operation requires look-up table generation online. For this reason, performance is lower than the ROM-based 16-bit counter mode of operation. However, the RAM-based implementation outperforms when the length is over 8192 bytes. PRESENT implementations are publicly available at: https://github.com/solowal/PRESENT_AVR (accessed on 7 January 2021), where anyone can access PRESENT implementations. In Table 3, a comparison with other lightweight block cipher implementations on target-embedded processors is given. On the 8-bit AVR environment, previous PRESENT implementation using 128-bit key shows a lower performance than other lightweight cryptographic algorithms [28], since substitution and permutation layers of the PRESENT algorithm incurs considerable overheads in an 8-bit AVR environment. We achieved the execution timing improvement of target block cipher implementation to 504 clock cycles per byte in an 8-bit AVR environment. Therefore, we believe that our optimization results are not only actually usable from real 8-bit AVR microcontrollers but can be applied to various cryptographic application algorithms. Table 3. Comparison of other implementations on target embedded processors (AVR) in terms of timing (cycles per byte), RAM (bytes), and code size (bytes).

Algorithm
Plaintext

Conclusions
We presented compact ECB and CTR for PRESENT on embedded processors. The ECB mode of operation was efficiently implemented in an optimization of diffusion layer, substitute layer, and add-round-key operations. The operation was accelerated with precomputation in CTR. This new approach optimized away PRESENT operations by the substitution layer of second round. Finally, PRESENT block cipher on target microcontrollers consumed 504.2, 488.2, 488.7, and 491.6 CPB for ECB, 16-bit CTR (RAM-based implementation), 16-bit CTR (ROM-based implementation), and 32-bit CTR (ROM-based implementation) modes of operation, respectively.