Chaining Optimization Methodology: A New SHA-3 Implementation on Low-End Microcontrollers

: Since the Keccak algorithm was selected by the US National Institute of Standards and Technology (NIST) as the standard SHA-3 hash algorithm for replacing the currently used SHA-2 algorithm in 2015, various optimization methods have been studied in parallel and hardware environments. However, in a software environment, the SHA-3 algorithm is much slower than the existing SHA-2 family; therefore, the use of the SHA-3 algorithm is low in a limited environment using embedded devices such as a Wireless Sensor Networks (WSN) enviornment. In this article, we propose a software optimization method that can be used generally to break through the speed limit of SHA-3. We combine the θ , π , and ρ processes into one, reducing memory access to the internal state more efﬁciently than conventional software methods. In addition, we present a new SHA-3 implementation for the proposed method in the most constrained environment, the 8-bit AVR microcontroller. This new implementation method, which we call the chaining optimization methodology, implicitly performs the π process of the f -function while minimizing memory access to the internal state of SHA-3. Through this, it achieves up to 26.1% performance improvement compared to the previous implementation in an AVR microcontroller and reduces the performance gap with the SHA-2 family to the maximum. Finally, we apply our SHA-3 implementation in Hash_Deterministic Random Bit Generator (Hash_DRBG), one of the upper algorithms of a hash function, to prove the applicability of our chaining optimization methodology on 8-bit AVR MCUs.


Introduction
Due to the increase in the use of IoT devices, many applications have been designed to use the devices so that many algorithms are implemented on low-powered devices [1]. However, due to the weaknesses of communication in Wireless Sensor Networks (WSNs), the security of communicating messages is easily damaged. Since the cryptographic hash function can guarantee the integrity of a communicating message, we can use it for the security of the communication in WSNs. Moreover, the function can be used as a core technique in security applications such as PBKDF2, HMAC, DRBG, and digital signature algorithms (including DSA, ECDSA, and RSA-PSS). Recently, some weaknesses of standard hash functions (e.g., SHA-1 or SHA2) have been discovered. Therefore, the US National Institute of Standards and Technology (NIST) recommends not using SHA-1 [2][3][4][5][6]. Generally applicable attacks concerning the SHA-2 family were introduced before 2008. Two functions SHA-1 and SHA-2 have different structures, but they share the same algorithm (i.e., SHA) [7,8]. Therefore, they share the same vulnerability. In some scenarios, the security of SHA-2 is the same as that of SHA-1, except SHA-2 uses larger inputs and outputs [7,8]. A number of preimage attacks on SHA-2 have been found [9][10][11]. Fortunately, before the weakness of SHA-2 against preimage attacks was found, NIST held the SHA-3 competition. The Keccak algorithm was selected as SHA-3, the new standard hash function [12].

The Contribution of This Paper Is as Follows:
•

Proposing an efficient reduced memory access method for Fast SHA-3 implementation
In this article, we analyze the SHA-3 algorithm in detail from an implementation point of view. We propose a new method to reduce memory access to the internal state of SHA-3 by calculating the number of memory accesses and analyzing the characteristics of each process while the f -function is operating. Our technique of combining the three processes performs the f -function efficiently without breaking the security of SHA-3. Moreover, our method is a generic method that can be applied to low and high-end processors such as 8-bit AVR, 32-bit ARM, CPU, and GPGPU. •

Presenting chaining optimization methodology on 8-bit AVR MCUs
We present a new SHA-3 implementation methodology called the chaining optimization methodology. Through the chaining optimization methodology, the θ, π, and ρ processes of f -function are combined into one, and, by using this, the memory access to the internal state of SHA-3 is reduced to the maximum. Based on the chaining optimization methodology, our SHA-3 software implements the π process implicitly using registers efficiently. This shows a performance improvement of up to 26.1% over the previous best implementation. In addition, our software achieves the fastest performance in an 8-bit AVR microcontroller as far as we know. •

Presenting optimized Hash_DRBG on 8-bit AVR environment
We prove the applicability of our SHA-3 software by using Hash_DRBG in an 8-bit AVR MCUs. In addition, we propose an optimized implementation of Hash_DRBG to reduce the performance gap with the existing SHA-2 implementation in the AVR environment. three operations are omitted from the first loop of the f -function through the input data used repeatedly in the derivation function of Hash_DRBG. The proposed lookup table is only 160 bytes and has the advantage that it can be created during SHA-3 operation. By using this, the proposed Hash_DRBG implementation achieves the fastest performance among Hash_DRBG using SHA-3.

Hash Function for Service Sustainability in Embedded Systems
As the development of 5G communication is accelerating and 6G communication is being developed, the communication-based Information Technology (IT) industry is in charge of the development of fundamental technologies in a wide range of fields such as economic growth, social integration, and environmental preservation. In order to design sustainable development based on close exchanges in each field, the IT engineering-based industry must be solid. Security technology is the basis of the IT industry, and so far it has supported the security of the IT industry based on encryption technology in various network communications. When communicating with each other, integrity in the verification of data is essential in terms of data storage and transmission. The hash function is an algorithm that verifies the integrity, and It is used in all cryptographic protocols. As the Internet of Things (IoT) environment develops, the embedded-based communication industry is becoming more active. Therefore, the hash function must be used for service-sustainability in embedded systems. For integrity verification, so far mainly SHA-1 and SHA-2 have been used. However, as existing SHA-1-based vulnerabilities have been discovered, National Institute of Standards and Technology (NIST) has recommended using SHA-3. Up to now, the migration of SHA-3 in crypto-engineering and industry is inadequate due to the limitations of the software performance. The SHA-3 software optimization method proposed in this article can be applied for service sustainability in various fields in the future. In addition, with the development of the quantum computing environment, NIST has held a competition for standardization of Post Quantum Cryptography (PQC). Since most of the algorithms currently submitted in Round 3 use the SHA-3 algorithm, our research can be future-oriented and contribute to the crypto-based communication industry in the future [21][22][23][24].

Extended Version of ICISC'20
In this paper, we extended our previous work published in ICISC'20 [25]. In ICISC' 20, it was difficult to describe the proposed method and implementation methodology in detail, due to page limitations. However, in this article, we have modified our software and reestablished the implementation methodology through detailed explanations. Compared to ICISC'20, the algorithm of the implementation methodology has been completely modified, and there is a performance improvement over ICISC'20 based on the modified code. We also added Hash_DRBG to prove that our implementation methodology can be applied to Hash_DRBG.
The remainder of this paper is organized as follows. In Section 2, we explain the background of SHA-3 and Hash_DRBG. Section 3 contains an analysis of related work concerning SHA-3's effects on the AVR environment. Section 4 proposes a new optimization method for SHA-3 and Hash_DRBG. Section 6 evaluates our software. Finally, Section 7 concludes the article.

Overview of SHA-3
In 1993, NIST published Secure Hash Algorithm 0 (SHA-0). Subsequently, two standard hash functions, SHA-1 and SHA-2, were proposed. However, Stevens et al. broke the security of SHA-1 by finding a collision [2]. Therefore, NIST has proposed a new hash function standard and redefined the Keccak algorithm of Bertoni et al. as SHA-3, a new standard function [12]. Differently from SHA-2, the Keccak algorithm has a new type of structure, the so-called sponge structure. As a result, the Keccak algorithm is not vulnerable to known attacks that are applied to SHA-2 [12].

Sponge Structure
In Figure 1, we describe the structure of SHA-3. The process of SHA-3 is a sponge structure composed of two processes, the absorbing process and the squeezing process. The first process compresses the input value using the permutation function f . For each iteration, an exclusive-OR (XOR) operation is applied to some part of the output of f and the padding value. The digest is computed in the squeezing process. In SHA-3, b-bit permutation means the size of the state and is fixed with b ∈ {25, 50, 100, 200, 400, 800, 1600}. Here, b is composed with the bitrate r and the capacity c, and satisfies b = r + c. If the digest length is longer than r, then f is applied to change the internal state. In this paper, we use the following parameters: b = 1600, r = 1088, and c = 512, and the digest length is 256 bits. Note that the values are widely used [12].

State of SHA-3
The state is used as the input of the function f . Recall that f is the core component of SHA-3. Therefore, it is important to make out the structure of the state to understand SHA-3. In Figure 2, the structure of the state is described. The state is a three-dimensional matrix such that x × y × z and |x| + |y| = 5. The state consists of 25 lanes. The length of each lane is determined by b, w, and l. According to b, the state is composed of 5 × 5 × w-bit. Some parameter settings are shown in Table 1. As seen in the table, l = log 2 (b/25) and w = 2 l . In SHA-3, the function f is repeated n r times where n r = 12 + 2 × l. The f -function is a b-bit permutation which consists of five processes: θ, π, ρ, χ and ι. Based on the processes, the state is updated by repeating the f -function n r times.
In the θ process, each bit in the state is XORed to two columns in the array. Let us to consider the case where the θ process is performed for the bit (x 0 , y 0 , z 0 ). In this case, two columns, of which the x-coordinates are (x 0 + 1) mod 5 and (x 0 − 1) mod 5, are used. In this process, (z 0 − 1) mod w is assigned to the z-coordinate of the column of which the x-coordinate is (x 0 − 1) mod 5. We describe the procedure in Algorithm 1. In Line 2, D[x, z] is computed. Note that the procedure is called as the initialθ.

4: return A
In Figure 3, the π process is described. In the process, the values of lanes are rearranged. Here, S[i](i ∈ [0, 24]) is the i-th lane of the state. Note that S [12] is a lane of (x = 0, y = 0) in state [12].  In the ρ process, each lane is right-rotated. The size of rotation is called the offset. The offset is determined by the x and y coordinates as seen in Table 2 [12]. The z-coordinate is modified by adding the offset where the lane size is used as the modulo. The effect of the χ process is to XOR each bit with a nonlinear function of two other bits in its row [12]. Note that the difference between the χ process and the other processes (i.e., the θ, π, and ρ processes) is that the χ process should be operated in row form and implemented accordingly.
The ι process executes an XOR operation for the lane of (x = 0, y = 0) of the state and constants RC [12]. Since the ι process operates only one lane, in most implementations, the χ and ι processes are combined into a single process.

Overview of Hash_DRBG
The security of the crypto used in the cryptographic protocol is proved on the premise of the use of a perfect random number generator. However, since it is practically impossible to implement an ideal random number generator in an embedded device such as 8-bit AVR microcontrollers and a 32-bit ARM Cortex series, a pseudo-random number is generated by using a Deterministic Random Bit Generator (DRBG). Among DRBGs, there are two types of DRBGs using a hash function (Hash_DRBG, and HMAC_DRBG). In the case of HMAC_DRBG, Hash-based Message Authentication Code (HMAC), the application algorithm of the hash function is used as the core algorithm; therefore, it is basically slower than Hash_DRBG and uses more memory [26]. Therefore, for the cryptographic protocol, it is recommended to use Hash_DRBG in the constrained 8-bit AVR environment [27]. Figure 4 shows the overall overview of Hash_DRBG, and Table 3 shows the specification of parameters used in Hash_DRBG. Hash_DRBG basically extracts a random bit while updating operational status consisting of V and C. The length of V and C is the seedlen for each hash function. The initial setting of operational status is completed by using the derivation function in the instantiate function. Afterward, random bit is extracted using operational status in the generate function. After extracting random bit, the value of V is updated through the hash function. Figure 5 shows the structure of the derivation function. The derivation function is a function to extract V and C for the initial instance setting. It updates V and C by operating the hash function as much as len_seed for input data. For instance, in the case of Hash_DRBG using SHA3-512, len_seed is 3, so the hash function is operated three times for input data. The reseed function is a function to update operational status when the generate function is called multiple times and has the same structure as the derived function. The extraction function is the process of extracting the actual random bits. It receives V as an input and extracts as many random bits as the user wants using the hash function. At this time, when the hash function is used multiple times, the value of V increases by 1. Finally, when random bit extraction is finished, operational status is updated.

Overview of 8-Bit AVR MCUs
The 8-bit AVR microcontroller is an embedded device made into a single integrated circuit by adding memory and I/P to the microprocessor. Recently, the microcontroller has been widely used in many applications for WSNs [1]. The 8-bit AVR microcontroller's commands consists of operation codes and an operand. In Table 4, we summarize some commands and related information for the commands that are used in this paper. We focus on ATmega128 since it is widely used in for sensor nodes [28]. The target device has the following resources: 128 KB of flash memory, 4 KB SRAM, and 4 KB EEPROM. The device supports throughput of 16 MIPS at 16 MHz and operates between 4.5 and 5.5 volts [29]. The AVR-MCU has 32 8-bit general-purpose resisters, which are used for various purposes, e.g., basic private operations and bit operations. Specifically, the R26-R31 registers can be combined and used as three 16-bit registers, i.e., X, Y, and Z registers. These registers (X, Y, and Z) are used as pointers to indirectly specify a 16-bit address for data memory. The status register (SREG) shows the status and result after Arithmetic Logic Unit (ALU) calculations.

Related Works of SHA-3 Implementation on 8-Bit AVR Environment
Keccak algorithm has been widely implemented in various environments including embedded processors since SHA-3 standard selection in 2012. It is widely known that the hardware implementation of SHA-3 has the advantage of faster execution than that of SHA-1 and SHA-2 [32]. However, with respect to software implementation, SHA-3 is much slower than existing hash algorithm including SHA-2 [13][14][15][16][17]. Currently, existing SHA-3 software on a variety of IoT devices, including 8-bit AVR MCUs, are implemented according to the pseudo-codes of the NIST SHA-3 standard [13][14][15][16]. The existing implementations following the pseudo-codes of the standard compute π and ρ in the combined way as π ∼ ρ rather than computing them separately, because rotate-operation (ρ) can be embedded in the process of π computation. Note that software implementations following the standard execute f -function in the following order: θ process → π ∼ ρ process → χ ∼ ι process.
There This execution time of SHA-3 is almost seven times that of SHA-2 software implemented by Otte et al. [15]. Thus, we focused on analyzing Balasch et al.'s SHA-3 implementation written with a combination of C and assembly codes because it is the fastest SHA-3 implementation on 8-bit AVR MCUs [16]. Balash et al.'s SHA-3 software (the version resulting in 256-bit digest) takes 716,483 clock cycles when computing a digest of a message composed of 500 bytes, which denotes a hash rate of 1432 (CPB).
Balash et al. introduced a new shift-rotation strategy for a faster π ∼ ρ process. Note that we need shift-rotations by 1 bit to the left (ROL) and to the right (ROR). In SHA-3, a single lane is 64 bits in length when b = 1600 and w = 64. In the ρ process, we need eight registers to rotate a 64-bit. Let LSL be a 1-bit logical left-shift, ADC be an addition with carry, BST be a 1-bit store to T in SREG, and BLD be a 1-bit load from T in SREG. Generally, in an 8-bit AVR MCU, a 1-bit left-rotation is implemented using an LSL followed by seven ROLs and an ADC [33]. Besides, a 1-bit right-rotation is implemented using a BST followed by eight RORs and a BLD. For 1 < n < 8, n bits rotation for 64-bit data, the procedure can be performed by repeating 1 bit rotation of 64-bit n times. However, the cost of the dedicated operation for n-bit rotation is not equal to the cost of n-bit rotation that can be implemented by repeating 1-bit rotation. Therefore, we need a dedicated method for n-bit rotation or (64-n)-bit rotation for efficient implementation [33]. Note that, for n > 8, the execution time of n-bit shift-rotations can be reduced to 40 clock cycles or fewer if x = (x ≫ n) operations are replaced by x = (x ≫ n%8). In this process, operations of x = (x ≫ n/8) directly allocate and store in memory. When storing to memory (the ρ process), the implementation of Balasch et al. combines π and ρ processes into a single process (π ∼ ρ). Note that Balasch et al. implemented SHA-3 based on the order of standard implementation [12].
While Balasch et al.'s technique gives better performance for the 8-bit AVR MCUs, SHA-2 is still much faster than SHA-3. This is caused by the fact that the cost of accessing memory is more expensive than arithmetic operations in low-end processors. Moreover, the state in SHA-3 (for b = 1600) requires at least 200 bytes. Note that the the memory requirement is very heavy compared with ordinary symmetric ciphers, where only 128 bits are required for the state. Therefore, it is important to minimize the amount of memory access to the state.
In Table 5, we summarize all memory accesses to state when SHA-3 is implemented as recommended by NIST. Here, π ∼ ρ means that π and ρ are combined, and θ ∼ ρ means that θ and ρ are combined. χ and ι processes are also combined with the same logic. The goal of initial θ is to create D[x, z] in Algorithm 1. Note that, initial θ is a part of the θ process. In Table 5, we can see that the state is accessed 3, 2, and 2 times during the θ, π ∼ ρ, and χ ∼ ι processes, respectively. Hence, well-known SHA-3 implementation needs seven memory accesses. For b = 1600, w = 64, and l = 16, the length of the state is 200 bytes (= 25 × 64/8). In addition, all processes are repeated 24 times (12 + 6 × 2) in the f -function. Hence, each execution of the f -function requires 168 (= 24 × 7) memory accesses. In low-powered embedded devices, frequent memory accesses can cause low performance. Hence, in the following section, we introduce a new strategy to reduce the number of memory accesses without increasing additional computation or lookup tables.

Related Works of DRBG Implementation on 8-Bit AVR Environment
Research on Hash_DRBG and HMAC_DRBG, which is based on hash functions in 8-bit AVR environments, has not been studied as far as we know. However, research on blockcipher based CounTeR_DRBG (CTR_DRBG) has been actively conducted recently [27,34]. Based on the fact that the nonce, which is the input data of CTR-mode, is used repeatedly when a session is generated, a method was proposed to optimize by generating a lookup-table of common parts of the initial few rounds of the block-cipher. In addition, the Instantiate Function of DRBG is efficiently compressed by using the characteristic of CTR_DRBG that the initial operational status is zero [27].

Proposed Technique for Efficient SHA-3 Implementations in 8-Bit AVR MCUs
In this section, we propose an efficient implementation technique for optimized SHA-3 execution on 8-bit AVR MCUs. As mentioned in Section 3, memory accesses to state takes longer clock cycles compared with the arithmetic operations on 8-bit AVR MCUs. Thus, it is important to efficiently arrange the use of available general-purpose registers for the optimized state accesses and memory accesses during the computation of SHA-3.

Main Idea
The main idea is to make the π process implicitly executed at a minimum cost and integrate the θ and ρ processes into a single process (θ ∼ ρ process) so as to reduce the number of memory accesses to the state. Figure 6 depicts the overview of the proposed SHA-3 implementation technique. From the figure, it is noticeable that, since each lane of the state is processed individually in the θ process, the ρ process can be applied to each lane. Namely, after computing D[x, z] through the initial θ, we can compute the remaining θ and ρ processes (note that the remaining θ after the initial θ is computed with ρ as θ ∼ ρ). When computing the θ ∼ ρ process, two memory accesses (load and store) are required to update the state. The π process can be executed implicitly while executing the θ ∼ ρ process because its operation is just changing the position of the lane in the state without updating any values. Putting the above explanation together, the main idea of our implementation implicitly computes the π process when storing the updated state in the memory. Table 6 compares the number of memory accesses to the state between the proposed method and the previous implementation following the standard method when computing the f -function. The proposed method reduces the number of memory accesses from 7 to 5, which is a saving of almost 28.57%. Since the state is 200-byte, the f function requires 120 (24 × 5) times memory accesses; our proposed technique can save 48 times the memory accesses in total compared with the previous implementation methods (168 = 24 × 7).
Implementations of SHA-3 on the 8-bit AVR MCUs depend on the value of b. The value b of the SHA-3 configuration determines the type of implementation on 8-bit AVR MCUs. For example, if b is less than 200-bit (namely, w is less than seven), the registers on the MCU can hold all lanes of the state, unless, in the case of 400-bit, 800-bit, or 1600-bit (the size of a single lane is 2-byte, 4-bytes, and 8-byte, respectively), it is hard to hold whole lanes of the state on the 8-bit AVR's general-purpose registers. Therefore, how much memory access to the state is optimized with the available registers determines the overall performance of SHA-3 implementation when the lane is larger than 8-bit. Our implementation focusses on optimizing the performance of SHA-3 implementation when b = 1600 because it is the most common configuration value (in the Korean Cryptographic Module Validation Program (KCMVP) [35], 1600-bit is used for b).

Process (load) Process (Initial )
Process (implicitly) Process (store)  From a more crypto-engineering point of view, the advantages of our method are out of the typical trade-off relationship. Until now, research has been conducted to optimize and implement various ciphers in various environments. They were accelerated by generating a lookup-table for skipping specific rounds and repetitive parts. However, since this increases the usage of Flash Memory or SRAM, it does not escape from the trade-off relationship; therefore, it is necessary to consider whether it can be used from the perspective of the actual crypto-industry. Our SHA-3 optimization technique is an implementation that increases performance without using additional computational tables. This fundamental method of reducing memory access does not affect the security of SHA-3 from a theoretical safety point of view, as it does not change the input/output of the hash function and implements the SHA-3 mechanism of the standard document. In addition, differently, the existing and standard implementations do not use tables to run π and ρ processes. In other words, our implementation no longer uses the look-up table that the existing implementations inevitably used and additional tables, while raising the performance to the top. In terms of code size, there is a difference of about 1 KB, but since it occupies only about 4% of our target platform, ATmega128, it hardly affects the actual overall performance.

Proposed Assembly Code on 8-Bit AVR MCUs
In this section, we present algorithms and codes which implement our proposed method on an 8-bit AVR environment. There are some things to consider in order to integrate the θ ∼ ρ process into one for the implicit execution of the π process in the AVR environment. As mentioned in Section 4.1, in order to minimize memory accesses, once a lane is loaded into registers, the θ ∼ ρ process needs to be completed before being stored into memory. In other words, we apply the θ ∼ ρ process to 25 lanes and execute the π process with cheap operations rather than an actual operation when stored in memory. For example, one lane is loaded into the registers to execute the θ ∼ ρ process, and the result is stored at the memory location where the π process is applied. However, if the lane to which the θ ∼ ρ(π) process has been executed is immediately stored in memory, the θ ∼ ρ process cannot be applied to the lane at the original location. Thus, the corresponding lane needs to be loaded into registers before storing. For efficiency, our implementation arranges the order of the lanes to apply the θ ∼ ρ process to the position index of the π process. That is, when storing the lane to which the θ ∼ ρ process is executed along the index of the π process, the next lane is first loaded into the registers. Therefore, it is possible to implicitly perform the π process while performing the theta process for the lanes in turn. We call this implementation technique "chaining optimization methodology". This new SHA-3 implementation can show that our proposed optimization method can be effectively applied in 8-bit AVR microcontrollers. The chaining optimization methodology efficiently alternates between two lanes and executes the π process without an additional lookup-table and operations while the previous π process implementation made use of the table containing position information and additional operations. Algorithm 2 shows the execution of the θ ∼ ρ process for the first five lanes of the index order of the π process among 25 lanes. Note that the π process is implicitly executed. We make use of some macro functions in order to efficiently integrate the θ and ρ processes. Algorithm 3 is a macro function LD_state in Algorithm 2 and it executes the θ process. Namely .endm Table 7 is the macro codes that performs the ρ process for one lane stored in registers. The ρ process performs a right rotate-shift operation according to the offset in Table 2. When b = 1600, the size of a lane is 64-bit. Thus, the offset of rotation-shift is less than 64. Since AVR's rotation-instruction only supports a single register, a combination of instructions is required to conduct the 1-bit rotation operation for one 64-bit lane maintained in eight registers. Rotating 64-bit data to 1-bit right (resp. left) incurs 10 (resp. 9) clock cycles. Therefore, when rotating right with n-bit (0 < n ≤ 3) offset, it is effective to use ROR_1bit_state n times, and m-bit (4 ≤ When using m < 8) offset, it is effective to use ROL_1bit_state m times. In 8-bit AVR MCUs, the 8-bit rotate-shift operation can be conducted with no cost through directly allocating the position of the register rather than actual shift arithmetics. Therefore, when updating the value of the lane to the memory, by changing the register order, rotate-shift operations for offsets of a multiple of eight can be efficiently computed. The macro codes storing the result in the memory are ST_RORkbytes_state, (k ∈ [0, 7]), and like Algorithm 3, codes changing the address value are added. Algorithm 2 is assembly codes that implement chaining optimization methodology in the AVR environment by efficiently using the previously presented macro functions and codes. The X registers (R24:R25) are used as the address offset for the state and D[i, z], the desired data are accessed during load and store operations. The θ process is applied to the lane while loading one lane from memory to the register, and the ρ process is applied when stored. At this time, the next lane is loaded into the register in the π process before storing. As shown in Figure 7, each lane is alternatively stored in either R8-R15 or R16-R23. The proposed chaining optimization methodology contributes to the much reduced number of memory accesses to the internal state. Since the memory access cost for 8-bit data is at least two clock cycles in the AVR microcontroller, a reduction in memory access of two times per round for 200 bytes of the internal state of SHA-3 has a speedy performance improvement. Thus, compared to previous implementations, we can save 50 memory accesses to the internal state when calling an f -function of SHA-3. our chaining optimization methodology is an optimization method that can always be expected to improve performance, regardless of the SHA-3 parameter and is not restricted by the platform.

Proposed Technique for Efficient Hash_DRBG Implementations Using SHA-3
In this section, we propose an efficient implementation technique for optimized Hash_DRBG execution on 8-bit AVR MCUs. First, we analyzed the various functions that makeup Hash_DRBG. Initially, V and C must be updated with the instantiate function; thus, the optimized method using the fixed initial operational status of CTR_DRBG is difficult to apply in Hash_DRBG [27]. In addition, in the case of the generates function in Hash_DRBG, V and C values are initially impossible to infer; therefore, we analyzed the SHA-3 optimization factor in the derivation function of Hash_DRBG. Since the input data of the instantiate function are the same as the input data of the derivation function, duplicate parts exist between the input data of the hash functions in the derivation function. Therefore, we chose a strategy to infer the state of each SHA-3 for the same part.
As mentioned in Section 2.2, when generating an initial instance, the derivation function operates SHA-3 as much as len_seed according to the security level of SHA-3. Input data of the derivation function consists of entropy, nonce, and personalization string. Entropy and nonce require at least half of the security-bit of SHA-3, and personalization string cannot exceed 35-bit at most [26]. In the case of b = 1600, which our target parameters, the input data for the SHA-3 hash function of the derivation function is less than 136 bytes, corresponding to the r of the sponge structure. Therefore, the initial state of the hash function operates repeatedly as long as len_seed in the derivation function has only 1 byte difference for each SHA-3. While the first SHA-3 of the derivation function is operating, a lookup-table can be generated for the duplicated part; thus, part of the one round operation of the f -function for the next SHA-3 can be omitted. Figure 8 shows the optimization strategy of Hash_DRBG, applying chaining optimization methodology. When executing the second SHA-3 of the derivation function, the spread of 1 byte in the first round of the f -function occurs independently only in the lane before the χ ∼ ι process. Therefore, after applying chaining optimization methodology to one round of the f -function in 8-bit AVR MCUs, 20 lanes have duplicated values with the previous SHA-3 state. By using this, if the first SHA-3 of the derivation function generates a lookup-table for 20 lanes of the state while performing one round of the f -function, 20 lanes of the second SHA-3 in the derivation function can omit the θ ∼ ρ process in one round.

Performance Analysis of SHA-3 Implementation
In this section, we compare the proposed implementation of the chaining optimization methodology with the existing SHA-3 implementation. We use SHA-3 with general parameters of b = 1600, r = 1088, and c = 512 applicable to actual fields; therefore, the internal state of SHA-3 has 200 bytes, and up to 136 bytes per f -function can be hashed by using the characteristics of the sponge structure. In our software, the padding function of the sponge structure uses the same method as the standard implementation. As the target environment, we chose ATmega 128, one of the most used pieces of equipment in the WSN environment. For accurate implementation and comparison, ATmega 128 was simulated in Atmel studio7 and the −O2 option was applied when compiling code. The performance of each hash function is measured in clock cycles per byte (CPB) in order that the relationship between hashing bytes and performance can be fairly expressed. Table 8 compares our new SHA-3 implementation in an 8-bit AVR MCU with several hash functions implemented in an AVR environment. Our SHA-3 software applying chaining optimization methodology implicitly implements the π process and minimizes memory access to the internal state; thus, high performance can be expected compared to existing SHA-3 implementation. Our software achieved 2646, 1326, and 1066 CPB when hashing 50, 100, and 500 bytes, respectively. The proposed implementation not only shows up to 26.1% performance improvement over the existing best performance, Balasch et al.'s SHA-3 implementation in the AVR environment, but also achieves the highest performance regardless of the hash rate compared to existing SHA-3 software. In addition, compared to the previous SHA-3 implementation, the difference in performance with the SHA-2 family has been reduced almost two times in the 8-bit AVR microcontroller [15,16]. Table 8. Performance of proposed SHA-3 Implementations by hash rate when hashing a byte of various messages in an 8-bit AVR microcontroller, hash rates represent cyc/byte (CPB) [15,16].

Reference
Algorithm Language  Table 9 shows the performance of Hash_DRBG based on clock cycles by extracted random bytes. Since the are no Hash_DRBG software results for Balasch et al. and Otte et al. [15,16], the overall part of Has_DRBG, except the core hash function, is applied to the software implemented directly to compare the performance improvement. Therefore, the rest of the functions of Hash_DRBG, except the core hash function, are configured identically; therefore, the performance improvement of chaining optimization methodology can be measured fairly. Since the nonce requires at least half the size of the security-bit of the hash function and the personalization string cannot exceed 35-bit, we set the input data of the derivation function to 64 bytes, which is an appropriate size in the AVR environment [26]. Our Hash_DRBG software, chaining optimization methodology and optimized technique for derivation function, shows a performance improvement of 26.1%-26.5% compared to the Hash_DRBG with Balasch et al. and Otte et al., which are the existing SHA-3 implementations in 8-bit AVR MCUs. The reason the performance improvement of Hash_DRBG slightly decreases as the number of extracted random bytes increases is that the f -function's first-round optimization factor of the proposed derivation function is fixed, but the hash function calls increase as the extracted bytes increase. Therefore, our Hash_DRBG software achieves the best performance when generating random numbers less than 100 bytes. In addition, our software reduces the gap with Hash_DRBG, using the existing SHA-2 family up to four times or less in the AVR environment. Table 9. Performance of proposed Hash_DRBG Implementations by extracting random byte in the 8-bit AVR microcontroller; performance measured by clock cycles [15,16].

Concluding Remarks
In this article, we presented a new SHA-3 implementation, which we call chaining optimization methodology, in the 8-bit AVR microcontroller. We proposed a new optimization method for SHA-3, which had a speed limitation in the software environment compared to the SHA-2 family. The memory access to the internal state of SHA-3 is reduced as much as possible by combining processes that can be calculated independently for each lane of the internal state among the f -function. Through efficient register scheduling and chaining operation for the internal state, the performance load of the π process is reduced to a minimum in the 8-bit AVR microcontroller. Our chaining optimization methodology, which is not dependent on a specific platform, such as a parallel environment, essentially reduces memory access to the state. Finally, our SHA-3 optimization method can be applied in a variety of crypto-fields. Unlike technologies that can only be applied to specific situations or devices, our methodology is applicable to all platforms and application algorithms without compromising the security of SHA-3. This means our software can completely replace the previous SHA-3 implementation on an 8-bit AVR environment. From a memory access point of view, we have proven that we have proposed a method applicable to all platforms. It also replaced the application algorithms used in the actual crypto industry with SHA-3-based DRBG. By using SHA-3 in Hash_DRBG, the most widely used algorithm has proven the effectiveness of the proposed technique. Therefore, the new SHA-3 implementation can be widely used and is particularly effective in limited environments such as embedded devices. In addition, in NIST Post-Quantum Cryptography Competition (PQC), as most of the candidates for the competition use the SHA-3 algorithm, we believe that our proposed method can be applied to PQC. 34. KIM, Y.; SEO, S.C. Efficient Implementation of AES and CTR_DRBG on 8-bit AVR-based Sensor Nodes. IEEE Access 2021, 1.