Compact Hardware Architectures of Enocoro-128v2 Stream Cipher for Constrained Embedded Devices

: Lightweight cryptography is a vital and fast growing ﬁeld in today’s world where billions of constrained devices interact with each other. In this paper, two novel compact architectures of the Enocoro-128v2 stream cipher are presented. The Enocoro-128v2 is part of the ISO / IEC 29192-3 standard. The ﬁrst architecture has an 8-bit datapath while the second one has a 4-bit datapath. The proposed architectures were implemented on the BASYS3 board (Artix 7 XC7A35T) using the VERILOG hardware description language. The hardware implementation of the proposed 8-bit architecture runs at a 189 MHz clock and reaches a throughput equal to 302 Mbps, while at the same time, it utilizes only 254 Look-up Tables (LUTs) and 330 Flip-ﬂops (FFs). Each round of computations requires 5 clock cycles. The 4-bit implementation has an operating frequency of 204 MHz and reaches a throughput equal to 181 Mbps, with each round requiring 9 clock cycles. The 4-bit implementation utilizes 249 LUTs and 343 FFs. To our knowledge, this is the ﬁrst time that such implementations of the Enocoro-128v2 are presented. Both implementations utilize a very low number of resources (only 78 FPGA slices are required for the 8-bit architecture and only 83 for the 4-bit one) and the results demonstrate that they are sustainable for area constrained embedded devices.


Introduction
In today's world, wireless communication is an essential part of our life, present in a wide field of our everyday affairs, such as health care, entertainment, or the exchange of information. Personal data are constantly exchanged between devices that are connected with each other, creating a system of interrelated computing devices known as the Internet of things (IoT). In addition, with the development of the fifth generation (5G) technology standard for cellular networks, the amount of the exchanged data, the speed of this exchange, and the number of the simultaneously connected devices are expected to be significantly increased.
As the number of the connected IoT devices rises, their vulnerability to cyber-attacks increases as well, with information being the most common target, through privilege abuse [1,2]. The connected devices, referred as Cyber-Physical Systems (CPS), combine both hardware and software. This combination and the required abstraction between hardware and software leads to safety and security problems and concerns [3]. Therefore, the need for security, through dedicated security modules, is one of the main factors that must be taken into consideration during the design of a device. These devices, however, must be small in size and, due to the fact that they usually run on some type of battery, must have low power dissipation. This leads to resource-constrained devices where the available resources

Specifications of Enocoro-128v2 Stream Cipher
In general, the term Enocoro refers to a family of security algorithms, developed by Hitachi Ltd. in Tokyo, Japan, that are aimed for hardware implementation in constrained environments. The latest member of this family of stream ciphers is called Enocoro-128v2. It was proposed in [8] and [9] as a new concreate 128-bit security algorithm. It is included in the ISO/IEC 29192 [10], an international standard for lightweight cryptography. Specifically, it is included in part three of this standard, which is dedicated on stream ciphers. It is important to note that Enocoro-128v2 can perform encryption and decryption up to 10 times faster than the light-weight implementation of AES-128, while maintaining the same level of security, according to the International Electrotechnical Commission (IEC) [11].
Enocoro-128v2 is a PANAMA-like keystream generator (PKSG) [12], a pseudorandom number generator (PRNG) subclass. It takes two inputs, a 128-bit secret Key and a 64-bit public initialization vector (IV) and produces an 8-bit output on every round. It has a 256-bit storage buffer (consisting of 8-bit subparts, denoted b 0 to b 31 ), a 16-bit internal state (split into two bytes, denoted a 0 and a 1 , with the byte a 1 being also the output of the algorithm), and an 8-bit counter, denoted as ctr. The buffer is initially filled with the Key, the IV and six constants that are defined in the algorithm's specifications. Its contents, as well as the contents of the internal state and the ones of the counter, are updated on every computational round. Their initial values are shown in Table 1. Table 1. Initial values of the Enocoro-128v2 according to the specifications.  0x01  0x88  0x4c  KEY  IV  0x66  0xe9  0x4b  0xd4  0xef  0x8a  0x2c  0x3b The most important part of the algorithm is its two update functions, denoted as ρ and λ. The function ρ consists of two parts, first, an 8-bit substitution box, named S 8 , that performs a permutation on the input bits and, second, a linear transformation called L (also referred to as Liner). The S 8 -box consists of smaller 4-bit substitution boxes, named S 4 , which are connected through a new linear transformation. The computations of Enocoro-128v2 are defined over two finite fields in GF(2 8 ) and GF (2 4 ). An element of GF(2 8 ) is given by the polynomial: while an element of GF (2 4 ) is given by the polynomial: On each round, both update functions are executed and the values of both the storage buffer and the internal state are updated. It is important to note that the execution of the algorithm is a two-step processes. The first step, that is called the initialization step and lasts for 96 rounds, and the second step, that lasts until the completion of the algorithm. The only difference between these steps is that during every round of the initialization step the ctr counter's value is incremented by the multiplication by 0x02 in the previously defined finite field GF(2 8 ), while during the next step the ctr counter's value is permanently set to zero. The complete architecture of Enocoro-128v2 is shown in Figure 1.

Related Work
The Enocoro-128v2 stream cipher was first proposed in [8] and updated in [9]. In those two works, the specifications of the algorithm were presented. The definition and an architecture for the S 8 -box were given as well as the mathematical definition of the linear transformation called L (also referred as Liner) through a 2-by-2 matrix.
Due to the fact that there are not many Enocoro-128v2 FPGA implementations, we also compared our implementations with FPGA implementations of other lightweight stream ciphers. The primary aim of every lightweight stream cipher design is the creation of an architecture that utilizes as few resources as possible while at the same time has a throughput that is high enough for the target application.
In [13] an implementation of DECIM v2 is proposed. This implementation reaches a frequency of 185 MHz and achieves a throughput equal to 46 Mbps, while it utilizes 80 slices and has a T/#Slices factor of 0.58. Additionally, in [13], an implementation of DECIM-128 that can run at a frequency of 174 MHz and has a throughput of 43 Mbps is presented. It utilizes 89 slices and has a T/#Slices factor of 0.49. An implementation of E0 is presented in [14]. It has a maximum frequency of 187 MHz and a maximum throughput equal to 187 Mbps, while it utilizes 140 slices and has a T/#Slices factor of 1.33.
Four implementations of Grain v1 are presented in [13][14][15][16]. The first one, in [13], has a frequency of 196 MHz, a throughput of 196 Mbps, it utilizes 44 slices and has a 4.45 T/#Slices factor. The second one, in [14], reaches a frequency of 177 MHz, a throughput of 177 Mbps, it uses 318 slices and has a T/#Slices factor of 0.55. The third one, in [15], has a frequency of 193 MHz, a throughput of 193 Mbps, it utilizes 122 slices and has a 1.58 T/#Slices factor. The fourth one, in [16], reaches a throughput of 105 Mbps, it uses 48 slices and has a T/#Slices factor of 2.19. Two implementations of Grain-128 are presented in [13,17]. The one in [13] has a frequency of 196 MHz, a throughput of 196 Mbps, it uses 50 slices and has a 3.92 T/#Slices factor. The one in [17] has a frequency of 181 MHz, a throughput of 181 Mbps, it uses only 48 slices and has a 3.77 T/#Slices factor.
In [13,14], two MICKEY 2.0 implementations are proposed. The implementation in [13] reaches a frequency of 233 MHz, a throughput of 233 Mbps, it utilizes 115 slices and has a T/#Slices factor of 2.03. The implementation in [14] reaches a frequency of 250 MHz, a throughput of 250 Mbps, it utilizes 98 slices and has a T/#Slices factor of 2.55. In [13,15,17,18] four implementations of MICKEY-128 2.0 are proposed. The one in [13] has a frequency of 223 MHz, a throughput of 223 Mbps, it utilizes 176 slices and has a 1.27 T/#Slices factor. The one in [15] reaches a frequency of 156 MHz, a throughput of 156 Mbps, it uses 261 slices and has a T/#Slices factor of 0.60. The implementation in [17] reaches a maximum frequency of 200 MHz, a throughput of 200 Mbps, it uses 190 slices and has a 1.05 T/#Slices factor. Finally, the one in [18] has a frequency of 170 MHz reaches a throughput of 170 Mbps, it uses 167 slices and has a T/#Slices factor of 1.02.
In [13] and [19], two implementations of Moustique are proposed. The one in [13] has a frequency of 225 MHz, a throughput of 225 Mbps, it uses 278 slices and has a 0.81 T/#Slices factor. The one in [19] has a throughput of 369 Mbps, it uses 252 slices and has a 1.46 T/#Slices factor. In [16] a Mosquito implementation is presented. It reaches a throughput of 137 Mbps, it uses 298 slices and has a T/#Slices factor of 0.46.
Five implementations of Trivium are presented in [13][14][15][16][17]. The first one, in [13], has a frequency of 240 MHz, a throughput of 240 Mbps, it utilizes 50 slices and has a 4.80 T/#Slices factor. The second one, in [14], reaches a frequency of 326 MHz, a throughput of 326 Mbps, it uses 149 slices and has a T/#Slices factor of 2.18. The third one, in [15], has a frequency of 201 MHz, a throughput of 201 Mbps, it utilizes 188 slices and has a 1.07 T/#Slices factor. The fourth one, in [16], reaches a throughput of 102 Mbps, it uses 40 slices and has a T/#Slices factor of 2.55. The fifth one, in [17], has a frequency of 207 MHz, a throughput of 207 Mbps, it utilizes 41 slices and has a 5.05 T/#Slices factor.
The work in [20] includes two FPGA implementations of the Enocoro-128v2 stream cipher that were optimized based on the throughput/slice (T/#Slices) metric. The first achieves a maximum frequency of 118 MHz while it utilizes 292 slices and has a T/#Slices factor of 0.40. The second reaches a maximum frequency of 149 MHz, requires 442 slices, and has a T/#Slices factor of 0.33.
For the 8-bit architecture of the Enocoro-128v2 that is presented in our paper, we followed the architecture that is presented in [8] and [9] for the S 8 -box. However, for our 4-bit Enocoro-128v2 architecture we designed and implemented a novel architecture for the S 8 -box that has a 4-bit datapath. For both architectures, we designed and implemented new architectures for the Liner subpart. Our primary aim was to take advantage of the symmetry in the underling matrix and reduce the elements that are needed for the arithmetic calculations. While the elemental data size of the Enocoro-128v2 is a byte, we designed a Liner with a 4-bit datapath for the 4-bit architecture that performs correct calculations while utilizing minimal resources. In addition, we designed a new architecture for the multiplication by 0x02 that has a 4-bit datapath that, again, requires minimal resources. Finally, for both the architectures, we optimized the control logic and the connections between the architecture's subparts in order to keep the resource usage to a minimum.

Proposed 8-Bit Architecture of Enocoro-128v2 Stream Cipher
The proposed 8-bit architecture, for the Enocoro-128v2 stream cipher, is shown in Figure 2. For the 8-bit architecture of the Enocoro-128v2 that is presented in our paper, we followed the architecture that is presented in [8] and [9] for the S8-box. However, for our 4-bit Enocoro-128v2 architecture we designed and implemented a novel architecture for the S8-box that has a 4-bit datapath. For both architectures, we designed and implemented new architectures for the Liner subpart. Our primary aim was to take advantage of the symmetry in the underling matrix and reduce the elements that are needed for the arithmetic calculations. While the elemental data size of the Enocoro-128v2 is a byte, we designed a Liner with a 4-bit datapath for the 4-bit architecture that performs correct calculations while utilizing minimal resources. In addition, we designed a new architecture for the multiplication by 0x02 that has a 4-bit datapath that, again, requires minimal resources. Finally, for both the architectures, we optimized the control logic and the connections between the architecture's subparts in order to keep the resource usage to a minimum.

Proposed 8-Bit Architecture of Enocoro-128v2 Stream Cipher
The proposed 8-bit architecture, for the Enocoro-128v2 stream cipher, is shown in Figure 2.  Responsible for the correct operation of the circuit is the Control subsystem, which generates the control signals for the multiplexers. This is done primarily by two counters, one that counts how many rounds have been completed and one that counts how many clock cycles have passed during the current round. The completion of a round requires 5 clock cycles.
In order to store the values that are computed on each round, a storage scheme that follows the algorithm's specifications is required. We opted for a dynamic storage scheme, a buffer named b, instead of a static one. On each round a byte-wise right rotation of the stored values is required. Instead of performing this rotation on a single clock cycle, we perform rotations on each clock cycle in a way that leads to the same result. Specifically, we rotate the values that are stored in the storage Responsible for the correct operation of the circuit is the Control subsystem, which generates the control signals for the multiplexers. This is done primarily by two counters, one that counts how many rounds have been completed and one that counts how many clock cycles have passed during the current round. The completion of a round requires 5 clock cycles.
In order to store the values that are computed on each round, a storage scheme that follows the algorithm's specifications is required. We opted for a dynamic storage scheme, a buffer named b, instead of a static one. On each round a byte-wise right rotation of the stored values is required. Instead of performing this rotation on a single clock cycle, we perform rotations on each clock cycle in a way that leads to the same result. Specifically, we rotate the values that are stored in the storage buffer by 13 bytes to the right on each of the round's 5 clock cycles. This way the buffer's contents end up in the necessary byte-wise right rotation at the start of the next round. We made this choice, of constantly changing the contents of the buffer, in order to increase the security of our architecture against side channel attacks (mainly Power and Electromagnetic analysis attacks), as measurements that can provide any helpful information for an attacker are more difficult to be obtained.
In addition to the buffer b, the architecture contains three more core components: an internal state named a and two functions named λ and ρ, respectively. The internal state consists of two bytes, a 0 and a 1 , that are updated on every round. The byte a 1 is the output of the algorithm. The proposed architecture for the two functions will be described in the following paragraphs.
Prior to the first round of computations, an initialization step is necessary. Following the algorithm's specifications, the registers b 24 to b 31 are initialized with the specified constants, the registers b 16 to b 23 with the public initialization vector IV and the registers b 0 to b 15 with the secret input Key. The input Key and the IV are fed to the algorithm byte by byte. After all the above values are stored in the buffer b, in their corresponding positions, the first round of the algorithm can begin.

Function λ of the 8-Bit Architecture
The λ function is responsible for the computation of four XORings, according to the algorithm's specifications in [9]. These XORings are computed, in sequential clock cycles, using the same XOR gate for the first three (e.g., for the input pairs 16 , b 28 ] in the initial round). For the fourth XORing an additional XOR gate is required because the result of the XORing of a 0 with the rightmost byte of the buffer (b 31 in the initial state) must be XORed with the value of the ctr counter. All the final results from the XORings are fed back and stored in the appropriate registers of buffer b. The ctr counter's size is one byte and its initial value is 0x01. The ctr counter's value is incremented by the multiplication by 0x02. The result of this multiplication is stored in the corresponding register in the final clock cycle of the current round, in order to the ctr counter has the correct value at the beginning of the next round. After the completion of 96 rounds, the value 0x00 is stored in the ctr counter and remains unchanged thereafter.

Update Function ρ of the 8-Bit Architecture
The update function ρ consists of two parts, the S 8 -box part, implemented according to [9], and the Linear Transformation (Liner) part. The output of the S 8 -box for each input byte is produced in the same clock cycle. The input byte is split into two 4-bit parts. Each part is fed into a S 4 -box. Each of the two outputs is used in two ways: it is multiplied by the coefficient e = 0x04 (which is defined over the chosen finite field in GF (2 4 )) and is also fed into an XOR gate. The gate's second input is the result of the other's S 4 -box multiplication by e (the circuit for the multiplication by e is shown in Figure 2). The results of the XOR gates are then fed into two new XOR gates along with the coefficients 0xA and 0x5, respectively. The two outputs are fed into two S 4 -boxes. Finally, the two results are concatenated, and the new byte is rotated by one bit to the left and then driven to the output. The output of the S 8 -box is then XORed with the appropriate state byte (a 0 or a 1 ) and is fed to the Liner. This procedure continues for the next bytes, according to the algorithm's specifications.
The Liner has a latency of one clock cycle and its operation lasts for three clocks cycles. In the first clock cycle, the first byte (B1) is fed into the Liner and is stored in register L1. In the second clock cycle, the second byte (B2) is fed into the Liner. Through the left multiplexer, B1 is driven to the XOR gate, while through the right multiplexer B2 is also driven to the XOR gate. The two bytes are XORed and the result is driven to the output. In the same clock cycle, B1 is also moved from L1 and is stored in Electronics 2020, 9, 1505 7 of 14 register L2. B2 is also multiplied by 0x02 (which is defined over the chosen finite field in GF (2 8 )) and the result is stored in register R1. The circuit for the multiplication is shown in Figure 2. Finally, in the third clock cycle, the contents of the registers L2 and R1 are driven to the XOR gate by the left and right multiplexer, respectively. The operation's result is driven to the output. At the appropriate clock cycles, the Liner's output is then XORed with the corresponding S 8 -box output according to the algorithm's specifications. The results which correspond to the new a 0 and a 1 are driven to the respective registers in the state a subpart. The byte a 1 , as already noted, is also the output of the algorithm.

Proposed 4-Bit Architecture of Enocoro-128v2 Stream Cipher
The proposed 4-bit architecture, for the Enocoro-128v2 stream cipher, is shown in Figure 3. The 4-bit architecture for the Enocoro-128 v2 stream cipher can be split in the same subparts as the 8-bit architecture. These subparts, however, differ greatly in their internal architecture. Again, the same initialization step is required before the first round of computations. The only difference is that in the 4-bit architecture each byte of data is fed into the buffer in 4-bit groups, beginning with the four least significant bits, and following the same procedure as in the 8-bit architecture, all the necessary bytes are present in buffer b after 48 clock cycles. The storage buffer is based on the same principles as the one in the 8-bit architecture. In order to achieve the necessary byte-wise right rotation of the buffer, each 4-bit group is rotated by 50 places (25 bytes) to the right at each clock cycle (each round requires 9 clock cycles).
significant bits, and following the same procedure as in the 8-bit architecture, all the necessary bytes are present in buffer b after 48 clock cycles. The storage buffer is based on the same principles as the one in the 8-bit architecture. In order to achieve the necessary byte-wise right rotation of the buffer, each 4-bit group is rotated by 50 places (25 bytes) to the right at each clock cycle (each round requires 9 clock cycles).

Function λ of the 4-Bit Architecture
The λ function in the 4-bit architecture works in the same way as the one in the 8-bit architecture. The only difference is that now the XORings are computed by 4-bit parts instead of complete bytes. Each XORing begins with the 4 LSBs of the specified byte. The ctr counter consists of two 4-bit registers, ctr_L and ctr_H, with initial values 0x0 and 0x1, respectively. One after the other, the values of those registers are driven to the multiplication subunit in order for the overall counter's value to be multiplied by 0x02. The result of this multiplication is stored in the corresponding registers in the final clock cycles of the current round, in order to the ctr counter's registers has the correct values at the beginning of the next round. After the completion of 96 initialization rounds, the value 0×0 is stored in both the ctr counter's registers, which then remain unchanged.
The multiplication by 0x02 subunit has a latency of one clock cycle and its operation lasts for three clock cycles for every complete multiplication. In the first clock cycle the 4 LSBs of the byte that is to be multiplied enter the circuit and are stored in registers M0, M3, M2, and M1. In the second clock cycle the 4 MSBs of the byte also enter the circuit. The MSB of the four is separately XORed with the contents of M3 and M2. The results along with the MSB and the contents of register M1 are driven to the output and represent the four LSBs of the multiplication's result. In parallel, the MSB is stored in register M0 and the previous contents of M0 are moved to M5. The rest of the incoming bits are stored in registers M3, M2, and M1. In the third round, the contents of M0 and M5 are XORed. The result along with the contents of M3, M2, and M1 are driven to the output, as they represent the four MSBs of the multiplication's result.

Function λ of the 4-Bit Architecture
The λ function in the 4-bit architecture works in the same way as the one in the 8-bit architecture. The only difference is that now the XORings are computed by 4-bit parts instead of complete bytes. Each XORing begins with the 4 LSBs of the specified byte. The ctr counter consists of two 4-bit registers, ctr_L and ctr_H, with initial values 0x0 and 0x1, respectively. One after the other, the values of those registers are driven to the multiplication subunit in order for the overall counter's value to be multiplied by 0x02. The result of this multiplication is stored in the corresponding registers in the final clock cycles of the current round, in order to the ctr counter's registers has the correct values at the beginning of the next round. After the completion of 96 initialization rounds, the value 0×0 is stored in both the ctr counter's registers, which then remain unchanged.
The multiplication by 0x02 subunit has a latency of one clock cycle and its operation lasts for three clock cycles for every complete multiplication. In the first clock cycle the 4 LSBs of the byte that is to be multiplied enter the circuit and are stored in registers M0, M3, M2, and M1. In the second clock cycle the 4 MSBs of the byte also enter the circuit. The MSB of the four is separately XORed with the contents of M3 and M2. The results along with the MSB and the contents of register M1 are driven to the output and represent the four LSBs of the multiplication's result. In parallel, the MSB is stored in register M0 and the previous contents of M0 are moved to M5. The rest of the incoming bits are stored in registers M3, M2, and M1. In the third round, the contents of M0 and M5 are XORed. The result along with the contents of M3, M2, and M1 are driven to the output, as they represent the four MSBs of the multiplication's result.

Update Function ρ of the 4-Bit Architecture
The update function ρ of the 4-bit architecture consists of the same two parts as the one in the 8-bit architecture, the S 8 -box part, and the Linear Transformation (Liner) part. For both these parts, novel internal 4-bit architectures were designed and implemented.
The S 8 -box has a latency of two clock cycles and its operation, for each byte, lasts for four clocks cycles. In the first clock cycle, the 4 LSBs of the specified byte enter the S 8 -box and are driven to the S 4 -box subpart. The output is used in two distinct ways: it is stored in register S1 and is multiplied by the coefficient e = 0x04 (which is defined over the chosen finite field in GF (2 4 )). The result of the multiplication is stored in register E1. In the second clock cycle, the remaining 4 MSBs of the specifying byte enter the S 8 -box and are treated in the same way. The previous output that was stored in S1, along with the new multiplication result are driven to the first XOR gate. The result is driven to the second XOR gate along with coefficient 0×5. The gate's output is fed into the second S 4 -box and the four output bits are stored in registers R1, R2, R3, and R4. Additionally, the contents of E1 are moved to E2. In the third clock cycle, the contents of S1 are driven to the first XOR gate along with the contents of E2. The gate's output is then driven to the second XOR gate along with the coefficient 0×A. The result is driven to the second S 4 -box. The MSB of the output along with contents of R2, R3, and R4 are concatenated (as shown in Figure 3) and are driven to the output. The three reaming bits of the S 4 -box output are stored in R2, R3, and R4 while the contents of R1 are moved to register D. In the fourth and final round, the contents of registers D, R2, R3, and R4 are concatenated and driven to the output. The output of the S 8 -box is XORed with the appropriate four state bits (a 0 _L, a 0 _H, a 1 _L, or a 1 _H) and is fed to the Liner. This procedure continues for all the incoming bytes, in 4-bit parts, according to the algorithm's specifications.
The Liner has a latency of two clock cycles and its operation lasts for six clocks cycles. In the first clock cycle, the 4 LSBs of the first byte (Byte1 [3:0]) are fed into the Liner and are stored in register L1. In the second clock cycle, Byte1 [3:0] is moved to register L2 and the remaining 4 MSBs of the first byte (Byte1 [7:4]) are fed into the Liner and stored in L1. In the third clock cycle, the 4 LSBs of the second byte (Byte2 [3:0]) enter the Liner and are fed into the multiplication circuit (this multiplication by 0x02 circuit has already been described in the previous part of this paper). Additionally, through the right multiplexer, Byte2 [3:0] is driven to the XOR gate, while through the left multiplexer Byte1 [3:0] is also driven to the XOR gate. The result is then driven to the output. In the same clock cycle, Byte1 [3:0] is also moved to register L3 and Byte1 [7:4] is moved to L2. In the fourth clock cycle, the 4 MSBs of the second byte (Byte2 [7:4]) enter the Liner and are fed into the multiplication circuit. Again, through the right multiplexer, Byte2 [7:4] is driven to the XOR gate, while through the left multiplexer Byte1 [7:4] is also driven to the XOR gate. The result is driven to the output. Following the previous procedure, Byte1 [3:0] is moved to register L4 and Byte1 [7:4] is moved to L3. Additionally, the output of the multiplication circuit is stored in register R1. In the fifth clock cycle, the contents of L4 and R1 are XORed and the result is outputted. The Byte1 [7:4] is moved to L4 and the new output of the Electronics 2020, 9, 1505 9 of 14 multiplication circuit is stored in register R1. In the sixth and final clock cycle, the contents of L4 and R1 are XORed and the result is driven to the output. At the appropriate clock cycles, the Liner's output is XORed with the corresponding output of the S 8 -box, according to the algorithm's specifications. The results which correspond to the new a 0 _L, a 0 _H, a 1 _L, and a 1 _H are driven to the respective registers that form the internal state a. The bits that correspond to a 1 _L and a 1 _H are also the output of the algorithm.

Results
The design suite that was selected for the synthesis and the implementation of the proposed 8-bit and 4-bit architectures of the Enocoro-128v2 stream cipher was Xilinx's Vivado 2019.2 through the use of the VERILOG hardware description language. The implementation was evaluated (routed and placed) in the Basys 3 Artix-7 (XC7A35T) FPGA Board. The implementation results of both the proposed architectures are shown in Table 2. The 8-bit implementation utilizes 254 LUTs (1.22% of the available LUTs) and 330 FFs (0.79% of the available FFs) in a total of 78 slices. Each round of the algorithm requires 5 clock cycles for its completion, while the maximum reached frequency is 189 MHz, leading to a throughput equal to 302 Mbps. The achieved throughput per number of slices for the 8-bit implementation is 3.8. On the other hand, the 4-bit architecture achieves a clock frequency of 204 MHz, a throughput equal to 181 Mbps with a latency per round equal to 9 clock cycles, while it utilizes only 343 FFs (0.82% of the available FFs) and 249 LUTs (1.19% of the available LUTs) in a total of 83 slices. The throughput per number of slices for the 4-bit implementation is 2.2. In addition, we estimated the on-chip power using the Vivado Report Power feature. For accurate and reliable estimation, the Switching Activity Interchange format (SAIF) file, which is generated from Post Implementation Timing Simulation, was used. The SAIF file mainly contains the signals' toggle counts (number of changes). In order to achieve a good toggle coverage in the design's internal signals (which leads to more accurate estimation), a very long simulation time was necessary e.g., 100.000 ns (with the clock period being 5 ns). Additionally, the inputs for the implementations were generated by a random number generator. We estimated that the 8-bit implementation has a dynamic power dissipation of 41 mW while the dynamic power dissipation of the 4-bit architecture is 40 mW. As it is well known, the static power consumption mainly depends on the total resources of the FPGA rather than the number of the resources that are utilized by an implementation. So, the static power dissipation of the device is estimated to be equal to 70 mw for both architectures.

Discussion
In Table 3, some comparison metrics regarding area and performance with previously published compact FPGA implementations for lightweight stream ciphers are given. These implementations are the closest ones that can be used for comparisons with the proposed implementations, due to the fact that implementations for the Enocoro-128v2 stream cipher are very few. The comparisons are focused on Frequency (Freq.), the Throughput (Mbps), the Area (#Slices) and, finally, the Throughput per area (T/#Slices) that measures the hardware resource cost associated with the implementation's throughput (the higher value is the better). Due to the fact that the stream ciphers that were used for comparisons were implemented in different FPGAs, and this can lead to a potentially different maximum frequency and hardware resource utilization, we believe that the best metric for comparison is the Throughput per area (T/#Slices) because it can mitigate the previously mentioned potential differences. We have denoted the proposed 8-bit architecture as Enocoro8 and the proposed 4-bit architecture as Enocoro4 for simplicity. As already discussed, in [13] an implementation of DECIM v2 and an implementation of DECIM-128 are proposed. Both our enocoro implementations achieve higher maximum frequencies, higher throughputs and have higher T/#Slices factors than both the DECIM ones. Our 8-bit Enocoro implementation utilizes less slices than both the DECIM ones while our 4-bit Enocoro implementation utilizes 6 slices less than the DECIM-128 implementation and only 3 more than the one of the DECIM v2. Compared to the implementation of E0 that is presented in [14], our 8-bit Enocoro implementation achieves a higher frequency, a higher throughput, it utilizes less slices and has a better T/#Slices factor. Our 4-bit Enocoro implementation achieves a higher frequency, it utilizes less slices and has a better T/#Slices factor, but it reaches a slightly lower maximum frequency (6 MHz less).
As for the two implementations of Grain-128 that are presented in [13,17], our 8-bit Enocoro implementation achieves a higher frequency than the one in [17] but lower than [13], while it achieves higher throughput than both the Grain-128 implementations. However, it utilizes more slices than both the Grain-128 ones. This leads to a better T/#Slices factor than [17] but worse than [13]. On the other hand, our 4-bit implementation has a frequency higher than both [13,17], a throughput equal to [17] but lower than [13], it utilizes more slices than both the Grain-128 implementations and its T/#Slices factor is lower than both [13,17].
In comparison to the two MICKEY 2.0 implementations in [13,14], our 8-bit Enocoro implementation reaches a lower frequency than both of them, but it has a higher throughput than both. Additionally, it utilizes fewer slices and has a better T/#Slices factor than both the MICKEY 2.0 implementations. Our 4-bit Enocoro implementation reaches a lower frequency and a lower throughput than both of the MICKEY 2.0 implementations. However, at the same time, it utilizes fewer slices than both of them and it has a T/#Slices factor that is higher than [13] but lower than [14].
Moreover, compared to the four implementations of MICKEY-128 2.0 that are proposed in [13,15,17,18], our 8-bit Enocoro implementation achieves a higher operating frequency compared to [15,18] but lower than [13,17]. The throughput of our implementation is higher than all MICKEY-128 2.0 implementations. Additionally, our 8-bit Enocoro implementation utilizes fewer slices and has a better T/#Slices factor than all MICKEY-128 2.0 implementations. On the other hand, our 4-bit implementation has a higher operating frequency compared to [15,17,18] but lower than [13]. Regarding the throughput, our implementation achieves a higher throughput compared to [15,18] but lower than [13,17]. Again, our 4-bit Enocoro implementation utilizes fewer slices and has a better T/#Slices factor than all MICKEY-128 2.0 implementations.
As already presented, two implementations of Moustique are proposed in [13,19]. The Moustique in [13] achieves a frequency higher than both our Enocoro implementations, it has lower throughput than our 8-bit implementation but higher than our 4-bit one. However, both our implementations utilize significantly less slices and have much better T/#Slices factors. The Moustique in [19] achieves a throughput higher than both our Enocoro implementations but, at the same time, it utilizes significantly more slices and has a much worst T/#Slices factor than both our implementations. Against the Mosquito implementation in [16], both our enocoro implementations perform better on all the corresponding metrics.
Finally, compared to the two implementations of the Enocoro-128v2 that are presented in [20], both our Enocoro-128v2 implementations achieve higher max frequencies, utilize less slices, and have higher T/#Slices factors.
From the previous comparisons of our 8-bit and 4-bit Enocoro implementations with a large variety of stream ciphers, it can be seen that our implementations compared very well against them. Even against implementations that have a different design philosophy like the compact Grain and Trivium, our implementations exhibit very good results. The lightweight stream ciphers are designed for area constraint embedded devices. That is why they generally consume fewer hardware resources and achieve better performance compared to conventional block ciphers such as the AES, the Triple Data Encryption Algorithm (3DES), etc. In addition, the constrained embedded devices have limited information processing resources in their CPU or in their memory and more importantly they have restricted low power requirements.
Therefore, efficiency in hardware means a balance between the previous mentioned factors. We strongly believe that the proposed architectures achieve a very good level of efficiency in terms of the previous factors because they does not utilize any memory, they require a low number of hardware resources and, therefore, do not consume a lot of power while simultaneously achieving a very good level of time efficiency.

Conclusions and Future Works
Two very compact architectures of the lightweight stream cipher Enocoro-128v2, along with their implementations, are proposed in this paper. The Enocoro-128v2 stream cipher is part of the ISO/IEC 29192-3 standard. The 8-bit architecture achieves a throughput equal to 302 Mbps @ 189 MHz while the 4-bit architecture achieves a throughput equal to 181 Mbps @ 204 MHz. Both architectures utilize a very low number of hardware resources that leads them to also have very low dynamic power consumption, specifically, up to 41 mW for the 8-bit one and 40 mW for the 4-bit one. Comparisons in terms of operating frequency, area, and most importantly in throughput per area, with the most well-known lightweight stream ciphers, prove that the proposed architectures are a very good candidate for the security aspect of area embedded devices.
Future works will aim to improve the proposed architectures in the architectural design level and also explore different options of implementation. Regarding the part of the implementation, our goal is to implement the proposed architectures in silicon and in miniature FPGAs in order to further reduce the power consumption to µW and enable the efficient integration to even more compact IoT nodes and constrained embedded systems in general. At architectural level, we will aim to further explore FPGA low power techniques (such as glitch reduction or signal gating) in order to, again, further reduce the power consumption. In this regard, a very important technique for power reduction is the Guarded Evaluation [21] because it can stop the input switching activity from propagating when the outputs are not used. This feature is common in feedback logic designs like our proposed architectures. In addition, because embedded applications now tend to use parallel computing architectures of algorithms like Enocoro 128v2 that already have a high degree of parallelism, they are a primary option for the devices' security module. Finally, additional features against side channel analysis attacks will be designed and included in the proposed architectures, such as countermeasures based on secure logic styles, hiding countermeasures, and masking countermeasures.

Funding:
We acknowledge support of this work by the project "I3T-Innovative Application of Industrial Internet of Things (IIoT) in Smart Environments" (MIS 5002434) which is implemented under the "Action for the Strategic Development on the Research and Technological Sector", funded by the Operational Programme "Competitiveness, Entrepreneurship and Innovation" (NSRF 2014-2020) and co-financed by Greece and the European Union (European Regional Development Fund).