10 Clock-Periods Pipelined Implementation of AES-128 Encryption-Decryption Algorithm up to 28 Gbit / s Real Throughput by Xilinx Zynq UltraScale + MPSoC ZCU102 Platform

: The security of communication and computer systems is an increasingly important issue, nowadays pervading all areas of human activity (e


Introduction
With the advances of information technology (IT) applications involving sensitive data, the network security is an ever-current topic for business activities; therefore, the development of robust, in computational terms light and efficient encryption algorithms is required for supporting the continuous increase of data volume and throughput in IoT applications, as well as video streaming, real-time communications, mobile transmissions and so forth [1][2][3][4]. The Advanced Encryption Standard (AES) remains the preferred encryption standard for governments, banks and high-security systems around the world. This last is the most widespread encryption algorithm, for instance, = 28.16 Gbit/s). Besides, a fast algorithm for the key expansion has been implemented, based on the combination, by GF operations, of the previous sub-key with the current sub-key modified by the Sbox; in this way, the key expansion operation is completed in only 174.55 ns, obtaining the 44 words from the main key. The results above described have been obtained also thanks to the latest-generation FPGA platform (Xilinx ZCU102 board) used to implement the encryption/decryption system, featured by high performances, large memory capabilities and a wide set of peripherals [19]. Furthermore, the developed encryption/decryption block implements all the control signals required to synchronize its operation with the data generator block and the modem, placed upstream and downstream from it, respectively. Besides, several blocks have been implemented to test the operation of the encryption/decryption block, by deterministically inserting an error inside a known-plaintext data packet and detecting the error in the encrypted/decrypted packet, notified by a proper error signal. Also, a proper mechanism to change the encryption key during the operation of the encryption system has been implemented, resulting in just three packets lost during the replacement process, as described in Section 3.
The following text is organized as follows: the Section 2 includes a literature analysis about high throughput implementation of encryption/decryption algorithms based on FPGA platforms; furthermore, the demonstrator of the Wireless Connector, based on the Xilinx ZCU102 platform, is described, demonstrating its proper operation. In the Section 3, the VHDL (acronym of VHSIC-Hardware Description Language, VHSIC means Very High Speed Integrated Circuits) blocks to implement and test the encryption and decryption algorithms are carefully described; also, the results related to the performances of the proposed custom AES-128 encryption/decryption algorithm are presented, in terms of encryption/decryption time, resource utilization and complexity. In the Section 4, the discussion on the obtained results are reported, also solving with clock routing issues; furthermore, in this section, the tests of the combined encryption/decryption system, after the loading on the ZCU102 board, are presented. Finally, the comparison of proposed AES-128 implementation with other similar works reported in the scientific literature is reported.

Fundamentals of the AES-128 Encryption/Decryption Algorithm
The AES algorithm is a block cipher at the bit level, like the Data Encryption Standard (DES), where each block length is fixed to 128 bits, whereas the key length can be equal to 128,192 or 256 bits [20]. Each 128-bit data block is partitioned into 16 bytes, mapped on a 4 × 4 array named state, and each byte of the state corresponds to an element of the Galois Field (GF) with 2 8 cardinality. Based on the key length, the algorithm includes n iterations, called rounds, where n is 10, 12 or 14 when the key length is 128, 192 or 256 bits, respectively. Each round of the encryption process, except for the last one, consists of four operations: All the operations are carried out sequentially within each round, except for the initial Add Round Key; in the last round, the Mix Columns operation is not performed (Figure 1).
The Substitute Bytes step is a non-linear transformation, where each byte in the state array is replaced with the entry of a fixed 8-bit Substitution Box (Sbox) implemented as a lookup table with 2 8 words of 8 bits each, used to hide the relationship between the key and the cipher-text. The used Sbox is derived from the multiplicative inverse over GF 2 8 , combined with an invertible geometric transformation, to avoid attacks based on simple algebraic properties, obtaining a 16 × 16 bytes table ( Figure 2). The permutation is obtained addressing the Sbox locations based on the most significative nibble and the less significative one of the 8-bit input data.
The Shift Rows step operates on the rows of the state array, circularly shifting the bytes in each row by a given offset. The first row is left unchanged, whereas each byte of the second row is shifted one position to the left; likewise, the third and fourth rows are shifted respectively by two and three bytes to the left. The Mix Columns step is a linear transformation mixing the column of the state array; each column, treated as a polynomial over GF 2 8 , is multiplied, modulo z 4 + 1, with a fixed polynomial (i.e., c(z) = 03z 3 + 01z 2 + 01z + 02). Both the Shift Rows and the Mix Columns operations are needed to hide the relationship between the cipher-text and the plain text.  In the Add Round Key step, the key is combined with the state array to make the cipher safer; for each round, a subkey is derived from the expansion of the main key, obtaining, at the end of the encryption process, an expanded key of 176 bytes, arranged in a linear array of 44 words (using a key length of 128-bit). After the initialization of the word array ( [ ] for 0 ≤ ≤ 43), by inserting the key in the first four words, the other ones are obtained using the following relation: A particular exception is made for the words with index multiple of four, for which, non-linear relationships, different from bit-to-bit XOR are used:  In the Add Round Key step, the key is combined with the state array to make the cipher safer; for each round, a subkey is derived from the expansion of the main key, obtaining, at the end of the encryption process, an expanded key of 176 bytes, arranged in a linear array of 44 words (using a key length of 128-bit). After the initialization of the word array ( [ ] for 0 ≤ ≤ 43), by inserting the key in the first four words, the other ones are obtained using the following relation: A particular exception is made for the words with index multiple of four, for which, non-linear relationships, different from bit-to-bit XOR are used: In the Add Round Key step, the key is combined with the state array to make the cipher safer; for each round, a subkey is derived from the expansion of the main key, obtaining, at the end of the encryption process, an expanded key of 176 bytes, arranged in a linear array of 44 words (using a key length of 128-bit). After the initialization of the word array (W[i] for 0 ≤ i ≤ 43), by inserting the key in the first four words, the other ones are obtained using the following relation: A particular exception is made for the words with index multiple of four, for which, non-linear relationships, different from bit-to-bit XOR are used: where the Subword sub-function replaces each byte of the word, provided as argument of the function, using the Sbox, whereas the Rotword sub-function simply shifts one byte to left; furthermore, , where x i−1 is the (i − 1)-th exponentiation operator in GF (2 8 ).
The encryption and decryption procedures employ two different algorithms; nevertheless, each operation in the encryption process corresponds to an inverse equivalent one in the decryption process. However, both are arranged in 10 rounds and both perform the Add Round Key step in the same way. Thus, each round of the decryption process consists of the following operations: Add Round Key A further difference between the encryption and decryption processes is the order of functions performed within a single round; in the decryption process, the first step is the Inverse Shift Rows, followed by Inverse Sub Bytes, Add Round Key and finally Inverse Mix Columns. In particular, the Inverse Shift Rows cyclically shift to the right by the same offset of the Shift Rows step but in the opposite direction.
To invert the Mix Column operation, the Inverse Mix Columns step employs the corresponding inverse matrix. The 4-byte columns of the state array are multiplied for the inverse 4 × 4 matrix featured by constant entries for producing the output bytes; all operations involved in the matrix multiplication are performed in GF 2 8 or equivalently by multiplying each column, modulo z 4 + 1, with a fixed polynomial b(z) = 0Bz 3 + 0Dz 2 + 09z + 0E, where b(z) = c(z) −1 mod z 4 + 1 and c(z) is the polynomial used in the Mix Columns step of the encryption.
The Inverse Substitute Bytes function is carried out similarly to the Substitute Bytes but using a different Sbox (Figure 3), obtained applying the inverse affine transformation to the Sbox followed by the multiplicative inverse in GF 2 8 .
Electronics 2020, 9, x FOR PEER REVIEW  6 of 30   where the Subword sub-function replaces each byte of the word, provided as argument of the  function, using the Sbox, whereas the Rotword sub-function simply shifts one byte to left;  furthermore, the function Rcon[i] is a round constant, represented by the word array  [ −1 , {00}, {00}, {00}], where −1 is the (i − 1)-th exponentiation operator in GF (2 8 ).
The encryption and decryption procedures employ two different algorithms; nevertheless, each operation in the encryption process corresponds to an inverse equivalent one in the decryption process. However, both are arranged in 10 rounds and both perform the Add Round Key step in the same way. Thus, each round of the decryption process consists of the following operations: A further difference between the encryption and decryption processes is the order of functions performed within a single round; in the decryption process, the first step is the Inverse Shift Rows, followed by Inverse Sub Bytes, Add Round Key and finally Inverse Mix Columns. In particular, the Inverse Shift Rows cyclically shift to the right by the same offset of the Shift Rows step but in the opposite direction.
To invert the Mix Column operation, the Inverse Mix Columns step employs the corresponding inverse matrix. The 4-byte columns of the state array are multiplied for the inverse 4 × 4 matrix featured by constant entries for producing the output bytes; all operations involved in the matrix multiplication are performed in GF(2 8 ) or equivalently by multiplying each column, modulo 4 + 1 , with a fixed polynomial ( ) = 0 3 + 0 2 + 09 + 0 , where ( ) = ( ) −1 ( 4 + 1) ( ) is the polynomial used in the Mix Columns step of the encryption. The Inverse Substitute Bytes function is carried out similarly to the Substitute Bytes but using a different Sbox (Figure 3), obtained applying the inverse affine transformation to the Sbox followed by the multiplicative inverse in GF(2 8 ). Since the encryption and decryption operations are not the same, a significant disadvantage from the implementation point of view is obtained; however, there is an equivalent version of the decryption algorithm, which involves the inverse functions in the same order as the encryption algorithm. In particular, since the Inverse Shift Rows step changes the sequence of the bytes of the state array, leaving the content unchanged, whereas the Inverse SubBytes step changes the content but not the sequence of the bytes, their order has not relevance anymore, thus it can be exchanged. Moreover, the Add Round Key and Inverse Mix Columns transformations, considering the key as a sequence of words, both operate on the state array, column by column; therefore, the Inverse Mix Columns operation can be applied to the phase key [ ], before adding it to the current state array, Since the encryption and decryption operations are not the same, a significant disadvantage from the implementation point of view is obtained; however, there is an equivalent version of the decryption algorithm, which involves the inverse functions in the same order as the encryption algorithm. In particular, since the Inverse Shift Rows step changes the sequence of the bytes of the state array, leaving the content unchanged, whereas the Inverse SubBytes step changes the content but not the sequence of the bytes, their order has not relevance anymore, thus it can be exchanged. Moreover, the Add Round Key and Inverse Mix Columns transformations, considering the key as a sequence of words, both operate on the state array, column by column; therefore, the Inverse Mix Columns operation can be applied to the phase key W[i], before adding it to the current state array, so obtaining the data packet decryption with the same sequence of operations of the encryption algorithm.

Implementation of Encrypting/Decrypting Algorithms With FPGA Platforms
In Reference [21], the authors proposed an inexpensive, low area and high throughput hardware implementation of the Advanced Encryption Standard algorithm (AES) for low-cost embedded applications, using a 128-bit key for both encryption and decryption, employing parallel operation in the folded architecture. The hardware selected for the implementation is the Virtex-6 XC6VLX75T FPGA device. In the folded architecture, the 128-bit blocks of input data are divided into four sub-blocks of 32-bit each and all the operations are performed sequentially. Due to the inefficiency of this method, along with the folded architecture, parallel computing is required to speed-up the algorithm execution. The experimental results reveal that the algorithm can achieve a 37.1 Gbit/s data throughput with a maximum clock frequency of 505.5 MHz.
C. Guzmán et al. proposed a hardware implementation of the AES 128-bit algorithm with a pipelined architecture, working on two non-feedback modes of operation, namely Encoded Code-Book (ECB) and Counter (CTR), using a Xilinx Virtex 5 FPGA platform [22]. They compared the two operation modalities in terms of resource utilization, throughput and robustness. The results revealed that the CTR mode is more convenient than ECB one in terms of level security and area efficiency. The proposed architecture reaches a clock frequency of 272.59 MHz corresponding to a throughput of 34.89 Gbit/s. A triple key AES algorithm is presented in ref. [23]; such a framework requires 128 bits plaintext input and 3 keys for combining the ciphertext. These lasts are combined by a common xor function and, the resulting key is provided along with the plaintext to an "add round key" block, where they are combined by xor function; afterward, the obtained data block and the combined key are sent in input to a 128 AES encryption block for obtaining the cipher data ( Figure 4). The proposed algorithm was optimized, thus obtaining 867.34 Mbit/s maximum throughput, with a resource utilization of 3402 Configurable Logic Blocks (CLBs), 27,787 LUTs and 385 Input/Output Blocks (IOBs). By comparing the proposed algorithm with other Xilinx devices, a 15% increase in throughput and a lower processing delay were obtained.
Electronics 2020, 9, x FOR PEER REVIEW 7 of 30 so obtaining the data packet decryption with the same sequence of operations of the encryption algorithm.

Implementation of Encrypting/Decrypting Algorithms With FPGA Platforms.
In Reference [21], the authors proposed an inexpensive, low area and high throughput hardware implementation of the Advanced Encryption Standard algorithm (AES) for low-cost embedded applications, using a 128-bit key for both encryption and decryption, employing parallel operation in the folded architecture. The hardware selected for the implementation is the Virtex-6 XC6VLX75T FPGA device. In the folded architecture, the 128-bit blocks of input data are divided into four sub-blocks of 32-bit each and all the operations are performed sequentially. Due to the inefficiency of this method, along with the folded architecture, parallel computing is required to speed-up the algorithm execution. The experimental results reveal that the algorithm can achieve a 37.1 Gbit/s data throughput with a maximum clock frequency of 505.5 MHz.
C. Guzmań et al. proposed a hardware implementation of the AES 128-bit algorithm with a pipelined architecture, working on two non-feedback modes of operation, namely Encoded Code-Book (ECB) and Counter (CTR), using a Xilinx Virtex 5 FPGA platform [22]. They compared the two operation modalities in terms of resource utilization, throughput and robustness. The results revealed that the CTR mode is more convenient than ECB one in terms of level security and area efficiency. The proposed architecture reaches a clock frequency of 272.59 MHz corresponding to a throughput of 34.89 Gbit/s. A triple key AES algorithm is presented in ref. [23]; such a framework requires 128 bits plaintext input and 3 keys for combining the ciphertext. These lasts are combined by a common xor function and, the resulting key is provided along with the plaintext to an "add round key" block, where they are combined by xor function; afterward, the obtained data block and the combined key are sent in input to a 128 AES encryption block for obtaining the cipher data ( Figure 4). The proposed algorithm was optimized, thus obtaining 867.34 Mbit/s maximum throughput, with a resource utilization of 3402 Configurable Logic Blocks (CLBs), 27,787 LUTs and 385 Input/Output Blocks (IOBs). By comparing the proposed algorithm with other Xilinx devices, a 15% increase in throughput and a lower processing delay were obtained.   In ref. [24], A. Gopalan et al. described the development and implementation of an AES algorithm on a FPGA platform (Xilinx XC6SLX16), comparing the designed infrastructure with a correspondent software implementation. The developed encryption block required 10 clock cycles for processing each Electronics 2020, 9, 1665 8 of 30 data block, thus corresponding to 100 ns processing time since the clock frequency is equal to 100 MHz. Instead, the decryption block took 11 cycles (i.e., 110 ns), since an initial overhead is required. The two main figures of merit used to evaluate the algorithm performance are the throughput (in Equation (3)) and latency (in Equation (4)): where block_per_cycle is one for a fully-unrolled architecture, whereas it becomes greater than one if the round output is re-used to elaborate the single input; stages_per_round is the number of clock cycles to process a single round. C. P. Fan et al. described a high-speed 128-bit AES encryption module both in sequential and fully pipelined architectures, including a Content Addressable Memory (CAM)-based architecture used to realize pipelined high-speed SubBytes and InvSubBytes blocks, a hardware-sharing solution to carry out a high-speed MixColumns operation and a real-time key generation scheme to realize the AddRoundKey block [25]. The latter generates 128-bit keys for both encryption and decryption processes from the encryption key segmented into four 32-bit blocks and stored into different registers. The last register output (named d register) is dispatched to ROT (shift of bytes), S-box and RCON (XOR operation) blocks. The SubBytes and InvSubBytes operations were implemented by applying a CAM-based architecture, providing as output the data lines value obtained from several register arrays placed both between and inside the AES round computations, according to the matched address line data. Also, the MixColumns and the InvMixColumns operations were carried out by means of the row mapping permutations, based on two corresponding polynomials matrix. To reduce the resource utilization, the InvMixColumns polynomials matrix was decomposed into three different matrices for highlighting the hardware sharing of the two operations; in this way, a high-speed shared circuit for implementing the two transformations was derived. The AES module, implemented on the Xilinx XC2V3000-6 FPGA platform, reached in the sequential architecture a data throughput value up to 0.876 Gbit/s with a clock frequency of 75.3 MHz, both in the encryption and decryption phases. Instead, the proposed fully pipelined AES architecture obtained 28.4 Gbit/s throughput with an operating frequency of 222.2 MHz in the encryption phase.
In ref. [26], the authors proposed high-performance hardware implementation of the Data Encryption Standard (DES) encryption algorithm, with a 16-stage pipelined architecture, operating in CTR mode, on a Xilinx Virtex XCV1000-4 BG560 FPGA platform. In the proposed architecture, an initial delay of 16 clock cycles is required to instantiate the functional block, where are included the key expansion function, the Sbox function and the Pbox function; then, at each clock cycle, fixed-length clusters of data are loaded into this block along with different keys, so allowing the use of multiple keys, one for each of the 16 rounds of the DES algorithm. The major contribution was a parameterizable key scheduling method, where the sub-keys are pre-computed and distributed to the functional blocks of each round; furthermore, a skew core controls the availability overtime of the sub-keys to the different function blocks, delaying their generation by the needed time amount. The results showed that the proposed architecture achieves an encryption rate of 3.87 Gbit/s, guaranteeing a low area utilization with only 6446 CLB slices used. P. Chodowiec et al. proposed a compact implementation of the 128-bit AES algorithm on the inexpensive Xilinx II XC2S30 FPGA, using a folded architecture and achieving good performance and low area utilization [27]. The folded architecture described is the same reported in ref. [21], nevertheless, the authors have introduced a new approach for implementing MixColumns and InvMixColumns functions using shared logic resources. This architecture requires only 222 CLB slices and 3 blocks of RAM, supporting a maximum throughput of 166 Mbit/s. In ref. [28], the authors developed and evaluated hardware implementations, based on various FPGA devices, of the DES encryption algorithm, introducing several pipelined architectures that stand Electronics 2020, 9, 1665 9 of 30 out for power consumption, resource utilization and throughput; the most significant ones are an 8-stages pipelined architecture and a 37-stages pipelined architecture. In the first one, two rounds at a time are collapsed into one stage and the output is saved into two intermediate registers of the next stage, up to a total of 8 stages; instead, in the second proposed solution, the authors developed a 37-stages pipelined DES architecture, previously reported in ref. [29] but optimized by reducing the utilization of resources by joining the logical operations by means of a processing block with 4 inputs and 1 output. The second architecture was improved by removing the redundant E (Expansion) and R (Reduction) boxes from the original design. With such modifications, the authors were able to increase the throughput by a 1.1 factor compared to the original design, reaching 40 Gbit/s using a Kintex7 platform. Regarding the proposed 8-stage pipelined implementation, a significant reduction of resources utilization (of a 0.75 factor) and power consumption (of a 0.65 factor) compared to a similar 16-stages pipelined design was demonstrated.
As above described, a LUT-based solution has been used to implement the Sbox in the proposed AES algorithm, which is not an optimal solution for area-limited hardware but offers better performances in terms of data throughput compared to other solutions, for instance, based on combinatorial logic, aimed to minimize the resource utilization, as demonstrated in Reference [30]. In this context, in Reference [31], the author proposed an overview of the different strategies to implement compact Sbox function, based on both polynomial and normal bases. Furthermore, they introduced a compact Sbox implementation based on a multi-level representation of GF operations, obtained properly selecting a particular basis (isomorphism) and making appropriate improvements to the circuital solution. The proposed solution has demonstrated improvements of 20% compared to the most compact Sbox implementation reported in Reference [32]. Besides, T. Good et al. proposed two new FPGA implementations of the AES algorithm [33]; the first one, implemented on Xilinx Spartan-III (XC3S2000) FPGA, relies on fully parallel loop unrolled architecture, reaching a 25 Gbit/s data throughput value. The latter, implemented on Spartan-II (XC2S15) FPGA, is based on state data and LUTs to carry out the AES operations, such as Substitute Bytes and Mix Columns, combined into a single matrix, called "T-box"; this implementation is featured by low area utilization, achieving 2.2 Mbit/s maximum throughput. The Sbox implementations proposed in these works can be applied to our solution to significantly reduce the used hardware resources but probably reducing the maximum throughput, representing the main prerogative of the Wireless Connector communication system.

Description of the "Wireless Connector" System's Demonstrator and Relative Communication Tests
In this sub-section, the preliminary demonstrator of the Wireless Connector system is presented, which includes two PEM003 RF radio modules (manufactured by Pasternack, Irvine, CA, highlighted with red box), interfaced with the base-band hardware consisting of Zynq Ultrascale+ MPSoC ZCU102 platform (manufactured by Xilinx, San Jose, CA, USA, highlighted by the yellow box), an ADFMCDAQ2 acquisition board (manufactured by Analog Device, Norwood, MA, USA, highlighted by the purple box), four power splitter combiners (model ZFSCJ-2-4+, manufactured by Mini-Circuits, Brooklyn, NY, USA, highlighted by the green box), four pre-DAC anti-alias low-pass filters (model VLFX-300, manufactured by Mini-Circuits, highlighted by the orange box) and a personal computer for the system management (highlighted by the blue box).
The ADFMCDAQ2 acquisition board includes a dual-ADC (model AD9680, manufactured by Analog Device) featured by 14-bit resolution 1.0 Gsps sample rate and with JESD204B interface and also a 16-bit resolution quad-DAC (model AD9144, manufactured by Analog Device), featured by 2.8 Gsps sample rate and JESD204B interface; furthermore, a clock generator is placed onboard by employing a 14-outputs AD9523-1 low jitter IC (manufactured by Analog Device), along with components for the power management.
The Pasternack's PEM003 development kit consists of a transmitter (Tx) and a receiver (Rx) module, operating at a frequency band around 60 GHz, supporting complex modulations through a pair of modulation signals I and Q. Each module is equipped with a USB interface, for setting the main parameters by connecting it to a PC but also ensuring its power supply. The baseband I and Q signals are applied to the input of the Tx module or available at the output of the Rx module, through the Micro Coaxial Connector (MCX) placed on the back of each board; the obtained signals are in the differential format (i.e., I+ and I−, Q+ and Q−). The 60 GHz section terminates with two Tx and Rx antennas connected to the UG-385/U flange which acts as an interface with the WR-15 waveguide. A reference design based on an embedded microprocessor system (uBLaze Xilinx) has been used to characterize the ADC/DAC devices. By using the internal logic resources of the FPGA device, the embedded MicroBlaze processor is generated by employing the Vivado/SDK design tool. The drivers for the management of the Ethernet protocol, a UART interface for the information exchange and system management and an external DDR memory for the management of user data are directly connected to the MicroBlaze processor.
The reduced performance of the acquisition board and the logical resources of the FPGA device allow the installation of the Quadrature Phase-Shift Keying (QPSK) modulator-demodulator with low-performance Forward Error Correction (FEC) modules (e.g. Reed-Solomon). In Figure 5, a functional scheme of the system described above is provided, whereas in Figure 6 the realized experimental setup to perform the 60 GHz communication tests is shown, supporting a data-rate up to 3 Gbit/s, constraint previously defined for the whole 5G communication system. Electronics 2020, 9, x FOR PEER REVIEW 10 of 30 pair of modulation signals I and Q. Each module is equipped with a USB interface, for setting the main parameters by connecting it to a PC but also ensuring its power supply. The baseband I and Q signals are applied to the input of the Tx module or available at the output of the Rx module, through the Micro Coaxial Connector (MCX) placed on the back of each board; the obtained signals are in the differential format (i.e. I+ and I−, Q+ and Q−). The 60 GHz section terminates with two Tx and Rx antennas connected to the UG-385/U flange which acts as an interface with the WR-15 waveguide. A reference design based on an embedded microprocessor system (uBLaze Xilinx) has been used to characterize the ADC/DAC devices. By using the internal logic resources of the FPGA device, the embedded MicroBlaze processor is generated by employing the Vivado/SDK design tool. The drivers for the management of the Ethernet protocol, a UART interface for the information exchange and system management and an external DDR memory for the management of user data are directly connected to the MicroBlaze processor. The reduced performance of the acquisition board and the logical resources of the FPGA device allow the installation of the Quadrature Phase-Shift Keying (QPSK) modulator-demodulator with low-performance Forward Error Correction (FEC) modules (e.g. Reed-Solomon). In Figure 5, a functional scheme of the system described above is provided, whereas in Figure 6 the realized experimental setup to perform the 60 GHz communication tests is shown, supporting a data-rate up to 3 Gbit/s, constraint previously defined for the whole 5G communication system.  The QPSK modulation is carried out by generating the I(t) and Q(t) coefficients, to be sent to the two quadrature mixers, where they are mixed with the carrier signal (I(t)) and the latter phase-shifted by 90° (Q(t)), respectively, both produced by the modem block. A representation of the modulated signals can be made on the complex plane, obtaining four symbols which constitute the QPSK constellation (as shown in Figure 7). The QPSK demodulation is based on the principle of coherent demodulation, which requires an appropriate reconstruction of the base-band symbols. The ADC component included on the ADFMCDAQ2 acquisition board provides the samples of the received I(t) and Q(t) signals, subsequently processed by First-in First-Out (FIFO) systems for modifying the data-flow on 64-bit registers at a 500 MHz frequency. These data are sent to a threshold decision-maker block for reconstructing the symbols on the receiver side. The constellation of received symbols (Figure 7), with 500 Ms/s symbol rate corresponding to 1 Gbit/s (but extendable up to 3 Gbit/s, as previously reported), has been displayed through a software application (Analog devices IIO Oscilloscope) for extrapolating the transmitted symbol, processed by FPGA and sent to the PC via the UART port. The QPSK modulation is carried out by generating the I(t) and Q(t) coefficients, to be sent to the two quadrature mixers, where they are mixed with the carrier signal (I(t)) and the latter phase-shifted by 90 • (Q(t)), respectively, both produced by the modem block. A representation of the modulated signals can be made on the complex plane, obtaining four symbols which constitute the QPSK constellation (as shown in Figure 7). Electronics 2020, 9, x FOR PEER REVIEW 11 of 30 Figure 6. Picture of the experimental setup using the Zynq Ultrascale+MPSoC ZCU102 platform baseband system interconnected to the PEM003 radio system, operating in QPSK modulation.
The QPSK modulation is carried out by generating the I(t) and Q(t) coefficients, to be sent to the two quadrature mixers, where they are mixed with the carrier signal (I(t)) and the latter phase-shifted by 90° (Q(t)), respectively, both produced by the modem block. A representation of the modulated signals can be made on the complex plane, obtaining four symbols which constitute the QPSK constellation (as shown in Figure 7). The QPSK demodulation is based on the principle of coherent demodulation, which requires an appropriate reconstruction of the base-band symbols. The ADC component included on the ADFMCDAQ2 acquisition board provides the samples of the received I(t) and Q(t) signals, subsequently processed by First-in First-Out (FIFO) systems for modifying the data-flow on 64-bit registers at a 500 MHz frequency. These data are sent to a threshold decision-maker block for reconstructing the symbols on the receiver side. The constellation of received symbols (Figure 7), with 500 Ms/s symbol rate corresponding to 1 Gbit/s (but extendable up to 3 Gbit/s, as previously reported), has been displayed through a software application (Analog devices IIO Oscilloscope) for extrapolating the transmitted symbol, processed by FPGA and sent to the PC via the UART port. The QPSK demodulation is based on the principle of coherent demodulation, which requires an appropriate reconstruction of the base-band symbols. The ADC component included on the ADFMCDAQ2 acquisition board provides the samples of the received I(t) and Q(t) signals, subsequently processed by First-in First-Out (FIFO) systems for modifying the data-flow on 64-bit registers at a 500 MHz frequency. These data are sent to a threshold decision-maker block for reconstructing the symbols on the receiver side. The constellation of received symbols (Figure 7), with 500 Ms/s symbol rate corresponding to 1 Gbit/s (but extendable up to 3 Gbit/s, as previously reported), has been displayed through a software application (Analog devices IIO Oscilloscope) for extrapolating the transmitted symbol, processed by FPGA and sent to the PC via the UART port.

Description of the VHDL Blocks Implemented for the AES Encryption/Decryption Algorithm
The VDHL block developed for implementing the encryption algorithm is shown in Figure 8 (red box); it accepts the plaintext in input and provides the ciphertext in output, both arranged into 128-bit blocks, via the AXI Stream bus.

Description of the VHDL Blocks Implemented for the AES Encryption/Decryption Algorithm
The VDHL block developed for implementing the encryption algorithm is shown in Figure 8 (red box); it accepts the plaintext in input and provides the ciphertext in output, both arranged into 128-bit blocks, via the AXI Stream bus. The source files implementing the encryption block are shown in Figure 9; the first four files are, AES_AXIS_KEY_v1, AES_AXIS_KEY_v1_0_S00_AXIS_inst, AES_AXIS_KEY_v1_0_S01_AXI_inst and AES_AXIS_KEY_v1_0_M00_AXIS_inst, related to the implementation of the communication between blocks, through AXI Bus Stream and AXI Lite; instead, the last two files contain the code for performing the encryption algorithm, namely aes_encoding_block and cipher_key_expansion_block.
The portion of the firmware, contained in cipher_key_expansion_block, dealing with the expansion of the key is shown in Figure 10, which performs the necessary operations to obtain the 44 words to make up the 10 sub-keys, used during the encryption rounds. Also, the start of the key expansion routine, whenever a new key is validated by the processor, has been implemented using the expansion_key_start signal; in order to obtain all the 44 words of the expanded key, only 174.5 ns are required.
As can be seen, the 44 words, constituting the sub-keys, are obtained by carrying out xor operations between the 32-bit sections of the subkey at the previous round. The source files implementing the encryption block are shown in Figure 9; the first four files are, AES_AXIS_KEY_v1, AES_AXIS_KEY_v1_0_S00_AXIS_inst, AES_AXIS_KEY_v1_0_S01_AXI_inst and AES_AXIS_KEY_v1_0_M00_AXIS_inst, related to the implementation of the communication between blocks, through AXI Bus Stream and AXI Lite; instead, the last two files contain the code for performing the encryption algorithm, namely aes_encoding_block and cipher_key_expansion_block. The portion of the firmware, contained in cipher_key_expansion_block, dealing with the expansion of the key is shown in Figure 10, which performs the necessary operations to obtain the 44 words to make up the 10 sub-keys, used during the encryption rounds. Also, the start of the key expansion routine, whenever a new key is validated by the processor, has been implemented using the expansion_key_start signal; in order to obtain all the 44 words of the expanded key, only 174.5 ns are required.
As can be seen, the 44 words, constituting the sub-keys, are obtained by carrying out xor operations between the 32-bit sections of the subkey at the previous round. Electronics 2020, 9, x FOR PEER REVIEW 13 of 30  A Sbox matrix is employed to expand the key ( Figure 11); as mentioned above, each element of the Sbox consists of 32 bits, instead of 8 bits, so allowing the algorithm to perform the related operations and thus obtaining the encrypted data packets, in a shorter temporal interval but with greater resource utilization of the FPGA device. This LUT-based solution was preferred over solutions that implement Sbox through GF operations, as those reported in References [31,33], because the main prerogative of the Wireless Connector is the operating speed rather than hardware resources utilization, given the wide memory capability of the employed FPGA platform; as known, LUT-based Sbox solutions offer better performances in terms of processing time to the detriment of area occupation, as demonstrated in Reference [30], thus affecting the Substitute Bytes step, the most critical operation in the AES algorithm but also the key expansion step in the proposed implementation.   A Sbox matrix is employed to expand the key ( Figure 11); as mentioned above, each element of the Sbox consists of 32 bits, instead of 8 bits, so allowing the algorithm to perform the related operations and thus obtaining the encrypted data packets, in a shorter temporal interval but with greater resource utilization of the FPGA device. This LUT-based solution was preferred over solutions that implement Sbox through GF operations, as those reported in References [31,33], because the main prerogative of the Wireless Connector is the operating speed rather than hardware resources utilization, given the wide memory capability of the employed FPGA platform; as known, LUT-based Sbox solutions offer better performances in terms of processing time to the detriment of area occupation, as demonstrated in Reference [30], thus affecting the Substitute Bytes step, the most critical operation in the AES algorithm but also the key expansion step in the proposed implementation. A Sbox matrix is employed to expand the key ( Figure 11); as mentioned above, each element of the Sbox consists of 32 bits, instead of 8 bits, so allowing the algorithm to perform the related operations and thus obtaining the encrypted data packets, in a shorter temporal interval but with greater resource utilization of the FPGA device. This LUT-based solution was preferred over solutions that implement Sbox through GF operations, as those reported in References [31,33], because the main prerogative of the Wireless Connector is the operating speed rather than hardware resources utilization, given the wide memory capability of the employed FPGA platform; as known, LUT-based Sbox solutions offer better performances in terms of processing time to the detriment of area occupation, as demonstrated in Reference [30], thus affecting the Substitute Bytes step, the most critical operation in the AES algorithm but also the key expansion step in the proposed implementation.
Once the 10 sub-keys are obtained, the algorithm carries out the 10 rounds required by the AES-128 and implemented in the aes_encoding_block source file, which receives in input the sub-keys and the plaintext and carried out the steps required to encrypt the plain text ( Figure 1). Electronics 2020, 9, x FOR PEER REVIEW 14 of 30 Once the 10 sub-keys are obtained, the algorithm carries out the 10 rounds required by the AES-128 and implemented in the aes_encoding_block source file, which receives in input the sub-keys and the plaintext and carried out the steps required to encrypt the plain text ( Figure 1).
In the first round, the xor operation between the plaintext and the cipher_key_table is carried out, which contains the unexpanded encryption key (round_0 in Figure 12). Afterward, the algorithm, using the intermediate data generated by the first round, proceeds with the following 9 rounds required by the AES-128, performing in each round the Substitute Bytes, Shift Rows, Mix Columns and Add Round Key operations (Figure 13a). These operations are iteratively applied to the intermediate data obtained from the previous round, updated until the ninth round (Figure 13b). The data obtained after this iteration, called intermediate_data (9), is provided to round 10 for the last Add Round Key operation and the resulting ciphered data is stored into 128-bit out_cipher_data packet ( Figure 14). In the first round, the xor operation between the plaintext and the cipher_key_table is carried out, which contains the unexpanded encryption key (round_0 in Figure 12). Once the 10 sub-keys are obtained, the algorithm carries out the 10 rounds required by the AES-128 and implemented in the aes_encoding_block source file, which receives in input the sub-keys and the plaintext and carried out the steps required to encrypt the plain text ( Figure 1).
In the first round, the xor operation between the plaintext and the cipher_key_table is carried out, which contains the unexpanded encryption key (round_0 in Figure 12). Afterward, the algorithm, using the intermediate data generated by the first round, proceeds with the following 9 rounds required by the AES-128, performing in each round the Substitute Bytes, Shift Rows, Mix Columns and Add Round Key operations (Figure 13a). These operations are iteratively applied to the intermediate data obtained from the previous round, updated until the ninth round (Figure 13b). The data obtained after this iteration, called intermediate_data (9), is provided to round 10 for the last Add Round Key operation and the resulting ciphered data is stored into 128-bit out_cipher_data packet ( Figure 14). Afterward, the algorithm, using the intermediate data generated by the first round, proceeds with the following 9 rounds required by the AES-128, performing in each round the Substitute Bytes, Shift Rows, Mix Columns and Add Round Key operations (Figure 13a). These operations are iteratively applied to the intermediate data obtained from the previous round, updated until the ninth round ( Figure 13b). The data obtained after this iteration, called intermediate_data (9), is provided to round 10 for the last Add Round Key operation and the resulting ciphered data is stored into 128-bit out_cipher_data packet (Figure 14).  By saving the results of each round (i.e., intermediate_data(i), i = 1, …, 9), a pipelined implementation can be obtained, carrying out simultaneously the 10 rounds on successive data packets, thus allowing to start the processing of a new packet as soon as the round's processing on   By saving the results of each round (i.e., intermediate_data(i), i = 1, …, 9), a pipelined implementation can be obtained, carrying out simultaneously the 10 rounds on successive data packets, thus allowing to start the processing of a new packet as soon as the round's processing on By saving the results of each round (i.e., intermediate_data(i), i = 1, . . . , 9), a pipelined implementation can be obtained, carrying out simultaneously the 10 rounds on successive data packets, thus allowing to start the processing of a new packet as soon as the round's processing on the previous ones is completed. Therefore, simultaneous processing on multiple packets is performed, thus allowing better exploitation of the used hardware resources, so reaching higher data throughput. As below reported, the proposed AES implementation takes only a clock period to complete the round's processing, allowing to provide an encrypted data packet for each clock cycle.
To test the correct behaviour of the implemented algorithm, a word generator, called Data_Generator (green box in Figure 8), has been included in the tool offered by Vivado IP INTEGRATOR for simulating the presence of the ethernet module, that provides 128-bit data packets at the input of the encryption block, every 42.67 ns, via the AXI Stream bus. Instead, to insert and store the key, an external block called Key_generator (purple box in Figure 8) and a memory block with 4 registers of 32-bit each have been employed, connected via AXI Lite bus, so allowing the user to update the key at any time. A Key_to_write block (orange box in Figure 8) writes the 4 words of the key (32 bits each) in 4 registers, created during the AXI Lite bus implementation phase, asynchronously to the processor, allowing the substitution of the encryption key during the normal operation of the algorithm. Therefore, if the key in the registers is not changed, the algorithm performs the data encryption, otherwise, if it differs from the current key, the expansion_key routine starts and the 10 sub-keys of the new main key are generated.
The switching to a new key is enabled when the processor deems it appropriate by setting a bit of an additional byte transmitted via the AXI Lite bus, stored in an additional 32-bit register, named key_valid. The algorithm queries this bit every 42.7 ns and if it detects that its value is set high, it reads the key stored in the registers and starts the key expansion routine; at the same time, the value of the enabling bit is reset for indicating to decryption block the changing of the encryption key. The developed key substitution mechanism represents an important functionality for the Wireless Connector since a periodic key change is required, for guaranteeing the security of the data exchanged between the two mobile stations constituting the communication system.
The correctness of the encrypted data packets is verified by a Pattern_Verificator block (blue box in Figure 8), connected to the encryption block via the AXI Stream bus; this last simulates the presence of the modem and contains a table with encrypted data packets corresponding to the plaintext data packets provided at the input of the AES_TEST_AXI block by the Data_Generator block. It compares the packets received from the AES_TEST_AXI block with those contained in the table; if the data received is the same as that in its table, the encryption has been successful, otherwise, an error has occurred. To verify the correct operation of the algorithm, an Insert_Error block (pink box in Figure 8) has been implemented to change a bit in the 128-bit plaintext packet, thus verifying the presence of any errors by the Pattern_Verificator. When an error is detected, this last set the error_sig bit in correspondence with the encrypted data packet that does not match the stored ones in the Pattern_Verificator table. The CLOCK block (yellow box in Figure 8) provides the clock to all the blocks with a frequency of 350 MHz. To synchronize the Pattern_Verificator with the encryption block, an impulse is generated to indicate the end of encryption and the availability of a new encrypted data packets at the output of the AES_AXIS_KEY block; this signal is associated to the m00_axis_tvalid pin of the AXI-Strem bus. Besides, two other signals have been implemented, namely a support flag and a signal indicating the packet of the Data_Generator table provided to the input of the encryption block, allowing the Pattern_Verificator to keep track of the packets sent and to associate them to the corresponding entries in its internal table.
Furthermore, an external signal has been defined, called s00_axis_tvalid, which indicates, through an impulse, to the encryption block the availability of the packets at the input, enabling the immediate acceptance of new data packets. To optimize the algorithm and to reduce the execution time, the plaintext packets are acquired on the falling edge of the clock signal, thus allowing to start with the encryption process in advance, so gaining 1.42 ns corresponding to half of the clock period.
The modem, located downstream of the encryption block and simulated by the Pattern_Verificator block, works with 64-bit data packets; therefore, the encrypted packets have to be serialized in 64-bit packets each, determining some latency and timing problems. Therefore, the m00_axis_tready signal, provided by the AXI Stream bus, has been implemented for indicating to the encryption block, when the Pattern_Verificator is available, to accept new packets. When the signal is set, the algorithm accepts the packets in input and performs the encryption process; otherwise, if it is reset, the algorithm stops and waits for the signal to return high. Instead, the m00_axis_tvalid signal indicates to Pattern_Verificator, that a new encrypted data packet is available at the output.
In Figure 15, the temporal trends of the signals involved in the developed encryption algorithm are shown; the s00_axis_tvalid signal is generated on the falling edge of the clock, whereas encryption of plaintext data packets starts on each rising edge of the clock. Also, to consider the case in which the s00_axis_tvalid signal is set to zero, the last two packets are made available using two impulses randomly spaced. Therefore, the algorithm accepts the plaintext data packets to perform encryption and provides the corresponding encrypted data packets after an interval of 28.560 ns. From the above considerations, the developed algorithm can supply 128-bit encrypted data packets every 2.856 ns (equal to the clock period), so obtaining, for a 350 MHz operating frequency, a throughput value of 44.8 Gbit/s. signal is set, the algorithm accepts the packets in input and performs the encryption process; otherwise, if it is reset, the algorithm stops and waits for the signal to return high. Instead, the m00_axis_tvalid signal indicates to Pattern_Verificator, that a new encrypted data packet is available at the output. In Figure 15, the temporal trends of the signals involved in the developed encryption algorithm are shown; the s00_axis_tvalid signal is generated on the falling edge of the clock, whereas encryption of plaintext data packets starts on each rising edge of the clock. Also, to consider the case in which the s00_axis_tvalid signal is set to zero, the last two packets are made available using two impulses randomly spaced. Therefore, the algorithm accepts the plaintext data packets to perform encryption and provides the corresponding encrypted data packets after an interval of 28.560 ns. From the above considerations, the developed algorithm can supply 128-bit encrypted data packets every 2.856 ns (equal to the clock period), so obtaining, for a 350 MHz operating frequency, a throughput value of 44.8 Gbit/s. Figure 15. Temporal trends of the signals involved in the encryption algorithm, with the plaintext data packets encrypted on each rising edge of the clock, as indicated by the s00_axis_tvalid signal set to one and the last two packets made available using two impulses spaced randomly over time (yellow box); each encrypted data packet is available at the output after 28.560 ns (10 system clock periods at 350 MHz frequency), as indicated by the m00_axis_tvalid signal (orange box).
The temporal trends related to the expansion of the key are shown in Figure 16, previously stored in the registers in an instant chosen by the user and validated through a signal provided by the processor. During the expansion of the key, which lasts 174.55 ns, the m00_axis_tvalid signal is reset indicating that no valid encrypted packets are provided from the encryption block. The error_sig signal is set in correspondence to the key change because the table related to the new key is considered, whereas the packets are still obtained with the old key; the signal returns to zero as soon as the encrypted packets are obtained through the new key.
The implementation of all the control and synchronization signals, above described, is one of the main contributions provided by the proposed work, fundamental for the correct operation of the whole encryption/decryption system, ensuring correct interoperability of the developed encryption/decryption block with the other sections of the Wireless Connector system.

Plaintext data packets
Encrypted data packets Figure 15. Temporal trends of the signals involved in the encryption algorithm, with the plaintext data packets encrypted on each rising edge of the clock, as indicated by the s00_axis_tvalid signal set to one and the last two packets made available using two impulses spaced randomly over time (yellow box); each encrypted data packet is available at the output after 28.560 ns (10 system clock periods at 350 MHz frequency), as indicated by the m00_axis_tvalid signal (orange box).
The temporal trends related to the expansion of the key are shown in Figure 16, previously stored in the registers in an instant chosen by the user and validated through a signal provided by the processor. During the expansion of the key, which lasts 174.55 ns, the m00_axis_tvalid signal is reset indicating that no valid encrypted packets are provided from the encryption block. The error_sig signal is set in correspondence to the key change because the table related to the new key is considered, whereas the packets are still obtained with the old key; the signal returns to zero as soon as the encrypted packets are obtained through the new key.
The implementation of all the control and synchronization signals, above described, is one of the main contributions provided by the proposed work, fundamental for the correct operation of the whole encryption/decryption system, ensuring correct interoperability of the developed encryption/decryption block with the other sections of the Wireless Connector system. Afterward, the VHDL block implementing the correspondent decryption algorithm, called AES_128_DEC (red box in Figure 17), has been developed, along with the blocks employed to test it, reproducing an operative scenario similar to that present in the final application. Afterward, the VHDL block implementing the correspondent decryption algorithm, called AES_128_DEC (red box in Figure 17), has been developed, along with the blocks employed to test it, reproducing an operative scenario similar to that present in the final application. The decryption algorithm has been implemented, as well as the encryption algorithm, parallelizing many logical instructions on each rising edge of the clock; similarly to the encryption algorithm, a Sbox matrix was used, consisting of 256 elements each of 32 bits. In Figure 18, the The decryption algorithm has been implemented, as well as the encryption algorithm, parallelizing many logical instructions on each rising edge of the clock; similarly to the encryption algorithm, a Sbox matrix was used, consisting of 256 elements each of 32 bits. In Figure 18, the source files used to implement the code performing the decryption are shown; the first four files, AES_128_DEC_v1_0, AES_128_DEC_v1_0_S00_AXIS_inst, AES_128_DEC_v1_0_S01_AXI_inst and AES_128_ DEC_v1_0_M00_AXIS_inst are related to the implementation of the communication between blocks by the means of AXI Bus Stream and AXI Lite. The files containing the code developed to perform the AES-128 decryption algorithm are the last two shown in Figure 18, named aes_decoding_block and cipher_key_expansion_block.
Electronics 2020, 9,  source files used to implement the code performing the decryption are shown; the first four files, AES_128_DEC_v1_0, AES_128_DEC_v1_0_S00_AXIS_inst, AES_128_DEC_v1_0_S01_AXI_inst and AES_128_ DEC_v1_0_M00_AXIS_inst are related to the implementation of the communication between blocks by the means of AXI Bus Stream and AXI Lite. The files containing the code developed to perform the AES-128 decryption algorithm are the last two shown in Figure 18, named aes_decoding_block and cipher_key_expansion_block. In the cipher_key_expansion_block file, the code to implement the key expansion has been implemented, to generate the 10 sub-keys employed during the decryption process; to expand the key, the same matrix used for the encryption process is deployed (called sbox_encoding_4). The 10 rounds for obtaining the plaintext data packets are implemented within the aes_decoding_block source file and the rounds are developed in the same way, as done for the encryption operation.
In each round, the InvSubBytes, InvShiftRows and InvMixColumns operations are combined to obtain the plaintext data packets. These operations are carried out by using the four 16 × 16 32-bit matrices, called sbox_decoding_0, sbox_decoding_1, sbox_decoding_2 and sbox_decoding_3, equivalent to operations of the AES decrypting algorithm; in particular, the xor operations between the intermediate data obtained during the different rounds of decryption algorithm and elements of these matrices are carried out (Figure 19). These matrices allow obtaining the plaintext data packets in only 10 clock periods, considerably reducing the necessary time to perform the decryption process; however, greater resource utilization of the FPGA device is required.  In the cipher_key_expansion_block file, the code to implement the key expansion has been implemented, to generate the 10 sub-keys employed during the decryption process; to expand the key, the same matrix used for the encryption process is deployed (called sbox_encoding_4). The 10 rounds for obtaining the plaintext data packets are implemented within the aes_decoding_block source file and the rounds are developed in the same way, as done for the encryption operation.
In each round, the InvSubBytes, InvShiftRows and InvMixColumns operations are combined to obtain the plaintext data packets. These operations are carried out by using the four 16 × 16 32-bit matrices, called sbox_decoding_0, sbox_decoding_1, sbox_decoding_2 and sbox_decoding_3, equivalent to operations of the AES decrypting algorithm; in particular, the xor operations between the intermediate data obtained during the different rounds of decryption algorithm and elements of these matrices are carried out (Figure 19). These matrices allow obtaining the plaintext data packets in only 10 clock periods, considerably reducing the necessary time to perform the decryption process; however, greater resource utilization of the FPGA device is required. source files used to implement the code performing the decryption are shown; the first four files, AES_128_DEC_v1_0, AES_128_DEC_v1_0_S00_AXIS_inst, AES_128_DEC_v1_0_S01_AXI_inst and AES_128_ DEC_v1_0_M00_AXIS_inst are related to the implementation of the communication between blocks by the means of AXI Bus Stream and AXI Lite. The files containing the code developed to perform the AES-128 decryption algorithm are the last two shown in Figure 18, named aes_decoding_block and cipher_key_expansion_block. In the cipher_key_expansion_block file, the code to implement the key expansion has been implemented, to generate the 10 sub-keys employed during the decryption process; to expand the key, the same matrix used for the encryption process is deployed (called sbox_encoding_4). The 10 rounds for obtaining the plaintext data packets are implemented within the aes_decoding_block source file and the rounds are developed in the same way, as done for the encryption operation.
In each round, the InvSubBytes, InvShiftRows and InvMixColumns operations are combined to obtain the plaintext data packets. These operations are carried out by using the four 16 × 16 32-bit matrices, called sbox_decoding_0, sbox_decoding_1, sbox_decoding_2 and sbox_decoding_3, equivalent to operations of the AES decrypting algorithm; in particular, the xor operations between the intermediate data obtained during the different rounds of decryption algorithm and elements of these matrices are carried out (Figure 19). These matrices allow obtaining the plaintext data packets in only 10 clock periods, considerably reducing the necessary time to perform the decryption process; however, greater resource utilization of the FPGA device is required.   The developed decryption algorithm provides the plaintext data packets, at the output of the AES_128_DEC block, in just 28.560 ns for 350 MHz clock frequency, thus obtaining the same maximum data-rate as the encryption algorithm (i.e., 44.8 Gbit/s). Similarly to the encryption algorithm, the s00_axis_tvalid signal, provided by the Cipher_Data_Generator block (green box in Figure 17), indicates that the encrypted data packets are available for the decryption; the plaintext data packets provided at the output of the decryption block are reported by the m00_axis_tvalid signal, as depicted in Figure 20. The developed decryption algorithm provides the plaintext data packets, at the output of the AES_128_DEC block, in just 28.560 ns for 350 MHz clock frequency, thus obtaining the same maximum data-rate as the encryption algorithm (i.e. 44.8 Gbit/s). Similarly to the encryption algorithm, the s00_axis_tvalid signal, provided by the Cipher_Data_Generator block (green box in Figure 17), indicates that the encrypted data packets are available for the decryption; the plaintext data packets provided at the output of the decryption block are reported by the m00_axis_tvalid signal, as depicted in Figure 20. The developed decryption block receives the new key and stores it in four 32-bit registers; the key is validated setting the bit of the key_valid register, used for this purpose. The algorithm checks this bit every 85.6 ns and if it is set, the new key is acquired and the flag bit is reset, thus communicating to the processor that the change of the key has been received; afterward, the expansion key routine starts. During this process, the m00_axis_tvalid signal is reset indicating that no valid decrypted packets are available. This operation requires 205 ns, 177 ns more than the 28.56 ns needed to provide the first valid decrypted packet for the new validated key.
To verify the correctness of the decrypted data packets and to check the sensitivity of the implemented algorithm in detecting errors, the Insert_Error block (pink block in Figure 17) has been implemented, similar to those implemented in the encryption algorithm; the sig_error signal triggers the change a single bit in the input word and verify that the Pattern_verificator block (blue block in Figure 17) detects the error; when it detects the error, the error_sig bit is set in correspondence with the decrypted data packet that does not match with the word set stored in its internal table.
Finally, the s00_axis_tready signal has been configured, for indicating the availability of a decryption block to accept a new encrypted data packet. As discussed above, the implemented algorithm can accept encrypted data packets and thus perform the decryption, on every rising edge of the system clock; therefore, it is always ready to accept new encrypted data packets, consequently, the s00_axis_tready signal is reset only if the m00_axis_tready is reset, namely if the block downward the decrypting block cannot accept decrypted data packets.

Post-Synthesis Simulation Results: Resources Utilization of the Encryption/Decryption Systems
In this sub-section, the simulations performed to determine the resource utilization on the ZCU102 FPGA platform by the developed AES-128 algorithm are reported. At first, the post-synthesis simulations have been performed on both encryption and decryption blocks, with data packets provided on each rising edge of the 350 MHz clock signal; afterward, the simulation has been performed by modifying the Data_Generator block to provide data packets at the input of the encryption/decryption block every 42.7 ns, thus verifying that the hardware usage remains unchanged.

Incoming Encrypted data packets
Outcoming Plaintext data packets Figure 20. Temporal trends related to the decryption phase; the time interval required to obtain the plaintext data packet from the encrypted data packet is highlighted by time markers applied to s00_axis_tvalid (yellow box) and the m00_axis_tvalid (orange box) signals.
The developed decryption block receives the new key and stores it in four 32-bit registers; the key is validated setting the bit of the key_valid register, used for this purpose. The algorithm checks this bit every 85.6 ns and if it is set, the new key is acquired and the flag bit is reset, thus communicating to the processor that the change of the key has been received; afterward, the expansion key routine starts. During this process, the m00_axis_tvalid signal is reset indicating that no valid decrypted packets are available. This operation requires 205 ns, 177 ns more than the 28.56 ns needed to provide the first valid decrypted packet for the new validated key.
To verify the correctness of the decrypted data packets and to check the sensitivity of the implemented algorithm in detecting errors, the Insert_Error block (pink block in Figure 17) has been implemented, similar to those implemented in the encryption algorithm; the sig_error signal triggers the change a single bit in the input word and verify that the Pattern_verificator block (blue block in Figure 17) detects the error; when it detects the error, the error_sig bit is set in correspondence with the decrypted data packet that does not match with the word set stored in its internal table.
Finally, the s00_axis_tready signal has been configured, for indicating the availability of a decryption block to accept a new encrypted data packet. As discussed above, the implemented algorithm can accept encrypted data packets and thus perform the decryption, on every rising edge of the system clock; therefore, it is always ready to accept new encrypted data packets, consequently, the s00_axis_tready signal is reset only if the m00_axis_tready is reset, namely if the block downward the decrypting block cannot accept decrypted data packets.

Post-Synthesis Simulation Results: Resources Utilization of the Encryption/Decryption Systems
In this sub-section, the simulations performed to determine the resource utilization on the ZCU102 FPGA platform by the developed AES-128 algorithm are reported. At first, the post-synthesis simulations have been performed on both encryption and decryption blocks, with data packets provided on each rising edge of the 350 MHz clock signal; afterward, the simulation has been performed by modifying the Data_Generator block to provide data packets at the input of the encryption/decryption block every 42.7 ns, thus verifying that the hardware usage remains unchanged.
The resource utilization of FPGA related to the encryption algorithm is reported in Table 1; considering the complete encryption system, in both cases discussed above, the percentages of hardware occupation equal to 5.48% for LUTs and 0.78% for FFs have been obtained. Afterward, the simulation of the encryption system has been performed by removing all the blocks used to verify the correct behavior of the algorithm, leaving only the blocks involved in the encryption algorithm; the hardware resources utilization is 4.76% for the LUT and 0.71% for the FF. A reduction in hardware occupation of 0.72% for LUTs has been obtained compared to the previous case including all the blocks. Considering the decryption algorithm, the post-synthesis simulations have been performed both when the data packets are provided on each rising edge of the 350 MHz clock signal and when the Data_Generator provides encrypted packets every 42.7 ns (i.e., 23.4 MHz packet rate, Table 2); in the first case, the hardware utilization of the FPGA device is 10.62% for LUTs, 0.79% for FFs and 0.25% relative to the Global Buffers (BUFG) used. In the latter case, the use of hardware resources is equal to 10.64% for the LUTs, 0.79% for the FFs and 0.25% for the BUFGs. Finally, the post-synthesis simulation of the decryption system has been performed by removing all the blocks used to verify the correct behavior of the algorithm, leaving only the block involved in the decryption algorithm. This configuration reveals a hardware utilization of 10.11% for the LUTs, 0.71% for the FFs and 0.25% for the Global Buffer, obtaining a reduction in the hardware occupation of 0.53% for the LUTs and 0.08% for the FFs compared to the complete decryption scheme (Table 2). Table 2. Hardware resource utilization related to the complete decryption scheme, including all the blocks to test the decryption algorithm, both when the encrypted packets are received in input on each rising edge of the 350 MHz clock signal and when they are provided every 42.7 ns (i.e., 23.4 MHz); also, the resource utilization of only the decryption block are reported. As it can be seen from Tables 1 and 2, showing the use of hardware resources for both the encryption and decryption systems, the LUTs used on the FPGA by the latter are 1.94× more, considering the only blocks that perform the decryption and 2.12× more, considering also the blocks needed to test it, compared to the LUTs used by the encryption algorithm; this is due to the implementation of 4 matrices (sbox_decoding_0, sbox_decoding_1, sbox_decoding_2 and sbox_decoding_3) in the decryption algorithm, each containing 32-bit elements deriving from the operations of Inverse SubBytes, Inverse Shift Rows and Inverse Mix Columns. In particular, the greater hardware resources consumption is attributable to the multiplication of the Inverse Mix Columns operation carried out in the decryption block, because involve a large number of values such as 0×09090909, 0×0B0B0B0B, 0×0D0D0D0D, 0×0E0E0E0E; such multiplicative constants require the storing of numerous intermediate values inside the LUT, occupying more hardware resources and consuming more power [34]. For this reason, several strategies were proposed in the scientific literature for reducing resource utilization and power consumption [35,36]. However, since the area occupation requirement is not as stringent as the encryption/decryption speed for the specifications of the developed project, the implementation choice fell on obtaining the data packets in the shortest possible amount of time at the expense of a greater chip's area occupation.

Discussion
In this section, the results of the carried out post-implementation simulations on the combined system constituted by the cascade of the encryption system and the decryption one are reported, to verify that the resulting performances are acceptable for the correct operation of the algorithm, once the project is loaded on the FPGA-ZCU102 platform.

Post-Implementation Simulations: Clock Routing Issues and Overall Performances of the Combined Encryption/Decryption System
The post-implementation simulations represent the closest emulation to downloading a design to a device, providing useful indications related to the functional and timing requirements of the developed system.
After setting the appropriate parameters and using synthesizable blocks, such as the Clocking Wizard, for the system clock and the interface mappable pins on the board, for the clock signal and the error_sig signal provided by the Pattern_Verificator, the post-implementation simulation on the encryption system has been carried; the simulation results indicated a timing problem related to the propagation of the signals within the FPGA-ZCU102 chip. In particular, for a 350 MHz system clock, a Worst Negative Slack (WNS) parameter equal to −1.014 ns has been obtained, indicating excessive delays in the propagation of the digital signals inside the FPGA chip, thus resulting in incorrect scheduling of the performed tasks; therefore, a positive WNS is required, for ensuring the proper operation of the developed encryption/decryption systems.
To overcome this problem, several post-implementation simulations, with a lower system clock frequency, have been carried out, obtaining an improvement of the WNS parameter (Table 3); in particular, by using a system clock frequency of 190 MHz, a WNS value equal to 0 ns was obtained, as well as for 180 MHz operating frequency, a WNS equal to 0.056 ns resulted. Furthermore, to support a greater system clock frequency, it is possible to use the implementation strategies provided by the Vivado tool; therefore, a common strategy suitable for both the encryption and decryption blocks has been chosen, since the final simulations have been carried out on the combined system. The post-implementation simulation has been performed by setting 220 MHz system clock frequency and adopting the Explore strategy, thus obtaining the WNS parameter equal to 0.005 ns for the encryption system and 0.008 ns for the decryption one. The area utilization resulting from the post-implementation simulations remains unchanged compared to the results obtained through the post-synthesis simulations, showing, for the encryption system, resource utilization of 5% of LUTs, 1% of FFs, 1% of I/O ports and 1% of BUFGs, as well as for the decryption system, 10% of LUTs, 1% of FFs, 1% of I/O ports and 1% of BUFGs; finally, for both the system, there is a 25% area utilization relative to the IP Clocking Wizard block used to generate the system clock during the post-synthesis and post-implementation simulations. Before performing the post-implementation simulations of the whole system including encryption and decryption blocks, the behavioral simulations with 220 MHz clock frequency have been carried out. In Figure 21, the temporal trends of the signals are shown, obtained providing the plaintext data packets (red box) to the encryption/decryption system every 40.86 ns; this data-rate derives from the clock frequency of 220 MHz, corresponding to 4.54 ns clock period, chosen to comply with the 3 Gbit/s throughput required by the specifications of Wireless Connector system, as calculated below in Equation (5).
The area utilization resulting from the post-implementation simulations remains unchanged compared to the results obtained through the post-synthesis simulations, showing, for the encryption system, resource utilization of 5% of LUTs, 1% of FFs, 1% of I/O ports and 1% of BUFGs, as well as for the decryption system, 10% of LUTs, 1% of FFs, 1% of I/O ports and 1% of BUFGs; finally, for both the system, there is a 25% area utilization relative to the IP Clocking Wizard block used to generate the system clock during the post-synthesis and post-implementation simulations. Before performing the post-implementation simulations of the whole system including encryption and decryption blocks, the behavioral simulations with 220 MHz clock frequency have been carried out. In Figure 21, the temporal trends of the signals are shown, obtained providing the plaintext data packets (red box) to the encryption/decryption system every 40.86 ns; this data-rate derives from the clock frequency of 220 MHz, corresponding to 4.54 ns clock period, chosen to comply with the 3 Gbit/s throughput required by the specifications of Wireless Connector system, as calculated below in Equation (5).
The encryption of the data packets is performed in 9.5 clock periods (white box); in fact, the encrypted packets are provided at the output of the encryption block on the falling edge of the clock, after exactly 9.5 clock periods and then acquired by the decryption block on the next rising edge. Afterward, the data packet is decrypted in 9.5 clock periods (blue box) and provided at the output of the decryption block after overall 10 clock periods. Therefore, the encryption and decryption operations last 20 clock periods, which are equal to 90.8 ns, considering a system clock frequency of 220 MHz ( Figure 21).

Figure 21.
Temporal trends related to the behavioral simulations where plaintext data packets are provided every 9 system clock periods at 220 MHz, corresponding to a data rate of 3.123 Gbit/s. In Figure 22, the temporal trends related to the behavioral simulations are shown, with the plaintext data packets provided to the system in each clock period (frequency 220 MHz). The s00_axis_tvalid signal is constantly set, indicating to the receiving block that a new data packet is available at the input, providing encrypted packets on each rising edge of the clock, thus allowing a data rate equal to 28.16 Gbit/s (220 * 128 = 28.16 / ). Finally, the post-implementation simulation of the overall system constituted by the cascade of the encryption (red box) and decryption (blue box) blocks has been carried out (Figure 23a). The simulation has been performed by setting the Explore implementation strategy, provided by the Vivado tool. The screenshots of the Project Manager, obtained after the post-implementation simulation, are shown in Figure 23; a positive WNS parameter, equal to 0.056 ns, is obtained ( Figure  23b), as well as the hardware utilization of the overall encryption/decryption system is reported in Figure 23c. In particular, the hardware resource utilization was equal to 15% LUTs, 1% FFs, 1% I/O ports, 1% BUFG, as well as a 25% area utilization value relative to the IP Clocking Wizard block, used to generate the system clock, was obtained.

Plaintext data packet
Encrypted data packet Decrypted data packet Figure 21. Temporal trends related to the behavioral simulations where plaintext data packets are provided every 9 system clock periods at 220 MHz, corresponding to a data rate of 3.123 Gbit/s. The encryption of the data packets is performed in 9.5 clock periods (white box); in fact, the encrypted packets are provided at the output of the encryption block on the falling edge of the clock, after exactly 9.5 clock periods and then acquired by the decryption block on the next rising edge. Afterward, the data packet is decrypted in 9.5 clock periods (blue box) and provided at the output of the decryption block after overall 10 clock periods. Therefore, the encryption and decryption operations last 20 clock periods, which are equal to 90.8 ns, considering a system clock frequency of 220 MHz (Figure 21).
In Figure 22, the temporal trends related to the behavioral simulations are shown, with the plaintext data packets provided to the system in each clock period (frequency 220 MHz). The s00_axis_tvalid signal is constantly set, indicating to the receiving block that a new data packet is available at the input, providing encrypted packets on each rising edge of the clock, thus allowing a data rate equal to 28.16 Gbit/s (220 MHz * 128 bit = 28.16 Gbit/s).
Finally, the post-implementation simulation of the overall system constituted by the cascade of the encryption (red box) and decryption (blue box) blocks has been carried out (Figure 23a). The simulation has been performed by setting the Explore implementation strategy, provided by the Vivado tool. The screenshots of the Project Manager, obtained after the post-implementation simulation, are shown in Figure 23; a positive WNS parameter, equal to 0.056 ns, is obtained (Figure 23b), as well as the hardware utilization of the overall encryption/decryption system is reported in Figure 23c. In particular, the hardware resource utilization was equal to 15% LUTs, 1% FFs, 1% I/O ports, 1% BUFG, as well as a 25% area utilization value relative to the IP Clocking Wizard block, used to generate the system clock, was obtained. Besides, the estimation of the total on-chip power (sum of the static FPGA power and design power) of the combined encryption/decryption system has been obtained from the post-implementation simulation, providing plaintext data packets each clock period, which is equal to 1.77 W, with 26.7 °C chip temperature, ensuring a thermal margin equal to 73.3 °C (i.e. temperature limit equal to 90 °C). Furthermore, post-implementation simulations have been carried out on both the encryption and decryption systems individually, so obtaining the total on-chip Besides, the estimation of the total on-chip power (sum of the static FPGA power and design power) of the combined encryption/decryption system has been obtained from the post-implementation simulation, providing plaintext data packets each clock period, which is equal to 1.77 W, with 26.7 °C chip temperature, ensuring a thermal margin equal to 73.3 °C (i.e. temperature limit equal to 90 °C). Furthermore, post-implementation simulations have been carried out on both the encryption and decryption systems individually, so obtaining the total on-chip power consumption equal to 1  Besides, the estimation of the total on-chip power (sum of the static FPGA power and design power) of the combined encryption/decryption system has been obtained from the post-implementation simulation, providing plaintext data packets each clock period, which is equal to 1.77 W, with 26.7 • C chip temperature, ensuring a thermal margin equal to 73.3 • C (i.e., temperature limit equal to 90 • C).
Furthermore, post-implementation simulations have been carried out on both the encryption and decryption systems individually, so obtaining the total on-chip power consumption equal to 1.17 W and 0.99 W with the chip temperature equal to 26.5 • C and 26.1 • C, respectively. By providing the plaintext data packets in input to the encryption block every 40.86 ns, the post-implementation simulation on the combined encryption/decryption system indicates a power consumption of only 365 mW, with a 25.5 • C chip temperature.

Testing of the Developed Encryption/Decryption Algorithm on ZCU102 Evaluation Board
After the generation of the bitstream file related to the developed project including the cascade of the encryption and decryption blocks, the file has been loaded on the FPGA-ZCU102 evaluation board. To monitor the interest signals, the IP Integrated Logical Analyzer (IL) has been added to the Block Design; also, to verify the correctness of the decrypted packets, provided by the system constituted by the encryption and decryption blocks connected in cascade, during the test phase, only a single encryption key has been used, initially loaded into four 32-bit registers and subsequently automatically validated; therefore, the error_sig signal produced by the Pattern_Verificator block remains low, thus indicating the errors' absence in the comparison of the packets received by the decryption block and those contained in the Pattern_Verificator table.
The tests carried out on the board confirmed the proper operation of both encryption and decryption algorithms, complying with the operation resulting from the post-implementation simulations reported in the previous paragraph. In Figure 24, the temporal trends related to the complete encryption/decryption system are shown, in which the plaintext data packets, provided every 9.5 clock periods, are accepted by the encryption block (red box) and thus the encrypted packets are delivered to the decryption block (white box), thereby obtaining the decrypted packets downstream (blue box in Figure 24). As expected, the error_sig signal remains low along the observation period, indicating that the processing of the packets is performed correctly, namely the packets leaving the decryption block are equal to those provided at the input to the encryption block.
Electronics 2020, 9, x FOR PEER REVIEW 25 of 30 °C, respectively. By providing the plaintext data packets in input to the encryption block every 40.86 ns, the post-implementation simulation on the combined encryption/decryption system indicates a power consumption of only 365 mW, with a 25.5 °C chip temperature.

Testing of the Developed Encryption/Decryption Algorithm on ZCU102 Evaluation Board
After the generation of the bitstream file related to the developed project including the cascade of the encryption and decryption blocks, the file has been loaded on the FPGA-ZCU102 evaluation board. To monitor the interest signals, the IP Integrated Logical Analyzer (IL) has been added to the Block Design; also, to verify the correctness of the decrypted packets, provided by the system constituted by the encryption and decryption blocks connected in cascade, during the test phase, only a single encryption key has been used, initially loaded into four 32-bit registers and subsequently automatically validated; therefore, the error_sig signal produced by the Pattern_Verificator block remains low, thus indicating the errors' absence in the comparison of the packets received by the decryption block and those contained in the Pattern_Verificator table.
The tests carried out on the board confirmed the proper operation of both encryption and decryption algorithms, complying with the operation resulting from the post-implementation simulations reported in the previous paragraph. In Figure 24, the temporal trends related to the complete encryption/decryption system are shown, in which the plaintext data packets, provided every 9.5 clock periods, are accepted by the encryption block (red box) and thus the encrypted packets are delivered to the decryption block (white box), thereby obtaining the decrypted packets downstream (blue box in Figure 24). As expected, the error_sig signal remains low along the observation period, indicating that the processing of the packets is performed correctly, namely the packets leaving the decryption block are equal to those provided at the input to the encryption block. Figure 24. Temporal trends with shown the plaintext data packets entering the system (red-dashed box), the encrypted ones delivered by the encryption block to the decryption block (white-dashed box) and finally decrypted packets provided in output by the decryption block (blue-dashed box); as evident, the plaintext data packets provided in input to the system are equal to those provided by the decryption block (as indicated by the red arrow), also demonstrated by error_sig signal, which remains low along the observation period (yellow-dashed box).
In Figure 25, the temporal trends related to the complete encryption/decryption system are shown, in which the plaintext data packets are provided in input on each rising edge of the clock Figure 24. Temporal trends with shown the plaintext data packets entering the system (red-dashed box), the encrypted ones delivered by the encryption block to the decryption block (white-dashed box) and finally decrypted packets provided in output by the decryption block (blue-dashed box); as evident, the plaintext data packets provided in input to the system are equal to those provided by the decryption block (as indicated by the red arrow), also demonstrated by error_sig signal, which remains low along the observation period (yellow-dashed box).
In Figure 25, the temporal trends related to the complete encryption/decryption system are shown, in which the plaintext data packets are provided in input on each rising edge of the clock signal; as can be noticed, also, in this case, the error_sig signal remains low, indicating the proper operation of the encryption/decryption system. signal; as can be noticed, also, in this case, the error_sig signal remains low, indicating the proper operation of the encryption/decryption system.

Comparison of the Proposed AES-128 Implementation With Other Works Reported in the Literature
For the Zynq SoC, just like other FPGA, the PL section is constituted by CLBs arranged according to matrix structure; each CLB contains two slices, each including four LUTs and eight FFs and a configurable switch matrix [37]. Therefore, from the results shown in Table 1 and Table 2, the number of CLBs and slices employed by the developed AES-128 encryption and decryption blocks are 1631/3262 and 3464/6928, respectively. Table 4 reports the comparison between the proposed implementation of AES-128 encryption algorithm with other pipelined implementations previously reported in the scientific literature, similar for operative frequency and supported throughput; also, the platform employed to develop the reported implementations are indicated, since the FPGA technology affects the performance of encryption and decryption. However, the figure of merit chosen for comparing the different implementations is the efficiency, defined as: In particular, this quantity is representative of how efficiently the FPGA hardware resources are used to support a given output throughput.

Comparison of the Proposed AES-128 Implementation with Other Works Reported in the Literature
For the Zynq SoC, just like other FPGA, the PL section is constituted by CLBs arranged according to matrix structure; each CLB contains two slices, each including four LUTs and eight FFs and a configurable switch matrix [37]. Therefore, from the results shown in Tables 1 and 2, the number of CLBs and slices employed by the developed AES-128 encryption and decryption blocks are 1631/3262 and 3464/6928, respectively. Table 4 reports the comparison between the proposed implementation of AES-128 encryption algorithm with other pipelined implementations previously reported in the scientific literature, similar for operative frequency and supported throughput; also, the platform employed to develop the reported implementations are indicated, since the FPGA technology affects the performance of encryption and decryption. However, the figure of merit chosen for comparing the different implementations is the efficiency, defined as: In particular, this quantity is representative of how efficiently the FPGA hardware resources are used to support a given output throughput.
As evident from the results reported in the following table, the proposed solution can reach high data throughput values (up to 28.16 Gbit/s) but with commensurably lower utilization of the hardware resources compared to other works, thus allowing higher efficiency. Considering the most performing implementation, reported in Reference [42], our solution obtains a maximum data throughput slightly lower (−5.3%) but also employs a lot less FPGA hardware resources (i.e., −39.7%), thus resulting into a higher efficiency value (+56.9%). Also, comparing our solution implementing encryption and decryption operation with those reported in Reference [33], a clear superiority of the former is evident, indicated with a higher efficiency value (+92.9%).
As aforementioned, it must be considered that the comparison shown in the previous table is made between solutions implemented with different platforms for technology, architecture and maximum clock frequency; therefore, the enhanced performances of our solution are also attributable to the advanced features and complex architecture of the used platform but mainly to the implemented solutions aimed to speed up the encryption/decryption process. Such advanced specifications are required to comply with the constraints imposed by the Wireless Connector system, also related to the other functionalities included in the developed communication system. Finally, the platform typology must be considered as a parameter of reported analysis to obtain a fair comparison.

Conclusions
In this research work, we have proposed a high-speed implementation of the well-known AES-128 algorithm properly developed for a custom, very short-range and high-frequency communication system, called Wireless Connector; specifically, this last supports high-throughput data transmission on a frequency range around 60 GHz between two mobile stations located at short-range (1-10 m). The core of the communication system is constituted by a Xilinx ZCU102 FPGA platform, which manages all the base-band operations, including the encryption and decryption of the data packets; the prototype of the Wireless Connector was realized, demonstrating its proper operation. In particular, a pipelined approach has been applied to the round-based elaboration typical of the AES algorithm, allowing simultaneous processing of multiple successive plaintext packets each clock period and thus reaching higher data throughput values; furthermore, a 32-bit 16 × 16 Sbox matrix was employed to speed up the Substitute Byte step compared to the classic 8-bit implementation.
Encryption and decryption VHDL blocks have been developed on the Xilinx ZCU102 FPGA platform, carrying out multiple elaborations of the incoming data packets to comply with the 3 Gbit/s data rate, constraint required by the Wireless Connector application. The developed encryption system can operate at a 220 MHz maximum clock frequency, supporting an encryption time of just 10 clock periods. Thanks to the pipelined elaboration, the proposed implementation is able to process and provide the encrypted packets each clock period (namely, 4.54 ns = 1 220 MHz ), reaching a maximum data throughput higher than 28 Gbit/s (i.e., 128 bit/packet 4.54 ns = 28.16 Gbit/s). Similarly, the decrypting system employs just 10 clock period for obtaining the plaintext data packets.
Furthermore, developed AES-128 encryption implementation is featured by higher efficiency (8.63 Mbps/slice) compared to similar solutions operating on the same frequency range, requiring just 1631 CLBs, 13043 LUTs and 3877 FFs. However, the decryption implementation requires higher resource utilization compared to the encryption one (3464 CLBs, 27713 LUTs, 3912 FFs and 1 BUFG), due to the four matrices derived from Inverse SubBytes, Inverse Shift Rows and Inverse Mix Columns operations, each containing 32-bit elements; the greater resource utilization is associated with the Inverse Mix Columns operation, given the multiplicative constants involved in its matrix representation and its LUT-based implementation inside the FPGA, as detailed in the Section 3.2.
Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interest.