A Low Area High Speed FPGA Implementation of AES Architecture for Cryptography Application

: Nowadays, a huge amount of digital data is frequently changed among different embedded devices over wireless communication technologies. Data security is considered an important parameter for avoiding information loss and preventing cyber-crimes. This research article details the low power high-speed hardware architectures for the efﬁcient ﬁeld programmable gate array (FPGA) implementation of the advanced encryption standard (AES) algorithm to provide data security. This work does not depend on the Look-Up Table (LUTs) for the implementation the SubBytes and InvSubBytes stages of transformations of the AES encryption and decryption; this new architecture uses combinational logical circuits for implementing SubBytes and InvSubBytes transformation. Due to the elimination of LUTs, unwanted delays are eliminated in this architecture and a subpipelining structure is introduced for improving the speed of the AES algorithm. Here, modiﬁed positive polarity reed muller (MPPRM) architecture is inserted to reduce the total hardware requirements, and comparisons are made with different implementations. With MPPRM architecture introduced in SubBytes stages, an efﬁcient mixcolumn and invmixcolumn architecture that is suited to subpipelined round units is added. The performances of the proposed AES-MPPRM architecture is analyzed in terms of number of slice registers, ﬂip ﬂops, number of slice LUTs, number of logical elements, slices, bonded IOB, operating frequency and delay. There are ﬁve different AES architectures including LAES, AES-CTR, AES-CFA, AES-BSRD, and AES-EMCBE. The LUT of the AES-MPPRM architecture designed in the Spartan 6 is reduced up to 15.45% when compared to the AES-BSRD.


Introduction
Information is treated as the most valuable asset in present day since about billions of items of information are processed and shared per second in this technological world. This large amount of information must be processed and shared in a very secure manner. Since information is treated as a most valuable thing, there is also a very high chance of piracy or alteration of this information by an unintended user [1][2][3][4][5]. Unauthorized change and availability of the original information also form the major goal of the researchers to keep the information secure. Researchers use cryptographic algorithms to keep information secret while transmitting and receiving information on an insecure channel. Cryptographic algorithms perform encryption operations on the original plain text before transmitting and perform decryption operation after receiving the encrypted text from the insecure the inner pipelining, outer pipelining, and loop-unrolling were used for obtaining a higher throughput. A new-affine-transformation, which was a combination of inverse isomorphic and affine transformation, was used to minimize the area of the Sub-Bytes. Moreover, the operation of the mix-column was separated into two stages for obtaining an equal latency among the stages. However, the execution of the mix column operation in on clock cycle was resulted in higher latency.
Arul Murugan, C et al. [49] presented iterative architecture of AES-128 encryption for enhancing the security. The LUT was used in the composite field arithmetic (CFA) to perform the multiplication in the S-box. Subsequently, the area was minimized by using the updated architecture of the S-box in the AES. Additionally, the Vedic multiplier was used in the operation of mix columns for minimizing the hardware resources of the AES. Here, three different FPGA devices such as Virtex-4, Virtex-5, and Spartan 3 devices were used to design the AES algorithm. The utilization of the CFA resulted in higher speeds in the process of sub-byte transformation. However, the architecture of the mix column transformation was highly complex when compared to the architecture of the shift-row transformation. Wegener, F et al. [50] developed the extended version of the AES in Spartan 6 for compromising the requirements of area and latency. The Bit-Serial Rotational Design (BSRD) was used instead of the S-box in the AES architecture. Moreover, the nonlinear proportion was significantly decreased by using the rotational symmetry technique for implementing the S-box of the AES algorithm. The linear operation was replicated and executed independently in the AES algorithm, which increased its area.
Madhavapandian and Maruthu Pandi [51] presented the design of secure architecture for the Transmission Control Protocol/Internet Protocol (TCP/IP). The protection over the TCP/IP protocol was achieved by using the 128 bit of AES algorithm. Next, the AES architecture was optimized for reducing the power consumption and cost. Moreover, an improved compressed structure for the mix columns was used along with the efficient mix column Boolean expression (EMCBE) for optimizing the structure of the AES. Here, the AES algorithm was designed and implemented using the Virtex 6 algorithm. However, the technique of gate replacement used for minimizing the delay affected the scalability of this AES architecture. Zodpe and Sapkal [52] developed the pseudo-random noise (PN) sequence generator for generating the values of the S-box. The distinct sequence of random numbers was generated by using the PN generator based on the feedback taps and initial seed value. An efficiency of the cryptosystem was improved based on the features of the PN sequence generator. Here, the requirement for applying the external key was avoided by using an initial key generation technique. However, the utilization of hardware resources of the AES was high, when it was designed by using the non-pipelined stages.
Sikka et al. [53] presented the high throughput FPGA implementation for the AES algorithm. Here, the AES algorithm was designed for the automotive applications. Specifically, the Vivado high-level synthesis (HLS) tool was used to design the AES with a 128-bit of key size and block size of 10 rounds. This HLS was used mainly based on the application-specific bit widths that used to design the FPGA. The implementation of AES using the HLS was used to improve the speed. However, the delay of this AES algorithm was increased because of the frequent re-calculation of the signal width for each input, output and intermediate value.
The existing methods are analyzed and explained as follows: for a better implementation of the AES, the hardware resources (e.g., slices and LUT) are required less during the encryption/decryption process. But, the requirements of the multiplexers are increased, when mix column operation is minimized in the LAES [47]. The replication of linear operation increases the area of the AES in encryption/decryption processes [50]. Moreover, the hardware resources of the AES are improved when it is designed by using the nonpipelined stages [52]. The latency is increased in the AES operation in the mix column operation [48]. Next, the delay of the AES is increased due to the frequent re-calculation of the signal width [53]. In order to overcome the aforementioned issues, an MPPRM based SubBytes transformation is designed to reduce the hardware resources used during the encryption/decryption process. Additionally, the delay element added in the AND gate of the MPPRM is used to minimize the delay caused in the encryption process.
Generally, the SubBytes transformation processed in the AES is considered as bottleneck during the encryption process, since the transformation calculates the most resource demanding multiplicative inverse function at the finite field arithmetic operation. The SubBytes transformation designed by using the read only memory creates an inappropriate delay in the AES encryption/decryption [54]. From the modifications of pipelining, subpipelining, and loop-unrolling, the subpipelining is used to achieve maximum speed with minimal area occupancy. LUT based approaches used in the SubBytes and InvSubBytes transformation stages causes much delay than the total delay of all of the other transformations. Therefore, in this work LUT based transformation is replaced with the minimal delay combinational circuits. Composite field architecture is one such architecture that provides the required logics. Also, composite field arithmetic structure has been implemented with subpipelinining architecture in this work. However, delays can also be reduced by eliminating the delays occurring due to dynamic hazards that take place in combination circuits. This is achieved by implementing SubBytes and InvSubBytes transformation in MPPRM architecture. Here, both the architectures are explored and hardware utilization comparisons are made to explore the most suitable architecture. Algorithms used for secure communication must withstand the cryptanalytic attacks which aim to find the secret key based on the techniques and mathematical operations used in converting plain text to cipher text. With the aim of optimizing the AES structure, a MPPRM based SubBytes transformation is accomplished for achieving the less area and lesser hardware resources during the FPGA implementation.
The most important contributions of this research work are given as follows: This work uses composite field arithmetic structure both SubBytes and InvSubBytes transformations with the speed efficient subpiplining structure. Earlier approach of implementing CFA is to decompose GF(28) as GF(((22)2)2). Where in this work GF(28) is decomposed as GF( (24)2).
MPPRM architecture is used for both SubBytes and InvSubBytes transformations for minimizing the AND and XOR gates to minimize the hardware resources used in the encryption/decryption process.
In this work, key expansion architecture is used to provide 128-bit key for subpipelined structure. A speed efficient AES-MPPRM architecture is achieved, because of the delay element added in the AND gate output.
The organization of the paper is given as follows: The information about the conventional AES algorithm is provided in Section 2. Section 3 provides the detailed explanation about the AES-MPPRM architecture. The modified new architecture in key expansion module is detailed in Section 4. Section 5 gives the results and discussion of the AES-MPPRM architecture. The conclusion of this research paper is provided in Section 6.

Related Works
The information about the conventional AES encryption and decryption process is provided in this section. This section contains the process of substitution, shift rows, mix columns, added round keys, key expansion architecture, subpipelined architecture, and composite field arithmetic operation.

AES Algorithm
Advanced encryption standard (AES) is a symmetric cryptographic algorithm, in which a single secret key is used for both encryption and decryption operations. AES has three different versions and each version operates at different bit levels of secret key. Based on the number of bits in the secret key, AES can be classified as AES-128, AES-192, or AES-256. Number of rounds of encryption and decryption operation for each version also varies from one another and it is given in Table 1.  13 15 Every round in the AES algorithm uses different subkeys, which are generated from the main original key. Although key sizes are different for each version, no of bits from the original data to be communicated securely remains same for all the versions. Data block size is 128 bits for all of the three versions of the AES algorithm. Number of rounds denotes the number of encryption operation takes places for a single data block with different subkeys obtained from the key expansion process. Diffusion and confusion process for each round remains same. Initially, the main key is utilized for pre round transformation and other subkeys are used in each round. So number of keys required will be always higher than the amount of rounds in AES versions.

AES Encryption and Decryption
AES is generally a non-Fiestel block cipher cryptographic algorithm that encrypts plain original text into encrypted cipher text with the help of secret key. Secret key lengths differ for different versions of the AES algorithm, as mentioned in Table 1. The same algorithm which is utilized to transform the plain text into cipher text can also be used to get back the plain text from the cipher text by using a secret key used in the encryption process. Except for the last round in all of the AEs versions, all of the other rounds will have the following four operations in both the encryption and decryption processes. Before proceeding to the encryption process, 128-bit plain text is grouped into a state matrix of order 4 × 4 with each element in the matrix is represented as a word (2 bytes).

Substitution Transformation
Substitution is a process in which a byte value is replaced by another byte. Only non-linear process involved in the AES algorithm is substitution.
Each byte in the state matrix is considered as a polynomial in the Galois Field 2 8 . Next, the affine transformation and matrix multiplication are used to transform the byte. Substitution can also be conducted directly using the RijndaelS-box. In decryption, inverse S-box is used for the substitution process. Substitution step introduces the actual confusion to the original data, forms the important step in preserving the original data from unauthorized access.

Shift Rows
The rows of state array are periodically shifted in the shift row phase. Here, the second, third, and fourth rows are shifted left by one, two and, three times, respectively, where the first row remains unaffected in the given input. Moreover, the rows are shifted right for accomplishing the decryption operation. Shift row is the first process of diffusion in the AES algorithm. This shift row process is invertible.

Mix Column Transformation
Similar to the operation of the shift row, the mix column operation is carried out in the column level. Here, the matrix multiplication is used to perform the operation of the mix column. The mix column and inv mix column operation are expressed in the following Equations (1) and (2).  0b  0d  09  09  0e  0b  0d  0d  09  0e  0b  0b 0d 09 0e where S 1, S 2, S 3 and S 4 are the output obtained after mix column operation; S1, S2, S3 and S4 are the input for mix column operation obtained as the output of Shift row operation.

Add Round Key
The important operation in the AES algorithm is add round key, since all of the earlier three steps are invertible. Without this operation, AES encryption becomes meaningless. Function of secret key takes place at the add round key operation. In this operation the original secret key or several subkeys are first arranges as a state matrix of 4 × 4 order. Matrix addition between keys and plain text is the operation in the add round key. As a result of this add round key operation, we get the cipher text.
All the above four operations are repeated for n number of times, where n denotes the number of rounds in the AES version. For every iteration, a different subkey is used. Therefore, for n number of rounds, n number of subkeys must be derived. Derivation of several subkeys from single main secret key is referred as key expansion process. Operation in the key expansion process or key schedule process must be carefully implemented, since the relation between subkeys should not be exploited easily. The mathematical way is followed by the key expansion process for deriving the different keys from the single way. In AES-MPPRM architecture, the time consumption is minimized for obtaining the subkeys from main key.

Key Expansion Architecture
AES-128 key expansion process is discussed in this section. In the key expansion process, the main secret key of 128 bit is divided into four words, each with 32-bit length. For AES-128, 10 subkeys must be derived from main original secret key. These subkeys are used in process of producing cipher text from plain text. Figure 1 clearly explains the process of key scheduling or key expansion. Mathematical operations are performed with the main key to derive the subkeys. Interdependency with the previous subkey makes the key expansion process efficient. Mathematical operation takes place at the word level and consecutive four words are combined together to form a subkey. For example, W4, W5, W6, and W7 are combined to form first subkey. Key expansion algorithm generates a total of 4(Nr + 1) words. Mathematical operation used here is just an EXOR operation between a previous word and a word from the previous subkey.

Generation of Temporary Word
Temporary word is the most important thing in key expansion process, since the only non-linear process involved in generation of subkeys is in the process of generation of temporary word. The first word of every subkey is generated as a result of EXOR operation between its corresponding temporary word and a word from a previous subkey. Mathematical operation takes place at the word level and consecutive four words are combined together to form a subkey. For example, W 4 , W 5 , W 6 , and W7 are combined to form first subkey. Key expansion algorithm generates a total of 4 (Nr + 1) words. Mathematical operation used here is just an EXOR operation between a previous word and a word from the previous subkey.

Generation of Temporary Word
Temporary word is the most important thing in key expansion process, since the only non-linear process involved in generation of subkeys is in the process of generation of temporary word. The first word of every subkey is generated as a result of EXOR operation between its corresponding temporary word and a word from a previous subkey. Three operations are performed in a word to create a temporary word. Substitution, Rotation and EXOR are the three operations which are performed in the process of creating a temporary word. For AES-128, temporary words are necessary to create W 4 , W 8 , W 12 , W 16 , W 20 , W 24 , W 28 , W 32 , W 36 , W 40 . Simply for i = 4Nr, temporary words are not needed. Equation (3) represents the generation of the temporary word.
where, 'i' denotes the word number and 'Nr' denotes the round number in AES. The rotword routine is identical to the Shift row routine, however it is used for the only one row. Moreover, the subword is similar to the Sub byte routine. The value of R constant differed for each round is provided in the Table 2.

Subpipelined Architecture
As stated in the first section subpipelining architectural optimization achieves maximum speed and great throughput than that of pipelining and folded architectures. To implement subpipelining, approach registers are inserted between each stage of every round and registers are also inserted between the rounds to store the intermediate results.
One of the main advantages of subpipelining over folder structures is that it can simultaneously operate multiple blocks of data. Subpipelining involves splitting each round of AES operation into n substages, as shown in Figure 2. The total delay in each substages involves set up delay, propagation delay, multiplexer delay, and combinational circuits delay (which are very small). Here, each substages within a round are divided with equal delay, with extra registers included within and between the rounds adds up more area to the architecture. It is noted that when the number of substages with equal delay are increased for each round, improvement in speed can be achieved. But dividing the rounds into substages does not improve the speed of the architecture due to the delay caused by adding a greater number of registers. Some of the component of rounds may give minimal delay when compared to the delay provided by the substages of that component. This is entirely because of the accumulated delay due to the additional registers. In those cases it is better not to divide those components into substages. Although subpipelining makes the blocks to be processed simultaneously, the average number of clock cycles for processing a block remains same. This is also the reason for not dividing each round after particular parts. As we know that that individual logic gates delay is the highest undividable delay of any combinational logics in non LUT based design. Based on this result, rounds can be divided into substages. entirely because of the accumulated delay due to the additional registers. In those cases it is better not to divide those components into substages. Although subpipelining makes the blocks to be processed simultaneously, the average number of clock cycles for processing a block remains same. This is also the reason for not dividing each round after particular parts. As we know that that individual logic gates delay is the highest undividable delay of any combinational logics in non LUT based design. Based on this result, rounds can be divided into substages.

Composite Field Arithmetic
As stated in the previous section, composite field arithmetic is implemented to increase speed as well as throughput of the AES encyptor and AES decryptor. For SubByte transformation, a byte must undergo a multiplicative inversion process and affine transformation. Composite field arithmetic is used to find the multiplicative inverse of a byte without much complexity overload. CF is denoted as GF ((2 n ) m ) and Gf (2 k ) is isomorphic to GF ((2 n ) m ) when K = nm. It is employed by mapping lower order GF to higher order GF.
The GF (2 8 ) is built from mapping GF by using the Equation (4).

Composite Field Arithmetic
As stated in the previous section, composite field arithmetic is implemented to increase speed as well as throughput of the AES encyptor and AES decryptor. For SubByte transformation, a byte must undergo a multiplicative inversion process and affine transformation. Composite field arithmetic is used to find the multiplicative inverse of a byte without much complexity overload. CF is denoted as GF ((2 n ) m ) and Gf (2 k ) is isomorphic to GF ((2 n ) m ) when K = nm. It is employed by mapping lower order GF to higher order GF.
The GF (2 8 ) is built from mapping GF by using the Equation (4).
The values of F and L are chosen as 10 and 1100, respectively, to make the above three equations as an irreducible equation. Isomorphic mapping function f(x) = δ × x and its inverse f(x) −1 are to be mapped to its composite field element and its inverse field element, where δ is a matrix with binary digits as its elements based on the GF (2 8 ) elements and its composite field elements. Suitable matrix for composite field and Galois field is expressed in the Equation (5). Due to isomorphic mapping, transformation stages in an AES algorithm are not suited for composite field implementation. Mix column and inverse mix column transformation must be provided some modification to make it suitable for the subpipelining architecture. In mixcolumn operations, {02} 16 16 , and {57} 16, respectively, in the composite field. These modifica-tions increase the hardware complexity on implementation of mixcolumn and inverse mixcolumn operations. In case of shift rows, Inverse Shift Rows and Key expansion stages operations are independent of Galios field representations. Therefore, no modifications are needed for these stages to implement Composite Field Arithmetic structure suitable for subpipelining architecture.

A Novel Proposed Multi-Stage Positive Polarity Reed Muller Architecture
In the proposed AES-MPPRM architecture, the MPPRM structure is used to the design the SubBytes/InvSubBytes transformation. Next, this SubBytes/InvSubBytes transformation is used to design an effective subpipelining structure. Therefore, the design of MPPRM structure used in the AES is used to minimize the hardware resources as well as it used to improve the design of the overall AES architecture. The detailed process of designing the AES-MPPRM architecture is described in the following sections. Figure 3  • Finally, the operation of XOR is performed over the current state matrix's bytes along with the round key bytes to obtain the encryption output. Composite field arithmetic architecture for the subbyte transformation is efficient in terms of power consumption, but power consumption can still be reduced by carefully designing some components of the CFA structure into two level logic architecture which is suitable for subpipelining architecture. Subbyte transformation operation can be splitted into three stages (pre-inversion stage, inversion stage, and post-inversion stage) to achieve better power efficiency. Splitting the subbyte transformation into further more Composite field arithmetic architecture for the subbyte transformation is efficient in terms of power consumption, but power consumption can still be reduced by carefully designing some components of the CFA structure into two level logic architecture which is suitable for subpipelining architecture. Subbyte transformation operation can be splitted into three stages (pre-inversion stage, inversion stage, and post-inversion stage) to achieve better power efficiency. Splitting the subbyte transformation into further more stages increase the power consumption rather than decreasing it. Splitting the entire architecture into two stages also does not show any significance in power consumption. So it is wise to divide the CFA architecture for Subbyte and Invsubbyte transformation into three stages. All these stages are implemented with AND and XOR gate Arrays.

Modified PPRM Architecture
In AES-MPPRM architecture, the MPPRM is designed for the SubByte Transformation. Conventionally, the SubByte Transformation is implemented with three stage PPRM architecture as shown in Figure 4. PPRM architecture is modified into MPPRM with reduced area and improved time efficiency, by introducing a minute change in the second stage of PPRM architecture. In this MPPRM modification is introduced only in second stage, without altering any structural logic in stage-1 and stage-3. The sample modified architecture from the PPRM to the MPPRM is shown in Figure 5. Figure 5 details the modification introduced in the stage-2 of standard PPRM architecture. By analyzing the above two logic structures, it is clear that new architecture has two EXOR gates and a single AND gate whereas the existing architecture has three EXOR gates and four AND gates. Output of both the architectures remains same irrespective of the gate used. A delay element has to be added to the output of AND gate in the second structure, since the processing time of AND gate is faster than the XOR gate. Main aim of PPRM architecture is to reduce the power consumption due to dynamic hazards is preserved in this new logical structure also. The logical expression of Figure 5 can be expressed as follows in Equation (6).
The difference between the conventional PPRM and MPPRM are clearly represented by using the mathematical expressions which are given in the following sections. The difference between the conventional PPRM and MPPRM are clearly represented by using the mathematical expressions which are given in the following sections.    3 = x5 + x6 + x2 + x1 2 = x6 1 = x7 + x5 + x3 + x2 + x6 + x4 + x1 0 = x7 + x5 + x3 + x2 + x6 + x0 From the three 4-bit values (a, b & c) of stage-1, the value of c0 − c3 is taken as input for stage-2. Similar to stage-1, stage-2 is also processed with AND and XOR arrays. This stage-2 provides the one 4-bit value i.e., d0 − d3 as shown in the Equation (10).
From Figure 6, it is known that the PPRM structure requires 17 AND gates and 16 XOR gates whereas the MPPRM requires only 6 AND gates and 13 XOR gates. Therefore, the changes proposed in the stage-1 are used to reduce the hardware utilization of the MPPRM architecture. Meanwhile, the operating frequency of the AES-MPPRM architecture is minimized by increasing the speed of the overall AES architecture. From Figure 6, it is known that the PPRM structure requires 17 AND gates and 16 XOR gates whereas the MPPRM requires only 6 AND gates and 13 XOR gates. Therefore, the changes proposed in the stage-1 are used to reduce the hardware utilization of the MPPRM architecture. Meanwhile, the operating frequency of the AES-MPPRM architecture is minimized by increasing the speed of the overall AES architecture.

Detailed Hardware Implementation Architectures
Optimized architectures for all the stages of AES encryption and decryption are presented in detailed nature which increases speed and reduce the total area occupied by the existing architecture. An efficient key expansion, mixcolumn and invmixcolumn structure is also provided in this section.

Subbytes/Invsubbytes Transformation
In Subbyte and Invsubbyte transformation stage, multiplicative inversion operation takes a large number of gates to give the correct output. It takes approximately 620 gates to find the multiplicative inverse in GF(2 8 ). As it is known that multiplicative inverse in

Detailed Hardware Implementation Architectures
Optimized architectures for all the stages of AES encryption and decryption are presented in detailed nature which increases speed and reduce the total area occupied by the existing architecture. An efficient key expansion, mixcolumn and invmixcolumn structure is also provided in this section.

Subbytes/Invsubbytes Transformation
In Subbyte and Invsubbyte transformation stage, multiplicative inversion operation takes a large number of gates to give the correct output. It takes approximately 620 gates to find the multiplicative inverse in GF(2 8 ). As it is known that multiplicative inverse in GF(2 8 ) can be found out in GF((2 4 ) 2 ), then it can be decomposed into GF(2 2 ) and then finally into GF (2). Multiplicative inverse in GF(2) is a simple logical AND operation. An efficient method to find the multiplicative inverse is to find the inverse of each bit i.e., {y3, y2, y1, y0} = {y 3 −1 , y 2 −1 , y 1 −1 , y 0 −1 }. It is illustrated in the following Equation (13).

Mixcolumns/Invmixcolumns
Since subpipelining is employed in subbyte transformation, a suitable architecture for the mixcolumn and inverse mixcolumn that shares common structure is proposed. In mixcolumn transformation it is essential to calculate multiplication of 02 and 03 operations. Generally, mixcolumn operation is modified as shown in Equation (14).
By the above expressions, 03 multiplications are avoided and the following Figure 7 shows the 02 multiplication process in GF.

Mixcolumns/Invmixcolumns
Since subpipelining is employed in subbyte transformation, a suitable architecture for the mixcolumn and inverse mixcolumn that shares common structure is proposed. In mixcolumn transformation it is essential to calculate multiplication of 02 and 03 operations. Generally, mixcolumn operation is modified as shown in Equation (14).
where = 0, + 1, + 2, + 3, By the above expressions, 03 multiplications are avoided and the following Figure 7 shows the 02 multiplication process in GF. From the above Figure 7, it is clear that 02 multiplications in GF can be implemented with only 3 XOR gates. Like mixcolum, invmixcolumn structure can also be modified to a structure that suits better for subpipelining architecture. Matrix used in the inverse mixcolumn is rewritten as shown in Equation (15). From the above Figure 7, it is clear that 02 multiplications in GF can be implemented with only 3 XOR gates. Like mixcolum, invmixcolumn structure can also be modified to a structure that suits better for subpipelining architecture. Matrix used in the inverse mixcolumn is rewritten as shown in Equation (15).
In the invmixcolumn, it is enough to find the multiplication operations of 02 and 04. 04 can multiplication can be computed by cascading 02 multiplication process or by the process shown in the following Figure 8.  + 02 04   01  01  01  01  01  01  01  01  01  01  01  01  01  01  01  01   + 04   01  00  01  00  00  01  00  01  01  00  01  00  00  01  00  01 In the invmixcolumn, it is enough to find the multiplication operations of 02 and 04. 04 can multiplication can be computed by cascading 02 multiplication process or by the process shown in the following Figure 8. From the above structure, 5 XOR gates are needed to implement 04 multiplications. From equation it is clear that it is possible to implement invmixcolumn by adding some extra logics with the mixcolumn structure. Therefore, the same architecture can be shared with mixcolumn and inverse mixcolumn operations.

Results and Discussion
The proposed fully subpipelined AES architecture for 128-bit key is implemented on Xilinx FPGA. For synthesis, timing, and power analysis, Xilinx ISE 5.1 is used. For each clock cycle, data blocks of 128 bits are accepted in the fully subpipelined encryptor. Next, a 128-bit output data is taken for each input block as the cipher text in an each clock cycle. A hardware description language (HDL) is used to design and implement the AES- From the above structure, 5 XOR gates are needed to implement 04 multiplications. From equation it is clear that it is possible to implement invmixcolumn by adding some extra logics with the mixcolumn structure. Therefore, the same architecture can be shared with mixcolumn and inverse mixcolumn operations.

Results and Discussion
The proposed fully subpipelined AES architecture for 128-bit key is implemented on Xilinx FPGA. For synthesis, timing, and power analysis, Xilinx ISE 5.1 is used. For each clock cycle, data blocks of 128 bits are accepted in the fully subpipelined encryptor. Next, a 128-bit output data is taken for each input block as the cipher text in an each clock cycle. A hardware description language (HDL) is used to design and implement the AES-MPPRM architecture. The designed AES-MPPRM architecture is used to process the 128 bit of cryptography.

Performance Analysis
The performance of the AES-MPPRM architecture is analyzed in five different FPGA devices such as Kintex 7, Spartan 3, Spartan 6, Virtex 5, and Virtex 6. The performances of the AES-MPPRM architecture are analyzed in terms of number of slice registers, flip flops, number of slice LUTs, number of logical elements, slices, bonded IOB, operating frequency, and delay. Moreover, the performance of the proposed AES-MPPRM architecture is analyzed with the AES-PPRM architecture. Tables 3-5 shows the hardware utilization analysis for the AES-MPPRM architecture of Kintex 7, Spartan 3, Spartan 6, Virtex 5, and Virtex 6 FPGA devices. Additionally, the frequency and delay analysis of the AES-MPPRM architecture is given in Table 6. Here, the results of the AES-MPPRM architecture are evaluated with the AES-PPRM architecture. The results of the AES-MPPRM architecture for the encryption and decryption processes are taken for the 128-bit of cryptography. The LUT, slices, and flip flops for the AES-MPPRM architecture designed in the Spartan 6 FPGA device are 197, 79, and 88, which are less when compared to the AES-PPRM architecture. From the performance analysis, it is concluded that the AES-MPPRM architecture achieves the better performance than the AES-PPRM architecture. The reason for the AES-PPRM architecture with higher hardware resource utilization is that it used high amount of AND and XOR gates during the Sub Bytes transformation. However, the AES-MPPRM architecture uses only less amount of AND and XOR gates in the Sub Bytes transformation. More specifically, the AES-MPPRM architecture designed in the Kintex 7, Spartan 3, Spartan 6, Virtex 5, and Virtex 6 FPGA devices consumes only less than or equal to the 70% of resources. Therefore, the proposed AES-MPPRM architecture consumes less hardware resources based on the MPPRM Sub Bytes transformation and its subpipelinining architecture.

Comparative Analysis
This section shows the comparative analysis of the AES-MPPRM architecture with existing AES architectures. There are five AES architectures, including LAES [47], AES-CTR [48], AES-CFA [49], AES-BSRD [50], and AES-EMCBE [51], which are used to justify the effectiveness of the AES-MPPRM architecture. The comparison between the AES-MPPRM with existing AES architectures are given as follows: The comparison of the AES-MPPRM architecture with LAES [47], AES-CTR [48], AES-CFA [49], AES-BSRD [50], and AES-EMCBE [51] is shown in Tables 7-11, respectively. Additionally, the LUT comparison graph among the existing AES architectures and AES-MPPRM is presented in Figure 9. From the analysis, it is concluded that the AES-MPPRM architecture outperforms well than the LAES [47], AES-CTR [48], AES-CFA [49], AES-BSRD [50], and AES-EMCBE [51]. For example, the number of slice LUT is 197 for the AES-MPPRM with Spartan 6 FPGA, and it is even less when compared to the AES-BSRD [50]. The area of AES-BSRD [50] is increased due to its replication of linear operation during the encryption/decryption process. The hardware utilization of the AES-MPPRM architecture is reduced mainly due to the fact that it doesn't depend on the LUT while designing the SubBytes and InvSubBytes stages. Here, the combinational logical circuit is used to design the SubBytes and InvSubBytes transformation. The MPPRM structure is used to minimize the hardware utilization during the encryption and decryption processes. Here, the combination of the subpipelining structure with MPPRM structure is used to minimize the delay.

Case Study
Cryptographic Process on X-ray Image of a Disjoint in Human Shoulder This AES-MPPRM architecture is implemented for the secure transmission of the human shoulder X-ray image. This secure transmission happens as an image encryption and decryption using modified AES algorithm.
In the image encryption, pixel value of the X-ray image which is to be transmitted securely is obtained with the MATLAB 6.5 programming software. These pixel values act as one of the input data to the proposed AES algorithm. The input data can be from 0 to

Case Study Cryptographic Process on X-ray Image of a Disjoint in Human Shoulder
This AES-MPPRM architecture is implemented for the secure transmission of the human shoulder X-ray image. This secure transmission happens as an image encryption and decryption using modified AES algorithm.
In the image encryption, pixel value of the X-ray image which is to be transmitted securely is obtained with the MATLAB 6.5 programming software. These pixel values act as one of the input data to the proposed AES algorithm. The input data can be from 0 to 255 (i.e., Pixel value) and the input pixel values are {27, 25, 26, 28, 27, . . . c, f f }. Figure 10 shows the input and histogram image, where the input pixel values are converted into binary form using the command of dec2bin. Subsequently, these binary data are given as the input and it is encrypted using the AES-MPPRM architecture. The output is taken in the form of hexadecimal values which is in 8 pher text values which can be transmitted across the world securely. With this cipher text value, unauthorized persons cannot obtain the original information without knowledge of the secret key. The encrypted image and its histogram are shown in the Figure 11. At the receiving end, obtained cipher text values are once again converted back into pixel values of the original images by the using the same proposed architecture. Then this same process is used for all the pixel values of the image. Then these pixel values are used by MATLAB to produce the original image at the receiver end. Decrypted image is obtained with the MATLAB by the process of converting encrypted pixel values to the images. From the encrypted results and histogram images, it is clear that the AES-MPPRM architecture achieves higher security.

Conclusions
In this paper, the SubBytes/InvSubBytes transformation of the AES architecture was designed using the MPPRM structure. Subsequently, the designed SubBytes/InvSubBytes transformation was used to develop an effective subpipelining architecture. This same subpipeline architecture can be implemented for other versions of the AES such as AES-192 and AES-256 algorithms, by making slight modifications in the key expansion unit. pher text values which can be transmitted across the world securely. With this cipher text value, unauthorized persons cannot obtain the original information without knowledge of the secret key. The encrypted image and its histogram are shown in the Figure 11. At the receiving end, obtained cipher text values are once again converted back into pixel values of the original images by the using the same proposed architecture. Then this same process is used for all the pixel values of the image. Then these pixel values are used by MATLAB to produce the original image at the receiver end. Decrypted image is obtained with the MATLAB by the process of converting encrypted pixel values to the images. From the encrypted results and histogram images, it is clear that the AES-MPPRM architecture achieves higher security.

Conclusions
In this paper, the SubBytes/InvSubBytes transformation of the AES architecture was designed using the MPPRM structure. Subsequently, the designed SubBytes/InvSubBytes transformation was used to develop an effective subpipelining architecture. This same subpipeline architecture can be implemented for other versions of the AES such as AES-192 and AES-256 algorithms, by making slight modifications in the key expansion unit. Decrypted image is obtained with the MATLAB by the process of converting encrypted pixel values to the images. From the encrypted results and histogram images, it is clear that the AES-MPPRM architecture achieves higher security.

Conclusions
In this paper, the SubBytes/InvSubBytes transformation of the AES architecture was designed using the MPPRM structure. Subsequently, the designed SubBytes/InvSubBytes transformation was used to develop an effective subpipelining architecture. This same subpipeline architecture can be implemented for other versions of the AES such as AES-192 and AES-256 algorithms, by making slight modifications in the key expansion unit. Hence, the MPPRM architecture used in the AES helped to reduce the amount of hardware resources. Moreover, the delay of the overall circuit was minimized using the integration of subpipelining and MPPRM structure. Accordingly, the lesser delay leads to the higher operating frequency for the AES-MPPRM architecture. The LUT of the AES-MPPRM architecture designed in the Spartan 6 is 197, which is much less compared to the AES-BSRD. In future, the modified key expansion architecture can be developed for minimizing the time consumption during the key generation. It can also be extended by employing the cryptanalysis process to check the security of the algorithm by introducing some attacks (such as a side channel attack).
Author Contributions: The paper investigation, resources, data curation, writing-original draft preparation, writing-review and editing, and visualization were undertaken by T.M.K. and K.S.R. The paper conceptualization, software, validation, formal analysis, methodology were done by K.A. Supervision, project administration, and final approval of the version to be published were conducted by S.R. and B.D.P. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest:
The authors declare no conflict of interest.