Design and Implementation of an AES Hardware Encryption Core

Kumar, Aayush; Mehfuz, Shabana; Urooj, Shabana

doi:10.3390/sym18060897

Open AccessArticle

Design and Implementation of an AES Hardware Encryption Core

by

Aayush Kumar

¹,

Shabana Mehfuz

¹

and

Shabana Urooj

^2,*

¹

Department of Electrical Engineering, Jamia Millia Islamia (A Central University), New Delhi 110001, India

²

Department of Electrical Engineering, College of Engineering, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Symmetry 2026, 18(6), 897; https://doi.org/10.3390/sym18060897

Submission received: 3 April 2026 / Revised: 9 May 2026 / Accepted: 15 May 2026 / Published: 25 May 2026

(This article belongs to the Special Issue Applications Based on Symmetry in Cryptography and Information Security)

Download

Browse Figures

Versions Notes

Abstract

The Advanced Encryption Standard (AES-128) is widely used in high-speed secure communication systems, requiring efficient ASIC-oriented hardware implementations. To ensure that the implementation achieves full functional correctness and coverage, this work provides an AES-128 hardware implementation in Systemverilog, which has been tested using an environment based on the Universal Verification Methodology (UVM). Post-synthesis evaluation provides realistic timing, area, and power metrics using Cadence Genus. Pipelining techniques were employed to reduce the critical path delay, along with enabling maximum operating frequency. Competitive area utilisation and power efficiency were also maintained. The proposed architecture demonstrates improved timing efficiency compared with representative FPGA-based implementations. This paper presents a fully pipelined ASIC-oriented AES-128, 300 ps combinational critical path delay and 1.39 GHz operating frequency.

Keywords:

AES-128; ASIC design; hardware security; pipelined architecture; UVM verification

1. Introduction

The explosive growth of digital multimedia and communication networks has led to unprecedented demand for secure, high-speed data transmission. Applications such as high-definition streaming, real-time communication, and IoT sensing generate large volumes of encrypted data. In these scenarios, any encryption-related bottleneck can severely degrade system performance or user experience. To meet this demand, hardware-based cryptographic modules have become indispensable, offering orders-of-magnitude improvements in throughput and energy efficiency over software-only solutions. Hardware accelerators perform encryption and decryption without creating bottlenecks in data pipelines, making them especially attractive for high-bandwidth networks and resource-constrained IoT devices [1,2].

Among symmetric-key encryption algorithms, the Advanced Encryption Standard (AES) has emerged as the de facto cipher for securing digital information [3]. In particular, the 128-bit variant (AES-128) is widely used in practice due to its strong security properties and relatively efficient design [4] standardised by the National Institute of Standards and Technology (NIST). AES-128 processes 128-bit data blocks through ten rounds of substitution and permutation operations. Its well-defined transformations—byte-wise Sub Bytes, row shifts (Shift Rows), and column mixing (Mix Columns)—are particularly amenable to parallel and pipelined hardware implementation. Consequently, AES-128 is embedded in numerous communication and security standards (e.g., TLS/SSL, IPsec, and wireless protocols), underlining the importance of high-performance AES cores.

However, combining low latency with high throughput in an ASIC implementation of AES-128 remains challenging. A straightforward AES round combines several nonlinear and linear transformations in sequence, resulting in a long critical path that limits the maximum clock frequency. Techniques such as loop unrolling and deep pipelining can significantly increase throughput by breaking the computation into parallel stages, but these approaches often incur considerable area and power overhead due to additional registers and duplicated logic. Conversely, compact designs that minimise hardware resources generally cannot achieve comparable clock speeds. As a result, AES-128 hardware designers face an inherent trade-off between performance (latency and throughput) and implementation cost (area and power).

Unlike conventional AES hardware implementations that rely primarily on coarse inter-round pipelining or full loop unrolling, the proposed architecture introduces a balanced intra-round and inter-round pipelining strategy placement specifically targeting the MixColumns critical path, which is the dominant contributor to delay. In this approach, pipeline registers are carefully inserted not only between AES rounds but also within critical transformation blocks, particularly around the MixColumns operation, which dominates the critical path.

This fine-grained pipeline placement significantly reduces the combinational delay without introducing redundant logic or excessive register overhead.

Furthermore, the design ensures proper synchronisation between the datapath and key expansion pipeline, thereby avoiding round-key misalignment issues commonly observed in deeply pipelined AES architectures. This balanced strategy enables high opera ting frequency while maintaining efficient area and power utilisation.

Post-synthesis results indicate that the proposed AES-128 core operates at a clock frequency of 1.39 GHz in a 65 nm standard-cell CMOS technology, demonstrating suitability for high-speed secure communication systems [5,6]. The architecture is fully pipelined, with registers strategically placed between the Sub Bytes, Mix Columns, and Add Round Key transformations to reduce the critical path delay [5]. This design achieves a per-round encryption latency on the order of a few hundred picoseconds in a 65 nm process. A Universal Verification Methodology (UVM) testbench is developed to provide comprehensive functional verification and coverage analysis. The design is synthesised using the Cadence Genus tool (2019.2) suite to obtain realistic timing, area, and power estimates. Experimental results demonstrate that the AES-128 core can operate at approximately 1.3 GHz while maintaining a compact area footprint, highlighting its suitability for secure, high-speed communication and resource-constrained embedded systems [6].

In summary, the main contributions of this work are:

A more focused pipeline placement methodology is proposed, co-locating intra-round and inter-round pipelining with particular optimisation of the MixColumns transformation, which is recognised as the largest portion of the critical path in AES datapaths.
The design avoids synchronisation problems between the datapath and key expansion module, thereby avoiding round-key misalignment in heavily pipelined designs.
A fully modular SystemVerilog RTL design of a 128-bit AES encryption core is created for scalable ASIC design and hardware reuse.
A UVM-based verification environment is developed to verify the correctness of the pipeline, key expansion and data processing for each round.
An ASIC-oriented design flow with Cadence Genus synthesis tool is adopted to achieve realistic post-synthesis timing, area and power metrics with a maximum clock speed of 1.39 GHz and a lower critical path delay of around 300 ps.
Performance analysis with existing FPGA- and ASIC-based AES designs presents better timing efficiency and applicability for high-speed secure communication applications.

All frequency values are expressed in hertz-based SI units, and all delay values are expressed in seconds.

The rest of this paper is structured as follows. In Section 3, we describe the design and implementation of the proposed AES-128 architecture, its pipelining approach and verification plan. Section 4 reports the post-synthesis evaluation outcomes in terms of timing, area and power, and compares them with previous work. Finally, Section 5 concludes the paper and provides suggestions for future work.

2. Literature Review

Recent AES hardware research can be broadly classified into three categories:

(i): high-throughput FPGA implementations,
(ii): compact low-power ASIC cores, and
(iii): reliability-oriented secure architectures.

However, only a limited number of studies address balanced pipelining strategies targeting critical path reduction in ASIC flows.

Other studies have prioritised minimising area and power for resource-constrained applications [7,8]. Iterative AES architectures and reduced-width Datapath (for instance, 8-bit or byte-serial designs) have been proposed to lower logic usage and energy consumption. These compact implementations, while saving silicon area, typically achieve only moderate throughput and often omit a complete functional verification stage [9]. In parallel, reliability and testability enhancements have been explored: for example, hybrid pipelining with built-in error detection and on-chip BIST mechanisms have been integrated to improve robustness [10]. Such designs focus on fault tolerance but frequently introduce overhead and still do not provide a unified ASIC synthesis or PPA evaluation.

More recent efforts emphasise ASIC-oriented evaluation [8]. A few works have employed complete ASIC design flows (e.g., synthesising with Cadence Genus) and adopted industry-standard verification methodologies like the Universal Verification Methodology (UVM). Nevertheless, most AES hardware publications continue to focus on either FPGA prototyping or isolated aspects of design. For instance, an ASIC BIST-enabled AES processor might report some timing metrics but omit power and area figures, whereas a high-speed FPGA implementation might attain impressive clock rates but lack any ASIC post-synthesis data. In general, very few existing works combine a full RTL description of AES-128, UVM-based validation, and post-synthesis ASIC PPA evaluation in a single study.

Although prior works report high throughput or compact area, most designs either rely on FPGA-centric evaluation or do not present detailed post-synthesis timing closure with balanced pipeline placement. Therefore, a gap exists in ASIC-oriented AES architectures that simultaneously address timing, verification, and scalable RTL modularity.

While the AES encryption technique was the focus of most contemporary papers, other approaches have also been examined in the literature, including multimedia and chaos-based encryption methods, which aim at providing security for image and video data. The encryption method is known to have high sensitivity and key space; however, it has a high computational overhead and can thus not be implemented in hardware efficiently. As a result, deterministic schemes like the AES are preferred and efficient for ASIC-based applications. Apart from AES encryption algorithms, contemporary research has also paid attention to advanced encryption algorithms such as image encryption algorithms and chaos-based encryption algorithms. For instance, the Re-cropping Framework: A Grid Recovery Method for Quantisation Step Estimation in Non-aligned Recompressed Images [11] provides an effective approach for image integrity analysis and quantisation estimation. Similarly, the multi-layer and multi-directional image encryption algorithm based on the hyperchaotic 3D Xin-She Yang map [12] introduces a robust encryption mechanism using chaotic systems and multi-dimensional transformations. The EAS framework employs neuron-inspired chaotic dynamics for multimedia video encryption, improving randomness and key sensitivity, but without ASIC-oriented timing or hardware implementation analysis. In terms of performance evaluation, chaos-based and multimedia encryption techniques generally provide high security due to complex nonlinear dynamics and large key spaces. However, these methods involve significantly higher computational complexity and are not well-suited for efficient hardware implementation. In contrast, AES-based architectures rely on well-defined substitution–permutation operations that are highly optimised for digital hardware. As a result, AES offers a balanced trade-off between security, computational efficiency, and hardware feasibility, making it more suitable for ASIC-based high-throughput applications. The work proposed here is to optimise the AES for high-frequency ASIC implementation, with priority to timing performance and efficiency.

Table 1 summarises representative AES-128 hardware implementations from the recent literature, indicating their platform, focus, key techniques, and limitations. It highlights that while many designs excel in a particular metric (such as throughput or area efficiency), they often lack comprehensive ASIC evaluation or exhaustive verification. To the best of our knowledge, no prior work presents a complete end-to-end ASIC-ready AES-128 implementation with detailed RTL design, rigorous UVM verification, and full post-synthesis PPA analysis. This gap motivates the need for an integrated ASIC-oriented design flow, which the present study addresses.

As shown in Table 2, for a broader comparison with modern multimedia and chaos-based encryption frameworks, the proposed AES-128 architecture is evaluated on several aspects, including the structure of the encryption, computational complexity, feasibility of hardware implementation, optimisation of computation time, and the possibility of implementing the architecture in an application-specific integrated circuit (ASIC).

The comparison indicates that multimedia-oriented chaotic encryption methods prioritise statistical security metrics, whereas the proposed AES-128 architecture focuses on deterministic hardware optimisation, timing closure, and synthesisable high-throughput ASIC implementation.

The EAS, 3D-NDHC, and MLMD-IE are recent chaos-based and multimedia encryption schemes, which mainly enhance security in statistical aspects through hyperchaotic diffusion, multidimensional permutation, and nonlinear sequence generation. The MLMD-IE and Re-cropping frameworks mainly focus on multidimensional diffusion security and forensic image analysis, respectively, but do not address high-frequency pipelined ASIC-oriented cryptographic hardware implementation. Usually, these techniques are assessed from entropy, NPCR, UACI, correlation and Lyapunov exponent analysis. Most of these schemes, however, are software-based and fail to measure ASIC-oriented parameters like timing closure, post-synthesis delay, pipeline feasibility or hardware resource usage. Conversely, the proposed AES-128 architecture is optimally designed to be implemented in an ASIC with deterministic substitution–permutation operations and balanced pipelining for high throughput. The proposed work also involves verification using UVM and post-synthesis evaluation using Cadence Genus, which allows for the direct analysis of timing performance and hardware feasibility.

3. AES-128 Architecture and Implementation

3.1. Overview of AES Cryptographic Algorithm

AES-128 is a symmetric-key block cipher operating on 128-bit data blocks and 128-bit keys [10]. It follows a fixed 4 × 4 byte-state substitution–permutation network (SPN).

The AES state is represented as a 4 × 4 matrix of bytes (16 bytes total), arranged in four columns and four rows. Under the FIPS-197 standard, a 128-bit key requires 10 rounds [10]. Prior to encryption, a key schedule expands the cipher key into 11 separate 128-bit round keys (one for the initial round and one for each of the 10 rounds). Each round applies a sequence of byte-wise and word-wise operations to introduce confusion and diffusion in the state. In particular, every round (except the last) consists of four transformations executed in order [13].

Figure 1 illustrates the overall AES-128 encryption process along with the 4 × 4 state matrix representation.

SubBytes: SubBytes performs nonlinear byte substitution using a fixed GF(2⁸)-based S-box to introduce confusion in the AES state. The AES standard defines this S-box as having no fixed points and providing strong nonlinearity [13].

ShiftRows: A byte-wise permutation that cyclically rotates the last three rows of the 4 × 4 state by offsets of 1, 2, and 3 bytes (the first row is unchanged). This step spreads byte differences across columns [13].

MixColumns: A linear mixing operation on each column: the four bytes of a column are treated as a polynomial over GF(2⁸) and multiplied by a fixed MDS matrix. Each input byte thus influences all four output bytes, providing complete diffusion within the column.

AddRoundKey: A simple bitwise XOR of the state with the round’s 128-bit subkey. This injects key material each round.

These transformations are iterated for 9 full rounds. Finally, the 10th (last) round omits the MixColumns step. In other words, the cipher begins by XOR-ing the plaintext with the initial round key, then performs nine iterations of (Sub Bytes, Shift Rows, MixColumns, AddRoundKey), and concludes with a final SubBytes, ShiftRows, and AddRoundKey sequence. This well-defined loop structure (a “crypto-permutation” of S-boxes, shifts, linear mixing, and key additions) ensures high security: the S-box provides nonlinearity (confusion), while ShiftRows + MixColumns generate diffusion across bytes.

Note: All frequency values are expressed in hertz-based SI units, and all delay values are expressed in seconds

3.2. Pipelined AES Hardware Architecture

To accelerate AES in dedicated hardware (ASIC or FPGA), designers typically employ pipelined implementations that unroll the round loop and insert registers between stages. A fully unrolled AES-128 core can be arranged as a deep pipeline of stages, enabling a new plaintext block to enter each cycle [14,15]. In this design, one block per cycle after pipeline fill, assuming continuous input data and no pipeline stalls, the architecture produces one 128-bit ciphertext block per clock cycle at the output stage, while intermediate blocks simultaneously occupy different pipeline stages [14].

T = 128 × f_(clk)

(1)

where T denotes the throughput in bits per second and f_(clk) represents the clock frequency in hertz. At f_(clk) = 1.39 GHz, the achievable throughput is approximately 178 Gbps [12]. In practice, the maximum achievable clock frequency is strongly dependent on the target technology, placement and routing constraints, and the depth of intra-round pipelining, and may be lower than the nominal value assumed in this example [14,15].

It should be noted that (1) represents the theoretical peak throughput, assuming a fully unrolled and fully inter-round-pipelined architecture with no pipeline stalls and continuous data availability.

The entire 10-round sequence is unrolled and partitioned into pipeline stages by inserting registers between rounds (or groups of rounds). This allows successive blocks to be processed in parallel across different rounds. For example, a fully inter-round-pipelined AES yields one block per cycle after the pipeline fill, dramatically increasing throughput.

Intra-round (Transformation-level) pipelining: Each round’s internal transformations can themselves be partitioned. For instance, the S-box computation, row shifts, and MixColumns logic may be split into sub-stages with registers in between. By pipelining within each round, the critical path is reduced to that of the slowest sub-transformation. In FPGAs or ASICs, this often means pipelining the SubBytes, ShiftRows, and MixColumns units so that no single stage is excessively long. Together with inter-round pipelining, this approach balances the logic and can allow much higher clock frequencies [16].

While pipelining greatly raises throughput, it incurs performance trade-offs. The area overhead of extra registers is typically small relative to the entire logic, but each pipeline stage adds latency. Although deeply pipelined AES cores can achieve very high throughput, this comes at the cost of increased latency proportional to the pipeline depth. Consequently, such architectures are best suited for streaming or bulk-data encryption scenarios rather than latency-critical applications. However, the encryption of a single block then spans many clock cycles (equal to the number of pipeline stages), so end-to-end latency is high, as one study notes, “heavily pipelined configurations will have extremely long latencies when compared to the base iterative version of AES-128 [14].” Designers must balance these factors: a fully pipelined design yields one block per cycle but may require ∼10 cycles of delay, whereas a non-pipelined loop has lower latency but much lower block throughput. In hardware cryptographic implementations, pipelining is favoured because modern ASICs and FPGAs have abundant registers and require extreme data rates (often multi-Gbps). Empirically, pipelined AES cores on FPGA/ASIC achieve tens of gigabits/sec of throughput by unrolling the rounds and inserting registers. In summary, pipelining in AES hardware leverages parallelism to maximise throughput at the expense of increased pipeline latency (and a modest register-area cost). Therefore, the choice of pipelining depth represents a design trade-off among throughput, latency, and area. While fully pipelined AES architectures maximise throughput, practical implementations must carefully balance these parameters based on application requirements and target hardware constraints. The overall AES encryption datapath is organised as a fully pipelined architecture.

Figure 2 shows the block-level architecture of the fully pipelined AES-128 encryption core, showing the sequential flow of SubBytes, ShiftRows, MixColumns, and AddRoundKey operations across multiple rounds with pipeline registers inserted to reduce the critical path.

3.3. Algorithmic Design and RTL Architecture of the Proposed AES-128 Core

3.3.1. AES Main Encryption Flow

AES-128’s key expansion process derives eleven 128-bit round keys from the cipher key using RotWord, SubWord, and Rcon functions. The first four words are derived directly from the cipher key, with the remaining words being recursively computed using XOR-based transformations. The round keys are fed to each pipeline stage to ensure correctness in key alignment during the encryption process.

The overall functional flow of the proposed AES-128 encryption core is summarised in Algorithm 1.

Algorithm 1: AES_Main Encryption Process

Input: 128-bit plaintext P, 128-bit cipher key K
Output: 128-bit ciphertext C

1. Generate round keys K₀, K₁, …, K₁₀ using the AES-128 key expansion

2. S ← AddRoundKey(P, K₀)

3. For i = 1 to 9 do
S ← AES_Round(S, K_i)

4. C ← AES_FinalRound(S, K₁₀)

3.3.2. Key Expansion Architecture

AES-128 employs a 128-bit cipher key, which is expanded into 44 words, each of 32 bits (4 bytes), forming 11 round keys of 128 bits each [17]. The expansion works as follows:

The first four words come directly from the original key.

For each new word

W [i]

where

i \geq 4

:

W[i] = {W[i − 4] ⊕ SubWord (RotWord (W[i − 1]) ⊕ Rcon[i/4], if i mod4 = 0
W[i − 4] ⊕W [i − 1] otherwise

(2)

After all 44 words are generated, every four consecutive words form a complete 128-bit round key. In hardware, this logic is typically implemented within a dedicated key expansion module integrating SubWord, RotWord, and Rcon operations [18]. Designers may choose to compute all keys ahead of time or generate keys on the fly each round [19]. Since the operations are simple XOR and byte transformations, the module is compact and does not significantly impact timing. Verification typically involves comparing the generated keys with a reference software model. Figure 3 illustrates the hardware architecture of the AES-128 key expansion module. The RotWord and SubWord transformations provide nonlinear key updates, while the injection of Rcon provides variation that depends on the rounds. The generated keys are then passed to the encryption datapath, which is pipelined.

The generation of round keys from the original cipher key follows the recursive AES key scheduling procedure, which is described in Algorithm 2.

For the case where i mod 4 = 0i\i mod 4 = 0 i mod 4 = 0, the word is computed as:

w[i] = w[i − 4 ]⊕ SubWord (RotWord(w[i − 1])) ⊕ Rcon[i/4]

(3)

Algorithm 2 is implemented as a dedicated hardware key expansion module that supplies round keys to the encryption datapath either precomputed or on-the-fly.

Algorithm 2: AES-128 Key expansion

Input: 128-bit cipher key K
Output: Round keys K₀, K₁, …, K₁₀ (each 128-bit)

1. Initialise words w₀, w₁, w₂, w₃ from cipher key K

2. For i = 4 to 43 do
If (i mod 4 = 0) then
w_i ← w_i–4 ⊕ SubWord(RotWord(w_i–1)) ⊕ Rcon[i/4]
Else
w_i ← w_i–4 ⊕ w_i–1

3. Group every four consecutive words to form round keys K₀ to K₁₀

4. Return all round keys

3.3.3. Sub Byte

The SubBytes phase is responsible for the single nonlinear step in AES. In this phase, each byte of the 4 × 4 state matrix is substituted by another byte according to a predefined 256-element substitution table (S-box). SubBytes creates substantial confusion and makes sure that the result is a complex and nonlinear function of both the input and the key [20].

The SubBytes stage is commonly implemented as a set of 16 parallel S-boxes in hardware, enabling processing of all 16 bytes of the 128-bit state in a single clock cycle [21]. An S-box can be implemented as either (i) a small ROM-based lookup table or (ii) a combinational logic network built upon GF(2⁸) operations [22].

In this project, a ROM-based parallel architecture has been adopted, where 16 parallel S-boxes perform substitution in one clock cycle for all the bytes in the state. This implementation was chosen because of the simplicity and regularity of such an implementation as well as the high speed and suitability of the pipeline design [23]. Compared to the combinational logic implementation, a ROM-based solution is faster and simpler to integrate into a design [24], which is important when implementing AES into an ASIC.

As SubBytes consists only of combinational logic, it is easy to place it in a pipelined architecture without increasing latency [25]. Parallelism enables the high throughput of the design and only moderate overhead in terms of area resources, creating a good compromise between the two aspects. For instance, the input byte 0x19 is replaced with the corresponding S-box value (e.g., 0xD4). This substitution is done independently for all 16 bytes in the state before the ShiftRows phase starts. Figure 4 illustrates the SubBytes transformation, in which each byte of the AES state is replaced with a value from a nonlinear S-box, introducing confusion in the encryption process.

Algorithm 3 shows SubBytes replaces each byte of the AES state using a nonlinear S-box lookup.

It increases security by adding confusion and allows parallel hardware execution.

Algorithm 3: AES SubBytes Transformation

Input: 128-bit state array S
Output: Substituted state array S′

1. For i = 0 to 15 do
b_i′ ← Sbox(b_i)

2. Combine all substituted bytes to form the new state:
S′ = {b0′, b1′, b2′, …, b15′}

3. Return S′.
End

3.3.4. Shift Rows

The transformation of ShiftRows enhances diffusion of AES; the rows of the state matrix are permuted in a cyclic manner [26]. Considering that the state can be considered as a 4 × 4 byte array, the former row is not moved, and the other rows are moved to the left by one, two, and three-byte positions, respectively. This operation does not modify the bit patterns of the bytes; it just modifies their positions in the state.

The ShiftRows transformation redistributes bytes across columns; the distribution of the bytes into columns helps the further step of MixColumns to merge the data which was obtained in different columns. This communication triggers the propagation of variations of inputs to the entire state over several encryption steps, thus adding to the diffusion power of the cipher [27].

From a hardware perspective, the ShiftRows operation is computationally efficient and can be implemented using simple wiring without additional logic. Because it only requires the repositioning of bytes, the operation can be realised by means of fixed wiring, without the need for arithmetic logic or memory access [24]. Therefore, the module is entirely combinational and adds insignificant delay.

In a pipelined AES implementation, ShiftRows is generally inserted between the SubBytes and MixColumns steps, and it does not need a special pipeline register [28]. Registers are placed around MixColumns or round boundaries instead to help satisfy timing constraints, and ShiftRows remain off-critical. This enables the entire AES round, including SubBytes, ShiftRows, MixColumns, and AddRoundKey, to be performed in an efficient manner in only one clock cycle. Figure 5 also highlights that ShiftRows is purely a permutation-based operation, which enables an efficient hardware implementation using fixed wiring without additional logic overhead.

Algorithm 4 shows that ShiftRows cyclically shifts the rows of the AES state to provide diffusion between columns.

The operation only reorders byte positions, making it lightweight and efficient for hardware implementation.

Algorithm 4: AES ShiftRows Transformation

Input: 128-bit state array S
Output: Row-shifted state array S′

1. Arrange the input state S into a 4 × 4 byte matrix.

2. Perform cyclic left shifts on each row:
  Row 0 → no shift
  Row 1 → shift left by 1 byte
  Row 2 → shift left by 2 bytes
  Row 3 → shift left by 3 bytes

3. Rearrange the shifted bytes to form the new state S′.

4. Return S′.
End

3.3.5. Mix Column

The MixColumns transformation is the primary linear diffusion step of the AES round, combining the state bytes using finite field arithmetic over GF(2⁸). One of the AES rounds is the MixColumns transformation that combines information at a bit level in the state. Under this operation, the columns of the 4 × 4 byte matrix are processed individually with the arithmetic of finite-field arithmetic on GF(2⁸), resulting in outputs which are based on the four input states of the same column [29].

Combined with the ShiftRows permutation, MixColumns is used to make sure that the data dependencies spread fast throughout the entire state over multiple rounds. The relationship facilitates the amplification of small changes in inputs so that the diffusion property of the cipher is enhanced and the avalanche effect that ensures the encryption is secure is achieved [25].

From a hardware perspective, MixColumns is not as friendly as other AES transformations, because it is arithmetically complex. Constant multiplications required are usually implemented with XOR networks and conditioned shifts, which add significant combinational delay. Consequently, MixColumns can easily become a significant part of the critical path of the AES datapath [27].

Figure 6 illustrates the column-wise diffusion process of the MixColumns transformation, highlighting how input bytes are linearly combined using Galois Field arithmetic.

The column-wise linear diffusion operation of AES is defined by the MixColumns transformation, which is described in Algorithm 5.

Algorithm 5: MixColumns Transformation

Input: 128-bit state S
Output: 128-bit transformed state S′

1. Divide S into four 32-bit columns C₀, C₁, C₂, C₃

2. For each column C_i do
Compute Ci′ using GF(2⁸) multiplications by constants {02} and {03}

3. Concatenate C₀′, C₁′, C₂′, C₃′ to form S′

4. Return S′

Algorithm 5 highlights the computational complexity of the MixColumns operation, which motivates the insertion of pipeline registers around this block to reduce the critical path delay.

3.4. Data Flow and Module Integration

The encryption process begins with the initial key encryption with the addition of RoundKey. Next, in rounds 1 to 9, the data is processed in the SubBytes, ShiftRows, Sequentially MixColumns, and AddRoundKey modules.

The 10th round skips the MixColumns step. The key expansion unit continuously provides the appropriate round keys. The round modules are linked in a successive way and with an enable and a clock timing operation by using gating logic. The modularity has the benefit of being easily verified and simplifies synthesis.

3.5. Pipelining Strategy

In order to support high-speed requirements, the design presents pipelining on two levels. Round-Level Pipelining Round registers are decoupled between rounds so that they can be used on multiple blocks at the same time.

By inserting pipeline registers around the MixColumns transformation, the critical path delay was reduced from 719 ps to 300 ps, allowing the design to achieve a maximum clock frequency of 1.39 GHz. All timing values are reported after post-synthesis static timing analysis.

Despite the introduction of additional pipeline registers, the area overhead remains moderate due to efficient module reuse.

4. Verification and Results

4.1. UVM-Based Verification Environment

The proposed AES-128 core was tested using a UVM-based environment that included a driver, monitor, scoreboard, and coverage collector. Over 1000 directed and constrained-random test cases were run, including corner cases such as all-zero and all-one inputs, as well as random key-data combinations. All AES transformations, such as SubBytes, ShiftRows, MixColumns, and AddRoundKey, were functionally covered to about 95%. Output comparison was performed using the scoreboard against a reference software AES golden model, and no mismatches were found. In addition, correctness was verified using standard AES known-answer test vectors (FIPS-197). Waveform analysis confirmed proper pipelining, datapath operation, and synchronisation of the key expansion stage. Assertion-based checks were also used to verify correct pipeline synchronisation and round-key alignment. Waveform analysis further confirmed proper pipelining and synchronisation between the datapath and key expansion stages.

4.2. Performance Comparison with Existing Works

We developed an AES-128 encryption core in SystemVerilog and synthesised it in an industrial ASIC design environment. The design was able to operate at a frequency of 1.39 GHz in a 65 nm technology. In the non-pipelined design, the critical path was shown to be 719 ps. By implementing the proposed pipelining scheme, especially in the MixColumns stage and at the boundaries of each round, the critical path delay was decreased to 300 ps. This is a more than two-fold improvement in timing with a negligible area increase. The overall area of the design (CADENCE Genus report) is around 327,836.520 µm² silicon area and the design occupies approximately 327k standard cell equivalents. Although the design incorporates pipeline registers, the modular RTL design allows for efficient hardware mapping without replication of logic. From power analysis, it is apparent that the design strikes a reasonable compromise between speed and power. While the design is primarily targeted towards high-speed applications, switching activity remains stable even under post-synthesis conditions.

While many previous studies have focused on FPGA-based AES-128 designs or partial optimisation, this research offers a fully integrated pipelined AES-128 architecture, including RTL design, verification, synthesis, and performance analysis.

The performance comparison of the proposed AES-128 core with prior works is presented in Table 3. To ensure a fair comparison, the reported implementations are categorised based on their hardware platform (ASIC or FPGA), and comparisons are interpreted accordingly.

As observed from Table 3, most of the existing implementations are FPGA-based and exhibit limited frequency scaling. In contrast, the proposed design achieves a significantly higher operating frequency and reduced critical path delay. This demonstrates that the proposed architecture is better suited for high-performance ASIC-based cryptographic applications.

It is interesting to note that the comparison carried out in Table 3 covers implementations using both FPGA and ASIC technologies. Therefore, the outcome is considered based on design trends such as critical path delay, maximum frequency, and design efficiency.

In particular, compared with the other ASIC-based AES-128 designs, the proposed work has a better timing performance and comparable area. This shows that the proposed intra-round and inter-round pipelining approach is effective for high-speed ASIC designs. A more rigorous comparison with closely related ASIC-based AES implementations is considered for future work.

4.3. Post-Synthesis Results

4.3.1. Timing Analysis

Table 4 illustrates the critical path timing analysis result after synthesis, in which the original datapath has a maximum combinational path of 719 ps, mostly caused by the MixColumns logic because of the GF(2⁸) multiplier XOR-shift network. The MixColumns transformation was determined as the largest contributor to the critical path in the non-pipelined design (~719 ps). The critical path delay was minimised to approximately 300 ps by adding fine-grained pipeline registers around this point, reflecting a focused pipeline optimisation, as opposed to a more traditional homogeneous pipelining strategy. This targeted pipeline placement differentiates the proposed architecture from conventional fully pipelined AES designs, where registers are typically inserted only between rounds.

After the proposed pipelining scheme was applied to the datapath, the critical path was reduced to 300 ps. This confirms that the register insertion was performed in a non-redundant manner, which is essential in achieving ASIC-level timing closure without architectural over-engineering. The achieved path delay reduction (>2×) puts the engine into the multi-GHz domain of cryptographic accelerators, going beyond the iterative ASIC-based AES engines. As illustrated in Table 4, the post-synthesis timing report highlights the critical path. As shown in Table 4, the post-synthesis timing report confirms a maximum combinational delay of 300 ps, corresponding to a clock frequency of 1.39 GHz under typical process conditions.

Table 4 presents the post-synthesis timing results of the proposed AES-128 architecture, highlighting the reduction in critical path delay after pipelining.

As shown in Table 4, the original datapath exhibits a critical path delay of 719 ps, primarily due to the MixColumns stage involving GF(2⁸) arithmetic operations. After applying the proposed pipelining strategy, the critical path delay is reduced to 300 ps, enabling a maximum operating frequency of 1.39 GHz. This demonstrates that the pipelining approach effectively improves timing performance while maintaining efficient hardware utilisation. The post-synthesis timing report highlights the critical path, confirming a maximum combinational delay of 300 ps under typical process conditions.

It is necessary to note that the 300 ps reported is the lower combinational critical path delay that is obtained with balanced pipelining. The highest operating frequency of 1.39 GHz corresponds to a clock period of approximately 719 ps, as determined by post-synthesis timing analysis. This clock period has other timing limits like setup time, clock uncertainty and routing effects. Thus, the maximum achievable clock frequency is less than the maximum suggested by the critical path delay. The synthesis-reported delay of ~715 ps represents the effective clock period after timing closure, not the isolated combinational delay.

4.3.2. Hardware Resource Utilisation

Table 5 summarises the hardware resource utilisation of the proposed design in 65 nm CMOS technology.

The area is reported in terms of physical silicon area (µm²), while the equivalent gate count is provided separately for design comparison. As shown in Table 5, the proposed design efficiently utilises hardware resources with a total cell area of 327,836.520 µm² in 65 nm CMOS technology. The dominance of combinational logic is mainly due to the MixColumns and substitution operations. The pipelined architecture ensures efficient mapping without logic duplication, making the design suitable for high-performance ASIC implementations. It is noted that the 327k reported is in units of standard cell equivalents (gate count), but the value 327,836.520 µm 2 is the actual physical area of silicon that was obtained at the end of synthesis. The two metrics are distinct dimensions of the utilisation of hardware and are reported individually to enable a clear understanding. Reporting both gate count and physical area ensures reproducibility and fair comparison across different technology nodes.

4.3.3. Power Analysis

The post-synthesis dynamic power distribution of the proposed AES-128 architecture is summarised in Table 6. The logic block has the largest internal power due to GF(2⁸) arithmetic and XOR-based Mix Columns. The register and clock networks have moderate switching activity. The balanced pipeline placement suppresses glitches, which leads to stable dynamic power consumption across all pipeline stages. The synchronous pipelined structure also enhances the implementation robustness through long path glitch propagation reduction and unnecessary combinational switching activity minimisation by timing optimisation. Furthermore, the modular RTL architecture and the stage-wise verification process enable the problem of observability of faults when simulating and gate-level verifying the design, benefiting the reliable implementation of these cryptographic functions in an ASIC-based system. The above results demonstrate that the proposed architecture supports high operating frequency with manageable switching overhead. This power consumption is typical for high-frequency pipelined implementations of an ASIC, as switching activity and deep pipelining add to the dynamic power. The power consumption of 544.33 mW is consistent with high-frequency pipelined ASIC implementations, which makes it use more energy. This design is made for systems that need to send and receive information quickly and safely. It is meant for high-speed communication, so it uses more power than some other systems that are designed to use very little power. The amount of power it uses is similar to what other similar systems use, according to what has been reported. And this design prioritises speed. Additionally, the use of deep pipelining improves timing performance and throughput but introduces extra registers and clock-switching activity, which contribute to increased power consumption. This represents a well-known trade-off between speed and power in high-performance hardware design. The measured values of power are taken in a post-synthesis analysis in Cadence Genus under standard operating conditions. The analysis presupposes a nominal supply voltage and typical switching activity factors as given by the synthesis tool. Results are due to normal process corners and ambient temperature. The values of power reported are determined under vectorless estimation, which gives a rough estimate of the power consumption of the operating frequency. Table 6 shows the detailed dynamic power distribution across different hardware components of the AES architecture.

Table 6 shows that the logic block consumes the highest power due to GF(2⁸) calculations and intensive use of XOR gates in the MixColumns operation. The registers and clocking network consume moderately high switching power, but the memory and latch have a negligible impact on power consumption. The pipelined structure makes the design highly efficient at suppressing glitch propagation and thus consuming constant dynamic power throughout the pipelines.

It should be noted that the reported dynamic power consumption corresponds to a high-frequency operating condition (1.39 GHz) in a fully pipelined architecture. The design is primarily optimised for high-throughput applications rather than ultra-low-power scenarios. Therefore, although the power consumption is relatively higher, it is consistent with high-speed ASIC implementations. For low-power applications such as IoT systems, techniques such as clock gating, operand isolation, and resource sharing can be incorporated in future work to significantly reduce power consumption.

The comparison between sequential and combinational power components of the proposed AES-128 architecture is illustrated in Figure 7.

The distribution of dynamic power consumption across different hardware components of the proposed AES-128 architecture is illustrated in Figure 8.

4.3.4. Gate-Level Functional Verification

The detailed waveform analysis corresponding to the verification process is presented in Figure 9.

Figure 9 shows that the plaintext message and key are fed into the AES-128 encryption core. The data goes through several pipelined stages associated with AES rounds such as SubBytes, ShiftRows, MixColumns, and AddRoundKey.

It is evident from the waveform that there is good pipelining with sequential generation of round results within successive clock cycles. After processing all rounds, the result will be the ciphertext, which is in line with the golden reference AES model. This is an indication that the AES design is correct.

Moreover, from the waveform analysis, it can be seen that the pipeline is working properly, as there is proper synchronisation between the datapath and key expansion stages.

4.3.5. Performance Summary of Proposed AES-128 Design

To provide a comprehensive evaluation of the proposed AES-128 core, key performance metrics, including throughput, latency, area, and energy efficiency, are summarised in Table 7. These metrics are particularly important for pipelined architectures, where both throughput and latency must be considered.

5. Conclusions

This paper has described a highly pipelined and ASIC-friendly AES-128 encryption core design implemented in modular SystemVerilog, with UVM-based verification. The design adopts an efficient intra-round and inter-round pipelining approach to shorten the critical path delay of the MixColumns operation. The post-synthesis results and simulation results confirm a maximum possible clock frequency of 1.39 GHz when the Cadence Genus tool is used, with a critical path delay of about 300 ps, which demonstrates the effectiveness of the proposed architecture. Gate-level simulation and constrained-random verification ensure the design’s correctness through a comparison of the ciphertext results with the standard AES-128 test vectors.

The proposed design exhibits better timing results compared to previous FPGA and ASIC-based designs, with acceptable area and dynamic power consumption. But the current implementation is focused on achieving higher clock frequency, and does not include specific security measures against side-channel attacks such as differential power analysis (DPA) and fault injection, which could affect its use in security-sensitive applications. Moreover, low-power design techniques (e.g., clock gating, resource sharing) have been ignored in this design, leading to higher dynamic power consumption.

The design is realised in 65 nm CMOS technology, but the modular RTL architecture is scalable to smaller technology nodes (e.g., 45 nm and 28 nm). Since the design is RTL-based, it remains largely technology-independent and can be synthesised across different process nodes with appropriate standard-cell libraries, which will be investigated in future designs to assess performance, power and area trade-offs. Also, the proposed AES-128 core is inherently capable of processing each 128-bit plaintext block in the ECB mode as it is fully pipelined. In real cryptographic applications, the CBC operation can be implemented by means of external XOR feedback and control logic, which does not change the round structure of AES. The main emphasis of the present work is on optimisation of the forward encryption datapath, with the aim of achieving minimum critical path delay and maximum operating frequency. A unified encryption/decryption datapath would involve extra hardware for inverse transformation, reverse round-key scheduling, and extra control multiplexing and thus might increase the timing overhead and decrease the maximum frequency of the proposed ASIC-oriented pipelined architecture. The primary objective of this work is a high-speed ASIC-oriented AES implementation focusing on timing optimisation and throughput. Therefore, advanced side-channel protection mechanisms were not considered in the current design, and integrating low-power and side-channel attack countermeasure techniques (e.g., masking, hiding, and fault detection) remains an important direction to design secure and energy-efficient cryptographic systems. In contrast to other AES implementations, the proposed work presents a specific pipeline optimisation approach, based on the critical path associated with MixColumns, which leads to a considerable reduction of time.

Author Contributions

Conceptualisation, A.K. and S.M.; methodology, A.K.; software, A.K.; validation, A.K. and S.M.; formal analysis, A.K. and S.U.; investigation, S.M.; resources, S.M. and S.U.; data curation, A.K.; writing—original draft preparation, A.K.; writing—review and editing, S.M. and S.U.; visualisation, S.M.; supervision, S.M. and S.U. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by Princess Nourah bint Abdulrahman University Researchers Supporting Project Number (PNURSP2026R79), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Data Availability Statement

The data presented in this study are available within the article. No new external datasets were generated.

Acknowledgments

Authors are thankful to Princess Nourah bint Abdulrahman University Researchers Supporting Project Number (PNURSP2026R79), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AES	Advanced Encryption Standard
ASIC	Application Specific Integrated Circuit
FPGA	Field Programmable Gate Array
UVM	Universal Verification Methodology
RTL	Register Transfer Level
CMOS	Complementary Metal Oxide Semiconductor
IoT	Internet of Things
S-Box	Substitution Box
BIST	Built In Self Test
DfT	Design for Testability
NIST	National Institute of Standards and Technology

References

Sheikhpour, S.; Mahani, A.; Bagheri, N. Reliable advanced encryption standard hardware implementation: 32-bit and 64-bit data-paths. Microprocess. Microsyst. 2021, 81, 103740. [Google Scholar] [CrossRef]
Malal, A.; Tezcan, C. FPGA-friendly compact and efficient AES-like 8 × 8 S-box. Microprocess. Microsyst. 2024, 105, 105007. [Google Scholar] [CrossRef]
Mestiri, H.; Kahri, F.; Bouallegue, B.; Machhout, M. A high-speed AES design resistant to fault injection attacks. Microprocess. Microsyst. 2016, 41, 47–55. [Google Scholar] [CrossRef]
Bedoui, M.; Mestiri, H.; Bouallegue, B.; Hamdi, B.; Machhout, M. An improvement of both security and reliability for AES implementations. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 9844–9851. [Google Scholar] [CrossRef]
Nitaj, A.; Susilo, W.; Tonien, J. Enhanced S-boxes for the AES with maximal periodicity and better avalanche property. Comput. Stand. Interfaces 2024, 87, 103769. [Google Scholar] [CrossRef]
Ahmad, N.; Hasan, S.M.R. A new ASIC implementation of an advanced encryption standard (AES) crypto-hardware accelerator. Microelectron. J. 2021, 117, 105255. [Google Scholar] [CrossRef]
Soltani, A.; Sharifian, S. An ultra-high throughput and fully pipelined implementation of AES algorithm on FPGA. Microprocess. Microsyst. 2015, 39, 480–493. [Google Scholar] [CrossRef]
Zhang, X.; Parhi, K.K. High-speed VLSI architectures for the AES algorithm. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2004, 12, 957–967. [Google Scholar] [CrossRef]
Lavanya, R.; Karpagam, M. Enhancing the security of AES through small scale confusion operations. Microprocess. Microsyst. 2020, 75, 103041. [Google Scholar] [CrossRef]
FIPS PUB 197; Advanced Encryption Standard (AES). National Institute of Standards and Technology (NIST): Gaithersburg, MD, USA, 2001.
Cheng, X.; Wang, H.; Luo, X.; Guan, Q.; Ma, B.; Wang, J. Re-Cropping Framework: A Grid Recovery Method for Quantization Step Estimation in Non-Aligned Recompressed Images. IEEE Trans. Circuits Syst. Video Technol. 2026, 36, 4771–4785. [Google Scholar] [CrossRef]
Erkan, U.; Toktas, F.; Toktas, A.; Lai, Q.; Zhou, S.; Lin, Y.; Gao, S. Multi-layer and multi-directional image encryption algorithm based on hyperchaotic 3D Xin-She Yang map. Expert Syst. Appl. 2026, 304, 130808. [Google Scholar] [CrossRef]
Abulibdeh, E.; Saleh, H.; Mohammad, B.; Alqutayri, M. Computational-Based Advanced Encryption Standard (AES) Accelerator. In 2023 International Conference on Microelectronics (ICM); IEEE: Abu Dhabi, United Arab Emirates, 2023; pp. 64–69. [Google Scholar] [CrossRef]
Gokul, R.; Swarnalatha, A. Pipelined AES-128 encryption and decryption design using Verilog HDL. Int. J. Multidiscip. Res. (IJFMR) 2025, 7, 1–9. [Google Scholar] [CrossRef]
Malal, A.; Tezcan, C. First fully pipelined high throughput FPGA implementation of wider AES. Res. Sq. 2025. [Google Scholar] [CrossRef]
Ajmi, H.; Zayer, F.; Fredj, A.H.; Belgacem, H.; Mohammad, B.; Werghi, N.; Dias, J. Efficient and lightweight in-memory computing architecture for hardware security. J. Parallel Distrib. Comput. 2024, 190, 104898. [Google Scholar] [CrossRef]
Azzouzi, O.; Anane, M.; Ghanem, M.C.; Himeur, Y.; Wojtczak, D. Flexible and area-efficient codesign implementation of AES on FPGA. Cryptography 2025, 9, 78. [Google Scholar] [CrossRef]
Selvapriya, E.S.; Suganthi, L. Design and implementation of low power AES cryptocore utilizing dynamic pipelined asynchronous model. Integration 2023, 93, 102057. [Google Scholar] [CrossRef]
Pradeep, A.; Mohanty, V.; Subramaniam, A.M.; Rebeiro, C. Revisiting AES SBox Composite Field Implementations for FPGAs. IEEE Embed. Syst. Lett. 2019, 11, 85–88. [Google Scholar] [CrossRef]
Rashmi, R.; Mohan, A. Implementation of AES S-Boxes using combinational logic. In 2008 IEEE International Symposium on Circuits and Systems (ISCAS); IEEE: Seattle, WA, USA, 2008; pp. 3294–3297. [Google Scholar] [CrossRef]
Lin, S.-H.; Lee, J.-Y.; Chuang, C.-C.; Lee, N.-Y.; Chen, P.-Y.; Chin, W.-L. Hardware implementation of high-throughput S-box in AES. IEEE Access 2023, 11, 59049–59058. [Google Scholar] [CrossRef]
Bazgir, O.; Gali, S.; Nikoubin, T. Area-power and energy efficient S-box in AES. In GLSVLSI 2024 Proceedings; ACM: New York, NY, USA, 2024; pp. 263–267. [Google Scholar] [CrossRef]
Sumio, M.; Akashi, S. An Optimized S-Box Circuit Architecture for Low Power AES Design. In Cryptographic Hardware and Embedded Systems—CHES 2002; Springer: Berlin/Heidelberg, Germany, 2002; Volume 2523, pp. 172–186. [Google Scholar] [CrossRef]
Thanikodi, M.K. AES algorithm for power-efficient and high-speed applications. Wirel. Pers. Commun. 2025, 140, 225–239. [Google Scholar] [CrossRef]
EL Makhloufi, A.; EL Adib, S.; Raissouni, N. Hardware pipelined AES architecture for satellite data. e-Prime 2024, 8, 100548. [Google Scholar] [CrossRef]
Yu, F.; Xu, S.; Xiao, X.; Yao, W.; Huang, Y.; Cai, S.; Li, Y. Dynamic analysis and FPGA implementation of chaotic system. Integration 2024, 96, 102129. [Google Scholar] [CrossRef]
Prayitno, R.H.; Latifah; Sudiro, S.A.; Madenda, S.; Harmanto, S. A modified MixColumn-InversMixColumn in AES algorithm suitable for hardware implementation using FPGA device. Commun. Sci. Technol. 2023, 8, 198–207. [Google Scholar] [CrossRef]
Visconti, P.; Capoccia, S.; Venere, E.; Velázquez, R.; Fazio, R.d. 10 Clock-Periods Pipelined Implementation of AES-128 Encryption-Decryption Algorithm up to 28 Gbit/s Real Throughput by Xilinx Zynq UltraScale+ MPSoC ZCU102 Platform. Electronics 2020, 9, 1665. [Google Scholar] [CrossRef]
Mohamed, H.A.A.; Yakout, M.A. An Efficient AES Design and Implementation Using FPGA. Int. J. Emerg. Sci. Eng. 2025, 13, 21–26. [Google Scholar] [CrossRef]
Liu, Q.; Xu, Z.; Yuan, Y. High Throughput and Secure Advanced Encryption Standard on Field Programmable Gate Array with Fine Pipelining and Enhanced Key Expansion. IET Comput. Digit. Tech. 2015, 9, 175–184. [Google Scholar] [CrossRef]
Dixit, N.K. Advanced FPGA Implementation of AES Algorithm. Int. J. Emerg. Trends Eng. Res. 2021, 9, 4. [Google Scholar] [CrossRef]
Algredo-Badillo, I.; Ramírez-Gutiérrez, K.A.; Morales-Rosales, L.A.; Pacheco Bautista, D.; Feregrino-Uribe, C. Hybrid Pipeline Hardware Architecture Based on Error Detection and Correction for AES. Sensors 2021, 21, 5655. [Google Scholar] [CrossRef]

Figure 1. AES-128 encryption process showing 4 × 4 state matrix representation and round transformations.

Figure 2. Fully Unrolled AES-128 Pipeline Showing Inter-Round Register Placement.

Figure 3. Hardware architecture of the AES-128 key expansion module showing RotWord, SubWord, and Rcon operations.

Figure 4. SubBytes operation showing nonlinear byte substitution using the AES S-box.

Figure 5. ShiftRows operation.

Figure 6. MixColumns transformation showing column-wise diffusion using Galois Field arithmetic.

Figure 7. Sequential vs. Combinational Power.

Figure 8. Dynamic Power Distribution Across Hardware Components of the Proposed AES-128 Architecture.

Figure 9. Gate-level simulation waveform of the proposed AES-128 pipelined encryption core showing plaintext input, key expansion, round transformations, and final ciphertext generation.

Table 1. Summary of Existing AES-128 Hardware Implementations and Limitations.

Year	Platform	Focus	Key Technique (s)	Max Frequency
2019	FPGA	High throughput	Loop unrolling with deep pipelining	500 MHz
2020	FPGA	Area minimization	Iterative AES-128 architecture	800 MHz
2021	ASIC	DfT/BIST	On-chip BIST for AES processor	700 MHz
2021	FPGA	Low-area/high-speed	Resource sharing and RTL optimisation	650 MHz
2021	ASIC	Compact IoT core	8-bit datapath AES	400 MHz
2021	Mixed HW	Reliability	Hybrid pipeline with error detection	–
2022	General HW	Security	Hardened AES techniques	–
2022	ASIC/FPGA	Area efficiency	Composite-field S-Box design	750 MHz
2023	ASIC	Ultra-low power	Energy-optimised AES core	300 MHz
2024	ASIC (65 nm)	Compact S-Box	Walsh–Hadamard-based S-Box	–

Table 2. Comparative Evaluation of AES-128 and Chaos-Based Multimedia Encryption Frameworks in Terms of Hardware Feasibility, Timing Optimisation, and Computational Complexity.

Evaluation Metric	Proposed AES-128 ASIC	EAS	3D-NDHC	MLMD-IE	Re-Cropping Framework
Application Domain	Secure communication	Video encryption	Image encryption	Image encryption	Image forensics
Encryption Structure	SPN-based AES rounds	Temporal chaotic segmentation	Dynamic 3D S-box	Multi-layer diffusion	JPEG grid recovery
Security Mechanism	Deterministic substitution–permutation	Chaotic neuron map	Hyperchaotic diffusion	Hyperchaotic multidirectional diffusion	Quantisation analysis
Computational Complexity	Moderate	High	High	High	Moderate
Pipeline Suitability	High	Limited	Limited	Limited	Not targeted
ASIC Orientation	Yes	No	No	No	No
Timing Optimisation	Balanced intra-round and inter-round pipelining	Not reported	Not reported	Not reported	Not applicable
Evaluation Basis	ASIC timing + UVM	Statistical randomness	Entropy + Lyapunov	NPCR/UACI	Quantisation recovery
Post-Synthesis Metrics	1.39 GHz, 300 ps	Not reported	Not reported	Not reported	Not reported
Verification/Evaluation Style	UVM + Cadence Genus ASIC evaluation	Statistical security and randomness analysis	Entropy, Lyapunov exponent, and cryptanalysis analysis	Entropy, NPCR, UACI, and correlation analysis	JPEG quantisation and forensic analysis
Hardware Implementation Focus	ASIC-oriented high-speed encryption	Software-oriented multimedia encryption	Multimedia privacy protection	Multimedia image encryption	Image forensic recovery
Main Complexity Source	MixColumns critical path	Chaotic sequence generation	Dynamic hyperchaotic S-box operations	Multi-directional recursive diffusion	DCT grid recovery and estimation
Throughput Orientation	High-throughput hardware architecture	Resource-efficient video protection	Security-oriented image encryption	Multi-layer security-oriented encryption	Forensic reconstruction framework
Timing Closure Support	Yes	No	No	No	No

Table 3. Performance Comparison of the Proposed AES-128 Core with Prior Works.

Work (Year)	Parameter
Work (Year)	Platform	Max Frequency	Delay	Area Utilization	Remark
Liu et al. (2015) [30]	FPGA	501 MHz	2 ns	High LUT usage	High throughput but FPGA-only
Dixit et al. (2021) [31]	FPGA	813 MHz	1.2 ns	Medium	Focus on pipelining
Ahmad and Hasan.(2021) [6]	ASIC	100 MHz	10 ns	Compact	Low area, modest speed
Algredo-Badillo et al. (2021) [32]	HW Arch.	–	–	Medium	Reliability-focused design
Proposed Work	ASIC	1.39 GHz	300 ps (pipelined)	327k	High performance, ASIC-ready

Table 4. Synthesis Results of AES Encryption Design Using Cadence Genus.

AES Encryption Design	Parameter
Maximum Frequency	1.39 GHz
Critical Path	Start-point: r4_a2_data_out_reg[110]/CK End-point: r5_a1_s_data_out_reg[100]/D
Delay (Non pipeline)	719 ps
Delay (Pipeline)	300 ps

Table 5. Area Utilisation Analysis.

Parameter	Value
Technology	65 nm CMOS
Total cell area	327,836.520 µm²
Design type	Fully Pipelined AES-128
Dominant resource	Combinational Logic

Table 6. Dynamic Power Analysis.

Category	Leakage (mW)	Internal (mW)	Switching (mW)
Memory	0.000	0.000	0.000
Register	0.0037	80.555	12.869
Latch	0.000	0.000	0.000
Logic	0.0289	209.520	251.353
Bbox/Clock/Pad/PM	0.000	0.000	0.000
Total	0.0326	290.075	264.222

Table 7. Summary of Proposed AES-128 Design.

Metric	Value
Clock Frequency	1.39 GHz
Throughput	178 Gbps
Latency	11 clock cycles
Cycles to First Output	11 cycles
Architecture	Fully pipelined (1 block/cycle)
Area	327,836 µm²
Power	544.33 mW
Energy per bit	~3.05 pJ/bit
Area Efficiency	~543 Mbps/µm²

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kumar, A.; Mehfuz, S.; Urooj, S. Design and Implementation of an AES Hardware Encryption Core. Symmetry 2026, 18, 897. https://doi.org/10.3390/sym18060897

AMA Style

Kumar A, Mehfuz S, Urooj S. Design and Implementation of an AES Hardware Encryption Core. Symmetry. 2026; 18(6):897. https://doi.org/10.3390/sym18060897

Chicago/Turabian Style

Kumar, Aayush, Shabana Mehfuz, and Shabana Urooj. 2026. "Design and Implementation of an AES Hardware Encryption Core" Symmetry 18, no. 6: 897. https://doi.org/10.3390/sym18060897

APA Style

Kumar, A., Mehfuz, S., & Urooj, S. (2026). Design and Implementation of an AES Hardware Encryption Core. Symmetry, 18(6), 897. https://doi.org/10.3390/sym18060897

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Design and Implementation of an AES Hardware Encryption Core

Abstract

1. Introduction

2. Literature Review

3. AES-128 Architecture and Implementation

3.1. Overview of AES Cryptographic Algorithm

3.2. Pipelined AES Hardware Architecture

3.3. Algorithmic Design and RTL Architecture of the Proposed AES-128 Core

3.3.1. AES Main Encryption Flow

3.3.2. Key Expansion Architecture

3.3.3. Sub Byte

3.3.4. Shift Rows

3.3.5. Mix Column

3.4. Data Flow and Module Integration

3.5. Pipelining Strategy

4. Verification and Results

4.1. UVM-Based Verification Environment

4.2. Performance Comparison with Existing Works

4.3. Post-Synthesis Results

4.3.1. Timing Analysis

4.3.2. Hardware Resource Utilisation

4.3.3. Power Analysis

4.3.4. Gate-Level Functional Verification

4.3.5. Performance Summary of Proposed AES-128 Design

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI