Adaptive FPGA-Based Mixed-Radix NTT Architectures with Classical and Quantum Evaluation for CRYSTALS-Kyber

AlKurdi, Yaser; Abu Al-Haija, Qasem; Alghuried, Ahod

doi:10.3390/app16126183

Open AccessArticle

Adaptive FPGA-Based Mixed-Radix NTT Architectures with Classical and Quantum Evaluation for CRYSTALS-Kyber

by

Yaser AlKurdi

¹

,

Qasem Abu Al-Haija

^2,*

and

Ahod Alghuried

³

¹

Cybersecurity Department, Princess Sumaya University for Technology (PSUT), Amman 11941, Jordan

²

Cybersecurity Department, Jordan University of Science and Technology (JUST), Irbid 22110, Jordan

³

Computer Science Department, Prince Sattam Bin Abdulaziz University, Al-Kharj 16273, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(12), 6183; https://doi.org/10.3390/app16126183

Submission received: 9 May 2026 / Revised: 4 June 2026 / Accepted: 15 June 2026 / Published: 18 June 2026

(This article belongs to the Special Issue Recent Progress of Information Security and Cryptography)

Download

Browse Figures

Versions Notes

Abstract

The imminent threat of large-scale quantum computers motivates the deployment of post-quantum cryptography (PQC). CRYSTALS-Kyber, a leading lattice-based Key Encapsulation Mechanism, relies heavily on Number Theoretic Transform (NTT) operations, which remain a major performance and resource bottleneck. This paper presents a cross-platform NTT evaluation framework for CRYSTALS-Kyber, centered on an adaptive FPGA-based mixed-radix accelerator supporting radix-2, radix-4, and radix-8 configurations, together with comparative classical implementations and exploratory quantum-circuit prototypes. Classical evaluations show that an iterative Cooley–Tukey implementation outperforms a matrix-based baseline (≈3.6× faster for the forward NTT, ≈6.3× faster for the inverse NTT). Quantum prototypes implemented in Qiskit demonstrate proof-of-concept QFT-based NTT constructions under classical simulation environments, highlighting circuit-depth growth and noise sensitivity rather than practical hardware acceleration. The proposed FPGA design, based on a Xilinx Virtex UltraScale+ platform, employs an adaptive radix controller, LUT-based twiddle management, and Montgomery/Barrett modular arithmetic. Montgomery reduction provides superior timing and area trade-offs, with an estimated

F_{\max}

of up to 231.48 MHz and only 5 DSPs for radix-2. At the same time, radix-2 offers the best resource/performance balance with a latency of approximately 32,804 cycles. The hybrid approach strikes a balance between near-term FPGA practicality and long-term quantum potential while preserving Kyber’s MLWE-based security. Experimental results and comparative analysis indicate that the adaptive design substantially reduces resource usage and timing overhead compared to recent HLS-based NTT accelerators.

Keywords:

post-quantum cryptography; CRYSTALS-Kyber; NTT; Quantum Fourier Transform; FPGA; lattice cryptography

1. Introduction

Public-key cryptography (PKC) has long provided the foundation for secure digital communications by employing asymmetric key pairs to achieve confidentiality, authentication, and integrity. Widely deployed algorithms such as Rivest–Shamir–Adleman (RSA) [1] and elliptic curve cryptography (ECC) [2] derive their security from the infeasibility of solving computationally hard mathematical problems, such as integer factorization and discrete logarithms. However, this assumption no longer holds in the era of quantum computing. Shor’s algorithm [3] enables efficient factorization and discrete logarithm computation, reducing attacks against RSA and ECC from exponential to polynomial time. Consequently, these widely used cryptographic systems become vulnerable once scalable quantum computers become available. Recognizing this imminent threat, the National Institute of Standards and Technology (NIST) launched a multi-round competition in 2016 to standardize post-quantum cryptography (PQC) [4]. After three evaluation rounds, NIST announced candidate algorithms for standardization [5]. Among them, CRYSTALS-Kyber [6]—the primary key encapsulation mechanism (KEM)—was selected for its efficiency, scalability, and resistance to quantum adversaries. Kyber relies on lattice-based cryptography (LBC) [7], specifically on the hardness of the module learning with errors (MLWE) problem [8], which is widely regarded as one of the most secure and versatile foundations for PQC [9].

Despite these advances, Kyber suffers from a critical computational bottleneck: the Number Theoretic Transform (NTT). NTT is used to accelerate polynomial multiplications in Kyber by reducing the complexity from O(n²) (schoolbook multiplication) to O(n log n). However, its practical realization remains challenging. First, computational complexity persists despite the use of FFT-inspired algorithms such as Cooley–Tukey (CT) [10] and Gentleman–Sande (GS) [11], which serve as the building blocks of NTT. Second, hardware implementations of NTT on Field Programmable Gate Arrays (FPGAs) consume substantial resources, including digital signal processing (DSP) slices, BRAMs, and LUTs [12]. This resource inefficiency restricts scalability for real-world cryptosystems. Third, although quantum computing offers theoretical acceleration opportunities via algorithms such as the Quantum Fourier Transform (QFT) [13], mapping modular arithmetic into quantum circuits introduces non-trivial challenges related to circuit depth, qubit coherence, and error propagation [14,15].

These limitations motivate a comparative investigation of NTT implementation strategies across software, hardware, and emerging quantum-computing domains. Such an evaluation provides insight into performance, resource utilization, scalability, and long-term implementation trade-offs relevant to CRYSTALS-Kyber. This research addresses this gap by proposing a cross-platform NTT framework focused primarily on an adaptive FPGA-based mixed-radix architecture, while additionally providing classical and quantum implementations for comparative evaluation and future heterogeneous computing exploration. The main objectives are:

Optimization of Classical NTT—Development of radix-mixed and algorithmic adaptations of CT and GS architectures to reduce computational overhead.
Quantum NTT Exploration—Implementation of gate-based and QFT-based polynomial arithmetic in Qiskit, evaluating the feasibility of implementing modular arithmetic within quantum circuits.
Efficient FPGA Acceleration—Design of adaptive radix-mixed NTT architectures optimized for resource utilization, achieving significant reductions in DSP, BRAM, and LUT consumption.
Cross-Platform Evaluation—Comparative analysis of NTT implementations across classical, quantum, and hardware domains, emphasizing performance metrics such as latency, throughput, and scalability.
Integration with CRYSTALS-Kyber—Ensuring compatibility with Kyber’s polynomial arithmetic requirements, thereby providing a practical and secure pathway toward PQC deployment.

The The principal contribution of this work is the adaptive FPGA-based NTT accelerator capable of mixed-radix configurability (radix-2/radix-4/radix-8) combined with resource-aware modular arithmetic optimization using Barrett and Montgomery reductions.

Classical and quantum implementations are included primarily as comparative baselines and feasibility studies within the proposed evaluation framework.

Figure 1 illustrates the evaluation workflow adopted in this study. The same NTT problem is examined through classical, quantum, and FPGA implementation paths, after which the resulting performance characteristics, resource requirements, and implementation constraints are comparatively analyzed. The framework is intended as a comparative evaluation methodology rather than a runtime-integrated heterogeneous architecture.

2. Related Work

The NTT has been widely recognized as the computational bottleneck in lattice-based PQC, including CRYSTALS-Kyber. Over the past decade, researchers have proposed numerous optimizations to improve its performance across software, hardware, and hybrid domains. This section reviews prior work on NTT implementations, categorizing it into high-level synthesis (HLS)-based methods, FPGA- and Application-Specific Integrated Circuit (ASIC)-based accelerations, parallel and vectorized designs, memory-optimized architectures, and emerging quantum approaches.

2.1. Early High-Level Synthesis and Hybrid Architectures

One of the earliest efforts to accelerate NTT using HLS was presented by [16] where optimization directives such as loop unrolling, pipelining, and inlining, were applied to enhance performance. Their analysis showed that unrolling nested loops significantly reduced latency, while modular multipliers based on Barrett reduction further minimized computational overhead. Authors in [17] introduced a hybrid hardware–software NewHope design that combines the NTT transformation with hash generation. Their FPGA-based design achieved notable speedups using pre-calculated twiddle factors but still incurred latency overhead in non-precomputation modes. They also employed soft error injection to mitigate reliability issues and improve robustness against Silent Data Corruption (SDC).

2.2. Parallelization and Butterfly Unit Optimizations

A key focus of subsequent research has been the parallelization of butterfly operations, the core building block of NTT. In [18], the authors proposed a Multi-Path Delay Feedback (MDF) architecture that combines features of multi-path delay commutators and single-path delay feedback units. This design enabled efficient multiplication via addition and shift operations, thereby reducing reliance on costly multipliers. The elimination of DSP usage altogether, thereby enabling better parallelism in butterfly operations, is presented in [19]. By storing twiddle factors in ROM, they reduced area and latency. Similarly, ref. [20] implemented a unified butterfly design in which the same hardware supported both NTT and inverse NTT, relying on bit-reversal to maintain correctness. While this reduced hardware cost, the bit-reversal step added latency overhead. Later works extended parallelism by introducing vectorized butterfly architectures [21,22]. These designs used vector coprocessors or SIMD/NEON instruction sets for modular arithmetic. Although effective in boosting throughput, they required memory bandwidth per cycle, often exceeding the constraints of resource-limited FPGAs.

2.3. Modular Reduction Improvements

Efficient modular reduction is critical for resource utilization. Early FPGA implementations favored Barrett or Montgomery reduction [23], but more advanced schemes were later introduced [24]. They proposed K2RED, a two-step reduction method that combines CT and GS butterflies into a unified 2 × 2 architecture. This approach reduced memory accesses by reusing twiddle factors but required additional control logic [25]. The authors in [25] further enhanced modular arithmetic by combining Proth prime-based reduction with modified KRED, offering low-cost multiplications via repeated constant multiplications. More recently, ref. [26] introduced MUX-controlled arithmetic to streamline addition, subtraction, and multiplication operations, enabling a controller-based reduction scheme with lower latency. Lookup table (LUT)-based methods also emerged; for example, ref. [27] introduced FLUT (Fast LUT), which reduced the results to the range

(- 2^{12}, 2^{12})

, enabling high-speed signed operations, while effective, these approaches increased memory requirements for LUT storage, making them less practical for constrained FPGA devices.

2.4. Memory and Pipelining Optimizations

Since NTT computations are memory-intensive, several works have targeted data movement reduction [28], which achieved 30% cycle reduction by leveraging multiply accumulate (MACC) instructions in modular operations. Ref. [29] improved performance by converting subtraction into negation–addition operations and introduced a DIV2 unit to replace multiplications in inverse NTT. Ping-pong memory access schemes [30] allowed simultaneous read–write operations, while [31] proposed banked memory layouts with Lazy-Last-Layer tricks to improve bandwidth utilization. Ref. [32] further reduced storage by reusing a single butterfly across all NTT levels, achieving nearly 100% hardware utilization but at the cost of longer critical paths. Recent works also explored domain-specific co-designs. Ref. [33] used OpenCL-based MPSoC platforms, dividing computation into NTT and pointwise multiplication (PWM) stages to reduce memory bottlenecks. Ref. [34] introduced KyberMat, a polyphase decomposition architecture inspired by FIR filters that reduces computational and memory overhead.

2.5. Advanced Radix and Flexible Architectures

Radix-based optimizations have been widely studied. Ref. [35] proposed Radix-22 architectures, enabling high-throughput designs with four parallel paths, though at a higher resource cost. Ref. [36] introduced radix-4 with adjacent coefficient packing, reducing memory conflicts. Ref. [37] presented mixed-radix architectures capable of runtime switching between radix-2 and radix-4, improving flexibility. Flexible architectures such as [38] proposed a Bi-Core design that compresses polynomial and twiddle-factor memory while supporting multiple cryptosystems (Kyber, Dilithium, Falcon). Ref. [39] further developed technology-independent NTT accelerators that are portable across FPGA platforms.

2.6. Emerging GPU and Cloud Implementations

Beyond FPGAs, GPU-based accelerations have gained attention [40] exploring GPU kernel optimizations, introducing techniques such as sliced-layer merging and depth-first scheduling, which reduced memory accesses. Microsoft researchers [41] proposed a cloud-based Kyber design that decomposes NTT into a butterfly core, stages, and input/output (I/O) steps, with pipeline sharing to improve scalability.

2.7. Quantum NTT Approaches

While most research has focused on classical and hardware domains, several studies have explored quantum NTT (QNTT). Gate-based modular arithmetic and QFT-based approaches offer theoretical acceleration but face challenges. Specifically, circuit depth, qubit coherence, and mapping modular arithmetic into quantum gates remain unresolved bottlenecks [14,15]. These limitations underscore the need for hybrid frameworks that bridge quantum and classical techniques, ensuring scalability in the future while remaining practical on near-term devices. In summary, Table 1 lists the summary of state-of-the-art NTT implementations.

Although numerous NTT accelerators have been reported in the literature, most existing studies focus on a specific radix organization, a particular hardware optimization strategy, or a single implementation domain. In contrast, this work provides a unified evaluation framework that examines radix-2, radix-4, and radix-8 NTT implementations within a common FPGA design methodology and investigates their corresponding software and quantum-circuit realizations. The resulting analysis enables a systematic assessment of implementation trade-offs, including latency, resource utilization, arithmetic complexity, and scalability considerations. Therefore, the contribution of this work lies not in proposing a new NTT algorithm but in providing a comparative evaluation of alternative implementation strategies and their implications for NTT acceleration in CRYSTALS-Kyber.

2.8. Limitations of Existing Works

Despite notable progress, we can summarize several limitations:

Latency and Throughput Bottlenecks: Even with optimized butterfly and modular reduction schemes, long critical paths can limit achievable throughput.
Resource Utilization Issues: Designs that achieve high throughput often do so at the expense of substantial DSP, BRAM, and LUT usage, making them unsuitable for constrained FPGAs.
Scalability Challenges: Many designs are tailored for specific platforms (e.g., Xilinx or Intel FPGAs), limiting portability and broader applicability.
Quantum Circuit Depth: Quantum NTT remains largely theoretical; existing designs suffer from excessive depth and qubit requirements, making them impractical on current quantum hardware.

These challenges motivate the hybrid framework proposed in this paper, which integrates classical optimizations, FPGA accelerations, and quantum exploration to provide a more scalable and resource-efficient NTT solution for CRYSTALS-Kyber.

2.9. Primary Research Gaps

Despite the significant progress achieved in NTT acceleration for lattice-based cryptography, several limitations remain in the existing literature. First, many reported architectures focus on a single radix organization and therefore provide limited insight into the implementation trade-offs associated with alternative radix decompositions. Second, most studies prioritize a specific optimization objective, such as throughput, latency, or area efficiency, making it difficult to compare architectural behavior under a common implementation framework. Third, existing investigations are typically confined to a single computational domain, such as software, FPGA hardware, or quantum computing, without providing a broader perspective on how NTT implementations behave across different computational paradigms. To address these limitations, this work presents a cross-platform evaluation framework centered on FPGA-based mixed-radix NTT acceleration for CRYSTALS-Kyber. The proposed approach investigates radix-2, radix-4, and radix-8 implementations within a unified FPGA design methodology while also examining corresponding software and quantum-circuit realizations. This enables systematic evaluation of implementation trade-offs, resource utilization, latency characteristics, and scalability considerations across multiple NTT implementation strategies.

3. Theoretical Background

LBC provides the mathematical foundation for CRYSTALS-Kyber, which is the focus of this study. Its security derives from the hardness of computational problems such as the Shortest Vector Problem (SVP) and the Closest Vector Problem (CVP). The CVP involves identifying the nearest lattice point to an arbitrary target point when only a poor-quality, or “bad,” basis is known. Solving CVP efficiently is hard and underpins the cryptographic strength of lattice schemes. Encryption in such systems encodes a message into a lattice structure, introduces a small random error vector, and produces a perturbed point near the lattice. Decryption requires the corresponding “good” basis to recover the original message by mapping the perturbed point to the nearest lattice point. Figure 2 illustrates a lattice space with non-lattice points, where rounding is performed to the nearest lattice vector. Decryption succeeds if the added error is small; otherwise, decoding errors occur.

The SVP, a related problem, requires finding the shortest non-zero vector in a lattice. CVP and SVP are proven to be hard under worst-case complexity assumptions, ensuring security against classical and quantum adversaries. The LWE problem extends the classical “learning with errors” problem. It introduces small random errors into linear equations, making the recovery of secret vectors computationally hard. LWE is the problem of recovering the secret vector from pairs (a,b), where a is chosen uniformly at random and b = 〈s,a〉 + e (mod q), with s as the secret vector and e sampled from an error distribution (often discrete Gaussian). The search-LWE problem involves finding the secret vectors given many LWE samples. In contrast, the decision-LWE variant asks whether a set of samples is from an LWE distribution or a uniform distribution. Variants such as Ring-LWE and Module-LWE extend the problem to polynomial rings and modules. Ring-LWE operates on single polynomials, while Module-LWE generalizes to vectors of polynomials, providing greater flexibility and efficiency in cryptographic construction as illustrated in Figure 3.

CRYSTALS is a cryptographic suite that includes Kyber, a KEM, and Dilithium, a digital signature scheme. Both rely on the hardness of MLWE. In Kyber, public keys consist of polynomial matrices and vectors derived from uniform sampling, while secret keys are sampled from centered binomial distributions. Kyber provides security against chosen-plaintext (IND-CPA) and chosen-ciphertext (IND-CCA2) adversaries. IND-CPA security is achieved through lattice-based encryption, while the Fujisaki–Okamoto transformation [18,42] extends it to IND-CCA2 by introducing re-encryption and verification steps. Figure 3 illustrates the key establishment mechanism. The algorithm operates over the polynomial ring

R_{q} = Z_{q} [X] / (X^{n} + 1)

. Noise terms are sampled from a centered binomial distribution

β_{η}

, which ensures correctness while maintaining security. Parameters are selected to strike a balance between efficiency and cryptographic strength. Table 2 lists Kyber’s parameter sets, including values of n, q, and k, which correspond to security levels equivalent to Advanced Encryption Standard (AES) AES-128 (Kyber512), AES-192 (Kyber768), and AES-256 (Kyber1024). The parameters

η_{1}, η_{2}, d_{u}, d_{v}

, and

δ

govern error generation, coefficient sizes, compression, and KEM failure probability.

Cryptographic primitives used in Kyber include SHA3-512, SHAKE-128, and SHAKE-256 for matrix generation, key derivation, and noise sampling. Auxiliary functions handle the encoding, decoding, and compression of polynomials. Polynomial multiplication is the fundamental operation in LBC. Several approaches exist, varying in efficiency and complexity. The classical schoolbook method multiplies two degree-

(n - 1)

polynomials with

O (n^{2})

complexity. Each coefficient is computed by summing pairwise products across the polynomial terms.

Cyclic convolution improves efficiency by reducing results modulo a polynomial. In Positive Wrapped Convolution (PWC), reduction is performed with

(X^{n} - 1)

, while in Negative Wrapped Convolution (NWC), reduction uses

(X^{n} + 1)

. Both methods retain

O (n^{2})

complexity but enable compatibility with ring structures used in lattice cryptography. The NTT is an analog of the Discrete Fourier Transform (DFT), operating over finite fields

Z_{q}

. It reduces the complexity of polynomial multiplication to

O (n \log n)

. The transform maps input coefficients into the NTT domain, where multiplication becomes pointwise, followed by an inverse NTT (INTT) to recover the result. The existence of primitive roots of unity modulo a prime q enables the NTT. These roots, referred to as twiddle factors, allow butterfly operations that combine and reduce coefficients efficiently. Figure 4 shows the integration of NTT and INTT across Kyber’s key generation, encapsulation, and decapsulation processes. When combined with NWC, Kyber avoids explicit reductions modulo

(X^{n} + 1)

. Pre-scaling input coefficients by a square root of unity

ζ

and post-scaling results by

ζ^{- 1}

achieves efficient negative cyclic convolution. Figure 5 illustrates procedure. Efficient NTT evaluation employs recursive decomposition. Two methods dominate: decimation-in-time (DIT) using the CT butterfly and decimation-in-frequency (DIF) using the GS butterfly.

In CT, inputs $(a, b)$ are transformed into outputs: $(a + b \cdot ζ_{i}, a - b \cdot ζ_{i})$ .
In GS, inputs $(a, b)$ yield outputs: $(a + b, (a - b) \cdot ζ^{- i})$ .

Here,

ζ_{i}

denotes the twiddle factor at stage i. Figure 6 depicts both BUs. These operations are reused recursively, forming the computational backbone of forward and inverse NTT. Table 3 compares the complexity of different multiplication methods. Schoolbook and cyclic convolutions remain quadratic, while NTT reduces the cost to quasi-linear.

Kyber incorporates several optimizations to tailor NTT to its module structure:

Consistent use of NTT domain: All core operations, including matrix–vector and pointwise multiplications, are performed in the NTT domain to minimize repeated transforms.
NWC pre-/post-scaling: Implicit use of $ζ$ enables multiplication in $R_{q} = Z_{q} [x] / (X^{n} + 1)$ .
Fixed parameters: Kyber uses $n = 256$ and $q = 3329$ , which ensure the presence of primitive roots for efficient butterfly computation.
Module structure: Parallel NTT operations handle multiple polynomials when $k = 2, 3, 4$ , depending on the security level.
Resource efficiency: In-place operations and reduced memory movement enhance suitability for constrained hardware.

Quantum computing merges physics, computer science, and mathematics to process information beyond classical limits. It leverages quantum mechanical phenomena such as superposition and interference to accelerate the solution of problems that are computationally infeasible for classical machines [14,43].

Superposition allows qubits to exist in a linear combination of basis states

| 0 〉

and

| 1 〉

. This property enables intrinsic parallelism, as quantum processors can evaluate many states simultaneously [44]. Entanglement creates correlations between qubits such that the measurement of one qubit reveals information about another, regardless of distance. This phenomenon enhances computational power in quantum circuits [45]. Decoherence remains a key challenge. Qubits are highly sensitive to environmental noise, causing fragile quantum states to collapse into classical states [15]. Techniques such as ion-trap and superconducting processors attempt to mitigate decoherence through error correction and improved isolation.

Quantum technologies have a dual role in cryptography.

Quantum Key Distribution (QKD): Protocols such as BB84 and E91 use entangled states to exchange symmetric keys securely [46,47]. Any eavesdropping attempt alters the quantum state, revealing the presence of an adversary.
PQC: Conversely, quantum computers threaten classical cryptosystems. Shor’s algorithm breaks RSA, Diffie–Hellman, and ECC by enabling efficient factorization and discrete logarithm computation [48,49]. Grover’s algorithm reduces the security level of symmetric schemes by half, requiring longer keys for equivalent security strength [50]. Lattice-based cryptography (LBC) remains resistant to known quantum algorithms, though research on hidden subgroup problems continues.

The fundamental unit of quantum information is the qubit. Unlike classical bits, qubits may occupy superpositions

α | 0 〉 + β | 1 〉

, where

{| α |}^{2} + {| β |}^{2} = 1

. Measurement collapses this state to a basis element according to the Born rule. Multi-qubit states grow exponentially, with an n-qubit system represented in a

2^{n}

-dimensional space. For example, simulating 100 qubits requires storage of

2^{100}

amplitudes, which is infeasible for classical hardware. Quantum gates manipulate qubit states and form the building blocks of quantum circuits.

Identity Gate (I): Preserves the quantum state.
Hadamard Gate (H): Creates superpositions, rotating states on the Bloch sphere [13].
Pauli Gates (X, Y, Z): Represent bit-flip, phase-flip, and combined transformations of qubit states [13].
Controlled Gates (CX, CCX): Enable conditional operations by flipping a target qubit depending on control qubits. These gates are essential for constructing composite algorithms such as the QFT.

Quantum circuits are constructed by chaining these gates, analogous to logic gates in classical digital design. Current quantum programming is performed at the gate level, as high-level abstractions remain limited in their scope.

4. Proposed Methodology

The framework follows a sequential evaluation methodology rather than a tightly coupled runtime architecture. Classical implementations establish algorithmic baselines, quantum implementations assess future feasibility, and FPGA implementations provide the primary deployment-oriented contribution. Results from these domains are integrated through comparative analysis rather than direct runtime interaction.

This section presents the methodological framework adopted for designing, implementing, and evaluating adaptive NTT architectures across classical, quantum, and hardware domains. The approach integrates algorithmic modeling, simulation, and hardware prototyping to construct a unified foundation for post-quantum cryptographic operations. The study began with the algorithmic foundations of polynomial convolution. Linear, positive-wrapped, and negative-wrapped convolutions were implemented in Python to validate their mathematical correctness. These implementations established a baseline for subsequent transformations and introduced cyclic reduction techniques required in LBC.

Classical NTT implementations were examined through two approaches. A matrix-based method was developed by computing intermediate values using modular exponentiation, with coefficients transformed according to derived equations, provided that the sequence length was a power of two and the twiddle factor was a valid primitive root of unity modulo q; while computationally expensive, this method provided fine-grained insight into the transformation process.

In parallel, an iterative NTT design was implemented using CT butterflies for the forward transform and GS butterflies for the inverse transform. Twiddle factors were dynamically computed, and modular inverses were derived using the Extended Euclidean Algorithm. This iterative approach emphasized efficiency, reversibility, and alignment with the requirements of lattice-based cryptographic schemes.

Quantum arithmetic modules were constructed using Qiskit due to their maturity, community adoption, and integration with real quantum hardware. Fundamental arithmetic operations were realized through both gate-based and QFT-based designs. Binary addition circuits employed X, CX, and CCX gates to simulate carry propagation, with ancillary qubits restored to maintain reversibility.

Figure 7 illustrates the implementation. The QFT was employed to design addition and subtraction circuits, in which controlled phase rotations replaced classical carry chains, as shown in Figure 8 and Figure 9. Multiplication was achieved by additions using QFT-based arithmetic, where decremented multiplier and accumulator registers were used to complete the operation, as depicted in Figure 10.

The quantum NTT was then implemented using recursive CT butterflies incorporating QFT-based modular addition, subtraction, and multiplication. Twiddle factor multiplication was realized through controlled modular operations, and a simplified two-stage pipeline was demonstrated in Figure 11. Qiskit Aer noise models were integrated to emulate decoherence and assess circuit stability under near-term quantum computing conditions, accounting for hardware limitations. The quantum experiments were conducted entirely using classical simulation backends provided by Qiskit Aer. Accordingly, the reported execution characteristics reflect simulator complexity and circuit modeling overhead rather than the performance of actual quantum hardware.

Due to the current limitations of quantum hardware and quantum circuit simulators, the quantum experiments were intentionally conducted using reduced toy parameters and small polynomial vectors. These experiments were designed solely to demonstrate the feasibility of QFT-based modular arithmetic and recursive butterfly construction rather than to implement the full CRYSTALS-Kyber parameter sets (e.g., n = 256 and q = 3329).

An adaptive radix-mixed NTT architecture was proposed and developed to achieve hardware efficiency. This design enabled configurable selection among radix-2, radix-4, and radix-8 butterfly organizations for design-space exploration and resource-performance evaluation, depending on the polynomial size, thereby balancing throughput and resource utilization. Twiddle factors were precomputed and stored in LUT-based memory, while modular multiplications were optimized using either Barrett or Montgomery reduction.

The architecture was organized into several modules. An adaptive radix controller orchestrated butterfly scheduling, while dedicated BUs implemented radix-specific transformations. The memory hierarchy utilized BRAM with cyclic partitioning to minimize read–write conflicts, and a modular arithmetic co-processor performed reductions using either the Barrett or Montgomery methods.

HLS as directives were applied to optimize FPGA performance, including loop unrolling, pipelining, and dataflow. Resource partitioning techniques further reduced access bottlenecks and improved concurrency.

The methodology can be summarized as an integration of Python-based prototypes for correctness verification, iterative CT and GS implementations as classical baselines, Qiskit circuits for quantum feasibility studies under near-term quantum conditions, together with FPGA-based architectural designs for hardware evaluation.

The proposed framework supports multiple radix organizations, including radix-2, radix-4, and radix-8 implementations. For an N-point Number Theoretic Transform (NTT), the number of computational stages required for a radix-r decomposition is given by:

S_{r} = \log_{r} N

(1)

where (N) denotes the transform size and (r) represents the selected radix. Accordingly, the transform length may be expressed as:

N = r^{S_{r}}

(2)

Higher radix values reduce the number of computational stages required to complete the transform. For example, for (N = 256), radix-2 requires eight stages, radix-4 requires four stages, and radix-8 requires approximately three stages. However, the reduction in stage count is accompanied by increased butterfly complexity, additional arithmetic operations, more demanding scheduling requirements, and potentially higher resource utilization. The configurable radix framework was therefore designed to evaluate the implementation trade-offs associated with different radix organizations under a common FPGA architecture. Rather than assuming that a higher radix always yields superior performance, the proposed approach enables comparative assessment of latency, initiation interval, resource utilization, and implementation complexity across alternative radix decompositions.

5. Results and Analysis

5.1. Classical NTT Evaluation

The comparative study of classical NTT implementations focused on an iterative Cooley–Tukey (CT) implementation and a naive matrix-based transform used as a conceptual baseline for algorithmic complexity comparison. The direct matrix-based formulation requires O(n²) arithmetic operations because each output coefficient depends on all input coefficients. In contrast, the iterative Cooley–Tukey NTT reduces the computational cost to O(n log n) through recursive butterfly decomposition. The proposed FPGA implementation preserves the same asymptotic O(n log n) complexity while improving practical execution efficiency through hardware parallelism, pipelining, and optimized modular arithmetic. The quantum implementation serves as a proof-of-concept realization and, therefore, is evaluated primarily from a feasibility perspective rather than practical complexity advantage under current hardware constraints.Experiments were conducted on a Python 398 platform running on an Intel Core i7 processor (2.9 GHz) with 16 GB RAM. Performance 399 was measured using execution time. Figure 12 summarizes the results and illustrates the time comparison.

The Python implementations were developed primarily for correctness verification, algorithmic illustration, and relative complexity analysis. They were not intended to represent optimized cryptographic software implementations, such as AVX2-accelerated or assembly-level Kyber libraries used in practical deployments.

The CT method achieved a forward NTT time of 752.8 μs compared to 2728.5 μs for the matrix approach and an inverse NTT time of 432.2 μs vs. 2723.4 μs. This performance difference reflects the nested loop structure of the matrix-based implementation, which incurs cubic complexity.

Both approaches were tested using polynomial sizes ranging from 16 to 200 coefficients to illustrate relative algorithmic scaling behavior and verify implementation correctness across multiple input dimensions. These experiments were intended as methodological complexity studies rather than benchmarks of practical CRYSTALS-Kyber deployments, which employ a fixed polynomial degree of (n = 256). The execution time for the CT method exhibited near-linear growth with respect to the input size, whereas the matrix-based approach grew quadratically. These findings confirm the well-established algorithmic efficiency advantages of structured iterative NTT approaches over direct matrix formulations. The presented Python-based experiments are intended primarily for correctness verification, complexity illustration, and baseline validation within the proposed cross-platform evaluation framework. Consequently, the results should be interpreted as conceptual and methodological comparisons rather than benchmarks against production-grade cryptographic software implementations or as claims of algorithmic novelty.

In CRYSTALS-Kyber, all standardized parameter sets use (n = 256). The adaptive FPGA architecture proposed in this work is designed with this target size in mind, while the software-based scalability experiments were conducted to demonstrate comparative growth characteristics between alternative NTT formulations. Finally, hardware memory requirements of the proposed accelerator are represented through BRAM utilization and synthesis statistics reported in Table 4 and Table 5.

5.2. Quantum NTT Evaluation

Quantum experiments were conducted using the Qiskit qasm simulator backend to evaluate the structural feasibility of QFT-based NTT circuits under controlled simulation conditions. The obtained timing and memory observations therefore represent classical simulation costs rather than deployable quantum hardware performance. Due to current limitations in qubit availability, circuit depth, coherence time, and simulator scalability, the quantum experiments were restricted to small proof-of-concept polynomial vectors (e.g., [1,2,3,4]) and simplified modular settings. These reduced parameters were selected exclusively for a feasibility demonstration of QFT-based arithmetic and recursive NTT construction, rather than for full-scale implementation of Kyber parameters (n = 256, q = 3329). Results showed that quantum NTT was significantly slower and consumed more memory than the classical CT implementation.

Scaling the presented proof-of-concept circuits to the CRYSTALS-Kyber polynomial degree (n = 256) would require substantially larger quantum resources, including increased logical qubit counts, deeper arithmetic circuits, and significantly greater error-correction overhead. Consequently, the present implementation should be viewed as a feasibility study of quantum NTT construction rather than a practical realization of Kyber-scale processing.

Execution overhead stemmed from limited qubit availability, the high circuit depth introduced by recursive butterfly operations, and gate fidelity issues in multi-qubit operations. Additional burdens included simulation overhead, the exponential growth in representing quantum states, and the need for auxiliary registers. Noise-aware simulations highlighted the sensitivity of quantum NTT to decoherence. Increased circuit depth degraded accuracy, especially in modular exponentiation. Although Qiskit Aer allowed partial mitigation through optimized circuit design, error rates remained high.

These findings emphasize that quantum NTT is currently impractical for large-scale cryptographic workloads. However, the experiments demonstrate proof-of-concept feasibility and highlight the potential of quantum parallelism for future implementations once hardware matures.

5.3. FPGA-Based Adaptive Radix NTT

The adaptive radix-mixed NTT was synthesized and evaluated on a Xilinx Virtex UltraScale+ FPGA (xcvu9p-fsgd2104-3-e) through the Amazon AWS cloud platform. Development was performed using the Vivado Design Suite (2020.2) with HLS optimizations. The synthesis results are summarized in Table 4. The table reports the performance of radix-2, radix-4, and radix-8 configurations using Barrett and Montgomery reductions. Montgomery reduction consistently achieved lower clock periods (4.32–5.55 ns) than Barrett reduction (6.83–8.64 ns), resulting in higher maximum operating frequencies of up to 231.48 MHz. Latency analysis showed that radix-2 achieved 32,804 cycles with Montgomery reduction, compared to 65,571 cycles with Barrett reduction. Resource utilization was also favorable: Montgomery consistently required fewer DSPs, LUTs, and FFs than Barrett across all radices, while BRAM usage remained constant at four blocks. From a design perspective, radix-2 provided the most efficient trade-off between performance and resource utilization, particularly for constrained FPGA environments. Although higher-radix NTT algorithms reduce the number of theoretical transform stages, the synthesized radix-4 and radix-8 implementations exhibited higher cycle latency in this design because of increased butterfly complexity, resource sharing, twiddle-factor scheduling, and higher initiation intervals. Specifically, the initiation interval increased from 1 for radix-2 to 3 for radix-4 and 6 for radix-8, which offset the theoretical stage reduction. Therefore, radix-2 achieved the best latency and resource-efficiency trade-off in the implemented HLS architecture. These findings are consistent with the expected scaling trade-offs in mixed-radix architectures.

The latency results in Table 4 should be interpreted in the context of the synthesized HLS implementation rather than ideal radix complexity alone. In this implementation, radix-4 and radix-8 require more complex butterfly scheduling and additional modular arithmetic operations, while resource sharing limits the number of operations that can be executed concurrently. This increases the initiation interval and introduces memory-access and control overhead. Consequently, the theoretical reduction in NTT stages for higher-radix designs is not directly reflected in the measured cycle latency.

Because CRYSTALS-Kyber uses a fixed polynomial degree of (n = 256), the mixed-radix controller should be interpreted primarily as a configurable design-space exploration mechanism rather than a required runtime adaptation feature. The synthesis results indicate that the radix-2 Montgomery configuration provides the most favorable resource-latency trade-off for the evaluated Kyber-oriented implementation.

5.4. Comparative Analysis

To evaluate the competitiveness of the proposed architecture, a comparison with state-of-the-art NTT implementations was conducted. The results are summarized in Table 5. The proposed design required only five DSPs, significantly fewer than the 36 DSPs reported in [52] and the 26 DSPs reported in [37]. LUT and FF usage were also markedly lower than those reported in [37,53]. In terms of clock performance, the design achieved a clock period of 4.322 ns and an estimated frequency of 231.48 MHz, outperforming many contemporary FPGA-based NTT accelerators. These results confirm the resource efficiency and strong competitiveness of the adaptive design.

Beyond resource utilization and timing metrics, the proposed architecture differs from prior NTT accelerators in several respects. Unlike fixed-radix architectures reported in [29,37], the proposed framework supports adaptive radix operation through configurable radix-2, radix-4, and radix-8 processing modes. This flexibility enables exploration of different resource-performance trade-offs depending on implementation constraints and application requirements. Furthermore, the architecture evaluates both Barrett and Montgomery reduction strategies within a unified framework, allowing systematic assessment of arithmetic-resource trade-offs. The design also incorporates LUT-based twiddle-factor management and resource-aware scheduling techniques intended to reduce DSP utilization while maintaining competitive operating frequencies.

For the FPGA implementation, memory utilization is assessed through BRAM resource consumption rather than software memory allocation metrics.

While several prior works achieve higher throughput through aggressive parallelization and dedicated datapath replication, such approaches often incur substantial DSP and logic overhead. The proposed design prioritizes resource efficiency and configurability, achieving competitive operating frequencies while maintaining significantly lower DSP utilization. This design philosophy differs from throughput-oriented accelerators and targets practical deployment scenarios where hardware resources are constrained.

The hardware evaluation focuses on resource utilization, latency, and operating frequency metrics obtained from FPGA synthesis and implementation reports. Detailed power characterization was outside the scope of the present study.

Recent FPGA accelerators for CRYSTALS-Kyber have increasingly focused on highly specialized optimization techniques, including merged-NTT architectures, memory-efficient butterfly scheduling, pipelined modular arithmetic units, and architecture-specific throughput enhancements. These designs often prioritize maximum performance for fixed Kyber parameters and specific deployment scenarios. In contrast, the proposed implementation emphasizes evaluating configurable radix organizations within a unified FPGA framework. While this design objective may differ from that of highly optimized application-specific accelerators, it enables a systematic investigation of the trade-offs among radix decomposition, implementation complexity, resource utilization, and latency. Consequently, the presented results should be interpreted as an architectural evaluation of alternative NTT implementation strategies rather than as a direct replacement for highly specialized Kyber acceleration architectures.

5.5. Discussion

The hybrid methodology demonstrates clear advantages. Classical CT implementations establish efficiency and scalability, FPGA acceleration provides practical deployment through resource-efficient designs, and quantum circuits offer a vision of future computational potential. This combination ensures that current performance requirements are met while also maintaining long-term scalability.

Limitations remain evident. Quantum NTT currently remains impractical for Kyber-scale deployment due to excessive circuit depth, large qubit requirements, error propagation, and simulator complexity. Consequently, the quantum experiments in this work were limited to reduced proof-of-concept parameter settings and should be interpreted as exploratory demonstrations rather than production-scale implementations.

From a security perspective, the proposed design maintains Kyber’s cryptographic strength. All operations conform to the mathematical structure of Module-LWE and preserve the IND-CPA and IND-CCA2 security guarantees. Introducing adaptive radix and hybrid implementation strategies does not alter the underlying hardness assumptions. Instead, the work demonstrates that Kyber can be efficiently supported across heterogeneous platforms while preserving its cryptographic security guarantees.

The quantum experiments presented in this work should not be interpreted as a practical deployment model involving runtime interaction between FPGA accelerators and remote quantum simulators. Rather, the quantum component serves as an exploratory feasibility study, while the FPGA architecture constitutes the primary deployment-oriented contribution. Consequently, the proposed framework is intended for comparative evaluation and future research exploration rather than immediate heterogeneous execution.

Although quantum NTT algorithms offer theoretical asymptotic advantages, practical deployment for Kyber-scale parameters remains constrained by current hardware limitations. Scaling the proposed proof-of-concept circuits to n = 256 would substantially increase circuit depth, qubit requirements, and fault-tolerant resource overhead. A comprehensive resource-estimation study for full CRYSTALS-Kyber parameter sets remains an important direction for future work.

A limitation of the current adaptive radix implementation is that radix-4 and radix-8 were synthesized under resource-sharing constraints, which increased initiation intervals and cycle latency. Future work will investigate fully parallel radix-4/radix-8 butterfly pipelines and optimized memory banking to better exploit the theoretical latency advantages of higher-radix NTT algorithms.

While the proposed architecture preserves the underlying cryptographic security assumptions of CRYSTALS-Kyber, practical hardware implementations may remain vulnerable to implementation-level side-channel attacks, including timing, power, and electromagnetic analysis. The present study focuses on architectural optimization and performance evaluation and therefore does not include a dedicated side-channel assessment. Investigation of leakage characteristics, constant-time implementation properties, and countermeasures against side-channel attacks represents an important direction for future work.

Recent Kyber-oriented accelerator architectures have explored additional optimization techniques including NTT merging, polynomial decomposition strategies, and hybrid arithmetic structures. These approaches generally prioritize throughput maximization through specialized datapath optimization, whereas the present work focuses on configurable mixed-radix operation and resource-efficient implementation trade-offs.

6. Conclusions

A hybrid Number Theoretic Transform (NTT) framework for CRYSTALS-Kyber has been presented. The framework integrates optimized classical algorithms, Qiskit-based quantum prototypes, and an adaptive radix-mixed FPGA accelerator. The design objectives were to reduce NTT latency, lower hardware resource consumption, and explore quantum acceleration while maintaining Kyber’s security properties. Experimental evaluation demonstrated clear benefits. Classical iterative Cooley–Tukey NTT was shown to outperform a naive matrix-based reference implementation in conceptual Python-based experiments intended for algorithmic comparison. FPGA synthesis on a Xilinx Virtex UltraScale+ confirmed that the adaptive radix architecture yields strong trade-offs in terms of timing and area. Using Montgomery reduction, the radix-2 implementation achieved an estimated maximum

F_{\max}

of 231.48 MHz, a clock period of 4.32 ns, low latency (≈32,804 cycles), and minimal DSP usage (5 DSPs).

Quantum prototypes implemented with Qiskit provided a simulation-based proof of concept for QFT-based NTT circuit construction and feasibility analysis. These experiments were intentionally performed using reduced parameter sizes because current quantum hardware and simulation environments cannot yet efficiently support Kyber-scale modular arithmetic. However, quantum circuits incurred substantial circuit depth and simulation overhead under noise models. Several limitations were identified. Quantum NTT remains constrained by qubit counts, circuit depth, and decoherence, which limit its applicability on near-term quantum hardware. Because the experiments were conducted on classical simulators rather than physical quantum hardware, the reported metrics should not be interpreted as practical deployment performance. FPGA implementations encounter scalability challenges as polynomial degrees and modulus sizes grow; larger parameter sets will increase DSP, BRAM, and logic requirements and may exceed the capacity of mid-range devices. Security analysis confirmed that the proposed engineering changes do not alter the underlying Module-LWE hardness assumptions. The adaptive design and runtime radix selection preserve Kyber’s mathematical structure and the IND-CPA and IND-CCA2 security properties described in the paper.

Future work will pursue three directions. First, quantum circuit optimizations will be investigated to reduce depth and ancilla usage, including the use of approximate arithmetic and hybrid quantum-classical subroutines. Second, multi-FPGA and heterogeneous CPU+FPGA mappings will be explored to scale the adaptive architecture to larger Kyber parameter sets. Third, a formal side-channel and implementation security assessment will be performed to verify that performance optimizations remain constant-time and resistant to practical attacks. These efforts aim to transition the hybrid NTT framework from prototype to deployable implementations without weakening cryptographic assurances. The hybrid approach strikes a balance between the immediate practicality of FPGA and a forward-looking quantum research agenda. The result is a resource-efficient, high-performance NTT solution compatible with CRYSTALS-Kyber and amenable to further optimization as quantum and FPGA technologies evolve.

Future research will investigate resource-estimation models for Kyber-scale quantum NTT implementations, including logical-qubit requirements, circuit-depth growth, and fault-tolerant execution costs.

Additionally, the future work will extend the hardware evaluation through detailed power characterization and Power–Area–Timing (PAT) analysis to further quantify implementation trade-offs across alternative radix configurations and modular reduction strategies.

Author Contributions

Conceptualization, Q.A.A.-H.; Methodology, Y.A.; Software, Y.A.; Validation, Q.A.A.-H. and A.A.; Formal analysis, Y.A.; Investigation, Y.A., Q.A.A.-H. and A.A.; Resources, Q.A.A.-H. and A.A.; Writing—original draft, Y.A. and Q.A.A.-H.; Writing—review & editing, Y.A., Q.A.A.-H. and A.A.; Visualization, Y.A.; Supervision, Q.A.A.-H.; Project administration, Q.A.A.-H.; Funding acquisition, Q.A.A.-H. and A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This study is supported via funding from Prince Sattam bin Abdulaziz University project number (PSAU/2026/R/1447).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors acknowledge and thank DSR at Prince Sattam bin Abdulaziz University for its technical and financial support.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kocher, P.C. Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS, and Other Systems. In Proceedings of the Advances in Cryptology—CRYPTO ’96, 16th Annual International Cryptology Conference, Santa Barbara, CA, USA, 18–22 August 1996, Proceedings; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 1996; Volume 1109, pp. 104–113. [Google Scholar] [CrossRef] [PubMed]
Miller, V.S. Use of Elliptic Curves in Cryptography. In Proceedings of the Advances in Cryptology—CRYPTO ’85, Santa Barbara, CA, USA, 18–22 August 1985, Proceedings; Williams, H.C., Ed.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 1985; Volume 218, pp. 417–426. [Google Scholar] [CrossRef]
Shor, P.W. Polynomial-Time Algorithms for Prime Factorization and Discrete Logarithms on a Quantum Computer. SIAM J. Comput. 1997, 26, 1484–1509. [Google Scholar] [CrossRef]
Liu, Y.K.; Moody, D. Post-quantum cryptography and the quantum future of cybersecurity. Phys. Rev. Appl. 2024, 21, 040501. [Google Scholar] [CrossRef] [PubMed]
National Institute of Standards and Technology (NIST). PQC Standardization Process: Announcing Four Candidates to Be Standardized, Plus Fourth Round Candidates; NIST Computer Security Resource Center: Gaithersburg, MD, USA, 2022.
Avanzi, R.; Bos, J.; Ducas, L.; Kiltz, E.; Lepoint, T.; Lyubashevsky, V.; Schanck, J.M.; Schwabe, P.; Seiler, G.; Stehlé, D.; et al. CRYSTALS-Kyber algorithm specifications and supporting documentation. NIST PQC Round 2019, 2, 1–43. [Google Scholar]
Regev, O. Lattice-based cryptography. In Proceedings of the Annual International Cryptology Conference; Springer: Berlin/Heidelberg, Germany, 2006; pp. 131–141. [Google Scholar]
Albrecht, M.R.; Orsini, E.; Paterson, K.G.; Peer, G.; Smart, N.P. Tightly secure ring-LWE based key encapsulation with short ciphertexts. In Proceedings of the European Symposium on Research in Computer Security; Springer: Berlin/Heidelberg, Germany, 2017; pp. 29–46. [Google Scholar]
Nejatollahi, H.; Dutt, N.; Ray, S.; Regazzoni, F.; Banerjee, I.; Cammarota, R. Post-quantum lattice-based cryptography implementations: A survey. ACM Comput. Surv. (CSUR) 2019, 51, 1–41. [Google Scholar] [CrossRef]
Cooley, J.W.; Tukey, J.W. An algorithm for the machine calculation of complex Fourier series. Math. Comput. 1965, 19, 297–301. [Google Scholar] [CrossRef]
Gentleman, W.M.; Sande, G. Fast Fourier Transforms: For fun and profit. In Proceedings of the American Federation of Information Processing Societies: Proceedings of the AFIPS ’66 Fall Joint Computer Conference, San Francisco, CA, USA, 7–10 November 1966; AFIPS Conference Proceedings; AFIPS/ACM/Spartan Books: Washington, DC, USA, 1966; Volume 29, pp. 563–578. [Google Scholar] [CrossRef]
Scott, M. A Note on the Implementation of the Number Theoretic Transform. In Proceedings of the Cryptography and Coding—16th IMA International Conference, IMACC 2017, Oxford, UK, 12–14 December 2017, Proceedings; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2017; Volume 10655, pp. 247–258. [Google Scholar] [CrossRef]
Nielsen, M.A.; Chuang, I.L. Quantum Computation and Quantum Information; Cambridge University Press: Cambridge, UK, 2010. [Google Scholar]
Montanaro, A. Quantum Algorithms: An Overview. npj Quantum Inf. 2016, 2, 15023. [Google Scholar] [CrossRef]
Shiddiq, M.; Komijani, D.; Duan, Y.; Gaita-Ariño, A.; Coronado, E.; Hill, S. Enhancing coherence in molecular spin qubits via atomic clock transitions. Nature 2016, 531, 348–351. [Google Scholar] [CrossRef] [PubMed]
Nguyen, D.T.; Dang, V.B.; Gaj, K. A High-Level Synthesis Approach to the Software/Hardware Codesign of NTT-Based Post-Quantum Cryptography Algorithms. In Proceedings of the International Conference on Field-Programmable Technology, FPT 2019, Tianjin, China, 9–13 December 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 371–374. [Google Scholar] [CrossRef]
Fritzmann, T.; Sharif, U.; Müller-Gritschneder, D.; Reinbrecht, C.; Schlichtmann, U.; Sepúlveda, J. Towards Reliable and Secure Post-Quantum Co-Processors based on RISC-V. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, DATE 2019, Florence, Italy, 25–29 March 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1148–1153. [Google Scholar] [CrossRef]
Duong-Ngoc, P.; Kim, Y.; Lee, H. Efficient k-Parallel Pipelined NTT Architecture for Post Quantum Cryptography. In Proceedings of the International SoC Design Conference, ISOCC 2020, Yeosu, Republic of Korea, 21–24 October 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 212–213. [Google Scholar] [CrossRef]
Ma, L.; Wu, X.; Bai, G. A low cost high performance polynomial multiplier design for FPGA implementation. In Proceedings of the 2020 IEEE 3rd International Conference on Electronics Technology (ICET); IEEE: Piscataway, NJ, USA, 2020; pp. 83–86. [Google Scholar]
Xu, J.; Wang, Y.; Liu, J.; Wang, X. A general-purpose number theoretic transform algorithm for compact RLWE cryptoprocessors. In Proceedings of the 2020 IEEE 14th International Conference on Anti-Counterfeiting, Security, and Identification (ASID); IEEE: Piscataway, NJ, USA, 2020; pp. 1–5. [Google Scholar]
Xin, G.; Han, J.; Yin, T.; Zhou, Y.; Yang, J.; Cheng, X.; Zeng, X. VPQC: A domain-specific vector processor for post-quantum cryptography based on RISC-V architecture. IEEE Trans. Circuits Syst. I Regul. Pap. 2020, 67, 2672–2684. [Google Scholar] [CrossRef]
Li, C.; Liu, L. A high speed NTT accelerator for lattice-based cryptography. In Proceedings of the 2021 International Conference on Communications, Information System and Computer Engineering (CISCE); IEEE: Piscataway, NJ, USA, 2021; pp. 85–89. [Google Scholar]
Nejatollahi, H.; Shahhosseini, S.; Cammarota, R.; Dutt, N.D. Exploring Energy Efficient Quantum-resistant Signal Processing Using Array Processors. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, 4–8 May 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1539–1543. [Google Scholar] [CrossRef]
Bisheh-Niasar, M.; Azarderakhsh, R.; Kermani, M.M. High-Speed NTT-based Polynomial Multiplication Accelerator for Post-Quantum Cryptography. In Proceedings of the 28th IEEE Symposium on Computer Arithmetic, ARITH 2021, Lyngby, Denmark, 14–16 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 94–101. [Google Scholar] [CrossRef]
Itabashi, Y.; Ueno, R.; Homma, N. Efficient Modular Polynomial Multiplier for NTT Accelerator of Crystals-Kyber. In Proceedings of the 25th Euromicro Conference on Digital System Design, DSD 2022, Maspalomas, Spain, 31 August–2 September 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 528–533. [Google Scholar] [CrossRef]
Guo, W.; Li, S.; Kong, L. An Efficient Implementation of KYBER. IEEE Trans. Circuits Syst. II Express Briefs 2022, 69, 1562–1566. [Google Scholar] [CrossRef]
Li, M.; Tian, J.; Hu, X.; Cao, Y.; Wang, Z. High-Speed and Low-Complexity Modular Reduction Design for CRYSTALS-Kyber. In Proceedings of the IEEE Asia Pacific Conference on Circuit and Systems, APCCAS 2022, Shenzhen, China, 11–13 November 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–5. [Google Scholar] [CrossRef]
Chen, Z.; Ma, Y.; Chen, T.; Lin, J.; Jing, J. Towards Efficient Kyber on FPGAs: A Processor for Vector of Polynomials. In Proceedings of the 25th Asia and South Pacific Design Automation Conference, ASP-DAC 2020, Beijing, China, 13–16 January 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 247–252. [Google Scholar] [CrossRef]
Yaman, F.; Mert, A.C.; Öztürk, E.; Savas, E. A Hardware Accelerator for Polynomial Multiplication Operation of CRYSTALS-KYBER PQC Scheme. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, DATE 2021, Grenoble, France, 1–5 February 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1020–1025. [Google Scholar] [CrossRef]
Zhao, L.; Zhang, J.; Huang, J.; Liu, Z.; Hancke, G.P. Efficient Implementation of Kyber on Mobile Devices. In Proceedings of the 27th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2021, Beijing, China, 14–16 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 506–513. [Google Scholar] [CrossRef]
Guo, W.; Li, S. Highly-Efficient Hardware Architecture for CRYSTALS-Kyber With a Novel Conflict-Free Memory Access Pattern. IEEE Trans. Circuits Syst. I Regul. Pap. 2023, 70, 4505–4515. [Google Scholar] [CrossRef]
Gao, X.; Tian, Z.; Xue, L. Hardware design and optimization of polynomial multiplication for post-quantum cryptography algorithm based on ntt. In Proceedings of the 2023 5th International Conference on Electronic Engineering and Informatics (EEI); IEEE: Piscataway, NJ, USA, 2023; pp. 304–308. [Google Scholar]
El-Kady, A.; Fournaris, A.P.; Paliouras, V. Invited Paper: Dilithium Hardware-Accelerated Application Using OpenCL-Based High-Level Synthesis. In Proceedings of the IEEE/ACM International Conference on Computer Aided Design, ICCAD 2023, San Francisco, CA, USA, 28 October–2 November 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–7. [Google Scholar] [CrossRef]
Tan, W.; Lao, Y.; Parhi, K.K. KyberMat: Efficient Accelerator for Matrix-Vector Polynomial Multiplication in CRYSTALS-Kyber Scheme via NTT and Polyphase Decomposition. In Proceedings of the IEEE/ACM International Conference on Computer Aided Design, ICCAD 2023, San Francisco, CA, USA, 28 October–2 November 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–9. [Google Scholar] [CrossRef]
Kundi, D.; Zhang, Y.; Wang, C.; Khalid, A.; O’Neill, M.; Liu, W. Ultra High-Speed Polynomial Multiplications for Lattice-Based Cryptography on FPGAs. IEEE Trans. Emerg. Top. Comput. 2022, 10, 1993–2005. [Google Scholar] [CrossRef]
Zeng, Q.; Li, Q.; Zhao, B.; Jiao, H.; Huang, Y. Hardware Design and Implementation of Post-Quantum Cryptography Kyber. In Proceedings of the IEEE High Performance Extreme Computing Conference, HPEC 2022, Waltham, MA, USA, 19–23 September 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–6. [Google Scholar] [CrossRef]
Duong-Ngoc, P.; Lee, H. Configurable Mixed-Radix Number Theoretic Transform Architecture for Lattice-Based Cryptography. IEEE Access 2022, 10, 12732–12741. [Google Scholar] [CrossRef]
Li, G.; Chen, D.; Mao, G.; Dai, W.; Sanka, A.I.; Cheung, R.C.C. Algorithm-Hardware Co-Design of Split-Radix Discrete Galois Transformation for KyberKEM. IEEE Trans. Emerg. Top. Comput. 2023, 11, 824–838. [Google Scholar] [CrossRef]
Dang, V.B.; Mohajerani, K.; Gaj, K. High-Speed Hardware Architectures and FPGA Benchmarking of CRYSTALS-Kyber, NTRU, and Saber. IEEE Trans. Comput. 2023, 72, 306–320. [Google Scholar] [CrossRef]
Ji, X.; Dong, J.; Deng, T.; Zhang, P.; Hua, J.; Xiao, F. HI-Kyber: A Novel High-Performance Implementation Scheme of Kyber Based on GPU. IEEE Trans. Parallel Distrib. Syst. 2024, 35, 722–736. [Google Scholar] [CrossRef]
Bisheh-Niasar, M.; Lo, D.; Parthasarathy, A.; Pelton, B.; Pillilli, B.; Kelly, B. PQC Cloudization: Rapid Prototyping of Scalable NTT/INTT Architecture to Accelerate Kyber. In 2023 IEEE Physical Assurance and Inspection of Electronics (PAINE); IEEE: Piscataway, NJ, USA, 2023; p. 1038. [Google Scholar]
Fujisaki, E.; Okamoto, T. Secure Integration of Asymmetric and Symmetric Encryption Schemes. J. Cryptol. 2013, 26, 80–101. [Google Scholar] [CrossRef]
Preskill, J. Quantum computing in the NISQ era and beyond. Quantum 2018, 2, 79. [Google Scholar] [CrossRef]
Soiguine, A. Superposition of Wave Functions in the G-Qubit Theory. Int. J. Appl. Sci. 2022, 5, 8–14. [Google Scholar] [CrossRef]
Wang, Y.; Li, Y.; Yin, Z.q.; Zeng, B. 16-qubit IBM universal quantum computer can be fully entangled. npj Quantum Inf. 2018, 4, 46. [Google Scholar] [CrossRef]
Bennett, C.H.; Brassard, G. Quantum key distribution protocols: A review. Theor. Comput. Sci. 2014, 560, 7–11. [Google Scholar] [CrossRef]
Njorbuenwu, M.; Swar, B.; Zavarsky, P. A Survey on the Impacts of Quantum Computers on Information Security. In Proceedings of the 2nd International Conference on Data Intelligence and Security, ICDIS 2019, South Padre Island, TX, USA, 28–30 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 212–218. [Google Scholar] [CrossRef]
Shor, P.W. Algorithms for quantum computation: Discrete logarithms and factoring. In Proceedings 35th Annual Symposium on Foundations of Computer Science; IEEE: Piscataway, NJ, USA, 1994. [Google Scholar]
Lauterbach, F.; Burdiak, P.; Richter, F.; Voznak, M. Performance analysis of post-quantum algorithms. In Proceedings of the 2021 29th Telecommunications Forum (TELFOR); IEEE: Piscataway, NJ, USA, 2021; pp. 1–4. [Google Scholar]
Brassard, G.; Høyer, P.; Tapp, A. Quantum counting. In International Colloquium on Automata, Languages, and Programming; Springer: Berlin/Heidelberg, Germany, 1998; pp. 820–831. [Google Scholar]
Nguyen, T.; Kieu-Do-Nguyen, B.; Pham, C.; Hoang, T. High-Speed NTT Accelerator for CRYSTAL-Kyber and CRYSTAL-Dilithium. IEEE Access 2024, 12, 34918–34930. [Google Scholar] [CrossRef]
Hirner, F.; Mert, A.C.; Roy, S.S. Proteus: A Pipelined NTT Architecture Generator. IEEE Trans. Very Large Scale Integr. Syst. 2024, 32, 1228–1238. [Google Scholar] [CrossRef]
Zhao, Y.; Xie, R.; Xin, G.; Han, J. A High-Performance Domain-Specific Processor With Matrix Extension of RISC-V for Module-LWE Applications. IEEE Trans. Circuits Syst. I Regul. Pap. 2022, 69, 2871–2884. [Google Scholar] [CrossRef]
Li, M.; Tian, J.; Hu, X.; Wang, Z. Reconfigurable and High-Efficiency Polynomial Multiplication Accelerator for CRYSTALS-Kyber. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2023, 42, 2540–2551. [Google Scholar] [CrossRef]
Ni, Z.; Khalid, A.; Liu, W.; O’Neill, M. Towards a Lightweight CRYSTALS-Kyber in FPGAs: An Ultra-lightweight BRAM-free NTT Core. In Proceedings of the IEEE International Symposium on Circuits and Systems, ISCAS 2023, Monterey, CA, USA, 21–25 May 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar] [CrossRef]
Hu, X.; Li, M.; Tian, J.; Wang, Z. DARM: A Low-Complexity and Fast Modular Multiplier for Lattice-Based Cryptography. In Proceedings of the 32nd IEEE International Conference on Application-Specific Systems, Architectures and Processors, ASAP 2021, Virtual Conference, USA, 7–9 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 175–178. [Google Scholar] [CrossRef]

Figure 1. Workflow of the proposed cross-platform NTT evaluation framework showing algorithm development, validation, quantum feasibility assessment, FPGA implementation, and comparative performance analysis.

Figure 2. Lattice space with non-lattice points.

Figure 3. General structure of Kyber Key Establishment Mechanism (KEM).

Figure 4. The use of NTT and INTT throughout the whole processes of Kyber KEM.

Figure 5. Polynomial multiplications using NTT and NWC.

Figure 6. Cooley–Tukey unit (CT BFU) and Gentleman–Sande unit (GS BFU).

Figure 7. Qiskit circuit for addition using X, CX, and CCX gates.

Figure 8. Qiskit Circuit for addition using QFT.

Figure 9. Qiskit Circuit for subtraction using QFT.

Figure 10. Qiskit Circuit for multiplication using QFT.

Figure 11. Workflow of proposed quantum NTT based on QFT.

Figure 12. Time comparisons between Matrix-based and Cooley-Tukey-based NTT.

Table 1. Comparative summary of state-of-the-art NTT implementations.

Ref.	Platform	Approach	Modular Reduction	Key Features	Limitations
[16]	FPGA (Virtex-7)	HLS with unrolling & pipelining	Barrett	Loop unrolling reduces latency	Limited scalability, high DSP use
[17]	RISC-V/FPGA	Hybrid HW/SW NewHope	Barrett	Precomputed twiddle factors, error injection	Latency in non-precomputed mode
[18]	FPGA (Virtex-7)	MDF parallel design	Shift-add multipliers	Parallel butterfly paths	Complexity in control logic
[20]	STM32 MCU	Unified butterfly	Barrett/Montgomery	Single butterfly for NTT & INTT	Bit-reversal overhead
[24]	FPGA (Artix-7)	K2RED unified 2 × 2 butterfly	Modified KRED	Reduced memory usage	Extra control cost
[30]	RISC-V	Ping-pong memory + LUT	Barrett	Dual memory access for high bandwidth	Higher LUT storage cost
[27]	FPGA (Virtex-7)	FLUT modular arithmetic	FLUT	High-speed signed reduction	Increased memory requirement
[36]	FPGA UltraScale+	Radix-4 + coefficient packing	Barrett	Reduced memory conflict	Higher design complexity
[32]	FPGA (Artix-7)	Single butterfly reuse	Barrett	∼100% utilization	Longer critical path
[40]	GPU	Kernel merging & scheduling	–	Reduced redundant memory	Still memory-intensive
[41]	Cloud FPGA	Multi-stage pipelining	Modified KRED	Shared resources, scalable	Overhead in cloud deployment
[14]	Quantum circuits	QFT-based	–	Theoretical parallelism	Circuit depth, qubit errors

Table 2. Parameter sets of Crystals-Kyber.

NIST Security Level	Kyber Version	n	k	q	$η_{1}, η_{2}$	$(d_{u}, d_{v})$	$δ$	pk	sk	ct
1: (128bits)	Kyber512	256	2	3329	3,2	(10,3)	$2^{- 178}$	800	1632	768
3: (192-bit)	Kyber768	256	3	3329	2,2	(10,4)	$2^{- 164}$	1184	2400	1088
5: (256bits)	Kyber1024	256	4	3329	2,2	(11,5)	$2^{- 174}$	1568	3168	1568

Table 3. Complexity comparison of polynomial multiplication methods.

Method	Description	Asymptotic Complexity	Used in	Notes
Schoolbook Multiplication	Naive nested-loop product	$O (n^{2})$	Educational examples, small n	Simple but inefficient for large degrees
NTT-Based Multiplication	Transform-multiply inverse using roots of unity	$O (n \log n)$	General lattice cryptography	Requires careful choice of modulus and primitive root
NTT + NWC (Kyber-specific)	NTT with $ζ$ -scaling for $(X^{n} + 1)$ ring	$O (n \log n)$	Kyber KEM	Efficient modulo reduction with reduced domain transformation

Table 4. Performance comparison of radix-2, radix-4, and radix-8 NTT implementations using Barrett and Montgomery reductions.

	Barret			Montgomery
Metric	radix2	radix4	radix8	radix2	radix4	radix8
Clock Period (nanosecond)	6.83	8.64	8.64	4.32	5.55	5.55
Estimated Fmax (MHz)	146.4129	115.7407	115.7407	231.48148	180.1802	180.1802
Latency (cycles)	65,571	196,674	393,277	32,804	196,676	393,279
Pipeline Interval (II)	1	3	6	1	3	6
DSP Utilization	9	19	34	5	9	15
BRAM Utilization	4	4	4	4	4	4
LUT Utilization	3494	7398	11,668	3011	6407	11,120
FF Utilization	3519	7233	11,915	3135	6617	10,765

Table 5. Comparative analysis with the previous NTT implementation.

Paper Ref (Year)	Clock Period (ns)	Fmax (MHz)	Latency (Cycles)	DSP	BRAM	LUT	FF
[51]	-	245	-	-	-	4821	4261
[52]	1.4	150	21	36	2	5200	2800
[53]	-	-	32	64	6	25,674	3137
[54]	0.31	273	84	16	8	4619	4166
[55]	1.5	300	45	2	0	1154	1031
[37]	0.24	265	64	26	0	3918	4292
[29]	-	182	24	4	9	2543	-
[56]	4.5	169	74	24	16	5200	-
[23]	1.1	182.98	18	58	29	3140	3007
Our paper	4.32	231.48	32,804	5	4	3011	3135

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

AlKurdi, Y.; Abu Al-Haija, Q.; Alghuried, A. Adaptive FPGA-Based Mixed-Radix NTT Architectures with Classical and Quantum Evaluation for CRYSTALS-Kyber. Appl. Sci. 2026, 16, 6183. https://doi.org/10.3390/app16126183

AMA Style

AlKurdi Y, Abu Al-Haija Q, Alghuried A. Adaptive FPGA-Based Mixed-Radix NTT Architectures with Classical and Quantum Evaluation for CRYSTALS-Kyber. Applied Sciences. 2026; 16(12):6183. https://doi.org/10.3390/app16126183

Chicago/Turabian Style

AlKurdi, Yaser, Qasem Abu Al-Haija, and Ahod Alghuried. 2026. "Adaptive FPGA-Based Mixed-Radix NTT Architectures with Classical and Quantum Evaluation for CRYSTALS-Kyber" Applied Sciences 16, no. 12: 6183. https://doi.org/10.3390/app16126183

APA Style

AlKurdi, Y., Abu Al-Haija, Q., & Alghuried, A. (2026). Adaptive FPGA-Based Mixed-Radix NTT Architectures with Classical and Quantum Evaluation for CRYSTALS-Kyber. Applied Sciences, 16(12), 6183. https://doi.org/10.3390/app16126183

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Adaptive FPGA-Based Mixed-Radix NTT Architectures with Classical and Quantum Evaluation for CRYSTALS-Kyber

Abstract

1. Introduction

2. Related Work

2.1. Early High-Level Synthesis and Hybrid Architectures

2.2. Parallelization and Butterfly Unit Optimizations

2.3. Modular Reduction Improvements

2.4. Memory and Pipelining Optimizations

2.5. Advanced Radix and Flexible Architectures

2.6. Emerging GPU and Cloud Implementations

2.7. Quantum NTT Approaches

2.8. Limitations of Existing Works

2.9. Primary Research Gaps

3. Theoretical Background

4. Proposed Methodology

5. Results and Analysis

5.1. Classical NTT Evaluation

5.2. Quantum NTT Evaluation

5.3. FPGA-Based Adaptive Radix NTT

5.4. Comparative Analysis

5.5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI