1. Introduction
Public-key cryptography (PKC) has long provided the foundation for secure digital communications by employing asymmetric key pairs to achieve confidentiality, authentication, and integrity. Widely deployed algorithms such as Rivest–Shamir–Adleman (RSA) [
1] and elliptic curve cryptography (ECC) [
2] derive their security from the infeasibility of solving computationally hard mathematical problems, such as integer factorization and discrete logarithms. However, this assumption no longer holds in the era of quantum computing. Shor’s algorithm [
3] enables efficient factorization and discrete logarithm computation, reducing attacks against RSA and ECC from exponential to polynomial time. Consequently, these widely used cryptographic systems become vulnerable once scalable quantum computers become available. Recognizing this imminent threat, the National Institute of Standards and Technology (NIST) launched a multi-round competition in 2016 to standardize post-quantum cryptography (PQC) [
4]. After three evaluation rounds, NIST announced candidate algorithms for standardization [
5]. Among them, CRYSTALS-Kyber [
6]—the primary key encapsulation mechanism (KEM)—was selected for its efficiency, scalability, and resistance to quantum adversaries. Kyber relies on lattice-based cryptography (LBC) [
7], specifically on the hardness of the module learning with errors (MLWE) problem [
8], which is widely regarded as one of the most secure and versatile foundations for PQC [
9].
Despite these advances, Kyber suffers from a critical computational bottleneck: the Number Theoretic Transform (NTT). NTT is used to accelerate polynomial multiplications in Kyber by reducing the complexity from O(n
2) (schoolbook multiplication) to O(n log n). However, its practical realization remains challenging. First, computational complexity persists despite the use of FFT-inspired algorithms such as Cooley–Tukey (CT) [
10] and Gentleman–Sande (GS) [
11], which serve as the building blocks of NTT. Second, hardware implementations of NTT on Field Programmable Gate Arrays (FPGAs) consume substantial resources, including digital signal processing (DSP) slices, BRAMs, and LUTs [
12]. This resource inefficiency restricts scalability for real-world cryptosystems. Third, although quantum computing offers theoretical acceleration opportunities via algorithms such as the Quantum Fourier Transform (QFT) [
13], mapping modular arithmetic into quantum circuits introduces non-trivial challenges related to circuit depth, qubit coherence, and error propagation [
14,
15].
These limitations motivate a comparative investigation of NTT implementation strategies across software, hardware, and emerging quantum-computing domains. Such an evaluation provides insight into performance, resource utilization, scalability, and long-term implementation trade-offs relevant to CRYSTALS-Kyber. This research addresses this gap by proposing a cross-platform NTT framework focused primarily on an adaptive FPGA-based mixed-radix architecture, while additionally providing classical and quantum implementations for comparative evaluation and future heterogeneous computing exploration. The main objectives are:
Optimization of Classical NTT—Development of radix-mixed and algorithmic adaptations of CT and GS architectures to reduce computational overhead.
Quantum NTT Exploration—Implementation of gate-based and QFT-based polynomial arithmetic in Qiskit, evaluating the feasibility of implementing modular arithmetic within quantum circuits.
Efficient FPGA Acceleration—Design of adaptive radix-mixed NTT architectures optimized for resource utilization, achieving significant reductions in DSP, BRAM, and LUT consumption.
Cross-Platform Evaluation—Comparative analysis of NTT implementations across classical, quantum, and hardware domains, emphasizing performance metrics such as latency, throughput, and scalability.
Integration with CRYSTALS-Kyber—Ensuring compatibility with Kyber’s polynomial arithmetic requirements, thereby providing a practical and secure pathway toward PQC deployment.
The The principal contribution of this work is the adaptive FPGA-based NTT accelerator capable of mixed-radix configurability (radix-2/radix-4/radix-8) combined with resource-aware modular arithmetic optimization using Barrett and Montgomery reductions.
Classical and quantum implementations are included primarily as comparative baselines and feasibility studies within the proposed evaluation framework.
Figure 1 illustrates the evaluation workflow adopted in this study. The same NTT problem is examined through classical, quantum, and FPGA implementation paths, after which the resulting performance characteristics, resource requirements, and implementation constraints are comparatively analyzed. The framework is intended as a comparative evaluation methodology rather than a runtime-integrated heterogeneous architecture.
2. Related Work
The NTT has been widely recognized as the computational bottleneck in lattice-based PQC, including CRYSTALS-Kyber. Over the past decade, researchers have proposed numerous optimizations to improve its performance across software, hardware, and hybrid domains. This section reviews prior work on NTT implementations, categorizing it into high-level synthesis (HLS)-based methods, FPGA- and Application-Specific Integrated Circuit (ASIC)-based accelerations, parallel and vectorized designs, memory-optimized architectures, and emerging quantum approaches.
2.1. Early High-Level Synthesis and Hybrid Architectures
One of the earliest efforts to accelerate NTT using HLS was presented by [
16] where optimization directives such as loop unrolling, pipelining, and inlining, were applied to enhance performance. Their analysis showed that unrolling nested loops significantly reduced latency, while modular multipliers based on Barrett reduction further minimized computational overhead. Authors in [
17] introduced a hybrid hardware–software NewHope design that combines the NTT transformation with hash generation. Their FPGA-based design achieved notable speedups using pre-calculated twiddle factors but still incurred latency overhead in non-precomputation modes. They also employed soft error injection to mitigate reliability issues and improve robustness against Silent Data Corruption (SDC).
2.2. Parallelization and Butterfly Unit Optimizations
A key focus of subsequent research has been the parallelization of butterfly operations, the core building block of NTT. In [
18], the authors proposed a Multi-Path Delay Feedback (MDF) architecture that combines features of multi-path delay commutators and single-path delay feedback units. This design enabled efficient multiplication via addition and shift operations, thereby reducing reliance on costly multipliers. The elimination of DSP usage altogether, thereby enabling better parallelism in butterfly operations, is presented in [
19]. By storing twiddle factors in ROM, they reduced area and latency. Similarly, ref. [
20] implemented a unified butterfly design in which the same hardware supported both NTT and inverse NTT, relying on bit-reversal to maintain correctness. While this reduced hardware cost, the bit-reversal step added latency overhead. Later works extended parallelism by introducing vectorized butterfly architectures [
21,
22]. These designs used vector coprocessors or SIMD/NEON instruction sets for modular arithmetic. Although effective in boosting throughput, they required memory bandwidth per cycle, often exceeding the constraints of resource-limited FPGAs.
2.3. Modular Reduction Improvements
Efficient modular reduction is critical for resource utilization. Early FPGA implementations favored Barrett or Montgomery reduction [
23], but more advanced schemes were later introduced [
24]. They proposed K2RED, a two-step reduction method that combines CT and GS butterflies into a unified 2 × 2 architecture. This approach reduced memory accesses by reusing twiddle factors but required additional control logic [
25]. The authors in [
25] further enhanced modular arithmetic by combining Proth prime-based reduction with modified KRED, offering low-cost multiplications via repeated constant multiplications. More recently, ref. [
26] introduced MUX-controlled arithmetic to streamline addition, subtraction, and multiplication operations, enabling a controller-based reduction scheme with lower latency. Lookup table (LUT)-based methods also emerged; for example, ref. [
27] introduced FLUT (Fast LUT), which reduced the results to the range
, enabling high-speed signed operations, while effective, these approaches increased memory requirements for LUT storage, making them less practical for constrained FPGA devices.
2.4. Memory and Pipelining Optimizations
Since NTT computations are memory-intensive, several works have targeted data movement reduction [
28], which achieved 30% cycle reduction by leveraging multiply accumulate (MACC) instructions in modular operations. Ref. [
29] improved performance by converting subtraction into negation–addition operations and introduced a DIV2 unit to replace multiplications in inverse NTT. Ping-pong memory access schemes [
30] allowed simultaneous read–write operations, while [
31] proposed banked memory layouts with Lazy-Last-Layer tricks to improve bandwidth utilization. Ref. [
32] further reduced storage by reusing a single butterfly across all NTT levels, achieving nearly 100% hardware utilization but at the cost of longer critical paths. Recent works also explored domain-specific co-designs. Ref. [
33] used OpenCL-based MPSoC platforms, dividing computation into NTT and pointwise multiplication (PWM) stages to reduce memory bottlenecks. Ref. [
34] introduced KyberMat, a polyphase decomposition architecture inspired by FIR filters that reduces computational and memory overhead.
2.5. Advanced Radix and Flexible Architectures
Radix-based optimizations have been widely studied. Ref. [
35] proposed Radix-22 architectures, enabling high-throughput designs with four parallel paths, though at a higher resource cost. Ref. [
36] introduced radix-4 with adjacent coefficient packing, reducing memory conflicts. Ref. [
37] presented mixed-radix architectures capable of runtime switching between radix-2 and radix-4, improving flexibility. Flexible architectures such as [
38] proposed a Bi-Core design that compresses polynomial and twiddle-factor memory while supporting multiple cryptosystems (Kyber, Dilithium, Falcon). Ref. [
39] further developed technology-independent NTT accelerators that are portable across FPGA platforms.
2.6. Emerging GPU and Cloud Implementations
Beyond FPGAs, GPU-based accelerations have gained attention [
40] exploring GPU kernel optimizations, introducing techniques such as sliced-layer merging and depth-first scheduling, which reduced memory accesses. Microsoft researchers [
41] proposed a cloud-based Kyber design that decomposes NTT into a butterfly core, stages, and input/output (I/O) steps, with pipeline sharing to improve scalability.
2.7. Quantum NTT Approaches
While most research has focused on classical and hardware domains, several studies have explored quantum NTT (QNTT). Gate-based modular arithmetic and QFT-based approaches offer theoretical acceleration but face challenges. Specifically, circuit depth, qubit coherence, and mapping modular arithmetic into quantum gates remain unresolved bottlenecks [
14,
15]. These limitations underscore the need for hybrid frameworks that bridge quantum and classical techniques, ensuring scalability in the future while remaining practical on near-term devices. In summary,
Table 1 lists the summary of state-of-the-art NTT implementations.
Although numerous NTT accelerators have been reported in the literature, most existing studies focus on a specific radix organization, a particular hardware optimization strategy, or a single implementation domain. In contrast, this work provides a unified evaluation framework that examines radix-2, radix-4, and radix-8 NTT implementations within a common FPGA design methodology and investigates their corresponding software and quantum-circuit realizations. The resulting analysis enables a systematic assessment of implementation trade-offs, including latency, resource utilization, arithmetic complexity, and scalability considerations. Therefore, the contribution of this work lies not in proposing a new NTT algorithm but in providing a comparative evaluation of alternative implementation strategies and their implications for NTT acceleration in CRYSTALS-Kyber.
2.8. Limitations of Existing Works
Despite notable progress, we can summarize several limitations:
Latency and Throughput Bottlenecks: Even with optimized butterfly and modular reduction schemes, long critical paths can limit achievable throughput.
Resource Utilization Issues: Designs that achieve high throughput often do so at the expense of substantial DSP, BRAM, and LUT usage, making them unsuitable for constrained FPGAs.
Scalability Challenges: Many designs are tailored for specific platforms (e.g., Xilinx or Intel FPGAs), limiting portability and broader applicability.
Quantum Circuit Depth: Quantum NTT remains largely theoretical; existing designs suffer from excessive depth and qubit requirements, making them impractical on current quantum hardware.
These challenges motivate the hybrid framework proposed in this paper, which integrates classical optimizations, FPGA accelerations, and quantum exploration to provide a more scalable and resource-efficient NTT solution for CRYSTALS-Kyber.
2.9. Primary Research Gaps
Despite the significant progress achieved in NTT acceleration for lattice-based cryptography, several limitations remain in the existing literature. First, many reported architectures focus on a single radix organization and therefore provide limited insight into the implementation trade-offs associated with alternative radix decompositions. Second, most studies prioritize a specific optimization objective, such as throughput, latency, or area efficiency, making it difficult to compare architectural behavior under a common implementation framework. Third, existing investigations are typically confined to a single computational domain, such as software, FPGA hardware, or quantum computing, without providing a broader perspective on how NTT implementations behave across different computational paradigms. To address these limitations, this work presents a cross-platform evaluation framework centered on FPGA-based mixed-radix NTT acceleration for CRYSTALS-Kyber. The proposed approach investigates radix-2, radix-4, and radix-8 implementations within a unified FPGA design methodology while also examining corresponding software and quantum-circuit realizations. This enables systematic evaluation of implementation trade-offs, resource utilization, latency characteristics, and scalability considerations across multiple NTT implementation strategies.
3. Theoretical Background
LBC provides the mathematical foundation for CRYSTALS-Kyber, which is the focus of this study. Its security derives from the hardness of computational problems such as the Shortest Vector Problem (SVP) and the Closest Vector Problem (CVP). The CVP involves identifying the nearest lattice point to an arbitrary target point when only a poor-quality, or “bad,” basis is known. Solving CVP efficiently is hard and underpins the cryptographic strength of lattice schemes. Encryption in such systems encodes a message into a lattice structure, introduces a small random error vector, and produces a perturbed point near the lattice. Decryption requires the corresponding “good” basis to recover the original message by mapping the perturbed point to the nearest lattice point.
Figure 2 illustrates a lattice space with non-lattice points, where rounding is performed to the nearest lattice vector. Decryption succeeds if the added error is small; otherwise, decoding errors occur.
The SVP, a related problem, requires finding the shortest non-zero vector in a lattice. CVP and SVP are proven to be hard under worst-case complexity assumptions, ensuring security against classical and quantum adversaries. The LWE problem extends the classical “learning with errors” problem. It introduces small random errors into linear equations, making the recovery of secret vectors computationally hard. LWE is the problem of recovering the secret vector from pairs (a,b), where a is chosen uniformly at random and b = 〈s,a〉 + e (mod q), with s as the secret vector and e sampled from an error distribution (often discrete Gaussian). The search-LWE problem involves finding the secret vectors given many LWE samples. In contrast, the decision-LWE variant asks whether a set of samples is from an LWE distribution or a uniform distribution. Variants such as Ring-LWE and Module-LWE extend the problem to polynomial rings and modules. Ring-LWE operates on single polynomials, while Module-LWE generalizes to vectors of polynomials, providing greater flexibility and efficiency in cryptographic construction as illustrated in
Figure 3.
CRYSTALS is a cryptographic suite that includes Kyber, a KEM, and Dilithium, a digital signature scheme. Both rely on the hardness of MLWE. In Kyber, public keys consist of polynomial matrices and vectors derived from uniform sampling, while secret keys are sampled from centered binomial distributions. Kyber provides security against chosen-plaintext (IND-CPA) and chosen-ciphertext (IND-CCA2) adversaries. IND-CPA security is achieved through lattice-based encryption, while the Fujisaki–Okamoto transformation [
18,
42] extends it to IND-CCA2 by introducing re-encryption and verification steps.
Figure 3 illustrates the key establishment mechanism. The algorithm operates over the polynomial ring
. Noise terms are sampled from a centered binomial distribution
, which ensures correctness while maintaining security. Parameters are selected to strike a balance between efficiency and cryptographic strength.
Table 2 lists Kyber’s parameter sets, including values of n, q, and k, which correspond to security levels equivalent to Advanced Encryption Standard (AES) AES-128 (Kyber512), AES-192 (Kyber768), and AES-256 (Kyber1024). The parameters
, and
govern error generation, coefficient sizes, compression, and KEM failure probability.
Cryptographic primitives used in Kyber include SHA3-512, SHAKE-128, and SHAKE-256 for matrix generation, key derivation, and noise sampling. Auxiliary functions handle the encoding, decoding, and compression of polynomials. Polynomial multiplication is the fundamental operation in LBC. Several approaches exist, varying in efficiency and complexity. The classical schoolbook method multiplies two degree- polynomials with complexity. Each coefficient is computed by summing pairwise products across the polynomial terms.
Cyclic convolution improves efficiency by reducing results modulo a polynomial. In Positive Wrapped Convolution (PWC), reduction is performed with
, while in Negative Wrapped Convolution (NWC), reduction uses
. Both methods retain
complexity but enable compatibility with ring structures used in lattice cryptography. The NTT is an analog of the Discrete Fourier Transform (DFT), operating over finite fields
. It reduces the complexity of polynomial multiplication to
. The transform maps input coefficients into the NTT domain, where multiplication becomes pointwise, followed by an inverse NTT (INTT) to recover the result. The existence of primitive roots of unity modulo a prime
q enables the NTT. These roots, referred to as twiddle factors, allow butterfly operations that combine and reduce coefficients efficiently.
Figure 4 shows the integration of NTT and INTT across Kyber’s key generation, encapsulation, and decapsulation processes. When combined with NWC, Kyber avoids explicit reductions modulo
. Pre-scaling input coefficients by a square root of unity
and post-scaling results by
achieves efficient negative cyclic convolution.
Figure 5 illustrates procedure. Efficient NTT evaluation employs recursive decomposition. Two methods dominate: decimation-in-time (DIT) using the CT butterfly and decimation-in-frequency (DIF) using the GS butterfly.
In CT, inputs are transformed into outputs: .
In GS, inputs yield outputs: .
Here,
denotes the twiddle factor at stage
i.
Figure 6 depicts both BUs. These operations are reused recursively, forming the computational backbone of forward and inverse NTT.
Table 3 compares the complexity of different multiplication methods. Schoolbook and cyclic convolutions remain quadratic, while NTT reduces the cost to quasi-linear.
Kyber incorporates several optimizations to tailor NTT to its module structure:
Consistent use of NTT domain: All core operations, including matrix–vector and pointwise multiplications, are performed in the NTT domain to minimize repeated transforms.
NWC pre-/post-scaling: Implicit use of enables multiplication in .
Fixed parameters: Kyber uses and , which ensure the presence of primitive roots for efficient butterfly computation.
Module structure: Parallel NTT operations handle multiple polynomials when , depending on the security level.
Resource efficiency: In-place operations and reduced memory movement enhance suitability for constrained hardware.
Quantum computing merges physics, computer science, and mathematics to process information beyond classical limits. It leverages quantum mechanical phenomena such as superposition and interference to accelerate the solution of problems that are computationally infeasible for classical machines [
14,
43].
Superposition allows qubits to exist in a linear combination of basis states
and
. This property enables intrinsic parallelism, as quantum processors can evaluate many states simultaneously [
44]. Entanglement creates correlations between qubits such that the measurement of one qubit reveals information about another, regardless of distance. This phenomenon enhances computational power in quantum circuits [
45]. Decoherence remains a key challenge. Qubits are highly sensitive to environmental noise, causing fragile quantum states to collapse into classical states [
15]. Techniques such as ion-trap and superconducting processors attempt to mitigate decoherence through error correction and improved isolation.
Quantum technologies have a dual role in cryptography.
Quantum Key Distribution (QKD): Protocols such as BB84 and E91 use entangled states to exchange symmetric keys securely [
46,
47]. Any eavesdropping attempt alters the quantum state, revealing the presence of an adversary.
PQC: Conversely, quantum computers threaten classical cryptosystems. Shor’s algorithm breaks RSA, Diffie–Hellman, and ECC by enabling efficient factorization and discrete logarithm computation [
48,
49]. Grover’s algorithm reduces the security level of symmetric schemes by half, requiring longer keys for equivalent security strength [
50]. Lattice-based cryptography (LBC) remains resistant to known quantum algorithms, though research on hidden subgroup problems continues.
The fundamental unit of quantum information is the qubit. Unlike classical bits, qubits may occupy superpositions , where . Measurement collapses this state to a basis element according to the Born rule. Multi-qubit states grow exponentially, with an n-qubit system represented in a -dimensional space. For example, simulating 100 qubits requires storage of amplitudes, which is infeasible for classical hardware. Quantum gates manipulate qubit states and form the building blocks of quantum circuits.
Identity Gate (I): Preserves the quantum state.
Hadamard Gate (H): Creates superpositions, rotating states on the Bloch sphere [
13].
Pauli Gates (X, Y, Z): Represent bit-flip, phase-flip, and combined transformations of qubit states [
13].
Controlled Gates (CX, CCX): Enable conditional operations by flipping a target qubit depending on control qubits. These gates are essential for constructing composite algorithms such as the QFT.
Quantum circuits are constructed by chaining these gates, analogous to logic gates in classical digital design. Current quantum programming is performed at the gate level, as high-level abstractions remain limited in their scope.
4. Proposed Methodology
The framework follows a sequential evaluation methodology rather than a tightly coupled runtime architecture. Classical implementations establish algorithmic baselines, quantum implementations assess future feasibility, and FPGA implementations provide the primary deployment-oriented contribution. Results from these domains are integrated through comparative analysis rather than direct runtime interaction.
This section presents the methodological framework adopted for designing, implementing, and evaluating adaptive NTT architectures across classical, quantum, and hardware domains. The approach integrates algorithmic modeling, simulation, and hardware prototyping to construct a unified foundation for post-quantum cryptographic operations. The study began with the algorithmic foundations of polynomial convolution. Linear, positive-wrapped, and negative-wrapped convolutions were implemented in Python to validate their mathematical correctness. These implementations established a baseline for subsequent transformations and introduced cyclic reduction techniques required in LBC.
Classical NTT implementations were examined through two approaches. A matrix-based method was developed by computing intermediate values using modular exponentiation, with coefficients transformed according to derived equations, provided that the sequence length was a power of two and the twiddle factor was a valid primitive root of unity modulo q; while computationally expensive, this method provided fine-grained insight into the transformation process.
In parallel, an iterative NTT design was implemented using CT butterflies for the forward transform and GS butterflies for the inverse transform. Twiddle factors were dynamically computed, and modular inverses were derived using the Extended Euclidean Algorithm. This iterative approach emphasized efficiency, reversibility, and alignment with the requirements of lattice-based cryptographic schemes.
Quantum arithmetic modules were constructed using Qiskit due to their maturity, community adoption, and integration with real quantum hardware. Fundamental arithmetic operations were realized through both gate-based and QFT-based designs. Binary addition circuits employed X, CX, and CCX gates to simulate carry propagation, with ancillary qubits restored to maintain reversibility.
Figure 7 illustrates the implementation. The QFT was employed to design addition and subtraction circuits, in which controlled phase rotations replaced classical carry chains, as shown in
Figure 8 and
Figure 9. Multiplication was achieved by additions using QFT-based arithmetic, where decremented multiplier and accumulator registers were used to complete the operation, as depicted in
Figure 10.
The quantum NTT was then implemented using recursive CT butterflies incorporating QFT-based modular addition, subtraction, and multiplication. Twiddle factor multiplication was realized through controlled modular operations, and a simplified two-stage pipeline was demonstrated in
Figure 11. Qiskit Aer noise models were integrated to emulate decoherence and assess circuit stability under near-term quantum computing conditions, accounting for hardware limitations. The quantum experiments were conducted entirely using classical simulation backends provided by Qiskit Aer. Accordingly, the reported execution characteristics reflect simulator complexity and circuit modeling overhead rather than the performance of actual quantum hardware.
Due to the current limitations of quantum hardware and quantum circuit simulators, the quantum experiments were intentionally conducted using reduced toy parameters and small polynomial vectors. These experiments were designed solely to demonstrate the feasibility of QFT-based modular arithmetic and recursive butterfly construction rather than to implement the full CRYSTALS-Kyber parameter sets (e.g., n = 256 and q = 3329).
An adaptive radix-mixed NTT architecture was proposed and developed to achieve hardware efficiency. This design enabled configurable selection among radix-2, radix-4, and radix-8 butterfly organizations for design-space exploration and resource-performance evaluation, depending on the polynomial size, thereby balancing throughput and resource utilization. Twiddle factors were precomputed and stored in LUT-based memory, while modular multiplications were optimized using either Barrett or Montgomery reduction.
The architecture was organized into several modules. An adaptive radix controller orchestrated butterfly scheduling, while dedicated BUs implemented radix-specific transformations. The memory hierarchy utilized BRAM with cyclic partitioning to minimize read–write conflicts, and a modular arithmetic co-processor performed reductions using either the Barrett or Montgomery methods.
HLS as directives were applied to optimize FPGA performance, including loop unrolling, pipelining, and dataflow. Resource partitioning techniques further reduced access bottlenecks and improved concurrency.
The methodology can be summarized as an integration of Python-based prototypes for correctness verification, iterative CT and GS implementations as classical baselines, Qiskit circuits for quantum feasibility studies under near-term quantum conditions, together with FPGA-based architectural designs for hardware evaluation.
The proposed framework supports multiple radix organizations, including radix-2, radix-4, and radix-8 implementations. For an N-point Number Theoretic Transform (NTT), the number of computational stages required for a radix-r decomposition is given by:
where (N) denotes the transform size and (r) represents the selected radix. Accordingly, the transform length may be expressed as:
Higher radix values reduce the number of computational stages required to complete the transform. For example, for (N = 256), radix-2 requires eight stages, radix-4 requires four stages, and radix-8 requires approximately three stages. However, the reduction in stage count is accompanied by increased butterfly complexity, additional arithmetic operations, more demanding scheduling requirements, and potentially higher resource utilization. The configurable radix framework was therefore designed to evaluate the implementation trade-offs associated with different radix organizations under a common FPGA architecture. Rather than assuming that a higher radix always yields superior performance, the proposed approach enables comparative assessment of latency, initiation interval, resource utilization, and implementation complexity across alternative radix decompositions.
5. Results and Analysis
5.1. Classical NTT Evaluation
The comparative study of classical NTT implementations focused on an iterative Cooley–Tukey (CT) implementation and a naive matrix-based transform used as a conceptual baseline for algorithmic complexity comparison. The direct matrix-based formulation requires O(n
2) arithmetic operations because each output coefficient depends on all input coefficients. In contrast, the iterative Cooley–Tukey NTT reduces the computational cost to O(n log n) through recursive butterfly decomposition. The proposed FPGA implementation preserves the same asymptotic O(n log n) complexity while improving practical execution efficiency through hardware parallelism, pipelining, and optimized modular arithmetic. The quantum implementation serves as a proof-of-concept realization and, therefore, is evaluated primarily from a feasibility perspective rather than practical complexity advantage under current hardware constraints.Experiments were conducted on a Python 398 platform running on an Intel Core i7 processor (2.9 GHz) with 16 GB RAM. Performance 399 was measured using execution time.
Figure 12 summarizes the results and illustrates the time comparison.
The Python implementations were developed primarily for correctness verification, algorithmic illustration, and relative complexity analysis. They were not intended to represent optimized cryptographic software implementations, such as AVX2-accelerated or assembly-level Kyber libraries used in practical deployments.
The CT method achieved a forward NTT time of 752.8 μs compared to 2728.5 μs for the matrix approach and an inverse NTT time of 432.2 μs vs. 2723.4 μs. This performance difference reflects the nested loop structure of the matrix-based implementation, which incurs cubic complexity.
Both approaches were tested using polynomial sizes ranging from 16 to 200 coefficients to illustrate relative algorithmic scaling behavior and verify implementation correctness across multiple input dimensions. These experiments were intended as methodological complexity studies rather than benchmarks of practical CRYSTALS-Kyber deployments, which employ a fixed polynomial degree of (n = 256). The execution time for the CT method exhibited near-linear growth with respect to the input size, whereas the matrix-based approach grew quadratically. These findings confirm the well-established algorithmic efficiency advantages of structured iterative NTT approaches over direct matrix formulations. The presented Python-based experiments are intended primarily for correctness verification, complexity illustration, and baseline validation within the proposed cross-platform evaluation framework. Consequently, the results should be interpreted as conceptual and methodological comparisons rather than benchmarks against production-grade cryptographic software implementations or as claims of algorithmic novelty.
In CRYSTALS-Kyber, all standardized parameter sets use (n = 256). The adaptive FPGA architecture proposed in this work is designed with this target size in mind, while the software-based scalability experiments were conducted to demonstrate comparative growth characteristics between alternative NTT formulations. Finally, hardware memory requirements of the proposed accelerator are represented through BRAM utilization and synthesis statistics reported in
Table 4 and
Table 5.
5.2. Quantum NTT Evaluation
Quantum experiments were conducted using the Qiskit qasm simulator backend to evaluate the structural feasibility of QFT-based NTT circuits under controlled simulation conditions. The obtained timing and memory observations therefore represent classical simulation costs rather than deployable quantum hardware performance. Due to current limitations in qubit availability, circuit depth, coherence time, and simulator scalability, the quantum experiments were restricted to small proof-of-concept polynomial vectors (e.g., [1,2,3,4]) and simplified modular settings. These reduced parameters were selected exclusively for a feasibility demonstration of QFT-based arithmetic and recursive NTT construction, rather than for full-scale implementation of Kyber parameters (n = 256, q = 3329). Results showed that quantum NTT was significantly slower and consumed more memory than the classical CT implementation.
Scaling the presented proof-of-concept circuits to the CRYSTALS-Kyber polynomial degree (n = 256) would require substantially larger quantum resources, including increased logical qubit counts, deeper arithmetic circuits, and significantly greater error-correction overhead. Consequently, the present implementation should be viewed as a feasibility study of quantum NTT construction rather than a practical realization of Kyber-scale processing.
Execution overhead stemmed from limited qubit availability, the high circuit depth introduced by recursive butterfly operations, and gate fidelity issues in multi-qubit operations. Additional burdens included simulation overhead, the exponential growth in representing quantum states, and the need for auxiliary registers. Noise-aware simulations highlighted the sensitivity of quantum NTT to decoherence. Increased circuit depth degraded accuracy, especially in modular exponentiation. Although Qiskit Aer allowed partial mitigation through optimized circuit design, error rates remained high.
These findings emphasize that quantum NTT is currently impractical for large-scale cryptographic workloads. However, the experiments demonstrate proof-of-concept feasibility and highlight the potential of quantum parallelism for future implementations once hardware matures.
5.3. FPGA-Based Adaptive Radix NTT
The adaptive radix-mixed NTT was synthesized and evaluated on a Xilinx Virtex UltraScale+ FPGA (xcvu9p-fsgd2104-3-e) through the Amazon AWS cloud platform. Development was performed using the Vivado Design Suite (2020.2) with HLS optimizations. The synthesis results are summarized in
Table 4. The table reports the performance of radix-2, radix-4, and radix-8 configurations using Barrett and Montgomery reductions. Montgomery reduction consistently achieved lower clock periods (4.32–5.55 ns) than Barrett reduction (6.83–8.64 ns), resulting in higher maximum operating frequencies of up to 231.48 MHz. Latency analysis showed that radix-2 achieved 32,804 cycles with Montgomery reduction, compared to 65,571 cycles with Barrett reduction. Resource utilization was also favorable: Montgomery consistently required fewer DSPs, LUTs, and FFs than Barrett across all radices, while BRAM usage remained constant at four blocks. From a design perspective, radix-2 provided the most efficient trade-off between performance and resource utilization, particularly for constrained FPGA environments. Although higher-radix NTT algorithms reduce the number of theoretical transform stages, the synthesized radix-4 and radix-8 implementations exhibited higher cycle latency in this design because of increased butterfly complexity, resource sharing, twiddle-factor scheduling, and higher initiation intervals. Specifically, the initiation interval increased from 1 for radix-2 to 3 for radix-4 and 6 for radix-8, which offset the theoretical stage reduction. Therefore, radix-2 achieved the best latency and resource-efficiency trade-off in the implemented HLS architecture. These findings are consistent with the expected scaling trade-offs in mixed-radix architectures.
The latency results in
Table 4 should be interpreted in the context of the synthesized HLS implementation rather than ideal radix complexity alone. In this implementation, radix-4 and radix-8 require more complex butterfly scheduling and additional modular arithmetic operations, while resource sharing limits the number of operations that can be executed concurrently. This increases the initiation interval and introduces memory-access and control overhead. Consequently, the theoretical reduction in NTT stages for higher-radix designs is not directly reflected in the measured cycle latency.
Because CRYSTALS-Kyber uses a fixed polynomial degree of (n = 256), the mixed-radix controller should be interpreted primarily as a configurable design-space exploration mechanism rather than a required runtime adaptation feature. The synthesis results indicate that the radix-2 Montgomery configuration provides the most favorable resource-latency trade-off for the evaluated Kyber-oriented implementation.
5.4. Comparative Analysis
To evaluate the competitiveness of the proposed architecture, a comparison with state-of-the-art NTT implementations was conducted. The results are summarized in
Table 5. The proposed design required only five DSPs, significantly fewer than the 36 DSPs reported in [
52] and the 26 DSPs reported in [
37]. LUT and FF usage were also markedly lower than those reported in [
37,
53]. In terms of clock performance, the design achieved a clock period of 4.322 ns and an estimated frequency of 231.48 MHz, outperforming many contemporary FPGA-based NTT accelerators. These results confirm the resource efficiency and strong competitiveness of the adaptive design.
Beyond resource utilization and timing metrics, the proposed architecture differs from prior NTT accelerators in several respects. Unlike fixed-radix architectures reported in [
29,
37], the proposed framework supports adaptive radix operation through configurable radix-2, radix-4, and radix-8 processing modes. This flexibility enables exploration of different resource-performance trade-offs depending on implementation constraints and application requirements. Furthermore, the architecture evaluates both Barrett and Montgomery reduction strategies within a unified framework, allowing systematic assessment of arithmetic-resource trade-offs. The design also incorporates LUT-based twiddle-factor management and resource-aware scheduling techniques intended to reduce DSP utilization while maintaining competitive operating frequencies.
For the FPGA implementation, memory utilization is assessed through BRAM resource consumption rather than software memory allocation metrics.
While several prior works achieve higher throughput through aggressive parallelization and dedicated datapath replication, such approaches often incur substantial DSP and logic overhead. The proposed design prioritizes resource efficiency and configurability, achieving competitive operating frequencies while maintaining significantly lower DSP utilization. This design philosophy differs from throughput-oriented accelerators and targets practical deployment scenarios where hardware resources are constrained.
The hardware evaluation focuses on resource utilization, latency, and operating frequency metrics obtained from FPGA synthesis and implementation reports. Detailed power characterization was outside the scope of the present study.
Recent FPGA accelerators for CRYSTALS-Kyber have increasingly focused on highly specialized optimization techniques, including merged-NTT architectures, memory-efficient butterfly scheduling, pipelined modular arithmetic units, and architecture-specific throughput enhancements. These designs often prioritize maximum performance for fixed Kyber parameters and specific deployment scenarios. In contrast, the proposed implementation emphasizes evaluating configurable radix organizations within a unified FPGA framework. While this design objective may differ from that of highly optimized application-specific accelerators, it enables a systematic investigation of the trade-offs among radix decomposition, implementation complexity, resource utilization, and latency. Consequently, the presented results should be interpreted as an architectural evaluation of alternative NTT implementation strategies rather than as a direct replacement for highly specialized Kyber acceleration architectures.
5.5. Discussion
The hybrid methodology demonstrates clear advantages. Classical CT implementations establish efficiency and scalability, FPGA acceleration provides practical deployment through resource-efficient designs, and quantum circuits offer a vision of future computational potential. This combination ensures that current performance requirements are met while also maintaining long-term scalability.
Limitations remain evident. Quantum NTT currently remains impractical for Kyber-scale deployment due to excessive circuit depth, large qubit requirements, error propagation, and simulator complexity. Consequently, the quantum experiments in this work were limited to reduced proof-of-concept parameter settings and should be interpreted as exploratory demonstrations rather than production-scale implementations.
From a security perspective, the proposed design maintains Kyber’s cryptographic strength. All operations conform to the mathematical structure of Module-LWE and preserve the IND-CPA and IND-CCA2 security guarantees. Introducing adaptive radix and hybrid implementation strategies does not alter the underlying hardness assumptions. Instead, the work demonstrates that Kyber can be efficiently supported across heterogeneous platforms while preserving its cryptographic security guarantees.
The quantum experiments presented in this work should not be interpreted as a practical deployment model involving runtime interaction between FPGA accelerators and remote quantum simulators. Rather, the quantum component serves as an exploratory feasibility study, while the FPGA architecture constitutes the primary deployment-oriented contribution. Consequently, the proposed framework is intended for comparative evaluation and future research exploration rather than immediate heterogeneous execution.
Although quantum NTT algorithms offer theoretical asymptotic advantages, practical deployment for Kyber-scale parameters remains constrained by current hardware limitations. Scaling the proposed proof-of-concept circuits to n = 256 would substantially increase circuit depth, qubit requirements, and fault-tolerant resource overhead. A comprehensive resource-estimation study for full CRYSTALS-Kyber parameter sets remains an important direction for future work.
A limitation of the current adaptive radix implementation is that radix-4 and radix-8 were synthesized under resource-sharing constraints, which increased initiation intervals and cycle latency. Future work will investigate fully parallel radix-4/radix-8 butterfly pipelines and optimized memory banking to better exploit the theoretical latency advantages of higher-radix NTT algorithms.
While the proposed architecture preserves the underlying cryptographic security assumptions of CRYSTALS-Kyber, practical hardware implementations may remain vulnerable to implementation-level side-channel attacks, including timing, power, and electromagnetic analysis. The present study focuses on architectural optimization and performance evaluation and therefore does not include a dedicated side-channel assessment. Investigation of leakage characteristics, constant-time implementation properties, and countermeasures against side-channel attacks represents an important direction for future work.
Recent Kyber-oriented accelerator architectures have explored additional optimization techniques including NTT merging, polynomial decomposition strategies, and hybrid arithmetic structures. These approaches generally prioritize throughput maximization through specialized datapath optimization, whereas the present work focuses on configurable mixed-radix operation and resource-efficient implementation trade-offs.
6. Conclusions
A hybrid Number Theoretic Transform (NTT) framework for CRYSTALS-Kyber has been presented. The framework integrates optimized classical algorithms, Qiskit-based quantum prototypes, and an adaptive radix-mixed FPGA accelerator. The design objectives were to reduce NTT latency, lower hardware resource consumption, and explore quantum acceleration while maintaining Kyber’s security properties. Experimental evaluation demonstrated clear benefits. Classical iterative Cooley–Tukey NTT was shown to outperform a naive matrix-based reference implementation in conceptual Python-based experiments intended for algorithmic comparison. FPGA synthesis on a Xilinx Virtex UltraScale+ confirmed that the adaptive radix architecture yields strong trade-offs in terms of timing and area. Using Montgomery reduction, the radix-2 implementation achieved an estimated maximum of 231.48 MHz, a clock period of 4.32 ns, low latency (≈32,804 cycles), and minimal DSP usage (5 DSPs).
Quantum prototypes implemented with Qiskit provided a simulation-based proof of concept for QFT-based NTT circuit construction and feasibility analysis. These experiments were intentionally performed using reduced parameter sizes because current quantum hardware and simulation environments cannot yet efficiently support Kyber-scale modular arithmetic. However, quantum circuits incurred substantial circuit depth and simulation overhead under noise models. Several limitations were identified. Quantum NTT remains constrained by qubit counts, circuit depth, and decoherence, which limit its applicability on near-term quantum hardware. Because the experiments were conducted on classical simulators rather than physical quantum hardware, the reported metrics should not be interpreted as practical deployment performance. FPGA implementations encounter scalability challenges as polynomial degrees and modulus sizes grow; larger parameter sets will increase DSP, BRAM, and logic requirements and may exceed the capacity of mid-range devices. Security analysis confirmed that the proposed engineering changes do not alter the underlying Module-LWE hardness assumptions. The adaptive design and runtime radix selection preserve Kyber’s mathematical structure and the IND-CPA and IND-CCA2 security properties described in the paper.
Future work will pursue three directions. First, quantum circuit optimizations will be investigated to reduce depth and ancilla usage, including the use of approximate arithmetic and hybrid quantum-classical subroutines. Second, multi-FPGA and heterogeneous CPU+FPGA mappings will be explored to scale the adaptive architecture to larger Kyber parameter sets. Third, a formal side-channel and implementation security assessment will be performed to verify that performance optimizations remain constant-time and resistant to practical attacks. These efforts aim to transition the hybrid NTT framework from prototype to deployable implementations without weakening cryptographic assurances. The hybrid approach strikes a balance between the immediate practicality of FPGA and a forward-looking quantum research agenda. The result is a resource-efficient, high-performance NTT solution compatible with CRYSTALS-Kyber and amenable to further optimization as quantum and FPGA technologies evolve.
Future research will investigate resource-estimation models for Kyber-scale quantum NTT implementations, including logical-qubit requirements, circuit-depth growth, and fault-tolerant execution costs.
Additionally, the future work will extend the hardware evaluation through detailed power characterization and Power–Area–Timing (PAT) analysis to further quantify implementation trade-offs across alternative radix configurations and modular reduction strategies.