Lattice-Based Cryptographic Accelerators for the Post-Quantum Era: Architectures, Optimizations, and Implementation Challenges

Yan, Hua; Wu, Lei; Sun, Qiming; He, Pengzhou

doi:10.3390/electronics15020475

Open AccessReview

Lattice-Based Cryptographic Accelerators for the Post-Quantum Era: Architectures, Optimizations, and Implementation Challenges

¹

Department of Computer Science and Computer Information Systems, Auburn University at Montgomery, Montgomery, AL 36117, USA

²

Department of Computer Science and Engineering, Santa Clara University, 500 El Camino Real, Santa Clara, CA 95053, USA

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(2), 475; https://doi.org/10.3390/electronics15020475

Submission received: 17 December 2025 / Revised: 13 January 2026 / Accepted: 20 January 2026 / Published: 22 January 2026

(This article belongs to the Special Issue Innovative Architectures and Advanced Solutions for Network Security in the Era of Emerging Technologies)

Download Versions Notes

Abstract

The imminent threat of large-scale quantum computers to modern public-key cryptographic devices has led to extensive research into post-quantum cryptography (PQC). Lattice-based schemes have proven to be the top candidate among existing PQC schemes due to their strong security guarantees, versatility, and relatively efficient operations. However, the computational cost of lattice-based algorithms—including various arithmetic operations such as Number Theoretic Transform (NTT), polynomial multiplication, and sampling—poses considerable performance challenges in practice. This survey offers a comprehensive review of hardware acceleration for lattice-based cryptographic schemes—specifically both the architectural and implementation details of the standardized algorithms in the category CRYSTALS-Kyber, CRYSTALS-Dilithium, and FALCON (Fast Fourier Lattice-Based Compact Signatures over NTRU). It examines optimization measures at various levels, such as algorithmic optimization, arithmetic unit design, memory hierarchy management, and system integration. The paper compares the various performance measures (throughput, latency, area, and power) of Field-Programmable Gate Array (FPGA) and Application-Specific Integrated Circuit (ASIC) implementations. We also address major issues related to implementation, side-channel resistance, resource constraints within IoT (Internet of Things) devices, and the trade-offs between performance and security. Finally, we point out new research opportunities and existing challenges, with implications for hardware accelerator design in the post-quantum cryptographic environment.

Keywords:

post-quantum cryptography; lattice-based algorithm; FPGA; ASIC; side-channel resistance; IoT

1. Introduction

Large-scale quantum computers, once realized, will pose a real existential threat to the cryptographic infrastructure on which today’s digital security is built. Since Shor’s algorithm [1] was introduced, it has been demonstrated that this algorithm, which runs on a reasonably large-scale quantum computer, can resolve integer factorization and discrete logarithm questions in polynomial time, making public-key cryptosystems such as RSA (Rivest–Shamir–Adleman) and ECC (Elliptic Curve Cryptography) insecure. With recent advances in quantum computing—Google’s demonstration of quantum supremacy and IBM’s push toward 1000+ qubit systems—the need for quantum-protective security mechanisms has become inevitable. To mitigate the quantum threat, (i) the National Security Agency (NSA) requested that all National Security Systems implement post-quantum cryptography (PQC) by 2035, and (ii) the National Institute of Standards and Technology (NIST) provided a formal standardization for the first quantum-resistant algorithms in August 2024 [2].

When considering different PQC methods, including code-based, hash-based, multivariate-based, and isogeny-based cryptography, lattice-based schemes are the prevailing paradigm. The NIST PQC standardization process, which started in 2016 and completed its fourth round in 2025, selected three lattice-based algorithms for the standardization: CRYSTALS-Kyber, renamed to ML-KEM (Module-Lattice-Based Key Encapsulation Mechanism) and published in Federal Information Processing Standards (FIPS) 203 for key encapsulation; CRYSTALS-Dilithium, renamed to ML-DSA (Module-Lattice-Based Digital Signatures Algorithm) and published in FIPS 204 for digital signatures; and FALCON (Fast Fourier Lattice-Based Compact Signatures over NTRU), due to be published in FIPS 206 as an alternative signature scheme. These chosen schemes demonstrate strong security guarantees with respect to worst-case hardness factors and efficient operations.

However, the computational challenges in lattice-based cryptographic operations are serious and hinder their real-world applications. Key operations in the mathematical framework, such as Number Theoretic Transform (NTT), polynomial multiplication over structured rings, modular arithmetic with large moduli, and Gaussian sampling, impose a much greater computational burden than classical elliptic curve operations [3]. As another example, a single CRYSTALS-Kyber key encapsulation process requires around 10,000 modular multiplications over 256-coefficient polynomials [4,5], and CRYSTALS-Dilithium signature generation necessitates rejection sampling resulting in variable execution time [6]. Despite improvements from algorithm optimization and Single Instruction Multiple Data (SIMD) instruction exploitation in software implementations on general-purpose processors, these implementations remain inadequate for latency-sensitive applications such as secure network communications (e.g., Transport Layer Security (TLS) and Datagram Transport Layer Security (DTLS) handshakes), real-time cryptographic protocols, and resource-constrained Internet of Things (IoT) devices [7]. The performance gap is significantly amplified in high-throughput areas: in a modern data center handling millions of TLS connections per second, cryptographic operations typically must be executed within microseconds—a goal that is often impossible to achieve with software.

An enabling technology to address this performance gap and facilitate its practical adoption is hardware acceleration based on lattice-based PQC. Individual hardware accelerators—using Field-Programmable Gate Arrays (FPGAs) or Application-Specific Integrated Circuits (ASICs)—leverage the parallelism inherent in polynomial operations, optimize datapath width for modular arithmetic, and develop specialized memory architectures to eliminate bottlenecks in coefficient access patterns. FPGA implementations offer reconfigurability to accommodate a fast design and development pace, while ASIC implementations provide better performance and energy efficiency for high-volume processing. Moreover, cryptographic systems deployed in adversarial environments are vulnerable to side-channel attacks (SCAs). Hardware implementations enable constant-time execution, making them resistant to timing side-channel attacks and allowing the integration of countermeasures against physical attacks.

However, existing evaluations of PQC implementation, though helpful, are often narrow in scope. The surveys of PQC are very broad and address multiple cryptographic families, but few studies provide details about hardware-specific optimizations for lattice-based schemes. Although side-channel analysis-related surveys cover security issues, they often focus less on assessing performance-driven architectural innovations. Surveys of software-based polynomial multiplication techniques inform algorithmic designs but are not directly applicable to hardware limitations such as memory bandwidth, Digital Signal Processing (DSP) block utilization, and routing congestion. To the authors’ best knowledge, no survey has concentrated on works published in 2022–2025 that cover hardware acceleration architectures for NIST-standardized lattice-based schemes. To fill this gap, we summarize our contribution as follows.

This survey offers the first comprehensive, systematically organized review of hardware acceleration techniques for lattice-based post-quantum cryptographic schemes, with a particular focus on NIST-standardized algorithms and their hardware realizations. Our specific contributions include the following:

Comprehensive Taxonomy of Hardware Architectures: We develop a multi-dimensional classification framework for lattice-based cryptographic accelerators. We categorize implementations across multiple axes including (a) algorithmic improvements, (b) integration approaches such as standalone accelerators, coprocessors, instruction-set extensions, (c) implementation platforms (FPGA, ASIC), and (d) optimization goals (throughput-oriented, area-constrained, energy-efficient, side-channel-resistant). This taxonomy enables systematic comparison and highlights architectural trends in the field.
Systematic Analysis of Optimization Strategies: We decompose and differentiate the optimization strategies applied to the reviewed implementations across various abstraction levels, including NTT, polynomial multiplication, modular reduction, memory architecture, and arithmetic units.
Assessment of the Post Round-3 Selection Landscape: We analyze the impact of NIST’s finalization of the PQC standards on how transitioning from Round 3 candidates to official standards (CRYSTALS-Kyber → ML-KEM, CRYSTALS-Dilithium → ML-DSA) affects hardware deployment. We investigate changes in parameters that require architectural modifications as well as the adaptability of current designs. We also examine the emerging trend of using single accelerators supporting multiple algorithms (due to the standardization of complementary schemes [ML-KEM for encryption, ML-DSA for signatures]), and assess the performance losses and trade-offs related to cryptographic flexibility.
Identifying a Research gap and Pointing to the Future: After a systematic review of the literature, specific gaps are identified: (a) limited ASIC implementations with side-channel protection (only three works are found), (b) inadequate exploration of FALCON hardware acceleration compared to Kyber/Dilithium (four times fewer publications), (c) lack of system-level integration studies addressing TLS/IPsec (Internet Protocol Security) acceleration, and (d) minimal research on emerging technologies such as processing-in-memory architectures for lattice cryptography. We recommend straightforward, actionable research avenues grounded in quantitative gap analysis.

The survey focuses on hardware accelerators for NIST-standardized and finalist lattice-based schemes: CRYSTALS-Kyber/ML-KEM, CRYSTALS-Dilithium/ML-DSA, FALCON, and closely related schemes (like Saber and NTRU) that inform design principles. We mainly examine FPGA and ASIC implementations published between 2020 and 2025, with regard to post-NIST-Round-3 and Round-4 research. Pure software implementations are excluded unless they directly influence hardware design decisions. We also exclude other PQC families (code-based, hash-based, multivariate, and isogeny-based) except where comparative analyses provide relevant context. Implementation security primarily targets side-channel and fault attacks; we do not extensively cover cryptanalytic security, which is mainly explored in dedicated cryptography surveys.

2. Background

In this section, we introduce the mathematical foundations for lattice-based cryptography, provide the foundational knowledge necessary to understand hardware acceleration, and discuss the advantages and trade-offs of two primary hardware implementation platforms, FPGAs and ASICs, for post-quantum cryptographic acceleration.

2.1. Lattice-Based Cryptography Fundamentals

Lattice-based cryptography bases its security on the computational intractability of well-studied problems defined over high-dimensional lattices. Unlike classical public-key algorithms such as RSA, whose security depends on integer factorization, or ECC, which depends on the discrete logarithm problem, lattice-based constructions are thought to withstand both traditional and quantum attacks.

2.1.1. Lattice Problems and Hardness Assumptions

This subsection introduces the fundamental lattice structure and the key hardness assumptions—the Short Vector Problem (SVP) and Learning With Errors (LWE)—that ultimately shape the computations that hardware must accelerate.

Lattice. Mathematically, a lattice consists of the set of all integer linear combinations of basis vectors embedded in Euclidean space. Formally, a lattice is a discrete additive subgroup generated by integer linear combinations of basis vectors

b_{1}, b_{2}, \dots, b_{m} \in R^{n}

:

Λ (B) = {B z : z \in Z^{m}} = {z_{1} b_{1} + z_{2} b_{2} + \dots + z_{m} b_{m} : z_{i} \in Z},

where

B = [b_{1} ∣ b_{2} ∣ \dots ∣ b_{m}]

is the basis matrix. The dimension of the lattice is n, and its rank is m [8].

Shortest Vector Problem (SVP). The security of modern lattice-based cryptography relies primarily on a computational problem named the shortest vector problem. Given a lattice basis B, find a non-zero lattice vector

v \in Λ (B)

with minimal Euclidean norm

∥ v ∥

. The decision version of SVP is NP-hard under randomized reductions [9], and no polynomial-time quantum algorithm is known for SVP in general lattices [10]. These hardness results motivate why lattice schemes can base security on geometric problems while still admitting efficient arithmetic implementations.

Learning With Errors (LWE). Introduced by Regev [11], the Learning With Errors (LWE) problem provides a foundation for constructing cryptographic primitives with worst-case to average-case hardness reductions. Parameterized by the dimension n, modulus q, and error distribution

χ

, LWE can be viewed as a problem of asking to distinguish between two distributions: given many samples, decide whether they come from a “noisy linear equation” distribution or from the uniform distribution. Concretely, the task is to distinguish between the following:

Samples of the form $(a_{i}, b_{i} = 〈 a_{i}, s 〉 + e_{i} \mod q)$ , where $a_{i} \overset{$}{\leftarrow} Z_{q}^{n}$ is uniform, $s \in Z_{q}^{n}$ is a fixed secret vector, $e_{i} \overset{$}{\leftarrow} χ$ is a small error (from a discrete Gaussian or similar distribution), and $b_{i} = 〈 a_{i}, s 〉 + e_{i} \mod q$ is a noisy result.
Uniformly random samples $(a_{i}, b_{i}) \overset{$}{\leftarrow} Z_{q}^{n} \times Z_{q}$ .

Standardized Module-LWE schemes operate over structured rings or modules to enable efficient polynomial arithmetic (often via NTT). Consequently, these mathematical structures directly determine the hardware kernels that implementations must accelerate.

2.1.2. Structured Lattice Problems: Ring-LWE and Module-LWE

Although standard Learning With Errors (LWE) offers strong and well-understood security guarantees, its direct instantiations incur substantial computational and storage overheads. In particular, public keys in standard LWE-based constructions typically require

O (n^{2})

ring elements, which poses a significant challenge for practical deployments [12]. To address these limitations, many NIST-standardized lattice-based schemes adopt structured variants of LWE that leverage algebraic structure to substantially improve efficiency while preserving provable security foundations.

Ring Learning With Errors (Ring-LWE). Ring-LWE, introduced by Lyubashevsky, Peikert, and Regev [13], reformulates the LWE problem over polynomial rings. Specifically, computations are carried out in the quotient ring

R_{q} = Z_{q} [x] / 〈 x^{n} + 1 〉,

where n is typically chosen as a power of two (e.g.,

n = 256, 512, 1024

) and q denotes a prime modulus. Elements of

R_{q}

are polynomials of degree at most

n - 1

with coefficients in

Z_{q}

. The Ring-LWE problem is defined as the task of distinguishing between the following two distributions:

Samples of the form $(a_{i}, b_{i} = a_{i} \cdot s + e_{i})$ , where $a_{i} \overset{$}{\leftarrow} R_{q}$ , $s \in R_{q}$ is a fixed secret polynomial, $e_{i} \overset{$}{\leftarrow} χ$ is drawn from an error distribution, and $b_{i} = a_{i} \cdot s + e_{i}$ (with all operations in $R_{q}$ ).
Uniformly random samples $(a_{i}, b_{i}) \overset{$}{\leftarrow} R_{q} \times R_{q}$ .

By exploiting the ring structure, Ring-LWE reduces public key sizes from

O (n^{2})

to

O (n)

elements while retaining worst-case to average-case hardness reductions for problems on ideal lattices [13,14]. Moreover, the polynomial representation naturally enables efficient arithmetic through algorithms such as the NTT, which is a key enabler for high-performance implementations, as discussed in Section 3.1.

Module Learning With Errors (Module-LWE). Module-LWE further generalizes Ring-LWE by operating over modules of the form

R_{q}^{k}

, where k is a small positive integer, typically

k \in {2, 3, 4}

[15]. This problem formulation underlies several NIST-standardized schemes, including CRYSTALS-Kyber and CRYSTALS-Dilithium. Concretely, the Module-LWE problem requires distinguishing between the following:

Samples of the form $(A, b = A \cdot s + e)$ , where $A \in R_{q}^{k \times k}$ is uniformly random and $s, e \in R_{q}^{k}$ are the secret and error vectors, and $b = A \cdot s + e$ (with all operations in $R_{q}$ ), respectively.
Uniformly random samples $(A, b) \overset{$}{\leftarrow} R_{q}^{k \times k} \times R_{q}^{k}$ .

Module-LWE occupies an intermediate position between standard LWE (which can be viewed as the case

k = n

) and Ring-LWE (corresponding to

k = 1

). This formulation enables a flexible trade-off between security and efficiency by tuning the module rank k, while maintaining hardness reductions to well-defined problems on module lattices [16]. As a result, Module-LWE-based schemes can conveniently support multiple NIST security levels—namely, Levels 1, 3, and 5—corresponding approximately to 128-, 192-, and 256-bit classical security, respectively, through appropriate parameter selection [17,18].

2.1.3. Key Operations in Lattice-Based Schemes

Efficient hardware realizations of lattice-based cryptographic schemes must support a small set of core arithmetic and probabilistic operations that dominate overall performance and resource utilization [3,19]. These operations are briefly summarized below.

Polynomial Multiplication. Polynomial multiplication over the ring

R_{q}

constitutes the primary computational bottleneck in Module-LWE-based schemes. A straightforward schoolbook implementation incurs a quadratic complexity of

O (n^{2})

coefficient multiplications. In contrast, NTT-based techniques exploit the underlying ring structure to reduce the complexity to

O (n \log n)

[20]. For typical parameter choices such as

n = 256

, this asymptotic improvement translates into an approximate

18 \times

reduction in the number of arithmetic operations, making NTT-based multiplication indispensable for high-performance implementations.

Modular Reduction. All coefficient arithmetic in lattice-based schemes is performed modulo a scheme-specific modulus q, necessitating efficient modular reduction mechanisms. To facilitate optimized hardware implementations, NIST-standardized schemes deliberately select moduli having favorable arithmetic properties:

CRYSTALS-Kyber ( $q = 3329$ ): a relatively small modulus that enables efficient 16-bit arithmetic operations [5].
CRYSTALS-Dilithium ( $q = 8,380,417 \approx 2^{23}$ ): an NTT-friendly prime that supports fast modular multiplication and reduction [6].
FALCON ( $q = 12,289$ ): a modulus tailored for FFT-based polynomial arithmetic [21].

Commonly adopted reduction techniques include Barrett reduction, which relies on precomputed constants; Montgomery reduction, which exploits power-of-two representations; and Plantard arithmetic, which eliminates the need for double-width multiplications [22,23,24].

Sampling. Random sampling from carefully defined distributions is another critical component of lattice-based cryptographic protocols. In particular, implementations must efficiently support the following:

Uniform sampling over $Z_{q}$ for generating public matrices A.
Centered binomial distributions (CBD) for error sampling in CRYSTALS-Kyber and CRYSTALS-Dilithium [5,6].
Discrete Gaussian sampling for signature generation in FALCON [21].
Rejection sampling mechanisms employed during CRYSTALS-Dilithium signature generation [6].

These sampling procedures often impose stringent requirements on randomness quality, timing behavior, and hardware efficiency.

Compression and Encoding. To reduce communication bandwidth and storage requirements, lattice-based schemes incorporate coefficient compression techniques that map values from

⌈ \log_{2} q ⌉

bits to a smaller representation of d bits, where d is typically 10 or 11 in CRYSTALS-Kyber [17]. While this compression introduces controlled information loss, it significantly improves practical efficiency. Consequently, hardware implementations must seamlessly integrate compression and decompression logic alongside the core arithmetic operations.

2.2. NIST-Standardized Lattice-Based Algorithms

The NIST PQC standardization effort, initiated in 2016 and culminating in the announcement of Round 3 selections in 2022, identified three lattice-based schemes for standardization [25]. This subsection reviews these algorithms from an architectural perspective, highlighting their operational characteristics and emphasizing computational aspects that are most relevant to hardware implementation.

2.2.1. CRYSTALS-Kyber/ML-KEM

CRYSTALS-Kyber, standardized as the Module-Lattice-Based Key Encapsulation Mechanism (ML-KEM) in FIPS 203 [17], is an IND-CCA2-secure KEM (indistinguishable under adaptive chosen-ciphertext attacks, with decapsulation-oracle access excluding the challenge ciphertext) intended for general-purpose key establishment in widely deployed protocols such as TLS, IPsec, and SSH (Secure Shell).

Parameter Sets. ML-KEM specifies three parameter sets corresponding to increasing NIST security levels [17]:

ML-KEM-512 (NIST Level 1): $(k, η_{1}, η_{2}) = (2, 3, 2)$ , with $q = 3329$ and $n = 256$ .
ML-KEM-768 (NIST Level 3): $(k, η_{1}, η_{2}) = (3, 2, 2)$ , with $q = 3329$ and $n = 256$ .
ML-KEM-1024 (NIST Level 5): $(k, η_{1}, η_{2}) = (4, 2, 2)$ , with $q = 3329$ and $n = 256$ .

Here, the parameter k determines the module rank, while

η_{1}

and

η_{2}

define the centered binomial distributions used for sampling secret and error polynomials.

Key Operations. The ML-KEM construction (Algorithm 1) consists of three core algorithms [5,18]: key generation KeyGen(), encapsulation Encaps(

p k

), and decapsulation Decaps(

s k

, c).

Algorithm 1 ML-KEM Construction: KeyGen, Encaps, and Decaps.

Note:

C B D_{η}

denotes the centered binomial distribution (for sampling small secret/error polynomials);

KDF

denotes a key-derivation function.

Input: None
Output: Public key

p k = (ρ, t)

, Secret key

s k = s

KeyGen():

1.: Sample seed $ρ \overset{$}{\leftarrow} {0, 1}^{256}$
2.: Generate $A \in R_{q}^{k \times k}$ from $ρ$ using SHAKE-128
3.: Sample $s, e \overset{$}{\leftarrow} {CBD}_{η_{1}}^{k}$
4.: Compute $t \leftarrow A \circ s + e$
5.: Return $(p k, s k) \leftarrow ((ρ, t), s)$

Input: Publick Key

p k = (ρ, t)

Output:

c

and

K \leftarrow KDF (m, H (pk))

Encaps( $pk$ ):

1.: Sample random message $m \overset{$}{\leftarrow} {0, 1}^{256}$
2.: Derive auxiliary randomness $r \leftarrow G (m, H (p k))$
3.: Regenerate $A \in R_{q}^{k \times k}$ from seed $ρ$
4.: Sample $s^{'}, e^{'} \overset{$}{\leftarrow} {CBD}_{η_{2}}^{k}$ and $e^{″} \overset{$}{\leftarrow} {CBD}_{η_{2}}$
5.: Compute $u \leftarrow A^{T} \circ s^{'} + e^{'}$
6.: Compute $v \leftarrow t^{T} \circ s^{'} + e^{″} + Decompress (Encode (m))$
7.: Compute $c \leftarrow (Compress (u), Compress (v))$
8.: Compute $K \leftarrow KDF (m, H (p k))$
9.: return $(c, K)$

Input: Secret key

s k = s

, Ciphertext c
Output: Shared key K or rejection symbol ⊥
Decaps( $sk, c$ ):

1.: Recover $(u, v)$ via decompression
2.: Compute $m^{'} \leftarrow Decode (v - s^{T} \circ u)$ .
3.: Recompute $(c^{'}, K^{'}) \leftarrow Encaps (p k; m^{'})$ ▹ to verify ciphertext integrity
4.: return $K \leftarrow KDF (m^{'}, H (pk))$ if $c = c^{'}$ , else ⊥

Computational Profile. For the ML-KEM-768 parameter set, key generation entails [26] the following:

Two matrix–vector products $A \circ s$ , corresponding to $k^{2} = 9$ polynomial multiplications.
9 forward NTTs, each operating on 256 coefficients.
$9 \times 256 = 2304$ coefficient-wise modular multiplications.
Three inverse NTT operations for transforming results back to the coefficient domain.

Overall, the number of modular multiplications is on the order of

\frac{n}{2} \times \log_{2} (n) = \frac{256}{2} \times \log_{2} (256) = 1024

Taking into account that multiple NTT/iNTT (inverse Number Theoretic Transform) operations are required in one encryption/sign, the number of modular multiplications needed is consistent with the claim that ML-KEM involves tens of thousands of arithmetic operations per execution.

Hardware Implications. Choosing a power-of-two polynomial degree

n = 256

enables efficient NTT implementations. The small modulus

q = 3329

further simplifies implementation by allowing compact 16-bit datapaths. Furthermore, the matrix-vector structure allows parallel processing of multiple polynomials. Nevertheless, the compression and decompression logic introduces additional architectural overhead that can impact area, latency, and design complexity. These factors must be carefully considered in resource-constrained implementations.

2.2.2. CRYSTALS-Dilithium/ML-DSA

CRYSTALS-Dilithium, standardized as the Module-Lattice-Based Digital Signature Algorithm (ML-DSA) in FIPS 204 [18], provides digital signature functionality based on the Fiat–Shamir-with-aborts paradigm [27].

Parameter Sets. ML-DSA defines three security levels [18]:

ML-DSA-44 (Level 2): $(k, ℓ) = (4, 4)$ , $q = 8,380,417$ , $n = 256$ .
ML-DSA-65 (Level 3): $(k, ℓ) = (6, 5)$ , $q = 8,380,417$ , $n = 256$ .
ML-DSA-87 (Level 5): $(k, ℓ) = (8, 7)$ , $q = 8,380,417$ , $n = 256$ .

Compared to ML-KEM, Dilithium employs significantly larger rectangular matrices of dimension

k \times ℓ

.

Key Operations. The core algorithms of ML-DSA construction (Algorithm 2) are key generation KeyGen(), signing Sign(

s k

, M), and verifying Verify(

v k

, M,

σ

).

Algorithm 2 ML-DSA Construction: KeyGen, Sign, and Verify.

Input: None
Output: Verification key

vk

, and signing key

sk

.
KeyGen():

1.: Sample a uniformly random matrix $A \in R_{q}^{k \times ℓ}$ .
2.: Sample secret vectors $s_{1} \in R_{q}^{ℓ}$ and $s_{2} \in R_{q}^{k}$ from bounded distributions.
3.: Compute $t = A \circ s_{1} + s_{2}$ .
4.: Return verification key $vk = (A, t)$ and signing key $sk = (s_{1}, s_{2})$ .

Input: Signing key

sk = (s_{1}, s_{2})

, message M (and public matrix A/parameters).
Output: Signature

σ

.
Sign( $sk, M$ ):

1.: Sample a random masking vector $y \in R_{q}^{ℓ}$ .
2.: Compute $w = A \circ y$ and derive the challenge $c = H (w, M)$ .
3.: Compute the response $z = y + c \circ s_{1}$ .
4.: Apply rejection sampling to enforce norm bounds.
5.: Return the signature $σ = (c, z, hint)$ .

Input: Verification key

vk = (A, t)

, message M, signature

σ

.
Output: Decision bit

b \in {accept, reject}

.
Verify( $vk, M, σ$ ):

1.: Parse $σ$ (e.g., $σ = (c, z, hint)$ ) and reconstruct the intermediate value $w^{'}$ .
2.: Check the consistency condition $A \circ z - c \circ t$ and verify that the derived challenge matches c.
3.: $Accept$ if all verification conditions are satisfied; otherwise Return $reject$ .

Computational Profile. For the ML-DSA-65 parameter set [6]:

$k \times ℓ = 30$ polynomial multiplications are required for computing $A \circ y$ .
$ℓ = 5$ additional polynomial multiplications are needed for $c \circ s_{1}$ .
The average rejection rate is approximately $4.5$ , leading to multiple signing iterations.

Each iteration thus involves on the order of 35 NTT-based polynomial multiplications.

Hardware Implications. Compared to ML-KEM, the ML-DSA’s larger modulus

q \approx 2^{23}

necessitates wider (typically 32-bit) datapaths, which increases both area and power consumption. Moreover, the rectangular matrix dimensions (

k \times ℓ

) complicate memory organization and access patterns. The signing procedure also relies on rejection sampling, which introduces challenges for constant-time hardware designs [28]. As a result of these combined factors, the overall computational cost is approximately twice that of ML-KEM-768.

2.2.3. FALCON

FALCON [21], to be standardized as FIPS 206, offers the smallest signature sizes among the NIST-selected schemes but relies on fundamentally different computational primitives.

Parameter Sets.

FALCON-512 (Level 1): $n = 512$ , $q = 12,289$ .
FALCON-1024 (Level 5): $n = 1024$ , $q = 12,289$ .

Algorithmic Distinctions. In contrast to Module-LWE–based schemes, FALCON employs the following:

NTRU lattices rather than Module-LWE constructions [29].
Fast Fourier Transform (FFT) operations over $C$ instead of NTTs over $Z_{q}$ [30].
High-precision floating-point arithmetic for Gaussian sampling [21].
Gram–Schmidt orthogonalization during key generation [31].

Key Operations: The core algorithms of FALCON construction (Algorithm 3) are key generation KeyGen(), signing Sign(

s k

, M), and verifying Verify(

v k

, M,

σ

).

Algorithm 3 FALCON Construction: KeyGen, Sign, and Verify.

Input: Parameters

(n, q)

and the ring

Z [x] / 〈 x^{n} + 1 〉

.
Output: Verification key

vk

and secret key

sk

.
KeyGen():

1.: Sample short polynomials $f, g \in Z [x] / 〈 x^{n} + 1 〉$ .
2.: Compute $h \leftarrow g / f \mod q$ using FFT-based polynomial inversion.
3.: Derive a trapdoor basis (e.g., via Gram–Schmidt orthogonalization).
4.: Return verification key $vk \leftarrow h$ and secret key $sk \leftarrow (f, g, trapdoor)$ .

Input: Secret key

sk

(trapdoor, e.g.,

(f, g, \dots)

), message M, modulus q (and scheme parameters).
Output: Signature

σ

.
Sign( $sk, M$ ):

1.: Compute the hash/challenge $c \leftarrow H (M) \in R_{q}$ .
2.: Perform trapdoor sampling to obtain a short vector $s$ such that $s \equiv c \cdot f^{- 1} (\mod q)$ .
3.: Sample $s$ using FFT-based discrete Gaussian sampling with high numerical precision [32].
4.: Return the signature $σ \leftarrow s$ .

Input: Verification key

vk = h

, message M, signature

σ = s

.
Output: Decision bit

b \in {accept, reject}

.
Verify( $vk, M, σ$ ):

1.: Compute $c \leftarrow H (M)$ and verify $s \cdot h \equiv c (\mod q)$ .
2.: Check that the Euclidean norm ${∥ s ∥}_{2}$ satisfies the prescribed bounds.
3.: $Accept$ if both checks pass; otherwise return $reject$ .

Hardware Implications. FALCON requires complex-valued FFT engines rather than integer NTT units employed in ML-KEM and ML-DSA. Supporting floating-point arithmetic further mandates specialized datapaths and control logic for accuracy. In addition, the larger polynomial degrees (

n \in 512, 1024

) increase transform depth and latency compared to the

n = 256

used in ML-KEM and ML-DSA. Moreover, Gram–Schmidt computations during key generation and sampling incur substantial computational overhead. Consequently, relatively few hardware implementations exist compared to its Module-LWE counterparts due to FALCON’s architectural divergence [33,34].

2.2.4. Comparative Analysis

Table 1 summarizes the key characteristics and hardware implementation challenges of NIST-standardized schemes.

2.3. Hardware Implementation Platforms

Hardware acceleration of lattice-based cryptographic primitives is predominantly realized on two classes of implementation platforms: FPGAs and ASICs. Each platform exhibits distinct architectural features, design methodologies, and performance trade-offs, which directly influence their suitability for PQC acceleration. This subsection provides a comparative overview with emphasis on hardware-relevant considerations.

2.3.1. FPGA Characteristics and Design Flow

FPGAs represent the dominant platform for lattice-based cryptographic accelerator research due to their reconfigurability and rapid prototyping capabilities. We will summarize the FPGA devices’ specific hardware resources and discuss the timing closure challenges that must be address when targeting high-performance lattice-based design.

Technology Overview. FPGAs are reconfigurable devices composed of configurable logic blocks (CLBs), programmable routing networks, embedded memory resources such as block RAMs (BRAMs), and specialized arithmetic components including DSP slices. Contemporary FPGA platforms from major vendors such as AMD (Xilinx) and Intel (formerly Altera) further integrate system-level peripherals, including high-speed serial transceivers, PCIe (Peripheral Component Interconnect Express) interfaces, and embedded processor cores [35]. These heterogeneous resources make FPGAs particularly attractive for prototyping and accelerating lattice-based cryptographic schemes, where algorithmic flexibility and rapid design iteration are critical.

Timing Closure. Achieving timing closure on FPGA platforms remains one of the primary design challenges, especially for high-performance PQC accelerators. Lattice-based implementations often employ wide datapaths and intensive polynomial arithmetic, such as NTT-based multiplication, which can lead to significant routing congestion and long critical paths. Furthermore, irregular memory access patterns associated with transform-based computations exacerbate placement and routing complexity. Consequently, meeting target clock frequencies typically requires careful pipeline insertion, balanced resource utilization, and iterative refinement of timing constraints, placement, and routing strategies [36].

2.3.2. ASIC Characteristics and Design Flow

Compared to FPGAs, ASICs are attractive for high-volume post-quantum cryptographic deployments due to the superior performance and energy efficiency they offer. We will summarize the key characteristics of ASIC implementation and power consumption for lattice-based cryptography.

Technology Overview. ASICs are custom-fabricated integrated circuits designed for a fixed functionality and optimized for specific performance, power, and area objectives. Unlike FPGAs, which rely on programmable interconnects and configuration memory, ASICs are manufactured using dedicated semiconductor process technologies offered by foundries such as TSMC and Samsung. These processes span a wide range of technology nodes, from mature and cost-effective nodes (e.g., 180 nm) to advanced, high-performance nodes (e.g., 7 nm), enabling designers to tailor implementations to application-specific requirements [37].

Power Considerations. Power consumption characteristics differ substantially between FPGA and ASIC implementations. On FPGAs, static power associated with configuration memory and programmable routing dominates overall energy usage, particularly in large designs. In contrast, ASIC power consumption is primarily dynamic, arising from signal switching activity. This fundamental distinction allows ASIC-based implementations to achieve significantly higher energy efficiency for lattice-based cryptographic workloads, especially in high-throughput, low-latency, or energy-constrained deployment scenarios [38].

2.3.3. FPGA Versus ASIC Trade-Offs for Lattice-Based Cryptography

FPGAs and ASICs represent two fundamentally distinct design points for accelerating lattice-based cryptographic workloads. Each platform offers unique advantages and limitations in terms of development cost, performance, power efficiency, and deployment flexibility. Table 2 summarizes a comparative analysis of these platforms across dimensions that are particularly relevant to PQC.

Design Effort Comparison. FPGA-centric development flows, commonly supported by vendor toolchains such as Xilinx Vivado and Intel Quartus Prime, enable rapid design iteration with synthesis and timing-analysis turnaround times typically measured in hours rather than days [39,40]. This short feedback cycle significantly facilitates architectural exploration, parameter tuning, and incremental optimization. In contrast, ASIC design methodologies based on industrial-grade Electronic Design Automation (EDA) tool suites—such as Cadence Genus/Innovus or Synopsys Design Compiler(DC)/IC Compiler (ICC; IC = integrated circuit)—require substantially longer iteration cycles. These flows involve complex stages including detailed placement and routing, parasitic RC (resistance-capacitance) extraction, and multi-corner multi-mode (MCMM) sign-off verification, often resulting in design iterations spanning several days to weeks before timing closure and functional correctness are achieved [39,41,42].

Performance Analysis. Recent comparative studies on lattice-based cryptographic accelerators [26,43,44] consistently report the following performance characteristics:

FPGA implementations typically achieve approximately 60–70% of the peak clock frequency attainable by functionally equivalent ASIC designs [26,36]. This gap is largely attributed to configurable routing overheads and longer interconnect delays inherent to reconfigurable fabrics.
The performance advantage of ASIC implementations becomes increasingly evident as architectural complexity grows, benefiting from optimized standard-cell placement, dedicated routing resources, and reduced parasitic capacitance [45].
Contemporary DSP-rich FPGA platforms (e.g., AMD Xilinx UltraScale+ devices equipped with DSP48E2 slices) can partially mitigate this disparity for multiplication-intensive workloads, such as NTT-based polynomial arithmetic, by mapping critical operations onto hardened multiplier blocks [36,39].

When to Choose an FPGA. FPGAs are particularly well suited to the following deployment scenarios [26]:

Research prototyping and early-stage architectural evaluation of post-quantum cryptographic algorithms.
Moderate production volumes (typically below 100,000 units), where non-recurring engineering (NRE) costs dominate overall system cost [46].
Evolving standards, including parameter adjustments during the NIST PQC standardization process and the transition from Round 3 candidates to finalized FIPS specifications.
Cryptographic agility, allowing multiple PQC primitives (e.g., ML-KEM and ML-DSA) to be supported within a single reconfigurable platform.
Applications demanding rapid deployment and frequent post-deployment updates in response to security advisories or protocol changes [47].

When to Choose an ASIC. ASIC-based implementations are generally preferable in the following contexts:

High-volume deployments (exceeding one million units per year), such as Internet-of-Things (IoT) sensor nodes, smartcards, and secure elements in mobile devices [48].
Power- and energy-constrained environments, where battery lifetime or thermal dissipation is critical, benefiting from ASIC implementations that typically achieve 5–10× lower energy per operation [43].
Stable and fully standardized algorithms following formal FIPS publication (e.g., FIPS 203, 204, and 206), where specification changes are unlikely [17,18].
Cost-sensitive products in which amortized per-unit manufacturing cost outweighs non-recurring engineering expenses, generally at production volumes above 500,000 units [46].

3. Hardware Acceleration Fundamentals

This section establishes the technical foundations necessary for understanding hardware implementations of lattice-based cryptographic schemes. We examine the computational structure of core arithmetic operations (Section 3.1), then discuss architectural considerations related to memory organization (Section 3.2).

3.1. Core Arithmetic Operations and Architectural Approaches

Efficient hardware acceleration of lattice-based cryptography critically depends on the optimization of three fundamental computational primitives: polynomial multiplication implemented via the NTT, modular arithmetic, and probabilistic sampling operations.

3.1.1. Number Theoretic Transform (NTT)

The NTT is fundamental to efficient polynomial multiplication in both ML-KEM and ML-DSA, and its implementation often determines an accelerator’s performance. We will present the algorithmic foundations of NTT, the Cooley–Tukey algorithm commonly used in hardware, hardware architectures ranging from serial to fully pipelined designs, and coefficient reordering strategies that impact memory access patterns.

Algorithmic Foundation. In the polynomial ring used by ML-KEM/ML-DSA, coefficients are reduced modulo a scheme-specific prime q and polynomials are reduced modulo

x^{n} + 1

. This ring structure enables the NTT to convert polynomial convolution, which is the dominant computational bottleneck, into efficient coefficient-wise multiplication. The NTT enables efficient polynomial multiplication over the ring

R_{q} = Z_{q} [x] / 〈 x^{n} + 1 〉

by reducing the naive schoolbook complexity of

O (n^{2})

to

O (n \log n)

through coefficient-wise multiplication in the transform domain [49].

Consider a polynomial

a (x) \in R_{q}

with coefficient represented by its coefficient vector (

a_{0}, \dots, a_{n - 1}

). The NTT transforms this coefficient vector into a domain where polynomial multiplication becomes coefficient-wise that can be computed efficiently and then mapped back. Given a polynomial

a (x) = \sum_{i = 0}^{n - 1} a_{i} x^{i},

the forward NTT is the discrete Fourier transform analogue over

Z_{q}

and computes

\hat{a} [j] = NTT (a) = \sum_{i = 0}^{n - 1} a_{i} \cdot ω^{i j} \mod q,

where

ω

is an n-th primitive root of unity modulo q, satisfying

ω^{n} \equiv 1 (\mod q)

and

ω^{k} \neg \equiv 1 (\mod q)

for all

0 < k < n

. Although this definition suggests evaluating each point independently, hardware accelerators achieve this computation via staged butterfly networks.

The iNTT reconstructs the original polynomial coefficients, with a final scaling factor by

n^{- 1}

modulo q that is a constant for fixed n and q, as

a [j] = n^{- 1} \cdot \sum_{i = 0}^{n - 1} \hat{a} [i] \cdot ω^{- i j} \mod q,

where

n^{- 1}

denotes the multiplicative inverse of n modulo q [3,50].

Together, the forward and inverse transforms enable efficient convolution-based polynomial multiplication while preserving exact arithmetic in

Z_{q}

.

Radix-2 Cooley–Tukey Algorithm. Most hardware implementations adopt the decimation-in-time (DIT) radix-2 Cooley–Tukey algorithm, which recursively decomposes an n-point transform into two

(n / 2)

-point sub-transforms [51]:

NTT (a_{0}, \dots, a_{n - 1}) = [\begin{matrix} NTT (a_{even}) + ω^{k} \cdot NTT (a_{odd}) \\ NTT (a_{even}) - ω^{k} \cdot NTT (a_{odd}) \end{matrix}] .

Each radix-2 stage consists of

n / 2

independent butterflies. A butterfly reads two coefficients, multiplies one by a twiddle

ω^{k}

, and then produces two updated outputs using modular addition/subtraction operations. Each radix-2 butterfly operation consists of the following modular arithmetic steps:

\begin{matrix} t_{k} & = a_{(k + n) / 2} \cdot ω^{k} \mod q, \\ b_{k} & = a_{k} + t_{k} \mod q, \\ b_{k + n / 2} & = a_{k} - t_{k} \mod q . \end{matrix}

For the commonly used parameter

n = 256

, as adopted in CRYSTALS-Kyber and CRYSTALS-Dilithium, the NTT computation comprises

\log_{2} (256) = 8

stages, each containing 128 butterfly operations. This results in a total of 1024 butterfly computations per forward or inverse NTT [26], which directly determines the throughput and resource requirements of NTT-based hardware accelerators.

Hardware Architectures. Hardware realizations of the NTT span a wide design space that trades parallelism for area and power efficiency [36,52]:

Serial Architecture: A single butterfly processing element executes one operation per clock cycle, resulting in a total latency of $n \log_{2} (n) / 2$ cycles to complete an n-point NTT (e.g., 1024 cycles for $n = 256$ ). This architecture minimizes hardware cost, typically occupying on the order of 2k LUTs (lookup tables) with little or no DSP utilization, making it attractive for resource-constrained designs [53].
Parallel Architecture: Multiple butterfly units, commonly with a parallelism factor $p \in {2, 4, 8}$ , operate concurrently to reduce overall latency to $n \log_{2} (n) / (2 p)$ cycles. For instance, with $p = 4$ , a Kyber NTT can be completed in 256 cycles at the expense of approximately $4 \times$ higher DSP usage compared to the serial design [26,53].
Fully Pipelined Architecture: In this approach, each NTT stage is deeply pipelined with an initiation interval of one clock cycle, enabling a sustained throughput of one NTT every $n / p$ cycles after an initial pipeline fill latency. Implementations on modern FPGA platforms, such as Xilinx UltraScale+ devices, report latencies as low as 224 cycles using $p = 4$ butterfly units while operating at clock frequencies up to 450 MHz [54].

Coefficient Reordering. The forward NTT inherently produces output coefficients in bit-reversed order, necessitating an explicit permutation step prior to inverse transformation or coefficient-wise multiplication [55]. Common hardware techniques to address this requirement include the following:

Dual-port RAMs (DPRAM) with bit-reversed addressing schemes [36].
Dedicated reordering networks implemented using barrel shifters or permutation networks [56].
In-place butterfly scheduling strategies that implicitly avoid explicit reordering [57].

The choice of reordering mechanism has a direct impact on memory access regularity and bandwidth requirements, as further discussed in Section 3.2.

3.1.2. Modular Arithmetic

The efficiency of modular arithmetic operations directly impacts the overall performance of an accelerator. This subsection focuses on the primary modular reduction mechanisms employed in hardware, as well as the simpler module modular addition and subtraction operations.

Modular Multiplication. NTT butterfly computations require modular multiplication followed by efficient reduction modulo q. Three primary reduction techniques are commonly employed in hardware implementations [22,23,24]:

Barrett Reduction: This method precomputes a constant $μ = ⌊ 2^{k} / q ⌋$ for $k \geq ⌈ \log_{2} q ⌉$ and approximates division by q as $⌊ x / q ⌋ \approx ⌊ x μ / 2^{k} ⌋$ [22]. For CRYSTALS-Kyber with $q = 3329$ , selecting $k = 16$ enables single-cycle reduction using a 16-bit datapath. The required hardware includes one $16 \times 16 \to 32$ -bit multiplier, two shifters, and a subtractor.
Montgomery Reduction: Montgomery arithmetic maps values into a transformed domain and computes $x R^{- 1} \mod q$ , where $R = 2^{r} > q$ , allowing reductions to be realized primarily through shifts and additions [22]. Although conversion to and from the Montgomery domain introduces additional multiplications per polynomial, this technique is well suited for larger moduli, such as that used in CRYSTALS-Dilithium ( $q \approx 2^{23}$ ) [58].
Plantard Arithmetic: Plantard reduction exploits moduli of the form $q = 2^{k} - c$ to replace multiplication-intensive reduction with addition-based operations, thereby avoiding double-width arithmetic [24]. However, its applicability is limited and does not directly extend to the moduli adopted by current NIST-standardized lattice-based schemes.

Modular Addition and Subtraction. Modular addition and subtraction compute

(a \pm b) \mod q

using conditional correction logic. A typical implementation first computes

c = a \pm b

, followed by

c^{'} = \{\begin{matrix} c - q, & if c \geq q, \\ c, & otherwise . \end{matrix}

This operation requires a

(⌈ \log_{2} q ⌉ + 1)

-bit adder, a comparator, and a 2:1 multiplexer, and it can be completed within a single clock cycle in most hardware realizations [59].

3.1.3. Sampling Operations

In addition to arithmetic, lattice-based schemes rely on sampling operations that are probabilistic procedures that can affect both performance and side-channel robustness. This subsection examines three primary sampling mechanisms encountered in NIST-standardized algorithms.

Centered Binomial Distribution (used in Kyber/Dilithium). CRYSTALS-Kyber and CRYSTALS-Dilithium generate error polynomials by sampling each coefficient from a centered binomial distribution

{CBD}_{η}

[6,17]. Each coefficient is computed as

e = \sum_{i = 1}^{η} (b_{i} - b_{i}^{'}),

where

b_{i}

and

b_{i}^{'}

are independently and uniformly sampled binary values. In hardware, this sampling procedure is commonly realized using bit population count (popcount) logic:

Generate $2 η$ uniformly random bits per coefficient.
Partition the bits into two groups of size $η$ .
Compute $popcount ({group}_{1}) - popcount ({group}_{2})$ .

For the frequently used parameter

η = 2

, as in ML-KEM-768, each coefficient requires four random bits. Modern FPGA architectures can efficiently implement popcount operations using 6-input lookup tables (LUTs), while ASIC designs typically rely on parallel full-adder trees to minimize both latency and area overhead [60].

Discrete Gaussian Sampling (used in FALCON). FALCON signature generation requires sampling from a discrete Gaussian distribution with double-precision floating-point accuracy, corresponding to a 53-bit mantissa [21]. A widely adopted approach is the cumulative distribution table (CDT) method, in which values of the Gaussian cumulative distribution function (CDF) are precomputed and a binary search is performed using a uniformly random input [61].

From a hardware implementation standpoint, CDT-based sampling introduces substantial challenges. These include the need for large read-only memory (on the order of 20 kB for

σ = 1.55

) and high-precision floating-point comparison units. To mitigate this complexity, alternative designs have explored hybrid integer–Gaussian approximations that replace floating-point arithmetic with integer-only operations, trading modest increases in signature size for significantly simplified hardware architectures [62].

Rejection Sampling (used in Dilithium). In CRYSTALS-Dilithium, signature generation employs rejection sampling to enforce norm bounds on the response vector

z

. In simplified form, acceptance requires

{∥ z ∥}_{\infty} < γ_{1} - β,

where

γ_{1}

,

β

are scheme parameters, as specified in the standard [6]. For the ML-DSA-65 parameter set, the expected number of iterations before acceptance is approximately 4.5.

Consequently, hardware implementations must support the following operations:

Computation of the infinity norm, i.e., $\max_{i} | z_{i} |$ .
Comparison against the threshold $γ_{1} - β$ .
Control logic to restart the signing procedure and resample randomness upon rejection.

The data-dependent number of iterations introduced by rejection sampling complicates constant-time hardware design and poses additional challenges for side-channel resistance, necessitating careful architectural countermeasures [28].

3.1.4. Alternative Polynomial Multiplication Algorithms

While NTT-based multiplication dominates lattice-based PQC implementations due to its

O (n \log n)

complexity and natural alignment with Module-LWE schemes, alternative algorithms offer advantages in specific deployment scenarios. We will examine two complementary approaches relevant to hardware design: Karatsuba multiplication [63] and Toom–Cook [64] multiplication variants for resource-constrained or hybrid architectures.

Karatsuba Multiplication. The Karatsuba algorithm [63] applies a recursive decomposition of large polynomials into smaller subproblems. A toy example is the multiplication between two degree-1 polynomials. The decomposition requires three multiplications instead of four for degree-1 polynomials, as shown in the following equation:

(a_{0} + a_{1} x) (b_{0} + b_{1} x) = a_{0} b_{0} + [(a_{0} + a_{1}) (b_{0} + b_{1}) - a_{0} b_{0} - a_{1} b_{1}] x + a_{1} b_{1} x^{2}

Recursively applying this decomposition to degree-

(n - 1)

polynomials yields a complexity of

O (n^{\log_{2} 3}) \approx O (n^{1.585})

[3].

Toom–Cook Multiplication. Toom–Cook generalizes Karatsuba by splitting polynomials into k parts (Toom-k), achieving complexity

O (n^{\log_{k} (2 k - 1)})

. Toom-3 (splitting into three parts) requires five multiplications and yields

O (n^{1.465})

complexity, while Toom-4 achieves

O (n^{1.404})

at the cost of seven multiplications [64].

Hardware Implementation Characteristics: Karatsuba designs trade multiplier count for increased adder complexity and intermediate storage. For Kyber’s

n = 256

polynomials, a 4-level Karatsuba recursion reduces 256 × 256 schoolbook multiplications (65,536 operations) to approximately 6561 multiplications (

3^{8}

) but requires managing 7 intermediate addition/subtraction stages per recursion level [3].

Toom–Cook variants demand more complex evaluation and interpolation stages than Karatsuba. For Toom-3, the algorithm evaluates polynomials at five points (typically 0, 1,

- 1

, 2, ∞), performs five multiplications, then interpolates the result through division by small constants (2, 3, 6).

Deployment Scenarios: Toom–Cook and Karatsuba multiplications are rarely used in isolation for lattice-based PQC hardware but appear in hybrid contexts. When DSP blocks are limited but logic fabric is abundant, Karatsuba and Toom–Cook reduce multiplier demand at an acceptable control-complexity cost. Example: low-end FPGAs or ASIC implementations targeting minimal silicon area [65]. For software-hardware co-design, CPU-based software implementations use Toom–Cook for large polynomial multiplication [4], while hardware accelerators handle other operations.

3.2. Memory Architecture and Access Patterns

As polynomial operations require substantial storage for coefficients, intermediate values and public parameters, the performance and efficiency of lattice-based cryptographic accelerators are significantly impacted by memory organization. This subsection examines memory requirements, on-the-fly matrix generation techniques that trade computation for storage.

Memory Requirements. Memory footprint constitutes a critical design constraint for hardware accelerators targeting lattice-based cryptographic schemes. Both the size of polynomial operands and the dimensionality of public matrices directly influence on-chip memory demand and data movement overhead. Table 3 summarizes representative storage requirements for NIST-standardized schemes instantiated at NIST Level 3 security parameters.

On-the-Fly Matrix Generation. Explicit storage of the public matrix A is generally impractical in lattice-based cryptographic implementations due to its large dimensionality. Instead, standardized schemes derive the coefficients of A on demand from a compact public seed

ρ

using an extendable-output function (XOF), typically SHAKE-128 [17,66].

From a hardware perspective, this approach introduces a trade-off between the throughput of the hash core, which bounds the rate of polynomial generation, and the availability of on-chip memory for buffering generated coefficients. A widely adopted solution leverages dual-port block RAM (BRAM) structures, where one port services NTT operand reads while the second port writes freshly generated coefficients. This organization enables overlapping matrix generation with arithmetic computation, thereby improving overall throughput [26,67].

Coefficient Storage Schemes. NTT-based accelerators typically adopt one of two coefficient storage organizations, each with distinct implications for memory access patterns and control complexity [36,68]:

Sequential (Natural Order): Coefficients are stored in natural order as $[a_{0}, a_{1}, \dots, a_{n - 1}]$ . This layout simplifies data loading and software interfacing but necessitates an explicit bit-reversal or permutation step following the forward NTT. Memory accesses are sequential during early NTT stages and become increasingly strided by $2^{stage}$ in later stages.
Bit-Reversed Order: Coefficients are arranged as $[a_{0}, a_{n / 2}, a_{1}, a_{n / 2 + 1}, \dots]$ . Under this organization, the NTT can be executed in-place without an additional reordering phase, at the cost of a more complex input arrangement. This layout is particularly advantageous in parallel architectures, as it can substantially reduce memory port conflicts [56].

Multi-Port Memory Utilization. Parallel NTT architectures require multiple coefficient accesses per clock cycle to sustain high throughput. On FPGA platforms, BRAMs typically support true dual-port operation, enabling two concurrent reads or writes. In contrast, ASIC SRAM (Static Random Access Memory) macros often provide single-port or pseudo-dual-port access modes. To support p parallel butterfly units under these constraints, designers commonly employ one or more of the following techniques [26,69]:

Memory Banking: Distribute coefficients across p independent memory banks to allow simultaneous accesses.
Time Multiplexing: Share limited memory ports among butterfly units across multiple cycles, trading increased latency for reduced area.
Register Files: Cache intermediate values in local registers to alleviate pressure on shared memory resources.

4. Architectural Approaches and Implementation Landscape

This section presents a structured taxonomy of hardware acceleration architectures for lattice-based cryptography and surveys the current implementation landscape. We classify existing designs according to their degree of specialization (Section 4.1), system-level integration strategy (Section 4.2), and primary optimization objectives (Section 4.3). This section leads to a detailed review of representative implementations reported in the literature (Section 5).

4.1. Taxonomy of Architectural Approaches

Hardware accelerators for lattice-based PQC occupy a broad design space defined by competing goals of performance, flexibility, and resource efficiency. Within this space, we identify three principal architectural dimensions that collectively characterize most existing designs.

4.1.1. Algorithm Specialization

The degree of algorithm specialization, from algorithm-specific designs optimized for a single scheme, to unified multi-algorithm architectures, to configurable parameter-agile designs, represents a fundamental architectural decision that balances design flexibility and performance optimization. Table 4 compares three specialization levels across key design dimensions.

Algorithm-Specific Designs. Algorithm-specific accelerators are tailored to a single lattice-based scheme, such as CRYSTALS-Kyber or CRYSTALS-Dilithium. By fixing algorithm parameters at the design time, these architectures can optimize datapath widths, memory organization, and control logic to achieve maximal performance and area efficiency. For instance, Kyber-specific designs exploit the relatively small modulus

q = 3329

through 16-bit arithmetic units and customize memory access patterns to the

k \times k

module matrix structure inherent to ML-KEM [26].

The principal limitation of this approach lies in its lack of flexibility: accommodating parameter changes or supporting additional algorithms typically requires substantial redesign and re-verification.

Unified Multi-Algorithm Designs. Unified architectures aim to support multiple lattice-based schemes—most commonly Kyber and Dilithium—within a single hardware instance by sharing core computational blocks such as NTT engines, modular arithmetic units, and memory subsystems. Differences in algorithm parameters are managed through runtime configuration and control logic [44].

A key challenge in this design paradigm is handling heterogeneous moduli. Supporting both

q = 3329

(Kyber) and

q = 8,380,417

(Dilithium) often necessitates uniform 32-bit datapaths, even when executing Kyber operations. As a result, unified designs typically incur a performance penalty on the order of 30–40% relative to fully algorithm-specific accelerators [70].

Configurable Parameter-Agile Designs. Parameter-agile architectures further extend flexibility by supporting multiple security levels (e.g., NIST Levels 1, 3, and 5) and varying polynomial degrees through synthesis-time or runtime reconfiguration [54,71]. This capability enables cryptographic agility, allowing a single accelerator to adapt to evolving standards and diverse deployment requirements.

However, this increased generality introduces additional control complexity, configuration logic, and multiplexing overhead. These factors can adversely affect both area efficiency and maximum achievable clock frequency, necessitating careful architectural trade-offs.

4.1.2. Integration Approaches

Integration choices between a cryptographic accelerator and the host system often determine design complexity and overall system performance. We will compare three primary integration approaches and discuss the associated system-level trade-offs involved. Table 5 summarizes the characteristics of the three integration approaches.

Standalone Accelerators. Standalone accelerators are implemented as self-contained cryptographic cores that communicate with a host processor through standard on-chip interconnects, such as AXI or Avalon [26,67]. This architectural separation cleanly decouples cryptographic functionality from the host system, facilitating independent verification, reuse across heterogeneous platforms, and simplified system integration. However, this approach incurs non-negligible communication overhead, as polynomial coefficients and intermediate results must be explicitly transferred between the host and the accelerator.

Tightly Coupled Coprocessors. Tightly coupled coprocessors integrate lattice-based cryptographic accelerators directly into the processor microarchitecture, often sharing register files, caches, and memory hierarchies with the core [52]. This close integration substantially reduces data movement overhead and lowers latency relative to standalone accelerators. The primary drawback is increased microarchitectural complexity, as modifications to pipeline stages, register interfaces, or memory consistency mechanisms complicate design, verification, and validation.

Instruction-Set Extensions (ISA Extensions). Instruction-set extensions—i.e., extensions to the Instruction Set Architecture (ISA)—enhance existing processor architectures—such as RISC-V or ARM—by introducing custom instructions targeting computational bottlenecks, including NTT butterfly operations, modular arithmetic, and sampling primitives [72]. This approach strikes a balance between flexibility and acceleration: software retains control over protocol-level logic, while hardware accelerates performance-critical kernels. As an illustrative example, a RISC-V processor augmented with 29 PQC-specific instructions achieves a

9.6 \times

speedup for Kyber relative to a purely software-based implementation [73].

4.1.3. Optimization Objectives

Depending on target application requirements, the implementations of hardware accelerators for lattice-based cryptography are optimized by their primary design goals that lead to different architectural choices. We categorize implementations by four different optimization objectives and highlight how each objective reshapes decisions. Table 6 presents the comparison of four primary optimization objectives across key dimensions.

Throughput-Oriented Designs. Throughput-oriented architectures prioritize maximizing operations per second by employing aggressive pipelining, multiple parallel butterfly units (BFUs), and high clock frequencies, often in the range of 400–500 MHz on modern FPGA platforms [26,54]. Such designs are well suited for server-class applications, including TLS endpoints required to sustain millions of concurrent connections. The associated trade-off is high resource utilization, with ML-KEM-768 implementations commonly exceeding 20k LUTs alongside substantial DSP and BRAM consumption.

Area-Constrained Designs. Area-constrained architectures target minimal logic and memory footprints for deployment in resource-limited environments such as IoT devices and smartcards. Common techniques include sequential processing, extensive resource sharing, and on-the-fly computation of intermediate values to reduce storage requirements [65,74]. These designs can be realized using fewer than 5k LUTs, albeit with a latency increase of approximately 5–10× compared to high-throughput architectures.

Energy-Efficient Designs. Energy-efficient accelerators focus on minimizing energy per cryptographic operation through architectural and circuit-level techniques such as clock gating, operand isolation, and voltage–frequency scaling [47]. These optimizations are particularly important for battery-powered and energy-constrained systems. In practice, ASIC-based implementations achieve approximately 5–25

μ

J per ML-KEM-768 encapsulation, whereas FPGA-based designs typically consume 50–200

μ

J for the same operation [43,44].

Side-Channel-Resistant Designs. Side-channel-resistant architectures incorporate dedicated countermeasures, including masking, shuffling, and redundancy, to mitigate power analysis and fault injection attacks [75,76]. While effective, these protections impose substantial overheads: first-order countermeasures commonly increase area by a factor of 1.5–4, latency by 2–3×, and overall energy consumption by approximately 25–40% [77].

4.2. Parallelism and Pipelining Strategies

Effective exploitation of parallelism and pipelining strategies is essential for high-performance lattice-based cryptographic acceleration. We will examine these strategies following three granularity levels. Table 7 provides a structured comparison of the three fundamental parallelization and pipelining strategies employed in lattice-based cryptographic hardware accelerators.

Butterfly-Level Parallelism. NTT accelerators commonly exploit butterfly-level parallelism by instantiating p concurrent butterfly processing units, where

p \in {1, 2, 4, 8}

. This approach enables multiple coefficient pairs to be processed in parallel within each NTT stage. For a polynomial degree of

n = 256

, increasing parallelism from a serial design (

p = 1

) to

p = 4

reduces the NTT latency from 1,024 cycles to 256 cycles [54]. However, scaling beyond

p = 8

often yields diminishing performance returns due to memory bandwidth constraints and routing congestion, particularly on FPGA fabrics with limited interconnect resources [36].

Polynomial-Level Parallelism. Polynomial-level parallelism arises naturally in matrix-vector multiplications of the form

A \circ s

, where multiple independent polynomial products can be computed concurrently. For example, ML-KEM-768 with module dimension

k = 3

supports three-way parallelism, allowing three polynomial multiplications to be executed simultaneously [26]. While this technique substantially improves throughput, it proportionally increases demands on memory bandwidth, coefficient storage, and DSP resources.

Operation-Level Pipelining. Deep pipelining further enhances throughput by overlapping NTT computation with coefficient loading and result write-back. Once the pipeline is filled, such designs can achieve an initiation interval of one operation every

n / p

cycles [54]. As a representative data point, an 8-stage pipelined NTT architecture operating at 450 MHz achieves a throughput of approximately 1.76 million ML-KEM-768 encapsulations per second, highlighting the effectiveness of aggressive pipelining for server-class workloads [26]

4.3. Memory Hierarchy Optimization

Because lattice-based PQC algorithms require handling large data sizes and intensive polynomial arithmetic, and complex access patterns, which increase memory storage and bandwidth demands, efficient use of memory hierarchy is crucial for lattice-based PQC hardware accelerators. Table 8 provides a comparison of three key memory hierarchy optimization strategies that are discussed here.

Coefficient Banking. Coefficient banking partitions polynomial coefficients across multiple memory banks to enable simultaneous access by parallel butterfly units. For a design with

p = 4

parallel butterflies, coefficients are interleaved across four banks such that

bank [i]

stores coefficients

{a_{i}, a_{i + 4}, a_{i + 8}, \dots}

[26,56]. This strategy effectively eliminates memory port conflicts but introduces additional routing complexity and non-trivial address generation logic.

On-Chip SRAM Versus BRAM Trade-offs. On FPGA platforms, BRAMs—typically 36 kbit dual-port memories—provide convenient on-chip storage but are available only in finite quantities (e.g., 1820 BRAMs on Xilinx UltraScale+ XCVU9P devices). Memory-intensive designs, such as ML-DSA-65 with

k = 6

and

ℓ = 5

, require approximately 45 kB of working memory and can exhaust available BRAM resources. In such cases, designers must resort to off-chip DRAM (Dynamic Random-Access Memory), incurring latency penalties on the order of 10–50× [7,67].

ASIC implementations, by contrast, leverage SRAM compilers to generate application-specific memories with tailored word widths and port configurations, avoiding the rigid block granularity constraints inherent to FPGA BRAMs and enabling more efficient memory utilization [43].

Hybrid On-the-Fly and Caching Strategies. In NIST-standardized lattice-based schemes, the public matrix A is deterministically generated from a compact seed

ρ

using the SHAKE extendable-output function, trading computation for storage. Hardware implementations typically adopt one of three strategies [66]:

Fully on-the-fly generation, which minimizes memory usage but is constrained by the throughput of the SHAKE accelerator.
Full caching of matrix coefficients, which minimizes latency at the expense of substantial memory consumption.
Hybrid approaches that cache partial matrices, balancing computational overhead and storage requirements.

The optimal choice depends on the relative performance of the hash accelerator and the available on-chip memory resources.

5. Implementation Survey and Bibliometric Analysis

To characterize the current state of the art, we surveyed 20 representative publications published between 2020 and 2025, categorizing each work according to contribution type, target algorithm, implementation platform, and reported performance characteristics. Table 9 presents a comprehensive bibliometric analysis following established methodologies for systematic literature reviews [78].

5.1. Research Methodology

To ensure our findings are transparent and reproducible, we detail our search strategy for identifying relevant publications, the categorization framework for classifying each publication, and the quality assessment criteria applied to weight publications, as follows.

Search Strategy. Relevant publications were identified through a multi-stage literature search process:

Queries to IEEE Xplore, the ACM Digital Library, and Springer databases using the keywords “lattice-based cryptography” AND “hardware” AND (“FPGA” OR “ASIC”).
Targeted filtering of the IACR ePrint Archive for hardware implementation studies.
Backward and forward citation tracking starting from seminal works in the field [26,44,75].

Inclusion criteria required publications to appear between 2020 and 2025, focus on NIST-standardized or finalist schemes, and report quantitative performance metrics.

Categorization Framework. Each publication was classified along the following dimensions:

Contribution Type: architectural proposal, optimization technique, survey, or security analysis.
Algorithm Focus: Kyber, Dilithium, FALCON, unified architectures, or general lattice-based schemes.
Platform: FPGA, ASIC, hardware/software co-design, or analysis-only.
Innovation Area: NTT optimization, memory architecture, side-channel protection, or system integration.

Quality Assessment. Publications were weighted based on venue prestige (with IEEE Transactions ranked above conferences and workshops), citation counts obtained from Google Scholar (as of January 2025), and technical depth, including the presence of comparative benchmarks, design-space exploration, and ablation studies.

Analysis of Implementation Trends

The bibliometric data collected from our surveyed publications reveals several notable trends about hardware implementations for lattice-based cryptography.

Temporal Distribution. The volume of published hardware implementations for lattice-based cryptography exhibits near-exponential growth over the surveyed period. Specifically, two works appeared in 2020, followed by four publications during 2021–2022, and a sharp increase to fourteen publications in 2023–2024. This surge closely follows the conclusion of NIST PQC Round 3 in July 2022 and reflects intensified research activity as standardization outcomes became clearer. The pronounced peak in 2023–2025 aligns temporally with the finalization of FIPS 203 and FIPS 204, underscoring the transition from exploratory research to standard-driven design.

Algorithm Coverage. The surveyed literature is dominated by CRYSTALS-Kyber (45%) and CRYSTALS-Dilithium (30%), reflecting their selection as NIST standards for key encapsulation and digital signatures, respectively. Unified Kyber–Dilithium architectures account for approximately 15% of publications, followed by FALCON (5%) and general or survey-oriented works (5%). The comparatively limited presence of FALCON implementations is primarily attributable to the architectural complexity of FFT-based arithmetic relative to NTT-based designs, as discussed in Section 2.2.3. Notably, the proportion of unified architectures increased from roughly 5% during 2020–2022 to nearly 15% in 2023–2025, indicating a growing emphasis on cryptographic agility in both academic prototypes and industrial designs [44,70].

Platform Distribution. FPGA-based implementations constitute the majority of surveyed works, accounting for approximately 65% of the literature. This dominance reflects the rapid prototyping capabilities and short design cycles afforded by reconfigurable platforms. ASIC-based implementations represent roughly 15% of the publications and are typically motivated by energy efficiency or high-volume deployment requirements [43]. Hybrid hardware/software co-design approaches comprise approximately 10% of the surveyed studies, offering an intermediate point between programmability and acceleration [73,79].

Performance Evolution. Across the surveyed period, normalized throughput—measured as operations per second per LUT—improved by approximately 3.2× for FPGA-based Kyber implementations. This trend can be attributed to several factors: (1) refinements in NTT algorithms and scheduling strategies [54,56], (2) improved memory architectures and access optimizations [26,67], and (3) increasing maturity of FPGA design tools, including broader adoption of Vivado high-level synthesis (HLS)-based workflows [39]. Over the same interval, reported ASIC implementations achieved an estimated 2.1× improvement in energy efficiency, reflecting advances in architectural optimization and process technology [43].

Growth of Side-Channel Research. Security-oriented contributions have grown substantially, increasing from approximately 10% of publications during 2020–2022 to nearly 35% in 2023–2025. This shift signals a transition from proof-of-concept accelerators toward deployment-ready designs. A comprehensive survey [75] identifies more than 40 distinct side-channel and fault-based attack vectors, motivating the development of increasingly sophisticated and integrated countermeasures [77,80].

Emerging Paradigms. Several emerging architectural paradigms have begun to appear in the recent literature. Processing-in-memory (PIM) approaches [81] and the repurposing of AI (artificial intelligence) accelerators for polynomial arithmetic [82] report potential throughput gains of 10–30×. Despite their promise, these techniques remain largely experimental due to technology readiness challenges, including ReRAM (Resistive Random-Access Memory) endurance limitations and restricted access to tensor cores. Consequently, their practical impact is more likely to materialize in post-2025 designs.

5.2. Design Trade-Off Analysis

We organized the key trade-offs observed across the surveyed implementations along four axes as follows.

Performance versus Area. The relationship between performance and area forms a clear Pareto frontier in lattice-based hardware design. High-performance accelerators achieving more than 1 M operations per second typically require in excess of 15k LUTs [26]. In contrast, highly area-efficient designs occupying fewer than 5k LUTs achieve throughput in the range of 50–100k operations per second [74]. For IoT-class devices, a practical operating point lies in the range of 8–12k LUTs, delivering approximately 200–400k operations per second [48,65].

Flexibility versus Efficiency. Unified architectures that support multiple algorithms incur a performance overhead of approximately 25–35% relative to algorithm-specific designs [44,70]. This cost is often justified in scenarios requiring support for multiple cryptographic protocols (e.g., ML-KEM and ML-DSA for complete public-key infrastructures), resilience to future parameter updates, or product differentiation through cryptographic agility.

Security versus Performance. Side-channel countermeasures introduce significant overheads. First-order masking typically increases area by 80–120%, latency by 50–100%, and energy consumption by 30–50% [75,77]. Higher-order masking schemes exhibit roughly quadratic growth in overhead, limiting their practicality in many hardware contexts [76]. Shuffling-based countermeasures offer a lighter-weight alternative, incurring 20–40% overhead at the expense of weaker formal security guarantees [80].

Throughput versus Latency. Aggressively pipelined architectures optimize throughput by minimizing initiation intervals but often increase end-to-end latency due to deeper pipelines. This trade-off is well suited to throughput-dominated workloads, such as TLS servers handling large numbers of concurrent connections [26]. In contrast, latency-sensitive applications, including embedded authentication, favor shallower pipelines or more sequential architectures [65].

5.3. Synthesis and Recommendations

We will now synthesize the implementation and trade-off analysis into practical recommendations for different deployment scenarios.

Research Prototyping. For academic research and early-stage prototyping, FPGA-based and algorithm-specific accelerators are generally preferred, as they maximize performance per LUT and enable rapid design iteration. A representative example is the Kyber-768 architecture by Dang et al. [26], which achieves high throughput using approximately 18k LUTs at 450 MHz.

Commercial Deployment. In high-volume deployment scenarios exceeding one million units, ASIC implementations are typically the most suitable option. Unified Kyber–Dilithium accelerators [44] enable complete post-quantum public-key infrastructures within a single core. Representative target design points include areas below 2 mm² at 65 nm, active power consumption under 50 mW, and integration of first-order side-channel protections [43,77].

IoT and Edge Devices. For resource-constrained IoT and edge platforms, area-optimized FPGA or ASIC designs occupying fewer than 5k LUTs or approximately 0.5 mm² are appropriate. Lightweight side-channel defenses based on shuffling [80] may be employed, accepting a 5–10× increase in latency relative to high-performance designs [65,74].

Standardization Transition. During the transitional period following PQC standardization, configurable and parameter-agile architectures capable of supporting both Round 3 parameter sets and finalized FIPS 203/204 parameters are recommended [54,71]. Such designs facilitate smooth migration during the anticipated 2025–2026 deployment window.

Table 9. Bibliometric analysis of lattice-based cryptographic hardware implementations (2020–2025).

Reference	Year	Venue Type	Algorithm(s)	Platform	Contribution Focus	Key Innovation	Performance Highlight	Sec. Level	Citations ^†
Dang et al. [26]	2023	Journal (IEEE TC)	Scheme	FPGA	High-speed architecture	Parallel NTT with optimized memory access	450 MHz, 1.76 M ops/s (Kyber-768)	L1, 3, 5	92
Mao et al. [79]	2023	Journal (ACM TRETS)	Dilithium	FPGA/SW	HW/SW co-design	Rejection sampling acceleration	237 $μ$ s sign (Dilithium-3)	L2, 3, 5	23
Sun et al. [54]	2023	Conference (ICTA)	Kyber, Dilithium, FALCON	FPGA	Unified accelerator	Configurable radix-2 NTT (4-parallel)	224 cycles (Kyber), 512 (Dilithium)	L1, 3, 5	5
Carril et al. [7]	2024	Journal (ACM TRETS)	Kyber, Dilithium	FPGA	Batch processing	DMA-optimized server workload	9.06× decapsulation speedup	L1, 3, 5	8
Roy et al. [43]	2020	Journal (IACR TCHES)	Kyber, Dilithium, Saber	ASIC	ASIC optimization	Full 65 nm ASIC evaluation	6.4 $μ$ J (Kyber-512), 1.94 mm²	L1, 3, 5	131
Aikata et al. [44]	2023	Journal (IEEE TCAS-I)	Kyber + Dilithium	FPGA	Unified architecture	Single KaLi core for KEM + DSA	200 MHz, 18.4k LUTs	L2, 3, 5	106
Matteo et al. [70]	2024	Journal (IEEE Access)	Kyber, Dilithium	FPGA	Memory unification	Shared RAM layout	250 MHz, configurable $k, ℓ$	L1, 3, 5	23
Gupta et al. [74]	2023	Journal (IEEE TCAS-I)	Dilithium	FPGA	Lightweight design	Instruction shuffling	<8k LUTs, SCA-resistant	L2, 3	64
Antognazza et al. [71]	2024	Journal (SN Comp. Sci.)	Kyber, NTRU, Saber	FPGA	Multischeme support	Unified KEM multiplier	Agility across 4 schemes	L1, 3, 5	5
Hwang [3]	2024	Survey (ePrint)	General lattice	Analysis	Algorithmic survey	SW polynomial optimization	ARM/AVX2 comparison	N/A	2
Karakaya & Ulu [83]	2024	Survey (WIREs)	General PQC	Analysis	IoT security	PQC threat model survey	N/A	N/A	39
Wang et al. (2023) [84]	2023	Survey (CAMB)	General lattice	Analysis	Foundations	Lattice crypto overview	N/A	N/A	42
Bandaru et al. [85]	2024	Journal (Elsevier CEE)	NIST finalists	Mixed	Comparative evaluation	HW/SW benchmarking	Comprehensive metrics	L1, 3, 5	3
Ravi et al. [75]	2024	Journal (IACR TCHES)	Kyber, Dilithium	Analysis	Side-channel survey	Attack taxonomy	N/A	L3	209
Wang et al. (2024) [76]	2024	Conference (ACNS)	Kyber	Analysis	Masking attack	Breaks 3-share masking	N/A	L3	11
Ravi et al. [86]	2024	Journal (IACR TCHES)	Kyber	Analysis	Countermeasure attack	Valid-ciphertext-only SCA	N/A	L3	6
Xu et al. [80]	2025	Journal (IEEE TCAS-II)	Kyber	FPGA	Shuffling countermeasure	HW-friendly shuffling	1.38× overhead	L3	6
Tosun et al. [87]	2025	ePrint (IACR)	General lattice	Analysis	Higher-order SCA	Template-free attacks	N/A	L3, 5	0
Lee et al. [88]	2023	Journal (MDPI Sensors)	CKKS (HE)	FPGA	FHE acceleration	Configurable ENC/DEC	23.7× faster than SEAL	N/A	19
Nejatollahi et al. [81]	2020	Conference (DAC)	NTT-based	PIM	In-memory computing	ReRAM crossbar NTT	31× FPGA throughput	N/A	89

Legend: Citations ^† are approximate Google Scholar counts as of December 2025. Abbreviations: AVX2 = Advanced Vector Extensions 2; FHE = fully homonorphic encryption; HE = homonorphic encryption.

6. Discussion and Critical Analysis

This section synthesizes the key findings from the architectural taxonomy (Section 4) and the bibliometric evidence (Table 4) to address four recurring design questions: (i) when to prefer FPGA versus ASIC implementations (Section 5.1), (ii) how algorithmic structure shapes architectural choices (Section 5.2), (iii) the concrete overheads associated with cryptographic agility (Section 5.3), and (iv) the security–performance tension induced by side-channel countermeasures (Section 6.4).

6.1. FPGA Versus ASIC: Quantitative Decision Framework

Selecting between FPGA and ASIC implementations extends beyond a simple cost–volume comparison. In practice, this decision is shaped by performance predictability, schedule risk, and time-to-market requirements, all of which are amplified during the post-quantum transition as specifications stabilize and deployments scale.

Performance Comparison from the Literature. Across comparable implementations, several consistent patterns emerge. Dang et al. report an ML-KEM-768 FPGA design operating at 450 MHz (1.76 M ops/s, 18.4k LUTs, 12 DSPs) on Virtex UltraScale+ [26]. In contrast, Imran’s 65 nm ASIC realization of comparable functionality reaches 400 MHz while occupying only 0.89 mm² and consuming 15.2 mW [43]. Normalized energy efficiency highlights the central differentiator: approximately 120

μ

J/op for the FPGA versus 6.4

μ

J/op for the ASIC, corresponding to an

18.7 \times

advantage for ASIC, consistent with the broader FPGA–ASIC energy gap reported in prior studies [38]. However, reconfigurability remains a decisive FPGA advantage in periods of specification churn. When NIST finalized ML-KEM parameters (with minor deltas relative to Round 3 Kyber), an FPGA update can be delivered via bitstream revision on the order of weeks, whereas an ASIC redesign and re-fabrication typically requires 12–18 months.

Cost–Volume Break-Even Analysis. Using standard cost models [46], consider a representative FPGA unit cost of $150 (UltraScale+ -2 speed grade), an ASIC non-recurring engineering (NRE) cost of $800k (65 nm, 2 mm² die, moderate complexity), and an ASIC unit cost of $5 at 100k volume. The break-even volume is

N_{BE} = \frac{NRE}{C_{FPGA} - C_{ASIC}} = \frac{800,000}{150 - 5} \approx 5517 units,

i.e., on the order of ∼7000 units under conservative assumptions. Importantly, this simplified estimate omits the opportunity cost of schedule: an 18-month ASIC timeline versus a 6-month FPGA deployment can dominate business outcomes when standards finalize mid-development. Under a risk-adjusted assumption of a 30% probability of parameter change during the design cycle, the effective break-even volume can shift upward (e.g., toward ∼25,000 units) due to re-spin risk and delayed time-to-market.

Post Round-3 Selection Landscape Shift. After publication of FIPS 203/204, parameter stability increases materially, changing the decision calculus. FPGA implementations remain attractive for near-term deployments that hedge against errata and minor revisions, whereas ASIC becomes increasingly favorable for products having stable requirements and sufficient shipment volume (e.g., >50k units). A notable exception is ultra-low-power IoT endpoints requiring <10 mW active power, where ASIC development may be justified even under residual re-spin risk [48].

Hybrid Strategies. A pragmatic emerging strategy combines FPGA flexibility with ASIC-like efficiency by hardening stable arithmetic kernels while preserving reconfigurable capacity for protocol handling and updates. For example, RISC-V SoCs (System on Chips) that integrate custom PQC acceleration alongside reconfigurable fabric can retain update flexibility while offloading bottleneck kernels into dedicated hardware [72].

6.2. Algorithm-Specific Implementation Characteristics

We will provide a comparative analysis of the three discussed NIST-standardized lattice-based PQC schemes in terms of algorithm-specific implementation characteristics that shape design decisions and influence the complexity of their hardware implementation.

ML-KEM as the “Hardware-Friendly” Baseline. ML-KEM exhibits several design choices that align closely with efficient hardware realization: a small modulus (

q = 3329

), power-of-two polynomial degree (

n = 256

), and centered binomial sampling. These parameters enable compact 16-bit datapaths with sufficient headroom for intermediate values, while the symmetric

k \times k

structure simplifies control relative to rectangular matrices. Consistent with these properties, Table 4 indicates that ML-KEM designs achieve approximately

2.1 \times

higher throughput-per-LUT than ML-DSA counterparts [26,79], providing a plausible explanation for ML-KEM’s dominance (45% of surveyed implementations).

ML-DSA: Complexity from Scale and Rejection Sampling. ML-DSA introduces architectural pressure along multiple axes. The larger modulus (

q \approx 2^{23}

) pushes arithmetic toward 32-bit datapaths, increasing area and often reducing maximum frequency relative to ML-KEM. Rectangular matrices (

k \times ℓ

, e.g.,

6 \times 5

for Level 3) complicate addressing and diminish symmetry in memory access patterns. Most critically, rejection sampling induces variable execution time: ML-DSA-65 signing averages approximately 4.5 iterations [6]. Hardware implications include restart-capable control logic, sustained SHAKE-256 throughput across retries, and constant-time enforcement that may require provisioning for worst-case cycle budgets to avoid timing leakage [28]. These factors plausibly contribute to ML-DSA’s smaller share (30%) despite being co-standardized alongside ML-KEM.

FALCON as an Outlier. Only 5% of surveyed implementations target FALCON, largely due to architectural divergence. FALCON relies on (i) FFT over

C

rather than NTT over

Z_{q}

, (ii) a high-precision (53-bit mantissa) Gaussian sampler rather than integer CBD sampling, and (iii) Gram–Schmidt orthogonalization during key generation [21]. These primitives map poorly onto typical fixed-point datapaths and DSP fabrics, and prior work reports substantially higher implementation complexity than ML-KEM at comparable security levels [33]. Consequently, adoption remains cautious as the hardware ecosystem matures.

Implications for Unified Designs. Unification of ML-KEM and ML-DSA is architecturally tractable because both can share NTT engines, broadly similar memory structures, and integer arithmetic units. Reported overheads commonly fall in the 25–35% range relative to algorithm-specific designs, driven by wider datapaths and additional configurability in control and addressing logic [44,70]. In contrast, unifying ML-KEM with FALCON is inherently more challenging and would likely require heterogeneous datapaths (NTT + FFT and integer + floating-point), motivating a separate specialized FALCON core alongside an NTT-based engine.

6.3. The True Cost of Cryptographic Agility

Cryptographic agility—supporting multiple algorithms on a single hardware platform—is frequently cited as a transition necessity, yet the quantitative costs are often underreported. Here, we consolidate the evidence from unified architectures summarized in Table 4.

Area Overhead. In KaLi [44], adding ML-DSA support to an ML-KEM-capable baseline increases resource usage to 18,406 LUTs versus approximately 12,200 LUTs for an ML-KEM-only reference design [26], corresponding to a 50.8% overhead. A plausible breakdown includes the requirement for 32-bit datapaths (thereby executing ML-KEM less efficiently), configurable control logic, and additional memory capacity to accommodate larger ML-DSA matrix dimensions. CRYPHTOR [70] reports lower overhead (approximately 35%) through aggressive sharing and multiplexing, albeit with reduced clock frequency due to routing congestion.

Performance Degradation. Unified accelerators commonly exhibit lower peak frequency. For example, KaLi operates at 200 MHz versus 450 MHz for a specialized ML-KEM design [26], a 55.6% reduction. Contributing factors include wider multiplexers on critical paths, more complex address generation for variable matrix geometries, and shared-resource contention that increases routing delay. Consequently, throughput in “unified ML-KEM mode” can fall well below the algorithm-specific baseline.

When Agility Justifies the Overhead. Unified designs are typically justified in three scenarios:

Dual-function PKI (Public Key Infrastructure): platforms requiring both ML-KEM and ML-DSA (e.g., TLS servers and VPN gateways), where a single accelerator reduces BOM (Bill of Materials) cost and board area.
Transition hedging: support for multiple parameter sets during a migration window to enable field updates.
Product differentiation: enterprise deployments that value cryptographic flexibility for compliance across multiple standards regimes.

Against Agility. For volume- and area-constrained endpoints (e.g., IoT sensors and smartcards), algorithm-specific accelerators are often more appropriate because the workload typically fixes on a single protocol, and a 50% area increase translates directly into higher per-unit silicon cost at scale.

6.4. Side-Channel Protection: Quantifying the Security–Performance Trade-Off

Side-channel attacks constitute a dominant practical threat to PQC implementations, yet robust protections remain expensive. In this section, we organize our analysis by threat category, examining specific attack vectors and their corresponding countermeasures with quantified overhead costs.

Threat Model. Power analysis attacks exploit correlations between intermediate cryptographic values and instantaneous power consumption or electromagnetic (EM) emissions [75]. In lattice-based schemes, vulnerable operations include NTT coefficient processing, the use of secret keys during decapsulation/signing, and rejection sampling decisions. Differential Power Analysis (DPA) recovers secret polynomials by analyzing power traces across multiple operations, while single-trace attacks target individual executions [76,86].

A timing attack takes advantage of the variable execution time to leak secret-dependent information. Critical vulnerabilities in lattice schemes include the following: (i) Dilithium’s rejection sampling exhibits key-dependent iteration counts (average 4.5 iterations for ML-DSA-65, but distribution depends on secret key [6]), (ii) conditional branches in polynomial reduction or coefficient comparison, (iii) early-abort optimizations in verification [28].

Fault injection attacks can be launched by an adversary, inducing computational errors via voltage glitching, clock manipulation, or laser/EM fault injection to extract secrets or forge signatures. A concrete example is to force the acceptance of invalid signatures revealing a secret in Dilithium by skipping rejection sampling checks [75].

Countermeasures. Boolean masking is an effective countermeasure against power analysis. Boolean masking separates each sensitive variable into d shares, which together reveal the secret but make the individual shares appear random. First-order masking (

d = 2

shares) provides protection against first-order DPA. However, empirical attacks on low-order masked designs indicate that basic masking can be insufficient under strong adversarial models [76]. Extending protection to higher orders (e.g., third-order) can induce multi-fold increases in area and latency due to quadratic gadget scaling [89]. The absence of third-order (and beyond) ML-KEM implementations in the surveyed set suggests that cost remains a prohibitive barrier outside high-assurance deployments.

Implementation Overhead: In a representative masked ML-KEM design [77], the area increases from approximately 18k LUTs (unprotected baseline [26]) to 32k LUTs (a 77.8% increase). Latency rises from 45

μ

s to 68

μ

s (51.1% slowdown), while energy increases from 58

μ

J to 89

μ

J (53.4%). Overheads are driven by share-management logic, the need for fresh randomness (often thousands of random bits per operation), and masked conversion gadgets whose complexity scales as

O (d^{2})

in the number of shares. These empirical observations align with theoretical expectations for first-order masking costs [90].

Shuffling randomizes execution order to reduce exploitable leakage correlation. A representative shuffling-based design [80] reports an area increase from 18k LUTs to 22k LUTs (22.2% overhead) with modest latency impact. Although shuffling generally offers weaker guarantees than masking and can be vulnerable to multi-trace analysis under sophisticated adversaries [91], it remains a practical option for moderate-security deployments with tight resource budgets.

Practical Recommendations. A pragmatic three-tier deployment model is as follows:

Consumer IoT: no protection or lightweight shuffling, prioritizing cost.
Enterprise/financial: first-order masking to balance security and performance.
Critical infrastructure/government: higher-order masking and/or physical countermeasures (e.g., tamper resistance), accepting substantial overheads.

Based on Table 4, most existing implementations emphasize performance over physical security; this may be appropriate under current deployment assumptions but could prove insufficient as long-term post-quantum adoption expands.

6.5. Open Challenges and Research Gaps

The surveyed literature reveals two critical research gaps that persist in the hardware acceleration landscape for lattice-based cryptography.

Lack of System-Level Integration Studies. The surveyed literature largely optimizes cryptographic kernels (KeyGen/Encaps/Sign) in isolation, with limited quantification of end-to-end protocol overhead. In practice, protocol-level performance depends on additional components—e.g., certificate chain validation, state-machine handling, DMA transfers, and network round-trips in TLS 1.3—of which accelerator latency is only one contributor. A key research gap is the design and evaluation of complete protocol accelerators that integrate ML-KEM and ML-DSA with protocol state machines, DMA engines, and network interfaces.

Insufficient ASIC Implementation Evidence at Scale. While ASIC results demonstrate compelling energy advantages, the surveyed set contains comparatively few ASIC implementations, and even fewer that report comprehensive, apples-to-apples comparisons across process nodes, operating corners, and realistic system constraints. More open, reproducible ASIC studies—covering memory macros, I/O, physical design closure, and side-channel countermeasures under sign-off assumptions—would materially strengthen the evidence base for deployment-oriented decision making.

7. Conclusions

This survey is an extensive, hardware-oriented review of lattice-based post-quantum cryptographic accelerators, synthesizing representative FPGA and ASIC implementations published between 2020 and 2025 in the NIST standardization context. We present a bibliometric analysis of 20 key works revealing rapid development of the discipline; about 70% of the existing works appeared in 2023–2025, coinciding with the issuance of FIPS 203 and FIPS 204. This concentration reflects a transition from proof-of-concept designs to deployment-oriented hardware architectures.

Performance Evolution and Algorithmic Trends. Algorithm coverage has clear preferences: CRYSTALS-Kyber is the leader with 45% of implementations, followed by CRYSTALS-Dilithium at 30%, while FALCON accounts for only around 5% despite its standardization path. This contrast is the result of architectural compatibility: Kyber’s NTT-based integer arithmetic maps easily to ordinary FPGA and ASIC datapaths, whereas FALCON’s FFT-based and very accurate sampling imposes specialized requirements. For the entire survey period, FPGA-based Kyber accelerators result in an average 3.2× increase of normalized throughput due to developments in NTT scheduling, memory layout, and toolchain maturity. Nevertheless, fundamental trade-offs persist between high-frequency designs (400–500 MHz, 15–25k LUTs) and more generally area-constrained IoT implementations that sacrifice 5–10× throughput to accommodate sub-5k LUT budgets.

Trade-offs: Flexibility, Security, and Cost. Unified multi-algorithm accelerators improved from approximately 5% in 2020–2022 to 15% in 2023–2025, in line with the demand for cryptographic agility and full post-quantum public-key infrastructures on one piece of hardware. This flexibility comes with measurable costs: unified designs usually show 25–55% performance degradation and 35–50% area overhead relative to algorithm-optimized implementations. At the same time, security-oriented research has grown vastly, increasing from 10% to 35% of publications. Strong physical protections are nevertheless costly: first-order masking generally introduces ∼78% area and ∼51% latency overhead, while higher-order masking is not extensively addressed in practical implementations on hardware.

Outlook. Key open issues are (a) the state-of-the-art implementation of advanced-node ASIC suitable for SoC integration, (b) formal verification mechanisms that jointly attend to functional correctness and side-channel security, (c) system-level accelerators for TLS/IPsec rather than isolated kernels, and (d) hardware-efficient FALCON architectures. In the short term (2024–2025), FPGA-based solutions have continued to be appealing for early deployment and specification flexibility. As standards stabilize and production expands, ASIC implementations will become more prevalent, especially after around 2026, and even more so for products greater than ∼50,000 units in which energy efficiency gains of up to 18.7× are reported. Overall, this survey offers a reference point and decision-making process for designing effective, secure, and efficient deployable hardware accelerators in the post-quantum era.

Author Contributions

Conceptualization, H.Y. and P.H.; methodology, H.Y., L.W. and P.H.; validation, H.Y.; formal analysis, H.Y., L.W. and P.H.; investigation, H.Y.; resources, H.Y.; data curation, H.Y.; writing—original draft preparation, H.Y.; writing—review and editing, H.Y., L.W., Q.S. and P.H.; project administration, P.H.; funding acquisition, P.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by Auburn University at Montgomery’s Grant-in-Aid program.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shor, P.W. Algorithms for quantum computation: Discrete logarithms and factoring. In Proceedings 35th Annual Symposium on Foundations of Computer Science; IEEE: Piscataway, NJ, USA, 1994; pp. 124–134. [Google Scholar]
PQC Standardization Process: Announcing Four Candidates to be Standardized, Plus Fourth Round Candidates. Available online: https://csrc.nist.gov/News/2022/pqc-candidates-to-be-standardized-and-round-4 (accessed on 30 September 2025).
Hwang, V. A Survey of Polynomial Multiplications for Lattice-Based Cryptosystems. Cryptol. ePrint Arch. 2023. Available online: https://eprint.iacr.org/2023/1962 (accessed on 30 September 2025).
Kannwischer, M.J.; Rijneveld, J.; Schwabe, P.; Stoffelen, K. PQM4: Post-Quantum Crypto Library for the ARM Cortex-M4. 2018. Available online: https://github.com/mupq/pqm4 (accessed on 30 September 2025).
Bos, J.; Ducas, L.; Kiltz, E.; Lepoint, T.; Lyubashevsky, V.; Schanck, J.M.; Schwabe, P.; Seiler, G.; Stehlé, D. CRYSTALS-Kyber: A CCA-secure module-lattice-based KEM. In 2018 IEEE European Symposium on Security and Privacy (EuroS&P); IEEE: Piscataway, NJ, USA, 2018; pp. 353–367. [Google Scholar]
Ducas, L.; Kiltz, E.; Lepoint, T.; Lyubashevsky, V.; Schwabe, P.; Seiler, G.; Stehlé, D. CRYSTALS-Dilithium: A Lattice-Based Digital Signature Scheme. In IACR Transactions on Cryptographic Hardware and Embedded Systems; 2018; pp. 238–268. Available online: https://tches.iacr.org/index.php/TCHES/article/view/839 (accessed on 30 September 2025).
Carril, X.; Kardaris, C.; Ribes-González, J.; Farràs, O.; Hernandez, C.; Kostalabros, V.; González-Jiménez, J.U.; Moreto, M. Hardware Acceleration for High-Volume Operations of CRYSTALS-Kyber and CRYSTALS-Dilithium. ACM Trans. Reconfigurable Technol. Syst. 2024, 17, 1–26. [Google Scholar] [CrossRef]
Micciancio, D.; Goldwasser, S. Complexity of Lattice Problems: A Cryptographic Perspective; Springer Science & Business Media: Chem, Switzerland, 2002; Volume 671. [Google Scholar]
Ajtai, M. The shortest vector problem in L2 is NP-hard for randomized reductions (extended abstract). In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, New York, NY, USA, 24–26 May 1998; STOC ’98. pp. 10–19. [Google Scholar] [CrossRef]
Aaronson, S.; Arkhipov, A. The computational complexity of linear optics. In Proceedings of the Forty-Third Annual ACM Symposium on Theory of Computing, New York, NY, USA, 6–8 June 2011; STOC ’11. pp. 333–342. [Google Scholar] [CrossRef]
Regev, O. On lattices, learning with errors, random linear codes, and cryptography. J. ACM 2009, 56, 1–40. [Google Scholar] [CrossRef]
Brakerski, Z.; Vaikuntanathan, V. Efficient fully homomorphic encryption from (standard) LWE. SIAM J. Comput. 2014, 43, 831–871. [Google Scholar] [CrossRef]
Lyubashevsky, V.; Peikert, C.; Regev, O. On Ideal Lattices and Learning with Errors over Rings. J. ACM 2013, 60, 1–35. [Google Scholar] [CrossRef]
Peikert, C.; Waters, B. Lossy trapdoor functions and their applications. In Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing, New York, NY, USA, 17–20 May 2008; STOC ’08. pp. 187–196. [Google Scholar] [CrossRef]
Langlois, A.; Stehlé, D. Worst-case to average-case reductions for module lattices. Des. Codes Cryptogr. 2015, 75, 565–599. [Google Scholar] [CrossRef]
Micciancio, D.; Peikert, C. Hardness of SIS and LWE with Small Parameters. In Annual Cryptology Conference; Springer: Berlin/Heidelberg, Germany, 2013. [Google Scholar] [CrossRef]
Module-Lattice-Based Key-Encapsulation Mechanism Standard. Available online: https://csrc.nist.gov/pubs/fips/203/final (accessed on 30 September 2025).
Module-Lattice-Based Digital Signature Standard. Available online: https://csrc.nist.gov/pubs/fips/204/final (accessed on 30 September 2025).
Pöppelmann, T.; Ducas, L.; Güneysu, T. Enhanced lattice-based signatures on reconfigurable hardware. In International Workshop on Cryptographic Hardware and Embedded Systems; Springer: Berlin/Heidelberg, Germany, 2014; pp. 353–370. [Google Scholar]
Barrett, P. Implementing the Rivest Shamir and Adleman public key encryption algorithm on a standard digital signal processor. In Conference on the Theory and Application of Cryptographic Techniques; Springer: Berlin/Heidelberg, Germany, 1986; pp. 311–323. [Google Scholar]
Fouque, P.A.; Hoffstein, J.; Kirchner, P.; Lyubashevsky, V.; Prest, T.; Ricosset, T.; Seiler, G.; Whyte, W. FALCON: Fast-Fourier Lattice-based Compact Signatures over NTRU. In NIST Post-Quantum Cryptography Project; Submission to the NIST PQC Standardization Process; 2020; Available online: https://www.di.ens.fr/~prest/Publications/falcon.pdf (accessed on 30 September 2025).
Montgomery, P.L. Modular multiplication without trial division. Math. Comput. 1985, 44, 519–521. [Google Scholar] [CrossRef]
Bos, J.W. Constant time modular inversion. J. Cryptogr. Eng. 2014, 4, 275–281. [Google Scholar] [CrossRef]
Plantard, T.; Susilo, W.; Zhang, Z. LLL for ideal lattices: Re-evaluation of the security of Gentry–Halevi’s FHE scheme. Des. Codes Cryptogr. 2015, 76, 325–344. [Google Scholar] [CrossRef]
Alagic, G.; Apon, D.; Cooper, D.; Dang, Q.; Dang, T.; Kelsey, J.; Liu, Y.K.; Miller, C.; Moody, D.; Peralta, R.; et al. Status Report on the Third Round of the NIST Post-Quantum Cryptography Standardization Process; Nist Interagency/Internal Report; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2022. [Google Scholar]
Dang, V.B.; Mohajerani, K.; Gaj, K. High-speed hardware architectures and FPGA benchmarking of CRYSTALS-Kyber, NTRU, and Saber. IEEE Trans. Comput. 2022, 72, 306–320. [Google Scholar] [CrossRef]
Fiat, A.; Shamir, A. How to Prove Yourself: Practical Solutions to Identification and Signature Problems. In Advances in Cryptology—EUROCRYPT 1986; Springer: Berlin/Heidelberg, Germany, 1986; pp. 186–194. [Google Scholar]
Saarinen, M.J.O. Arithmetic Coding and Blinding Countermeasures for Lattice Signatures: Engineering a Side-Channel Resistant Post-Quantum Signature Scheme with Compact Signatures. J. Cryptogr. Eng. 2018, 8, 71–84. [Google Scholar] [CrossRef]
Hoffstein, J.; Pipher, J.; Silverman, J.H. NTRU: A Ring-Based Public Key Cryptosystem. In International Algorithmic Number Theory Symposium; Springer: Berlin/Heidelberg, Germany, 1998; pp. 267–288. [Google Scholar]
Ducas, L.; Nguyen, P.Q. Learning a Zonotope and More: Cryptanalysis of NTRUSign Countermeasures. In Advances in Cryptology—ASIACRYPT 2012; Springer: Berlin/Heidelberg, Germany, 2012; pp. 433–450. [Google Scholar]
Post-Quantum Cryptography Round 3 Submissions. Available online: https://csrc.nist.gov/projects/post-quantum-cryptography/post-quantum-cryptography-standardization/round-3-submissions (accessed on 30 September 2025).
Ducas, L.; Nguyen, P.Q. Faster Gaussian lattice sampling using lazy floating-point arithmetic. In International Conference on the Theory and Application of Cryptology and Information Security; Springer: Berlin/Heidelberg, Germany, 2012; pp. 415–432. [Google Scholar]
Karabulut, E.; Aysu, A. Falcon down: Breaking falcon post-quantum signature scheme through side-channel attacks. In 2021 58th ACM/IEEE Design Automation Conference (DAC); IEEE: Piscataway, NJ, USA, 2021; pp. 691–696. [Google Scholar]
Howe, J.; Prest, T.; Apon, D. SoK: How (not) to design and implement post-quantum cryptography. In Cryptographers’ Track at the RSA Conference; Springer: Berlin/Heidelberg, Germany, 2021; pp. 444–477. [Google Scholar]
Xilinx. UltraScale Architecture and Product Data Sheet: Overview. Available online: https://docs.amd.com/api/khub/documents/dGU6Y~1b8XPqDFk5ulti6g/content (accessed on 30 September 2025).
Mert, A.C.; Öztürk, E.; Savaş, E. Design and implementation of a fast and scalable NTT-based polynomial multiplier architecture. In 2019 22nd Euromicro Conference on Digital System Design (DSD); IEEE: Piscataway, NJ, USA, 2019; pp. 253–260. [Google Scholar]
Rabaey, J.M.; Chandrakasan, A.; Nikolic, B. Digital Integrated Circuits; Prentice Hall: Englewood Cliffs, NJ, USA, 2002; Volume 2. [Google Scholar]
Roy, K.; Prasad, S.C. Low-Power CMOS VLSI Circuit Design; John Wiley & Sons: Hoboken, NJ, USA, 2009. [Google Scholar]
Xilinx. Vivado Design Suite User and Reference Guides. Available online: https://docs.amd.com/r/2021.2-English/ug949-vivado-design-methodology/Vivado-Design-Suite-User-and-Reference-Guides (accessed on 30 September 2025).
Kilts, S. Advanced FPGA Design: Architecture, Implementation, and Optimization; John Wiley & Sons: Hoboken, NJ, USA, 2007. [Google Scholar]
Synopsys. Design White Papers: A Holistic Approach to Energy-Efficient System-on-Chip (SoC) Design. Available online: https://www.synopsys.com/content/dam/synopsys/solutions/documents/a-holistic-approach-to-energy-efficient-soc-design-wp.pdf (accessed on 30 September 2025).
Cadence. Best Full-Flow PPA. Available online: https://www.cadence.com/en_US/home/resources/white-papers/best-full-flow-ppa-wp.html (accessed on 30 September 2025).
Roy, S.S.; Basso, A. High-Speed Instruction-Set Coprocessor for Lattice-Based Key Encapsulation Mechanism: Saber in Hardware. In IACR Transactions on Cryptographic Hardware and Embedded Systems; 2020; Volume 2020, Issue 4, Available online: https://tches.iacr.org/index.php/TCHES/article/view/8690 (accessed on 30 September 2025).
Aikata, A.; Mert, A.C.; Imran, M.; Pagliarini, S.; Roy, S.S. KaLi: A Crystal for Post-Quantum Security Using Kyber and Dilithium. IEEE Trans. Circuits Syst. I Regul. Pap. 2022, 70, 747–758. [Google Scholar] [CrossRef]
Weste, N.H.E.; Harris, D. CMOS VLSI Design: A Circuits and Systems Perspective; Pearson Education India: Noida, India, 2015. [Google Scholar]
Kuon, I.; Rose, J. Measuring the gap between FPGAs and ASICs. In Proceedings of the 2006 ACM/SIGDA 14th International Symposium on Field Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2006; pp. 21–30. [Google Scholar]
Howe, J.; Moore, C.; O’Neill, M.; Regazzoni, F.; Güneysu, T.; Beeden, K. Lattice-Based Encryption over Standard Lattices in Hardware. In Proceedings of the 53rd Annual Design Automation Conference, Austin, TX, USA, 5–9 June 2016; pp. 1–6. [Google Scholar]
Senor, J.; Portilla, J.; Mujica, G. Analysis of the NTRU Post-Quantum Cryptographic Scheme in Constrained IoT Edge Devices. IEEE Internet Things J. 2022, 9, 18778–18790. [Google Scholar] [CrossRef]
Montgomery, P.L. Speeding the Pollard and elliptic curve methods of factorization. Math. Comput. 1987, 48, 243–264. [Google Scholar] [CrossRef]
Harvey, D.; Van Der Hoeven, J. Integer multiplication in time O(nlog∖,n). Ann. Math. 2021, 193, 563–617. [Google Scholar] [CrossRef]
Cooley, J.W.; Tukey, J.W. An Algorithm for the Machine Calculation of Complex Fourier Series. Math. Comput. 1965, 19, 297–301. [Google Scholar] [CrossRef]
Roy, S.S.; Vercauteren, F.; Mentens, N.; Chen, D.D.; Verbauwhede, I. Compact Ring-LWE Cryptoprocessor. In Cryptographic Hardware and Embedded Systems—CHES 2014; Batina, L., Robshaw, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2014; pp. 371–391. [Google Scholar]
Bisheh-Niasar, M.; Azarderakhsh, R.; Mozaffari-Kermani, M. High-Speed NTT-Based Polynomial Multiplication Accelerator for CRYSTALS-Kyber Post-Quantum Cryptography. Cryptol. ePrint Arch. 2021. Available online: https://eprint.iacr.org/2021/563 (accessed on 30 September 2025).
Sun, J.; Bai, X.; Kang, Y. An FPGA-Based Efficient NTT Accelerator for Post-Quantum Cryptography CRYSTALS-Kyber. In 2023 IEEE International Conference on Integrated Circuits, Technologies and Applications (ICTA); IEEE: Piscataway, NJ, USA, 2023; pp. 142–143. [Google Scholar]
Agarwal, R.C.; Cooley, J. New Algorithms for Digital Convolution. IEEE Trans. Acoust. Speech Signal Process. 2003, 25, 392–410. [Google Scholar] [CrossRef]
Fritzmann, T.; Sepúlveda, J. Efficient and Flexible Low-Power NTT for Lattice-Based Cryptography. In 2019 IEEE International Symposium on Hardware Oriented Security and Trust (HOST); IEEE: Piscataway, NJ, USA, 2019; pp. 141–150. [Google Scholar]
Turan, F.; Verbauwhede, I. Compact and Flexible FPGA Implementation of Ed25519 and X25519. ACM Trans. Embed. Comput. Syst. 2019, 18, 1–21. [Google Scholar] [CrossRef]
Liu, Z.; Seo, H.; Roy, S.S.; Großschädl, J.; Kim, H.; Verbauwhede, I. Efficient Ring-LWE Encryption on 8-Bit AVR Processors. In International Workshop on Cryptographic Hardware and Embedded Systems; Springer: Berlin/Heidelberg, Germany, 2015; pp. 663–682. [Google Scholar]
Göttert, N.; Feller, T.; Schneider, M.; Buchmann, J.; Huss, S. On the Design of Hardware Building Blocks for Modern Lattice-Based Encryption Schemes. In International Workshop on Cryptographic Hardware and Embedded Systems; Springer: Berlin/Heidelberg, Germany, 2012; pp. 512–529. [Google Scholar]
Land, G.; Sasdrich, P.; Güneysu, T. A Hard Crystal—Implementing Dilithium on Reconfigurable Hardware. In International Conference on Smart Card Research and Advanced Applications; Springer International Publishing: Chem, Switerland, 2021; pp. 210–230. [Google Scholar]
Peikert, C. An Efficient and Parallel Gaussian Sampler for Lattices. In Advances in Cryptology—CRYPTO 2010; Springer: Berlin/Heidelberg, Germany, 2010; pp. 80–97. [Google Scholar]
Roy, S.S.; Reparaz, O.; Vercauteren, F.; Verbauwhede, I. Compact and Side Channel Secure Discrete Gaussian Sampling. Cryptol. ePrint Arch. 2014. Available online: https://eprint.iacr.org/2014/591 (accessed on 30 September 2025).
Karatsuba, A. Multiplication of Multidigit Numbers on Automata. Soviet Physics Doklady. 1963. Available online: https://www.researchgate.net/publication/234346907_Multiplication_of_Multidigit_Numbers_on_Automata#fullTextFileContent (accessed on 30 September 2025).
Cook, S.A.; Aanderaa, S.O. On the minimum computation time of functions. Trans. Am. Math. Soc. 1969, 142, 291–314. [Google Scholar] [CrossRef]
Pöppelmann, T.; Güneysu, T. Towards Practical Lattice-Based Public-Key Encryption on Reconfigurable Hardware. In International Conference on Selected Areas in Cryptography; Springer: Berlin/Heidelberg, Germany, 2013; pp. 68–85. [Google Scholar]
Dworkin, M.J. SHA-3 Standard: Permutation-Based Hash and Extendable-Output Functions; Technical Report; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2015. [Google Scholar]
Beckwith, L.; Nguyen, D.T.; Gaj, K. High-Performance Hardware Implementation of CRYSTALS-Dilithium. In 2021 International Conference on Field-Programmable Technology (ICFPT); IEEE: Piscataway, NJ, USA, 2021; pp. 1–10. [Google Scholar]
Bos, J.W.; Costello, C.; Naehrig, M.; Stebila, D. Post-Quantum Key Exchange for the TLS Protocol from the Ring Learning with Errors Problem. In 2015 IEEE Symposium on Security and Privacy; IEEE: Piscataway, NJ, USA, 2015; pp. 553–570. [Google Scholar]
Zhang, N.; Yang, B.; Chen, C.; Yin, S.; Wei, S.; Liu, L. Highly Efficient Architecture of NewHope-NIST on FPGA Using Low-Complexity NTT/INTT. In IACR Transactions on Cryptographic Hardware and Embedded Systems; International Association for Cryptologic Research: Bellevue, WA, USA, 2020; Volume 2. [Google Scholar]
Matteo, S.D.; Sarno, I.; Saponara, S. CRYPHTOR: A Memory-Unified NTT-Based Hardware Accelerator for Post-Quantum CRYSTALS Algorithms. IEEE Access 2024, 12, 25501–25511. [Google Scholar] [CrossRef]
Antognazza, F.; Barenghi, A.; Pelosi, G.; Susella, R. Performance and Efficiency Exploration of Hardware Polynomial Multipliers for Post-Quantum Lattice-Based Cryptosystems. SN Comput. Sci. 2024, 5, 212. [Google Scholar] [CrossRef]
Fritzmann, T.; Van Beirendonck, M.; Basu Roy, D.; Karl, P.; Schamberger, T.; Verbauwhede, I.; Sigl, G. Masked Accelerators and Instruction Set Extensions for Post-Quantum Cryptography. In IACR Transactions on Cryptographic Hardware and Embedded Systems; International Association for Cryptologic Research (IACR): Bellevue, WA, USA, 2022; pp. 414–460. Available online: https://tches.iacr.org/index.php/TCHES/article/view/9303 (accessed on 30 September 2025).
Nannipieri, P.; Di Matteo, S.; Zulberti, L.; Albicocchi, F.; Saponara, S.; Fanucci, L. A RISC-V Post-Quantum Cryptography Instruction Set Extension for Number Theoretic Transform to Speed-Up CRYSTALS Algorithms. IEEE Access 2021, 9, 150798–150808. [Google Scholar] [CrossRef]
Gupta, N.; Jati, A.; Chattopadhyay, A.; Jha, G. Lightweight Hardware Accelerator for Post-Quantum Digital Signature CRYSTALS-Dilithium. IEEE Trans. Circuits Syst. I Regul. Pap. 2023, 70, 3234–3243. [Google Scholar] [CrossRef]
Ravi, P.; Roy, S.S.; Chattopadhyay, A.; Bhasin, S. Generic Side-Channel Attacks on CCA-Secure Lattice-Based PKE and KEMs. In IACR Transactions on Cryptographic Hardware and Embedded Systems; International Association for Cryptologic Research (IACR): Bellevue, WA, USA, 2020; pp. 307–335. Available online: https://tches.iacr.org/index.php/TCHES/article/view/8592 (accessed on 30 September 2025).
Wang, R.; Brisfors, M.; Dubrova, E. A Side-Channel Attack on a Higher-Order Masked CRYSTALS-Kyber Implementation. In International Conference on Applied Cryptography and Network Security; Springer Nature: Chem, Switzerland, 2024; pp. 301–324. [Google Scholar]
Jati, A.; Gupta, N.; Chattopadhyay, A.; Sanadhya, S.K. A Configurable CRYSTALS-Kyber Hardware Implementation with Side-Channel Protection. ACM Trans. Embed. Comput. Syst. 2024, 23, 1–25. [Google Scholar] [CrossRef]
Keele, S. Guidelines for Performing Systematic Literature Reviews in Software Engineering; Version 2.3; EBSE Technical Report; EBSE: Rio de Janeiro, Brazil, 2007. [Google Scholar]
Mao, G.; Chen, D.; Li, G.; Dai, W.; Sanka, A.I.; Koç, Ç.K.; Cheung, R.C.C. High-Performance and Configurable SW/HW Co-Design of Post-Quantum Signature CRYSTALS-Dilithium. ACM Trans. Reconfigurable Technol. Syst. 2023, 16, 1–28. [Google Scholar] [CrossRef]
Xu, D.; Wang, K.; Tian, J. A Hardware-Friendly Shuffling Countermeasure against Side-Channel Attacks for Kyber. IEEE Trans. Circuits Syst. II Express Briefs 2025, 72, 504–508. [Google Scholar] [CrossRef]
Nejatollahi, H.; Gupta, S.; Imani, M.; Simunic Rosing, T.; Cammarota, R.; Dutt, N. CryptoPIM: In-Memory Acceleration for Lattice-Based Cryptographic Hardware. In 2020 57th ACM/IEEE Design Automation Conference (DAC); IEEE: Piscataway, NJ, USA, 2020; pp. 1–6. [Google Scholar]
Wan, L.; Zheng, F.; Lin, J. TESLAC: Accelerating Lattice-Based Cryptography with AI Accelerator. In International Conference on Security and Privacy in Communication Systems; Springer International Publishing: Chem, Switzerland, 2021; pp. 249–269. [Google Scholar]
Karakaya, A.; Ulu, A. A Survey on Post-Quantum Based Approaches for Edge Computing Security. Wiley Interdiscip. Rev. Comput. Stat. 2024, 16, e1644. [Google Scholar] [CrossRef]
Wang, X.; Xu, G.; Yu, Y. Lattice-Based Cryptography: A Survey. Chin. Ann. Math. Ser. B 2023, 44, 945–960. [Google Scholar] [CrossRef]
Bandaru, M.; Mathe, S.E.; Wattanapanich, C. Evaluation of Hardware and Software Implementations for NIST Finalist and Fourth-Round Post-Quantum Cryptography KEMs. Comput. Electr. Eng. 2024, 120, 109826. [Google Scholar] [CrossRef]
Ravi, P.; Paiva, T.; Jap, D.; D’anvers, J.P.; Bhasin, S. Defeating Low-Cost Countermeasures against Side-Channel Attacks in Lattice-Based Encryption. IACR Transactions on Cryptographic Hardware and Embedded Systems. 2024. Available online: https://eprint.iacr.org/2023/1627 (accessed on 30 September 2025).
Tosun, T.; Oswald, E.; Savaş, E. Non-Profiled Higher-Order Side-Channel Attacks against Lattice-Based Post-Quantum Cryptography. Cryptology ePrint Archive. 2025. Available online: https://eprint.iacr.org/2025/1257 (accessed on 30 September 2025).
Lee, J.; Duong, P.N.; Lee, H. Configurable Encryption and Decryption Architectures for CKKS-Based Homomorphic Encryption. Sensors 2023, 23, 7389. [Google Scholar] [CrossRef]
Chari, S.; Jutla, C.S.; Rao, J.R.; Rohatgi, P. Towards Sound Approaches to Counteract Power-Analysis Attacks. In Advances in Cryptology—CRYPTO 1999; Springer: Berlin/Heidelberg, Germany, 1999; pp. 398–412. [Google Scholar]
Ishai, Y.; Sahai, A.; Wagner, D. Private circuits: Securing hardware against probing attacks. In Annual International Cryptology Conference; Springer: Berlin/Heidelberg, Germany, 2003; pp. 463–481. [Google Scholar]
Prouff, E.; Rivain, M. Masking against Side-Channel Attacks: A Formal Security Proof. In Advances in Cryptology—EUROCRYPT 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 142–159. [Google Scholar]

Table 1. Comparison of NIST-standardized lattice-based cryptographic schemes.

Property	ML-KEM (CRYSTALS-Kyber)	ML-DSA (CRYSTALS-Dilithium)	FALCON
NIST Standard	FIPS 203	FIPS 204	FIPS 206 (draft)
Primitive Type	Key Encapsulation Mechanism (KEM)	Digital Signature Algorithm (DSA)	Digital Signature Algorithm (DSA)
Underlying Hardness	Module-LWE	Module-LWE	NTRU lattice
Algebraic Structure	Module over $R_{q}^{k}$	Module over $R_{q}^{k \times ℓ}$	NTRU lattice over $Z [x] / 〈 x^{n} + 1 〉$
Ring/Modulus	$R_{q} = Z_{q} [x] / 〈 x^{n} + 1 〉$	$R_{q} = Z_{q} [x] / 〈 x^{n} + 1 〉$	$Z_{q} [x] / 〈 x^{n} + 1 〉$
Typical Parameters	$n = 256$ , $q = 3329$ , $k \in {2, 3, 4}$	$n = 256$ , $q = 8,380,417$ , $(k, ℓ) \in {(4, 4), (6, 5), (8, 7)}$	$n \in {512, 1024}$ , $q = 12,289$
Security Levels	Levels 1, 3, 5	Levels 2, 3, 5	Levels 1, 5
Dominant Operation	NTT-based polynomial multiplication	NTT-based polynomial multiplication	FFT-based polynomial multiplication
Transform Domain	NTT over $Z_{q}$	NTT over $Z_{q}$	FFT over $C$
Arithmetic Type	Integer (16-bit)	Integer (32-bit)	Floating-point (double precision)
Sampling Techniques	Centered binomial distribution (CBD)	Uniform + rejection sampling	Discrete Gaussian (FFT-based)
Matrix Structure	Square ( $k \times k$ )	Rectangular ( $k \times ℓ$ )	None (trapdoor basis)
Key Generation Cost	Low	Moderate	High (Gram–Schmidt + FFT)
Signature/Ciphertext Size	Ciphertext: ≈768–1568 bytes	Signature: ≈2420–4595 bytes	Signature: ≈660–1280 bytes
Constant-Time Challenges	Minimal	Rejection sampling	Floating-point arithmetic
Hardware Maturity	High (widely implemented)	High (widely implemented)	Low (few implementations)

Table 2. FPGA vs. ASIC comparison for lattice-based cryptographic acceleration.

Dimension	FPGA	ASIC
Development Time	3–6 months (design to hardware)	12–24 months (design to silicon)
Development Cost	$50k–$200k (tools, engineering effort)	$500k–$5M (NRE, masks, fabrication)
Unit Cost (1k volume)	$50–$500 per device	$2–$20 per device
Performance (typical)	200–450 MHz for NTT cores	300–600 MHz for equivalent logic
Power Consumption (Kyber-768)	50–200 mW	10–50 mW
Energy per Operation	50–200 $μ$ J (FPGA overhead)	5–25 $μ$ J (2–10× improvement)
Area Efficiency	∼20k LUTs (Kyber, moderate design)	0.5–2 mm² @ 65 nm (4–8× denser)
Reconfigurability	Full (bitstream update)	None (fixed at fabrication)
Time to Market	Fast (weeks for design update)	Slow (re-fabrication required)
Cryptographic Agility	Excellent (supports multiple algorithms)	Limited (algorithm fixed at design time)
Side-Channel Leakage	Higher (irregular routing capacitance)	Lower (predictable physical layout)
Primary Suitability	Prototyping, moderate volume, evolving standards	High-volume, power-critical, stable standards

Table 3. Memory requirements for parameter sets of lattice-based NIST-standardized algorithms across all security levels.

Scheme	Security Level	Polynomial Storage	Matrix A	Secrets	Working Memory	Total RAM
ML-KEM-512	Level 1	$2 \times 256 \times 12$ bit	4 polynomials (generated from $ρ$ )	4 polynomials	4 polynomials (NTT)	∼12 kB
ML-KEM-768	Level 3	$3 \times 256 \times 12$ bit	9 polynomials (generated from $ρ$ )	6 polynomials	4 polynomials (NTT)	∼18 kB
ML-KEM-1024	Level 5	$4 \times 256 \times 12$ bit	16 polynomials (generated from $ρ$ )	8 polynomials	4 polynomials (NTT)	∼25 kB
ML-DSA-44	Level 2	$4 \times 256 \times 23$ bit	16 polynomials (generated from $ρ$ )	8 polynomials	6 polynomials (NTT)	∼32 kB
ML-DSA-65	Level 3	$5 \times 256 \times 23$ bit	30 polynomials (generated from $ρ$ )	11 polynomials	6 polynomials (NTT)	∼45 kB
ML-DSA-87	Level 5	$7 \times 256 \times 23$ bit	56 polynomials (generated from $ρ$ )	15 polynomials	6 polynomials (NTT)	∼65 kB
FALCON-512	Level 1	$512 \times 14$ bit	Trapdoor tree structure	3 polynomials	8 polynomials (FFT)	∼25 kB
FALCON-1024	Level 5	$1024 \times 14$ bit	Trapdoor tree structure	3 polynomials	8 polynomials (FFT)	∼50 kB

Table 4. Algorithm specialization overview.

	Algorithm-Specific	Unified Multi-Algorithm	Configurable Parameter-Agile
Algorithms	Single scheme (ML-KEM/ML-DSA/FALCON)	Multiple schemes (typically ML-KEM + ML-DSA)	Multiple schemes + security levels
Key Characteristics	Fixed parameters at design time; use relatively small modulus (q = 3329) via 16-bit arithmetic units	Shared computational blocks (NTT engines, modular arithmetic units, memory subsystems); runtime configuration; control logic	Additional control complexity, configuration logic; adversely affect maximum achievable clock frequency; need for careful architectural trade-offs
Datapath Optimization	Optimized datapath widths	Uniform 32-bit datapath width	Configurable or worst-case provisioned
Memory Organization	Customize memory access patterns to $k \times k$ module matrix structure	Shared memory subsystems	Flexible allocation
Flexibility	None (parameter changes require redesign and re-verification)	Moderate (differences in algorithm parameters are managed through runtime configuration and control logic)	High (support multiple security levels and varying polynomial degrees through synthesis-time and runtime recofiguration)
Performance	High (maximized performance)	Reduced (30–40% slower than algorithm-specific accelerators due to unified design)	Variable
Area Efficiency	High (maximized area efficiency)	Moderate	Low (additional multiplexing overhead; adversely affect area efficiency)
Redesign Effort	Substantial for any changes	Moderate for new algorithms	Minimal, reconfigurable
Verification Effort	Single configuration testing	Multiple algorithm mode testing	Comprehensive parameter coverage

Table 5. Integration approach overview.

	Standalone Accelerators	Tightly Coupled Coprocessors	ISA Extensions
Architecture Type	Self-contained core with standard on-chip interconnections	Integrated into the processor microarchitecture	Custom instruction extensions enhancing existing processor (RISC-V or ARM)
Key Characteristics	Loose coupled (architectural separation cleanly decouples cryptographic functionality from the host system)	Reuse across heterogeneous platforms; simplified system integration	Balance between software flexibility and hardware acceleration (software retains control over protocol-level logic, hardware accelerates performance-critical kernels)
Datapath	Explicit transfers (between the host and the accelerator)	Shared register/cache	Register operands
Memory Access	Standard on-chip interconnects (AXI or Avalon)	Memory hierarchies with the core	Processor memory systems
Primary Limitation	Non-negligible communication overhead (explicitly transferred polynomial coefficients and intermediate results between the host and accelerator)	Increased microarchitectural complexity (modification to pipeline stages, register interface, and memory consistency mechanisms complicate design, verification, and validation)	Introducing custom instructions targeting computational bottlenecks (NTT butterfly operations, modular arithmetic, and sampling primitives)

Table 6. Optimization objective overview.

	Throughput-Oriented	Area-Constrained	Energy-Efficent	Side-Channel-Resistant
Architecture Characteristics	Aggressive pipelining; multiple parallel units; high clock frequencies	Area-constrained architectures	Clock gating, operand isolation, voltage–frequency scaling	1st/higher order masking, shuffling, redundancy, constant-time execution
Performance	Maximizing operations per second; 400–500 MHz (FPGA)	5–10× latency compared to high-throughput architectures	Minimal logic; increased latency of approximately 5–10× compared to high-throughput architecture	Reduced performance (2–3× latency, 1.5–4× increased area)
Target Application	Server-class applications, TLS endpoints (millions of concurrent connections)	IoT devices, smartcards	Battery-powered systems	Security-critical applications
Primary Goal	Maximize operations per second	Minimize logic and memory footprint	Minimize energy per operation	Mitigate physical attacks
Primary Techniques	Aggressive pipelining, parallel BFUs, high clock frequencies	Sequential processing, resource sharing, on-the-fly computation	Clock gating, operand isolation, voltage–frequency scaling	Masking, shuffling, redundancy

Table 7. Parallelism and pipelining.

	Butterfly-Level Parallelism	Polynomal-Level Parallelism	Operation-Level Pipelining
Architecture Details	Multiple coefficient pairs processed in parallel within each NTT stage	Polynomial-level parallelism in matrix-vector multiplications	Deep pipelining
Granularity	Fine (coefficient pairs)	Medium (polynomials)	Coarse (operations)
Speedup Mechanism	Parallel BFU processing (multiple coefficient pairs per cycle)	Concurrent poly operations (Parallel matrix-vector multiplications)	Overlapped execution (continuous operation stream)
Primary Benefit	Linear speedup, predicable scaling, well-understood	Natural fit for matrix-vector operations, independent computations	Maximum throughput, continuous processing, efficient resource utilization
Primary Limits	Memory bandwidth limit, routing congestion (FPGA), area increase	Proportional resource increase, memory bandwidth demands	Pipeline fill latency, complex hazard management, high area

Table 8. Memory hierarchy optimization.

	Coeficient Banking	On-Chip SRAM vs. BRAM	Hybrid On-the-Fly and Caching
Primary Goal	Enable parallel memory access	Efficient memory utilization	Balance storage and computation
Primary Benefit	Conflict-free parallel access	Platform-appropriate memory utilization	Optimized memory-computation trade-off
Primary Trade-off	Routing complexity vs. address generation logic	Resource limits (BRAM quantity limits of FPGAs) vs. flexibility (design effort of ASICs)	Throughput vs. area

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yan, H.; Wu, L.; Sun, Q.; He, P. Lattice-Based Cryptographic Accelerators for the Post-Quantum Era: Architectures, Optimizations, and Implementation Challenges. Electronics 2026, 15, 475. https://doi.org/10.3390/electronics15020475

AMA Style

Yan H, Wu L, Sun Q, He P. Lattice-Based Cryptographic Accelerators for the Post-Quantum Era: Architectures, Optimizations, and Implementation Challenges. Electronics. 2026; 15(2):475. https://doi.org/10.3390/electronics15020475

Chicago/Turabian Style

Yan, Hua, Lei Wu, Qiming Sun, and Pengzhou He. 2026. "Lattice-Based Cryptographic Accelerators for the Post-Quantum Era: Architectures, Optimizations, and Implementation Challenges" Electronics 15, no. 2: 475. https://doi.org/10.3390/electronics15020475

APA Style

Yan, H., Wu, L., Sun, Q., & He, P. (2026). Lattice-Based Cryptographic Accelerators for the Post-Quantum Era: Architectures, Optimizations, and Implementation Challenges. Electronics, 15(2), 475. https://doi.org/10.3390/electronics15020475

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lattice-Based Cryptographic Accelerators for the Post-Quantum Era: Architectures, Optimizations, and Implementation Challenges

Abstract

1. Introduction

2. Background

2.1. Lattice-Based Cryptography Fundamentals

2.1.1. Lattice Problems and Hardness Assumptions

2.1.2. Structured Lattice Problems: Ring-LWE and Module-LWE

2.1.3. Key Operations in Lattice-Based Schemes

2.2. NIST-Standardized Lattice-Based Algorithms

2.2.1. CRYSTALS-Kyber/ML-KEM

2.2.2. CRYSTALS-Dilithium/ML-DSA

2.2.3. FALCON

2.2.4. Comparative Analysis

2.3. Hardware Implementation Platforms

2.3.1. FPGA Characteristics and Design Flow

2.3.2. ASIC Characteristics and Design Flow

2.3.3. FPGA Versus ASIC Trade-Offs for Lattice-Based Cryptography

3. Hardware Acceleration Fundamentals

3.1. Core Arithmetic Operations and Architectural Approaches

3.1.1. Number Theoretic Transform (NTT)

3.1.2. Modular Arithmetic

3.1.3. Sampling Operations

3.1.4. Alternative Polynomial Multiplication Algorithms

3.2. Memory Architecture and Access Patterns

4. Architectural Approaches and Implementation Landscape

4.1. Taxonomy of Architectural Approaches

4.1.1. Algorithm Specialization

4.1.2. Integration Approaches

4.1.3. Optimization Objectives

4.2. Parallelism and Pipelining Strategies

4.3. Memory Hierarchy Optimization

5. Implementation Survey and Bibliometric Analysis

5.1. Research Methodology

Analysis of Implementation Trends

5.2. Design Trade-Off Analysis

5.3. Synthesis and Recommendations

6. Discussion and Critical Analysis

6.1. FPGA Versus ASIC: Quantitative Decision Framework

6.2. Algorithm-Specific Implementation Characteristics

6.3. The True Cost of Cryptographic Agility

6.4. Side-Channel Protection: Quantifying the Security–Performance Trade-Off

6.5. Open Challenges and Research Gaps

7. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI