Next Article in Journal
An Enhanced LSTM with Hippocampal-Inspired Episodic Memory for Urban Crowd Behavior Analysis
Previous Article in Journal
AIDE: An Active Inference-Driven Framework for Dynamic Evaluation via Latent State Modeling and Generative Reasoning
Previous Article in Special Issue
Analysis of Surface Code Algorithms on Quantum Hardware Using the Qrisp Framework
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Accelerating Post-Quantum Cryptography: A High-Efficiency NTT for ML-KEM on RISC-V

1
Department of Computer and Network Engineering, University of Electro-Communications (UEC), Tokyo 182-8585, Japan
2
Faculty of Radio-Electronics Engineering, Le Quy Don Technical University (LQDTU), Hanoi 10000, Vietnam
3
Faculty of Electronics and Telecommunications, The University of Science, Vietnam National University Ho Chi Minh City, Ho Chi Minh City 700000, Vietnam
*
Authors to whom correspondence should be addressed.
Electronics 2026, 15(1), 100; https://doi.org/10.3390/electronics15010100
Submission received: 12 November 2025 / Revised: 17 December 2025 / Accepted: 23 December 2025 / Published: 24 December 2025
(This article belongs to the Special Issue Recent Advances in Quantum Information)

Abstract

Post-quantum cryptography (PQC) is rapidly being standardized, with key primitives such as Key Encapsulation Mechanisms (KEMs) and Digital Signature Algorithms (DSAs) moving into practical applications. While initial research focused on pure software and hardware implementations, the focus is shifting toward flexible, high-efficiency solutions suitable for widespread deployment. A system-on-chip is a viable option with the ability to coordinate between hardware and software flexibly. However, the main drawback of this system is the latency in exchanging data during computation. Currently, most SoCs are implemented on FPGAs, and there is a lack of SoCs realized on ASICs. This paper introduces a complete RISC-V SoC design in an ASIC for Module Lattice-based KEM. Our system features a RISC-V processor tightly integrated with a high-efficiency Number Theoretic Transform (NTT) accelerator. This accelerator leverages custom instructions to accelerate cryptographic operations. Our research has achieved the following results: (1) The accelerator provides a speedup of up to 14.51× for NTT and 16.75× for inverse NTT operations compared to other RISC-V platforms; (2) This leads to end-to-end performance improvements for ML-KEM of up to 56.5% for security level I, 50.9% for level III, and 45.4% for level V; (3) The ASIC design is fabricated using a 180 nm CMOS process at a maximum operating frequency of 118 MHz with an area overhead of 8.7%. The chip achieved a minimum power consumption of 5.913 μW at 10 kHz and 0.9 V of supply voltage.

1. Introduction

As the volume of data shared across devices like computers, smartphones, and IoT gadgets explodes, so does the risk of data leakage. Communication channels are vulnerable to eavesdropping, making robust security crucial. Traditional data protection often relies on symmetric encryption in which the sender and receiver share the same secret key. However, this method has a critical weakness: if that single key is compromised, the entire communication is at risk. To solve this, the Key Encapsulation Mechanism (KEM) was developed to securely exchange these shared keys.
The rise of quantum computing presents a significant threat to these traditional KEMs. Quantum computers could break them in polynomial time. In response, the National Institute of Standards and Technology (NIST) initiated a program to standardize new cryptographic algorithms that are secure against quantum attacks. This field is known as post-quantum cryptography (PQC).
The official PQC standards include CRYSTALS-Kyber (FIPS 203) [1] and Hamming Quasi-Cyclic for KEM. Digital signatures consist of CRYSTALS-Dilithium (FIPS 204) [2], FALCON [3], and SPHINCS+ (FIPS 205) [4]. The FIPS 203 is the first standardized KEM, making it a primary focus for research and optimization. ML-KEM’s security is based on the complexity of the Shortest Vector Problem in high-dimensional lattices. A major computational challenge for ML-KEM is the reliance on polynomial multiplication, which dominates the algorithm’s runtime.
Several studies have focused on accelerating polynomial multiplication for ML-KEM across various hardware platforms [5,6,7,8]. Some have explored pure hardware solutions, while others have targeted System-on-Chip (SoC) platforms [9,10,11,12], which is a growing trend for PQC implementation [13]. In an SoC, hardware accelerators can be loosely connected to system buses or tightly integrated into the CPU. The advantage of being loosely coupled is that it is simple and easy to implement, but loading and storing data from/to memory is a barrier to accelerating computation. Tightly coupled designs offer superior acceleration efficiency. Most SoCs are implemented in FPGA platforms. Currently, there are very few practical on-chip implementations to evaluate system-level performance, creating a significant research gap.
We designed and fabricated an SoC that directly addresses this gap. Our design features an accelerator for forward/inverse Number Theoretic Transform (NTT/INTT), central to ML-KEM’s polynomial multiplication. The accelerator employs a dual butterfly unit (BU) architecture which is tightly integrated into a 64-bit RISC-V CPU via the Rocket Custom Coprocessor (RoCC) interface. To our knowledge, this is the first physical chip implementation of an SoC with RoCC custom instructions for ML-KEM.
Our results demonstrate significant performance gains:
  • Performance: The accelerator provides a speedup of up to 14.51× for NTT and 16.75× for inverse NTT operations compared to other RISC-V platforms.
  • Efficiency: We achieved speedup efficiencies of 56.5%, 50.9%, and 45.4% for security levels I, III, and V, respectively. This performance comes with a minimal area overhead of 8.7%.
  • ASIC Fabrication: The complete chip was fabricated using 180 nm CMOS technology, with a total area of 297 k gate equivalents (GE), consuming a minimum of 5.913 μW at an operation frequency of 10 kHz and a VDD of 0.9 V. The SoC achieved a maximum frequency of 118 MHz at a supply voltage of 2.0 V.
The rest of this paper is organized as follows: Section 2 provides the background and related work. Section 3 details the proposed system, focusing on the NTT accelerator and SoC architecture. Section 4 presents the implementation results and evaluations. Finally, Section 5 concludes the paper.

2. Background

ML-KEM is a module lattice-based KEM built upon public-key encryption principles. The key establishment process is illustrated in Figure 1. The sender generates a pair of encryption ( e k ) and decryption keys ( d k ), then sends the e k to the receiver over an unsecured channel. The receiver then uses e k to generate a shared secret key K and encapsulates it by encrypting a message m to produce a ciphertext c. This ciphertext is sent back to the sender over a public channel. Upon receiving c, the sender uses the d k to recover the message m , which is then used with e k to derive the shared secret key K . This shared key is subsequently used as the secret key for data transmission using symmetric encryption.
The functions of ML-KEM are based on the multiplication of polynomials in the ring Z q [ X ] / [ X n + 1 ] , with polynomial of the form f ( X ) = f 0 + f 1 X + + f 255 X 255 , and integer coefficients f i in Z q for all i, with q = 3329 and n = 256 . The operations in the ring include addition, subtraction, and multiplication performed modulo X n + 1 . The NTT is employed to accelerate polynomial multiplication. Given polynomials f and g, their representation in the NTT domain is denoted as f ^ = NTT ( f ) and g ^ = NTT ( g ) . For the ML-KEM, let h ^ be the product of two polynomials on the NTT domain, then h ^ is calculated by multiplication as described in Equation (1).
h ^ 2 i = f ^ 2 i · g ^ 2 i + f ^ 2 i + 1 · g ^ 2 i + 1 · ζ 2 b r ( i ) + 1 h ^ 2 i + 1 = f ^ 2 i · g ^ 2 i + 1 + f ^ 2 i + 1 · g ^ 2 i
where the indices 2 i and 2 i + 1 represent, respectively, the even and odd coefficients of the polynomials in the multiplication, and b r ( i ) is an integer represented by bit-reversing the unsigned value i { 0 , , 127 } . Finally, the product of 2 polynomials on the normal domain is achieved by h = I N T T ( h ^ ) .
However, polynomial multiplication still remains a computational bottleneck in the execution of ML-KEM. The overall efficiency still depends on how effectively the NTT is implemented. This challenge becomes more significant at higher security levels, where the number of polynomials involved in matrix-vector multiplications increases. The two most commonly adopted structures for BU are Cooley–Tukey (CT) [14] and Gentleman–Sande (GS) [15], which differ in the ordering of transformation outputs.
The CT and GS structures are shown in Figure 2a,b, respectively. We can see that the two architectures differ fundamentally in how they place the coefficient multiplication with the twiddle factors. Figure 2c,d show an example of NTT and INTT with a polynomial degree of 8. The arrowheads indicate the direction of the data during the transformation process. The output of CT is in the order of the GS input, so when using CT for NTT and GS for INTT together, we can skip pre-processing before performing the transformation. In our approach, at each stage of INTT, the coefficients are divided by 2; then the result will not need to be divided by 2 s t a g e as usual. This leads to the execution of NTT and INTT with constant time.
Twiddle factors are precomputed and stored in shared ROMs and dedicated ROMs for each BU to accommodate the dual-BU architecture. The twiddle factors are read by the stage index in the Control Unit. For INTT, twiddle factors are recomputed from NTT twiddle factors according to the corresponding stage index.
Our target is an embedded system that balances execution flexibility and computational speed. RISC-V offers an ideal solution in such contexts by combining the performance benefits of hardware acceleration with the adaptability of software control. RISC-V is an open instruction set architecture designed with modularity, allowing developers to tailor processor architectures to specific application requirements.
This modular approach makes RISC-V highly versatile and suitable for various applications, from lightweight IoT devices to high-performance computing systems. As an open and modular instruction set architecture, RISC-V enables us to customize processor designs to meet specific application requirements. Integrating a hardware accelerator directly with the CPU can substantially enhance system performance while preserving software flexibility. This integration method helps solve the large latency problem caused by data exchange on SoCs. Custom instructions facilitate efficient control of the accelerator, enabling tight hardware-software integration and improved overall system efficiency.

3. Proposed Architecture

3.1. Rocket RISC-V System-on-Chip

The proposed SoC architecture is illustrated in Figure 3. The system is built on a Rocket Core configured with the RV64IMAC variant, featuring a 1-kB data cache, 4-kB of on-chip RAM, and an external memory interface supporting up to 256 MB. The NTT accelerator is tightly coupled to the CPU via the RoCC interface [16]. This accelerator functions as a co-processor, receiving commands from the CPU and accessing data directly to the memory through the cache. Custom instructions are defined to transmit control and addresses to the accelerator. The accelerator loads the required data at the provided input address, then performs NTT or INTT transformations based on the control information. Finally, the output result is stored directly in the memory. The basic peripheries are connected to the system via the peripheral bus. Peripherals include a UART for printing results, an SPI for loading firmware from an SD card, and a JTAG for system debugging.
Figure 4 illustrates the interface between the CPU and the accelerator. The CMD and RESP lines correspond to the command and response of the CPU with the accelerator. The busy and interrupt signals are used for communication control purposes. Data transfer is carried out via MEM_REQ and MEM_RESP. This tightly coupled interaction enables efficient, low-latency control and data transfer between the CPU and the accelerator.
The custom instruction architecture adheres to the standard RISC-V instruction format. It includes the fields opcode, rd, rs1, rs2, and funct. The opcode specifies the custom group from 0 to 3. The rd, rs1, and rs2 fields correspond to the destination and source registers, respectively. The accelerator uses the funct field to differentiate between instructions within the same group. We define two custom instructions under the custom_0 group (opcode = 0001011) specifically for NTT and INTT operations in ML-KEM. The structure of these custom instructions is illustrated in Figure 5.
When a RoCC instruction is fetched and decoded, the presence of the custom_0 routes it to the accelerator via the RoCC command router. The accelerator interprets the remaining instruction fields to determine the appropriate control signals and begins processing accordingly. The first custom instruction, identified by funct = 0, loads the source and destination addresses into the accelerator. The second custom instruction with funct = 1 provides the data length and specifies the operation mode, where mode = 0 indicates an NTT operation and mode = 1 indicates an INTT operation. The funct = 1 also signals that all necessary parameters have been received and the accelerator can begin execution. The result is automatically stored to the specified address upon completion. The CPU can then read the result directly from the declared memory region.
Figure 6 illustrates the operation of the accelerator after receiving instructions from the CPU with funct = 1. The accelerator sends data request instructions to the source address in memory via MEM_REQ. The system then returns the requested data. This data is temporarily stored in the FIFO. After sufficient data is in the FIFO, it is sent to the BUs and forward- or inverse-transformed as requested. This process is pipelined until the final result is written back to the FIFO. At that time, the accelerator sends a write request to the destination address via MEM_REQ, along with the result retrieved from the FIFO. After the last data in the FIFO is successfully stored, the accelerator reports that the execution is complete and returns to the IDLE state.

3.2. NTT Accelerator Architecture

The proposed NTT accelerator architecture comprises a computational element that consists of two parallel execution BUs as shown in Figure 7. Modular reduction is performed using the Barrett method [17]. To save the area, the accelerator implements NTT using an iterative procedure. The input polynomial is stored in a simple FIFO of size 64 × 32. Thanks to the feedback shift register-based reordering unit (RU) architecture, the reordering process does not incur additional delays through each stage as in typical multi-path delay commutator (MDC) architectures.
The accelerator is built on a 64-bit system, so 16 bits are used for each polynomial coefficient, processing four coefficients on a single data call. At the start of the operation, each group of 4 coefficients of the polynomial is automatically loaded and temporarily stored in the FIFO. These coefficients are stored sequentially in each FIFO compartment until they are full. Since the NTT implementation for ML-KEM is performed for each group of even and odd coefficients, a minimum FIFO size of 64 × 32 bits is required to store all 128 coefficients. Since 4 input coefficients can be provided at a time, 2 BUs can be used at the same time to process the transformation. After the transformation, the BU output must be passed through the RU to correct the order for the next BU operation.
Here, we use a RU based on a shift register combined with a feedback register, which allows the pipeline to execute continuously without any additional delay. Moreover, it ensures that the FIFO memory is always at a minimum size, since some use a ping-pong architecture to store intermediate results, which requires a larger buffer. This leads to an advantages is the delay is constant. The results of this stage are always stored in the FIFO before being read out in the next stage, without causing memory conflicts due to the coefficients. Even- or odd-coefficient sequences will require 128 / 4 = 32 clocks per stage. Seven stages are needed to complete the transformation for 128 coefficients, so it will require 32 × 7 × 2 = 448 CCs to transform a polynomial.
Figure 8 shows an example of a feedback RU for n = 16 . It works like a normal RU in typical MDC architectures, but with an additional feedback path to the input registers. The purpose of this feedback is to use the input registers to delay the coefficients until a corresponding coefficient appears in the order of the next stage. The outputs are then concatenated into a bit string and buffered in the FIFO. Therefore, at the FIFO, the write pointer will always follow the read pointer by exact 16 clock cycles. This helps the FIFO avoid read-after-write (RAW) conflicts while maintaining a minimum size. Several previous studies required more than the memory needed for storing polynomials [18,19,20].
A finite state machine manages control and synchronization between the accelerator and the RoCC interface, allowing the accelerator to operate independently of the CPU. This setup enables efficient NTT/INTT execution via custom instructions, speeding up the overall algorithm.

4. Implementation Results and Discussion

The proposed SoC was implemented using 180 nm CMOS technology. Figure 9 shows the chip micrograph, with a die area of 1950 μm × 1950 μm, corresponding to approximately 297 kGE. The chip reaches a maximum operating frequency of 118 MHz, and a minimum power consumption of 5.913 μW at 10 kHz and VDD = 0.9 V.
Table 1 shows the area ultilization on FPGA Xilinx Artix-7 (xc7a100tcsg324-1) with the Xilinx Vivado tool version 2022.2. The values in bold represent the total area of the SoC and the area required for the NTT accelerator, respectively. We can see that when an NTT accelerator is added, the system incurs around 8.7% overhead. However, the efficiency is significant with a great improvement in execution time. The results demonstrate that tightly coupled acceleration is substantially more efficient than loosely coupled approaches. In typical RISC-V platforms, the main bottleneck in NTT/INTT acceleration lies in memory access latency. Our design overcomes this limitation by directly interfacing with the data cache, significantly reducing memory access time.
The efficiency of the accelerator is attributed to its compact architecture, which outperforms previous designs in both area and speed. The streamlined architecture and control logic enable seamless SoC integration with two custom instructions. This simplicity drives the speed-up efficiency in NTT/INTT execution across various RISC-V platforms, as shown in Table 2. The upward arrows indicate how much the latency has been increased compared to our design. A comparison with other 32-bit systems is feasible because the execution speed of the NTT/INTT is dominated by the accelerator architecture rather than the processor bit-width. In the reference C program, polynomial coefficients are represented using 16-bit integers, while other variables use 32-bit integers. Consequently, the program exhibits comparable execution performance on both 64-bit and 32-bit systems. As a result, the proposed architecture achieves notable speedups of up to 14.51 × for NTT and 16.75 × for INTT compared to other accelerators on different RISC-V platforms. With a maximum operating frequency of 118 MHz, the accelerator achieves a throughput of approximately 77.94 kNTT/s, or 239.4 Mbps. This throughput corresponds to the 180 nm technology node used in our implementation.
The bar graph in Figure 10 illustrates the speedup efficiency achieved over the baseline C implementation. By offloading the NTT/INTT computations to the hardware accelerator, overall execution time is significantly reduced. This improvement in NTT/INTT overhead contributes to an overall system speedup from 37% to 56.5%.
In this experiment, we first execute the reference software (C baseline) provided by the CRYSTALS project [29]. The software is compiled using the RISC-V toolchain to generate an executable binary, which is then loaded onto the CPU to determine the number of cycles required to complete the KeyGen, Encaps, and Decaps procedures across different security levels I, III, and V, corresponding to k = 2 , 3 , 4 . The cycle count is obtained by reading the cycle read-only register at the beginning and end of each operation, thereby yielding the total number of execution cycles. Each measurement is repeated 1000 times, and the average value is reported.
We then repeat the same procedure using our implementation of NTT operations accelerated through custom instructions. Finally, we assess the performance improvement enabled by these instructions and summarize the results in the chart. Each bar represents the average number of clock cycles consumed by the CPU to complete the corresponding process.
Figure 11 presents the speedup efficiency of the complete ML-KEM algorithm implementation. Given that different RISC-V platforms vary in architecture and baseline performance, the comparison is made regarding relative speedup efficiency. The performance improvement of each implementation relative to its own C reference. This provides a fair basis for evaluating the impact of the accelerator across heterogeneous CPU types. The evaluation method used to compute acceleration efficiency is detailed in Equation (2).
E f f [ % ] = R e f [ C C s ] A c c [ C C s ] R e f [ C C s ] × 100
Our proposed design outperforms those in [23,25,26], which achieving efficiency gains between 7% and 35%. While these prior works enhance performance through custom arithmetic instructions, their overall acceleration efficiency remains limited. This limitation stems from the simplicity of the instructions, which require numerous invocations to complete the full algorithm. As a result, the reduction in total execution time is modest compared to our tightly coupled and more integrated acceleration approach.
Several designs implement various acceleration techniques beyond NTT, such as hash function acceleration and sampling, achieving efficiency improvements ranging from 42% to 59%. While our focus is solely on accelerating NTT/INTT, our design still achieves a significant efficiency gain of 37% to 56.5%, comparable to more complex architectures like those in [10,22,28,30].
Figure 12 illustrates the power consumption of the SoC at various frequencies and VDD voltages. Figure 12a shows the power consumption of the chip at VDD = 2.0 V, in the operating frequency range from 10 kHz to 118 MHz. The power is measured in three states: Active power, when the SoC executes the algorithm; Idle power, the state when the chip resets; and Sleep power, which represents leakage power. We can see that the system consumes 307.9 mW at 118 MHz. We can also observe that the leakage power is minimal across all frequency ranges, with a maximum value of approximately 1.65 μW. The energy per NTT operation can be computed as P [ W ] / f [ H z ] · N T T [ C C s ] = 3.95 μJ per NTT. Figure 12b shows the power consumption of the system at a 10 kHz frequency, spanning a voltage range from 0.9 V to 2.0 V. The results show that the minimum power consumption of the SoC achieved is 5.913 μW, which can be applied to low-power systems.
Table 3 summarizes the scaling of power, energy, and area from the 180 nm process down to the 65 nm and 32 nm nodes using the scaling equations in [31]. Few works are implemented on actual ASICs; most are evaluated on FPGAs or rely solely on synthesis tools. The chip achieves an energy consumption of 2.609 nJ per cycle.
The scaled results are therefore intended to reflect the expected performance trend rather than absolute post-layout metrics. Although the reference design is implemented in FPGA with 65 nm, 55 nm and 28 nm nodes, the comparison remains meaningful at the architectural level, as both designs target the same functionality and workload. Nevertheless, differences arising from cell libraries, interconnect characteristics, and process-specific optimizations may still affect the absolute values. Hence, the comparison should be interpreted as an indicative evaluation of relative efficiency rather than a definitive silicon-level equivalence.
After technology-equivalent scaling, this energy level is lower than that of the SoC reported in [22]. Furthermore, the equivalent operating frequency significantly outperforms the other designs. Some works do not report maximum frequency or area, and only provide cycle counts (CCs). It is important to note that these results are obtained from direct measurements on a physical chip, whereas the other compared works rely on simulation or FPGA-based evaluation.
Table 4 presents the energy per process and the SoC throughput when executing ML-KEM at different security levels. After converting to an equivalent technology node, the energy is reduced to approximately one-third of that reported in [22].
T h r o u g h p u t [ O p / s ] = f [ H z ] O p e r a t i o n [ C C s ]
Throughput is defined as the number of operations completed per second (Op/s), computed using Equation (3). It can be observed that the throughput of our system is significantly higher, 2 to 30 times, than that of the SoCs presented in [22,28]. Notably, when normalized to the same 65 nm process, the throughput increases due to the higher equivalent operating frequency.
These results demonstrate that accelerating the SoC through a coprocessing approach is highly effective. This methodology can be readily extended to other lattice-based algorithms such as Dilithium and FALCON. It is also applicable to algorithms that demand high computational speed and large workloads, including convolutional neural networks (CNNs) and various signal-processing tasks. The wide operating frequency and voltage range make this SoC applicable to a wide range of applications, which opens up a promising future for this efficient design approach as we continue to optimize other system processes.

5. Conclusions

Our research presents a RISC-V SoC featuring an integrated NTT accelerator controlled via custom instructions. The SoC consumes a minimum of 5.913 μW of power consumption at a 0.9 V supply voltage and 10 kHz frequency in 180 nm CMOS technology. The chip also achieved a maximum operation frequency of 118 MHz at VDD = 2.0 V.
Our SoC architecture outperforms the baseline C implementation by 56.5% at security level I, 50.9% at level III, and 45.4% at level V. This efficiency level is superior to other similar RISC-V platforms. It demonstrates a key insight: a tightly integrated coprocessor is a highly effective way to accelerate complex computational algorithms dramatically.

Author Contributions

Conceptualization, C.-K.P.; methodology, D.-T.D.; investigation, K.-D.N.; original draft preparation, D.-T.D.; review and editing, D.-H.L. and C.-K.P.; visualization, K.-D.N.; supervision, C.-K.P. and D.-T.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by JST NEXUS, Japan, under grant number JPMJNX25D4.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The VLSI chip in this paper was fabricated in the chip fabrication program of the VLSI Design and Education Center (VDEC) at the University of Tokyo, with the collaboration of Rohm Corporation and Toppan Printing Corporation.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. National Institute of Standards and Technology. Module-Lattice-Based KeyEncapsulation Mechanism Standard; Federal Information Processing Standards Publication NIST FIPS 203; Department of Commerce: Washington, DC, USA, 2024. [CrossRef]
  2. National Institute of Standards and Technology. Module-Lattice-Based Digital Signature Standard; Federal Information Processing Standards Publication NIST FIPS 204; Department of Commerce: Washington, DC, USA, 2024. [CrossRef]
  3. Fouque, P.-A.; Hoffstein, J.; Kirchner, P.; Lyubashevsky, V.; Pornin, T.; Prest, T.; Ricosset, T.; Seiler, G.; Whyte, W.; Zhang, Z. Falcon: Fast-Fourier Lattice-Based Compact Signatures Over NTRU (v1.2). NIST PQC Round 2020, 1–67. [Google Scholar]
  4. Aumasson, J.-P.; Bernstein, D.-J.; Beullens, W.; Dobraunig, C.; Eichlseder, M.; Fluhrer, S.; Gazdag, S.-L.; Hülsing, A.; Kampanakis, P.; Kölbl, S.; et al. SPHINCS+—Submission To the 3rd Round of the NIST Post-quantum Project, v3.1. NIST PQC Round 2022, 1–63. [Google Scholar]
  5. Nguyen, T.-H.; Dam, D.-T.; Duong, P.-P.; Kieu-Do-Nguyen, B.; Pham, C.-K.; Hoang, T.-T. Efficient Hardware Implementation of the Lightweight CRYSTALS-Kyber. IEEE Trans. Circ. Syst. I Regul. Pap. 2025, 72, 610–622. [Google Scholar] [CrossRef]
  6. Cui, Y.; Chen, J.; Ni, Z.; Zhang, Z.; Wang, C.; Liu, W. Instruction-Based High-Performance Hardware Controller of CRYSTALS-Kyber With Balanced Resource Utilization. IEEE Trans. Circ. Syst. I Regul. Pap. 2025, 72, 2394–2407. [Google Scholar] [CrossRef]
  7. Kim, H.; Jung, H.; Satriawan, A.; Lee, H. A Configurable ML-KEM/Kyber Key-Encapsulation Hardware Accelerator Architecture. IEEE Trans. Circ. Syst. II Express Briefs 2024, 71, 4678–4682. [Google Scholar] [CrossRef]
  8. Nguyen, T.-H.; Dang, T.-K.; Dam, D.-T.; Nguyen, K.-D.; Duong, P.-P.; Pham, C.-K.; Hoang, T.-T. An Area-Time Efficient Hardware Architecture for ML-KEM Post-Quantum Cryptography Standard. IEEE Access 2025, 13, 103834–103847. [Google Scholar] [CrossRef]
  9. Gewehr, C.; Luza, L.; Moraes, F.G. Hardware Acceleration of Crystals-Kyber in Low-Complexity Embedded Systems With RISC-V Instruction Set Extensions. IEEE Access 2024, 12, 94477–94495. [Google Scholar] [CrossRef]
  10. Wang, T.; Zhang, C.; Zhang, X.; Gu, D.; Cao, P. Optimized Hardware-Software Co-Design for Kyber and Dilithium on RISC-V SoC FPGA. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2024, 2024, 99–135. [Google Scholar] [CrossRef]
  11. Ye, Z.; Song, R.; Zhang, H.; Chen, D.; Cheung, R.C.-C.; Huang, K. A Highly-efficient Lattice-based Post-Quantum Cryptography Processor for IoT Applications. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2024, 2024, 130–153. [Google Scholar] [CrossRef]
  12. Dam, D.-T.; Nguyen, T.-H.; Kieu-Do-Nguyen, B.; Hoang, T.-T.; Pham, C.-K. RISC-V SoC with NTT-Blackbox for CRYSTALS-Kyber Post-Quantum Cryptography. In Proceedings of the 9th International Conference on Integrated Circuits, Design, and Verification, Hanoi, Vietnam, 6–8 June 2024; pp. 49–54. [Google Scholar]
  13. Dam, D.-T.; Tran, T.-H.; Hoang, V.-P.; Pham, C.-K.; Hoang, T.-T. A Survey of Post-Quantum Cryptography: Start of a New Race. Cryptography 2023, 7, 40. [Google Scholar] [CrossRef]
  14. Cooley, J.W.; Tukey, J.W. An Algorithm for the Machine Calculation of Complex Fourier Series. Math. Comput. 1965, 19, 297–301. [Google Scholar] [CrossRef]
  15. Gentleman, W.M.; Sande, G. Fast Fourier Transforms: For Fun and Profit. In Proceedings of the Fall Joint Computer Conference (AFIPS), San Francisco, CA, USA, 7–10 November 1966; pp. 563–578. [Google Scholar]
  16. Chipyard. Rocketchip–Version: Stable. 2024. Available online: https://chipyard.readthedocs.io/en/stable/Generators/Rocket-Chip.html (accessed on 1 December 2024).
  17. Barrett, P. Implementing the Rivest Shamir and Adleman Public Key Encryption Algorithm on a Standard Digital Signal Processor. In Proceedings of the Annual International Cryptology Conference (CRYPTO), Santa Barbara, CA, USA, 17–21 August 1986; pp. 311–323. [Google Scholar]
  18. Di Matteo, S.; Sarno, I.; Saponara, S. CRYPHTOR: A Memory-Unified NTT-Based Hardware Accelerator for Post-Quantum CRYSTALS Algorithms. IEEE Access 2024, 12, 25501–25511. [Google Scholar] [CrossRef]
  19. Sun, J.; Bai, X. A High-Speed Hardware Architecture of NTT Accelerator for CRYSTALS-Kyber. Integr. Circuits Syst. 2024, 1, 92–102. [Google Scholar] [CrossRef]
  20. Liu, S.-H.; Kuo, C.-Y.; Mo, Y.-N.; Su, T. An Area-Efficient, Conflict-Free, and Configurable Architecture for Accelerating NTT/INTT. IEEE Trans. Very Large Scale Inte. (VLSI) Syst. 2023, 32, 519–529. [Google Scholar] [CrossRef]
  21. Miteloudi, K.; Bos, J.; Bronchain, O.; Fay, B.; Renes, J. PQ. V. ALU. E: Post-quantum RISC-V Custom ALU Extensions on Dilithium and Kyber. In Proceedings of the International Conference on Smart Card Research and Advanced Applications, Amsterdam, The Netherlands, 14–16 November 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 190–209. [Google Scholar]
  22. Huang, J.; Zhao, H.; Zhang, J.; Dai, W.; Zhou, L.; Cheung, R.C.C.; Koç, Ç.K.; Chen, D. Yet another Improvement of Plantard Arithmetic for Faster Kyber on Low-end 32-bit IoT Devices. IEEE Trans. Inf. Forensics Secur. 2024, 19, 3800–3813. [Google Scholar] [CrossRef]
  23. Fritzmann, T.; Sigl, G.; Sepúlveda, J. RISQ-V: Tightly Coupled RISC-V Accelerators for Post-quantum Cryptography. IACR Trans. Crypt. Hardw. Embed. Syst. 2020, 239–280. [Google Scholar] [CrossRef]
  24. Dam, D.-T.; Nguyen, T.-H.; Tran, T.-H.; Le, D.-H.; Hoang, T.-T.; Pham, C.-K. High-Efficiency Multi-Standard Polynomial Multiplication Accelerator on RISC-V SoC for Post-Quantum Cryptography. IEEE Access 2024, 12, 195015–195031. [Google Scholar] [CrossRef]
  25. Dolmeta, A.; Valpreda, E.; Martina, M.; Masera, G. Implementation and integration of NTT/INTT accelerator on RISC-V for CRYSTALS-Kyber. In Proceedings of the ACM International Conference on Computing Frontiers: Workshops and Special Sessions, Ischia, Italy, 7–9 May 2024; pp. 59–62. [Google Scholar]
  26. Alkim, E.; Evkan, H.; Lahr, N.; Niederhagen, R.; Petri, R. ISA Extensions for Finite Field Arithmetic Accelerating Kyber and NewHope on RISC-V. IACR Trans. Crypt. Hard. Embed. Syst. 2020, 2020, 219–242. [Google Scholar]
  27. Ji, X.; Dong, J.; Huang, J.; Yuan, Z.; Dai, W.; Xiao, F.; Lin, J. ECO-CRYSTALS: Efficient Cryptography CRYSTALS on Standard RISC-V ISA. IEEE Trans. Comput. 2025, 74, 401–413. [Google Scholar] [CrossRef]
  28. Li, L.; Qin, G.; Yu, Y.; Wang, W. Compact Instruction Set Extensions for Kyber. IEEE Trans.-Comput.-Aided Des. Integr. Circuits Syst. 2024, 43, 756–760. [Google Scholar] [CrossRef]
  29. PQ-Crystals. Kyber: Post-Quantum Key-Encapsulation Library. 2025. Available online: https://github.com/pq-crystals/kyber (accessed on 8 December 2025).
  30. Nannipieri, P.; Di Matteo, S.; Zulberti, L.; Albicocchi, F.; Saponara, S.; Fanucci, L. A RISC-V Post Quantum Cryptography Instruction Set Extension for Number Theoretic Transform to Speed-up CRYSTALS Algorithms. IEEE Access 2021, 9, 150798–150808. [Google Scholar] [CrossRef]
  31. Stillmaker, A.; Baas, B. Scaling equations for the accurate prediction of CMOS device performance from 180 nm to 7 nm. Integration 2017, 58, 74–81. [Google Scholar] [CrossRef]
Figure 1. The key establishment using ML-KEM.
Figure 1. The key establishment using ML-KEM.
Electronics 15 00100 g001
Figure 2. The CT (a) and GS (b) structures and corresponding mapping to (c) forward NTT and (d) inverse NTT.
Figure 2. The CT (a) and GS (b) structures and corresponding mapping to (c) forward NTT and (d) inverse NTT.
Electronics 15 00100 g002
Figure 3. Theproposed SoC architecture with NTT accelerator.
Figure 3. Theproposed SoC architecture with NTT accelerator.
Electronics 15 00100 g003
Figure 4. Interface between CPU and NTT accelerator.
Figure 4. Interface between CPU and NTT accelerator.
Electronics 15 00100 g004
Figure 5. The RISC-V custom instruction architecture.
Figure 5. The RISC-V custom instruction architecture.
Electronics 15 00100 g005
Figure 6. Timing diagram of the accelerator execution flow.
Figure 6. Timing diagram of the accelerator execution flow.
Electronics 15 00100 g006
Figure 7. The tightly coupled NTT accelerator architecture.
Figure 7. The tightly coupled NTT accelerator architecture.
Electronics 15 00100 g007
Figure 8. The re-ordering unit with feedback and a simple FIFO.
Figure 8. The re-ordering unit with feedback and a simple FIFO.
Electronics 15 00100 g008
Figure 9. CMOS 180 nm chip micrograph.
Figure 9. CMOS 180 nm chip micrograph.
Electronics 15 00100 g009
Figure 10. Speedup efficiency comparison of the algorithm implementation compared to the C baseline.
Figure 10. Speedup efficiency comparison of the algorithm implementation compared to the C baseline.
Electronics 15 00100 g010
Figure 11. The acceleration efficiency of the systems after improvements compared to the other platforms. The speedup efficiency was evaluated by comparing the algorithm’s performance on the improved system with the C baseline execution time on each platform. The efficiency is compared with the systems PQRISCV, E310, ARM-M3 in [22]; Polar [10]; RISCQ-V [23]; X-HEEP [25]; VexRiscv [26]; E203 [28]; and RI5CY [21].
Figure 11. The acceleration efficiency of the systems after improvements compared to the other platforms. The speedup efficiency was evaluated by comparing the algorithm’s performance on the improved system with the C baseline execution time on each platform. The efficiency is compared with the systems PQRISCV, E310, ARM-M3 in [22]; Polar [10]; RISCQ-V [23]; X-HEEP [25]; VexRiscv [26]; E203 [28]; and RI5CY [21].
Electronics 15 00100 g011
Figure 12. SoC power consumption (a) at VDD = 2.0 V in the operating frequency range from 10 kHz to 118 MHz and (b) SoC power consumption at 10 kHz in the voltage range from 0.9 V to 2.0 V.
Figure 12. SoC power consumption (a) at VDD = 2.0 V in the operating frequency range from 10 kHz to 118 MHz and (b) SoC power consumption at 10 kHz in the voltage range from 0.9 V to 2.0 V.
Electronics 15 00100 g012
Table 1. FPGA resource utilization of the proposed SoC.
Table 1. FPGA resource utilization of the proposed SoC.
LUTsFFsSliceBRAMsDSPs
SoC17,010894453151012
Bus, Peripheral98005429253100
ROM321011600
RAM961495940
Rocket CPU
— Core380016971382010
— NTT1314101254812
— D-Cache156255963240
— I-Cache117984710
Table 2. The NTT/INTT latency (in clock cycles) compared to other works.
Table 2. The NTT/INTT latency (in clock cycles) compared to other works.
WorksPlatformNTTINTTArea
Overhead
CCsLatencyCCsLatency
OursRocket151414138.7%
[21]RI5CY2577↑ 1.70×3851↑ 2.72×16%
[22]M38026↑ 5.30×8594↑ 6.08×-
E31015,888↑ 10.49×15,719↑ 11.12×-
PQRISCV21,975↑ 14.51×23,666↑ 16.75×-
[23]RISQ-V1935↑ 1.28×1930↑ 1.37×60%
[24]Rocket4156↑ 2.75×4172↑ 2.95×12.93%
[25]X-HEEP1531↑ 1.01×1531↑ 1.08×32.64%
[26]Vex-Riscv6868↑ 4.54×6367↑ 4.51×6%
[27]SiFive U748845↑ 5.84×10,262↑ 7.26×-
5700↑ 3.76×5618↑ 3.98×-
[28]E2034302↑ 2.84×3426↑ 2.42×4.3%
Table 3. Normalized Power, Energy, and Area after technology scaling from 180 nm.
Table 3. Normalized Power, Energy, and Area after technology scaling from 180 nm.
PlatformsProcessPower
(mW)
Energy
(nJ/cycle)
VDD
(V)
f max
(MHz)
Area
(mm2)
This workASIC180 nm307.92.60921183.8025
65 nm *4.4710.1751.25460.3169
32 nm *0.4610.0380.911500.0951
RISQ-V [22]FPGA65 nm2.570.2571.2100.1432
X-HEEP [25]FPGA65 nm----0.5067
E203 [28]FPGA55 nm0.3090.009-32.9-
RI5CY [21]FPGA28 nm---1000.0158
* Scaled from 180 nm using scaling equations in [31].
Table 4. Comparison of Energy and Throughput for ML-KEM across different SoC implementations.
Table 4. Comparison of Energy and Throughput for ML-KEM across different SoC implementations.
k = 1 k = 2 k = 3
KeyGenEncapsDecapsKeyGenEncapsDecapsKeyGenEncapsDecaps
Energy
(μJ)
This work1894.132841.203253.423699.564717.075494.556394.667860.928672.32
This work *127.05190.58218.23248.15316.40368.55428.93527.28581.70
RISQ-V [22]494.98685.93526.34797.73968.38877.141256.731465.931350.28
E203 [28]5.607.076.428.8911.1310.2013.8916.6615.47
Throughput
(Op/s)
This work16310895836556483935
This work *752501438385302259223181164
RISQ-V [22]545333222
E203 [28]534246332729211819
* Scaled to 65 nm [31].
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Dam, D.-T.; Nguyen, K.-D.; Le, D.-H.; Pham, C.-K. Accelerating Post-Quantum Cryptography: A High-Efficiency NTT for ML-KEM on RISC-V. Electronics 2026, 15, 100. https://doi.org/10.3390/electronics15010100

AMA Style

Dam D-T, Nguyen K-D, Le D-H, Pham C-K. Accelerating Post-Quantum Cryptography: A High-Efficiency NTT for ML-KEM on RISC-V. Electronics. 2026; 15(1):100. https://doi.org/10.3390/electronics15010100

Chicago/Turabian Style

Dam, Duc-Thuan, Khai-Duy Nguyen, Duc-Hung Le, and Cong-Kha Pham. 2026. "Accelerating Post-Quantum Cryptography: A High-Efficiency NTT for ML-KEM on RISC-V" Electronics 15, no. 1: 100. https://doi.org/10.3390/electronics15010100

APA Style

Dam, D.-T., Nguyen, K.-D., Le, D.-H., & Pham, C.-K. (2026). Accelerating Post-Quantum Cryptography: A High-Efficiency NTT for ML-KEM on RISC-V. Electronics, 15(1), 100. https://doi.org/10.3390/electronics15010100

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop