Review Reports - Accelerating Post-Quantum Cryptography: A High-Efficiency NTT for ML-KEM on RISC-V

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper presents a hardware implementation of a RISC-V System-on-Chip (SoC) integrated with an NTT module for accelerating ML-KEM. The topic is of significant importance and is highly appealing to potential readers. The design methodology and implementation details are well demonstrated, and a series of experiments have been conducted to assess the performance and power consumption. Nevertheless, several issues need to be addressed before acceptance.

- The authors compared the Speedup efficiency presented in Figure 8. It is necessary to explain in the text how the reference data in Fig. 8 were acquired.

- The primary contribution of this research lies in the hardware implementation. Regarding this, hardware overhead, frequency, and area are among the most crucial indicators for evaluating the work. However, these aspects are inadequately addressed in the paper. Additionally, a comparison with related works in terms of hardware area or the FPGA resources consumption should be conducted.

Author Response

1. Summary
Thank you very much for taking the time to carefully review this manuscript. We sincerely appreciate your valuable feedback. We have thoroughly addressed each of your comments and provided detailed responses below. The corresponding highlighted changes can be found in the re-submitted files. We hope that our revisions meet your expectations and further enhance the quality of this article.
2. Point-by-point response to Comments and Suggestions for Authors
Comments 1: - The authors compared the Speedup efficiency presented in Figure 8. It is necessary to explain in the text how the reference data in Fig. 8 were acquired.
Response 1: Thank you for your suggestion. We have reviewed the manuscript and added the necessary explanations in this updated version. In this revision, the former Figure 8 has been renumbered as Figure 10. The corresponding explanations are highlighted in yellow on page 9. We have clarified the execution steps as well as the meaning of each column and the values presented in the chart.
Comments 2: The primary contribution of this research lies in the hardware implementation. Regarding this, hardware overhead, frequency, and area are among the most crucial indicators for evaluating the work. However, these aspects are inadequately addressed in the paper. Additionally, a comparison with related works in terms of hardware area or the FPGA resources consumption should be conducted.
Response 2: Thank you for your suggestion. We have added Table 1 on page 7, along with the corresponding explanations on page 8. These updates, highlighted in yellow, report the FPGA resource utilization during the development of our system. The system was initially designed and validated on an FPGA before proceeding to ASIC layout. It can be observed that integrating the NTT accelerator introduces approximately 8.7% additional overhead. However, this overhead is justified by the significant performance improvement achieved, as demonstrated in Figure 10. We have also compared the overhead of our design with other reported works in Table 2.
3. Additional clarifications In addition to addressing the reviewers’ comments, we have updated the tables, figures, and explanations in this revised manuscript. We focused on resolving the shortcomings of the previous version, particularly regarding architectural description, analysis, comparisons, and presentation clarity. Once again, we sincerely thank you for your careful reading and supportive evaluation of our work

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript presents a RISC-V SoC with a tightly coupled NTT/INTT accelerator for ML-KEM, implemented and fabricated in a 180 nm CMOS process. The work addresses an important problem in post-quantum cryptography and demonstrates meaningful performance gains. The contribution is promising, but several aspects require clarification and strengthening to improve the technical rigor and presentation quality.

Clarify Novel Contributions
- The claim of being “the first physical chip implementation of an SoC for ML-KEM” is significant and should be supported more clearly. Please explain how your work differs from prior ASIC or MPW implementations and explicitly articulate what aspects (e.g., full SoC, RoCC-based integration, NTT-only acceleration) are novel.
Improve Architectural Description
- The architecture section would benefit from a more cohesive and detailed description of the dataflow.
- Please expand on how polynomial data moves through memory → FIFO → butterfly units → memory, and justify design choices such as FIFO size, dual-BU configuration, and the 448-cycle latency.
- The reordering unit and memory-conflict avoidance mechanism require further explanation.
NTT/INTT Design Rationale
- More detail is needed on the adoption of Cooley–Tukey for NTT and Gentleman–Sande for INTT.
- Please explain how twiddle factors are indexed and stored, how bit-reversal is handled, and why this combination eliminates pre/post-processing.
Evaluation Section Needs Additional Depth
- The results focus primarily on cycle counts. To enable a more complete comparison, please include:
  - throughput (e.g., NTT operations per second),
    - energy per NTT/INTT,
    - normalization per MHz or per gate equivalent,
    - a breakdown of execution time showing the fraction spent in NTT/INTT vs other components.
  - It is unclear whether the “C baseline” runs on the same Rocket core under identical conditions; please clarify.
- Power Measurements Need Interpretation
  - The power results would be more meaningful with additional analysis, such as dynamic vs. leakage power, trend explanations, and energy-per-operation calculations.
  - Please discuss the implications of the wide voltage/frequency operating range.

Comments on the Quality of English Language

Some sentences are difficult to follow, and terminology is inconsistently used (e.g., BU, BFU, butterfly unit).
Figures 3–5 contain small or unclear text; larger font or simplified diagrams would improve readability.
Consider carefully proofreading the manuscript for grammar, clarity, and consistency.

Author Response

1. Summary
Thank you very much for taking the time to carefully review this manuscript. We sincerely appreciate your valuable feedback. We have thoroughly addressed each of your comments and provided detailed responses below. The corresponding highlighted changes can be found in the re-submitted files. We hope that our revisions meet your expectations and further enhance the quality of this article.
2. Point-by-point response to Comments and Suggestions for Authors
Comments 1: Clarify Novel Contributions The claim of being “the first physical chip implementation of an SoC for ML-KEM” is significant and should be supported more clearly. Please explain how your work differs from prior ASIC or MPW implementations and explicitly articulate what aspects (e.g., full SoC, RoCC-based integration, NTT-only acceleration) are novel.
Response 1: Thank you for your suggestion. We have reviewed the related literature, and most existing works are implemented using simulation, FPGA prototypes, or up to post-layout analysis. Only a small number of studies report results on physical chips, and these typically focus on implementing the algorithm rather than a complete SoC. Therefore, to provide clearer context, we have added a statement in the revised manuscript noting that this is the first chip to implement a full SoC with an NTT accelerator based on RoCC custom instructions. This addition is highlighted in cyan on page 2.
Comments 2: Improve Architectural Description The architecture section would benefit from a more cohesive and detailed description of the dataflow. Please expand on how polynomial data moves through memory → FIFO → butterfly units → memory, and justify design choices such as FIFO size, dual-BU configuration, and the 448-cycle latency. The reordering unit and memory-conflict avoidance mechanism require further explanation.
Response 2: Thank you for your suggestion. We have added detailed explanations regarding the dataflow, the rationale behind choosing the FIFO size, and how it works in conjunction with the dual-butterfly architecture. These clarifications also justify why the NTT/INTT transformation for a single polynomial requires 448 clock cycles. The corresponding explanations are highlighted in cyan on page 6. In addition, we have included a new Figure 8 to illustrate the operation of the feedback-based reordering unit and how it cooperates with the FIFO during iterative processing.
Comments 3: NTT/INTT Design Rationale More detail is needed on the adoption of Cooley–Tukey for NTT and Gentleman–Sande for INTT. Please explain how twiddle factors are indexed and stored, how bit-reversal is handled, and why this combination eliminates pre/post-processing.
Response 3: Thank you for your suggestion. We have added Figure 2 to illustrate the CT and GS structures, along with examples of how they are applied in forward and inverse NTT transformations. We also provide an explanation of the rationale behind combining the CT–GS architectures for forward and inverse NTT, which is highlighted in cyan on page 3. In addition, we describe how the twiddle factors are stored and utilized, as detailed on page 4.
Comments 4: Evaluation Section Needs Additional Depth The results focus primarily on cycle counts. To enable a more complete comparison, please include: throughput (e.g., NTT operations per second), energy per NTT/INTT, normalization per MHz or per gate equivalent, a breakdown of execution time showing the fraction spent in NTT/INTT vs other components. It is unclear whether the “C baseline” runs on the same Rocket core under identical conditions; please clarify
Response 4: Thank you for your suggestion. We have conducted a more in-depth evaluation of the system throughput, including the NTT operations per second and the energy consumption for each process. These additions are highlighted in cyan on pages 8, 9, and 11. The term C baseline refers to the original C reference implementation released by the CRYSTALS development team. We use this reference software running on our system as the baseline for all subsequent comparisons. This clarification has been added to the manuscript at line 232, page 9.

Comments 5: Power Measurements Need Interpretation The power results would be more meaningful with additional analysis, such as dynamic vs. leakage power, trend explanations, and energy-per-operation calculations. Please discuss the implications of the wide voltage/frequency operating range.
Response 5: Thank you for your comment. We have conducted a more detailed analysis of the power measurement results. Tables 3 and 4 have been added to normalize the reported values across different process technologies, enabling a fairer comparison with existing works. Additional explanations have also been included and are highlighted in cyan on pages 10 and 11.
Comments 6: Comments on the Quality of English Language Some sentences are difficult to follow, and terminology is inconsistently used (e.g., BU, BFU, butterfly unit). Figures 3–5 contain small or unclear text; larger font or simplified diagrams would improve readability. Consider carefully proofreading the manuscript for grammar, clarity, and consistency.
Response 6: Thank you for your comment. We have revised and standardized all terminology and abbreviations throughout the manuscript. We have also adjusted the font sizes in the figures to ensure consistency and readability. Moreover, we carefully reviewed the grammar and clarified the presentation where necessary, so that readers can easily follow the flow of our ideas.

4. Additional clarifications In addition to addressing the reviewers’ comments, we focused on clarifying the architectural structure of the design, the evaluation methodology, and the analysis of the measurement results. We have also included additional data to further demonstrate the effectiveness of the proposed system. Once again, we sincerely thank you for your careful reading and supportive evaluation of our work

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The manuscript proposes a RISC-V SoC with a tightly coupled NTT/INTT accelerator implemented as an ASIC in 180-nm CMOS. The accelerator integrates via RoCC, uses a dual-butterfly architecture, and claims significant speedups in ML-KEM operations. An ASIC is fabricated and evaluated for area, power, and performance. However, there are some concerns that need to be addressed. More specifically, my comments for the author:

The FIFO structure (page 5–6) is essential to the iterative NTT pipeline, yet the manuscript lacks timing diagrams, occupancy analysis, or stall handling descriptions.
The bit-reversal / FSR-based reordering unit is not well elaborated. A figure that clarifies dataflow ordering, memory conflicts, or latency per stage would be interesting.
Many compared works are 32-bit CPUs vs. the authors’ 64-bit Rocket Core. Authors should normalise bit-width, frequency, technology node, memory hierarchy, or functional scope. Otherwise, speedup numbers such as 14.51x and 16.75x are potentially misleading.
The evaluation of energy efficiency is incomplete because the manuscript reports absolute power consumption at different frequencies and voltages but does not quantify energy per NTT, energy per coefficient, or energy per ML-KEM encapsulation or decapsulation
Latency numbers seems to not account for memory hierarchy differences.
The manuscript should include details about testbench structure, functional verification coverage, co-simulation between the RISC-V software and the RTL accelerator, or validation of custom instructions against corner cases and hazard scenarios

Minor editorial comments

Acronyms (RU, BU) should be defined once and used consistently
Parts of the paper contain grammatical errors, informal language, and unclear explanations

Author Response

1. Summary
Thank you very much for taking the time to carefully review this manuscript. We sincerely appreciate your valuable feedback. We have thoroughly addressed each of your comments and provided detailed responses below. The corresponding highlighted changes can be found in the re-submitted files. We hope that our revisions meet your expectations and further enhance the quality of this article.
2. Point-by-point response to Comments and Suggestions for Authors
The FIFO structure is essential to the iterative NTT pipeline, yet the manuscript lacks timing diagrams, occupancy analysis, or stall handling descriptions.
Response 1: Thank you for your suggestion. We have added Figure 8 on page 7 to provide a clearer illustration of how the FIFO operates in our design, along with additional explanations highlighted in pink on pages 6 and 7.
Comments 2: The bit-reversal / FSR-based reordering unit is not well elaborated. A figure that clarifies dataflow ordering, memory conflicts, or latency per stage would be interesting.
Response 2: Thank you for your suggestion. We have added Figure 8 to illustrate an example of the feedback-based reordering unit for the case of n = 16. This unit operates similarly to a conventional reorder buffer used in typical NTT architectures but includes an additional feedback path to the input registers. The purpose of this feedback mechanism is to utilize the input registers to temporarily hold coefficients until their corresponding counterparts appear at the next stage. The outputs are then concatenated into a bitstream and buffered into the FIFO. As shown, the output stream always maintains a fixed latency—in our design, this latency is 16 clock cycles. Consequently, within the FIFO, the write pointer consistently trails the read pointer by exactly 16 clock cycles. This ensures that the FIFO avoids RAW conflicts while keeping its size minimal. These explanations are highlighted in cyan on page 6.
Comments 3: Many compared works are 32-bit CPUs vs. the authors’ 64-bit Rocket Core. Authors should normalise bit-width, frequency, technology node, memory hierarchy, or functional scope. Otherwise, speedup numbers such as 14.51x and 16.75x are potentially misleading.
Response 3: Thank you for your constructive comment. We have incorporated these explanations into the revised manuscript, highlighted in pink on page 8. The reference software implementation we use employs only 32-bit variables, with each polynomial coefficient represented using 16-bit data types. Therefore, the performance evaluation on both systems is fundamentally comparable.
Comments 4: The evaluation of energy efficiency is incomplete because the manuscript reports absolute power consumption at different frequencies and voltages but does not quantify energy per NTT, energy per coefficient, or energy per ML-KEM encapsulation or decapsulation.
Response 4: Thank you for your suggestion. In the manuscript, we measure the power consumption of the chip while it executes the KeyGen, Encaps, and Decaps procedures sequentially. The reported value represents the average over 100 measurement iterations. We have also added evaluations of the energy per NTT, energy per cycle, and energy per process. These results are presented in Tables 3 and 4, with the corresponding explanations highlighted in cyan on pages 10 and 11.
Comments 5: Latency numbers seems to not account for memory hierarchy differences.
Response 4: Thank you for your comment. The latency figures reported here already include the clock cycles required for memory read and write operations. In practice, the latency is measured from the moment the instruction carrying the input polynomial’s memory address is issued, until the signal indicating that the results have been written back to memory is asserted. Other studies in the literature follow a similar evaluation methodology.
Comments 6: The manuscript should include details about testbench structure, functional verification coverage, co-simulation between the RISC-V software and the RTL accelerator, or validation of custom instructions against corner cases and hazard scenarios.
Response 6: Thank you for your comment. During the design process, we continuously validated the hardware results against the reference software implementation. For simulation, we used Verilator to emulate the full SoC behavior and verify the correctness of the design. We then deployed the RTL on an FPGA platform to confirm functional correctness and obtain preliminary performance estimates. Finally, the RTL was synthesized and taped out as an ASIC for fabrication. After fabrication, we evaluated the functionality of the physical chip through the measurement procedures described in the paper. The verification was conducted following the sequence of KeyGen, Encapsulation, and Decapsulation at security levels corresponding to k= 2, 3, and 4. The outputs of the reference software were used as the golden testbench for comparison against the results produced by the chip. These explanations are highlighted in yellow on page 9.
Comments 7: Minor editorial comments Acronyms (RU, BU) should be defined once and used consistently Parts of the paper contain grammatical errors, informal language, and unclear explanations
Response 7: Thank you for your constructive comments. We have updated the definitions of all abbreviations at their first occurrence. We have also carefully reviewed and corrected grammatical issues and clarified several explanations throughout the manuscript.

4. Additional clarifications In addition to addressing the reviewers’ comments, we focused on clarifying the architectural structure of the design, the evaluation methodology, and the analysis of the measurement results. We have also included additional data to further demonstrate the effectiveness of the proposed system. Once again, we sincerely thank you for your careful reading and supportive evaluation of our work

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

The paper presents a valuable and meaningful hardware contribution and demonstrates promising PQC acceleration results using an ASIC-based RISC-V SoC.

Strengths

The work addresses post-quantum cryptography acceleration, which is highly pertinent as PQC standardization progresses.
The paper presents an ASIC-based RISC-V SoC featuring a tightly integrated NTT accelerator—an important step forward since most existing designs rely on FPGA platforms.
Significant NTT and inverse NTT speedups (14×–16×).
Meaningful end-to-end ML-KEM improvements across all security levels.
Real silicon results (fabricated 180 nm ASIC), which greatly strengthen the credibility of the work.
Integrating the accelerator through architectural extensions is an important and practical design aspect that improves usability and software-hardware integration.

Areas for Improvement

While raw speedup is shown, the paper does not deeply analyze throughput, latency bottlenecks, memory interactions, or bus congestion in the SoC.
The NTT accelerator appears optimized for ML-KEM, but it is unclear whether it supports other lattice schemes, other moduli, or other NTT sizes.

Author Response

1. Summary
Thank you very much for taking the time to carefully review this manuscript. We sincerely appreciate your valuable feedback. We have thoroughly addressed each of your comments and provided detailed responses below. The corresponding highlighted changes can be found in the re-submitted files. We hope that our revisions meet your expectations and further enhance the quality of this article.
2. Point-by-point response to Comments and Suggestions for Authors
While raw speedup is shown, the paper does not deeply analyze throughput, latency bottlenecks, memory interactions, or bus congestion in the SoC.
Response 1: Thank you for your suggestion. We have revised the manuscript and conducted additional in-depth analyses. The analyses, explanations, and comparisons regarding memory operations, dataflow, and the rationale behind design parameter choices have been highlighted in cyan and pink on page 6. We have also added new figures and tables presenting detailed evaluations of system resources, throughput, and energy consumption on pages 8, 9, 10, and 11.
Comments 2: The NTT accelerator appears optimized for ML-KEM, but it is unclear whether it supports other lattice schemes, other moduli, or other NTT sizes.
Response 2: Thank you for your constructive comment. Our design is optimized specifically for ML-KEM. As demonstrated, even with only two simple custom instructions, the performance improvement over the pure-software implementation is significant. The resulting efficiency is comparable to other platforms that employ a larger set of customized instruction groups. This highlights the potential of our approach for other computationally intensive algorithms such as Dilithium and FALCON, particularly those requiring flexible NTT dimensions or variable input sizes. This direction also represents our planned future research. The corresponding points have been highlighted in orange on page 11 of the revised manuscript.
4. Additional clarifications In addition to addressing the reviewers’ comments, we have updated the tables, figures, and explanations in this revised manuscript. We focused on resolving the shortcomings of the previous version, particularly regarding architectural description, analysis, comparisons, and presentation clarity. Once again, we sincerely thank you for your careful reading and supportive evaluation of our work.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Thanks for the reply.

Most of my previous concerns have been addressed.

I recommend that the authors refrain from using scaled data in the comparative analysis, as these data have not been validated directly, as indicated by the recent additions to Tables 3 and 4 in the revised manuscript.

Author Response

Response to Reviewer 1 Comments
1. Summary
Thank you very much for taking the time to carefully review this manuscript. We sincerely appreciate your valuable feedback. We have thoroughly addressed each of your comments and provided detailed responses below. The corresponding highlighted changes can be found in the re-submitted files. We hope that our revisions meet your expectations and further enhance the quality of this article.
2. Point-by-point response to Comments and Suggestions for Authors
Comments 1: I recommend that the authors refrain from using scaled data in the comparative analysis, as these data have not been validated directly, as indicated by the recent additions to Tables 3 and 4 in the revised manuscript.
Response 1: Thank you for your suggestion. Because the number of published implementations on physical chips is very limited, most of them are implemented on different FPGA platforms. Furthermore, each design uses a different process. Therefore, we converted the equivalent design to the process closest to the reference design for comparison. This comparison is only intended to describe trends and make an architectural comparison between different designs, not to make an absolute comparison. This has been explained in more detail in the section highlighted in yellow on pages 10 and 11.
3. Additional clarifications In addition to addressing the reviewers’ comments, we have updated the explanations in this revised manuscript. We focused on resolving the shortcomings of the previous version, particularly regarding architectural description, analysis, comparisons, and presentation clarity. Once again, we sincerely thank you for your careful reading and supportive evaluation of our work

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The authors have significantly revised the manuscript improving the overall quality. I suggest acceptance.

Author Response

Response to Reviewer 3 Comments
1. Summary
Thank you very much for taking the time to carefully review this manuscript. We sincerely appreciate your valuable feedback. We have thoroughly addressed each of your comments and provided detailed responses below. The corresponding highlighted changes can be found in the re-submitted files. We hope that our revisions meet your expectations and further enhance the quality of this article.
2. Point-by-point response to Comments and Suggestions for Authors
The authors have significantly revised the manuscript improving the overall quality. I suggest acceptance.
Response 1: We appreciate your support. We are delighted that our response has clarified your questions in the previous manuscript. Thank you for taking the time to read the manuscript and for your valuable comments.
Response to Reviewer 3 Comments
1. Summary
Thank you very much for taking the time to carefully review this manuscript. We sincerely appreciate your valuable feedback. We have thoroughly addressed each of your comments and provided detailed responses below. The corresponding highlighted changes can be found in the re-submitted files. We hope that our revisions meet your expectations and further enhance the quality of this article.
2. Point-by-point response to Comments and Suggestions for Authors
The authors have significantly revised the manuscript improving the overall quality. I suggest acceptance.
Response 1: We appreciate your support. We are delighted that our response has clarified your questions in the previous manuscript. Thank you for taking the time to read the manuscript and for your valuable comments.
3. Additional clarifications In addition to addressing the reviewers’ comments, we focused on clarifying the architectural structure of the design, the evaluation methodology, and the analysis of the measurement results. We have also included additional data to further demonstrate the effectiveness of the proposed system. Once again, we sincerely thank you for your careful reading and supportive evaluation of our work