An Efficient and Low-Cost Design of Modular Reduction for CRYSTALS-Kyber
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe paper propose an optimized version of the modular reduction for Kyber based on bitwise algorithm. The paper is well written, the proposed implementation and results seem sighlty better in efficiency than the state of the art. I suggest some minor improvements:
1) The title of the paper can be improved. The term "reducer" seems not appropriate. Maybe the author can substitute "reducer" with "modular reduction", such as: An efficient and low-cost design of modular reduction for CRYSTALS-Kyber.
2) Figure 1, 2 and 3 are difficult to understand. The authors can improve the description
3) Also the description of Figure 5 can be improved.
4) The quality of Figure 7 can be improved.
5) The authors can reference more works on hardware implementation of Kyber with different modular reduction teqniques:
Montgomery: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10439161
Author Response
Comments 1: the title of the paper can be improved. The term "reducer" seems not appropriate. Maybe the author can substitute "reducer" with "modular reduction", such as: An efficient and low-cost design of modular reduction for CRYSTALS-Kyber.
Response 1: Thank you for pointing this out. We agree with this comment. Therefore, we have revised the title of the paper by replacing the term "reducer" with "modular reduction". The new title is now "An efficient and low-cost design of modular reduction for CRYSTALS-Kyber". This change can be found on the title page of the revised manuscript. In addition, we have replaced the term "reducer" with "reduction unit" throughout the entire text.
Comments 2: Figure 1, 2 and 3 are difficult to understand. The authors can improve the description.
Response 2: Thank you for your valuable feedback. We agree with this comment. Therefore, we have re-optimized the presentation of Figure 1, 2, and 3, and improved the descriptions of these figures. The changes can be found on pages 3 and 4, where the revised parts are marked in red.
Comments 3: Also the description of Figure 5 can be improved.
Response 3: Thank you for your insightful comment. We are in full agreement. Therefore, we have re-optimized the presentation of Figure 5 and improved its description. The changes can be found on page 6, where the revised parts are marked in red.
Comments 4: The quality of Figure 7 can be improved.
Response 4: We appreciate your valuable feedback. Therefore, we have replaced Figure 7 with a higher-resolution image to improve its quality. The updated figure can be found on page 8.
Comments 5: The authors can reference more works on hardware implementation of Kyber with different modular reduction teqniques:
Montgomery: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10439161
Response 5: Thank you for your insightful comment. We are in full agreement. Therefore, we have added more references on the hardware implementation of Kyber with different modular reduction techniques, including the the above work mentioned by the reviewer in the revised manuscript (as reference [5]). The changes can be found on page 1, in paragraph 2, line 35.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThis manuscript reports an efficient, low-cost bitwise modular reduction architecture based on a Dadda tree compression array, targeting CRYSTALS-Kyber post-quantum cryptographic schemes. While the topic is timely and relevant to ongoing hardware optimization efforts for lattice-based cryptography, I regret to say that I do not recommend this paper for publication in its current form. My main concerns are as follows:
- Limited Novelty
The proposed design is an incremental improvement over prior bitwise modular reduction schemes such as those by Yaman and Guo. While the authors do show improved performance metrics (e.g., LUT count, ATP), the core methodology (bit-tree compression and modular decomposition) is not substantially novel, and the architectural improvements seem mostly implementation-level optimizations. - Lack of System-Level Validation
The authors evaluate only the modular reduction unit in isolation. There is no validation at the system level e.g., how the design impacts full Kyber encryption/decryption or end-to-end performance. Without such context, the practical significance of the improvement remains unclear. - Readability and Presentation Issues
The paper contains multiple grammar issues and lacks editorial polish. Moreover, the presentation of Figures 1–3 (bit trees) is dense, and explanations are difficult to follow for readers not deeply familiar with low-level digital arithmetic. No abstraction or guiding diagrams are offered to help readers understand the contribution in simpler terms. - Evaluation Scope
The comparison table is appreciated, but the discussion could benefit from more detailed metrics, such as power consumption or scalability. Also, DSP usage is briefly mentioned but not fully explored, and the impact on latency in real deployments is not discussed.
- Unclear Motivation for Proposed Design Choices
It is not well explained why a Dadda tree is preferred over other compression strategies in this context. A more detailed analysis of the design trade-offs (e.g., compression speed vs. regularity vs. resource usage) would strengthen the work.
Author Response
Comments 1: Limited Novelty
The proposed design is an incremental improvement over prior bitwise modular reduction schemes such as those by Yaman and Guo. While the authors do show improved performance metrics (e.g., LUT count, ATP), the core methodology (bit-tree compression and modular decomposition) is not substantially novel, and the architectural improvements seem mostly implementation-level optimizations.
Response 1: Thank you for your thorough review and valuable comments. We fully acknowledge your assessment regarding the "incremental improvement" nature of this work, and would like to further elaborate on its methodological and engineering contributions:
- We propose an efficient modular reduction algorithm for q=3329 in CRYSTALS-Kyber. By rederiving the bit tree of the bitwise modular algorithm and proposing three universal methods for eliminating redundant bits, we achieve a more streamlined approach that can be flexibly transferred to other modular parameters.
- We design a reduction unit based on Dadda tree compression arrays. This design significantly reduces resource consumption and latency, offering a more efficient solution for modulus reduction.
- We evaluate the modular reduction design on FPGA and the experimental results show that our design achieves excellent performance on ATP. At the same time, an evaluation was also conducted at the level of polynomial operations, and our design can effectively improve the area overhead of polynomial operations and increase the operating frequency.
While we understand your expectation for greater theoretical breakthroughs, we believe the above contributions align well with the journal's aim "Our aim is to encourage scientists to publish their experimental and theoretical results". We sincerely hope you could reconsider our contributions in this context.
Comments 2:Lack of System-Level Validation
The authors evaluate only the modular reduction unit in isolation. There is no validation at the system level e.g., how the design impacts full Kyber encryption/decryption or end-to-end performance. Without such context, the practical significance of the improvement remains unclear.
Response 2: Thanks for your constructive suggestion. Actually, we have already considered the evaluation of our design’s improvement for the whole Kyber system. But due to the fact that the Kyber algorithm mainly consists of sampling operations and polynomial operations, which accounts for more than 50% calculations in the whole algorithm[R1]. So we conducted the experiment of polynomial modular multiplication with our design, whose results are given in Section 4.3 (e.g. Figure 7). Through this evaluation, we can roughly infer the impact of our design on the whole Kyber system. This paper is just one paper focused on the modular reduction unit on the circuit level. We are constructing our Kyber encryption system which involves several improvements on the circuit and system level besides this modular reduction unit. All these innovations cannot be detailed in this single paper. So in this paper, we only give the performance improvement for polynomial modular multiplication after using our design of modular reduction.
[R1] Bisheh-Niasar, M.; Azarderakhsh, R.; Mozaffari-Kermani, M. High-Speed NTT-Based Polynomial Multiplication Accelerator for Post-Quantum Cryptography. In Proceedings of the 2021 IEEE 28th Symposium on Computer Arithmetic (ARITH); IEEE: Lyngby, Denmark, June 2021; pp. 94–101.
Comments 3: Readability and Presentation Issues
The paper contains multiple grammar issues and lacks editorial polish. Moreover, the presentation of Figures 1–3 (bit trees) is dense, and explanations are difficult to follow for readers not deeply familiar with low-level digital arithmetic. No abstraction or guiding diagrams are offered to help readers understand the contribution in simpler terms.
Response 3: Thank you for your valuable comments. We have done careful grammar checking and polishing of the full text. In addition, we have standardized the terminology, such as replacing the term "reducer" with "reduction unit", and changed symbols that could cause confusion. For example, we have replaced the symbol for compressing the carry with the letter "d" to avoid ambiguity. Finally, as the reviewer suggested, we also re-optimize the presentation of Figure 1, 2, and 3, and improve the descriptions of these figures. The changes can be found on pages 3 and 4, where the revised parts are marked in red.
Comments 4: Evaluation Scope
The comparison table is appreciated, but the discussion could benefit from more detailed metrics, such as power consumption or scalability. Also, DSP usage is briefly mentioned but not fully explored, and the impact on latency in real deployments is not discussed.
Response 4: Thanks for your helpful suggestion. According to the reviewer’s comments, we have made the following revisions:
- Add Power Consumption and Latency Metrics: We have included power consumption and latency as additional metrics in Table 2 to provide a more comprehensive evaluation.
- Scalability Discussion: Our current work is specifically tailored for the fixed modulus of 3329 in Kyber, we have highlighted that our method is transferable and can be adapted for other parameters by re-deriving the bit tree and customizing the compressor accordingly.
- DSP Usage Explanation: Actually, many literatures usually only use the Area-Time Product (ATP) metric, which already includes the consideration of DSP, for comparison. To address the reviewer’s concern, we add the usage of the DSP resources and provide a detailed explanation of why DSP resources are utilized in certain designs on page 8, lines 222-223.
Comments 5: Unclear Motivation for Proposed Design Choices
It is not well explained why a Dadda tree is preferred over other compression strategies in this context. A more detailed analysis of the design trade-offs (e.g., compression speed vs. regularity vs. resource usage) would strengthen the work.
Response 5: We apologize for not explaining clearly our motivation in the last manuscript. In the new version, we have added a detailed explanation in Section 3.1, paragraph 1, lines 145-147, to clarify why we chose the Dadda tree over other compression strategies. We have elaborated on the design trade-offs, including resource overhead, compression speed, and regularity. The changes can be found on page 5, in Section 3.1, paragraph 1, lines 145-147.
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThis manuscript presents a new hardware architecture for bitwise modular reduction in the CRYSTALS-Kyber cryptographic algorithm. The authors propose an optimization that integrates a Dadda Tree Hybrid Compression Array (DTHCA) to improve area efficiency and operating frequency over existing methods.
They provide an architectural hardware representation of the proposed method and implementation using an FPGA card.
Compared to existing works, the proposed method outperforms them in terms of Area-Time Efficiency, Parallelism, and cost-effectiveness for low-resource environments.
Overall, the manuscript is comprehensive, focused, and the method is clearly explained with sound scientific methods to validate it.
Author Response
Response: Thank you very much for your positive and constructive comments on our manuscript. We are pleased to hear that you find our work comprehensive, focused, and clearly explained. Your recognition of the significance of our proposed optimization using the Dadda Tree Hybrid Compression Array (DTHCA) and its potential advantages in terms of Area-Time Efficiency, Parallelism, and cost-effectiveness is highly appreciated. We will continue to refine our work and ensure that it meets the higher standards of scientific rigor and practical applicability. Thank you again for your valuable feedback and support.
Reviewer 4 Report
Comments and Suggestions for AuthorsIn this paper, a bit-level modular reduction hardware architecture based on the Dadda Tree is proposed. The design concept is clear, the theoretical derivation is sound, and the performance improvement is validated on an FPGA platform. This research holds promising prospects and engineering value in the field of post-quantum cryptographic acceleration. To further improve the quality and completeness of the paper, the author is advised to revise and enhance the following aspects:
- The current evaluation only includes performance metrics such as LUT utilization, frequency, and area-time product. It is recommended to supplement power analysis (e.g., static power, dynamic power, or power-performance ratio) to provide a more comprehensive assessment of hardware efficiency.
- It is suggested to include a discussion on the impact of different compression array parameters (such as tree depth and compression unit selection) to demonstrate the architecture’s tunability in balancing resource usage and performance.
- The hardware architecture diagrams presented in the paper (e.g., Dadda Tree structure, compression path) lack essential annotations and signal labels. It is recommended to refine the graphical representations to improve readability and expressiveness.
- The current experimental comparisons focus on Barrett and Yaman methods. It is suggested to incorporate more recent optimized structures, such as modular reduction based on the Plantard algorithm or low-bit-width compressor designs like RFET, to enhance the representativeness of the comparisons.
- In the mathematical derivations, some symbols are not clearly defined or are used inconsistently across different contexts (e.g., bit-level modular tree expressions). It is advised to standardize the terminology and provide a symbol table for clarity.
- There are inconsistencies in the formatting of references—for instance, the author name in reference [1] is written as “P.W. Shor,” while in [11] it appears as “C.K. Koc.” Some entries use mixed initials and full names, and spacing issues are present (e.g., “C. K.” is not a standard format). Reference formatting should be normalized throughout.
It is recommended to standardize the placement and formatting of figures and tables for consistency and improved presentati
Comments on the Quality of English LanguageThe English could be improved to more clearly express the research.
Author Response
Comments 1: The current evaluation only includes performance metrics such as LUT utilization, frequency, and area-time product. It is recommended to supplement power analysis (e.g., static power, dynamic power, or power-performance ratio) to provide a more comprehensive assessment of hardware efficiency.
Response 1: Thank you for pointing this out. We agree with this comment. Therefore, we have added the total power consumption information in Table 2 and included a related conclusion description in the last line of the second paragraph in Section 4.2. The changes can be found on page 7, in the second paragraph of Section 4.2, line 208.
Comments 2: It is suggested to include a discussion on the impact of different compression array parameters (such as tree depth and compression unit selection) to demonstrate the architecture’s tunability in balancing resource usage and performance.
Response 2: Thank you for your suggestion. We have carefully considered it, but unfortunately, we cannot accept this comment. As mentioned in the manuscript, our work is specifically derived for the fixed modulus of 3329 in Kyber. After the bit tree is fully simplified, the depth of the compression array is determined and cannot be adjusted. While our method is transferable, it requires re-derivation of the bit tree and customization of the compressor for other parameters. Regarding the selection of compression units, we have already discussed in Section 3.2, paragraph 2, line 173, why only 3-2 and 2-2 compressions are used.
Comments 3: The hardware architecture diagrams presented in the paper (e.g., Dadda Tree structure, compression path) lack essential annotations and signal labels. It is recommended to refine the graphical representations to improve readability and expressiveness.
Response 3: Thank you for your insightful comment. We are in full agreement. Therefore, we have optimized the presentation of Figure 1, Figure 2, Figure 3, Figure 4, and Figure 5 to include essential annotations and signal labels. Additionally, we have clarified that Figure 5 illustrates the detailed compression process of the reduction unit shown in Figure 4. The changes can be found on pages 3, 4, 4, 5, and 6, where the revised figures are presented.
Comments 4: The current experimental comparisons focus on Barrett and Yaman methods. It is suggested to incorporate more recent optimized structures, such as modular reduction based on the Plantard algorithm or low-bit-width compressor designs like RFET, to enhance the representativeness of the comparisons.
Response 4: We appreciate your valuable feedback. Actually, we used to also want to make a comparison as you suggested. But as mentioned in Section 4.2, paragraph 1, algorithms such as Plantard, Montgomery, K-RED, and K2-RED introduce a coefficient in the reduction result, which necessitates additional post-processing operations to eliminate. Although this can be avoided in NTT and INTT by pre-multiplying the twiddle factors with the inverse of the coefficient, it cannot be completely avoided in polynomial point-wise multiplication, especially with a modulus of 3329, and still introduces extra post-processing. That is to say, the comparison will involve many other circuits (e.g. pre-producing and post-producing circuits) which cannot be achieved from the reference paper. Therefore, to make the comparison fair, we only compare modular reduction algorithms that do not require post-processing.
Comments 5: In the mathematical derivations, some symbols are not clearly defined or are used inconsistently across different contexts (e.g., bit-level modular tree expressions). It is advised to standardize the terminology and provide a symbol table for clarity.
Response 5: Thank you for pointing this out. We have checked and revised the whole paper as your comments, which mainly includes the following revisions:
- Replaced the symbol for the compressed carry with the letter "d" to avoid confusion.
- Provided explanations for the input "C" in Section 2.
- Adjusted the symbols in lines 97-113 on page 3, line 158 on page 6, and lines 201-213 on page 7.
- Made corresponding adjustments to the symbols in Figures 1 to 5.
Comments 6: There are inconsistencies in the formatting of references—for instance, the author name in reference [1] is written as “P.W. Shor,” while in [11] it appears as “C.K. Koc.” Some entries use mixed initials and full names, and spacing issues are present (e.g., “C. K.” is not a standard format). Reference formatting should be normalized throughout.
Response 6: Thank you for your attention to the formatting of the references. We have carefully reviewed the reference list and would like to clarify that the formatting of the author names is consistent with the requirements of the journal. Specifically, the format "Last Name, First Initial." (e.g., "Shor, P.W." and "Koc, C.K.") is the standard format prescribed by the journal, and we have adhered to this consistently throughout the manuscript. We have also ensured that there are no spacing issues in the references.To address any potential confusion, we double-checked the reference list and confirmed that all entries follow the prescribed format.
Comments 7: It is recommended to standardize the placement and formatting of figures and tables for consistency and improved presentati.
Response 7: Thanks for your comments for better presentati. We have adjusted the placement and formatting of all figures and tables to ensure consistency and improved presentation. All figures and tables are now aligned with the main text. For oversized figures and tables that could not fit within the standard margins, we have aligned them to the right to maintain readability.
Author Response File: Author Response.pdf
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsI recommend that the authors provide a more detailed and thorough presentation of their work to enhance clarity and strengthen the overall contribution. However, the current version meets the minimum standards for publication.
Author Response
Comments: I recommend that the authors provide a more detailed and thorough presentation of their work to enhance clarity and strengthen the overall contribution. However, the current version meets the minimum standards for publication.
Response: Thank you very much for your valuable comments on our manuscript. We appreciate your suggestion to provide a more detailed and thorough presentation of our work to enhance clarity and strengthen the overall contribution. We have taken your feedback seriously and have revised the manuscript accordingly. Specifically, we have expanded the sections related to the algorithm complexity, area efficiency, and system-level improvements to provide a more comprehensive understanding of our contributions. These revisions can be found on page 2, lines 67-78, where we detail the enhancements made using the Dadda Tree Hybrid Compression Array (DTHCA). We believe these changes will make our work clearer and more impactful. Thank you again for your insightful comments and for giving us the opportunity to improve our manuscript.
Reviewer 4 Report
Comments and Suggestions for Authorsaccept the paper
Author Response
Comments:accept the paper.
Response: Thank you very much for your positive evaluation and decision to accept our paper. We are delighted to hear that our work has met the standards for publication. Your feedback throughout the review process has been invaluable in helping us refine and improve our manuscript. We truly appreciate your time, effort, and constructive comments, which have significantly contributed to the quality of our research presentation.