#
Optimizing Hardware Resource Utilization for Accelerating the NTRU-KEM Algorithm^{ †}

^{1}

^{2}

^{3}

^{*}

^{†}

^{‡}

## Abstract

**:**

## 1. Introduction

**Efficient full-functionality implementation:**Unlike previous studies that focused only on encapsulation and decapsulation functions, this work aims to efficiently implement the entire NTRU-KEM algorithm, including the key generation (KEYGEN) function, using a hardware and software co-design methodology.**Optimized use of hardware resources:**Our work addresses the challenge of high latencies and underutilized hardware resources (like bus interfaces) observed in previous work [18]. Our approach achieves this by developing a new scheduling technique that maximizes hardware resource usage, particularly for cross-functional operations that can be executed in parallel.**Minimized area for registers:**To run concurrent operations without data overwriting, we design an integrated register array. This design allows for efficient reuse of registers throughout NTRU-KEM processes, minimizing hardware size while maintaining performance.**Enhanced performance efficiency:**Our implementation demonstrates a significant improvement in hardware resource efficiency, showing a performance increase of 1.82 to 51.38 times per area compared to previous work.

## 2. Background

#### 2.1. Lattice-Based Cryptography

#### 2.2. NTRU Cryptosystem

#### 2.3. NTRU-KEM Algorithm

`KEYGEN`, encapsulation, and decapsulation.

`KEYGEN`generates a key pair ${K}_{priv}=(f,{f}_{p},{h}_{q})$ and ${K}_{pub}=h$. The process starts by sampling the four polynomials and then computing ${f}_{p}={f}^{-1}$$mod(3,{\Phi}_{n})$ and ${f}_{q}={f}^{-1}$$mod$$(q,{\Phi}_{n})$. The public key h is a polynomial in ${R}_{q}$ calculated from $h=(3\xb7g\xb7{f}_{q})$$mod$$(q,{x}^{n}-1)$. The last component of the private key, ${h}_{q}$, is a precomputed polynomial ${h}_{q}={h}^{-1}$$mod$$(q,{\Phi}_{n})$. Using the polynomials above, encapsulation and decapsulation are computed as depicted in Figure 1.

## 3. Design Overview

`KEYGEN`, encapsulation and decapsulation in Figure 1. Our hardware design methodology focuses on area efficiency and hardware utilization, including the memory bus and ALU.

## 4. Implementation

`poly_*_mul`), polynomial inversion (

`poly_*_inv`), integer sorting (

`crypto_sort_int32`), and binary conversion (

`2B_conversion`). These FSMs were designed based on their respective common arithmetic patterns.

`poly_*_mul`includes

`poly_${R}_{q}$_mul`,

`poly_${S}_{3}$_mul`,

`poly_${S}_{q}$_mul`, and

`poly_${R}_{2}$_inv_to_${R}_{q}$_inv`sub-functions for polynomial multiplication with specific ring modulus. And

`poly_*_inv`includes

`poly_${R}_{2}$_inv`and

`poly_${S}_{3}$_inv`sub-functions for polynomial inverse operation. This approach is similar to our previous work [18]. However, we observed that while our previous work was efficient in terms of area, its use of resources was not optimal. This was due to uneven allocation of resources, leading to time-related bottlenecks. For example, the bus bandwidth was underutilized, using less than half of its capacity when the ALU was busy, which negatively affected the overall performance and efficiency. To address such issues, we adopt a different approach that aims to maximize resource utilization while still maintaining the design’s compact area efficiency.

#### 4.1. FSM-Oriented Hardware Design

`poly_${R}_{q}$_mul`includes multiplications between polynomials on a ring ${R}_{q}$. To ease the complexity of the multiplications, conventional methods employ fast Fourier transform (FFT) or number theoretic transform (NTT) [19,20,21,22,23,24,25]. However, NTRU does not use such operations for two major reasons. First, the quotient rings used in NTRU exhibit properties that are not directly supportive of FFT/NTT, as such approaches demand polynomials to have degrees of power of two [26,27]. However, NTRU variants use polynomials with degrees that are prime numbers, thus not powers of two. Previous research [28] extends the polynomials to higher degrees of powers of two but results in more computation than the naive multiplications. The second reason is that the predefined modulus used in NTRU is rather simple and can be efficiently computed using specialized operation sequences. For instance, NTRU frequently uses the modulo 3, which only uses 1, 0, and −1 as coefficients, so any computation between such values can be easily managed by zeroization and bit-flip operations.

#### 4.2. Latency Profiling Sub-Functions for Acceleration

`poly_${R}_{q}$_mul`,

`poly_${S}_{3}$_mul`, and

`poly_${S}_{q}$_mul`account for between 83 and 97% of the total runtime. Accordingly, the aforementioned existing NTRU-KEM accelerators that only accelerated these two functions [13,14,15] aimed to boost the performance of the polynomial multiplication. However, in the case of

`KEYGEN`, other sub-functions,

`poly_${R}_{q}$_inv`and

`poly_${S}_{3}$_inv`, account for the biggest portion, over 86%, while

`poly_${R}_{q}$_mul`only accounts for 12%.

`poly_${R}_{q}$_inv`and

`poly_${S}_{3}$_inv`also need to be taken into consideration when designing the hardware. It is worth noting that all the

`poly_*_inv`and

`poly_*_mul`sub-functions inherently involve parallelism since the computation of each term in the resulting polynomial can be performed independently of the others. In addition,

`crypto_sort_int32`in encapsulation and

`KEYGEN`is also selected to implement in the hardware since it can be processed in a parallel manner. Conversely, the sub-function named

`randombytes`within the encapsulation and

`KEYGEN`function is predominantly comprised of sequential operations. Also, due to its relatively minor contribution to the overall execution time, it has not been implemented as hardware. The

`randombytes`sub-function is used or the seed generation before the

`KEYGEN`and encapsulation function. And the

`SHA3-hash`sub-function is also left to the software part because of the small execution time in the CPU. To efficiently process the above-selected two kinds of polynomial operations, the optimizations that we have devised will be presented in the following subsections.

#### 4.3. Optimizing Bus Utilization

`poly_${R}_{q}$_mul`sub-function using 1-ALU.

#### 4.4. Parallel Sub-Function Scheduling

#### 4.4.1. Key Generation Scheduling

`KEYGEN`function in NTRU-KEM includes a series of

`poly_inv`and

`poly_mul`sub-functions. NTRU utilizes an efficient almost inverse algorithm [29], which calculates almost an inverse of a polynomial that yields an exact result across all NTRU-KEM functions. This inverse calculation primarily consists of iterating a swap operation based on the highest degree of a non-zero coefficient in each polynomial. Notably, the

`poly_inv`sub-function does not require the use of multiplication or addition arrays. This is because it operates on coefficients that are only 1 or 2 bits in size associated with the polynomial inverse sub-functions within the ${R}_{2}$ or ${S}_{3}$ ring modulus. For the polynomial inverse in the ${R}_{q}$ ring modulus, the

`poly_${R}_{q}$_inv`result is achieved by combining the

`poly_${R}_{2}$_inv`and

`poly_${R}_{2}$_inv_to_${R}_{q}$_inv`sub-functions. In these processes, we primarily use bit operations,

`reg_array`, and the swap operation within the FSM.

#### 4.4.2. Encapsulation Scheduling

`poly_mul`and

`crypto_sort_int32`sub-functions. Initially, the function receives seed data (r, m) and public key data $pk$. The

`poly_mul`sub-function works with r and $pk$, while the

`crypto_sort_int32`processes m, making it feasible for these functions to operate simultaneously. However, efficient parallelization is challenging due to the limitations of the bus interface. Both sub-functions have idle time in the bus interface. The bus utilization rates are 50% for

`poly_mul`and 80% for

`crypto_sort_int32`. To manage their scheduling, the intervals of bus usage for

`poly_mul`are spaced out in two-cycle units. Although the ALU operates every cycle, the output is stored in memory every four cycles, with one cycle for loading new data and another for storing results.

`poly_mul`and

`crypto_sort_int32`sub-functions. This was performed by creating two-cycle intervals in which the bus interface is idle, during which the

`crypto_sort_int32`sub-function operates. The bus interface for

`crypto_sort_int32`was designed to match these two-cycle intervals of the

`poly_mul`sub-function for optimal scheduling. By thoroughly analyzing the bus interface requirements of both sub-functions, we achieved efficient parallel scheduling. This led to a remarkable bus utilization rate of approximately 97% for the encapsulation function using the HPS4096821 parameter. This optimization significantly improved our design’s performance, eliminating the need for extra resources to increase the bus interface’s bit-width.

#### 4.4.3. Decapsulation Scheduling

`poly_${R}_{q}$_mul`,

`poly_${S}_{3}$_mul`, and

`poly_${S}_{q}$_mul`sub-functions are scheduled in a parallel and pipelined manner. Each of these sub-functions, which perform polynomial multiplication, has a bus interface utilization rate of about 50%. A key challenge is simultaneously executing two polynomial multiplication sub-functions, as this not only requires bus utilization but also additional ALU resources.

`poly_${S}_{3}$_mul`sub-function to be executed using bitwise operators, eliminating the need for an ALU. Using this knowledge, we parallelized the

`poly_${S}_{3}$_mul`operation with the textttpoly_${R}_{q}$_mul and

`poly_${S}_{q}$_mul`sub-functions. Despite sharing the same bus utilization rate, the unique characteristics of the

`poly_${S}_{3}$_mul`allowed for efficient parallel scheduling.

`poly_${S}_{3}$_mul`sub-function become available from the timing of completing half of the

`poly_${R}_{q}$_mul`sub-function, allowing for immediate input to the subsequent

`poly_${S}_{3}$_mul`sub-function. Similarly, starting from the timing of completing half of the

`poly_${S}_{3}$_mul`sub-function, the input data for the

`poly_${S}_{q}$_mul`sub-function could be obtained. We applied a pipelined approach to coordinate these three sub-functions effectively.

#### 4.5. Optimizing `Reg_Array` Utilization with Combined Structure

`KEYGEN`and encapsulation functions, the polynomial inversion and sorting sub-functions are never executed simultaneously as they are both scheduled alongside polynomial multiplication sub-functions. Thus, combining the control registers for these sub-functions does not hinder performance.

`KEYGEN`for polynomial inversion and multiplication, in encapsulation for sorting and polynomial multiplication, and in decapsulation for two parallel polynomial multiplication sub-functions, fewer registers were used (44, 40, and 56, respectively). Consequently, we were able to combine the control registers for two sub-functions into these 12 unused register arrays. This approach not only reduced the total number of registers required in the entire accelerator but also improved the utilization rate of the existing registers. Through this integrated management, we achieve a decrease in the design area without compromising performance, aligning with the goal of an area-efficient design.

## 5. Evaluation

#### 5.1. Impact of Our Optimization Methods

`poly_mul`, making it difficult to perform other tasks since all tasks need data to pass the bus interface. Consequently, parallelizing the sub-function scheduling becomes infeasible and leaves several registers unused, which, in turn, leads to reduced resource utilization.

`reg_Array`technique and reached our optimized design (v5). While the overall latency remains similar, the area is reduced, thus resulting in better performance in terms of time × area compared to all the other variants.

#### 5.2. Performance Comparison with Prior Work

`x-net`, which is a performance-centric design set, and

`comba`, one that focuses on area efficiency. For each set, these authors designed separate accelerators for each of the four parameters; thus, single designs cannot be used for a generic purpose (i.e., supporting all parameters). Their design areas for each parameter of NTRU-KEM ranged from 99.99 $\times \phantom{\rule{3.33333pt}{0ex}}{10}^{3}\phantom{\rule{3.33333pt}{0ex}}\mathsf{\mu}{\mathrm{m}}^{2}$ to 762.17 $\times \phantom{\rule{3.33333pt}{0ex}}{10}^{3}\phantom{\rule{3.33333pt}{0ex}}\mathsf{\mu}{\mathrm{m}}^{2}$ when combining the encapsulation module and the decapsulation module together.

`KEYGEN`. All the above extra features are supported in a single design with an efficient area usage of 39.60 $\times \phantom{\rule{3.33333pt}{0ex}}{10}^{3}\phantom{\rule{3.33333pt}{0ex}}\mathsf{\mu}{\mathrm{m}}^{2}$. Note that even when compared to the area efficient version of

`comba`, our design uses at least 2.87× less area, while demonstrating, on average, 17.66× faster performance in terms of latency. Compared to the

`x-net`version, our accelerator exhibited an average performance that was 7.69 times slower but achieved an average design area that was 15.98 times smaller. In terms of the time × area, our accelerator exhibits superior efficiency, surpassing that of both the

`x-net`and

`comba`accelerator versions by average factors of 2.09 and 48.66 across all parameters, respectively. It is worth noting that the mitigation of side-channel attacks (SCA) is beyond the scope of this work. However, future research could explore applying and optimizing the methods mentioned in [30] to mitigate SCA. Given our design’s efficiency in terms of the time × area metric compared to previous works, we believe that our design will outperform others even after incorporating methods to mitigate SCA.

## 6. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## References

- Nielsen, M.A.; Chuang, I.L. Quantum Computation and Quantum Information: 10th Anniversary Edition; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
- Shor, P.W. Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer. SIAM Rev.
**1999**, 41, 303–332. [Google Scholar] [CrossRef] - Kumar, M.; Pattnaik, P. Post quantum cryptography (PQC)-An overview. In Proceedings of the 2020 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, USA, 22–24 September 2020; pp. 1–9. [Google Scholar]
- Raheman, F. The future of cybersecurity in the age of quantum computers. Future Internet
**2022**, 14, 335. [Google Scholar] [CrossRef] - Shinohara, N.; Moriai, S. Trends in Post-Quantum Cryptography: Cryptosystems for the Quantum Computing Era. The Magazine of New Breeze, 2019, pp. 9–11. Available online: https://www.ituaj.jp/wp-content/uploads/2019/01/nb31-1_web-05-Special-TrendsPostQuantum.pdf (accessed on 14 November 2023).
- Yaman, F.; Mert, A.C.; Öztürk, E.; Savaş, E. A hardware accelerator for polynomial multiplication operation of CRYSTALS-KYBER PQC scheme. In Proceedings of the 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, France, 1–5 February 2021; pp. 1020–1025. [Google Scholar]
- Hoffstein, J.; Pipher, J.; Silverman, J.H. NTRU: A ring-based public key cryptosystem. In Algorithmic Number Theory, Proceedings of the Third International Symposiun, ANTS-III, Portland, OR, USA, 21–25 June 1998; Springer: Berlin/Heidelberg, Germany, 2006; pp. 267–288. [Google Scholar]
- Rivest, R.L.; Shamir, A.; Adleman, L. A Method for Obtaining Digital Signatures and Public-Key Cryptosystems. Commun. ACM
**1978**, 21, 120–126. [Google Scholar] [CrossRef] - Diffie, W.; Hellman, M.E. New directions in cryptography. In Democratizing Cryptography: The Work of Whitfield Diffie and Martin Hellman; Morgan & Claypool: San Rafael, CA, USA, 2022; pp. 365–390. [Google Scholar]
- Dang, V.B.; Farahmand, F.; Andrzejczak, M.; Gaj, K. Implementing and benchmarking three lattice-based post-quantum cryptography algorithms using software/hardware codesign. In Proceedings of the 2019 International Conference on Field-Programmable Technology (ICFPT), Tianjin, China, 9–13 December 2019; pp. 206–214. [Google Scholar]
- Kannwischer, M.J.; Rijneveld, J.; Schwabe, P. Faster multiplication in on Cortex-M4 to speed up NIST PQC candidates. In Proceedings of the International Conference on Applied Cryptography and Network Security, Bogota, Colombia, 5–7 June 2019; pp. 281–301. [Google Scholar]
- He, P.; Tu, Y.; Khalid, A.; O’Neill, M.; Xie, J. HPMA-NTRU: High-Performance Polynomial Multiplication Accelerator for NTRU. In Proceedings of the 2022 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), Austin, TX, USA, 19–21 October 2022; pp. 1–6. [Google Scholar]
- Qin, Z.; Tong, R.; Wu, X.; Bai, G.; Wu, L.; Su, L. A compact full hardware implementation of PQC algorithm NTRU. In Proceedings of the 2021 International Conference on Communications, Information System and Computer Engineering (CISCE), Beijing, China, 14–16 May 2021; pp. 792–797. [Google Scholar]
- Farahmand, F.; Dang, V.B.; Nguyen, D.T.; Gaj, K. Evaluating the potential for hardware acceleration of four NTRU-based key encapsulation mechanisms using software/hardware codesign. In Proceedings of the Post-Quantum Cryptography: 10th International Conference, PQCrypto 2019, Chongqing, China, 8–10 May 2019; pp. 23–43. [Google Scholar]
- Antognazza, F.; Barenghi, A.; Pelosi, G.; Susella, R. A Flexible ASIC-oriented Design for a Full NTRU Accelerator. In Proceedings of the 28th Asia and South Pacific Design Automation Conference, Tokyo, Japan, 16–19 January 2023; pp. 591–597. [Google Scholar]
- Kostalabros, V.; Ribes-González, J.; Farràs, O.; Moretó, M.; Hernandez, C. Hls-based hw/sw co-design of the post-quantum classic mceliece cryptosystem. In Proceedings of the 2021 31st International Conference on Field-Programmable Logic and Applications (FPL), Dresden, Germany, 30 August–3 September 2021; pp. 52–59. [Google Scholar]
- Schöffel, M.; Feldmann, J.; Wehn, N. Code-based Cryptography in IoT: A HW/SW Co-Design of HQC. arXiv
**2023**, arXiv:2301.04888. [Google Scholar] - Lee, Y.; Nam, K.; Joo, Y.; Kim, J.; Oh, H.; Paek, Y. Area-Efficient Accelerator for the Full NTRU-KEM Algorithm. In Proceedings of the International Conference on Computational Science and Its Applications, Athens, Greece, 3–6 July 2023; pp. 186–201. [Google Scholar]
- Riazi, M.; Laine, K.; Pelton, B.; Dai, W. HEAX: An Architecture for Computing on Encrypted Data. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, 16–20 March 2020. [Google Scholar] [CrossRef]
- Nam, K.; Oh, H.; Moon, H.; Paek, Y. Accelerating N-Bit Operations over TFHE on Commodity CPU-FPGA. In Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design, San Diego, CA, USA, 30 October–3 November 2022; pp. 1–9. [Google Scholar]
- Cheon, J.H.; Kim, A.; Kim, M.; Song, Y. Homomorphic Encryption for Arithmetic of Approximate Numbers. In Proceedings of the International Conference on the Theory and Application of Cryptology and Information Security, Hong Kong, China, 3–7 December 2017; pp. 409–437. [Google Scholar]
- Chillotti, I.; Gama, N.; Georgieva, M.; Izabachène, M. TFHE: Fast Fully Homomorphic Encryption over the Torus. J. Cryptol.
**2020**, 33, 34–91. [Google Scholar] [CrossRef] - Prest, T.; Fouque, P.A.; Hoffstein, J.; Kirchner, P.; Lyubashevsky, V.; Pornin, T.; Ricosset, T.; Seiler, G.; Whyte, W.; Zhang, Z. Falcon. Post-Quantum Cryptography Project of NIST. 2020. Available online: https://csrc.nist.gov/projects/post-quantum-cryptography/selected-algorithms-2022 (accessed on 14 November 2023).
- Zhang, N.; Yang, B.; Chen, C.; Yin, S.; Wei, S.; Liu, L. Highly efficient architecture of NewHope-NIST on FPGA using low-complexity NTT/INTT. In IACR Transactions on Cryptographic Hardware and Embedded Systems; IACR: Santa Barbara, CA, USA, 2020; pp. 49–72. Available online: https://ches.iacr.org/2020/index.php (accessed on 14 November 2023).
- Bisheh-Niasar, M.; Azarderakhsh, R.; Mozaffari-Kermani, M. High-speed NTT-based polynomial multiplication accelerator for post-quantum cryptography. In Proceedings of the 2021 IEEE 28th Symposium on Computer Arithmetic (ARITH), Lyngby, Denmark, 14–16 June 2021; pp. 94–101. [Google Scholar]
- Cooley, J.W.; Lewis, P.A.W.; Welch, P. The fast Fourier transform algorithm: Programming considerations in the calculation of sine, cosine and Laplace transforms. J. Sound Vib.
**1970**, 12, 315–337. [Google Scholar] [CrossRef] - Becoulet, A.; Verguet, A. A depth-first iterative algorithm for the conjugate pair fast fourier transform. IEEE Trans. Signal Process.
**2021**, 69, 1537–1547. [Google Scholar] [CrossRef] - Chung, C.M.M.; Hwang, V.; Kannwischer, M.J.; Seiler, G.; Shih, C.J.; Yang, B.Y. NTT multiplication for NTT-unfriendly rings: New speed records for Saber and NTRU on Cortex-M4 and AVX2. In IACR Transactions on Cryptographic Hardware and Embedded Systems; ICAR: Santa Barbara, CA, USA, 2021; pp. 159–188. Available online: https://ches.iacr.org/2021/index.php (accessed on 14 November 2023).
- Schroeppel, R.; Orman, H.; o’Malley, S.; Spatscheck, O. Fast key exchange with elliptic curve systems. In Proceedings of the Advances in Cryptology—CRYPT0’95: 15th Annual International Cryptology Conference, Santa Barbara, CA, USA, 27–31 August 1995; Proceedings. Springer: Berlin/Heidelberg, Germany, 2001; pp. 43–56. [Google Scholar]
- Standaert, F.X. Introduction to side-channel attacks. In Secure Integrated Circuits and Systems; Springer: Berlin/Heidelberg, Germany, 2010; pp. 27–42. [Google Scholar]

**Figure 4.**Parallel sub-function scheduling between

`poly_${R}_{q}$_mul`and

`crypto_sort_int32`with 2-ALU.

Parameter | NTRU Variant | |||
---|---|---|---|---|

HPS2048509 | HPS2048677 | HPS4096821 | HRSS701 | |

N | 509 | 677 | 821 | 701 |

q | 2048 | 2048 | 4096 | 8192 |

Function | NTRU-KEM Variant | |||||||
---|---|---|---|---|---|---|---|---|

HPS2048509 | HPS2048677 | HPS4096821 | HRSS701 | |||||

Key Generation | 43.8088 ms | 100% | 77.0217 ms | 100% | 114.3290 ms | 100% | 82.5875 ms | 100% |

poly_Rq_inv | 20.7199 | 47.30 | 36.4755 | 47.36 | 53.7846 | 47.04 | 39.3617 | 47.66 |

poly_R2_inv | . | . | . | . | . | . | . | . |

poly_R2_inv_to_Rq_inv | . | . | . | . | . | . | . | . |

poly_S3_inv | 17.3077 | 39.51 | 30.4974 | 39.60 | 44.8878 | 39.26 | 32.7177 | 39.61 |

poly_Rq_mul | 5.4949 | 12.54 | 9.7996 | 12.72 | 14.2652 | 12.48 | 10.3971 | 12.59 |

crypto_sort_int32 | 0.1359 | 0.31 | 0.1999 | 0.26 | 0.2534 | 0.22 | - | - |

randombytes | 0.0407 | 0.09 | 0.0493 | 0.06 | 0.0605 | 0.05 | 0.0236 | 0.03 |

Encapsulation | 1.3186 ms | 100% | 2.2319 ms | 100% | 3.2032 ms | 100% | 2.1644 ms | 100% |

poly_Rq_mul | 1.0974 | 83.23 | 1.9365 | 86.76 | 2.8308 | 88.37 | 2.0700 | 95.64 |

crypto_sort_int32 | 0.1333 | 10.11 | 0.1985 | 8.89 | 0.2520 | 7.87 | - | - |

randombytes | 0.0374 | 2.84 | 0.0483 | 2.16 | 0.0586 | 1.83 | 0.0217 | 1.00 |

SHA3-hash | 0.0076 | 0.57 | 0.092 | 0.41 | 0.0098 | 0.30 | 0.0094 | 0.43 |

etc. | 0.0430 | 3.26 | 0.0395 | 1.77 | 0.0520 | 1.62 | 0.0633 | 2.92 |

Decapsulation | 3.3900 ms | 100% | 5.9193 ms | 100% | 8.6410 ms | 100% | 6.3532 ms | 100% |

poly_Rq_mul | 1.0936 | 32.26 | 1.9678 | 33.24 | 2.8410 | 32.88 | 2.0849 | 32.82 |

poly_S3_mul | 1.1082 | 32.69 | 1.9434 | 32.83 | 2.8354 | 32.81 | 2.0797 | 32.73 |

poly_Sq_mul | 1.1032 | 32.54 | 1.9404 | 32.78 | 2.8310 | 32.76 | 2.0727 | 32.62 |

SHA3-hash | 0.0207 | 0.64 | 0.0269 | 0.45 | 0.0282 | 0.33 | 0.0294 | 0.46 |

etc. | 0.0644 | 1.90 | 0.0408 | 0.69 | 0.1054 | 1.22 | 0.0866 | 1.36 |

# | Description | Parameters | Register (16-bit) | Area (10 ^{3} μm^{2}) | t_{Keygen}(ms) | t_{Enc}(ms) | t_{Dec}(ms) | Time × Area |
---|---|---|---|---|---|---|---|---|

v1 [18] | Reference (1-ALU) | HPS2048509 | 28 | 29.82 | 0.6671 | 0.0507 | 0.0993 | 4.47 |

HPS2048677 | 1.1994 | 0.0845 | 0.1768 | 7.79 | ||||

HPS4096821 | 1.7418 | 0.1187 | 0.2587 | 11.25 | ||||

HRSS701 | 1.2598 | 0.0624 | 0.1873 | 7.45 | ||||

v2 | 2-ALU | HPS2048509 | 56 | 45.15 | 0.4555 | 0.0341 | 0.0497 | 3.78 |

HPS2048677 | 0.8228 | 0.0551 | 0.0884 | 6.48 | ||||

HPS4096821 | 1.1908 | 0.0755 | 0.1294 | 9.25 | ||||

HRSS701 | 0.8609 | 0.0312 | 0.0936 | 5.64 | ||||

v3 | 4-ALU | HPS 048509 | 112 | 60.48 | 0.3497 | 0.0258 | 0.0248 | 3.07 |

HPS2048677 | 0.6346 | 0.0403 | 0.0442 | 5.11 | ||||

HPS4096821 | 0.9152 | 0.0540 | 0.0647 | 7.18 | ||||

HRSS701 | 0.6615 | 0.0156 | 0.0468 | 3.77 | ||||

v4 | 2-ALU + Parallel Sub-Function Scheduling | HPS2048509 | 56 | 45.15 | 0.4452 | 0.0238 | 0.0331 | 2.57 |

HPS2048677 | 0.8044 | 0.0367 | 0.0589 | 4.32 | ||||

HPS4096821 | 1.1638 | 0.0486 | 0.0862 | 6.09 | ||||

HRSS701 | 0.8414 | 0.0312 | 0.0624 | 4.23 | ||||

v5 | 2-ALU + Parallel Sub-Function Scheduling + Combined Register | HPS2048509 | 56 | 39.60 | 0.4452 | 0.0238 | 0.0331 | 2.25 |

HPS2048677 | 0.8044 | 0.0367 | 0.0589 | 3.79 | ||||

HPS4096821 | 1.1638 | 0.0486 | 0.0862 | 5.34 | ||||

HRSS701 | 0.8414 | 0.0312 | 0.0624 | 3.71 |

Parameters | Latency (ms) | Area (10 ^{3} μm^{2}) | Time × Area | |||
---|---|---|---|---|---|---|

Encap | Decap | |||||

Ours(SW-SHA3 + HW-Ours-v5) | HPS2048509 | 0.0313 | 0.0538 | 39.60 | 3.37 | - |

HPS2048677 | 0.0458 | 0.0858 | 5.21 | - | ||

HPS4096821 | 0.0583 | 0.1145 | 6.84 | - | ||

HRSS701 | 0.0443 | 0.0918 | 5.39 | - | ||

x-net [15] | HPS2048509 | 0.0062 | 0.0071 | 460.25 | 6.12 | 1.82× |

HPS2048677 | 0.0084 | 0.0094 | 580.47 | 10.33 | 1.98× | |

HPS4096821 | 0.0102 | 0.0115 | 728.85 | 15.82 | 2.31× | |

HRSS701 | 0.0041 | 0.0117 | 762.17 | 12.04 | 2.24× | |

comba [15] | HPS2048509 | 0.3492 | 1.0471 | 102.97 | 143.78 | 42.65× |

HPS2048677 | 0.6161 | 1.8476 | 101.04 | 245.62 | 47.75× | |

HPS4096821 | 0.9047 | 2.7135 | 99.99 | 361.79 | 52.87× | |

HRSS701 | 0.6604 | 1.9834 | 104.69 | 276.78 | 51.38× |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Lee, Y.; Youn, J.; Nam, K.; Oh, H.; Paek, Y.
Optimizing Hardware Resource Utilization for Accelerating the NTRU-KEM Algorithm. *Computers* **2023**, *12*, 259.
https://doi.org/10.3390/computers12120259

**AMA Style**

Lee Y, Youn J, Nam K, Oh H, Paek Y.
Optimizing Hardware Resource Utilization for Accelerating the NTRU-KEM Algorithm. *Computers*. 2023; 12(12):259.
https://doi.org/10.3390/computers12120259

**Chicago/Turabian Style**

Lee, Yongseok, Jonghee Youn, Kevin Nam, Hyunyoung Oh, and Yunheung Paek.
2023. "Optimizing Hardware Resource Utilization for Accelerating the NTRU-KEM Algorithm" *Computers* 12, no. 12: 259.
https://doi.org/10.3390/computers12120259