Compact and Low-Latency FPGA-Based Number Theoretic Transform Architecture for CRYSTALS Kyber Postquantum Cryptography Scheme

Binh Kieu-Do-Nguyen; Nguyen The Binh; Cuong Pham-Quoc; Huynh Phuc Nghi; Ngoc-Thinh Tran; Trong-Thuc Hoang; Cong-Kha Pham

doi:10.3390/info15070400

,

and

¹

Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), 268 Ly Thuong Kiet St., Dist. 10, Ho Chi Minh City 740050, Vietnam

²

Computer Engineering Department, Vietnam National University—Ho Chi Minh City (VNU-HCM), Thu Duc, Ho Chi Minh City 700000, Vietnam

³

Department of Computer and Network Engineering, University of Electro-Communications (UEC), Tokyo 182-8585, Japan

^*

Author to whom correspondence should be addressed.

Information2024, 15(7), 400;https://doi.org/10.3390/info15070400

This article belongs to the Special Issue Software Engineering and Green Software

Version Notes

Order Reprints

Abstract

In the modern era of the Internet of Things (IoT), especially with the rapid development of quantum computers, the implementation of postquantum cryptography algorithms in numerous terminals allows them to defend against potential future quantum attack threats. Lattice-based cryptography can withstand quantum computing attacks, making it a viable substitute for the currently prevalent classical public-key cryptography technique. However, the algorithm’s significant time complexity places a substantial computational burden on the already resource-limited chip in the IoT terminal. In lattice-based cryptography algorithms, the polynomial multiplication on the finite field is well known as the most time-consuming process. Therefore, investigations into efficient methods for calculating polynomial multiplication are essential for adopting these quantum-resistant lattice-based algorithms on a low-profile IoT terminal. Number theoretic transform (NTT), a variant of fast Fourier transform (FFT), is a technique widely employed to accelerate polynomial multiplication on the finite field to achieve a subquadratic time complexity. This study presents an efficient FPGA-based implementation of number theoretic transform for the CRYSTAL Kyber, a lattice-based public-key cryptography algorithm. Our hybrid design, which supports both forward and inverse NTT, is able run at high frequencies up to 417 MHz on a low-profile Artix7-XC7A100T and achieve a low latency of

1.10 μ

s while achieving state-of-the-art hardware efficiency, consuming only 541-LUTs, 680 FFs, and four 18 Kb BRAMs. This is made possible thanks to the newly proposed multilevel pipeline butterfly unit architecture in combination with employing an effective coefficient accessing pattern.

Keywords:

field-programmable gate array; postquantum cryptography; number theoretic transform; CRYSTALS Kyber; lightweight design

1. Introduction

Security is a crucial concern in computer systems for safeguarding confidential data from the ever-changing landscape of cyber attacks. The conventional encryption methods, which depend on intricate mathematical issues that pose challenges for classical computers to answer, effectively fulfill their purpose of safeguarding sensitive data from malevolent entities. Quantum computers, which rely on the principles of quantum physics, can disrupt the equilibrium between defenders and attackers by significantly accelerating some calculations. By implementing Shor’s algorithm [] and Grover’s algorithm [] on quantum computers, standard public key cryptography (PKC) systems such as RSA [] and elliptic curve cryptography (ECC) [] may be vulnerable. Therefore, replacing them with postquantum cryptography (PQC) is necessary. CRYSTALS Kyber is a cryptographic algorithm commonly referred to as Kyber. It is documented in the source []. The designation of Kyber as a postquantum key encapsulation mechanism (KEM) protocol is due to its classification within the family of lattice-based cryptography (LBC). The fundamental challenge that Kyber addresses is module learning with error (M-LWE). Kyber is a cryptographic system built on lattices and relies on the M-LWE problem. Kyber’s security is based on the premise that addressing learning with error (LWE) problems is challenging, even for quantum computers. The hardness of the material contributes to its resistance to quantum cryptanalysis. Generally, systems based on the M-LWE offer enhanced security [] and improved performance compared to LWE schemes []. In this method, polynomials are typically defined over the ring

Z_{q} [x] / (x^{n} + 1)

, with the value n indicating the size of the public-key matrix and the security levels. Kyber provides three sets of parameters: Kyber-512, Kyber-768, and Kyber-1024. These factors define the dimensions of the mathematical entities employed and impact the algorithm’s security and efficiency. The Kyber-768 algorithm provides an optimal equilibrium between robustness and effectiveness.

Performing polynomial multiplication within the ring is an expensive calculation, both in terms of hardware and software implementation. This holds true not only for the Kyber cryptography algorithm but also for other cryptographic algorithms. Polynomial multiplication can be computed using several algorithms, including the schoolbook algorithm [,,], Karatsuba algorithm [,], Toom–Cook algorithm [,], and the number theoretictTransform (NTT) algorithm [,,]. Of these algorithms, the schoolbook algorithm is the most basic and straightforward. However, it has a quadratic complexity of

𝒪 (n^{2})

, where n represents the length of the polynomials. The Karatsuba algorithm employs the “divide-and-conquer” approach by splitting the original polynomial into two sections, resulting in a computational complexity of

𝒪 (n^{1.58})

. The Toom–Cook algorithm is an extension of the Karatsuba algorithm. The Toom–Cook algorithm differs from the Karatsuba algorithm by dividing the original polynomial into k parts. This division results in a complexity of

𝒪 (n^{r})

, where

r = {log}_{k} (2 k - 1)

. The discrete Fourier transform (DFT) can be employed for polynomial multiplication. However, the computational complexity of directly computing the DFT is

𝒪 (n^{2})

, which is comparable to that of the schoolbook technique.

Over the past few decades, numerous efficient algorithms for calculating the DFT have been introduced, which are currently referred to as the fast Fourier transform (FFT). The concept of the FFT was initially formulated by Carl Friedrich Gauss in an unpublished manuscript in 1805 []. The FFT gained significant recognition in 1965 when Cooley and Tukey independently introduced the Cooley–Tukey method, a rapid technique for DFT. The NTT is a specific instance of DFT that operates on a finite field []. Similarly, the majority of FFT approaches can be utilized for NTT, leading to the development of efficient algorithms for number theoretic transformations. From an implementation perspective, the FFT and NTT are the most efficient techniques for calculating polynomial multiplication of large degrees. This is because they have a quasilinear complexity of

𝒪 (n log n)

, which gives them a major advantage over other multiplication algorithms. However, due to the fact that the FFT is carried out in the complex domain, the floating point operations involved in the computation may introduce inaccuracies in rounding precision when multiplying two polynomials with integer coefficients. Conversely, the calculations involved in NTT computing exclusively utilize integers, circumventing any potential precision issues.

Most of the polynomials used in lattice-based schemes are created using integer coefficients. It is evident that NTT is particularly well suited for implementing these schemes. Most schemes that utilize lattices with algebraic structures tend to favor NTT-based multiplication because of its numerous benefits, in addition to its quasilinear complexity and utilization of integer arithmetic. Nevertheless, certain lattice-based techniques are unable to utilize NTT directly due to the parameter requirements imposed by NTT. NTT utilizes negative wrapped convolution (NWC). The current settings of Kyber, which is the sole KEM specified by the National Institute of Standards and Technology (NIST), only partially satisfy the condition

q \equiv 1 (\mod n)

, therefore preventing the full utilization of the NTT based on the number theoretic cryptography approach.

The process of encrypting and decrypting in CRYSTALS Kyber is significantly dependent on performing polynomial multiplication within a certain mathematical ring. In order to address these problems, Kyber employs NTT to carry out a crucial operation with optimal efficiency. NTT functions as an optimization tool. It converts polynomials from their standard form into an alternative representation that allows for significantly faster multiplication. Using NTT greatly enhances the velocity and effectiveness of polynomial multiplication in CRYSTALS Kyber. This results in accelerated key generation, encapsulation, and decapsulation procedures. The term “improve NTT” is used to enhance the overall performance of the CRYSTALS Kyber PQC scheme. In this study, we introduced a new NTT architecture to expedite the polynomial matrix–vector multiplication in Kyber. The supplied accelerator achieves reduced latency and resource consumption compared to current works on the same platform.

1.1. Related Work

The first challenge with implementing iterative FFT or NTT, the Cooley–Tukey FFT algorithm [] to be specific, is handling bit-reversed order preprocessing efficiently, which has a non-negligible impact on performance. Some studies, such as [], addressed this problem by using a ping-pong memory access scheme between two BRAMs and were able to achieve a

3.95 \times

faster and at least

1.5 \times

smaller area time product (ATP) than the state-of-the-art methods. Another more recent work [] approached the problem by arranging the coefficients into three FIFOs and is also able to achieve a BRAM-free NTT architecture. Most notably, a recent work [] used registers to store the coefficients and used multiplexers to index them while completely eliminated both the use of BRAMs and DSPs. However, since the final goal of these transforms is to merely perform polynomial multiplication, there are some tricks that we can exploit. The reference implementation of CRYSTALS Kyber [], which is implemented in software, suggests that we can directly perform the Cooley–Tukey FFT on the natural order in the forward pass, obtaining the result in bit-reversed order and performing pointwise multiplication in that order, and finally performing the Gentleman–Sande algorithm in the inverse to get the multiplication result in natural order.

The second challenge of implementing NTT is the butterfly unit, or efficiently handling modular multiplication or modular reduction of the product between a coefficient and the root of unity. The reference implementation [] suggests that we replace all modular multiplications with Montgomery multiplication, basically performing all operations in an intermediate, reduction-friendly form and performing postprocessing in the end to obtain the result in standard form. A recent work [] also utilized the traditional Montgomery multiplication approach and achieved a low cycle count for the NTT operation at only about 128 clock cycles. On the other hand, a work [] developed a K-RED algorithm that can perform modular reduction efficiently and was optimized by [] along with the three FIFOs strategy to achieve 16.7% higher hardware efficiency than the state-of-the-art method. Moreover, authors [] proposed

K^{2} - R E D

, which is based on K-RED and achieves better hardware resource utilization.

In general, all studies so far have mainly focused on finding the optimal data access pattern of the iterative FFT and optimizing the modular reduction on the butterfly unit to make the whole NTT core faster and/or more compact. In this paper, we propose a high-frequency, fully pipelined, butterfly-unit NTT accelerator architecture for CRYSTALS Kyber version 3, with

n = 256

and

q = 3329

, for the FPGA platform, to achieve lower latency in the critical exchange process to which these operations belong.

1.2. Key Contributions

The key contributions in this work can be summarized as follows:

1.: We developed a new data accessing pattern on the NTT algorithm as well as appropriate reordering in order to reduce the BRAM to just the LUT-RAM of the FPGA Architecture, which can support shallow-depth and long-width requirements for unrolling.
2.: We also proposed two novel butterfly units that are DSP-free and have low resource utilization. The most expensive operation of the butterfly unit in NTT, in terms of resources and time, is the modular multiplication of the coefficient with the root of unities. The first approach we developed utilizes the fact that, with parameter $q = 3329 < 2^{12}$ , which means the coefficient is up to $12 b i t s$ long, we can split the parameter to the sum of multiple precomputed products with the roots of unity and can completely eliminate the full multiplication and as well as the need for dedicated storage for the root of unitie.
3.: The second butterfly unit we constructed utilizes quarter square multiplication to perform modular multiplication $x y = {(x + y)}^{2} / 4 - {(x - y)}^{2} / 4$ , by realizing the fact that although a $12 \times 12$ multiplication typically requires a $2^{24}$ -depth look-up table, $12 - b i t$ squares only requires $2^{12}$ . Therefore, by replacing multiplication by squares with additional processing, we can fit the quarter squares on a single dual-port ROM that fits neatly onto one BRAM. This approach also avoids the need for the full $12 \times 12$ product to be computed and reduced.
4.: We developed a detailed pipelined butterfly unit based on the proposed approach with a short critical path that can thus operate at a higher frequency.
5.: Finally, we synthesized place and route (PnR) and verified its functionality on an Xilinx Artix-7 FPGA, which can run at up to 417 MHz.

2. Background Knowledge

FFT traditionally operates on the vector of complex numbers and can be utilized to perform fast convolution with subquadratic time complexity. Fortunately, the same algorithm can be applied on an

n - 1

-degree polynomial with coefficients in any finite field

Z_{q}

with the nth root of unity to achieve fast multiplication. The variant of the FFT that operates on the finite field is called the NTT.

For CRYSTALS Kyber version 2 and later, the parameters are

q = 3329

and

n = 256

, and the polynomial is operated on the ring

Z_{3329} [x] / (x^{256} + 1)

. However, since

Z_{3329}

only has up to 256 primitive roots of unity, namely,

ζ = 17

, an NTT-based multiplication can only give us the result modulo

x^{256} - 1

. Therefore, the modulus is first factored as

x^{256} + 1 = \prod_{i = 0}^{127} (x^{2} - ζ^{2 {br}_{7} (i) + 1})

and the 256-coefficient polynomial is transformed to a vector of 128 degree-1 polynomial modulo

x^{2} - ζ^{2 {br}_{7} (i) + 1}

. The pointwise multiplication process now becomes “degree-1-wise” multiplication.

Furthermore, the traditional iterative FFT algorithm (Cooley–Tukey) requires bit-reversal preprocessing on both the forward and inverse transform; this introduces non-negligible impacts on hardware utilization and performance. Therefore, in the reference implementation of CRYSTALS Kyber, the forward transform (forward NTT) is performed using the Cooley–Tukey algorithm and outputs the result in reverse order, which then is used to perform pointwise multiplication in the same reversed order, and finally the Gentleman–Sande algorithm, which takes in reversed order vector, is used. The final result is output in natural order, as expected. The forward and inverse NTT algorithms are described in Algorithm 1 and Algorithm 2, respectively.

Algorithm 1: Cooley–Tukey forward NTT

Algorithm 2: Gentleman–Sande inverse NTT

3. Proposed NTT Architecture

The proposed NTT core design performs a number theoretic transform with parameters

n = 256, q = 3329

(CRYSTALS Kyber Version 3). Thus, each polynomial being transformed is a vector of 256 12-bit numbers. To improve the throughput, the core utilizes a single, simple dual-port SRAM, which has one read port and one write port, with a depth of 64 and a width of 48 bits; each entry stores four 12-bit polynomial coefficients to perform operations on two pairs of coefficients at a time.

On the Xilinx 7-series FPGA, this SRAM is inferred to multiple instances of LUT-RAM, and this design requires no BRAM usage on the device. Since the butterfly operation requires fetching the coefficient in a nonlinear manner, a special access pattern from the control/address generation and realignment from the realign buffer is required and is handled both in pre and postbuffer operation.

The butterfly calculation pipeline consists of fetching the data from outside (the first NTT iteration) or coefficient SRAM, going through the realign buffer to permute the data in the correct order and then feeding the butterfly unit to perform the butterfly calculation. After the operation, the coefficient is converted to either write back and write back the next iteration or stream out of the core in the last iteration of the NTT, as shown in Figure 1.

Figure 1. Control/data flow of the proposed NTT core.

For I/O operations, the core takes in a stream of four 12-bit coefficients with a ready signal to indicate that it is ready to accept data, and when it is finished, the core outputs a stream of four 12-bit data with an associated data valid indication.

3.1. Butterfly Unit

In the forward NTT operatio described in Algorithm 1, the inner loop unit performs the operation (1), which is called a butterfly calculation.

\{\begin{matrix} r [i] & = (r [i] + r [i + l e n] ζ [k]) mod q \\ r [i + l e n] & = (r [i] - r [i + l e n] ζ [k]) mod q \end{matrix}

(1)

where k is an integer from 1 to 127; the table of

ζ

is precomputed and stored in ROM.

Similarly, the butterfly calculation of the inverse NTT (INTT) is defined in Equation (2).

\{\begin{matrix} r [i] & = (r [i + l e n] + r [i]) 2^{- 1} mod q \\ r [i + l e n] & = (r [i + l e n] - r [i]) ζ [k] 2^{- 1} mod q \end{matrix}

(2)

The most significant bottleneck of the butterfly unit is the modular multiplication of

ζ [k]

. If we perform this operation naïvely, the steps involve a 12 × 12 multiplication, which is most efficiently achieved by using a DSP on FPGA and then somehow reducing this 24-bit product mod q.

On the reference implementation of Kyber, which is optimized for AVX2 (software) version 2.0, this is achieved by performing the NTT and PWM in Montgomery form, replacing all modular multiplications with Montgomery multiplications, and then converting back to normal form at the INTT stage.

In this work, we developed two solutions with different hardware tradeoffs to perform this particular modular multiplication efficiently.

3.1.1. Multiplicationless Method

This approach does not require the full multiplication of 12 × 12 for

r [i + l e n] ζ [k]

but instead uses a look-up table to achieve the modular multiplication. The multiplication of a 12-bit unsigned number

y [11 : 0]

by

ζ [k]

,

t = y [11 : 0] ζ [k]

, can be expanded as shown in Equation (3).

t = y [11 : 9] ζ [k] 2^{9} + y [8 : 6] ζ [k] 2^{6} + y [5 : 3] ζ [k] 2^{3} + y [2 : 0] ζ [k]

(3)

Every term in Equation (3) has 1024 possible outcomes that are 12-bit each. Therefore, we can precompute all these terms and store them as four 1024 × 12 ROMs indexed by

{i [0 : 2], k}

, namely, luts9, luts6, luts3, and luts0, with the value, as described in Equation (4). Equation (3) now becomes Equation (5).

\{\begin{matrix} luts 9 [{i [0 : 2], k}] : = i \times ζ [k] 2^{9} mod q \\ luts 6 [{i [0 : 2], k}] : = i \times ζ [k] 2^{6} mod q \\ luts 3 [{i [0 : 2], k}] : = i \times ζ [k] 2^{3} mod q \\ luts 0 [{i [0 : 2], k}] : = i \times ζ [k] 2^{0} mod q \end{matrix}

(4)

t = luts 9 [{y [11 : 9], k}] + luts 9 [{y [8 : 6], k}] + luts 9 [{y [5 : 3], k}] + luts 0 [{y [2 : 0], k}]

(5)

Therefore, by using the precomputed table, we eliminate the need to perform 12 × 12 multiplication and the need to store the zeta array explicitly since it is already embedded in the multiplication look-up table. Equation (5) yields a 14-bit result, where

0 \leq t \leq 4 (q - 1)

; to reduce it to mod

q = 3329

, we simply perform conditional subtraction by

2 q

and then by q. Thus, this variant of the butterfly unit is implemented as shown in Figure 2, where the modular multiplication is handled by a multiplicationless zeta[k] submodule implemented as shown in Figure 3; the luts0, luts3, luts6, and luts9 ROMs store the values described in Equation (4).

Figure 2. Multiplicationless butterfly unit.

Figure 3. Mult-less zeta[k] submodule.

This method requires either four single-port 2 KB ROMs per butterfly unit or four dual-port 2 KB ROMs per two butterfly units. Since Xilinx’s BRAM does have the capability of being dual-port ROMs, this work used the latter forms to achieve higher throughput and the maximum utilization of Xilinx’s BRAMs, where the two butterfly units share the same ROMs as the look-up table. In our testing, the synthesis tool was able to automatically fuse two identical single-port ROMs inside two instances of the butterfly unit into a single dual-port ROM without any explicit RTL code change.

3.1.2. Quarter Square Multiplication Method

All 12 × 12 modular multiplications,

a b mod q

, could be naïvely stored and looked up by using a

2^{24} \times 12

ROM, which is very space-inefficient. On the other hand, a 12-bit modular square,

x^{2} mod q

, only requires a

2^{12} \times 12

ROM, realizing the quarter square multiplication identity (6) of two numbers

x, y

.

\frac{{(x + y)}^{2}}{4} - \frac{{(x - y)}^{2}}{4} = \frac{(x^{2} + 2 x y + y^{2}) - (x^{2} - 2 x y + y^{2})}{4} = x y

(6)

We can algebraically transform a multiplication to a difference of two quarter squares using the identity (6). Since

q = 3329

is an odd prime, the inverse of 4 modulo q exists, and the same identity (6) can be applied to perform modular multiplication (7) in the finite field

Z_{3329}

.

a b mod q = {(a + b)}^{2} 4^{- 1} - {(a - b)}^{2} 4^{- 1} mod q

(7)

Therefore, the 12 × 12 modular multiplication

a b mod q

can be realized using a single

2^{12} \times 12

quarter square ROM using the identity (7). Furthermore, the quarter squares for both sum

a + b

and difference

a - b

can be looked up simultaneously in a single clock cycle using a dual-port ROM, as shown in Equation (8).

\{\begin{matrix} {lut}_{x^{2} 4^{- 1} mod q} (n) & : = n^{2} 4^{- 1} mod q, 0 \leq n < 2^{12} \\ a b mod q & = {lut}_{x^{2} 4^{- 1} mod q} (a + b) - {lut}_{x^{2} 4^{- 1} mod q} (a - b) mod q \end{matrix}

(8)

Note that the depth of the ROM can be reduced by four by performing preprocessing and postprocessing. The first reduction factor of 2 is achieved by taking advantage of the symmetry of squares

x^{2} = {(- x)}^{2}

; in the modular field, this means

n^{2} = {(q - n)}^{2} = min {(n, q - n)}^{2} mod q

. Thus, we only need to look up the quarter square of

min (n, q - n)

, which is only up to

(q - 1) / 2

and is 11 bits wide.

The second reduction factor of 2 is achieved by further realizing the fact that the table saves both the even

{(2 n)}^{2} / 4

and odd

{(2 n + 1)}^{2} / 4

entries; the odd entries

{(2 n + 1)}^{2} / 4

can be computed using the even entries

{(2 n)}^{2} / 4

by

{(2 n + 1)}^{2} / 4 = {(2 n)}^{2} / 4 + n + 4^{- 1}

. Therefore, the tables only need to store the even quarter squares,

x^{2}

for x up to

q / 4

(10-bit); to compute

x^{2} 4^{- 1} mod q

, we only look up the truncated number

n = x [10 : 1]

and then conditionally adding

n + 4^{- 1}

if x is odd, as shown in Equation (9).

\{\begin{matrix} {lut}_{x^{2} mod q} (n) & : = n^{2} mod q, 0 \leq n < 2^{10} \\ x {[10 : 0]}^{2} 4^{- 1} mod q & = {lut}_{x^{2} mod q} (x [10 : 1]) + x [0] (x [10 : 1] + 4^{- 1}) mod q \end{matrix}

(9)

Finally, with the preprocessing and postprocessing in place, we only need a single 1024 × 12 dual-port ROM to store the quarter squares, which fits perfectly in a single 2 KB true dual-port BRAM of Xilinx FPGA. This variant of the butterfly unit is implemented as shown in Figure 4. The modular multiplication is handled by a quarter square mul-mod submodule, implemented as whon in Figure 5. The wuarter square dual-port ROM stores the value of

{lut}_{x^{2} mod q}

described in Equation (9). For simplicity, all trivial reductions modulo q are omitted. This method requires a single true dual-port 2 KB ROM per butterfly unit.

Figure 4. Quarter square butterfly unit.

Figure 5. Quarter square Mul-mod submodule.

3.1.3. Hybrid Butterfly Unit

Both the proposed butterfly units shown in Figure 2 and Figure 4 only focus on the forward calculation in Equation (1). A slight modification can be made to allow the butterfly unit to support both the forward butterfly calculation in Equation (1) and the inverse butterfly calculation in Equation (2). This modified butterfly unit is called a hybrid butterfly unit and is implemented as shown in Figure 6.

Figure 6. Hybrid butterfly unit.

The zeta[k] multiplier submodule can be any modular multiplication module, such as the ones in Figure 3 or Figure 5. With intt_sel signal, when asserted, the butterfly unit acts as an inverse NTT butterfly unit, as in Equation (10).

\{\begin{matrix} x_{out} & = (y_{in} + x_{in}) 2^{- 1} \\ y_{out} & = (y_{in} - x_{in}) ζ [k] 2^{- 1} \\ i n t t_s e l & = 1 \end{matrix}

(10)

Similarly, when intt_sel is deasserted, the butterfly unit acts as a forward NTT butterfly unit, as in Equation (11).

\{\begin{matrix} x_{out} & = x_{in} + y_{in} ζ [k] \\ y_{out} & = x_{in} - y_{in} ζ [k] \\ i n t t_s e l & = 0 \end{matrix}

(11)

Note that the multiplications by the inverse of 2 modulo q in Equation (10),

a 2^{- 1}

, can simply be achieved by realizing that, if a is even, then

a 2^{- 1} \equiv a ≫ 1 (\mod q)

, where

a ≫ 1

is the result of the unsigned binary presentation of a shifted right by 1. It can also be defined as

a ≫ 1 : = ⌊ \frac{a}{2} ⌋

. When a is odd, then

a + q

is even, and thus

a 2^{- 1} \equiv a 2^{- 1} + q 2^{- 1} \equiv (a + q) 2^{- 1} \equiv (a + q) ≫ 1 (mod q)

.

Therefore, all modular multiplications by

2^{- 1}

are efficient, as shown in Equation (12).

a 2^{- 1} mod q = (a + q a [0]) ≫ 1 mod q

(12)

3.2. Realign Buffer

The butterfly unit requires the data to be accessed in a special way; for instance, we need to access the coefficient at offset j and offset

j + l e n

to perform a single butterfly operation. Special handling is necessary when working with a single SRAM with a single read port. To handle this without compromising the throughput of the butterfly unit, the core’s SRAM must support fetching the double-width coefficient or 24 bits. The core fetches the data at position j,

j + 1

in the odd cycles and

j + l e n

,

j + l e n + 1

in the very next even cycles; the data are reordered to interleaved form

r [j], r [j + l e n]

and output to the butterfly unit, as shown in Figure 7. Since each butterfly unit requires a pair of coefficient and the read port supplies two coefficients on each cycle, with proper data ordering, a pipelined butterfly unit is fully utilized. The same principle can be applied to the output of the butterfly unit and the write port of the SRAM; the data are realigned to contiguous form

r^{'} [j], r^{'} [j + 1]

and written back to the SRAM.

Figure 7. Proposed data access pattern.

4. Experimental Results

4.1. Experimental Setup

To verify the functionality and the performance of the NTT core under ideal conditions, we created a test bench with the core top as the unit test and interfaced with it using an AXI-like handshake mechanism. For the input side, after reasserting reset, we provided the core with a continuous stream of input data as the coefficient of the polynomial on which we needed the core to operate. Eventually, the core was finished and a data_out_valid signal was asserted to provide a stream of output data as the transformed result. The operation latency was as the number of cycles between the first in_data_valid from the test bench assertion up until the last assertion of out_data_valid (the previous result data). Furthermore, the result was then checked with the expected result to make sure the core was working correctly; this was particularly handy when we were in the process of rapidly tweaking the core to fix the violated path to achieve higher frequency.

The top verification module is shown in Figure 8. A UART transceiver was used receive the test data and send back the result, which was then checked against the expected result. The other end of the UART communication was a desktop computer running a UART test driver software that transmitted test polynomials randomly and then checked for the FPGA’s NTT result for any error. The test driver algorithm is described in Algorithm 3.

Figure 8. Verification module.

Algorithm 3: UART software driver

4.2. Results and Comparison

In this work, we had three variants of the NTT core. The first two were the NTT-only cores with each proposed butterfly unit, namely,

α

and

β

, respectively; these cores only support the forward NTT operation. However, since there are not many method that only support forward NTT, to make our result more comparable, we adopted the hybrid butterfly unit (Figure 6) for the multiplicationless strategy to support both inverse and forward NTT on the fly via an intt_sel switch. This is called the

γ

core; to be more specific, this modification involved the addition of two stages to the pipeline to switch between CT (Algorithm 1) and GS (Algorithm 2) butterfly.

The utilization result is reported after being synthesized (with directive AreaOptimized_high) with PnR (with directive Explore) on NIST-recommended xc7a100tfgg676-3 target device on Vivado 2023.2. The number of cycles was recorded from the experiment test bench and scaled with the maximum frequency on which the core could run to provide the wall-clock time latency

Time (μ s) : = \frac{Cycles}{Freq (M H z)}

of each NTT operation. The INTT wall-clock time was omitted for simplicity since it differs from NTT by a few clock cycles at most. The cycles reported for “Both NTT and INTT” designs are in NTT/INTT format (i.e., 126/127 means 126 clock cycles for NTT and 127 clock cycles for INTT), unless both of them are the same or the work did not report it clearly. The ATP metric, where a smaller value indicates better hardware utilization efficiency, is calculated as the product of time in

μ s

and the respective resource utilized, for instance, LUT ATP :=

Time (μ s) \times LUTs

and FF ATP :=

Time (μ s) \times FFs

. For the methods that have a configurable number of parallel butterfly units, the notation [c]

\times n

indicates the n-parallel butterfly units design of the method [c].

In Table 1, for the NTT-only cores, the

α

core uses 429 LUTs, 538 FFs, and 4 BRAMs with two parallel butterfly units (unroll factor 2), achieving a hardware efficiency of 441.87 LUT ATP and 554.14 FF ATP with 459 cycles of latency. The

β

core, with a single butterfly, using only 379 LUTs, 414 FFs, with 910 cycles latency, achieved 792.11 LUT ATP and 865.25 FF ATP, which is worse than

α

but only uses a single BRAM as the ROM quarter square look-up table for the butterfly unit. Note that it is possible to not unroll the design

α

and run with just a single butterfly unit; however, this still requires four BRAMs due to the nature of the approach, that is, four ROMs shared between two butterflies via two different ports. In the

β

design, each butterfly has a dedicated ROM; hence, there are more rooms to scale in the quarter square method. And, finally, the hybrid core

γ

, which supports both NTT and INTT and is based on the

α

design, only requires slightly more resources at 541 LUTs (595.10 ATP), 680 FFs (748.00 ATP,) and two-cycle latency penalty, where both INTT and NTT take 461 cycles to finish.

Table 1. Result comparison with other lightweight NTT-core methods.

For the NTT standalone cores, although a recent method [] (FNTT) successfully eliminated the usage of any DSP or BRAM, the approach heavily hinders the LUT and FF utilization and frequency, which can only run at about 100 MHz even on the higher-end Virtex-7 platform. Our

α

design is able to achieve more than 10× lower latency and more than 200× smaller ATPs than [] (FNTT). Furthermore, even the

β

design has almost 7× better latency and more than 100× better ATPs than the FNTT method [] with the cost of only a single additional 18 Kb BRAM.

For the hybrid design (NTT- and INTT-capable) comparison, thanks to the highly pipelined nature of the butterfly unit, our

γ

design can run at a higher frequency (417 MHz) on the Artix-7 platform than the other methods, including [] (300 MHz), which uses a much more recent and faster Zynq Ultrascale+ platform. The newly proposed modular multiplication methods can achieve better latency (1.10

μ

s) than all other methods except the fastest ×16 configuration in [] (0.40

μ

s). However, [] uses nearly

17.57 \times

more LUTs and

17.5 \times

more BRAMs than ours while being only

2.75 \times

faster. Therefore, we achieved a

6.39 \times

more efficient LUT ATP and

1.44 \times

more efficient FF ATP than [] ×16.

Compared to [], which employs the same parallel butterflies strategy, our method achieved

3.71 \times

and

1.37 \times

better LUT and FF ATP in the ×1 configuration and

5.10 \times

and

1.60 \times

better LUT and FF ATP in the ×2 configuration.

Our design also achieved about

2 \times

better LUT ATP and

1.5 \times

better FF ATP than [,]. In [], the authors used two fewer BRAMs but at the cost of four DSPs and much higher LUTs and FFs usage, and latency was higher. The same apply for [,], which use fewer BRAMs at the expense of significantly more LUTs/FFs and two DSPs; hence, these methods are less efficient than our work in LUT and FF ATP.

In [], the authors did not report the exact maximum frequency that their design is able to run. Assuming that it could run at the same frequency as our

γ

design at 417 MHz, which is a highly unlikely scenario given the nearly 5000 LUTs and 35 BRAMs utilized, the method in [] would have 3.66× better latency than

γ

while using 9.18× more LUTs and 2.38× more FFs and thus would be more efficient in terms of FFs ATP but significantly worse in term of LUTs ATP.

5. Conclusions

In this paper, we proposed a new NTT architecture with a unique data access pattern along with a reorder buffer, making it suitable for a single, shallow-depth, simple and dual-port SRAM; in the FPGA architecture, this was mapped to abundant LUT-RAM resources. This allowed us to unroll the inner loop of the NTT by a factor of two and beyond (×4, ×8) to achieve lower latency without using significantly more resources in the control path. Furthermore, we proposed two new look-up table-based butterfly units, namely, multiplicationless and quarter square multiplication, both with different tradeoffs in LUT/BRAM usage and eliminate the need for computing the full

12 \times 12

product and subsequently the need for DSP utilization. Thanks to the new butterfly units, our whole design runs at a higher frequency (417 MHz), has good latency while maintaining excellent hardware usage efficiency (LUT/FF ATP), and is suitable for embedding in high-performance, lightweight CRYSTALS Kyber accelerators for more efficient polynomial multiplication.

Author Contributions

Conceptualization, B.K.-D.-N.; Validation, B.K.-D.-N., N.T.B. and H.P.N.; Investigation, B.K.-D.-N., N.T.B. and H.P.N.; Writing—original draft, C.P.-Q.; Writing—review & editing, C.P.-Q., N.-T.T. and T.-T.H.; Supervision, C.P.-Q., N.-T.T., T.-T.H. and C.-K.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by Vietnam National University Ho Chi Minh City (VNU-HCM) under grant number NCM2021-20-02. We acknowledge Ho Chi Minh City University of Technology (HCMUT), VNU-HCM, for supporting this study.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Shor, P.W. Polynomial-Time Algorithms for Prime Factorization and Discrete Logarithms on a Quantum Computer. SIAM J. Sci. Statist. Comput. 1997, 26, 1484–1509. [Google Scholar] [CrossRef]
Grover, L.K. A Fast Quantum Mechanical Algorithm for Database Search. In Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of Computing, Philadelphia, PA, USA, 22–24 May 1996; pp. 212–219. [Google Scholar]
Rivest, R.L.; Shamir, A.; Adleman, L. A Method for Obtaining Digital Signatures and Public-key Cryptosystems. Commun. ACM 1978, 21, 120–126. [Google Scholar] [CrossRef]
Miller, V.S. Use of Elliptic Curves in Cryptography. In Proceedings of the Advances in Cryptology (CRYPTO), Santa Barbara, CA, USA, 18–22 August 1985; pp. 417–426. [Google Scholar]
Bos, J.; Ducas, L.; Kiltz, E.; Lepoint, T.; Lyubashevsky, V.; Schanck, J.M.; Schwabe, P.; Seiler, G.; Stehle, D.D. CRYSTALS—Kyber: A CCA-Secure Module-Lattice-Based KEM. In Proceedings of the European Symposium on Security and Privacy (EuroS&P), London, UK, 24–26 April 2018; pp. 353–367.
Lyubashevsky, V.; Peikert, C.; Regev, O. On Ideal Lattices and Learning with Errors Over Rings. In Proceedings of the Advances in Cryptology (EUROCRYPT), Santa Barbara, CA, USA, 15–19 August 2010; pp. 1–23. [Google Scholar]
Lindner, R.; Peikert, C. Better Key Sizes (and Attacks) for LWE-Based Encryption. In Proceedings of the Topics in Cryptology (CT-RSA), San Francisco, CA, USA, 14–18 February 2011; pp. 319–339. [Google Scholar]
Das, M.; Jajodia, B.B. Hardware Design of Optimized Large Integer Schoolbook Polynomial Multiplications on FPGA. In Proceedings of the International SoC Design Conference (ISOCC), Gangneung-si, Republic of Korea, 19–22 October 2022; pp. 65–66. [Google Scholar]
Zhang, Y.; Cui, Y.; Ni, Z.; Kundi, D.; Liu, D.; Liu, W. A Lightweight and Efficient Schoolbook Polynomial Multiplier for Saber. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), Austin, TX, USA, 27 May–1 June 2022; pp. 2251–2255. [Google Scholar]
Birgani, Y.A.; Timarchi, S.; Khalid, A. Area-Time-Efficient Scalable Schoolbook Polynomial Multiplier for Lattice-Based Cryptography. IEEE Trans. Circ. Syst. II Express Briefs (TCAS-II) 2022, 69, 5079–5083. [Google Scholar] [CrossRef]
Yang, S.; Liu, D.; Hu, A.; Li, A.; Zhang, J.; Li, X.; Lu, J.; Mo, C. An Instruction-configurable Post-quantum Cryptographic Processor Towards NTRU. In Proceedings of the Asian Hardware Oriented Security and Trust Symposium (AsianHOST), Singapore, 14–16 December 2022; pp. 1–6. [Google Scholar]
Wong, Z.Y.; Wong, D.C.K.; Lee, W.K.; Mok, K.M.; Yap, W.S.; Khalid, A. KaratSaber: New Speed Records for Saber Polynomial Multiplication Using Efficient Karatsuba FPGA Architecture. IEEE Trans. Comput. 2023, 72, 1830–1842. [Google Scholar] [CrossRef]
Ghosh, A.; Mera, J.M.B.; Karmakar, A.; Das, D.; Ghosh, S.; Verbauwhede, I.; Sen, S. A 334 μW 0.158 mm² ASIC for Post-Quantum Key-Encapsulation Mechanism Saber With Low-Latency Striding Toom–Cook Multiplication. IEEE J. Solid-State Circ. 2023, 58, 2383–2398. [Google Scholar] [CrossRef]
Wang, J.; Yang, C.; Zhang, F.; Meng, Y.; Su, Y. TCPM: A Reconfigurable and Efficient Toom-Cook-Based Polynomial Multiplier Over Rings Using a Novel Compressed Postprocessing Algorithm. IEEE Trans. Very Large Scale Integr. (Vlsi) Syst. 2023, 31, 1153–1166. [Google Scholar] [CrossRef]
Ye, Z.; Cheung, R.; Huang, K. PipeNTT: A Pipelined Number Theoretic Transform Architecture. IEEE Trans. Circ. Syst. II Express Briefs (TCAS-II) 2022, 69, 4068–4072. [Google Scholar] [CrossRef]
Xin, M.; Xu, C.; Huang, K.; Yu, H.; Yao, H.; Jiang, X.; Liu, D. Implementation of Number Theoretic Transform Unit for Polynomial Multiplication of Lattice-based Cryptography. In Proceedings of the International Conference on Consumer Electronics and Computer Engineering (ICCECE), Guangzhou, China, 14–16 January 2022; pp. 323–327. [Google Scholar]
Nguyen, T.H.; Binh, K.D.N.; Pham, C.K.; Hoang, T.T. High-Speed NTT Accelerator for CRYSTAL-Kyber and CRYSTAL-Dilithium. IEEE Access 2024, 12, 34918–34930. [Google Scholar] [CrossRef]
Gauss, C.F. Nachlass: Theoria Interpolationis Methodo Nova Tractata. Carl Friedrich Gauss 1866, 3, 265–303. [Google Scholar]
Pollard, J.M. The Fast Fourier Transform in a Finite Field. Math. Comput. 1971, 25, 365–374. [Google Scholar] [CrossRef]
Cooley, J.W.; Tukey, J.W. An Algorithm for the Machine Calculation of Complex Fourier Series. Math. Comput. 1965, 19, 297–301. [Google Scholar] [CrossRef]
Zhang, C.; Liu, D.; Liu, X.; Zou, X.; Niu, G.; Liu, B.; Jiang, Q. Towards Efficient Hardware Implementation of NTT for Kyber on FPGAs. In Proceedings of the 2021 IEEE International Symposium on Circuits and Systems (ISCAS), Daegu, Republic of Korea, 22–28 May 2021. [Google Scholar]
Ni, Z.; Khalid, A.; Liu, W.; O’Neill, M. Towards a Lightweight CRYSTALS-Kyber in FPGAs: An Ultra-lightweight BRAM-free NTT Core. In Proceedings of the IEEE International Symposium on Circuits and Systems 2023, Monterey, CA, USA, 21–25 May 2023. [Google Scholar]
Imran, M.; Khan, S.; Khalid, A.; Rafferty, C.; Shah, Y.A.; Pagliarini, S.; Rashid, M.; O’Neill, M. Evaluating NTT/INTT Implementation Styles for Post-Quantum Cryptography. IEEE Embed. Syst. Lett. 2024; early access. [Google Scholar] [CrossRef]
Yang, H.; Chen, R.; Wang, Q.; Wu, Z.; Peng, W. Hardware Acceleration of NTT-Based Polynomial Multiplication in CRYSTALS-Kyber. In Information Security and Cryptology; Ge, C., Yung, M., Eds.; Springer: Singapore, 2024; pp. 111–129. [Google Scholar]
Longa, P.; Naehrig, M. Speeding Up the Number Theoretic Transform for Faster Ideal Lattice-Based Cryptography. Cryptology ePrint Archive, Paper 2016/504, 2016. Available online: https://eprint.iacr.org/2016/504 (accessed on 7 July 2024).
Niasar, M.B.; Azarderakhsh, R.; Kermani, M.M. Instruction-Set Accelerated Implementation of CRYSTALS-Kyber. IEEE Trans. Circ. Syst. I Regul. Pap. (TCAS-I) 2021, 68, 4648–4659. [Google Scholar] [CrossRef]
Bisheh-Niasar, M.; Azarderakhsh, R.; Mozaffari-Kermani, M. High-Speed NTT-based Polynomial Multiplication Accelerator for Post-Quantum Cryptography. In Proceedings of the 2021 IEEE 28th Symposium on Computer Arithmetic (ARITH), Lyngby, Denmark, 14–16 June 2021. [Google Scholar]
Yaman, F.; Mert, A.C.; Öztürk, E.; Savas, E. A Hardware Accelerator for Polynomial Multiplication Operation of CRYSTALS-KYBER PQC Scheme. In Proceedings of the 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, France, 1–5 February 2021. [Google Scholar]
Xing, Y.; Li, S. A Compact Hardware Implementation of CCA-Secure Key Exchange Mechanism CRYSTALS-KYBER on FPGA. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2021, 2021, 328–356. [Google Scholar] [CrossRef]
Matteo, S.D.; Sarno, I.; Saponara, S. CRYPHTOR: A Memory-Unified NTT-Based Hardware Accelerator for Post-Quantum CRYSTALS Algorithms. IEEE Access 2024, 12, 25501–25511. [Google Scholar] [CrossRef]

Figure 1. Control/data flow of the proposed NTT core.

Figure 2. Multiplicationless butterfly unit.

Figure 3. Mult-less zeta[k] submodule.

Figure 4. Quarter square butterfly unit.

Figure 5. Quarter square Mul-mod submodule.

Figure 6. Hybrid butterfly unit.

Figure 7. Proposed data access pattern.

Figure 8. Verification module.

Table 1. Result comparison with other lightweight NTT-core methods.

	Mode	LUTs	FFs	DSPs	BRAMs (18 Kb)	Freq (MHz)	Cycles	Time ( $μ$ s)	Platform	LUT ATP	FF ATP
This ( $α$ ) ×2	◐	429	538	0	4	446	459	1.03	Artix-7	441.87	554.14
This ( $β$ ) ×1 (quarter sq.)	◐	379	414	0	1	435	910	2.05	Artix-7	792.11	865.26
[] (FNTT)	◐	9187	9328	0	0	100	1410	14.10	Virtex-7	129,536.7	131,524.8
This ( $γ$ ) ×2	$⚫$	541	680	0	4	417	461	1.10	Artix-7	595.10	748.00
[]	$⚫$	810	717	4	2	222	324	1.46	Artix-7	1182.60	1046.82
[]	$⚫$	609	640	2	4	257	490	1.91	Artix-7	1163.19	1222.40
[]	$⚫$	1154	1031	2	0	300	456	1.52	Zynq US+	1754.08	1567.12
[] ×1	$⚫$	360	145	3	2	115	940	8.17	Artix-7	2941.20	1184.65
[] ×2	$⚫$	737	290	6	4	115	474	4.12	Artix-7	3036.44	1194.80
[] ×1	$⚫$	948	325	1	5	190	904	4.76	Artix-7	4512.48	1547.00
[] ×4	$⚫$	2543	792	4	18	182	232/233	1.27	Artix-7	3229.61	1005.84
[] ×16	$⚫$	9508	2684	16	70	172	69/71	0.40	Artix-7	3803.20	1073.60
[]	$⚫$	1737	1167	2	3	161	512	3.18	Artix-7	5523.66	3711.06
[] (UNTT)	$⚫$	9298	9402	0	0	20	1410	70.50	Virtex-7	655,509.0	662,841.0
[]	$⚫$	4969	1616	9	35	N/A	126/127	N/A	Artix-7	N/A	N/A
[]	$⚫$	1243	562	11	7	118	933	7.91	Artix-7	9832.13	4445.42

◐: Only NTT,

⚫

: both NTT and INTT

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Compact and Low-Latency FPGA-Based Number Theoretic Transform Architecture for CRYSTALS Kyber Postquantum Cryptography Scheme

Abstract

1. Introduction

1.1. Related Work

1.2. Key Contributions

2. Background Knowledge

3. Proposed NTT Architecture

3.1. Butterfly Unit

3.1.1. Multiplicationless Method

3.1.2. Quarter Square Multiplication Method

3.1.3. Hybrid Butterfly Unit

3.2. Realign Buffer

4. Experimental Results

4.1. Experimental Setup

4.2. Results and Comparison

5. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics