1. Introduction
In 2016, the National Institute of Standards and Technology (NIST) initiated the PQC standardization process. This project aims to develop, deploy, and standardize new postquantum cryptosystems before any largescale quantum computers come into being. In July 2022, NIST announced the results of the third round with four candidates to be standardized for Public Key Encryption (PKE) and Digital Signature Algorithm (DSA) [
1]. Most are LatticeBased Cryptographic (LBC) algorithms, CRYSTALKyber [
2], and CRYSTALDilithium/FALCON [
3,
4], respectively.
Latticebased cryptographic constructions are primarily based on solving the LearningWithError (LWE) and its variants problem (CRYSTALKyber, Dilithium) or NTRU lattices (FALCON). Implementing the LBC cryptosystem requires performing polynomial multiplication, the most hardwareintensive operation. There are two ways to do polynomial multiplication: the Schoolbook polynomial multiplication and the multiplication based on NTT. Schoolbook multiplication is inefficient for polynomial multiplication because it has a $O\left({N}^{2}\right)$ complexity. NTT is the special Discrete Fourier Transform case over a finite field. NTTbased multiplication enhances polynomial multiplication, reducing $O\left({N}^{2}\right)$ complexity to quasilinear complexity $O(N\xb7logN)$. In order to improve the efficiency of the LBC cryptosystem, NTT optimization is necessary. Furthermore, all NTT operations are modulo operations on prime q, including modular addition, subtraction, and multiplication. Performing modular multiplication is a highly intricate task that demands significant hardware resources. Thus, improving modular multiplication can enhance the performance of NTT/INTT and the entire cryptosystem. The optimization problem of modulo computations for multiplication has garnered significant attention in hardware implementations of LBC postquantum cryptography research.
The accelerator proposals for NTT/INTT are mainly focused on optimizing modulo calculations from the product of multiplying two integers. Montgomery [
5] and Barrett [
6] are two commonly used constanttime modular reduction algorithms. The Montgomery method has received less attention as it needs to be done in the “Montgomery domain”. On the other hand, the Barret method is more efficient and is used more frequently in LBC cryptosystems [
7]. Barrett reduction utilizes precomputed values to approximate the division by the modulus. The modular multiplication based on the Barret method requires three multiplications, one for the two input coefficients and two for the constants. A variation of the Barret algorithm in [
8], called ShiftAddMultiplySubtractSubtract (SAMS2), replaces constant multipliers with simple bit shifts, additions, and subtractions, which are less expensive than multiplication and division operands. Studies [
9,
10,
11] have applied the SAMS2 method for parameters
q = 7681 and 12,289.
In the study [
12], Plantard introduced a novel constanttime modular reduction algorithm. Like the Montgomery and Barrett algorithms, Plantard multiplication utilizes precomputed values and requires three multiplications for modular multiplication. But, when performing NTT/INTT with precomputed twiddle factors, the number of Plantard multiplications can be reduced by one. In another study [
13], J. Huang et al. enhanced the Plantard method to accommodate signed integers as input and narrowed the range of the modulus
q to
$\left(\frac{q}{2};\frac{q}{2}\right)$.
Another effective method for reducing modulus in the LBC system is to utilize the characteristic property of prime number
q [
14,
15]. This approach enables modular calculations through lightweight operations, including bitwise, addition, and subtraction. An alternative and more straightforward method for highorder bits modulus
q is using the lookup tables, as demonstrated in [
16].
In a different research study [
17], Longa et al. proposed a method called KRED, which utilizes a special format of NTTapplicable primes,
$q=k\xb7{2}^{m}+1$. This method primarily includes multiplying by a small coefficient (
k) and subtracting, resulting in significantly lower computational costs than other methods. The product input
$c=a\xb7b$ is reduced to the signed integer
$r\equiv k\xb7c\left(modq\right)$. In study [
18], BishehNiasar et al. based on the KRED method and introduced the
${K}^{2}$RED method by applying KRED twice for CRYSTALKyber. Furthermore, the cumulative coefficients,
$k/{k}^{2}$, can be eliminated by merging
${k}^{1}/{k}^{2}$ into the twiddle factor
$\omega /{\omega}^{1}$ in NTT/INTT processes. In particular, in study [
19], Li et al. used the
$k\xb7{2}^{m}\equiv 1\left(modq\right)$ property of the modulo q calculation, which helps to apply the KRED method to the PointWiseMultiplication (PWM) process. Additionally, the multiplication by a factor of
${N}^{1}$ has been removed in the INTT process.
Nevertheless, as far as we know, studies presently focus on optimizing from the input multiplication. This study aims to design a DSP for modular multiplication by developing a multiplication unit for two integer inputs and utilizing advanced modular reduction techniques. In particular, the Karatsuba algorithm is used to subdivide the size of multiplication by half. Partial multiplications are performed using the Vertical and Crosswise algorithm, which speeds up computation. The results of the multiplications are reduced to bitwidths following the KRED method’s condition using the precomputed lookup table. Finally, the KRED method is applied to calculate the result of the modulo operation, $a\xb7b\phantom{\rule{0.277778em}{0ex}}\left(mod\phantom{\rule{0.166667em}{0ex}}q\right)$.
The major contributions of this paper are as follows:
Proposes the first specialized DSP that performs modular multiplication for the CRYSTALKyber PQC algorithm, called KyberDSP (KDSP).
The KDSP performs the multiplication of two input integers and modular reduction for the prime q = 3329. The architecture reaches a high frequency of 283 MHz, and the area is only 77 SLICEs, equivalent to 77% of a typical DSP. This result completely outperforms traditional methods of modular multiplication that rely on DSP.
The proposed LatticeDSP (LDSP) configuration optimizes the BU in NTT/INTT.
In addition to saving on hardware resources, using the proposed LDSP also eliminates the ${N}^{1}$ multiplication in the INTT process. As a result, the BU architecture requires minimal hardware resources. Choosing the architecture for NTT accelerators based on DecimationInTime (DIT), DecimationInFrequency (DIF), or both has become more flexible and easier. In CRYSTALKyber, the BU architecture reaches a high frequency of 283 MHz while occupying an area equivalent to one DSP.
Designs a KDSPbased PWMM architecture designed for CRYSTALKyber.
PWM calculation in CRYSTALKyber is more complicated than other LBC algorithms, requiring at least four multiplications for two PWM results. This study introduces a specific PWM structure for CRYSTALKyber that uses KDSP. Furthermore, the cumulative computation of matrix multiplication is combined with PWM while maintaining the same hardware cost for all three Kyber security levels (1, 3, and 5). The architecture that implements PWMM on the NTT domain includes PWM and PointWise Addition (PWA). The proposed PWMM operating frequency reaches 275 MHz with a hardware area of 386 SLICEs, equivalent to closely 4 DSPs.
Extended with LDSP design for prime numbers q = 7681 and 12,289.
The proposed DSP design method is ideal for NTTfriendly algorithms with a prime factor $q=k\xb7{2}^{m}+1$. By applying this design to the case where q = 7681 and 12,889, it has been proven that the method still allows for a high operating frequency of 272 MHz and 256 MHz while using 87 SLICEs and 101 SLICEs of hardware resources.
The remainder of the paper is organized as follows.
Section 2 introduces the theoretical background of LBC, specifically the CRYSTALKyber algorithm, and describes the NTTbased polynomial multiplication.
Section 3 discusses in more detail the existing implementation studies for modular reduction.
Section 4 presents the implementation of a DSP design for modular multiplication and builds upon the BU and PWMM architectures.
Section 5 compares the performance of the proposed DSP and the designs built on it with the stateoftheart reference implementations of FieldProgrammable Gate Arrays (FPGAs). Finally, in
Section 6, the conclusion of the paper is presented.
3. Related Works
LBC operations are performed on the ring
${\mathbf{R}}_{q}={\mathbb{Z}}_{q}\left[X\right]/({X}^{N}+1)$, where
q is a prime number and
N is a poweroftwo. Modular multiplication is the most timeconsuming operand in NTT and can be expressed as follows:
Several classical algorithms are available to enhance the efficiency of modular reduction, such as Montgomery reduction and Barret reduction. The Montgomery method is infrequently used due to the resource consumption of conversions into and out of the “Montgomery domain” [
25,
26]. In contrast, the Barret method is widely adopted and has many improved variants. The basic idea behind Barret’s algorithm is to precompute the inverse of modulus
q and use simple bit shifting and multiplication instead of costly division. Algorithm 1 utilizes Barrett reduction to compute the product of two integers modulo
q.
Algorithm 1 Modular Multiplication by Barret Reduction [7] 
 Input:
$a,b,q\in \mathbb{Z}$  Output:
$a\times b\left(mod\phantom{\rule{0.222222em}{0ex}}q\right)$ 
Precomputation  1:
$k=\u2308lo{g}_{2}\phantom{\rule{0.222222em}{0ex}}q\u2309$;  2:
$r={2}^{k}$;  3:
$\mu =\u230a\frac{{r}^{2}}{q}\u230b$; 
Multiplication  4:
$z=a\times b$; 
Barret reduction  5:
${m}_{1}=\u230a\frac{z}{r}\u230b$;  6:
${m}_{2}={m}_{1}\times \mu $;  7:
${m}_{3}=\u230a\frac{{m}_{2}}{r}\u230b$;  8:
$t=z{m}_{3}\times q$;  9:
if $t\ge q$ then  10:
return $tq$  11:
else  12:
return t  13:
end if

In LBC algorithms, the
q value is fixed, allowing precomputation of
k,
r, and
$\mu $. Barrettbased modular multiplication commonly employs DSPs for multiplying input coefficients and the constant
$\mu $, while multiplying by
q is efficiently achieved using bitwise shifts and additions. In study [
27], Dang et al. applied a variation of Barret reduction in [
28] to select parameter values (
$\alpha $,
$\beta $) and design a singleconstantmultiplier for multiplying by the constant
$\mu $. LBC schemes based on NTT implementation using signed integers can eliminate modular addition at each butterfly unit [
25,
29,
30]. The optimized Barrett reduction algorithm for signed integer inputs has been further enhanced by study [
31] to narrow its output range to
$\left(\frac{q}{2},\frac{q}{2}\right)$. This improvement effectively limits the growth of coefficients after each butterfly unit, resulting in better performance. The SAMS2 method simplifies multiplication by bit shifting, addition, and subtraction [
8,
9,
10,
11]. This significantly reduces the hardware architecture but increases latency due to multiple subtractions. A lookup table can be used to speed up, but it is inefficient in terms of area.
Huang et al. enhanced the Plantard algorithm for a larger range of inputs and a smaller range of outputs [
13]. The improved Plantard method saves one multiplication compared to the latest Montgomery and Barrett methods. However, the drawback of this method is that it still requires three multiplications when calculating the PWM and necessitates doubling the width of the precomputed intermediate twiddle factors.
Several methods have been proposed for modular reduction to optimize the area and speed of NTT accelerator with specific
q parameter. Study [
14] utilized the form of
$q={2}^{{l}_{1}}\pm {2}^{{l}_{2}}\pm \dots \pm 1$ to replace multiplications in Barret reduction with bit shifts, addition, and subtraction operations. Aikata et al. implemented this technique for the Kyber
q = 3329 case [
32]. Some recent studies involve calculating the modulus
q of higherorder bits in the product of multiplying two 12bit integers,
$c[23:0]=a\times b$. In studies [
15,
33], the property
${2}^{12}\equiv {2}^{9}+{2}^{8}1\left(mod\phantom{\rule{0.166667em}{0ex}}3329\right)$ is used to gradually reduce the higherorder bits to an arithmetic combination of the smaller bit arrays. Similarly, in study [
34], the bit width of the multiplication product is reduced from 24 to 15, and then apply Barret algorithm. This method is useful in reducing the multiplication size during Barret reduction. Additionally, a different format of
$q=\delta \xb7{2}^{e}+1$, is used to propose a modulus reduction algorithm for Kyber [
35]. This algorithm divides the product
c into two corresponding parts,
$c={c}_{1}\xb7{2}^{e}+{c}_{0}$, and replaces the large modulus
q with the smaller modulus
$\delta $. A simpler and more efficient alternative is introduced in [
16]. Zhang et al. used the precomputed lookup table to store the calculations of
$c[23:20]\xb7{2}^{20}\left(mod\phantom{\rule{0.166667em}{0ex}}3329\right)$,
$c[19:16]\xb7{2}^{16}\left(mod\phantom{\rule{0.166667em}{0ex}}3329\right)$, and
$c[15:12]\xb7{2}^{12}\left(mod\phantom{\rule{0.166667em}{0ex}}3329\right)$. The higherorder bits are used as the input address of the lookup tables, and the outputs are the corresponding modular operations. Finally, the modulo operation of the product
$c[23:0]$ is calculated by adding the four numbers on the ring
${\mathbf{R}}_{3329}$.
Another new modular reduction approach, KRED, is proposed in study [
17]. The KRED method utilizes the characteristics of Proth numbers represented as
$q=k\xb7{2}^{m}+1$ where
k is a small number,
m is a natural number. This approach presents two functions: KRED and KRED2X, which take any integer
c and return an integer
d such that
$d\equiv k\xb7c\phantom{\rule{0.277778em}{0ex}}mod\phantom{\rule{0.166667em}{0ex}}q$ and
$d\equiv {k}^{2}\xb7c\phantom{\rule{0.277778em}{0ex}}mod\phantom{\rule{0.166667em}{0ex}}q$, respectively. The KRED method performs one multiplication with a constant factor of
k and one subtraction, as described in Algorithm 2. Multiplying by
k is achieved with bit shifting and addition, significantly reducing computational costs compared to other methods. However, to correct the reduction results, the output of KRED must be multiplied by the factor
${k}^{1}/{k}^{2}$. BishehNiasar et al. in [
18] followed this scheme and proposed
${K}^{2}$RED, by applying KRED twice in CRYSTALKyber with constant
k = 13 and
m = 8. The multiplication by
${k}^{1}/{k}^{2}$ can be combined with the precomputed twiddle factor
$\omega $ in NTT/INTT for faster computation with fewer hardware resources. In study [
36], Ni et al. segmented the product into two parts. The lookup table method is employed for the bits exceeding 20, while the remaining portion underwent the KRED technique. This method is simple but requires one more adder. The
${K}^{2}$RED is extended to
${K}^{l}$RED for different NTT parameters, where
$l=\u2308t/m\u2309$ is the number of loops, and
t is the bitlength of input coefficients [
37]. However, KRED is not appropriate for PWM calculations with random multipliers. Fortunately, this drawback can be resolved by using the property of
$k\xb7{2}^{m}\equiv 1\left(mod\phantom{\rule{0.166667em}{0ex}}q\right)$, as demonstrated in study [
19]. Li et al. applied the KRED method with modifications in input value and output calculations. In particular, the input product
c is multiplied by two, and the subtraction in KRED is changed sign. Therefore, the output is calculated as,
with
$(k,N)=$ (13,256) and
$13\xb7256\equiv 1\phantom{\rule{0.277778em}{0ex}}\left(mod\phantom{\rule{0.222222em}{0ex}}3329\right)$ for the case of Kyber, the Equation (
9) is equivalent to
$r\equiv {128}^{1}\xb7c\phantom{\rule{0.277778em}{0ex}}\left(mod\phantom{\rule{0.222222em}{0ex}}3329\right)$. During the INTT process, multiplying by
${128}^{1}$ is considered postprocessing. As a result, KRED can be applied to PWM processes, eliminating the need for postprocessing in Kyber by using this method.
Algorithm 2 KRED Modular Reduction Algorithm [17] 
 Input:
c, parameter: m,k.  Output:
$d\equiv k\xb7c\left(modq\right)$  1:
${d}_{0}=c\left(mod\left({2}^{m}\right)\right)$;  2:
${d}_{1}=c/{2}^{m}$;  3:
Return ($k{d}_{0}{d}_{1}$)

4. Proposed Hardware Design
The modular reduction of LBCs typically begins with the DSPbased multiplication product, but small multiplication widths can result in suboptimal area usage. As a result, in this section, we present a KDSP architecture as the basis for implementing BUs in NTT/INTT processes. Further, we propose a PWMM unit for Kyber using the KDSP and extend the LDSP design method to other LBC cases with q = 7681 and 12,289.
4.1. KDSP
The KDSP performs modular multiplication of two 12bit coefficients,
$a[11:0]$ and
$b[11:0]$, to produce a 12bit result
$c[11:0]$ on the ring
${\mathbf{R}}_{3329}$. The proposed architecture of KDSP is shown in
Figure 1 with three computational stages. In the initial stage, we employ the Karatsuba algorithm to partition a 12bit multiplication operation into three discrete components: two 6bit multiplications and one 7bit multiplication, respectively
${a}_{H}\xb7{b}_{H}$,
${a}_{L}\xb7{b}_{L}$, and
$({a}_{H}+{a}_{L})\xb7({b}_{H}+{b}_{L})$. Subsequently, the summation of these partial products is calculated using the lookup table method. As a result of this procedure, the bitwidth of the product is reduced from 24 bits to 20 bits. Finally, the KRED method is utilized to ascertain the modulus
q of the 20bit product.
The choice of multiplier design significantly impacts the speed and area of the proposed DSP. This study selects the Vedic multiplier based on the Vertical and Crosswise technique for designing 6bit and 7bit multipliers due to its shorter critical path than the conventional array multipliers [
38]. Vedic multipliers are performed in parallel, and the partial products are added together by two or three levels of the adder [
39].
Figure 2 shows the 3bit multiplier architecture of two numbers
$a[{a}_{2},{a}_{1},{a}_{0}]$ and
$b[{b}_{2},{b}_{1},{b}_{0}]$. The architecture comprises nine AND gates, three full adders, and three half adders.
Figure 3 depicts the proposed 6bit multiplier architecture based on four 3bit multipliers and three 6bit adder units. The adder unit used is the carry save adder to perform the additions in parallel, improving speed efficiency [
40]. A 4bit Vedic multiplier is also designed. In the KDSP architecture, the 3bit and 4bit Vedic multipliers are used as the base multipliers for building up the 6bit and 7bit multipliers.
According to the Karatsuba algorithm, multiplication is done in three steps: ${a}_{H}\xb7{b}_{H}$, ${a}_{L}\xb7{b}_{L}$, and $({a}_{H}+{a}_{L})\xb7({b}_{H}+{b}_{L})$, then subtract ${a}_{H}\xb7{b}_{H}$ and ${a}_{L}\xb7{b}_{L}$ from the result of $({a}_{H}+{a}_{L})\xb7({b}_{H}+{b}_{L})$ to get the product. A lookup table is designed to compute the operation [23:18] ·${2}^{18}\phantom{\rule{0.277778em}{0ex}}mod\phantom{\rule{0.166667em}{0ex}}3329$, with the input address being the [23:18] bits of the product ${a}_{H}\xb7{b}_{H}$. The final product is then reduced modular to a bit width of 20 bits.
The prime parameter in Kyber is $q=13\xb7{2}^{8}+1$, where the factor values are k = 13 and m = 8. The elimination of accumulation is handled differently by implementing KDSP in each BU or PWM unit. Consequently, the KREDbased modular reduction part can be regarded as a specialized submodule, which will be further discussed in the following subsections.
4.2. Butterfly Unit
The NWC technique is utilized in NTT/INTT to avoid doubling the size of the multiplication polynomial. To compute
$\mathbf{c}=\mathbf{a}\times \mathbf{b}$ in the ring
${\mathbf{R}}_{q}$ with NWC, polynomials
$\mathbf{a}$ and
$\mathbf{b}$ must be scaled by a factor
$\phi $ before applying NTT (refered to as preprocessing). Subsequently, polynomial product
$\mathbf{c}$ is scaled by a factor
${N}^{1}\xb7{\phi}^{1}$ after INTT (referred to as postprocessing), and
$\phi $ is the
2Nth primitive root of the unity. Two methods for calculating the NTT are DIT and DIF, corresponding to the CooleyTukey (CT) [
41], and GentlemanSande (GS) butterfly configurations [
42]. Given a pair of coefficients
$(a,b)$ and twiddle factor
$\omega $, the CT and GS butterfly calculations produce the results of
$(a+b\xb7\omega ,ab\xb7\omega )$ and
$(a+b,(ab)\xb7\omega )$, as depicted in
Figure 4a,b.
In studies [
43,
44], the
$\phi /{\phi}^{1}$ multipliers can be merged with NTT and INTT processes. These methods involve separate butterfly operations: CT for NTT with
$\phi $ and GS for INTT with
${\phi}^{1}$. The unifiedBU architecture has been proposed for the simultaneous computation of GS and CT in the study [
14], as shown in
Figure 4c. In the GS calculation, the factor
${N}^{1}$ can be precomputed with the twiddle factor
$\omega $ for the
$(ab)\xb7\omega $ operation. With the operation
$(a+b)$, in study [
45], Zhang et al. proposed replacing the multiplication by
${N}^{1}$ after INTT with the multiplication by
${2}^{1}$ in each butterfly operand, respectively
$(a+b)/2$. The hardware architecture for multiplier
${2}^{1}$ in
${\mathbf{R}}_{q}$ is achieved simply by implementing
$a/2=(a\gg 1)+a\left[0\right]\xb7\frac{q+1}{2}$.
Butterfly operations are all modular arithmetic. The specific configuration choice among CT, GS, or both depends on the design of the NTT accelerator. The iterative NTT architecture uses unified BU for butterfly operations in all stages, such as NTT, INTT, and PWM with the Kyber case. The latency of iterative NTT increases with the number of butterfly cores and becomes more complex when handling highorder polynomials. On the other hand, the NTT pipeline architecture allows for flexibility in selecting BU configurations and butterfly cores proportional to the number of NTT stages [
11,
46,
47]. In all cases, modular multiplication is consistently the most hardwareintensive operation and represents the critical delay path.
In this study, a BU tailored for Kyber is implemented. This architecture employs KDSP and comprises one modular addition, one subtraction, and one multiplication. The resulting output is controlled by mux units, allowing the BU to operate in either CT or GS mode, as depicted in
Figure 4d. The classical KRED method is applied for the modular reduction part of KDSP. The accumulated
k = 13 is removed when the inverse
${k}^{1}$ is merged into the twiddle factor
$\omega $. Additionally, the multiplication by
${N}^{1}$ for the GS calculation is eliminated, as performed at the PWM stage (further details are provided in the subsequent section). BUs utilize KDSP units, helping reduce the size and improving the efficiency of parallel or pipeline architectures when multiple BUs are used.
4.3. PointWise Matrix Multiplication Unit
Calculating the ciphertext
$\mathbf{u}=INTT({\hat{\mathbf{A}}}^{T}\circ \hat{\mathbf{r}})+{\mathbf{e}}_{\mathbf{1}}$ is the most intricate operation in Kyber, with
$\mathbf{u}$,
$\mathbf{r}$, and
${\mathbf{e}}_{\mathbf{1}}$ are vector polynomials, and
$\mathbf{A}$ is matrix polynomial. The specific mathematical expression for the PWMM is shown in Equation (
10) for case module rank
k = 2.
The PWMM requires two pointwise operations: PWM and PWA. In Kyber, a 256term polynomial
$a\left(x\right)=({a}_{0},{a}_{1},\dots ,{a}_{254},{a}_{255})$ is performed NTT process using two 128point NTT, one for the even part (
${a}_{2i}\left(x\right)=({a}_{0},{a}_{2},\dots ,{a}_{254})$) and one for the odd part (
${a}_{2i+1}\left(x\right)=({a}_{1},{a}_{3},\dots ,{a}_{255})$). Thus, the PWM on Kyber is multiplying polynomials of the form
${\hat{a}}_{2i}+{\hat{a}}_{2i+1}\xb7X$. In study [
48], Xing et al. proposed using the Karatsuba method to reduce the number of pointwise multiplications required to calculate
$\hat{h}=\hat{f}\circ \hat{g}$ from five to four, as follows:
The previous section mentioned that the KRED method can be customized to change the output value. To remove the postprocessing of the INTT stage, the output values of the PWM calculation should be
${128}^{1}\xb7{\hat{h}}_{2i}$ and
${128}^{1}\xb7{\hat{h}}_{2i+1}$, which can be achieved by applying the property
$13\xb7{2}^{8}\equiv 1\left(mod\phantom{\rule{0.277778em}{0ex}}3329\right)$ characteristic. The PWM architecture for Kyber is detailed in
Figure 5. Initially, the calculations
${\hat{f}}_{2i}\xb7{\hat{g}}_{2i}$,
${\hat{f}}_{2i+1}\xb7{\hat{g}}_{2i+1}$, and
$({\hat{f}}_{2i}+{\hat{f}}_{2i+1})\xb7({\hat{g}}_{2i}+{\hat{g}}_{2i+1})$ all use KDSPs without KRED unit, resulting in bit widths 20. The calculation of
$2\xb7{\hat{h}}_{2i+1}$ is performed using modular addition, modular subtraction, and bit shifting. Subsequently, the KRED unit is utilized to get the
${128}^{1}\xb7{\hat{h}}_{2i+1}$ result. For the
$2\xb7{\hat{h}}_{2i}$ calculation, the PWM
${\hat{f}}_{2i+1}\xb7{\hat{g}}_{2i+1}$ is utilized with KRED to provide the result in the ring
${\mathbf{R}}_{3329}$. The accumulated factor
k is eliminated from the PWM with precomputed factor
${k}^{1}\xb7{\zeta}^{2br\left(i\right)+1}$. Finally, KRED is used again to determine the
${128}^{1}\xb7{\hat{h}}_{2i+1}$ result.
For PWA calculation, the architecture in
Figure 5 shows an efficient and simple way to perform PWA using a shift register SHR with feedback and modular addition. During the initial calculation stage, the SHR takes in the output of PWM
${\hat{a}}_{00}\circ {\hat{r}}_{0}$ and adds value
${12}^{\prime}b0$. In the following step, the SHR takes in the result of the previous addition and performs another addition with the outcome of the PWM
${\hat{a}}_{10}\circ {\hat{r}}_{1}$. This architecture requires no hardware cost changes when implementing different security levels of Kyber,
k = 2, 3, and 4, respectively.
4.4. LDSPs
NTTbased polynomial multiplication is a highly utilized and efficient method for implementing LBC systems, offering quasilinear complexity of $O(N\xb7logN)$. To quickly calculate NTT using NWC, a prime number q is carefully chosen to meet the condition $q\equiv 1\phantom{\rule{0.277778em}{0ex}}\left(mod\phantom{\rule{0.166667em}{0ex}}2N\right)$. This guarantees the existence of the Nth and $2N$th primitive roots of unity (denoted as $\omega $ and $\zeta $) in the ring ${\mathbf{R}}_{q}$. Consequently, the parameters for the LBC algorithms require a relatively large value for N and a relatively small modulus q in form $q=k\xb7{2}^{m}+1$, where $2N\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}{2}^{m}$ and $k\ge 3$ is considered small integer.
This study focuses on analyzing and proposing the DSP for commonly used parameter sets
$(N,q)$ of LBC algorithms, as in
Table 2 (referred to as LDSP). The Kyber team has adopted the parameter pair
$(256,3329)$ since round 2 of the NIST competition. It is chosen to address increased bandwidth requirements resulting from removing public key compression. The advantage of using
q = 3329 is that NTTbased polynomial multiplication can be performed quickly, leading to smaller noise. One drawback to this parameter set is that PWM calculation is not directly possible.
Alternatively,
$(256,7681)$ is the smallest parameter pair that fully supports fast NTT computation while ensuring high security and the ability to perform direct PWM calculations. Due to this advantage, many current hardware implementations have adopted the prime
q = 7681 to optimize their systems [
11,
46,
49]. The remaining set
$(512/1024,$ 12,289) is utilized in the latest version of FALCON, where the polynomial degree
N can vary depending on the desired security level of the system.
The primary operations that use DSP for modular multiplication are butterfly and PWM. In the BFU architecture, a coefficient is multiplied by a precomputed twiddle factor (
$\omega $). The accumulated
k can be merged to
$\omega $ as
${k}^{1}\xb7\omega $. The value of
N does not impact the KRED architecture in LDSP in the case of FALCON. Algorithm 3 outlines the steps for implementing LDSP for butterfly computation in a comprehensive and detailed manner.
Algorithm 3 LDSP for Butterfly Unit 
 Input:
nbit integers: b, ${\omega}^{\prime}$ = ${k}^{1}\xb7\omega $, prime q, and small integers k, m.  Output:
$r\equiv b\xb7\omega \phantom{\rule{0.277778em}{0ex}}\left(mod\phantom{\rule{0.166667em}{0ex}}q\right)$ 
Stage 1: Karatsuba, Vedic multiplier  1:
${t}_{0}={b}_{H}\xb7{\omega}_{H}^{\prime}$;  2:
${t}_{1}={b}_{L}\xb7{\omega}_{L}^{\prime}$;  3:
${t}_{2}=({b}_{H}+{b}_{L})\xb7({\omega}_{H}^{\prime}+{\omega}_{L}^{\prime})$; 
Stage 2: Calculate the product  4:
${p}_{0}={\left({t}_{0}[n1,\dots ,n+2m]\right)}_{LUT}$;  5:
${p}_{1}=\{{t}_{0},{t}_{1}\}$;  6:
${p}_{2}={t}_{2}{t}_{1}{t}_{0}$;  7:
$p={p}_{0}+{p}_{1}+{p}_{2}$; 
Stage 2: KRED Reduction  8:
${d}_{0}=p[m1,\dots ,0]$;  9:
${d}_{1}=p[n+m1,\dots ,m]$;  10:
Return $(k\xb7{d}_{0}{d}_{1}$)

In the PWM operation, one crucial step involves multiplying the output by the value
${N}^{1}$ to eliminate postprocessing in the INTT. Notably, as the polynomial degree
N changes, the value of
N consistently satisfies the condition
$2N\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}{2}^{m}$, which ensures that the output can be computed directly as,
with
$z={2}^{m}/N$ and
$k\xb7{2}^{m}\equiv 1\phantom{\rule{0.166667em}{0ex}}\left(mod\phantom{\rule{0.166667em}{0ex}}q\right)$. The Equation (
13) is equivalent to
$r\equiv {N}^{1}\xb7c\phantom{\rule{0.277778em}{0ex}}\left(mod\phantom{\rule{0.166667em}{0ex}}q\right)$. Additionally, since
z is a power of two, it can be easily multiplied by bitshifting the input of the KRED part. Algorithm 4 provides a detailed outline of the LDSP implementation for PWM operation.
For multiplying coefficients with larger bitwidth, it is possible to create an efficient multiplier design using 3bit and 4bit Vedic multiplier circuits as the basic building blocks.
Algorithm 4 LDSP for PWM Unit 
 Input:
nbit integers: a, b, prime q, and small integers k, m, $z={2}^{i}$, degree N.  Output:
$r\equiv {N}^{1}\xb7a\xb7b\phantom{\rule{0.277778em}{0ex}}\left(mod\phantom{\rule{0.166667em}{0ex}}q\right)$ 
Stage 1: Karatsuba, Vedic multiplier  1:
${t}_{0}={a}_{H}\xb7{b}_{H}$;  2:
${t}_{1}={a}_{L}\xb7{b}_{L}$;  3:
${t}_{2}=({a}_{H}+{a}_{L})\xb7({b}_{H}+{b}_{L})$; 
Stage 2: Calculate the product  4:
${p}_{0}={\left({t}_{0}[n1,..,n+2mi]\right)}_{LUT}$;  5:
${p}_{1}=\{{t}_{0},{t}_{1}\}$;  6:
${p}_{2}={t}_{2}{t}_{1}{t}_{0}$;  7:
$p=({p}_{0}+{p}_{1}+{p}_{2})\ll i$; 
Stage 2: KRED Reduction  8:
${d}_{0}=p[m1,\dots ,0]$;  9:
${d}_{1}=p[n+m1,\dots ,m]$;  10:
Return (${d}_{1}k\xb7{d}_{0}$)

5. Implementation Results
This study introduces and applies the proposed modular multiplication LDSPs in BFU and PWMM units of the NTTbased accelerator in the LBC cryptosystem. These architectures are synthesized and placeandrouted using the Xilinx Vivado 2021.2 suite. The widely used Xilinx Artix7 FPGA platform (part number XC7A100tfgg6763) is selected to ensure a fair comparison with stateoftheart hardware implementations. We introduce the hardware efficiency (
$Eff.$) for a comparative analysis with previous works. A higher
$Eff.$ value is desirable and can be calculated as follows,
The study [
47] showed a normal DSP with an equivalent conversion rate of 100 SLICEs or 400 LUTs. We use this ratio for area comparison with other studies.
Table 3 shows the proposed KDSP architecture for modular multiplication on CRYSTALKyber, showcasing its speed and area. KDSP can perform integer multiplication and modular reduction and operates at a frequency of 283 MHz, occupying an area of only 77 SLICEs, equivalent to 77% of a DPS. It is worth noting that all other studies listed in the comparison table use DSP for coefficient multiplication. The implementation results of the KRED method are better than other methods when performing modular reduction. Notably, in study [
36], by combining the KRED and LUT methods, the operating frequency reached 300 MHz with an equivalent area of 50 (+400) LUTs. In [
46], heavy multiplications are efficiently replaced with compact bitwise operations and additions/subtractions based on an optimized Barrett algorithm. The hardware results achieved an operating frequency of 265 MHz and occupied an equivalent area of 81 (+400) LUTs. Study [
33] utilizes a bitreduce method that is complex and hardware costly. These results demonstrate that the proposed KDSP architecture has further optimized modular multiplication, with significantly improved
$Eff$ performance metrics of
$1.86\times $,
$2.25\times $,
$2.46\times $,
$2.57\times $,
$4.1\times $, and
$4.2\times $ compared to studies [
18,
33,
36,
46,
48,
50], respectively.
The results of the KDSPbased BU implementation for Kyber are displayed in
Table 4. The proposed BU architecture comprises one KDSP, an adder, and a subtractor for CT and GS butterfly operations. The operating frequency of this architecture reaches 283 MHz and takes an area of 104 SLICEs, slightly equal to one DSP. The conventional unified BU architecture uses two DSPs to perform distinct multiplications, adders, and subtractors for CT and GS calculations. The studies [
33,
48] used this architecture and recorded low operation frequency and high hardware resources of 159 MHz for 774 (+800) LUTs and 161 MHz for 647 (+800) LUTs, respectively. Other studies have built a reduced architecture using only one DSP for multiplication. With the downsized BU architecture, the study [
46] has an operating frequency of 265 MHz and consumes 186 (+400) LUTs. It is important to mention that multiplication can impact speed and area. In study [
51], a standard implementation of Kyber’s reference code, modular multiplication consumed more DSPs due to applying both the Barret and Montgomery algorithms. The proposed BU architecture based on KDSP significantly improves the
$Eff.$ index compared to the research studies [
18,
27,
33,
46,
48,
50]. The improvement is
$2.06\times $,
$2.52\times $,
$3.01\times $,
$3.26\times $,
$8.39\times $, and
$9.22\times $ times, respectively.
Table 5 shows the effectiveness of the DSP design method for core operations modular multiplication and butterfly in the NTT accelerator of the LBC cryptosystem. We have developed LDSPs and BUs architectures for prime
q values in
$q=k\xb7{2}^{m}+1$, specifically for 3329, 7681, and 12,289. When using LDSPs, the operating frequencies for
q = 3329, 7681, and 12,289 are 283 MHz, 272 MHz, and 256 MHz, respectively. The hardware resources needed for LDSPs are less than or equal to one DSP, which results in a percentage of the area used of 77%, 87%, and 101% DSP, respectively. On the other hand, using BUs results in hardware resource improvements of 104%, 120%, and 136% DSP for the same
q values, respectively. The operating frequencies for BUs are 283 MHz, 260 MHz, and 250 MHz, respectively. All LDSPs and BUs architectures have less than 400 LUTs hardware resources, equivalent to one DSP.
The results of implementing the PWMM architecture in Kyber are presented in
Table 6. Other studies utilize BRAM [
46] or FIFO [
47] to temporarily store the accumulation results, leading to further hardware resource consumption for the PWA operation. In order to optimize the NTT accelerator pipelines, our architecture is designed to handle both PWM and PWA operations. Two different architectures, namely onePWMM and twoPWMM, are designed and achieve the highest operating frequency of 275 MHz with cycles and an area of 128 Clks/1123 LUTs and 64 Clks/2297 LUTs. To be more precise, the hardware resources take fewer SLICEs than the conversion value of 4 and 8 DSPs, which are equivalent to 386 and 797 SLICEs, respectively. In studies [
34,
35,
36], the PWM execution is performed using a shared hardware architecture with BUs. The highest operating frequencies are achieved in studies [
35,
36], reaching 300 MHz and 303 MHz, respectively. This is primarily due to the use of efficient modular reduction modules. In study [
46], an architecture for twoPWM calculation was proposed on Kyber, using 8 DSPs for the multiplications. The operating frequency of this architecture reaches 265 MHz and consumes hardware resources of 749 (+3200) LUTs respectively. Our architecture significantly reduces hardware resources and improves efficiency. The proposed PWMM design reduces ATP(area time product) by 33.8%, 47.8%, 67.3%, and 71.2% compared to [
34,
35,
36,
46], respectively.
6. Conclusions
In this paper, we present a method for designing compact and efficient specialized hardware implementations for modular multiplication in LBC systems. Using the proposed approach, we have designed and implemented core architectures within NTT accelerators for polynomial multiplication. The optimization of the BU architecture is performed to completely eliminate the need for postprocessing in INTT. Consequently, the BU architecture, when simultaneously executing NTT and INTT, has a footprint equivalent to that of a conventional BU architecture. Additionally, we propose a hardware design for implementing PWMM for the Kyber algorithm. The proposed architecture can perform both PWM and PWA calculations for all security levels without requiring additional temporary memory, such as RAM or FIFO buffers.
Furthermore, we have demonstrated the effectiveness of the proposed method by designing it for common prime parameters q, used in various LBC algorithms. FPGAbased implementation results show the outperforming hardware efficiency of the KDSP, LDSP, BU, and PWMM architectures compared to existing implementations. The findings of this paper can further optimize NTT accelerators with pipeline or iterative configurations. Therefore, the proposed architectures represent an important step toward designing compact and highperformance postquantum latticebased cryptography systems on hardware platforms.