1. Introduction
With the explosion of the InternetofThingsbased data and the widespread use of machine learning (ML) as a cloudbased service, securing private user data during ML inferences has become a pressing concern for cloudservice providers. Fully homomorphic encryption (FHE) is a promising solution for preserving sensitive information in cloud computing because it provides strong defense mechanisms and enables the direct computation on encrypted data (ciphertext) while preserving confidentiality [
1,
2]. However, the requirement for high degrees of security leads to complex parameter settings, resulting in expensive computation on large ciphertext, which limits the practical realization of FHEbased applications. Cloudside analytics can be resourceintensive and timeconsuming, making it necessary to develop cryptographic accelerators to facilitate the deployment of realworld applications. Cryptographic accelerators are designed to reduce the computational overhead of homomorphic functions, thus enabling faster and more efficient computation on encrypted data. The development of such accelerators is crucial to unlock the full potential of FHEbased solutions, make it more accessible to a wider range of users and supporting the secure processing of sensitive data in realworld settings.
Figure 1 illustrates an endtoend FHEbased cryptosystem with primary homomorphic operations performed in the cloud server.
FHE cryptographic protocols typically involve integer and latticebased schemes. The most efficient latticebased schemes rely on the ring learning with errors (RLWE) problem, which provides strong security guarantees and the desired performance [
3]. In RLWEbased FHE protocols, the input messages are encrypted by adding noise, and the generated ciphertexts are composed of two polynomial rings. The growth of noise through homomorphic computations limits the circuit depth, and the selection of FHE parameters must balance the security requirements with computational complexity [
4]. Parameter selection primarily involves polynomial degree
N, and modulo integer
Q with at least 128bit security is typically required to guard against unpredictable attacks [
5]. To support multiplicative depth,
N increases proportionally. Highcircuitdepth FHE schemes inevitably have the drawback of large ciphertexts, which leads to expensive computations, highbandwidth data movement, and large storagespace requirements.
Primary homomorphic operations involve addition, multiplication, and permutation of ciphertexts. Homomorphic multiplication between ciphertexts is often computationally expensive because of the convolution of polynomial coefficients.
Figure 2 shows a general diagram of the multiplication between two ciphertexts that dominates homomorphic operations. Initially, ciphertext consists of two component polynomials. The ciphertext multiplication results in a tuple of polynomials, making further computation challenging. Thus, an operation is required to revert the ciphertext to its original form. An expensive operation known as key switching is required to relinearize the ciphertext. However, key switching is computationally intensive with number theoretic transform (NTT) and inverse NTT (INTT) operations being dominant. Therefore, developing key switching hardware accelerators is significant for speeding up homomorphic multiplication and realizing FHEbased applications.
1.1. Related Works
While FHE holds potential, its primary limitation is inefficiency, which stems from two factors: complex polynomial operations and timeconsuming ciphertext management. To tackle the computational and memory demands of homomorphic functions, various optimization and acceleration efforts are underway.
Table 1 presents FHE accelerators, highlighting the hardware utilized and features of the accelerators. Initially, FHE acceleration depended on general hardware features. However, CPUs lack the capacity to effectively harness FHE’s inherent parallelism [
6]. GPUbased implementations tap into this parallelism, but GPU’s extensive floatingpoint units remain underused as FHE tasks mainly involve integer operations [
7,
8,
9]. Furthermore, neither CPUs nor GPUs offer sufficient main memory bandwidth to cope with FHE workload’s dataintensive nature.
To enhance FHE scheme performance, researchers have been exploring custom hardware accelerators using ASIC and FPGA technologies. ASIC solutions [
10,
11,
12,
13] show promise, as they surpass CPU/GPU implementations and bridge the performance gap between plaintext and ciphertext computations. However, to accommodate large onchip memory, expensive advanced technology nodes such as 7 nm or 12 nm are required for ASIC implementations. Furthermore, designing and fabricating these ASIC proposals demand significant engineering time and high nonrecurring costs. Since FHE algorithms are not standardized and continue to evolve, any changes would necessitate major ASIC redesign efforts. Conversely, FPGA solutions are more costeffective than ASICs, offer rapid prototyping and design updates, and are better equipped to adapt to future FHE algorithm modifications.
Several studies have proposed FPGAaccelerated architecture designs for FHE [
14,
15,
16,
17,
18,
19]. Notably, Riazi et al. introduced HEAX, a hardware architecture that accelerates CKKSbased HE on Intel FPGA platforms and supports low parameter sets [
14]. However, the architecture faces high input/output and memory interface bandwidths, as well as costly internal memory, making it difficult to place and route multiple cores on the target FPGA platform. Han et al. proposed coxHE, an FPGA acceleration framework for FHE kernels using the highlevel synthesis (HLS) design flow [
16]. Targeting key switching operations, coxHE examined data dependence to minimize interdependence between data, maximizing parallel computation and algorithm acceleration. Mert et al. proposed Medha, a programmable instructionset architecture that accelerates cloudside RNSCKKS operations [
17]. Medha featured seven residue polynomial arithmetic units (RPAU), memoryconservative design, and support for multiple parameter sets using a single hardware accelerator with a divideandconquer technique. However, these three FPGAbased implementations only support small parameter sets, insufficient for bootstrapping. Recently, Yang et al. proposed Poseidon, an FPGAbased FHE accelerator supporting bootstrapping on the modern Xilinx U280 FPGA [
18]. Poseidon employed several optimization techniques to enhance resource efficiency. Similarly, Agrawal et al. presented FAB, an FPGAaccelerated design that balances memory and computing consumption for large homomorphic parameter bootstrapping [
19]. FAB accelerates CKKS bootstrapping using a carefully designed datapath for key switching, taking full advantage of onchip 43 MB onchip storage. However, the design’s extensive parallelism consumes numerous logic elements, especially with larger parameter sets. Additionally, inefficient scheduling can result in redundant resource consumption and complex workflow synchronization, leading to suboptimal performance. In this work, we adopt a pipelined KeySwitch design to simplify scheduling and target highthroughput implementation. Our design method leverages FPGA fabric’s programmable logic elements and enhances onchip memory utilization.
1.2. Our Main Contributions
This study presents a comprehensive hardware architecture for the KeySwitch accelerator design, which operates in a highly pipelined manner to speed up CKKSbased FHE schemes. Built on compact NTT and INTT engines [
20], the KeySwitch module efficiently employs onchip resources. Importantly, our design approach significantly reduces internal memory consumption, allowing onchip memory to hold temporary data. The design executes subfunctions concurrently in a pipelined and parallel manner to boost throughput. We demonstrate an example design supporting a threelevel parameter set. The proposed KeySwitch module was evaluated on the Xilinx UltraScale+ XCU250 FPGA platform, and we provide an indepth discussion of the design methodology and area breakdown for better understanding of key operations. Compared to the most related study, our KeySwitch module achieves a 1.6x higher throughput rate and superior hardware efficiency.
The remainder of this paper is organized as follows:
Section 2 provides an overview of the underlying operations of RLWEbased HE schemes.
Section 3 describes the key switching algorithm in detail, and
Section 4 presents the design of our KeySwitch module.
Section 5 presents the experimental results, compares our approach with related works, and discusses our findings. Finally,
Section 6 concludes the study.
2. Background
CKKSbased HE schemes have been extensively studied to perform meaningful computations on encrypted data of real and complex numbers. In the encrypted data domain, the ciphertext often consists of two Ndegree polynomials, and each coefficient is an integer modulo Q. Therefore, the underlying homomorphic operations in RLWEbased HE schemes share similarities, enabling the development of a single hardware accelerator that can support multiple HE instances. Our study primarily focuses on accelerating CKKSbased homomorphic encryption; however, the operations described at the ciphertext level have a broad applicability to almost all latticebased homomorphic encryption schemes.
2.1. Residue Number System
The Chinese remainder theorem (CRT) enables a polynomial in
${R}_{Q}$ to be represented as an RNS decomposition with smaller pairwise coprimes such that
$Q={\prod}_{i=0}^{L}{q}_{i}$ [
21]. This enables polynomial
$\mathit{a}$ in
${R}_{Q}$ to be represented in RNS channels as a set of polynomial components. For instance, considering an RNS representation with three pairwise coprime moduli
${q}_{0},{q}_{1},{q}_{2}$, the polynomial
$\mathit{a}$ can be represented as a set of three polynomials:
$\mathit{a}\equiv ({\mathit{a}}_{0},{\mathit{a}}_{1},{\mathit{a}}_{2})\phantom{\rule{0.277778em}{0ex}}mod\phantom{\rule{0.277778em}{0ex}}({q}_{0},{q}_{1},{q}_{2})$, where each
${\mathit{a}}_{i}$ is a polynomial in
$R{q}_{i}$. This technique can significantly reduce the magnitude of coefficients and improve the performance of arithmetic operations in HE.
We denote the polynomial component in a ring field
${R}_{{q}_{i}}={Z}_{{q}_{i}}$/
$({X}^{N}+1)$ as follows:
Thus, arithmetic operations on large integer coefficients can be performed for each smaller modulus without any loss of precision.
2.2. Gadget Decomposition
Let
q be the modulus and
$\mathit{g}=({g}_{0},{g}_{1},\dots ,{g}_{d1})\in {Z}^{d}$ be a gadget vector. A gadget decomposition [
22], denoted by
${\mathit{g}}^{1}:{Z}_{q}\to {Z}^{d}$, maps an integer
$a\in {Z}_{q}$ into a vector
$\overline{\mathit{a}}={\mathit{g}}^{1}\left(a\right)\in {Z}_{q}^{d}$ and
$\langle {\mathit{g}}^{1}\left(a\right),\mathit{g}\rangle =a$ (mod
q). By extending the domain of the gadget decomposition
${\mathit{g}}^{1}$ from
${Z}_{q}$ to
${R}_{q}$, we can apply it to a polynomial
$\mathit{a}={\sum}_{i\in \left[N\right]}{a}_{i}\xb7{X}^{i}$ in
${R}_{q}$ by mapping each coefficient
${a}_{i}$ to a vector
${\mathit{g}}^{1}\left({a}_{i}\right)\in {Z}_{q}^{d}$ and then replacing
${a}_{i}$ with
${\mathit{g}}^{1}\left({a}_{i}\right)\xb7{X}^{i}$ in the polynomial expression (
${\mathit{g}}^{1}:{R}_{q}\to {R}^{d}$ with
$\mathit{a}={\sum}_{i\in \left[N\right]}{a}_{i}\xb7{X}^{i}\to {\sum}_{i\in \left[N\right]}{\mathit{g}}^{1}\left({a}_{i}\right)\xb7{X}^{i}$). This extension was proposed by [
23].
RNS representation can also be integrated with prime decomposition, as exemplified in [
24]. An element
$\mathit{a}\in {R}_{Q}$ can be represented in RNS form as
${({\left[\mathit{a}\right]}_{{q}_{i}})}_{0\le i\le l}\in {\prod}_{i=0}^{l}{R}_{{q}_{i}}$. The inverse mapping, which allows the retrieval of the original element
$\mathit{a}$ from its RNS form, is defined by the formula
$\mathit{a}={\sum}_{i=0}^{l}{\mathit{a}}_{i}\xb7{g}_{i}\xb7{\left[{g}_{i}^{1}\right]}_{{q}_{i}}$ (mod
Q), where
${g}_{i}=\frac{Q}{{q}_{i}}$ [
14].
2.3. Key Generation
The client begins by generating a secret key $\mathit{sk}$, which is a polynomial in ${R}_{Q}$. Then, they generate a uniformly random polynomial $\mathit{r}$ from $U\left({R}_{Q}\right)$ and an error or noise polynomial $\mathit{e}$ from a distribution $\chi $. The corresponding public key is generated as $\mathit{pk}=(\mathit{b},\mathit{r})\in {R}_{Q}^{2}$, where $\mathit{b}$ is obtained by taking the inner product of $\mathit{r}$ and a fixed vector $\mathit{s}$, and adding the error polynomial $\mathit{e}$, that is, $\mathit{b}=\langle \mathit{r},\mathit{s}\rangle +\mathit{e}$.
Let
${\mathit{sk}}^{\prime}$ be a different key: We sample
${\mathit{D}}_{1}\leftarrow U\left({R}_{Q}^{L}\right)$ and
$\mathit{e}\leftarrow {\chi}^{L}$. Using the gadget vector
$\mathit{g}$, we compute
${\mathit{D}}_{0}={\mathit{sk}}^{\prime}\xb7{\mathit{D}}_{1}+\mathit{sk}\xb7\mathit{g}+\mathit{e}$ (mod
Q) and return a switching key (
SwK) as
$\mathbf{SwK}=\left({\mathit{D}}_{0,j}\right{\mathit{D}}_{1,j})$, in which
${\mathit{D}}_{j}$ is a vector of polynomials
${d}_{i}\in {\prod}_{i=0}^{l}{q}_{i}$ [
23].
2.4. Encryption and Decryption
CKKS encodes a vector of maximal
$N/2$ real values into a plaintext polynomial
$\mathit{m}$ of
N coefficients, modulo
q. Using the generated public key
$\mathit{pk}$, the client encrypts an input message and produces a noisy ciphertext
$\mathit{ct}=({\mathit{c}}_{0},{\mathit{c}}_{1})\in {R}_{Q}^{2}$ as follows:
where
${\mathit{r}}_{1}$ is another uniformly random vector and
${\mathit{e}}_{0}$ and
${\mathit{e}}_{1}$ are other noise vectors. After homomorphic computations on ciphertexts, the client obtains the results in the encrypted form
${\mathit{ct}}^{\prime}=({\mathit{c}}_{0}^{\prime},{\mathit{c}}_{1}^{\prime})$ and uses the secret key to recover the desired information. Decryption is performed using
${\mathit{m}}^{\prime}={\mathit{c}}_{1}^{\prime}{\mathit{c}}_{0}^{\prime}\xb7\mathit{sk}\approx \mathit{m}+{\mathit{e}}^{\prime}$ with a small error.
2.5. Homomorphic Operations
Homomorphic addition: Taking ciphertexts
$\mathit{a}=({\mathit{a}}_{0},{\mathit{a}}_{1})$ and
$\mathit{b}=({\mathit{b}}_{0},{\mathit{b}}_{1})$ for example, their homomorphic addition is computed by coefficientwise adding their copair of RNSelement polynomials:
Homomorphic multiplication: For ciphertexts
$\mathit{a}=({\mathit{a}}_{0},{\mathit{a}}_{1})$ and
$\mathit{b}=({\mathit{b}}_{0},{\mathit{b}}_{1})$, their homomorphic multiplication is performed by multiplications between their RNS elements:
This dyadic multiplication produces a special ciphertext of
${\mathit{a}}_{1}\xb7{\mathit{b}}_{1}$ for a different secret key (that is,
${\mathit{sk}}^{2}$). Subsequently, key switching is performed to relinearize the quadratic form of homomorphic multiplication results and obtain a linear ciphertext of the original form.
Key switching: RLWE ciphertexts can be transformed from one secret key to another using key switching computation with
SwK. This method enables the transformation of a ciphertext decryptable by
$\mathit{sk}$ into a new ciphertext under a different secret key
${\mathit{sk}}^{\prime}$ with an additional error
${\mathit{e}}_{KS}$. The
SwK is considered a
d encryption of
$\mathit{sk}\xb7{g}_{i}$ under different secret keys
${\mathit{sk}}^{\prime}$, that is,
$\mathbf{SwK}\xb7(1,{\mathit{sk}}^{\prime})\approx \mathit{sk}\xb7\mathit{g}$ (mod
Q) [
23].
Key switching ($\mathit{ct},\mathbf{SwK}$) return ${\mathit{ct}}^{\prime}=({\mathit{c}}_{0},0)+{\mathit{g}}^{1}\left({\mathit{c}}_{1}\right)\xb7\mathbf{SwK}$ (mod Q) where $\mathit{ct}=({\mathit{c}}_{0},{\mathit{c}}_{1})$, $\mathbf{SwK}=\left({\mathit{D}}_{0}\right{\mathit{D}}_{1})$. In detail:
${\mathit{ct}}^{\prime}=({\mathit{c}}_{0},0)+{\mathit{g}}^{1}\left({\mathit{c}}_{1}\right)\xb7\mathbf{SwK}$
$=({\mathit{c}}_{0},0)+{\mathit{g}}^{1}\left({\mathit{c}}_{1}\right)\xb7({\mathit{D}}_{0},{\mathit{D}}_{1})$
$=(({\mathit{c}}_{0}+{\mathit{g}}^{1}\left({\mathit{c}}_{1}\right)\xb7{\mathit{D}}_{0}),({\mathit{g}}^{1}\left({\mathit{c}}_{1}\right)\xb7{\mathit{D}}_{1}))=({\mathit{c}}_{0}^{\prime},{\mathit{c}}_{1}^{\prime})$
 →
${\mathit{m}}^{\prime}={\mathit{c}}_{0}^{\prime}+{\mathit{c}}_{1}^{\prime}\xb7{\mathit{sk}}^{\prime}$
$={\mathit{c}}_{0}+{\mathit{g}}^{1}\left({\mathit{c}}_{1}\right)\xb7{\mathit{D}}_{0}+{\mathit{g}}^{1}\left({\mathit{c}}_{1}\right)\xb7{\mathit{D}}_{1}\xb7{\mathit{sk}}^{\prime}$
$={\mathit{c}}_{0}+{\mathit{g}}^{1}\left({\mathit{c}}_{1}\right)\xb7({\mathit{D}}_{0}+{\mathit{sk}}^{\prime}\xb7{\mathit{D}}_{1})$
$={\mathit{c}}_{0}+{\mathit{g}}^{1}\left({\mathit{c}}_{1}\right)(\mathit{sk}\xb7\mathit{g}+\mathit{e})$
$=\langle \mathit{ct},(1,\mathit{sk})\rangle +{\mathit{e}}_{KS}$, where ${\mathit{e}}_{KS}=\langle {\mathit{g}}^{1}\left({\mathit{c}}_{1}\right),\mathit{e}\rangle $.
 →
${\mathit{m}}^{\prime}=\mathit{m}+{\mathit{e}}_{KS}$.
3. Key Switching Algorithm
Algorithm 1 provides a detailed description of the homomorphic multiplication with a key switching operation, which is a crucial building block of the SEAL HE library [
6]. One remarkable feature of homomorphic multiplication is that NTT is a linear transformation, and optimized HE implementations typically store polynomials in the NTT form across operations instead of their coefficient form. Therefore, the first phase of homomorphic multiplication involves dyadic multiplication. However, the use of the Karatsuba algorithm, a fast multiplication technique, can reduce the total number of coefficientwise multiplications from four to three. Dyadic multiplication produces a tuple of polynomials (
$c{t}_{0,i}$,
$c{t}_{1,i}$,
$c{t}_{2,i}$), where
$c{t}_{2,i}$ is a special ciphertext that encrypts the square of the secret key; that is, (
$1,\mathit{s},{\mathit{s}}^{2}$). To recombine the homomorphic products and obtain a linear ciphertext in the form (
$1,\mathit{s}$), key switching is required to make
$c{t}_{2,i}$ decryptable with the original secret key. The homomorphic multiplication is computed using the following equation, which involves key switching using
SwK:
Key switching is a computationally intensive operation that typically dominates the cost of homomorphic multiplication. The key switching operation requires two inputs: the polynomial component $\mathit{c}{\mathit{t}}_{2,i}$ and key switching key matrix SwK. The polynomial component $\mathit{c}{\mathit{t}}_{2,i}$ is represented in RNS form as $(l+1)$ residue polynomials, whereas the key switching key matrix $\mathbf{SwK}=\left(\mathit{D}0,j\right\mathit{D}1,j)$ is a tensor of $(l+1)$ matrices of $(L+2)$ residue polynomials. RNS decomposition was used to enable fast key switching with a highly parallel and pipelined implementation.
Algorithm 1 shows that key switching involves
l INTT and
${l}^{2}$ NTT operations for increasing the modulus, and two INTTs and two
l NTTs for modulus switching. Thus, key switching dominates the homomorphic multiplication process in terms of the computational cost. However, at
ldepth level, the main costs are memory expense and data movement. To illustrate the efficient utilization of the onchip resources on the FPGA platform, we used a parameter set of five modulo primes as a running example. The implementation results indicate that the proposed approach maximizes the utilization of hardware resources.
Algorithm 1 Homomorphic multiplication algorithm with a key switching operation [6] 
Input: $\mathit{a}=({a}_{0},{a}_{1})$ and $\mathit{b}=({b}_{0},{b}_{1})\in {\left({\prod}_{i=0}^{l}{q}_{i}\right)}^{2}$, $\mathbf{SwK}=\left({\mathit{D}}_{0,j}\right{\mathit{D}}_{1,j})\in {\left({q}_{sp}{\prod}_{j=0}^{L}{q}_{j}\right)}^{2}$ where ${\mathit{D}}_{\mathit{j}}={d}_{i}\in {\prod}_{i=0}^{l}{q}_{i}$ Output:
$\mathit{c}=({c}_{0},{c}_{1})\in {\left({\prod}_{i=0}^{l}{q}_{i}\right)}^{2}$  1:
/* Dyadic multiplication */  2:
for $i=0$ to l do  3:
$c{t}_{0,i}={a}_{0,i}\odot {b}_{0,i}$  4:
$c{t}_{1,i}={a}_{0,i}\odot {b}_{1,i}+{a}_{1,i}\odot {b}_{0,i}$  5:
$c{t}_{2,i}={a}_{1,i}\odot {b}_{1,i}$  6:
end for  7:
/* Key switching */  8:
for $i=0$ to ldo ▹ Modulus raising  9:
$\tilde{a}\leftarrow $ INTT${}_{{q}_{i}}\left(c{t}_{2,i}\right)$  10:
for $j=0$ to l do  11:
if $i\ne j$ then  12:
$\tilde{b}\leftarrow $ Mod$(\tilde{a},{q}_{i})$  13:
$\overline{b}\leftarrow $ NTT${}_{{q}_{j}}\left(\tilde{b}\right)$  14:
else  15:
$\overline{b}\leftarrow c{t}_{2,i}$  16:
end if  17:
${\overline{c}}_{0,j}\leftarrow {\overline{c}}_{0,j}+\overline{b}\odot {d}_{0,i,j}$ (mod ${q}_{j}$)  18:
${\overline{c}}_{1,j}\leftarrow {\overline{c}}_{1,j}+\overline{b}\odot {d}_{1,i,j}$ (mod ${q}_{j}$)  19:
end for  20:
$\tilde{b}\leftarrow $ Mod$(\tilde{a},{q}_{sp})$  21:
$\overline{b}\leftarrow $ NTT${}_{{q}_{sp}}\left(\tilde{b}\right)$  22:
${\overline{c}}_{0,l+1}\leftarrow {\overline{c}}_{0,l+1}+\overline{b}\odot {d}_{0,i,L+1}$ (mod ${q}_{sp}$)  23:
${\overline{c}}_{1,l+1}\leftarrow {\overline{c}}_{1,l+1}+\overline{b}\odot {d}_{1,i,L+1}$ (mod ${q}_{sp}$)  24:
end for  25:
for $k=0$ to 1 do ▹ Modulus switching  26:
$\tilde{r}\leftarrow $ INTT${}_{{q}_{sp}}\left({\overline{c}}_{k,l+1}\right)$  27:
for $i=0$ to l do  28:
$r\leftarrow $ Mod$(\tilde{r},{q}_{i})$  29:
$\overline{r}\leftarrow $ NTT${}_{{q}_{i}}\left(r\right)$  30:
${c}_{k,i}^{\prime}\leftarrow {\overline{c}}_{k,i}\overline{r}$ (mod ${q}_{i}$)  31:
${c}_{k,i}\leftarrow {\left[{q}_{sp}^{1}\right]}_{{q}_{i}}\xb7{c}_{k,i}^{\prime}+c{t}_{k,i}$ (mod ${q}_{i}$)  32:
end for  33:
end for  34:
return
$\mathit{c}=({c}_{0},{c}_{1})$

4. KeySwitch Hardware Architecture
Figure 3 illustrates the pipelined architecture of the KeySwitch module with an initial depth of
$L=3$. The KeySwitch module consumes the third component of the dyadic multiplication result and generates relinearized ciphertext. The KeySwitch design was divided in two functional modules with a pipelined connection: ModRai and ModSwi. Two modules have similar structures, and we numbered the sequential operations for clarity. The numbering makes it easier to track the description of their operations.
Key switching operation is computationally intensive, with NTT and INTT operations being dominant. In an FHE setting, ciphertext polynomials are represented in the NTT form by default to reduce the number of NTT/INTT conversions. However, this format is not compatible with the rescaling operation that occurs during moduli switching. Therefore, the key switching process involves performing NTT and INTT operations before and after rescaling, respectively. Consequently, the primary computational costs associated with key switching are for the NTT and INTT operations. Conventionally, the NTT and INTT units consume a large amount of internal memory to store precomputed TFs. In this study, the proposed KeySwitch module employs inplace NTT and INTT hardware designs that aim to reduce the onchip memory usage [
20]. In particular, each NTT and INTT unit stores several TF bases of the associated modulus and utilizes builtin twiddle factor generator (TFG) to twiddle all other factors. Based on the design method of [
20] and the exploration of the key switching execution, we designed different NTT modules for associated moduli through pipeline stages. By adopting this approach, the proposed KeySwitch module utilizes hardware resources more efficiently.
In the ModRai module, the first INTT operation transforms a sequence of (
$l+1$) input polynomials into the associated modulus (op ①). The next stage involves performing MOD operations on the previous INTT results for the (
$l+2$) moduli. Because operations on individual (
$l+2$) moduli are independent of RNS decomposition, we can perform (
$l+2$) MODs in parallel (op ②) to efficiently pipeline the computation. Modular multiplication (ModMul) also requires the original input polynomial, which reduces the number of MODs on (
$l+2$) moduli to (
$l+1$) MODs at a time.
Figure 4 shows selectable MOD outputs. Subsequently, the (
$l+1$) NTT modules must run in parallel for subsequent NTT computations (op ③). Once the NTT computations are complete, the ModMul module performs modular multiplications with the
SwK using Algorithm 1. To simultaneously generate two relinearized vectors, we deployed
$2\times $(
$l+2$) ModMul modules (op ④). After the ModMul product, the results were stored in the following memory banks (ops ⑤ and ⑥, respectively). We used two Ultra RAM (URAM), largescale, highspeed memory element, banks to store two polynomials with five RNS components. After accumulating (
$l+1$) polynomials in URAMs, the ModRai module transferred the temporary data to the ModSwi module memory and continued accumulating with the next polynomials. Cooperation after NTT was indicated as MAR, and its detailed structure is shown in
Figure 5.
The ModSwi module performed the second part of the key switching operation after (
$l+1$) iterations. In this step, temporary data from ModRai were received and stored in RAM banks (op ⑦). The following INTT unit transformed only the two polynomials with the associated special modulus
${q}_{sp}$ (op ⑧). The ModSwi module then performed the flooring operation with (
$l+1$) MR units and (
$l+1$) NTT computations (ops ⑨ and ⑩, respectively). For the ModMul operation of the 51bit modulus, the coefficients were compared with half of
${q}_{sp}$, and the subtraction with the residue of
${q}_{sp}$ modulo
${q}_{i}$ was then determined [
6]. At the end of the flooring, subtraction with ModRai outputs and subsequent multiplication by the inverse value of the special prime were performed for two polynomials of RNS components in parallel (ops ⑪ and ⑫, respectively). Op ⑬ added the remaining two components of the homomorphic multiplication results to the outputs of the flooring operation, and generated the relinearized ciphertext simultaneously. The output of the key switching operation consisted of two polynomials of RNS components, which are referred to as
c${}_{0}$ and
c${}_{1}$ of the keyswitched ciphertext
c.
The pipeline timing for the key switching operation is shown in
Figure 6, where each pipeline stage comprises a series of consecutive operations separated by a few cycles. Each square block represents the approximate delay of the onepolynomial NTT computation. The ModRai unit can increase the modulus in a highly pipelined manner, with the results stored in the RAM until all input moduli are transformed (op ⑥). Subsequently, the ModSwi module performs the modulus switching operation only for two polynomials with the associated special modulus. In a pipelined operation, modulus switching has a timing delay of two square blocks. However, the delay gap between consecutive key switching operations depends on the number of modulo primes, which affects the accumulation latency in the ModRai module.
In this configuration of KeySwitch with
$l=3$ and
$N=64$ K,
Figure 7 shows the tensor form of
SwK. In the RNS domain, the component polynomials are 480 KB (
$=\frac{65536\times 60\text{}\mathrm{bit}}{1024\times 8}$) for
${q}_{0}$ and
${q}_{sp}$ with
$60\text{}\mathrm{bit}$ and 408 KB (
$=\frac{65536\times 51\text{}\mathrm{bit}}{1024\times 8}$) for
${q}_{i}$ of
$51\text{}\mathrm{bit}$. Each ciphertext polynomial size is 1704 KB (
$=\frac{65536(60\text{}\mathrm{bit}+3\times 51\text{}\mathrm{bit})}{1024\times 8}$), and each ciphertext size is 3408 KB. The
SwK matrix dominated, accounting for 17,472 KB (
$=4\times \frac{65536(2\times 60\text{}\mathrm{bit}+3\times 51\text{}\mathrm{bit})}{1024\times 8}$). The same
SwK matrices for all homomorphic multiplication operations at a specific level can be reused. However, these matrices are often too large to be stored in the onchip memory, leading to a significant data movement overhead and a bottleneck in the overall performance of the cryptosystem. Thus, reducing data movement between the onchip and external memory is critical for improving the efficiency of the system.