Next Article in Journal
MSA-Net: A Multi-Scale Attention Network with Contrastive Learning for Robust Intervertebral Disc Labeling in MRI
Previous Article in Journal
On the Hilbert Function of Partially General Unions of Double Points
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

GPU Acceleration for KLSS Key Switching in Fully Homomorphic Encryption

by
Shutong Jin
and
Ray C. C. Cheung
*
Department of Electrical Engineering, City University of Hong Kong, Tat Chee Avenue, Hong Kong SAR, China
*
Author to whom correspondence should be addressed.
Mathematics 2025, 13(23), 3809; https://doi.org/10.3390/math13233809
Submission received: 6 November 2025 / Revised: 20 November 2025 / Accepted: 25 November 2025 / Published: 27 November 2025

Abstract

Fully Homomorphic Encryption (FHE) enables privacy-preserving computation but is hindered by high computational overhead, with the key-switching operation being a primary performance bottleneck. This paper introduces the first CUDA-optimized GPU implementation of the Kim, Lee, Seo, and Son (KLSS) key-switching algorithm for three leading FHE schemes: BGV, BFV, and CKKS. Our solution achieves significant performance gains, delivering speedups of up to 181× against the original CPU implementation. Furthermore, we analyze the critical trade-off between the key-switching techniques on GPUs, providing insights for the choice between single- and double-decomposition methods. Our work provides a high-performance tool and offers clear guidelines on the trade-off between latency and hardware memory constraints.

1. Introduction

Fully Homomorphic Encryption (FHE) represents a paradigm shift in cryptography by enabling computation directly on encrypted data without prior decryption. This capability allows for the processing of sensitive information by untrusted third parties, such as cloud service providers, while guaranteeing data confidentiality. Consequently, FHE is an immensely attractive technology for a wide range of privacy-preserving applications, including secure cloud computing, private machine learning inference, and federated learning [1,2,3,4,5].
Despite this promise, the practical deployment of FHE has been hindered by significant performance challenges. The substantial computational overhead associated with current FHE schemes remains a critical barrier to their widespread adoption in real-world systems. Among the various homomorphic operations, key switching is a particularly demanding primitive. It is a fundamental building block for other essential operations, such as ciphertext relinearization after multiplication and automorphisms. Given its frequent invocation and resource-intensive nature, the efficiency of the key-switching procedure has a profound impact on the overall performance of an FHE system.
This bottleneck has drawn considerable attention from the research community, leading to continuous efforts on developing more efficient key-switching algorithms. Among these, the technique by Kim, Lee, Seo, and Son (KLSS) [6] is particularly noteworthy. It reduces the complexity, measured in the number of Number Theoretic Transform (NTT) operations, from  O ( 2 ) to O ( ) with respect to the ciphertext level .
While the theoretical advantages of the KLSS algorithm are clear, its practical benefits depend heavily on an optimized implementation that can fully exploit modern hardware architectures. To date, however, implementations have often been scheme-specific or tailored to CPU environments.
In this work, we bridge this gap by presenting the first unified, GPU-accelerated implementation of the KLSS key-switching algorithm for three of the most prominent FHE schemes: BGV [7], BFV [8,9,10], and CKKS [11]. Our contributions are threefold:
  • A CUDA-optimized GPU implementation of KLSS key switching for the BGV, BFV, and CKKS schemes.
  • A comprehensive benchmark against CPU and prior state-of-the-art GPU implementations, demonstrating significant speedups.
  • An in-depth analysis of the trade-off between single- and double-decomposition for key switching, specifically within the context of GPU-accelerated homomorphic encryption.
The remainder of this paper is organized as follows. Section 2 reviews the necessary background on FHE and GPU acceleration. Section 3 looks into the specifics of the FHE schemes and the key-switching procedure relevant to our work. Section 4 details the architecture of our unified GPU implementation. Section 5 presents our experimental setup and performance results. Section 6 provides a detailed discussion of our findings. Finally, Section 7 concludes the paper and discusses potential directions for future work.

2. Related Works

2.1. Fully Homomorphic Encryption

Since Gentry’s blueprint in 2009 [12], the field of FHE has largely been built on lattice-based cryptography. The security of these systems is rooted in the computational hardness of the Learning with Errors (LWE) and Ring-LWE (RLWE) problems [7,9,10,12,13,14,15,16]. The schemes based on the RLWE problem, such as BGV [7], BFV [8,9,10], and CKKS [11], have gained prominence due to their high computational efficiency. This efficiency stems from their ability to pack multiple data values into a single ciphertext, enabling parallel processing via Single Instruction, Multiple Data (SIMD) operations. The BGV and BFV schemes are designed for exact computations over finite fields, while the CKKS scheme is tailored for approximate arithmetic on real and complex values. This makes CKKS particularly well-suited for applications in privacy-preserving machine learning [2,11,17,18,19,20,21]. The practical implementation and academic study of these schemes are supported by many open-source libraries, including SEAL [22], HElib [23], and OpenFHE [24].

2.2. Research in Key Switching

Previous research on key switching [25] produced two primary techniques with distinct trade-offs. The Gentry, Halevi, and Smart (GHS) method is highly efficient, requiring only a linear number of NTTs [16]. However, this efficiency comes at the cost of either halving the modulus size or doubling the polynomial dimension to preserve security. In contrast, the Brakerski and Vaikuntanathan (BV) technique is more direct to implement but incurs a higher computational cost, involving a quadratic number of NTTs [10].
Hybrid key switching was developed to combine the digit decomposition technique from BV with the larger modulus strategy from GHS to strike a better balance [26]. More recent research [27] has continued to build upon the hybrid approach to further minimize the computational overhead. For instance, some work focuses on optimizing the selection of decomposition parameters for different ciphertext levels [28]. A more recent advancement, now considered the state-of-the-art, is the KLSS technique, which significantly reduces the computational workload by further optimizing the use of NTTs [6].

2.3. GPU-Accelerated FHE

To mitigate the substantial computational demands of FHE, researchers have turned to Graphics Processing Units (GPUs) for acceleration. The inherent parallelism of FHE, especially the SIMD batching, makes GPUs a natural fit for this task, and numerous studies have confirmed their effectiveness in accelerating homomorphic operations [29,30,31,32,33,34,35].
The research landscape for GPU acceleration can be divided based on the type of FHE scheme. For bit-wise schemes, the open-source library cuFHE was one of the first to provide a CUDA-based implementation for TFHE [36]. Subsequent work on FHEW and TFHE has introduced advanced parallelization strategies and memory-aware optimizations tailored to their unique computational patterns [37,38,39].
Simultaneously, a significant body of research has focused on accelerating word-type schemes like CKKS, BFV, and BGV, which are critical for arithmetic-heavy applications. The central aim is to reduce the computational overhead that impedes practical FHE deployment. To achieve this, researchers have developed optimizations at multiple levels, from algorithmic enhancements in key switching to software strategies like kernel fusion that reduce memory latency [40,41,42]. Other efforts have targeted hardware-specific features, such as leveraging Tensor Cores for further acceleration [43,44]. Most recently, the Cheddar library was introduced, demonstrating state-of-the-art performance by integrating these multi-level optimization techniques [45].
Most open-source library, like HEonGPU [42], provide implementations for previous variants of key-switching techniques (BV [10], GHS [16], Hybrid [26]). To the best of our knowledge, no existing GPU acceleration library offers an implementation for KLSS key switching or other double-decomposition based methods, a gap that directly motivated this work.

3. Preliminaries

The security of the Learning With Errors (LWE) problem stems from the computational hardness of recovering a secret vector from a system of linear equations that contains a small amount of noise. Formally, for a given modulus q and dimension n, an LWE instance is a pair ( a , b ) Z q n + 1 . Here, a is a vector chosen uniformly at random, and b is derived as the inner product of a and a secret vector s Z q n , perturbed by a small error term e :
b = a , s + e ( mod q ) .
The error e is typically drawn from a discrete Gaussian distribution.
The Ring Learning With Errors (RLWE) problem is the ring-based analogue of LWE, offering improved efficiency by operating over polynomial rings. Within a ring R q = Z q [ x ] / ( f ( x ) ) , the challenge is to find a secret polynomial s R q . This is done using samples ( a , b ) R q 2 , where a is a uniformly random polynomial and b is computed as
b = a · s + e ( mod q ) ,
with e being an error polynomial with small coefficients.
In this work, we operate within the polynomial ring R = Z [ X ] / ( X N + 1 ) , where N is a power of two. A ciphertext, denoted as ct , is a pair of polynomials ( c 0 , c 1 ) . The correctness of decryption relies on the relationship
c 0 + c 1 · s = m + e ,
where s is the secret key, m is the encoding of the plaintext μ , and  e is a small error term.

3.1. BGV, BFV, CKKS

We now summarize the core functionalities of three RLWE-based FHE schemes: BGV [7], BFV [8,9,10], and CKKS [11].
  • Setup: The process begins by establishing the cryptographic environment based on a security parameter λ . This involves selecting public parameters p p = ( N , t , Q , P , χ key , χ err ) , which define the polynomial ring dimension N, plaintext modulus t, ciphertext modulus Q, and the distributions for key and error sampling ( χ key and χ err ).
  • Key Generation: A secret key sk is created by sampling a polynomial s from the key distribution χ key . The corresponding public key pk is a pair ( a , as + e ) , where a is a uniformly random polynomial in R Q and e is a small error from χ err . Additionally, evaluation keys, such as relinearization keys ksk relin and automorphism keys ksk auto , are generated for homomorphic operations.
  • Encryption: To encrypt a plaintext m , it is first encoded into a polynomial m * . The specific encoding method depends on the scheme:
     
    BFV: The message is scaled to the most significant bits: m * = Q t · [ m ] t .
     
    BGV: A correction factor μ is applied: m * = [ μ m ] t .
     
    CKKS: The message is directly encoded as a polynomial: m * = m .
    The final ciphertext ct = ( c 0 , c 1 ) is then formed by adding this encoded message to a fresh encryption of zero, which is calculated as Enc p k ( 0 ) = [ r · ( a , b ) + t · ( e 0 , e 1 ) ] Q .
  • Decryption: The original message is recovered from a ciphertext using the secret key s .
     
    BFV:   m = t Q · [ c 0 + c 1 · s ] Q
     
    BGV:   m = [ μ 1 [ c 0 + c 1 · s ] Q ] .
     
    CKKS:   m = [ c 0 + c 1 · s ] Q
  • Homomorphic Operations: These schemes support computations directly on encrypted data.
     
    Addition: The sum of two ciphertexts, ct and ct , is simply their component-wise sum: ct add = ct + ct .
     
    Multiplication: The product of two ciphertexts is computed, followed by a relinearization step using ksk relin to produce ct mult .
     
    Automorphism: An automorphism (e.g., rotation) is applied to a ciphertext ct using a corresponding key ksk auto to yield the transformed ciphertext ct auto .

3.2. Key Switching

Key switching is a generic technique for changing the secret key under which a ciphertext is encrypted, from an initial key s A to a target key s B , without altering the underlying plaintext. A switching key, ksk s A s B , is defined as:
ksk s A s B = ( [ s A + a · s B + t e ] Q , a ) R Q 2
where a is a uniformly random polynomial in R Q 2 and e is a small error sampled from χ err .

3.2.1. Relinearization

Relinearization is a specific application of key switching used after a homomorphic multiplication. The multiplication of two linear ciphertexts results in a quadratic ciphertext of degree 2, effectively encrypted under ( 1 , s , s 2 ) . Relinearization uses a key-switching key ksk relin to convert this quadratic ciphertext back into a linear one under the original basis ( 1 , s ) . Our implementation unifies the key-switching module for both relinearization and other operations (like automorphisms). For non-relinearization inputs, we structure the input as [ c 0 * , 0 , c 1 * ] , where the term to be switched, c 1 * , is treated as the third component ( c 2 ) to maintain a consistent interface.

3.2.2. Modulus Switching

Modulus Switching is a technique for managing noise growth. It reduces the modulus of a ciphertext from a larger value Q to a smaller one Q while preserving the plaintext. This is done by scaling the ciphertext coefficients by the ratio Q / Q and rounding to the nearest integer:
ct = Q Q · ct ( mod Q )

4. GPU Acceleration for KLSS

4.1. The Double-Decomposition Approach

The KLSS key-switching technique [6] builds upon the prior state-of-the-art methods such as hybrid key switching, by introducing a double-decomposition strategy. As illustrated in Figure 1, after being raised to Q P , both the input ciphertext ct A and the key-switching key ksk are additionally converted to an external base T for external product. In Section 4.2, we restructure the original Algorithm 1 to align with the massively parallel architecture of GPUs, thereby maximizing its computational throughput.

4.2. GPU Acceleration

To better leverage the massively parallel architecture of modern GPUs, we restructured the original algorithm into a two-stage process. The first stage is an offline preprocessing step, where all necessary data is precomputed and loaded into GPU global memory. The second stage performs the online calculation of the key-switching ciphertext based on a given input. These two stages are detailed in Algorithms 2 and 3, respectively.

4.2.1. Stage 1: Key Generation (Offline)

This module is executed once. All intermediate and final key data are generated and stored in the GPU’s global memory (Device memory) to avoid costly data transfers during the online phase.

4.2.2. Stage 2: GPU-Accelerated Key Switching (Online)

This module leverages the pre-computed key ksk d o u b l e from the GPU memory. The input c i is first transferred to the GPU, all computations are performed on the device, and only the final result is transferred back.
Algorithm 1 KLSS Key Switching
Input: Context parameters: Bases q , P , E ; Decompositions d , d d o u b l e .
Input: Offline: Secret keys s A , s B .
Input: Online: Ciphertext component c i R q .
Output: Switched ciphertext parts ( c ˜ 0 , c ˜ 1 ) R Q .
       STAGE 1. Generates ksk d o u b l e and store.
  1:  for   l 0 to d do
  2:        Sample a l , e l . Compute raw parts:
  3:         ksk 0 , l [ a l s B + t e l + P · B ( s A ) l ] Q P
  4:         ksk 1 , l [ a l ] Q P
       // BaseConvert (Map from Q P T )
  5:  for   j { 0 , 1 } , l [ 0 , d ] , m [ 0 , d d o u b l e 1 ]   do
  6:         poly D ˜ ( ksk j , l ) m ▹ Extract digit
  7:         ksk d o u b l e , j , l , m NTT fwd ( BaseExt ( poly , Q m ˜ , T ) )
▹ Store in NTT form in Ring R T for fast online dot products.
       STAGE 2. Transforms c i using ksk d o u b l e .
       1. Input Preparation (Decompose & BaseConvert)
  8:  for   l 0 to d do
  9:         poly D ( c i ) l ▹ Decompose input c i
 10:        c ext , l NTT fwd ( BaseExt ( poly , q l ˜ , T ) )
▹ Move input parts to Ring R T and transform to NTT.
       2. Inner Product
 11:  for   j { 0 , 1 }   do
 12:        Initialize accumulator acc j 0 R T .
 13:        for  m 0 to d d o u b l e 1  do
 14:               temp 0
 15:              for  l 0 to d do
 16:                     temp temp + c ext , l · ksk d o u b l e , j , l , m ▹ Pointwise Mult-Add
 17:                acc j acc j + temp · NTT fwd ( B ˜ m ) ▹ Reconstruct in R T
 18:         c j NTT inv ( acc j ) ▹ Inverse NTT in Ring R T
        3. Post-Processing (BaseConvert & ModDown)
 19:  for   j { 0 , 1 }   do
 20:         c j BaseExt ( c j , T , Q P ) ▹ Convert back to large modulus
 21:         ffi j t · BaseExt ( [ NTT inv ( t 1 c j ) ] P , P , Q ) ▹ Correction factor
 22:         c ˜ j [ P 1 ( NTT inv ( c j ) + ffi j ) ] Q ▹ Scale down to Q
 23:  return   ( c ˜ 0 , c ˜ 1 )
Algorithm 2 GPU-Accelerated Key Generation for KLSS
Input: New secret key s B R Q P , old secret key s A R Q P (on Device).
Output: Double-decomposition key ksk d o u b l e stored in GPU Device memory.
       // Step 1: Generate standard key parts in a batched kernel.
  1:  parallel for   l [ 0 , d 1 ]   do▹ Launch a kernel to process all ω parts in parallel.
  2:  Sample a batch of d polynomials a l and e l on the GPU.
▹ All polynomial arithmetic is coefficient-wise and parallelized within the kernel.
  3:   ksk 0 , l [ a l · s B + t e l + P · B ( s A ) l ] Q P
  4:   ksk 1 , l [ a l ] Q P
       // Step 2: Apply second decomposition and base-extend in a single, large-scale kernel.
  5:  parallel for  ( j , l , m ) { 0 , 1 } × [ 0 , d 1 ] × [ 0 , d d o u b l e 1 ] do▹ A massively parallel kernel.
▹ Each thread block be assigned to compute one or more ksk d o u b l e , j , l , m .
  6:  Decompose ksk j , l to get D ˜ ( ksk j , l ) m .▹ RNS decomposition is coefficient-wise.
  7:  Let poly = D ˜ ( ksk j , l ) m .
  8:   ksk d o u b l e , j , l , m BaseExt ( poly , Q m ˜ , T ) .▹ Batched BaseExt with error correction.
  9:  return  ksk d o u b l e (residing on the GPU Device).
Algorithm 3 GPU-Accelerated Double-Decomposition KLSS Key Switching.
Input: Ciphertext component c i R Q (on Host), key ksk d o u b l e (on Device).
Output: Switched ciphertext components ( c ˜ 0 , c ˜ 1 ) R Q 2 (on Host).
  1:  Transfer c i from Host to GPU Device memory.
       // Step 1: Batched Input Extension.
  2:  Decompose c i into ( D ( c i ) 0 , , D ( c i ) d 1 ) on the GPU.
  3:  parallel for  l [ 0 , d 1 ]  do▹ Launch a kernel for batched base extension.
  4:   c ext , l BaseExt ( D ( c i ) l , q l ˜ , T ) .▹ Batched BaseExt
       // Step 2: Batched Dot Product in R E .
  5:  parallel for  ( j , l , m ) { 0 , 1 } × [ 0 , d ] × [ 0 , d d o u b l e 1 ]  do▹ Batched NTTs and multiplications.
  6:   temp j , l , m NTT fwd ( c ext , l ) NTT fwd ( ksk d o u b l e , j , l , m ) . ▹⊙ is coefficient-wise product.
  7:  parallel for  ( j , m ) { 0 , 1 } × [ 0 , d d o u b l e 1 ]  do▹ Parallel reduction over index l.
  8:   c j , m l = 0 d temp j , l , m .▹ Kernel using parallel reduction.
  9:  parallel for  j { 0 , 1 }  do▹ Parallel CRT reconstruction (another reduction).
 10:   c j m = 0 d d o u b l e 1 c j , m NTT fwd ( B ˜ m ) .
       // Step 3: Batched Base Conversion from R T back to R Q P .
 11:  parallel for j { 0 , 1 }  do
 12:   c j BaseExt ( NTT inv ( c j ) , T , Q P ) .▹ Batched NTT inv and BaseExt.
       // Step 4: Batched Final Scaling.
 13:  parallel for  j { 0 , 1 }  do
 14:   ffi j t · BaseExt ( [ NTT inv ( t 1 c j ) ] P , P , Q ) .▹ Batched kernels for all steps.
 15:   c ˜ j [ P 1 ( NTT inv ( c j ) + ffi j ) ] Q .
 16:  Transfer ( c ˜ 0 , c ˜ 1 ) from GPU Device to Host memory.
 17:  return  ( c ˜ 0 , c ˜ 1 ) .

4.3. Unified Framework for BFV, BGV, and CKKS

While the original KLSS paper focused primarily on the CKKS scheme, we extended the implementation to encompass all word-type schemes, including BFV and BGV. By designing the framework illustrated in Figure 2, we developed a unified implementation that achieves comparable efficiency across all three schemes, as detailed in Table 1 and Table 2. Additionally, we adapted our implementation to align with the original Phantom library, where ciphertexts for BGV and CKKS are stored in the NTT domain.

5. Results

5.1. Experimental Setup

Our performance evaluation was conducted on two distinct hardware platforms: a consumer-grade NVIDIA RTX 4090 GPU and a datacenter-class NVIDIA A100 GPU. The A100 was specifically required for evaluating large parameter sets, as the substantial memory footprint of the associated key-switching keys exceeded the memory capacity of the consumer-grade RTX 4090. For our CPU performance baseline, we reference the results reported in [6]. Those experiments were performed on a system running Ubuntu 20.04.3 LTS, equipped with 192 GB of RAM and an Intel Xeon Platinum 8268 CPU (2.90 GHz), with each test utilizing a single thread. The specific parameters used for all benchmarks are detailed in Table 3.

5.2. GPU Acceleration Performance

As shown in Table 4, our GPU implementation achieves significant speedups, ranging from 84 × to 150 × on the NVIDIA A100 and from 114 × to 181 × on the NVIDIA RTX 4090 (including the host-device data transfer and kernel initialization). This performance difference is attributed to the distinct architectural trade-offs between A100 and RTX 4090. While the A100 features higher memory and bandwidth, the RTX 4090 has a substantially greater number of CUDA cores (16,384 versus 6912 for the A100). The higher speedup on the RTX 4090 indicates that our algorithm is predominantly compute-bound, and thus benefits more from the larger count of parallel processors than from the A100’s superior memory bandwidth.

5.3. Comparison with Hybrid Key Switching

A subject of ongoing discussion in the field is the relative performance of hybrid key-switching techniques (single decomposition) versus the KLSS algorithm [46] (double decomposition). We conduct a direct performance evaluation, comparing our implementation of the KLSS algorithm against a state-of-the-art, GPU-optimized hybrid method by Yang et al. [41]. To ensure a fair and direct comparison, we applied an identical GPU acceleration strategy to both algorithms. The results, presented in Table 1 and Table 2, indicate that neither method demonstrates universal superiority.
Our analysis shows that the hybrid KLSS method, even when configured with optimal decomposition parameters, does not consistently outperform the simpler, more memory-efficient single-decomposition strategy. Specifically, the performance advantage of KLSS diminishes as the parameter r increases or as the multiplication depth decreases. This observation aligns with the theoretical insights provided by [46], although their experimental validation was confined to CPU architectures.
Conversely, the KLSS method demonstrates superior performance for parameter sets with a large multiplication depth and a small number of RNS decomposition word size r for the single-decomposition part. Under these favorable conditions, such as those of parameter Set #14, the KLSS approach achieves a speedup of up to 1.49× over the single-decomposition strategy. A more detailed discussion of these trade-offs is presented in Section 6.

6. Discussion

6.1. Single- or Double-Decomposition? A Trade-Off Analysis

The choice between single and double decomposition represents a fundamental trade-off between computational latency and available multiplicative depth. Mono & Güneysu [46] recently contended that prior comparisons to hybrid key switching in [6] were inequitable. They demonstrated that for latency-optimized scenarios, a single-decomposition approach configured with a large gadget vector dimension r (and thus a decomposition base d = 1 ) can match or even surpass the performance of double decomposition. Our own experiments, detailed in Table 1 and Table 2, corroborate this finding in a GPU-accelerated context: for shallow circuits where raw speed is paramount, single decomposition is indeed the superior choice.
However, this latency-centric perspective does not capture the full picture. We argue that the double decomposition method remains a crucial tool in scenarios where maximizing multiplicative depth is the primary constraint. This is particularly relevant for applications involving deep arithmetic circuits where bootstrapping is either unavailable or prohibitively expensive. In such cases, the number of available levels, , is a fixed resource that dictates the complexity of the computable functions.
The advantage of double decomposition lies in its management of the modulus chain. The total modulus is composed of primes for the ciphertext modulus Q and primes for the special modulus P used in key switching. The number of levels, , is determined by the number of primes composing Q. To achieve low latency with single decomposition, one must use a large r, which in turn requires a large P constructed from multiple prime moduli. These primes are thus consumed by P and are unavailable for Q, effectively reducing the maximum achievable depth .
In contrast, double decomposition allows for a much smaller special modulus P. By setting the inner gadget dimension r = 1 , P can be constructed from a single prime from the modulus chain. This minimizes the resources allocated to key switching and maximizes the number of primes available for the ciphertext modulus Q. Consequently, for a given set of cryptographic parameters, double decomposition enables a greater multiplicative depth . Therefore, the recommendation by [46] to use d = 1 is not universally optimal. The choice of decomposition scheme is not absolute but rather a strategic decision dictated by application-specific requirements: single decomposition for speed in shallow circuits, and double decomposition for depth in complex, non-bootstrapped computations.

6.2. GPU Key Size Constraints

Beyond the latency-focused discussion in [46], a critical and often decisive factor for practical GPU implementations is the strictly limited nature of on-chip GPU memory (VRAM). To achieve the high throughput promised by GPU acceleration, the entirety of the evaluation keys must reside in VRAM to avoid performance-killing data transfers over the PCIe bus. The choice of decomposition scheme directly impacts the feasibility of this requirement.
This trade-off between decomposition parameters and memory consumption is vividly illustrated in the provided plot, which we will reference as Figure 3. The figure plots the key-switching key size against the required multiplication depth, , for various single- and double-decomposition configurations. Several clear trends emerge:
  • Impact of Gadget Dimension: The size of the gadget vector dimension, r (and r d o u b l e for double decomposition), has a dramatic effect on the total key size. In the single-decomposition case, increasing the dimension from r = 2 to r = 8 results in a significant increase in memory footprint.
  • The Cost of Double Decomposition: The penalty is even more pronounced with double decomposition. As shown in the figure, moving from ‘Double-Decomp (r = 3, rdouble = 2)’ to ‘(r = 3, rdouble = 8)’ causes a steep rise in key size. Crucially, for comparable parameters, double decomposition consistently requires more memory than single decomposition.
The key sizes shown, reaching up to 0.5 GB, are already substantial. However, it is vital to recognize that these values represent a best-case scenario. The plot likely assumes a moderate polynomial degree N. In practice, higher security requirements often necessitate larger parameters like N = 65,536, which would immediately quadruple the key sizes shown. Table 5 shows the maximum size for a single key-switching key with the different techniques. With double decomposition, a single key-switching key can demand up to 145 GB in the worst case. Furthermore, Figure 3 only accounts for the key switching (relinearization) key. A complete application requires a full set of Galois keys for operations like vector rotations, and the total size of these keys can easily be an order of magnitude larger than the relinearization key alone. Figure 4 further illustrates the dramatic growth of key size with r d o u b l e , while the corresponding efficiency gains fail to keep pace.
When these scaling factors are considered, a key size of 0.5 GB can easily inflate to 20–40 GB or more. This brings the memory requirement into direct conflict with the VRAM capacity of many common GPUs (e.g., 24 GB on an NVIDIA RTX 4090). If the total key size exceeds available VRAM, the system is forced to swap key data between system RAM and VRAM, creating an I/O bottleneck that negates the computational advantages of the GPU. Therefore, while double decomposition may offer greater multiplicative depth, its associated memory cost can render it impractical for GPU acceleration, making single decomposition the only viable option in memory-constrained environments.

7. Conclusions and Future Work

In this work, we addressed the growing need for high-performance and flexible GPU acceleration in FHE. We bridged a significant gap in the existing ecosystem by presenting the first unified, GPU-accelerated implementation of the KLSS key-switching algorithm for the BGV, BFV, and CKKS schemes.
Our extensive benchmarks confirmed that our CUDA-optimized implementation delivers significant speedups over both multi-core CPU baselines and prior state-of-the-art GPU implementations, reinforcing the critical role of GPUs in advancing practical FHE. More importantly, our work provides a detailed analysis of the trade-offs in key-switching design for GPU architectures. We experimentally validated the findings of Mono and Güneysu [46] that single decomposition offers superior latency, but we also demonstrated that this comes at the cost of reduced multiplicative depth. Our analysis highlights that double decomposition, despite its higher latency and memory footprint, remains an essential tool for depth-constrained applications where its efficient use of the modulus chain is paramount. This establishes a clear framework for practitioners: the choice is a strategic balance between latency, circuit depth, and the stringent memory limitations of GPU hardware.
Looking ahead, our work opens several promising avenues for future research. The unified framework and optimized kernels presented here are intentionally designed for extensibility. An immediate and natural extension is the integration of bootstrapping for all three schemes. Our proposed kernels can be directly leveraged for this purpose, as efficient key switching is a core component of the bootstrapping algorithm. Therefore, our work serves as a critical first step toward a complete, GPU-accelerated bootstrapping solution, which would enable computation of arbitrary depth. Further performance gains could be realized by exploring more advanced CUDA features, optimizing memory transfer patterns, and investigating the use of specialized hardware units like Tensor Cores for polynomial arithmetic. Finally, extending our unified architecture to support other parallel platforms, such as FPGAs or other GPU vendors via APIs like SYCL, would further broaden the impact and accessibility of high-performance homomorphic encryption.

Author Contributions

Conceptualization, methodology, software, validation, S.J.; writing—original draft preparation, visualization, S.J.; writing—review and editing, supervision, R.C.C.C.; project administration, R.C.C.C.; funding acquisition, R.C.C.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by CityUHK research grants 9239077 and 9239083.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to intellectual property considerations.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

NThe polynomial ring degree, a power-of-two.
R m The ring Z m [ X ] / ( X N + 1 ) .
Q The ciphertext modulus, Q = i = 1 q i , composed of co-prime primes.
PThe key-switching extension modulus, P = j = 1 k P j , composed of k co-prime primes.
˜ The total number of co-primes in the mod-up ring R Q P . ˜ = + k .
TThe double-decomposition modulus, T = i = 1 r T i , composed of r co-prime primes.
s A , s B The old secret key ( s A ) and the new secret key ( s B ). For relinearization, s B = s A 2 .
dThe number of decomposition groups for the first decomposition over q.
The groups are denoted q l ˜ .
d d o u b l e The number of decomposition groups for the second decomposition over Q P .
The groups are denoted Q l ˜ .
rThe RNS word width of the first decomposition.
r d o u b l e The RNS word width of the second decomposition.
D , B The first decomposition function and its reconstruction basis over Q .
For a polynomial c , c = D ( c ) , B ( c ) .
D ˜ , B ˜ The second decomposition function and its reconstruction basis over Q P .
c i The input ciphertext polynomial component to be switched,
e.g.,  c 2 from a multiplication result.
ksk The standard (single-decomposition) key-switching key.
ksk d o u b l e The double-decomposition key-switching key.

References

  1. Zhang, Q.; Yang, L.T.; Chen, Z. Privacy Preserving Deep Computation Model on Cloud for Big Data Feature Learning. IEEE Trans. Comput. 2016, 65, 1351–1362. [Google Scholar] [CrossRef]
  2. Chabanne, H.; Wargny, A.; Milgram, J.; Morel, C.; Prouff, E. Privacy-preserving classification on deep neural network. Cryptol. ePrint Arch. 2017. Available online: https://ia.cr/2017/035 (accessed on 30 October 2025).
  3. Jia, H.; Cai, D.; Huo, Z.; Wang, C.; Zhang, S.; Zhang, S.; Li, X.; Yang, S. Evaluation of Activation Functions in Convolutional Neural Networks for Image Classification Based on Homomorphic Encryption. In Proceedings of the 13th International Conference on Computer Engineering and Networks, Wuxi, China, 3–5 November 2023; Lecture Notes in Electrical Engineering. Springer Nature: Singapore, 2024; Volume 1127, pp. 343–355. [Google Scholar] [CrossRef]
  4. Jiang, S.; Yang, H.; Xie, Q.; Ma, C.; Wang, S.; Xing, G. Lancelot: Towards Efficient and Privacy-Preserving Byzantine-Robust Federated Learning Within Fully Homomorphic Encryption. arXiv 2024, arXiv:2408.06197. [Google Scholar] [CrossRef]
  5. Asiri, M.; Khemakhem, M.A.; Alhebshi, R.M.; Alsulami, B.S.; Eassa, F.E. Decentralized Federated Learning for IoT Malware Detection at the Multi-Access Edge: A Two-Tier, Privacy-Preserving Design. Future Internet 2025, 17, 475. [Google Scholar] [CrossRef]
  6. Kim, M.; Lee, D.; Seo, J.; Song, Y. Accelerating HE operations from key decomposition technique. In Proceedings of the Annual International Cryptology Conference, Santa Barbara, CA, USA, 18–22 August 2024; Springer: Berlin/Heidelberg, Germany, 2023; pp. 70–92. [Google Scholar]
  7. Brakerski, Z.; Gentry, C.; Vaikuntanathan, V. (Leveled) fully homomorphic encryption without bootstrapping. ACM Trans. Comput. Theory TOCT 2014, 6, 309–325. [Google Scholar] [CrossRef]
  8. Fan, J.; Vercauteren, F. Somewhat Practical Fully Homomorphic Encryption. Cryptol. ePrint Arch. 2012. Available online: https://eprint.iacr.org/2012/144 (accessed on 30 October 2025).
  9. Brakerski, Z.; Vaikuntanathan, V. Efficient fully homomorphic encryption from (standard) LWE. SIAM J. Comput. 2014, 43, 831–871. [Google Scholar] [CrossRef]
  10. Brakerski, Z.; Vaikuntanathan, V. Fully homomorphic encryption from Ring-LWE and security for key dependent messages. In Proceedings of the Annual Cryptology Conference, Santa Barbara, CA, USA, 14–18 August 2011; Springer: Berlin/Heidelberg, Germany, 2011; pp. 505–524. [Google Scholar]
  11. Cheon, J.H.; Kim, A.; Kim, M.; Song, Y. Homomorphic encryption for arithmetic of approximate numbers. In Proceedings of the Advances in Cryptology—ASIACRYPT 2017: 23rd International Conference on the Theory and Applications of Cryptology and Information Security, Hong Kong, China, 3–7 December 2017; Springer: Berlin/Heidelberg, Germany, 2017. Proceedings, Part I 23. pp. 409–437. [Google Scholar]
  12. Gentry, C. A Fully Homomorphic Encryption Scheme. Ph.D. Thesis, Stanford University, Stanford, CA, USA, 2009. [Google Scholar]
  13. Zvika, B.; Vinod, V. Lattice-based FHE as secure as PKE. In Proceedings of the 5th Conference on Innovations in Theoretical Computer Science, ITCS’14, New York, NY, USA, 12–14 January 2014; pp. 1–12. [Google Scholar] [CrossRef]
  14. Gentry, C.; Sahai, A.; Waters, B. Homomorphic encryption from learning with errors: Conceptually-simpler, asymptotically-faster, attribute-based. In Proceedings of the Advances in Cryptology—CRYPTO 2013: 33rd Annual Cryptology Conference, Santa Barbara, CA, USA, 18–22 August 2013; Springer: Berlin/Heidelberg, Germany, 2013. Proceedings, Part I. pp. 75–92. [Google Scholar]
  15. Naehrig, M.; Lauter, K.; Vaikuntanathan, V. Can homomorphic encryption be practical? In Proceedings of the 3rd ACM Workshop on Cloud Computing Security Workshop, Chicago, IL, USA, 21 October 2011; pp. 113–124. [Google Scholar]
  16. Gentry, C.; Halevi, S.; Smart, N.P. Homomorphic Evaluation of the AES Circuit. Cryptol. ePrint Arch. 2012. Available online: https://eprint.iacr.org/2012/099 (accessed on 30 October 2025).
  17. Aslett, L.; Esperança, P.; Holmes, C. Encrypted statistical machine learning: New privacy preserving methods. arXiv 2015, arXiv:1508.06845. [Google Scholar] [CrossRef]
  18. Aono, Y.; Hayashi, T.; Wang, L.; Moriai, S. Privacy-preserving deep learning via additively homomorphic encryption. IEEE Trans. Inf. Forensics Secur. 2017, 13, 1333–1345. [Google Scholar] [CrossRef]
  19. Chen, H.; Cammarota, R.; Valencia, F.; Regazzoni, F.; Koushanfar, F. Ahec: End-to-end compiler framework for privacy-preserving machine learning acceleration. In Proceedings of the 2020 57th ACM/IEEE Design Automation Conference (DAC), Virtual, 20–24 July 2020; IEEE: New York, NY, USA, 2020; pp. 1–6. [Google Scholar]
  20. Juvekar, C.; Vaikuntanathan, V.; Chandrakasan, A. GAZELLE: A Low Latency Framework for Secure Neural Network Inference. In Proceedings of the 27th USENIX Security Symposium (USENIX Security 18), Baltimore, MD, USA, 15–17 August 2018; pp. 1651–1669. [Google Scholar]
  21. Ao, W.; Boddeti, V.N. AutoFHE: Automated Adaption of CNNs for Efficient Evaluation over FHE. In Proceedings of the 33rd USENIX Security Symposium (USENIX Security 24), Philadelphia, PA, USA, 14–18 August 2024; pp. 2173–2190. [Google Scholar]
  22. Chen, H.; Laine, K.; Player, R. Simple encrypted arithmetic library-SEAL v2.1. In Proceedings of the Financial Cryptography and Data Security: FC 2017 International Workshops, WAHC, BITCOIN, VOTING, WTSC, and TA, Sliema, Malta, 7 April 2017; Springer: Berlin/Heidelberg, Germany, 2017. Revised Selected Papers 21. pp. 3–18. [Google Scholar]
  23. Halevi, S.; Shoup, V. Design and implementation of a homomorphic-encryption library. IBM Res. 2013, 6, 8–36. [Google Scholar]
  24. Al Badawi, A.; Bates, J.; Bergamaschi, F.; Cousins, D.B.; Erabelli, S.; Genise, N.; Halevi, S.; Hunt, H.; Kim, A.; Lee, Y.; et al. Openfhe: Open-source fully homomorphic encryption library. In Proceedings of the 10th Workshop on Encrypted Computing & Applied Homomorphic Cryptography, Los Angles, CA, USA, 7 November 2022; pp. 53–63. [Google Scholar]
  25. Kim, A.; Polyakov, Y.; Zucca, V. Revisiting Homomorphic Encryption Schemes for Finite Fields. In Proceedings of the Advances in Cryptology—ASIACRYPT 2021, Singapore, 6–10 December 2021; pp. 608–639. [Google Scholar] [CrossRef]
  26. Han, K.; Ki, D. Better Bootstrapping for Approximate Homomorphic Encryption. Cryptol. ePrint Arch. 2019. Available online: https://eprint.iacr.org/2019/688 (accessed on 30 October 2025).
  27. Zhou, L.; Huang, R.; Wang, B. Enhancing Multi-Key Fully Homomorphic Encryption with Efficient Key Switching and Batched Multi-Hop Computations. Appl. Sci. 2025, 15, 5771. [Google Scholar] [CrossRef]
  28. Hwang, I.; Seo, J.; Song, Y. Optimizing HE operations via Level-aware Key-switching Framework. In Proceedings of the 11th Workshop on Encrypted Computing & Applied Homomorphic Cryptography, Copenhagen, Denmark, 26 November 2023; pp. 59–67. [Google Scholar]
  29. Wang, W.; Hu, Y.; Chen, L.; Huang, X.; Sunar, B. Accelerating fully homomorphic encryption using GPU. In Proceedings of the 2012 IEEE Conference on High Performance Extreme Computing, Waltham, MA, USA, 10–12 September 2012; IEEE: New York, NY, USA, 2012; pp. 1–5. [Google Scholar]
  30. Wang, W.; Chen, Z.; Huang, X. Accelerating leveled fully homomorphic encryption using GPU. In Proceedings of the 2014 IEEE International Symposium on Circuits and Systems (ISCAS), Melbourne, Australia, 1–5 June 2014; IEEE: New York, NY, USA, 2014; pp. 2800–2803. [Google Scholar]
  31. Dai, W.; Sunar, B. cuHE: A homomorphic encryption accelerator library. In Proceedings of the Cryptography and Information Security in the Balkans: Second International Conference, BalkanCryptSec 2015, Koper, Slovenia, 3–4 September 2015; pp. 169–186. [Google Scholar]
  32. Al Badawi, A.; Veeravalli, B.; Mun, C.F.; Aung, K.M.M. High-performance FV somewhat homomorphic encryption on GPUs: An implementation using CUDA. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2018, 2018, 70–95. [Google Scholar] [CrossRef]
  33. Al Badawi, A.; Veeravalli, B.; Lin, J.; Xiao, N.; Kazuaki, M.; Mi, A.K.M. Multi-GPU design and performance evaluation of homomorphic encryption on GPU clusters. IEEE Trans. Parallel Distrib. Syst. 2020, 32, 379–391. [Google Scholar] [CrossRef]
  34. Goey, J.Z.; Lee, W.K.; Goi, B.M.; Yap, W.S. Accelerating number theoretic transform in GPU platform for fully homomorphic encryption. J. Supercomput. 2021, 77, 1455–1474. [Google Scholar] [CrossRef]
  35. Alves, P.G.M.; Ortiz, J.N.; Aranha, D.F. Faster homomorphic encryption over GPGPUs via hierarchical DGT. In Proceedings of the International Conference on Financial Cryptography and Data Security, Virtual, 1–5 March 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 520–540. [Google Scholar]
  36. vernamlab. cuFHE; GitHub. 2018. Available online: https://github.com/vernamlab/cuFHE (accessed on 5 November 2025).
  37. Xiao, Y.; Liu, F.H.; Ku, Y.T.; Ho, M.C.; Hsu, C.F.; Chang, M.C.; Hung, S.H.; Chen, W.C. GPU Acceleration for FHEW/TFHE Bootstrapping. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2025, 2025, 314–339. [Google Scholar] [CrossRef]
  38. Shen, S.; Yang, H.; Liu, Z.; Liu, Y.; Lu, X.; Dai, W.; Zhou, L.; Zhao, Y.; Cheung, R.C.C. VeloFHE: GPU Acceleration for FHEW and TFHE Bootstrapping. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2025, 2025, 81–114. [Google Scholar] [CrossRef]
  39. Jin, S.; Shen, S.; Yang, H.; Chen, D.; Dai, W.; Cheung, R.C.C. CuFDFB: Fast and Private Computation on Non-Linear Functions Using FHE. Cryptol. ePrint Arch. 2025, 14, 1–13. Available online: https://eprint.iacr.org/2025/1096 (accessed on 30 October 2025).
  40. Jung, W.; Kim, S.; Ahn, J.H.; Cheon, J.H.; Lee, Y. Over 100x Faster Bootstrapping in Fully Homomorphic Encryption through Memory-centric Optimization with GPUs. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2021, 2021, 114–148. [Google Scholar] [CrossRef]
  41. Yang, H.; Shen, S.; Dai, W.; Zhou, L.; Liu, Z.; Zhao, Y. Phantom: A CUDA-accelerated Word-Wise Homomorphic Encryption Library. IEEE Trans. Dependable Secur. Comput. 2024, 21, 4895–4906. [Google Scholar] [CrossRef]
  42. Ozcan, A.S.; Savas, E. HEonGPU: A GPU-Based Fully Homomorphic Encryption Library 1.0. Cryptol. ePrint Arch. 2024. Available online: https://eprint.iacr.org/2024/1543 (accessed on 30 October 2025).
  43. Fan, S.; Wang, Z.; Xu, W.; Hou, R.; Meng, D.; Zhang, M. TensorFHE: Achieving Practical Computation on Encrypted Data Using GPGPU. In Proceedings of the 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), Montreal, QC, Canada, 25 February–1 March 2023; pp. 922–934, ISSN 2378-203X. [Google Scholar] [CrossRef]
  44. Fan, G.; Zhang, M.; Zheng, F.; Fan, S.; Zhou, T.; Deng, X.; Tang, W.; Kong, L.; Song, Y.; Yan, S. WarpDrive: GPU-Based Fully Homomorphic Encryption Acceleration Leveraging Tensor and CUDA Cores. In Proceedings of the 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), Las Vegas, NV, USA, 1–5 March 2025; pp. 1187–1200, ISSN 2378-203X. [Google Scholar] [CrossRef]
  45. Kim, J.; Choi, W.; Ahn, J.H. Cheddar: A Swift Fully Homomorphic Encryption Library for CUDA GPUs. arXiv 2024, arXiv:2407.13055. [Google Scholar] [CrossRef]
  46. Mono, J.; Güneysu, T. A New Perspective on Key Switching for BGV-like Schemes. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2025, 2025, 763–794. [Google Scholar] [CrossRef]
Figure 1. The computational workflow for key switching using the KLSS technique. Decomp: Digit Decomposition; IP: Inner Product.
Figure 1. The computational workflow for key switching using the KLSS technique. Decomp: Digit Decomposition; IP: Inner Product.
Mathematics 13 03809 g001
Figure 2. The unified framework to accommodate GPU-accelerated KLSS key switching for BFV, BGV, CKKS.
Figure 2. The unified framework to accommodate GPU-accelerated KLSS key switching for BFV, BGV, CKKS.
Mathematics 13 03809 g002
Figure 3. Key-switching key size growth vs. multiplication depth (), with N = 2 16 and each coefficient takes 8 bytes.
Figure 3. Key-switching key size growth vs. multiplication depth (), with N = 2 16 and each coefficient takes 8 bytes.
Mathematics 13 03809 g003
Figure 4. Runtime for one key-switching operation vs. key-switching key size trade-off for varying r double , with N = 16,384. (a) = 16 . (b) = 24 . (c) = 32 . (d) = 48 .
Figure 4. Runtime for one key-switching operation vs. key-switching key size trade-off for varying r double , with N = 16,384. (a) = 16 . (b) = 24 . (c) = 32 . (d) = 48 .
Mathematics 13 03809 g004
Table 1. Per ciphertext key switching latency for Hybrid and KLSS methods for BFV, BGV, and CKKS schemes on an A100 GPU. All latency values are in milliseconds (ms).
Table 1. Per ciphertext key switching latency for Hybrid and KLSS methods for BFV, BGV, and CKKS schemes on an A100 GPU. All latency values are in milliseconds (ms).
Set #BFVBGVCKKS
Hybrid [26,41]OursSpeedupHybrid [26,41]OursSpeedupHybrid [26,41]OursSpeedup
(ms)(ms)(ms)(ms)(ms)(ms)
10.940.851.10×1.030.971.06×1.000.951.06×
20.610.680.89×0.670.740.91×0.660.720.92×
30.490.710.69×0.550.810.67×0.530.730.73×
41.361.191.15×1.491.361.10×1.481.321.12×
50.860.791.09×0.960.891.08×0.920.881.04×
60.640.790.80×0.750.850.89×0.710.870.81×
71.861.511.23×1.931.631.18×1.941.671.17×
81.141.031.11×1.201.131.07×1.171.121.05×
90.800.920.88×0.891.030.87×0.911.020.89×
105.583.301.69×5.663.731.52×5.673.621.57×
113.462.961.17×3.553.231.10×3.563.171.12×
122.482.291.09×2.702.581.05×2.702.531.07×
131.912.020.95×2.112.260.94×2.042.300.89×
148.624.411.95×8.624.771.81×8.544.721.81×
155.234.131.27×5.394.531.19×5.324.441.20×
163.902.871.36×4.073.231.26×4.013.291.22×
172.892.631.10×3.092.911.06×3.062.921.05×
1812.005.812.06×12.126.291.93×12.176.271.94×
197.415.271.41×7.525.681.32×7.555.681.33×
205.223.731.40×5.514.151.33×5.384.151.30×
214.153.141.32×4.333.621.20×4.263.621.17×
Table 2. Per-ciphertext key switching latency and performance comparison for Hybrid and KLSS methods across BFV, BGV, and CKKS schemes on an NVIDIA RTX 4090 GPU.
Table 2. Per-ciphertext key switching latency and performance comparison for Hybrid and KLSS methods across BFV, BGV, and CKKS schemes on an NVIDIA RTX 4090 GPU.
Set #BFVBGVCKKS
Hybrid [26,41]OursSpeedupHybrid [26,41]OursSpeedupHybrid [26,41]OursSpeedup
(ms)(ms)(ms)(ms)(ms)(ms)
10.650.601.07×0.710.671.06×0.710.671.05×
20.370.450.83×0.410.510.80×0.400.500.80×
30.290.410.71×0.340.470.72×0.330.460.72×
40.910.851.07×0.980.921.06×0.970.911.06×
50.540.541.00×0.580.600.97×0.570.590.98×
60.390.470.83×0.440.530.82×0.420.530.79×
71.201.151.04×1.291.241.04×1.281.231.04×
80.730.701.04×0.790.771.02×0.780.761.03×
90.510.560.91×0.570.640.90×0.560.620.91×
103.843.031.27×4.033.231.25×4.043.211.26×
112.372.590.92×2.522.750.92×2.512.730.92×
121.741.880.93×1.882.030.93×1.862.020.92×
131.341.580.85×1.461.760.83×1.451.720.84×
146.184.151.49×6.384.421.44×6.384.401.45×
153.833.850.99×3.904.070.96×3.894.060.96×
162.862.561.12×2.942.781.06×2.912.771.05×
172.192.220.99×2.262.430.93×2.242.420.93×
Table 3. Summary of Parameters used for log N = 15 and log N = 16.
Table 3. Summary of Parameters used for log N = 15 and log N = 16.
log NSet # l ˜ rOptimal r double
151161514
2161425
3161335
4201914
5201825
6201735
7242314
8242225
9242137
1610323115
11323023
12322933
13322846
14403915
15403823
16403735
17403646
18484715
19484623
20484535
21484446
Table 4. Per ciphertext key switching latency of CPU vs. GPU Implementation (this work).
Table 4. Per ciphertext key switching latency of CPU vs. GPU Implementation (this work).
Set #CPU [6] (ms)NVIDIA A100NVIDIA RTX 4090
Time (ms)SpeedupTime (ms)Speedup
1840.9588.59×0.67124.66×
2620.7286.68×0.50123.04×
3660.7390.35×0.46143.87×
41121.3285.01×0.91122.69×
5850.8896.15×0.59144.52×
6870.8799.61×0.53164.61×
71401.6784.07×1.23113.68×
81101.1298.13×0.76144.39×
91011.0299.26×0.62162.05×
104513.62124.75×3.21140.57×
113523.17110.95×2.73128.96×
123352.53132.63×2.02165.50×
133012.30131.01×1.72174.67×
146044.72127.92×4.40137.34×
155084.44114.31×4.06125.14×
164793.29145.43×2.77172.64×
174382.92149.91×2.42181.35×
Table 5. Summary of maximum key sizes for different polynomial degrees N.
Table 5. Summary of maximum key sizes for different polynomial degrees N.
Polynomial Degree NKey Size (MB)
Max Single DecompositionMax Double Decomposition
2 12 602219071
2 13 6044318,143
2 14 6088536,285
2 15 60177072,570
2 16 603540145,140
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jin, S.; Cheung, R.C.C. GPU Acceleration for KLSS Key Switching in Fully Homomorphic Encryption. Mathematics 2025, 13, 3809. https://doi.org/10.3390/math13233809

AMA Style

Jin S, Cheung RCC. GPU Acceleration for KLSS Key Switching in Fully Homomorphic Encryption. Mathematics. 2025; 13(23):3809. https://doi.org/10.3390/math13233809

Chicago/Turabian Style

Jin, Shutong, and Ray C. C. Cheung. 2025. "GPU Acceleration for KLSS Key Switching in Fully Homomorphic Encryption" Mathematics 13, no. 23: 3809. https://doi.org/10.3390/math13233809

APA Style

Jin, S., & Cheung, R. C. C. (2025). GPU Acceleration for KLSS Key Switching in Fully Homomorphic Encryption. Mathematics, 13(23), 3809. https://doi.org/10.3390/math13233809

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop