Abstract
Fully Homomorphic Encryption (FHE) enables privacy-preserving computation but is hindered by high computational overhead, with the key-switching operation being a primary performance bottleneck. This paper introduces the first CUDA-optimized GPU implementation of the Kim, Lee, Seo, and Son (KLSS) key-switching algorithm for three leading FHE schemes: BGV, BFV, and CKKS. Our solution achieves significant performance gains, delivering speedups of up to 181× against the original CPU implementation. Furthermore, we analyze the critical trade-off between the key-switching techniques on GPUs, providing insights for the choice between single- and double-decomposition methods. Our work provides a high-performance tool and offers clear guidelines on the trade-off between latency and hardware memory constraints.