Efficient Quantization and Data Access for Accelerating Homomorphic Encrypted CNNs

Chen, Kai; Wang, Xinyu; Fu, Yuxiang; Li, Li

doi:10.3390/electronics14030464

Open AccessArticle

Efficient Quantization and Data Access for Accelerating Homomorphic Encrypted CNNs

by

Kai Chen

^1,2,

Xinyu Wang

¹,

Yuxiang Fu

^1,* and

Li Li

^1,*

¹

School of Electronic Science and Engineering, Nanjing University, Nanjing 210023, China

²

Jiangsu Huachuang Microsystem Company Limited, Nanjing 211800, China

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(3), 464; https://doi.org/10.3390/electronics14030464

Submission received: 11 December 2024 / Revised: 13 January 2025 / Accepted: 22 January 2025 / Published: 23 January 2025

Download

Browse Figures

Versions Notes

Abstract

Due to the ability to perform computations directly on encrypted data, homomorphic encryption (HE) has recently become an important branch of privacy-preserving machine learning (PPML) implementation. Nevertheless, existing implementations of HE-based convolutional neural network (HCNN) applications are not satisfactory in inference latency and area efficiency compared to the unencrypted version. In this work, we first improve the additive powers-of-two (APoT) quantization method for HCNN to achieve a better tradeoff between the complexity of modular multiplication and the network accuracy. An efficient multiplicationless modular multiplier–accumulator (M-MAC) unit is accordingly designed. Furthermore, a batch-processing HCNN accelerator with M-MACs is implemented, in which we propose an advanced data partition scheme to avoid multiple moves of the large-size ciphertext polynomials. Compared to the latest FPGA design, our accelerator can achieve

11 \times

resource reduction of an M-MAC and

2.36 \times

speedup in inference latency for a widely used CNN-11 network to process 8K images. The speedup of our design is also significant compared to the latest CPU and GPU implementations of the batch-processing HCNN models.

Keywords:

homomorphic encryption; convolutional neural network; modular multiplication; hardware acceleration

1. Introduction

Fully homomorphic encryption (FHE) [1] is a promising solution for privacy-preserving machine learning (PPML) [2] because of the characteristic of performing computation on encrypted data without decryption. After encryption, the inputs are converted to high-degree polynomials, and the HE evaluation is executed over the polynomial ring with a large modulus. Therefore, inference on large machine learning models such as the widely used convolutional neural networks (CNNs) suffers from large computational complexity and memory usage compared to the unencrypted version [3].

As the first framework enabling HE-based CNN (HCNN) inference, CryptoNets [4] adopted Chinese remainder theorem (CRT) to pack the pixels at the same position in a batch of images into a ciphertext polynomial. This packing approach is friendly for batch-processing scenarios and has been adopted in GPU implementation [5] and FPGA implementation [6]. Different packing methods [7,8,9] have been proposed recently to reduce the inference latency of a single image, but introduced expensive homomorphic rotation operations [10]. Several pruning schemes have been introduced to further reduce the computation effort, including powers-of-two weight quantization [11], standard pruning [6,11] and packing-aware pruning [12]. Faster CryptoNets [11] showed a significant reduction in operations, while the integer encoder employed was much less efficient than the batch encoder. The FPGA accelerator [6] first introduced weight sparsity and focused on the dataflow optimization of 8K images simultaneously, but the inference latency could not be reduced when processing small batches of images. The packing-aware pruning method in [12] was based on the packing method of GAZELLE [7], so the rotation operation is inevitable.

To improve the inference efficiency, we propose an efficient modular multiplicationless convolution architecture. The specific contributions are summarized as follows:

An HE-friendly additive powers-of-two (APoT) quantization method is adopted and improved to reduce the multiplication operation of HCNN inference, which can achieve negligible accuracy loss compared with the floating-point CNNs.
A corresponding multiplicationless modular multiplier–accumulator (M-MAC) unit is proposed to achieve $11 \times$ area reduction compared to the standard M-MAC unit adopted by the latest FPGA accelerators [6].
An HCNN accelerator with an M-MAC array is designed to implement the widely used CNNs with a moderate batch size. Repeated transmission of input and output polynomials is avoided based on our proposed data access strategy. When processing 8K images in the CIFAR-10 dataset, our FPGA design is $2.36 \times$ and $3.95 \times$ faster than recent batch-processing FPGA and GPU implementations, respectively.

2. Preliminaries

2.1. Homomorphic Encryption

Most of the modern HE instantiations rely on the hardness of the Ring-LWE (RLWE) problem and operate over the polynomial ring. In this paper, we adopt the BFV scheme [10] to implement, but the proposed techniques are also applicable to other RLWE-based schemes such as CKKS [13]. The HE scheme can be described as follows. The plaintext (

p t

) message is first encoded to the polynomial representation and then encrypted as ciphertext (

c t

) polynomials. The

c t

polynomials support the following homomorphic computations:

c t

-

p t

addition,

c t

-

c t

addition,

c t

-

p t

multiplication,

c t

-

c t

multiplication and

c t

rotation. The noise will grow as homomorphic computations are performed, and the decryption function will return the correct result only when the noise does not exceed the bound. There are three essential parameters in the BFV scheme: the polynomial degree n, the plaintext (

p t

) modulus t and the ciphertext (

c t

) modulus q. The coefficients of the

c t

and

p t

polynomials can be viewed as elements over

Z_{q}

and

Z_{t}

, respectively.

2.2. Encrypted CNN Linear Layers

Given the characteristic of FHE, linear layers of CNN such as convolution, fully connected and average-pooling layers can be directly implemented using homomorphic computation. Nevertheless, nonlinear layers such as activation functions and max-pooling operations need to be modified to accommodate homomorphic evaluation [4] or achieved by multi-party computation (MPC) [7]. Here, we focus on the acceleration of the linear layers with encrypted inputs and unencrypted weights as in [4,5,6,7,8,9]. There are many different data representations of input features and weights, and this paper supports three of them in LoLa [8]:

Sparse representation: Each element $v_{i}$ in $v$ is represented by a message in which every coordinate is equal to $v_{i}$ .
SIMD representation: The i-th item of multiple vectors ${v_{0}, \dots, v_{j - 1}}$ $(j \leq n)$ is sequentially mapped to each item of a message.
Convolution representation: The img2col technique [14] is introduced to flatten the input images. Each column after flattening is mapped to a message, which is multiplied by the same weight.

For a convolution layer, the input

X_{i n}

and output

X_{o u t}

are 3D tensors with size

I_{c} \times I_{h} \times I_{w}

and

O_{c} \times O_{h} \times O_{w}

, where

{(\cdot)}_{w}

,

{(\cdot)}_{h}

and

{(\cdot)}_{c}

denote width, height and channel number, respectively. The corresponding weight matrix W is a 4D tensor with size

O_{c} \times I_{c} \times d \times d

where d denotes the kernel size and the stride is set to S. Generally, the input features are represented by SIMD or convolution representation, and the weights are converted into sparse representation since only element-wise modular multiplication and addition operations are needed. The representation of a fully connected layer with I inputs and O outputs is similar, except that the input features can only be represented by SIMD representation. Average-pooling layers can be regarded as convolution layers with all-one weights.

3. Efficient Modular Processing Element Based on APoT Quantization Method

3.1. Optimized APoT Quantization Method

Since the BFV scheme only supports integer computation, the floating-point inputs and weights need to be quantized to fixed-point integers. CryptoNets [4] and HCNN [5] adopted the uniform quantization method. Faster CryptoNets [11] adopted powers-of-two weight quantization to reduce the complexity of modular multiplication. Nevertheless, the powers-of-two quantization method represents a lower resolution compared to the uniform quantization method under the same bit width, which causes obvious accuracy loss.

To achieve a better tradeoff between inference accuracy and complexity of modular multiplication, we first introduce the additive powers-of-two (APoT) quantization method [15] to HCNN training and inference. For a

δ

-bit weight x, the quantized value

\hat{x}

is denoted by the sum of two powers-of-two numbers:

\hat{x} = s i g n (x) \cdot (ϕ_{1} < < k_{1} + ϕ_{2} < < k_{2}),

(1)

where

s i g n (x) = x \geq 0

? 1:−1;

0 \leq k_{1}, k_{2} < δ - 1

;

ϕ_{1}, ϕ_{2} \in {0, 1}

. Note that, when

ϕ_{1} = 0

, the APoT scheme will devolve into the PoT scheme. We count the weight distribution of the third convolutional layer in CNN-11; the histogram is shown in Figure 1. Compared with the original APoT scheme which uses

δ

to denote the bit width required to store

k_{1}, k_{2}

, i.e.,

k_{1}, k_{2} < 2^{δ - 1} - 1

, our scheme adopts a tighter shift range and allows the quantized weights to be distributed in a range closer to zero. To enable backpropagation during training, the straight-through estimator (STE) [16] is adopted. The proposed APoT quantization strategy is advantageous for HCNN inference since it can slow down the growth in bit width during computing. Limited by the fact that the BFV scheme cannot support division operations, the bit width of results will keep growing during HCNN inference. A tighter shift range not only reduces the bit width of the shifted addition but also significantly reduces the parameter size of the BFV scheme [5]. To ensure the correctness of the final decryption, the value of the final results must not exceed t. However, a larger value of t will result in a larger ciphertext size and more computational effort.

3.2. Multiplicationless Modular Multiplier–Accumulator

Since the computation of linear layers can be expressed as a series of

c t

polynomials multiplied by the corresponding weight scalar values and added together over

Z_{q}

, the main subject of the processing element (PE) of an HCNN accelerator is the modular multiplier–accumulator (M-MAC). The modular multiplication between the coefficients of the

c t

polynomial and the quantized weight can be simplified by adopting the APoT quantization method. The modular multiplier is implemented by a multiplier and a modular reduction unit. We replace the multiplier with shift and addition, and adopt the improved Barrett reduction method [17] to reduce the resource consumption. The modular adder is implemented in the same way as [18]. A six-stage pipelined architecture is designed to implement the low bit-width M-MAC; the architecture is shown in Figure 2.

4. Transmission-Efficient Homomorphic CNN Accelerator with M-MACs

4.1. Overall Architecture

The implementation of homomorphic CNNs has the following properties different from those of the unencrypted CNNs: First, the required data size grows by more than four orders of magnitude, because the inputs and outputs are all based on a set of

c t

polynomials represented by Chinese remainder theorem (CRT), each containing

2 \cdot r

polynomials, where r denotes the number of moduli after splitting. Second, since all operations in the linear layer after encryption are over

Z_{q}

, the resulting frequently modular operations will significantly increase the hardware resource overhead. In this section, we aim to implement a homomorphic CNN accelerator to demonstrate the efficiency of the proposed M-MAC. Furthermore, an advanced data partition and transfer strategy is proposed to reduce the on-chip data transfer time. The accelerator supports the sparse representation weight and convolution representation input which facilitates the processing of small batches of images. Therefore, the computation of linear layers is transformed into the matrix multiplication operation.

The overall architecture is shown in Figure 3. The input and output controllers are responsible for distributing data to the buffer and aggregating data from the buffer to the data interface, respectively. The computing array consists of v clusters. Within each cluster, there are u M-MACs, a modular adder, a PE controller, a weight buffer, and a sparse weight index buffer. After weighing the computation time and data transmission time, we finally choose to implement an array of moderate size with

u = v = 16

. By introducing the ping-pong strategy, the majority of data transmission time can be hidden by computing time.

4.2. Implementation Details

In this subsection, we present the details of the data transfer and computing process. In this paper, the computation of the linear layers is converted to matrix multiplication for fast execution. Here, we use a convolutional layer as an example, which can also be applied to the fully connected layer and the average-pooling layer. We use

p o l y

to denote the size of a ciphertext input, which is equal to

2 \cdot r \cdot n \cdot ⌈ {log}_{2} q_{i} ⌉

bits (

q = \sum_{i = 1}^{r} q_{i}

). Assuming that

O_{h} \times O_{w}

is less than n, then the memory usage required for

⌊ \frac{n}{O_{h} \times O_{w}} ⌋

batches of inputs and outputs is

d^{2} I_{c} \times p o l y

and

O_{c} \times p o l y

, respectively. After introducing the weight-pruning technique, the memory usage required for weights is

d^{2} \times I_{c} \times O_{c} \times (1 - s p a r s i t y) \times δ

bits. No bias is used in linear layers as in [5].

The data size of weights is relatively small, so the weights of each layer can be directly stored on-chip before computing. To reduce the storage area, only the non-zero weights and an index buffer characterizing whether the weight at each position is non-zero are stored. Nevertheless, it is impractical to store all the encrypted input data (∼GB) on the chip. The entire input data will be first stored in the off-chip DDR and transferred to the on-chip memory segmentally as required. Since the polynomials over different

Z_{q_{i}}

in a ciphertext are independent, only one polynomial of each ciphertext is operated firstly, and the rest of the polynomials are operated after this set of operations is finished. The data of the convolution layer are partitioned as shown in Figure 4, where the numbers in the circles indicate the order of access. Here,

16 d^{2} \times 16

inputs and corresponding weights at a unit of 16 output channels are loaded for one calculation. Since the quantity of weights is much smaller than that of inputs, repeated reading of weights rather than inputs is adopted. Each input polynomial tile loaded from the DRAM will be discarded after all the computations involved have been completed. The non-zero weights are decomposed by a lookup table (LUT) into the input format required by the M-MACs, i.e., s,

k_{1}

,

k_{2}

,

ϕ_{1}

,

ϕ_{2}

. The 16-way parallel inputs and weights are broadcasted to corresponding M-MACs, and the obtained output results are pipelined to the modular adders. The results are added with the intermediate results of the previous round to achieve the accumulation of the

I_{c}

dimension. The M-MAC will be gated (stop toggling) when receiving the zero-value weights, but the latency is not reduced for simplifying the control logic. The advantage of our proposed strategy is that, regardless of the amount of data to be computed, both the input and the output polynomials only need to be transmitted once, without any overhead of repeated transmission.

5. Experimental Results

5.1. Experimental Setup

For a fair comparison, we target the two most commonly studied homomorphic CNNs, CNN-6 (for MNIST) and CNN-11 (for CIFAR-10) [4,5,6,7,8], for deployment. The architectures of the two CNNs are shown in Table 1. The pooling operation is executed before the activation function to reduce the inputs of the activation function. Networks are trained, quantized and pruned using the PyTorch v1.7 framework, and then implemented on SEAL library [19] to verify the accuracy.

The HE parameters are chosen as in [6], i.e.,

n = 8192

; the bit width of q is 218-bit and 304-bit for CNN-6 and CNN-11, respectively. Note that both n and q can be smaller because the required multiplication depth is very shallow under MPC. The modulus q consists of several small moduli selected in SEAL [19] with the maximum bit width of 60-bit.

5.2. Network Accuracy

We train two networks and test the inference results of the proposed APoT quantization method, and compare the results with the floating version and uniform quantization version. The results of two activation functions, ReLU and polynomial approximation, are given in Table 2. CNN-6 and CNN-11 adopt

x^{2}

and

a x^{2} + b x + c

as activation functions, respectively. The input images are both scaled by 255. It can be found that the proposed APoT quantization method is also applicable to the HCNN model that applies polynomial approximation as the activation function. Although the use of polynomial approximation leads to a decrease in the accuracy of CNN-11, the quantization-aware training method can avoid significant accuracy degradation and outperform the post-training uniform quantized method.

To further compress the model size, we evaluate several pruning strategies. Among them, structured pruning not only compresses the weights, but also reduces the size of the input and output polynomials, which can efficiently reduce the computational effort of the hardware accelerator. Therefore, we adopt the structured pruning supported by Pytorch and achieve

50.4 %

and

57.3 %

weight sparsity for CNN-6 and CNN-11, respectively.

5.3. Implementation Results of Modular Multiplication

The HCNN accelerator is implemented and evaluated on a Xilinx Virtex-UltraScale XCVU440-FLGA2892 FPGA board (Xilinx, San Jose, CA, USA). The FPGA is connected to the CPU via the PCIe Gen3 x8 interface to transmit data. The measured transmission bandwidth of 8 GB DRAM attached to FPGA is about 12 GB/s. When evaluating the effects of the end-to-end implementation, we find that the communication overhead of the MPC implementation between the client and the server is quite high. Therefore, we finally choose the server-only multi-batch HCNN implementation as described in [4,5,6] for evaluation. By adopting SIMD encoding, 8192 images can be inferred simultaneously. The activation function based on polynomial multiplication is performed on an Intel Xeon Gold 6154 CPU (36 threads) running at

3.0

GHz.

The performance of our accelerator and the most similar end-to-end HCNN accelerator [6] is listed in Table 3. After employing the simplified M-MAC architecture, our proposed M-MAC consumes only three DSPs, which is

11 \times

fewer compared to [6]. Because there is no need to transfer a complete polynomial to the on-chip memory, the memory size of our design is also smaller. Other reduction in resource consumption mainly comes from the reduction in datapath width. The end-to-end inference latency is the sum of the off-chip data transfer latency, the latency of the on-chip linear layer operations and the latency of the CPU-side activation function operations. The off-chip data transfer latency is limited by two points: the bandwidth of PCIe and DRAM in this design is lower than that in [6] and the convolution representation naturally requires more data to be transferred than the SIMD representation. The linear layer operations are the focus of this design. When executing CNN-6, the latency reduction is insignificant because [6] achieves

90 %

sparsity. When executing CNN-11, the execution time of the linear layer can be reduced by

3.96 \times

, and the execution time of the activation function can be halved because pooling is executed before the activation function. In total, our design can achieve

1.18 \times

and

2.36 \times

speedup compared with [6] during the end-to-end implementation. The speedup of our design is also significant compared to the latest CPU and GPU implementations of multi-batch HCNN models. CryptoNet [4] takes

10.251

s to process 8K images using CNN-6 on the same CPU platform, while the inference latency of the latest GPU implementation HCNN [5] is

5.1

s and 304 s when processing 8K images using CNN-6 and CNN-11, respectively.

6. Conclusions

An HCNN accelerator with improved M-MACs by adopting the APoT quantization method is proposed in this paper. An efficient data tiling strategy is designed to minimize the data movement. To the best knowledge of the authors, the proposed FPGA accelerator is the fastest end-to-end implementation of the batch-processing HCNN models.

Author Contributions

K.C. explored the APoT quantization method and designed the accelerator; K.C. performed the experiments with support from X.W.; K.C. analyzed the experimental results; K.C. and X.W. contributed to task decomposition and the corresponding implementations; K.C. wrote the paper; L.L. and Y.F. supervised the project. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Key Research and Development Program of China under Grant 2023YFB2806802, in part by the Joint Funds of the National Nature Science Foundation of China under Grant U21B2032, in part by the National Nature Science Foundation of China under Grant 62104098 and in part by the National Key Research and Development Program of China under Grant 2021YFB3600104.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The author Kai Chen was employed by the company Jiangsu Huachuang Microsystem Company Limited. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Gentry, C. Fully homomorphic encryption using ideal lattices. In Proceedings of the STOC ’09: Symposium on Theory of Computing, Bethesda, MD, USA, 31 May–2 June 2009; pp. 169–178. [Google Scholar]
Tanuwidjaja, H.C.; Choi, R.; Baek, S.; Kim, K. Privacy-preserving deep learning on machine learning as a service—A comprehensive survey. IEEE Access 2020, 8, 167425–167447. [Google Scholar] [CrossRef]
Aharoni, E.; Drucker, N.; Ezov, G.; Shaul, H.; Soceanu, O. Complex encoded tile tensors: Accelerating encrypted analytics. IEEE Secur. Priv. 2022, 20, 35–43. [Google Scholar] [CrossRef]
Gilad-Bachrach, R.; Dowlin, N.; Laine, K.; Lauter, K.; Naehrig, M.; Wernsing, J. Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy. In Proceedings of the 33rd International Conference on Machine Learning, PMLR 48, New York, NY, USA, 20–22 June 2016; pp. 201–210. [Google Scholar]
Badawi, A.A.; Jin, C.; Lin, J.; Mun, C.F.; Jie, S.J.; Tan, B.H.M.; Nan, X.; Aung, K.M.M.; Chandrasekhar, V.R. Towards the AlexNet moment for homomorphic encryption: HCNN, the first homomorphic CNN on encrypted data with GPUs. IEEE Trans. Emerg. Topics Comput. 2020, 9, 1330–1343. [Google Scholar] [CrossRef]
Yang, Y.; Kuppannagari, S.R.; Kannan, R.; Prasanna, V.K. FPGA accelerator for homomorphic encrypted sparse convolutional neural network inference. In Proceedings of the 2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), New York, NY, USA, 15–18 May 2022; pp. 1–9. [Google Scholar]
Juvekar, C.; Vaikuntanathan, V.; Chandrakasan, A. GAZELLE: A low latency framework for secure neural network inference. In Proceedings of the 27th USENIX Conference on Security Symposium, Baltimore, MD, USA, 15–17 August 2018; pp. 1651–1669. [Google Scholar]
Brutzkus, A.; Gilad-Bachrach, R.; Elisha, O. Low latency privacy preserving inference. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 812–821. [Google Scholar]
Reagen, B.; Choi, W.-S.; Ko, Y.; Lee, V.T.; Lee, H.-H.S.; Wei, G.-Y.; Brooks, D. Cheetah: Optimizing and accelerating homomorphic encryption for private inference. In Proceedings of the 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), Seoul, Republic of Korea, 27 February–3 March 2021; pp. 26–39. [Google Scholar]
Fan, J.; Vercauteren, F. Somewhat Practical Fully Homomorphic Encryption. Cryptol. ePrint Arch. 2012. Available online: https://eprint.iacr.org/2012/144 (accessed on 22 March 2012).
Chou, E.; Beal, J.; Levy, D.; Yeung, S.; Haque, A.; Fei-Fei, L. Faster cryptonets: Leveraging sparsity for real-world encrypted inference. arXiv 2018, arXiv:1811.09953. [Google Scholar]
Cai, Y.; Zhang, Q.; Ning, R.; Xin, C.; Wu, H. Hunter: HE-friendly structured pruning for efficient privacy-preserving deep learning. In Proceedings of the ASIA CCS ’22: Proceedings of the 2022 ACM on Asia Conference on Computer and Communications Security, Nagasaki, Japan,, 30 May–3 June 2022; pp. 931–945. [Google Scholar]
Cheon, J.H.; Kim, A.; Kim, M.; Song, Y. Homomorphic encryption for arithmetic of approximate numbers. In Proceedings of the 23rd International Conference on the Theory and Applications of Cryptology and Information Security (ASIACRYPT), Hong Kong, China, 3–7 December 2017; pp. 409–437. [Google Scholar]
Chellapilla, K.; Puri, S.; Simard, P. High performance convolutional neural networks for document processing. In Proceedings of the 10th International Workshop on Frontiers in Handwriting Recognition (IWFHR), La Baule, France, 23–26 October 2006. [Google Scholar]
Li, Y.; Dong, X.; Wang, W. Additive powers-of-two quantization: An efficient non-uniform discretization for neural networks. arXiv 2019, arXiv:1909.13144. [Google Scholar]
Yin, P.; Lyu, J.; Zhang, S.; Osher, S.; Qi, Y.; Xin, J. Understanding straight-through estimator in training activation quantized neural nets. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Kong, Y. Optimizing the improved Barrett modular multipliers for public-key cryptography. In Proceedings of the International Conference on Computational Intelligence and Software Engineering (CiSE), Wuhan, China, 10–12 December 2010. pp. 1–4.
Banerjee, U.; Ukyab, T.S.; Chandrakasan, A.P. Sapphire: A configurable crypto-processor for post-quantum lattice-based protocols. In Proceedings of the IACR Transactions on Cryptographic Hardware and Embedded Systems, Atlanta, GA, USA, 25–28 August 2019; pp. 17–61. [Google Scholar]
Microsoft SEAL (Release 3.2); Microsoft Research: Redmond, WA, USA, 2019; Available online: https://github.com/Microsoft/SEAL (accessed on 16 January 2020).

Figure 1. Comparison between different quantization schemes for the third convolution layer in CNN-11.

Figure 2. M-MAC architecture.

Figure 3. Overall architecture of the CNN accelerator.

Figure 4. Dataflow of the convolution layer.

Table 1. Network architectures.

CNN-6	Input Size	Description
Conv-1	$1 \times 28 \times 28$	filter: $5 \times 1 \times 5 \times 5$ , stride: 2, activ
Conv-2	$5 \times 13 \times 13$	filter: $50 \times 5 \times 5 \times 5$ , stride: 2, activ
Fc-1	$50 \times 4 \times 4$	filter: $100 \times 1250$
Fc-2	$1 \times 100$	filter: $10 \times 100$
CNN-11	Input Size	Description
Conv-1	$3 \times 32 \times 32$	filter: $32 \times 3 \times 3 \times 3$ , stride: 1
Pool-1	$32 \times 32 \times 32$	average, $2 \times 2$ , stride: 2, activ
Conv-2	$32 \times 16 \times 16$	filter: $64 \times 32 \times 3 \times 3$ , stride: 1
Pool-2	$64 \times 16 \times 16$	average, $2 \times 2$ , stride: 2, activ
Conv-3	$64 \times 8 \times 8$	filter: $128 \times 64 \times 3 \times 3$ , stride: 1
Pool-3	$128 \times 8 \times 8$	average, $2 \times 2$ , stride: 2, activ
Fc-1	$128 \times 4 \times 4$	filter: $128 \times 2048$
Fc-2	$1 \times 128$	filter: $10 \times 128$

Table 2. Accuracy results.

Model	Quant. Method	δ⁺	Accuracy
Model	Quant. Method	δ⁺	ReLU	Poly
CNN-6	Float	32	98.94%	98.99%
	Uniform	4	98.83%	98.8%
	Ours (APoT)	4	98.93%	98.85%
CNN-11	Float	32	82.74%	78.49%
	Uniform	8	82.37%	$77.1 %^{★}$
	Ours (APoT)	8	82.15%	78.36%

⁺ δ denotes the actual bit width of the weights

.^{★}

The accuracy is obtained from [6].

Table 3. Performance comparison between the two FPGA accelerators.

Design	Accel-L [6]	Our Design
FPGA Device	Xilinx U200	Xilinx XCVU440
Frequency (MHz)	175	$166.7$
LUT/FF/DSP	360K/424K/2320	194K/158K/768
BRAM/URAM	$698 / 264$	$103.5 / 0$
Inference Latency (s) of CNN-6 (8K images)
CPU to FPGA	$0.140$	0.462 ^⋄
Linear ${Layers}^{★}$	$0.639$	$0.314$
Activ. Layers	2.706 ^≀	$2.175$
Total	$3.485$	$2.951$
Inference Latency (s) of CNN-11 (8K images)
CPU to FPGA	$2.771$	15.323 ^⋄
Linear ${Layers}^{★}$	$98.384$	$24.846$
Activ. Layers	80.131 ^≀	$36.729$
Total	$181.286$	$76.898$

^⋄ The latency of converting the encrypted data to convolutional representation and rearranging the result polynomials is included

.^{★}

The latency of DRAM to accelerator is included. ^≀ The latency is obtained on the same CPU platform as our design.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, K.; Wang, X.; Fu, Y.; Li, L. Efficient Quantization and Data Access for Accelerating Homomorphic Encrypted CNNs. Electronics 2025, 14, 464. https://doi.org/10.3390/electronics14030464

AMA Style

Chen K, Wang X, Fu Y, Li L. Efficient Quantization and Data Access for Accelerating Homomorphic Encrypted CNNs. Electronics. 2025; 14(3):464. https://doi.org/10.3390/electronics14030464

Chicago/Turabian Style

Chen, Kai, Xinyu Wang, Yuxiang Fu, and Li Li. 2025. "Efficient Quantization and Data Access for Accelerating Homomorphic Encrypted CNNs" Electronics 14, no. 3: 464. https://doi.org/10.3390/electronics14030464

APA Style

Chen, K., Wang, X., Fu, Y., & Li, L. (2025). Efficient Quantization and Data Access for Accelerating Homomorphic Encrypted CNNs. Electronics, 14(3), 464. https://doi.org/10.3390/electronics14030464

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Quantization and Data Access for Accelerating Homomorphic Encrypted CNNs

Abstract

1. Introduction

2. Preliminaries

2.1. Homomorphic Encryption

2.2. Encrypted CNN Linear Layers

3. Efficient Modular Processing Element Based on APoT Quantization Method

3.1. Optimized APoT Quantization Method

3.2. Multiplicationless Modular Multiplier–Accumulator

4. Transmission-Efficient Homomorphic CNN Accelerator with M-MACs

4.1. Overall Architecture

4.2. Implementation Details

5. Experimental Results

5.1. Experimental Setup

5.2. Network Accuracy

5.3. Implementation Results of Modular Multiplication

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI