1. Introduction
Deep neural networks (DNNs) have been widely used in various areas, such as image classification, target detection, and natural language processing, due to their strong predictive abilities [
1]. However, DNNs require a large amount of complex computation, and users may encounter limited computing power or lack the necessary expertise to deploy these models. As a result, many users turn to cloud services to deploy their models, giving rise to deep learning as a service (DLaaS), which allows users to access predictive services on a pay-as-you-go basis. While DLaaS offers great convenience, it also poses a potential security risk for sensitive data (e.g., medical and financial information) if cloud servers are not trusted [
2,
3].
We focus on using homomorphic encryption [
4] to address the problem of data privacy protection in cloud services. Homomorphic encryption allows for operations to be performed directly on encrypted data, resulting in the same outcome as if the operations were performed on the plaintext. In our solution (shown in
Figure 1), users encrypt their data and upload it to the cloud server. The server then performs inference on the encrypted data without knowing the original information, protecting the user’s data privacy [
5]. Our solution does not reveal any data information to other users in the cloud and can be considered to have zero communication consumption. The user simply uploads data to the cloud and receives the results. Other secure multi-party computation (MPC)-based protocols require two servers to be online at the same time, which means more communication consumption [
6,
7,
8].
However, current mainstream homomorphic encryption schemes only support additive and multiplicative operations, and cannot support nonlinear activation layers in neural networks. Dowlin et al. [
9] proposed CryptoNets, the first convolutional neural network (CNN) to perform privacy inference on ciphertexts of homomorphic encryption. CryptoNets use squared functions instead of standard activation functions, but this results in some accuracy degradation. Some subsequent work used polynomials obtained by approximating the standard activation function to achieve higher model accuracy, but this approach is only applicable to shallow models [
10,
11,
12,
13]. Recently, Lee et al. [
14] used minimax approximate polynomials [
13] to achieve an accurate approximation of the ReLU activation function, and implemented the first inference of the deep residual network ResNet-20 on encrypted data. However, its approximate polynomial degree is as high as 27, which led to extremely high inference latency. Therefore, applying homomorphic encryption to deep neural networks while maintaining both high accuracy and low inference latency has become an urgent problem.
In our paper, our goal is to reduce the inference latency of DNNs on homomorphic encrypted data while maintaining their inference accuracy. To achieve this goal, we propose a low-degree Hermite deep neural network framework (called LHDNN). Unlike existing works that use fixed-coefficient polynomials to replace the ReLU function, LHDNN uses a set of low-degree trainable Hermite polynomials (called LotHps) as the activation layers in DNNs. The degree of LotHps is two, and its low degree property ensures low inference latency for encrypted data. Specifically, LotHps are linear combinations of the first three terms of Hermite polynomials, with three trainable weight parameters that can be learned by the backpropagation algorithm during model training. Compared with the fixed coefficient polynomial, the parameterized activation function LotHps has stronger expressibility, which makes the upper limit of model accuracy higher. In addition, LHDNN combines a new weight initialization and regularization module with LotHps to ensure more stable model training and a stronger generalization ability. To further improve the accuracy of LotHps-based models, we propose a variable-weighted difference training (VDT) strategy, which uses an existing ReLU-based model to guide the training of the LotHps-based model. Specifically, the difference between the activation layer outputs of the LotHps-based model and the ReLU-based model, as well as the difference between their final layer outputs, are added to the loss of the LotHps-based model, with a weight function p(x) to smoothly transition between the two different terms. This strategy enables the LotHps-based model to achieve higher accuracy in the early stages of training and prevents overfitting to the ReLU-based activation layer outputs, leading to lower final accuracy.
In summary, our contributions are as follows:
We propose a low-degree Hermite deep neural network framework (called LHDNN), which employs a set of low-degree trainable Hermite polynomials (referred to as LotHps) as activation layers in the DNNs. In addition, LHDNN integrates a novel weight initialization and regularization module with LotHps, ensuring a more stable training process and a stronger model generalization ability.
We propose a variable-weighted difference training (VDT) strategy that uses the original ReLU-based model to guide the training of the LotHps-based model, thereby improving the accuracy of the LotHps-based model.
Our extensive experiments on benchmark datasets MNIST, Skin-Cancer, and CIFAR-10 validated the superiority of LHDNN in inference speed and accuracy on encrypted data.
In the rest of this paper, we discuss related work in
Section 2. The knowledge related to homomorphic encryption is introduced in
Section 3. In
Section 4, we present the proposed LHDNN and the variable-weighted difference training (VDT) strategy. Relevant experiments are conducted in
Section 5 and
Section 6. Finally, we provide a summary of the entire paper in
Section 7.
2. Related Work
The solutions for applying homomorphic encryption to deep neural networks can be divided into two categories depending on the homomorphic encryption scheme used, i.e., homomorphic encryption schemes based on the learning with errors (LWE) puzzle and homomorphic encryption schemes based on the ring learning with errors (RLWE) puzzle.
Using the first class of homomorphic encryption solutions, nonlinear operations in the activation function can be implemented using a lookup table. While this approach can accurately evaluate the activation function in a short time, it does not support batch (SIMD) operations, leading to inefficient operations in other steps (e.g., matrix multiplication in the convolutional layer). FHE-DiNN [
15] and TAPAS [
16] use binarized weights and sparsification techniques to achieve faster computation on complex models, but they have a reduced inference accuracy of about 3–6.2%, even on small MNIST datasets. Lou and Jiang [
17] implemented privacy inference for the ResNet-18 model on the CIFAR-10 dataset using a leveled version of the Torus homomorphic encryption (TFHE) scheme. Folkerts et al. [
18] used a ternary neural network to optimize privacy-preserving inference based on TFHE. Compared to plaintext inference, it is slower by 1.7 to 2.7 orders of magnitude, but its accuracy on the MNIST dataset is only 93.1%. DOREN [
19] proposed a low-depth batched neuron that can simultaneously evaluate multiple ReLU functions without approximation. The amortized runtime is about 20 times faster than Lou and Jiang’s approach. Meftah et al. [
20] reexamined and improved the framework proposed by DOREN, achieving a 6–34 times speedup on some CNN architectures on the CIFAR10 dataset.
The second type of homomorphic encryption scheme supports SIMD operation, i.e., packing multiple plaintexts into one ciphertext, which can significantly improve the efficiency of ciphertext operations, but their inability to support nonlinear activation functions in neural networks becomes the biggest limitation of such solutions. Dowlin et al. [
9] used squared activation functions instead of standard ones to achieve inference on ciphertexts for a model with only two activation layers, achieving 98.95% accuracy on the MNIST dataset. Chabanne et al. [
10] implemented a neural network with six nonlinear layers using the Taylor expansion to approximate the Softplus activation function combined with a batch normalization (BN) layer, achieving 99.30% accuracy on the MNIST dataset, slightly lower than the 99.59% of the original ReLU-based model. Hesamifard et al. [
12] approximated the derivative of the ReLU activation function using a 2-degree polynomial and then replaced the ReLU activation function with a 3-degree polynomial obtained through integration, further improving the accuracy on the MNIST dataset, but reducing the absolute accuracy by about 2.7% when used for a deeper model on the CIFAR-10 dataset. Alsaedi et al. [
21] approximated the ReLU function using the Legendre polynomials and achieved a plaintext accuracy of 99.80% on the MNIST dataset, but did not evaluate their model on encrypted data. Yagyu et al. [
22] improved model accuracy by pretraining the polynomial approximation coefficients of the MISH activation function. Their ciphertext accuracy on the MNIST dataset was 0.01% higher than plaintext accuracy, but their encrypted accuracy on CIFAR-10 was only 67.20%. Lee et al. [
14] utilized advanced min-max approximate polynomials to achieve the best activation function approximation and successfully implemented ResNet-20 on the RNS-CKKS homomorphic encryption scheme for the first time. Although their method achieved about 92.43%
2.65% inference accuracy on the CIFAR-10 dataset, the degree of their polynomial is very high, which results in a higher inference delay because ciphertext multiplication is very expensive. In addition, a large number of bootstrapping operations are needed to refresh ciphertext noise, which may cause decryption errors.
Although solutions based on the first type of homomorphic encryption have an advantage in ciphertext inference speed in the activation layers of DNN, their lack of support for batch processing results in slower inference speeds in non-activation layers. The second type of solution can achieve fast inference in non-activation layers, but currently has limited methods for handling activation layers. Using low-degree polynomials can only achieve privacy-preserving inference of encrypted data in shallow networks, but applying this method to deeper networks results in a significant decrease in model accuracy. On the other hand, using high-degree polynomials can achieve high model accuracy, but the ciphertext inference latency is very high. Therefore, efficient privacy-preserving inference of deeper DNN using FHE solutions is an important research topic that needs to be addressed. To address the limitations of current research, we propose a low-degree Hermite deep neural network framework (called LHDNN). LHDNN uses a set of low-degree trainable Hermite polynomials (referred to as LotHps) as activation layers in the DNN. LotHps contains three weight parameters that can be learned during the model training process through backpropagation algorithm. By combining a novel weight initialization and regularization module with LotHps, we can ensure a more stable training process and stronger model generalization ability. Furthermore, we propose a variable-weighted difference training (VDT) strategy that uses the original ReLU-based model to guide the training of the LotHps-based model, thereby improving the accuracy of the LotHps-based model.
4. The Proposed Method
In this section, we introduce the proposed low-degree Hermite neural network (LHDNN), as shown in
Figure 2, which includes the LotHps activation layer and weight initialization and regularization modules. In addition, we introduce the variable-weighted difference training (VDT) strategy.
4.1. Low-Order Trainable Hermite Polynomials (LotHps) Activation Layer
As previously mentioned, the ciphertext produced by homomorphic encryption only supports addition and multiplication operations. Therefore, the standard ReLU activation function used in deep neural networks does not work properly in this context. To address this issue, we need to use a homomorphic-friendly polynomial as our activation function. Furthermore, performing a single multiplication on the ciphertext produced by homomorphic encryption is computationally expensive, so we want to minimize the number of multiplication operations. To achieve this, we need to design a low-degree polynomial activation function. In this section, we will discuss the important properties of Hermite orthogonal polynomials and how to use them as our activation layer.
Hermite polynomials: The Hermite orthogonal polynomials are defined as
. They have been widely used in various fields due to their many excellent properties [
26]. Here, we only introduce the orthogonality that we use. Specifically, for any two distinct non-negative integers
and
, the Hermite polynomial
and
are orthogonal under the weight function
, i.e.:
where
when
, otherwise
. Additionally, the Hermite polynomials satisfy a three-term recurrence relation:
LotHps based on Hermite polynomials: Based on the orthogonality of Hermite polynomials, we constructed a low-degree trainable Hermite polynomials (called LotHps) activation function. In order to maintain low depth of multiplication, we only use the lower degree terms
,
, and
of the Hermite polynomials. The LotHps function proposed by us can be expressed as:
where
,
, and
are learnable parameters whose values are adjusted adaptively during neural network training. Specifically, during the backward propagation process in the model, the gradient of the weights in the LotHps activation layer can be derived using the chain rule. Assuming that
represents the objective function, the gradient of parameters in the LotHps activation layer are:
where
represents the number of specific input channels,
represents the input value of the
-th channel of the Hermite activation layer, and
represents the gradient back-propagated from a deeper layer. With the gradient, we can update the values of
,
, and
through optimization algorithms such as stochastic gradient descent [
27] to minimize the loss function.
As to why we use the Hermite polynomials instead of the similar Legendre, Chebyshev and Laguerre polynomials, etc., this is because only the orthogonal interval of Hermite polynomials is
. This means that no matter how large the output value of the batch normalization (BN) layer in a DNN is, it is always in the orthogonal interval of the Hermite polynomial. Although we can satisfy the orthogonality condition for other orthogonal polynomials by scaling the input values, we will undoubtedly introduce more weights to train [
26].
4.2. Weight Initialization and Regularization Module
Weight initialization of LotHps: To reduce the uncertainty of weight random initialization for the LotHps activation layer, we propose a novel weight initialization method that can make the error in the initial stage of model training smaller. Specifically, we use the weight coefficients obtained by approximating the ReLU function as the initial weights of the LotHps activation layer, which provide a good starting point for it. The approximate method we propose is as follows:
Assume that
is a family of functions with weight orthogonal about the point set
. In this case, we use a family of Hermite orthogonal functions where
refers specifically to
. Specify that the approximation function consisting of this family of orthogonal functions takes the form:
The conventional approximation method is to minimize the sum of squared errors, as shown in the following equation:
where
represents the best approximation polynomial, and
represents the sample points, in this case specifically the points on the ReLU function.
The conventional method only provides the best fit for the original function, which is effective for the forward propagation process of the neural network model. However, the gradient of its approximation function may have a large difference with the gradient of the original function, leading to a large error in the backward propagation process of the model. To address this, we consider adding the error of both derivative functions to the objective function. Additionally, since the output values of the batch normalization layer follow a normal distribution, the values are mostly concentrated around 0. We use a weight function
to better approximate the function values and derivative values around 0, resulting in the final approximation objective we use:
where
represents the derivative of the approximating function,
represents the derivative of the approximated function, and
represents the approximation interval.
Weight regularization of LotHps: During DNNs model training, we found that significant changes in the values of the weights of the LotHps activation layer or changes in sign caused large fluctuations in the model loss. To prevent the instability of weights during training, we combined the aforementioned weight initialization techniques and proposed a novel weight regularization module that can also improve the generalization ability of the model.
Because
corresponds to the Hermite polynomial
, and
itself is not in the same dimension, so we first use
to get dimensionless
. Then, let
represent the weight of the LotHps activation layer during training, and
represent its initial weight. We calculate the relative Euclidean distance between
and
as a regularization term to constrain
. Finally, the relative Euclidean distance for a single LotHps activation layer is as follows:
The parameters
of each LotHps activation layer of the model are subsequently averaged while multiplying by a regular term parameter
to control the strength of the regularity, in order to obtain the LotHps weight regularity term, which is as follows:
where
denotes the number of LotHps activation layers and
represents the relative Euclidean distance for the
-th LotHps activation layer.
4.3. Variable-Weighted Difference Training (VDT) Strategy
The proposed LotHps activation layer shows good performance on shallower DNNs, but its accuracy degrades slightly when applied to deeper networks. Inspired by knowledge distillation techniques, we propose a variable-weighted difference training (VDT) strategy designed to reduce the difference between DNNs using LotHps functions (called LotHps-based models) and DNNs using ReLU functions (called ReLU-based models), thereby improving the accuracy of LotHps-based models.
In
Figure 3, we demonstrate our novel VDT strategy, which leverages the original ReLU-based model as a teacher to supervise the training of the LotHps-based model. The proposed VDT strategy consists of two loss terms: the first loss term corresponds to the activation loss between the LotHps-based model and the ReLU-based model, while the second loss term comprises the output loss of both models and the cross-entropy between the LotHps-based model output and the true label. Notably, we utilize a weight function denoted by
in Equation (16) to achieve a smooth transition between the two aforementioned loss terms.
Activation Loss: To quantify the activation loss for the two models, we utilize Kullback–Leibler (KL) divergence. Specifically, we assume that activation layer output distributions of the LotHps and ReLU-based models are represented as
and
respectively. The KL divergence of
and
can be expressed as:
Let
be the number of activation layers, and
be the KL divergence of the
-th activation layer. Then, the first loss term is:
where
and
are the
-th activation layer output distributions of the LotHps-based model and the ReLU-based model, respectively.
Output Loss: For the loss terms of the final output distribution of the two models, we utilize the approach suggested in [
28], which uses a response-based knowledge distillation method with a soft target technique:
where
is the soft target version of the class
sample prediction,
represents the temperature, and
is the logarithm of the original prediction. The second loss term is obtained by combining the output distribution differences and hard label loss as follows:
where
and
represent the soft target outputs of the LotHps and ReLU-based models, respectively,
represents the true label, and
represents the cross-entropy function.
and
are two hyperparameters that control the relative magnitude of the two cross-entropy losses.
Loss Smooth Transition: We did not simply weigh the two aforementioned loss terms arithmetically. Instead, we used a smooth transition function
as a dynamic weighting function to achieve a smooth transition from
to
. In this way, during the initial training, the main goal of the LotHps-based model is to reduce the difference between it and the activation output distribution of the ReLU-based model. In the later stages of training, the LotHps-based model adjusts the model according to the final learning goals. The specific expression of the smooth transition function
is as follows:
where
represents the initial moment of training,
is an adjustable parameter representing the complete transition moment, and
represents the moment when two loss priorities are equal. Ultimately, the total loss item of the LotHps-based model can be expressed as:
5. Implementation Details
The specific implementation is divided into two parts: model training on unencrypted data, and privacy inference on encrypted data. We built three models in three different datasets. The first model is a network (named CNN-6) with five nonlinear activation layers on the MNIST dataset. The second model is an AlexNet model on the Skin-Cancer dataset. The third model is a ResNet-20 model on the CIFAR-10 dataset. Please see
Table 1 for more details on the models. In this section, we will present the datasets and models we used, as well as the security parameters and inference optimization methods.
5.1. DataSets
MNIST [
29]: The MNIST dataset consists of single-channel images of 10 handwritten Arabic numerals. It includes 50,000 images in the test set and 10,000 images in the training set, each with a size of 28 × 28 pixels. In total, there are 60,000 images in the MNIST dataset.
Skin-Cancer [
30]: The Skin-Cancer dataset consists of medical images of different types of skin cancer, with a total of 10,015 images belonging to seven different categories. We modified the size of all images to 32 × 32 pixels, and divided the dataset into a training set and a test set in an 8:2 ratio. Because the data was severely imbalanced, we performed data enhancement and resampling operations on the training data.
CIFAR-10 [
31]: The CIFAR-10 dataset consists of color images of 10 different objects, with a total of 60,000 images. It includes 50,000 images in the test set and 10,000 images in the training set, each with a size of 32 × 32 pixels. The training set is extended by random rotation and random clipping.
5.2. Model Architecture
For the MNIST dataset, we built a network (named CNN-6) containing four convolutional layers and two fully connected layers. The exact arrangement of the network layers is shown in
Table 1, where C represents the convolutional layer, B represents the batch normalization layer, A represents the activation layer, P represents the average pooling layer, and F represents the fully-connected layer. For the Skin-Cancer dataset, we modified the standard AlexNet network [
32] to accommodate the size of the input images, and replaced the maximum pooling layer with a homomorphism-friendly average pooling layer. For the CIFAR-10 dataset, we used the standard ResNet-20 network [
33].
5.3. Approximation Interval of Weight Initialization
Because the outputs of the intermediate layers of different models are different, this also leads to different input values of the activation layer. If we choose too small an interval for weight initialization, those larger input values, which increase rapidly after activation, will easily lead to gradient explosion. At the same time, if the interval is too large, it will lead to a poor approximation effect of the LotHps function to the ReLU function, resulting in a large initial training loss in the LotHps-based model. Therefore, it is especially important to choose a good initialization interval, and we use the maximum absolute value of each activation layer input as the parameter of the approximation interval . When training the original ReLU-based model, we counted this value as 23.2, 32.8 and 40.3 for the three models, so when training the LotHps-based model, we used these values as the parameter .
5.4. Safety Parameter Setting
As with other encryption schemes, the CKKS homomorphic encryption scheme requires parameters to be set to ensure that known attacks are computationally infeasible. We chose different configurations for the different models, and all configurations satisfy 128-bit security, which means that an adversary would need to perform at least
basic operations to crack the scheme with probability 1. The polynomial degree
of the first configuration is set to
, with integer and fractional partial precision set to 10 and 50, respectively, and a multiplication depth of 14. The second configuration is set to
, with integer and fractional partial precision set to 10 and 39, respectively, and a multiplication depth of 20. The third configuration is set to
, with integer and fractional partial precision set to 12 and 48, respectively, and a multiplication depth of 9. In addition, in the third configuration, we use bootstrapping technology. The degree of the approximation polynomial of the modulus function is set to 14, and the maximum length of the modulus is 1332, which meets 128-bit security, while Lee et al. [
14] only meets 111.6 bits of security. The parameters for each configuration are listed in
Table 2.
5.5. Inference Optimization
When performing model inference on ciphertexts, some ciphertext packing techniques and special convolution methods are required to reduce the complexity of the inference process. Aharoni et al. [
34] proposed a data structure called the graph block tensor and an interleaved packing method that can effectively reduce the latency and memory consumption of ciphertext inference, and can easily adapt the output of one convolutional layer as the input for the next convolutional layer. The graph block tensor is a data structure that packs tensors into fixed-size blocks according to the requirements of homomorphic encryption and allows them to be manipulated similarly to regular tensors [
34]. We use this approach, and the shape and packing of the tensor blocks for the three models are shown in
Table 3.
Given an image input , each dimension represents the height and width, number of channels and batch of the image, and the convolution kernel size ; each dimension represents the height and width, number of input channels, and number of output channels of the convolution kernel (i.e. the number of convolution kernels) respectively. The two are multiplied to obtain . For a particular block shape and packing method, C-W-H-F-B can be decomposed and rearranged to for .