Image Perturbation-Based Deep Learning for Face Recognition Utilizing Discrete Cosine Transform

: Face recognition, including emotion classiﬁcation and face attribute classiﬁcation, has seen tremendous progress during the last decade owing to the use of deep learning. Large-scale data collected from numerous users have been the driving force in this growth. However, face images containing the identities of the owner can potentially cause severe privacy leakage if linked to other sensitive biometric information. The novel discrete cosine transform (DCT) coefﬁcient cutting method (DCC) proposed in this study combines DCT and pixelization to protect the privacy of the image. However, privacy is subjective, and it is not guaranteed that the transformed image will preserve privacy. To overcome this, a user study was conducted on whether DCC really preserves privacy. To this end, convolutional neural networks were trained for face recognition and face attribute classiﬁcation tasks. Our survey and experiments demonstrate that a face recognition deep learning model can be trained with images that most people think preserve privacy at a manageable cost in classiﬁcation accuracy.


Introduction
Face recognition has been one of the well-known topics in computer vision for a long time.The face is one of the most popular biometrics; as such, face recognition has become an essential tool in our daily lives [1].Along with the development of deep learning, face recognition has achieved a human-like performance.Deep learning uses the backpropagation algorithm to learn internal parameters and compute the representation in each layer [2].Large-scale data collected from numerous users have contributed to the rapid development of deep learning.
However, face data contains the identities of individuals, which can be readily linked to other sensitive personal information, such as health data, causing severe privacy leakage.To make matters worse, deep learning is often trained on images without the approval of the person observed in the image [3].For example, face images in large-scale training datasets, such as social face classification (SFC) [4] and WIDER FACE [5], are collected from social networking services or search engines without explicit consent, which could violate privacy.In addition, information more than just the person's identity can be inferred from the feature representations of face recognition [6,7].Therefore, extracting sensitive information, such as gender, ethnicity, and health status, without consent is considered a violation of privacy [8].For this reason, preserving the privacy of face data in deep learning tasks is indispensable in preventing privacy leakage.
There have been numerous studies to preserve privacy in deep learning.Cryptographybased deep learning protects privacy-risk information by encrypting sensitive contents without compromising model accuracy has high computational complexity.Federated learning [9] is designed to train neural networks locally with each client data.Federated learning provides an advantage in privacy over centralized models because the aggregate server only sees trained models.However, cryptographic and federated learning require a trusted server; otherwise, the attacker can decrypt the ciphertext or restore the original data from gradients [10].Therefore, in this paper, the focus is on image perturbation-based privacy-preserving methods, which do not require the trust of all parties.Image perturbation methods can be performed during the image distribution phase to transform the image so that the eye cannot recognize the original image.Image pixelization [11], also called mosaicing, can be achieved by dividing the image into a rectangular grid and averaging the pixels within each grid.Blurring [12] removes the image details by convolving the image with the filter function, such as a Gaussian filter or a bilateral filter.
In this paper, a novel image perturbation method is proposed based on the discrete cosine transform coefficient cutting methods (DCC).Our approach is based on pixelization and the discrete cosine transform (DCT) [13].The DCT expresses a finite sequence of data points as a sum of cosine functions, transforming the image into a DCT coefficient matrix.Most DCT coefficients have a value near zero, and only a few have a relatively large value.The main idea of DCC is that the larger values are more significant in forming an image; thus, cutting the smaller values to conceal the image detail, in the process protecting privacy.
Image perturbation-based privacy-preserving methods, including our method, vary depending on the level of obfuscation.The notion of privacy is subjective, so for the transformed image, someone may consider the image as having preserved privacy, whereas another may not.Figure 1 displays two examples of the proposed DCC.There would be unanimous agreement that Figure 1a is a privacy-preserving transformation; whereas, opinions with respect to Figure 1b are expected to be more subjective.To overcome this problem, a survey was conducted to determine whether the privacy of the transformed image was preserved.To the best of our knowledge, few studies have conducted surveys on whether transformed images preserve privacy.The original image and DCC-transformed image were presented to the participants to determine whether the two images were perceived as having the same identity.The inability to determine whether the two images are the same means that the privacy of the face image has been preserved.Then, face recognition and face attribute classification tasks were conducted.The neural network was trained with DCC-transformed images and tested on the original images.The accuracy dropped by 3-12% depending on the task when trained by the DCC image of (a = b = 4, r = 64) image, which for most survey participants protected privacy.
Our main contributions are as follows: 1.A privacy-preserving image perturbation method was proposed based on pixelization and discrete cosine transform, that is, DCC. 2. A survey was conducted on whether the proposed method really preserved privacy.3. A neural network was trained on face recognition tasks with face attribute classification on obfuscated images, achieving satisfactory accuracy, making it suitable for realworld applications.

Related Works 2.1. Convolutional Neural Network
Deep learning processes language, images, audio, and video data mainly using convolutional neural networks (CNNs) [2].CNN automatically extracts features that distinguish objects from one another, inspired by the classical notion of neurons communicating with other cells via synapses.CNNs have applicability in many domains, such as speech recognition [14], object detection [15], and face recognition [16], and with the development of a large number of datasets recently in new learning algorithms and architecture.
The architecture of CNN includes several bundles of convolution layers and pooling layers, as well as a few fully connected layers.A convolution layer extracts features via the product between each element of the kernel and the input.The output of convolution, called feature maps, is passed through a nonlinear activation function, such as a sigmoid or tanh function, which is a mathematical representation of a biological neuron behavior.A pooling layer downsamples the output to decrease the number of learning parameters.The outputs of the feature maps of the final layer are typically flattened to a one-dimensional array and connected to fully connected layers.The last activation function of the fully connected layer depends on the task of the CNN.In the classification task, of interest is the score of the class probabilities, where each score ranges between 0 and 1, and all scores sum to 1.
The training minimizes the loss function through gradient descent and the backpropagation algorithm.The purpose of training is to minimize the difference between the output of the networks and the given ground truth labels.
AlexNet [17] won the challenge in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) competition in 2012 by correctly classifying ImageNet datasets [18].The author used the ReLU activation layer [19] to accelerate learning time to improve the network performance.VGGNet [20] uses 3 × 3 convolution filters, which push the depth to 16-19 layers, thereby improving the accuracy of the classification of ImageNet.ResNet [21] introduced the skip connection, allowing training with 152 layers while having a lower complexity than VGGNet.As a result, ResNet attained a 3.57% error rate on ImageNet, which is overwhelmingly greater than the human level.

Face Recognition Deep Learning Needs Privacy Preservation
Human faces are often used as training material for deep learning.Initially, faces from images or videos are detected, and their location is determined.After reasonable annotations on the detected face, a deep learning model is trained for face recognition or face analysis.
Face recognition involves identifying or verifying a human in an image.DeepFace [4] trained a network including more than 120 million parameters on four million facial images, with more than 4000 unique identities.An accuracy of 97.35% was attained on the labeled faces in the wild (LFW) dataset [22], overpowering human-level performance in face verification tasks.FaceNet [23] used 100 to 200 million faces with 8 million different identities and achieved an accuracy of 99.63% on face verification tasks using a CNN trained to directly optimize the embedding itself.
Face analysis recognizes valuable information, such as emotion, gender, and age, in images and is utilized for face attribute classification, age estimation, or face mask detection.DTAGN [24] boosts facial expression recognition performance by combining two deep networks: one extracts appearance-related features, and the other extracts geometric features.In [25], a real-time monitoring architecture was proposed to identify face masks using MobileNet V2.In [26], a multitask CNN-based architecture was presented to conduct face analysis tasks concurrently.
Despite the usefulness of face recognition and face analysis, some privacy violation issues have been raised.In [27], it is argued that biometric data can be used to identify a person easily, so in certain cases, malicious leakage can lead to criminality, such as identity theft or tracking of individuals.Therefore, a privacy-preserving mechanism is essential when using biometric data, such as face images.

Privacy-Preserving Deep Learning
Previous studies have been conducted to preserve privacy in the field of deep learning.In [28], a secure face verification system was proposed based on a CNN representation with the Paillier algorithm, saving all the feature representations in ciphertext so that the client would know only the verification result, ensuring privacy.In [29], a novel system was suggested that utilizes additive homomorphic encryption to protect the gradient.However, cryptographic-based methodologies incur a high computational cost.
In [30], a new deep learning algorithm was developed to train a centralized CNN with differential privacy [31], which resulted in decreased accuracy.However, such a centralized CNN needs an honest server because all of the data are stored on the central server.Federated learning has been known to protect privacy, as the central server can only see the local training results while data remain local.Recent studies [10,32] have reconstructed the victim's private data by assuming a malicious server in the federated learning environment.Therefore, federated learning requires a trustworthy server.
Pix [33] extends differential privacy to image data using image pixelization methods; it was demonstrated that Pix can prevent re-identification attacks.PEEP [3] perturbs eigenfaces by utilizing differential privacy to recognize faces.The third-party server only sees the controlled information and consequently preserves privacy.Image perturbation methods, such as Pix and PEEP, do not need to be trusted by third parties because the transformation can be applied at the image distribution stage.

Methods
This section outlines the entire process of our DCT coefficient cutting method (DCC): Step 1 depicts a formal discrete cosine transform (DCT) and pixelization, Step 2 provides the details of the coefficient cutting method, and Step 3 describes inverse-DCT and presents the results of DCC applied to facial images.

(Step 1) Discrete Cosine Transform (DCT)
The discrete cosine transform (DCT) was first proposed by Ahmed [13] in 1972.DCT transforms a signal or image from the spatial domain to the frequency domain and vice versa for inverse-DCT.One-dimensional DCT (1D-DCT) is used in signal processing [34], and two-dimensional DCT (2D-DCT) is used in image processing [35].In this study, 2D-DCT was used to transform images.For convenience, 2D-DCT is referred to as DCT in the remainder of this paper.DCT can be applied to both gray and color images.For the color image, DCT is performed in each RGB channel.Let V be the frequency domain, and X be the spatial domain (image).The DCT of an M × N matrix X is defined as: DCT transforms the image of Figure 2a to the frequency domain generating the DCT coefficient matrix of Figure 2b.In the image resulting from the transformation, the white pixels are concentrated at the top left.The whiter the pixel, the larger the DCT coefficient, and the blacker the pixel, the smaller the DCT coefficient.Note that the DCT coefficient values are absolute values.The larger DCT values associated with the lower frequencies represent an essential part of the original image in the transformation back to the spatial domain.This is because the human eye tends to sense the low-frequency components in the picture better.In summary, the most visually important information is concentrated in only a few coefficients of the DCT at the top left.In this study, the image is split into a × b blocks.As shown in Figure 3b, the DCT is performed blockwise, so it works like a pixelization method.

(Step 2) Coefficient Cutting (CUT)
As described in Section 3.1, during the formation of an image, significant information is concentrated at a few low frequencies.The main idea of coefficient cutting is that even if most of the high frequency is omitted, the main features of the image remain intact while the sensitive information is concealed.The largest DCT coefficient for each block was selected and stored in the DCC coefficient matrix, as shown in Figure 4b to maintain at least one DCT coefficient for each block.Except for the selected a × b DCT coefficients, the top (r − a × b) DCT coefficients were selected for the whole image, not each block, and stored in the DCC coefficient matrix, as shown in Figure 4c.Then, the remaining coefficients were discarded.The value r is the number of remaining DCT coefficients, which control privacy intensity, and cannot be less than a × b.The larger the r value, the lower the privacy intensity.The coefficient cutting methods filter the DCT coefficient matrix V in Figure 4a, and the DCC Coefficient matrix V* of Figure 4c is produced as a result.

(Step 3) Inverse Discrete Cosine Transform (I-DCT)
In Section 3.2, the DCT coefficients are cut to produce the DCC coefficient matrix V*, which is still in the frequency domain.The inverse discrete cosine transform (I-DCT) changes the frequency domain into the spatial domain.I-DCT is defined as follows: Note that α p and α q are the same as in Equation ( 2).However, the extent of cutting to apply to the DCT coefficient cannot be determined from the perspective of preserving privacy.Does the image in Figure 5b preserve privacy?The answer to that question is affirmative.However, does the image in Figure 5g preserve privacy?The response is more ambivalent.Therefore, a survey was conducted to understand people's perception of privacy, as discussed in Section 4.

User Study
This section summarizes the results of the study on people's thoughts towards privacy.As discussed earlier, the notion of privacy is subjective.For example, in the modified image, someone may think that the image still contains sensitive private content, but someone else may think that the image has successfully eliminated private content.The proper degree of DCT coefficient pruning required to protect privacy is vague, so a user study was conducted to explore this issue.For convenience, it is assumed that a = b = 4 for the rest of the paper, without any further reference to this notation.
As shown in Figure 6, the survey consists of questions with regard to two images.A celebrity of cultural background similar to that of the participants is considered for the images.Each question refers to two celebrity images of the same or different identities.The image size is 178 × 218 pixels, and the face is at the center.The first image is the original facial image without any manipulation.The second image was mutated into a DCC-applied image.Each question consisting of sub-questions was on the same first image and a second image with different privacy-preservation levels.The question asked whether the two images were of the "same person", "different person", or "cannot judge".The first sub-question was on the DCC (r = 16) image.The r of DCC was doubled for the next sub-question, which means that the privacy level was lowered.Figure 6a corresponds to the third sub-question, and Figure 6b to the seventh sub-question, which is the lowest privacy level in our survey.The survey respondent could see the next sub-question after answering the current question to prevent cheating.To judge the predictability of the answer, the privacy level of the images was gradually weakened.The main idea of the survey is that failure to determine whether the pair of images are identical implies that the privacy of the face image has been preserved.
There were 69 users in the study, including four pairs of celebrities, and each question consisted of seven sub-questions with respect to DCC transformed images of varying r.Therefore, a total of 28 questions were asked.Figure 7 shows that over 96% of people could not judge the identities in the DCC images ranging from r = 16 to r = 64.This result means that if r is 64 or less, the privacy of the face image is almost preserved.The participants started to correctly identify from r = 128 onwards.The number of correct answers exceeded the number of undecided answers for r = 256.All participants in the experiment expressed an opinion for r = 1024, and most of the participants (97%) answered correctly.This result means that for r greater than 1024, the privacy of the face image is hardly preserved.With pixelization, DCC (r = 16) preserved privacy the most strongly as the information on the block is compressed into one number.In other words, the pixelated image had only 16 pixels.However, pixelated images are inappropriate for deep learning tasks because of the lack of pixel information.In Section 5, the results are presented on a deep learning experiment conducted to classify face attributes and to evaluate whether face attributes can be recognized by our method.

Experiment and Results
In this section, two experiments are discussed: face recognition and face attribute classification.As illustrated in Figure 8 an SGD optimizer was used with a learning rate of 0.001, decay of 0.001, and momentum of 0.9; 50 epochs were trained for face recognition and face attribute classification for each DCC method.As shown in Figure 9, r began at 16 and doubled for the next step.The results were evaluated for accuracy on validation data.

Face Recognition
The LFW dataset [22], named "Labeled Faces in the Wild", was used for the face recognition task.The dataset consists of 13,233 black and white images of 5749 individuals, of which only 1140 were used, limiting the minimum number of faces per person to 100, similar to the methods used in [3].The 1140 images comprised 236 of "Colin Powell", 121 of "Donald Rumsfeld", 530 of "George W Bush", 109 of "Gerhard Schroeder", and 144 of "Tony Blair".Training data and test data were divided to preserve the ratio of samples for each class to prevent imbalance.We used 75% of the input dataset for training and 25% for testing.The class "George W Bush" is overwhelming compared to other classes, so a data augmentation was performed with horizontal flip on all classes except that class.Note that the data augmentation was performed only on the training data.The total number of datasets was 1592 after the augmentation.In addition, five-fold validation was performed

Figure 2 .
Figure 2. (a) Sample 32 × 32 image from CIFAR-10 [36].DCT transforms the image pixel to the frequency domain.(b) 32 × 32 DCT coefficient matrix.White pixels represent the maximum DCT coefficient values, and black pixels represent the minimum DCT coefficient values close to zero.

Figure 3 .
Figure 3. Three steps of the DCC process with a sample image from CelebA [37].(a) Step 1 (DCT): The original image is divided into several blocks, and each block is transformed from the spatial domain to the frequency domain by DCT and displayed as the DCT coefficient matrix.(b) Step 2 (CUT): The DCT coefficient matrix is filtered by coefficient cutting methods, such that only a few high-frequency coefficients in the DCC coefficient matrix remain.(c) Step 3 (IDCT): The DCC coefficient matrix is transformed from the frequency domain to the spatial domain by I-DCT per block.(d) Then, the privacy-preserved image is created.

Figure 4 .
Figure 4. Example of coefficient cutting for a = b = 2, m = n = 8, r = 12.(a) DCT coefficient matrix after step 1.(b) Select the largest DCT coefficient for each block.(c) Select the remaining top (r − a × b) coefficients.Then, generate the DCC coefficient matrix.

I
-DCT transforms the DCC Coefficients V * to the privacy image X * of Figure 3d.If the cutting phase is excluded, I-DCT transforms the DCT coefficients V to the original image X. Algorithm 1 shows the steps for transforming the DCC images.
Figure 5 displays a DCC example with a = b = 4.As observed in the figure, DCC hides the personal identity of the person in the image.
Figure 5b is equivalent to image pixelization with r = a × b = 16, such that privacy is more strongly maintained.As r increases, the image becomes more comprehensible.Figure 5b is expressed by a DCT coefficient of only 0.04%, and in Figure 5g is expressed by a DCT coefficient of 1.32% within an image size of 178 × 218.

Figure 5 .
Figure 5. (a) Original image (b-g) DCC result of a sample image with a = b = 4 and different r values.As r increases, the image becomes clearer and it is easier to distinguish who he is.

Algorithm 1
Discrete Cosine Transform Coefficient Cutting Methods Input: Image, number of blocks a × b, number of remaining coefficients r Divide an image into a × b blocks Step 1 for all blocks do DCT to the image block Store largest DCT coefficient value to DCC Step 2 end for if r > a × b then Store top (r − a × b) DCT coefficient values to DCC end if for all blocks do I-DCT to the DCC Step 3 end for Combine each blocks return Privacy-Preserved Image

Figure 6 .
Figure 6.The composition of a user study.The image on the left of each sub-question is the original image.(a) The image on the right of the third sub-question is transformed by DCC (r = 64).(b) The image on the right of the seventh sub-question is transformed by DCC (r = 1024).

Figure 9 .
Figure 9. Sample image of the training set and variation of DCC with r.