Deep Fake Image Detection Based on Pairwise Learning

: Generative adversarial networks (GANs) can be used to generate a photo-realistic image from a low-dimension random noise. Such a synthesized (fake) image with inappropriate content can be used on social media networks, which can cause severe problems. With the aim to successfully detect fake images, an effective and efﬁcient image forgery detector is necessary. However, conventional image forgery detectors fail to recognize fake images generated by the GAN-based generator since these images are generated and manipulated from the source image. Therefore, in this paper, we propose a deep learning-based approach for detecting the fake images by using the contrastive loss. First, several state-of-the-art GANs are employed to generate the fake–real image pairs. Next, the reduced DenseNet is developed to a two-streamed network structure to allow pairwise information as the input. Then, the proposed common fake feature network is trained using the pairwise learning to distinguish the features between the fake and real images. Finally, a classiﬁcation layer is concatenated to the proposed common fake feature network to detect whether the input image is fake or real. The experimental results demonstrated that the proposed method signiﬁcantly outperformed other state-of-the-art fake image detectors.


Introduction
Recently, deep learning-based generative models, such as variational autoencoders and generative adversarial networks (GANs), have been widely used to synthesize the photo-realistic partial or whole content of an image or a video. Furthermore, recent modifications of the GANs, such as progressive growth of GANs (PGGAN) [1] and BigGAN [2], have been used to synthesize a highly photo-realistic image or video, which is impossible recognize as a fake by humans in a limited time. In general, the generative applications perform image translation tasks [3], which can cause serious problems if a fake image is improperly used on social media networks. For instance, the cycleGAN can be used to synthesize the fake face image in a pornography video [4]. Furthermore, the GANs can create a speech video with the synthesized facial content of any famous politician, causing severe problems to the society, political, and commercial activities. Therefore, an effective fake face image detection technique is urgently needed. In this paper, our previous study [5] is extended to recognize generated fake images effectively and efficiently.
In the traditional image forgery detection approaches, two types of forensics schemes are commonly used, active schemes and passive schemes. In the active schemes, an externally additive signal (i.e., watermark) is embedded in the source image without visual artifacts. To determine if an image is a tampered image, the watermark extraction process is performed on the target image to restore the watermark [6]. The extracted watermark image can be used to detect tampered regions in the target image. However, there is no source image for the images generated by the GANs so that the active image forgery detector cannot extract the watermark image. On the other hand, the passive image forgery detectors use the statistical information on the source image that is high consistency between different images. As a result, the intrinsic statistical information can be used to detect the fake regions in the image [7,8]. The passive image forgery detectors cannot be used to identify fake images generated by the GANs because they are synthesized from the low-dimensional random vector. Specifically, the fake images generated by the GANs are not modified from their original images.
Since deep neural networks have been widely used in various recognition tasks, we can also adopt a deep neural network to detect fake images generated by the GANs. Recently, the deep learning-based approached for fake image detection using supervised learning has been studied. In other words, fake image detection has been treated as a binary classification problem (i.e., fake or real image). For instance, the convolution neural network (CNN) network was used to develop the fake image detector [9,10]. In [11], the performance of the fake face image detection was further improved by adopting the most advanced CNN-Xception network [12]. In [13], a manipulated face detection algorithm was proposed based on a hybrid ensemble learning approach. However, none of these studies has investigated the fully generated image, but instead, they have been focused only on partial manipulation of face images; thus, they cannot be used to detect the fully generated fake images.
Many GANs have been proposed in recent years. Some of the recently proposed GANs [1-3,14-18] have been used to produce photo-realistic images. To develop a fake image detector, it is necessary to collect all of the GAN's images as the training set for deep neural networks to achieve the promising performance. However, it is difficult and very time-consuming to collect the training samples generated by all the GANs. In addition, such a supervised learning strategy [9][10][11] tends to learn the discriminative features of fake images generated by all the GANs, and as a result, the learned (trained) detector may not have a good generalization ability. In other words, the learned detector will be unable to recognize the fake images generated by the GANs that were not included in the detector training process.
To meet the current requirement for the GANs-based generator of fake image detection, we propose a modified network structure, including a pairwise learning approach, called the common fake feature network (CFFN). By using the pairwise learning, the proposed structure overcomes the shortcomings of the supervised learning-based CNNs, such as those presented in [9,11]. To verify the effectiveness of the proposed method, the proposed deep fake detector (DeepFD) is applied to identify the fake face and generic images. The main contributions of this work are as follows.

•
A fake face image detector based on the novel CFFN, consisting of an improved DenseNet backbone network and Siamese network architecture, is proposed.

•
The cross-layer features are investigated by the proposed CFFN, which can be used to improve the performance.

•
The pairwise learning approach is used to improve the generalization property of the proposed DeepFD.
The rest of this paper is organized as follows. Sections 2 and 3 introduce the proposed CFFN for fake image detection with the pairwise learning intended for the face and general images, respectively. Section 4 presents obtained experimental results of the fake face and general images. Finally, Section 4 gives the conclusions.

Fake Face Image Detection
The most serious challenge in the image and video forgery detection field is the fake face image detection. Fake face images can be used to create fake identities on social media networks, thus stealing personal information illegally. For instance, the fake image generator can be used to produce images of celebrities with inappropriate content, which has hazardous consequences. In this section, the proposed deep learning framework with the pairwise learning strategy is introduced in detail.
The proposed two-step learning method that combines the CFF based on pairwise learning strategy and the classifier learning is presented in Figure 1. Introducing the supervised learning strategy in the fake face image detection the problems related to both difficult collection of training samples generated by all possible GANs and the need to retrain the fake face detector to obtain an effective model for the fake face images generated by a new GAN, are addressed. Specifically, to overcome these problems, the fake and real images are paired and follow by using the pairwise information to construct the contrastive loss to learn the discriminative common fake feature (CFF) by the proposed CFFN. Once the discriminative CFF is learned, the classification network captures the discriminative CFF to identify whether the image is real or fake. The details of the proposed method are described in the following. Let the set of the collected training images generated by M GANs be defined as: , where each GAN generates N k training images. Let the training set consisted of real images be denoted as X real = [x i=1 , x i=2 , ..., x i=N r ], containing N r training images. Therefore, the total number of training images, including both real and fake images, will be N T = N r + N f = N r + ∑ M k=1 N k . The label information set denoted as Y = [y 1 , y 2 , ..., y N t ] indicates whether an image is fake (y = 0) or real (y = 1). As stated previously, the pairwise information is necessary for the training stage so that the CFFN can learn the discriminative CFFs well. Toward this end, the pairwise information can be generated from the training set X and its corresponding label set Y by the permutation combination. Therefore, there are C(N T , 2) pairs P = [p i=0,j=0 , p i=0,j=1 , ..., p i=0,j=N r , ..., p i=N f ,j=N r ] generated from the training samples. In this paper, we set the total number of pairwise samples to N r = 2, 000, 000.

Common Fake Feature Network
Many advanced CNN can be used to learn the fake features from the training set. Xception Network was used in [11] to capture the powerful feature from the training images in a purely supervised way. Other advanced CNNs, such as DenseNet [19], ResNet [20], Xception [12], can also be applied to the fake face detector training. However, most of these advanced CNNs are trained in a supervised way, so the classification performance depends on the training set. Rather than learn the fake features from all the GANs' images, we seek the CFF over different GANs. In this way, a suitable backbone network is needed for learning CFFs. However, the traditional CNNs (e.g., the DenseNet [19]) are not designed to learn the discriminative CFF. To overcome this shortcoming, we propose integrating the Siamese network with the DenseNet [19] , developing the CFFN to achieve the discriminative CFF learning.
A dense block is a basic component in the DenseNet [19] , which is one of the state-of-the-art CNN models for image recognition. However, it is trained by the supervised learning strategy, while the proposed pairwise learning strategy for the CFFs denotes a semi-supervised learning strategy. The proposed CFFN is a two-streamed network designed to allow the pairwise input for CFF learning. On the other hand, the traditional CNNs, which are single-streamed networks, are unable to receive the paired information; thus, the common features can be difficultly learned by the traditional CNNs. In the proposed CFFN, the backbone network can be any of the advanced CNNs, such as ResNet [20], Xception [12], or DenseNet [19]. Once the backbone network is trained to have the best feature representation ability, the performance of the fake image recognition can be improved as well. To this end, DenseNet is selected as a backbone network of the proposed CFFN.
Moreover, it is well known that CNNs capture the hierarchical feature representation from a low level to a high level. In other words, the CNNs use only on high-level feature representation to identify whether the image is fake or not. However, the CFFs of fake face images may not exist only in the high-level representation but also in the middle-level feature representation. Inspired by [21], in this work, the cross-layer features are integrated into the classification layer to improve the fake image recognition performance, as shown in Figure 2. The proposed CFFN consists of three dense units that include two, four, and three dense blocks, respectively, and the number of channels in the three dense units are 48, 60, 78, and126, respectively. The parameter θ in the transition layer is 0.5 and the growth rate is 24. Then, a convolution layer with 128 channels and 3 × 3 kernel size is concatenated to the output layer of the last dense unit. Finally, the fully connected layer is added to obtain the discriminative feature representation. To obtain the cross-layer feature representation, we also reshape the last layers of the first and second dense units to aggregate the cross-layer features into the fully connected layer. Therefore, in the final feature representation, there are 128 × 3 = 384 neurons.
In general, the classification of fake image can be performed by a different classification learning model, such as random forest, SVM, or Bayes classifier. However, the discriminative feature may be further improved by applying the back-propagation algorithm to the end-to-end structure. Therefore, in this work, the convolution and fully connected layers are concatenated into the last convolution layer of the proposed CFFN to obtain the final decision result. The details of the proposed CFFN are given in Table 1.

Discriminative Feature Learning
The main drawback of supervised learning is that it is hard to identify the subject that is excluded from the training phase. To enhance the performance of the proposed method, we introduce contrastive loss to learn the CFFs by pairwise learning. Therefore, the Siamese network structure [22] is used for allowing the pairwise inputs, as illustrated in Figure 3. With the aim to make the proposed CFFN learn the discriminative features during the training process, the contrastive loss term is incorporated into the energy function of the traditional loss function for supervised learning (i.e., the cross-entropy loss). Afterward, given the face image pair (x 1 , x 2 ) and the pairwise label y, where y = 0 indicates an impostor pair, and y = 1 indicates a genuine pair, the energy function between two images is defined as: The most intuitive way to learn the discriminative features is to minimize the energy function E W given by (1). Specifically, direct computation of E W (x 1 , x 2 ) by calculating l 2 norm distance in the feature domain leads to a constant mapping, and the constant mapping makes any input to a constant vector such that the energy function E W can be minimized. For instance, the learned mapping function can be f CFFN (x 1 ) = f CFFN (x 2 ) = [1, 1, . . . , 1]. Thus, the constant mapping leads to useless feature representation. Therefore, to overcome this problem, the contrastive loss is introduced to learn the discriminative feature representation as well as to avoid constant mapping, which can be expressed as: where m denotes the predefined threshold value. When the input is the genuine pair y ij = 1, the cost function tends to minimize the energy (defined by the feature distance) E W between two images. When the input is an impostor pair, the contrastive loss will minimize the function max(0, (m − E w ).
In other words, the energy E W will be maximized if the feature distance between the impostor pair is smaller than the predefined threshold value m. In this way, it is possible to learn the common characteristics of the fake images generated by different GANs. When the contrastive loss is used, the feature representation f CFFN (x i ) will tend to become similar to f CFFN (x j ) at y ij = 1 (i.e., for a fake-fake or real-real pair). By iteratively train the network f CFFN using the contrastive loss, the CFFs of the collected GANs can be learned well.

Classification Learning
As stated previously, there are multiple existing classifiers for fake image detection. To improve the performance of the fake face image detection, we adopt a sub-network as a classifier. Thus, the classification learning can be quickly learned by the cross-entropy loss function, which is given by: where p i = 0 indicates the real image, f CLS denotes the classification sub-network consisting of a convolution layer with two channels, and a fully connected layer with two neurons. The classifier can be easily trained by the back-propagation algorithm [23]. One way to learn both the CFFs and classifier is the joint learning strategy incorporating the contrastive loss and cross-entropy loss into the total energy function. In another way, the CFFN is first trained by the proposed contrastive loss and follows by training the classifier based on cross-entropy loss. When the first strategy is applied, it is difficult to observe the impact of both contrastive and cross-entropy loss functions on the performance of the fake image detection tasks. Therefore, we adopt the second strategy to ensure the best performance of the proposed method. However, the first learning strategy is used as a baseline in the comparison, which will be presented in one of the following sections.

Two-Step Learning Policy
There are two loss functions, including contrastive loss and cross-entropy loss for the proposed CFFN and classifier learning in the proposed method, respectively. A joint learning policy can be adapted to optimize the proposed CFFN and classifier network based on two loss functions simultaneously. However, it is hard to determine the weighting values for two loss functions. In general, the weighting is determined empirically. Since the primary purpose of the proposed CFFN is the discriminative features learning, allowing that the CFF can be learned by minimizing the contrastive loss first. Afterward, any classifier can be used to recognize the fake face image based on the trained CFF. In this study, we adopt a small neural network as the classifier, enabling the capability of the end-to-end training. Moreover, it is well known that a classifier can be easily trained based on a better feature representation. Therefore, the CFFN is first trained based on contrastive loss, and then the classifier network is optimized by minimizing the cross-entropy loss. To verify whether the two-step learning policy is valid or not, we also conduct an experiment to compare the performance between the joint learning (i.e., Baseline-I) and the two-step learning policy in the experimental Section.

Fake General Image Detection
In contrast to the fake face image detection, the fake general image is more difficult to detect because the content of a general image varies significantly. Moreover, the fake feature of a general image is more complicated than that of a face image. Therefore, in this case, the more effective backbone network is required to be able to capture the CFFs of a general image, compared to the backbone network used in the fake face detection task. To this end, we increase the number of channels in the proposed CFFN. As given in Table 2, the total number of the dense blocks in each dense unit is increased. The number of channels in each dense block is also increased to achieve better capturing of fake features of general images. Similarly, the contrastive loss and the classification sub-network presented in Section II are employed to detect whether an image is fake or real.

Data Collection
The dataset used in the experiments was extracted from CelebA [24]. The images from the CelebA covered large pose variations and background clutter, including 10,177 of identities and 202,599 aligned face images. In the experiment, five state-of-the-art GANs were used to produce the training set of fake images based on the CelebA dataset, and they were as follows: By using the selected GANs, it was hard to synthesize realistic images with high-resolution, except for the PGGAN. In [15][16][17][18], the default size of the generated face images in the released source code was only 64 × 64 pixels. Specifically, if the size of the fake image was set to 128 × 128 pixels, many artifacts would be significant in the generated images, so the artifacts to recognize the fake image would be easily observed. In such a case, the fake image detector would not be needed. The most GANs could generate realistic fake images only of the smaller resolution, such as 64 × 64 pixels. To achieve a fair comparison of the performance of different image detectors in recognizing fake images generated by different GANs, the size of the input image was consistent, and it was set to 64 × 64 pixel. In the PGGAN, the best model released by the authors of the corresponding GAN was used. However, the PGGAN [1] can be used to generate the high-resolution fake face images, in which the size of the generated face image is different from the one used in our experiments. Therefore, in the experiments, we downsampled the fake face image generated by the PGGAN to 64 × 64. Note, the generated images are downloaded from the official website provided by the authors in PGGAN [1].
Each GAN randomly generated 40,000 fake images with the size of 64 × 64, which were recorded into the fake image pool. Since the fake image is generated by giving a random vector to GANs, the generated content will vary each time unless we set the random vector as a constant vector. To have a fair experiment result, we save the generated face images to a fake image pool to ensure the fake contents of the generated face images are consistent in each experiment. There were 200,000 fake images in total in the pool. We also randomly selected 200,000 real images from CelebA. Therefore, the total number of images, including real and fake images, was 400,000. To evaluate the performance of the proposed method, we split the image dataset into the training, validation, and test sets consisting of 380,000, 10,000, and 10,000 images, respectively. In each set, the number of fake images was equal to the number of real images.

Experimental Settings
In the training of the proposed CFFN and fake face detector, we set the learning rate to =1e − 3, and the total number of epochs to 15. The threshold value m of the contrastive loss was set to 0.5. Adam optimizer [25] was used for both the first-and second-step learning. The number of epochs of the first-step learning of the CFFN was set to 2, and the number of epochs of the classification learning of the classification sub-network was set to 13. The batch size was 88 for all the learning tasks. The parameters settings were as in [5,9,11].
In the experiments, we used the conventional image forgery method based on the sensor pattern noise [8] for performance comparison. The Baseline-I method was the jointly learning method based on the contrastive loss and binary cross-entropy loss without two-step learning. In the Baseline-II method, instead of the CFFN structure, the DenseNet [19] with two-step learning of the contrastive and binary cross-entropy loss functions was adopted.
We compared the performance regarding the precision P and recall R, which were defined as: where TP denoted the number of true positives, indicating that a real image was recognized as a real, FP denoted the number of false positives, denoting that a fake image was detected as a real image, and FN denoted the number of false negatives, showing that a real image was recognized as a fake one.

Objective Performance Comparison
To verify the effectiveness of the proposed method, we excluded one of the selected GANs from the training process and used it in the testing process instead to make the training and test sets be different. For instance, when the PGGAN was excluded from the training phase of the proposed DeepFD, the fake images generated by the PGGAN and the corresponding real images were used to evaluate the performance of the trained fake face detector. The objective performance comparison of the proposed fake face detector, two baseline methods, and methods proposed in [9][10][11] in terms of precision and recall, is presented in Table 3. As presented in Table 3, the proposed method significantly outperformed other state-of-the-art methods; thus, the CFFN can be used to capture the discriminative features of the fake images. The curves of the validation accuracy during the training phase are depicted in Figure 4. It demonstrated that the effectiveness of the proposed DeepFD. The proposed pairwise learning successfully captured the CFFs from the training images generated by different GANs. Thus, it was verified that the proposed method had higher generalization ability and effectiveness than the other methods. Table 3. The objective performance comparison of the proposed and other fake face detectors.

Visualized Result
As presented in [26], the object can be localized by designing the number of channels in the last convolution layers to be the same as the number of classes. Also, as suggested in [26], the channels of the last convolutional layer of the proposed CDNN enable the visualization ability. Therefore, the proposed model was used to visualize fake regions in the generated images by extracting the last convolution layer and mapping the responses to the image domain. Since the last convolution layer was designed to have two channels, the first channel was regarded as a feature response of the first class (i.e., real image), and the second channel corresponded to the second class (i.e., fake image). As a result, the proposed method could be used to visualize the fake regions, making a more intuitive interpretation of typical fake features generated by the GANs. Moreover, the heat map of the last convolutional layer is produced by the normalized response values in the feature map with the second channel (i.e., the feature map for fake feature). As a result in Figure 5, the higher feature response values can be observed in the corresponding regions with artifacts in the fake face images, while the real images have relatively lower feature response values. We map the response value to the original image to draw the artifact regions in red color.

Training Convergence
In the proposed method, it is necessary to guarantee the convergence of network training. Therefore, the convergence of the contrastive loss and CFFN training accuracy was analyzed, and the results are presented in Figure 6. In Figure 6a, the orange line depicts the accuracy curve during the training with supervised learning without contrastive loss, and the blue line indicates the validation accuracy curve of the training process with the proposed pairwise learning in two-step learning policy. It is clear that the proposed pairwise learning enhanced the method convergence significantly compared to the supervised learning strategy. On the other hand, it was necessary to ensure the contrastive loss converged as well. Figure 6b depicts the convergence of the proposed contrastive loss during the CFFN training. As the results show, the proposed contrastive loss was successfully decreased after only several iterations. In general, over-fitting can be observed when the validation accuracy is dropping down as well as the training accuracy is still improving. In our experiments, the validation accuracy only slightly improves after 21,000 iterations. It is clear that the validation accuracy in Figure 6a does not drop down after 21,000 iterations. Therefore, the total number of the training iterations can be higher than 21,000.

Fake General Image Detection
In the task of fake general image detection, we used three state-of-the-art GANs to generate high-quality fake images: The dataset was extracted from the ILSVRC12 [29]. We adopted the source code provided in [2,27,28] and its released model that was trained on the ILSVRC12 to generate the fake general images. Each GAN generated 100,000 fake images with a size of 128 × 128, which were recorded into the fake general image pool. Then, we randomly selected 300,000 real images from the ILSVRC12. Therefore, the total number of images was 600,000. To evaluate the performance of the proposed method, we split the image dataset into the training, validation, and test sets that consisted of 580,000, 10,000, and 10,000 images, respectively.
The objective performance comparison of the proposed method and other state-of-the-art image forgery detection methods is presented in Table 4. The results given in Table 4 show that the proposed CFFN with the pairwise learning strategy was significantly better than other state-of-the-art image forgery detectors. Compared to the supervised learning-based methods [9,11], the performance of the proposed method was significantly better. Accordingly, it was proven that the proposed method could learn the CFF of a fake general image.

Discussions and Limitations
In the proposed CFFN and DeepFD, the fake face and general image detection ability is provided using deep neural networks. Since the main contribution of this work is that the CFFs are learned from the pairwise training samples, the proposed CFFs may fail when the fake features of the results of a new generator are significantly different from most of those used in the training phase. In such a situation, the fake face and general image detector should be retrained. Another limitation of the proposed method is related to the collection of training samples. The technical details of some fake image generators maybe have not been revealed, so the training samples might be hard to collect in practice. In order to overcome this limitation, a few-shot learning policy should be employed in the learning of the CFF from a small-scale training set.

Conclusions
In this paper, a fake feature network-based pairwise learning is proposed to detect the fake face and general images generated by the state-of-the-art GANs. The proposed CFFN can be used to learn the middle-and high-level and discriminative fake features by aggregating the cross-layer feature representations. The proposed pairwise learning strategy enables the fake feature learning, which allows the trained fake image detector to have the ability to detect the fake image generated by a new GAN, even it was not included in the training phase. The experimental results demonstrated that the proposed method outperformed other state-of-the-art methods in terms of precision and recall rate. The fake video detection is also an important issue, so in our future work, we will extend the proposed method to fake video detection, incorporating the object detection and Siamese network structure.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

CNN
Convolution neural network CFF Common fake feature CFFN Common fake feature network DeepFD Deep fake image detector GAN Generative adversarial nets