Multi-Feature Fusion Based Deepfake Face Forgery Video Detection

: With the rapid development of deep learning, generating realistic fake face videos is becoming easier. It is common to make fake news, network pornography, extortion and other related illegal events using deep forgery. In order to attenuate the harm of deep forgery face video, researchers proposed many detection methods based on the tampering traces introduced by deep forgery. However, these methods generally have poor cross-database detection performance. Therefore, this paper proposes a multi-feature fusion detection method to improve the generalization ability of the detector. This method combines feature information of face video in the spatial domain, frequency domain, Pattern of Local Gravitational Force (PLGF) and time domain and effectively reduces the average error rate of span detection while ensuring good detection effect in the library.


Introduction
The human face usually provides important identity information; thus, many related studies were carried out, including face detection and recognition in 2D and 3D spaces [1][2][3][4]. In recent years, driven by computer graphics technology and deep learning, deep face forgery technology (Deepfake) achieved rapid development. It can replace the face of the target video character to the specified ones or let the target face repeat specific expression and action so as to realize the generation and replacement of high-fidelity faces [5]. Opensource tools and applications allowed ordinary users to change their faces according to their personal needs and generate high fidelity depth-forged face videos. The early application of deep forgery technology was only for the purpose of entertainment. However, due to the low technical threshold, high fidelity and strong deception of deep forgery face video, some criminals can easily forge face-changing videos of specific characters and for malicious use, which not only violates the privacy and reputation of the parties but also misleads public opinion, erodes social trust and even leads to serious political crisis. On the other hand, digital video evidence is an important class of electronic evidence in the judicial system, which is more and more widely used in various cases. It is very important to ensure its authenticity, which requires the support of digital video authenticity detection technology [6]. How to detect and defend against deep forgery face video has become one of the hottest issues concerned by governments, enterprises and individuals around the world.
In order to attenuate the harm caused by the deep face forgery technology, researchers carried out in-depth exploration face forgery video detection technology and put forward the idea of detection from multiple perspectives such as the space domain, time domain and frequency domain. These methods achieved satisfactory detection performance on some data sets. However, they have defects such as low detection accuracy, poor generalization performance and weak anti-interference ability, which have great limitations and are difficult to be applied to more complex actual scenes. Based on such facts, this paper designs a feature fusion detection model that can extract features from the video spatial domain, frequency domain and time domain at the same time. The details are as follows: (a) spatial features of face images directly extract spatial features from face spatial images using the Xception network, (b) the frequency domain characteristics of face images are used to obtain the corresponding spectrum map by discrete Fourier transform of face images, and then extract it from the spectrum map through the Xception network, (c) the PLGF image is calculated, and then the PLGF feature is extracted by the Xception network and (d) time-domain features of face images are extracted by splicing and fusing the above three feature vectors of continuous multiple frames into the LSTM network structure. Finally, the output features of the LSTM network which fuse the information of face image spatial domain, frequency domain, PLGF and time domain are used for final classification detection. The source code of this work will be available in https://gitlab.com/test-2022 /multi-feature-fusion-based-deepfake-detection (accessed on 3 March 2022).

Generation Method of Deep Forged Face Video
Deep forgery is an image synthesis technology based on deep learning. It mainly uses a generative adversarial network, deep convolutional neural network and automatic encoder to forge a set of primitive faces and target faces as training data. One of the common applications of deep forgery is to replace one face in the video with another face, which is also known as face swapping [7]. The core idea of face swapping is to replace the face in the target video with the face in the source video and make the replaced face as realistic as possible through the corresponding detail processing so that the naked eye cannot distinguish whether the face in the output video is tampered. Because of changes involving identity attributes, face-changing techniques enable a specified person to appear in video scenes that never appeared before. The method of deep forgery of face video has been publicly implemented, which is mainly based on the structure of a denoising self-encoder and uses a supervised learning method to train a neural network for face replacement. In the following, Deep-Faceswap is taken as an example to introduce the generation process of deep forgery face video, and other deep forgery face generation methods also have similar generation processes [8].
Deep-Faceswap technology needs to use Dlib to extract the face in the source video and the face in the target video, then crop and align the extracted face and adjust the size to 64 × 64. In the training phase, primitive face A and target face B are used as training data to train a weight-sharing encoder for extracting the common facial attributes of A and B. In the decoding phase, A and B respectively train an independent decoder to learn the unique facial information of A and B and complete the corresponding face reconstruction. After the encoder and decoder of A and B are trained, in order to realize the face replacement between the original human face A and the target face B, the encoder is used to encode the facial attribute of B, and then the decoder of A is used to decode the facial attribute encoding feature of B to reconstruct the face. After that, the depth forgery face image with the appearance feature of face A and the facial expression action of face B is generated. The specific process is shown in Figure 1. Based on a similar idea, researchers proposed and developed more face replacement methods and achieved better replacement results.

Encoder
Decoder A

Deep Forged Face Video Database
In the process of deep forgery video detection, a database is indispensable, as it is mainly used to train and evaluate the performance of the detection model. There are four commonly used public face changing databases, namely DeepfakeTIMIT [9], Fake Face in the Wild (FFW) [10], FaceForensics++ [11] and DeepFakeDetection [12].
In DeepfakeTIMIT, face forgery videos are generated by the Swiss Idiap Institute using an open-source face replacement algorithm. The database selects 16 pairs of faces with similar facial features from VIDTIMIT to generate forgery videos, each video has 2 versions of different resolution sizes.
FFW is a face-changing database, released by the Biorecognition Laboratory of the Norwegian University of Science and Technology, which contains only face forgery video, using a variety of forgery techniques to generate forgery video.
In FaceForensics++, videos are collected from the YouTube video website, which contains 1000 real personage videos. A total of 4 face forgery techniques (DeepFake, Face2Face, FaceSwap and NeuralTextures) are used to generate 4 types of face forgery videos, and the number of each type of forgery video is 1000. In addition, FaceForensics++ database video also uses H.264 to compress the video into lossless video, high quality video and low quality video, and the corresponding compression rates are 0, 23 and 40, respectively, to simulate the compression of video in the actual transmission process.
Videos in DeepFakeDetection are jointly produced by Google and Jigsaw, which contains 363 original videos and 3068 counterfeit videos, with richer backgrounds and more diverse facial expressions. Similar to FaceForensics++, the DeepFakeDetection database also divides the video compression rate into C0, C23 and C40.
The algorithm presented in this paper will be tested on DeepfakeTIMIT, FaceForen-sics++ and DeepFakeDetection.

Deep Forgery Face Video Detection Method
The tamper detection methods of deep forgery face video are mostly based on the tampering traces introduced in the tampering process and the inconsistency of video frame images in spatial and temporal domains. The quality of forged face video production is related to the skin color difference and face action difference. When skin color difference is too large and the position of two face feature points cannot be accurately mapped, it is easy to cause obvious artifacts in the tampered video. Based on the tampering traces introduced in the process of depth forgery, researchers proposed corresponding detection methods to distinguish between fake face video images and real face video images. The proposed detection methods can be divided into two categories: machine learning method based on manual features and deep learning method. Zhang et al. [13] first proposed a classical machine learning method to detect face changing images. They first calculate speeded up robust features (SURF) descriptors, then generate bag of word (BOW) features by K-means method, obtain codebook histogram and input them into various classifiers, such as support vector machine (SVM), random forest (RF) and multi-layer perceptron, to distinguish face changing images and real face images by training. Due to the defects in the generation process of deeply forged faces, the stitching of the generated face region into the original image will introduce errors. Based on this, S. Lyu et al. [14] proposed a 3D attitude difference detection method based on the head posture position. This method uses the difference between the coordinate position of the central region of the face and the key points of 68 human faces extracted by Dlib as the features to distinguish true and false faces. The extracted features are standardized (mean and standard deviation) and then input into the SVM classifier to obtain the detection results. Matern et al. [15] summarized the artifacts left by current facial tampering and its processing, specifically the lack of consistent texture features in facial regions, such as missing reflection and detail in the eyes and teeth regions. Koopman et al. [16] proposed a method to classify true and false videos by using photo response non-uniformity (PRNU). PRNU is usually considered to be the camera fingerprint left by the camera in the image. Because face swapping will change the local PRNU mode of the face region in video frames, PRNU mode can be used as a feature to classify real and fake videos. By analyzing the images generated by GAN, the Horst Görtz Institute for IT-Security Research Team of the University of Bohongluer in Germany found that there were significant differences between the generated image and the real image in the frequency domain [17], which is caused by the up-sampling operation. Since the essence of GAN generating images is to transform low-dimensional noise vectors into high-dimensional images, the up-sampling operation cannot be avoided. Therefore, there must be grid characteristics in the frequency domain of the generated images. Habeeba [18] proposed a method using neural network to detect the visual artifacts of facial regions in non-natural images to distinguish true and false videos. Zhou et al. [19] proposed a dualstream network to detect face tampering and considered the fusion of two features: face feature in spatial image and steganalysis feature of image block. Yu et al. [20] proposed a method for detecting face-changing images using separable convolutional neural networks, which combines the features after block with the whole image features to classify the images. Li et al. [21] used convolutional neural network (CNN) to extract the features of video frame images and then input the features of a specific number of continuous video frames into the recurrent neural network (RNN) to train the RNN to distinguish whether the video is face-changing video. Dolhansky [22] proposed three simple detection systems: (a) a small CNN model composed of six convolution layers and one fully connected layer to detect low-level image tampering, (b) the Xception network model using only face images for training and (c) the Xception network model using complete image training [23]. The existing deep-learning based methods show impressive performance. However, most of them just extract features from one or two domains. We try to fuse features extracted from more domains to improve the performance of detection.

Detection Framework
Compared with the real face, there is tampering trace information formed in the process of deep forgery. How to extract these tiny tampering traces is the key to distinguishing depth forgery faces. The intensity of deep forgery traces is too tiny to effectively detect and they have limited generalization ability based on a single feature. To this end, this paper proposes to detect from multiple perspectives such as space domain, time domain and frequency domain and designs a detection network that simultaneously extracts features from multiple feature spaces of face images, thus making the network have better generalization ability through multi-feature fusion.
This The detection framework of this paper is shown in Figure 2, in which the Xception network structure of feature extraction from the face image spatial domain, frequency domain and PLGF image is basically the same and finally output 2048-dimensional features. The three features compose 6144-dimensional features by splicing, which represent the spatial, frequency and PLGF fusion features of a face image. Time domain feature is extracted from fusion features of 10 face images through the double-layer LSTM network, which outputs 512-dimensional features [24]. Finally, binary classification results are output through a fully connected layer. The structure of the Xception network and the double-layer LSTM are determined by experiments. We have used Xception, ResNet50, InceptionV3, EfficientNet and DensNet201 as the feature extractor and found that the Xception network has the best performance among them. For the LSTM, we have used 1, 2 and 3 layer structures and found that the 2 layer structure has 2% to 3% higher accuracy than the 1 layer structure, while the 3 layer structure has almost the same performance with the 2 layer structure. Therefore, we selected the 2 layer LSTM for temporal feature extracting.

Data Pre-Processing
The CNN model of the Dlib machine learning library in Python is used to detect the face region extracted from the video to be tested to obtain the face image. The face image is represented by I as an image with R, G, B three color channels and variable size. In order to keep the edge of the face in the extracted image, the bounding box of the face will be expended. The new bounding box has the same center with the old one, but its width and height are 1.3 times of the original ones.
For the input of the Xception network that extracts spatial characteristics, the size of I is adjusted to 224 × 224 × 3 by bilinear interpolation and normalized. The obtained spatial image is denoted as I S as the input of the network.
For the input of the Xception network to extract the frequency domain characteristics, DFT transform is performed on each channel in I, and the low frequency component is moved to the center to obtain the spectrum of each color channel. Assuming that the amplitude of the position of R channel (x, y) is A R (x, y), the value of the corresponding position of the frequency domain image is shown in Equation (1): The values at other positions are so deduced. Then, the size of each channel is adjusted to 224 × 224 by bilinear interpolation method, and the frequency domain input image I F with size of 224 × 224 × 3 is obtained.
For the Xception network input of PLGF image extraction, the horizontal gradient G hor and vertical gradient G ver are obtained by PLGF operator in the horizontal and vertical directions of the three color channels in I, respectively. PLGF convolution is expressed as follows: where f hor and f ver are 3 × 3 convolution kernels in the horizontal and vertical directions of the local gravity model (PLGF), respectively. I[x, y] is the pixel value of coordinates (x, y), G d [x, y] is the direction gradient of coordinates (x, y).
Then, according to the Lambert model, the horizontal and vertical gradients were separated by illumination to obtain the horizontal illumination separation gradient ISG hor and the vertical illumination separation gradient ISG ver . The operation of illumination separation is to divide the gradient and prevent its own pixel value of the minimum value of zero removal. Since light intensity changes slowly in a small area to a constant value L, the illumination component L can be eliminated to obtain the face texture features only related to the reflection coefficient. It has rich texture information and can be used as an effective feature for face detection authenticity. The specific expression of light separation is as follows: Then, the synthetic gradient ISG is obtained by linear activation of the horizontal and vertical light separation gradient, and the PLGF image is formed, as following: Finally, the PLGF image of each channel is bilinear interpolated, and its size is adjusted to 224 × 224, then the final PLGF input image I P is obtained.
For the video to be detected, the down-sampling is carried out according to the frequency of each of the 5 frames, the actual frame image for detection is obtained so as to avoid the face image in the adjacent detection frame being too close, resulting in redundant information. The frame images obtained after down-sampling are processed according to the above method to obtain the corresponding I S , I F and I P of each frame, that is, the corresponding data pre-processing is completed.

Xception Feature Extraction Network
Since the input I S , I F and I P sizes as network inputs are completely consistent, the Xception network used to extract the spatial, frequency and PLGF features of face images also has the same structure. The Xception network structure used in this method is shown in Figure 3.
In Figure 3, Conv represents the normal convolution layer, SeparableConv represents the depth separable convolution layer, 3 × 3 and 1 × 1 represent the size of convolution kernel or pooling kernel, stride = 2 × 2 represents the sliding step size of convolution kernel or pooling kernel is 2, and if it is not specially pointed out, the sliding step size is defaulted to 1.

Double-Layer LSTM Time Domain Feature Extraction Network
After extracting three 2048-dimensional features from spatial domain, frequency domain and PLGF through the above Xception network, the extracted features are spliced and fused to obtain 6144-dimensional features. Then, the 6144-dimensional fusion feature of 10 frames of face images is input into the double-layer LSTM network structure, and the final 512-dimensional fusion feature is extracted, and the binary classification results of real face nuclear forgery face are output through the fully connected layer. The network structure of this part is shown in Figure 4.  In Figure 4, the first layer input of LSTM is 6144-dimensional feature, and the output is 512-dimensional feature, which further integrates the features of various fields originally separated. The output of the first layer in LSTM contains 10 time steps, and the output features are sent to the second layer. The second layer in LSTM input is 512-dimensional features, and the output is 512-dimensional features with only one time step, which is the fusion of spatial domain, frequency domain, PLGF and time domain information of face images.
Finally, the 512-dimensional feature outputs a 2-dimensional vector through a fully connected layer and then outputs the binary classification results of real face or fake face contained in the video through the softmax activation function.

Introduction of Experimental Database
In order to evaluate the performance of the proposed method, we conduct experiments on three face forgery video databases: DeepFakeDetection (DFD), FaceForensics++ (FF++) and DeepfakeTIMIT (TIMIT). DFD database contains 1089 real videos and 9204 face forgery videos, which are divided into 3 different compression levels: synthetic compression rate 0 (C0), synthetic compression rate 23 (C23) and synthetic compression rate 40 (C40). The real video data come from 28 actors shooting in different scenes. The FF++ contains 1000 real videos and 4000 face forgery videos. There are 1000 face forgery videos synthesized by Deepfake tampering, which are divided into 3 different compression degrees: synthetic compression rate 0 (C0), synthetic compression rate 23 (C23) and synthetic compression rate 40 (C40). The real video data come from the video website YouTube. The TIMIT database contains 559 real videos and 640 face-changing videos. Face forgery videos include low quality (LQ) and high quality (HQ) videos.

Experimental Settings
In this experiment, a Nvidia GTX1080Ti card with 11 GB display memory is utilized. The operating system environment is Ubuntu 14.04, the programming language is Python 3.6 and the deep learning framework is Keras 2.2.5 based on TensorFlow 1.14.
During training, the database was divided into training set, verification set and test set according to the ratio of 7:2:1, and the batch size of the training sample was set to 32. The number of real and fake samples in the databases are usually unbalanced, so half the samples in each batch are randomly selected from real samples and the other samples are from fake samples. For the Xception network, the Adam method is used to optimize, and the learning rate is set to 0.0001. For the LSTM network, RMSProp method is used to optimize, and the learning rate is set to 0.001. The automatic decline strategy of learning rate is adopted in training. If the loss does not decline after five iterations, the learning rate is set to 0.5 times that of the original. If the loss does not decrease after 10 iterations, the network model is considered to have converged, and then the training is terminated. It takes about 15 s for one batch during training.
Half total error rate (HTER) is used as the evaluation index, which is the average value of false alarm rate and missed detection rate of the algorithm under the decision threshold, as defined in Equation (5). Among them, False Acceptance Rate (FAR) refers to the error acceptance rate, that is, the ratio of the algorithm to judge the face changing face as the real face. False Rejection Rate (FRR) refers to the error rejection rate, that is, the algorithm to judge the real face as the ratio of the face changing face, defined as Equations (6) and (7), where N f 2t refers to the number of times that the face changing face is judged as the real face, N f refers to the total number of attacks on the face changing face, N t2 f refers to the number of times that the real face is judged as the face changing face. N t refers to the total number of real face detection. The lower the HTER, the better the performance of the algorithm.

Ablation Experiment
In order to verify the effect of each branch structure of the algorithm, the corresponding ablation experiments are carried out in this section. The model was trained on the DFD (C23) database and then tested on the DFD (C23) database, FF++ (C0) database, FF++ (C23) database and TIMIT database, and HTER was selected as the evaluation index of the algorithm. The experimental results are shown in Table 1. It can be seen from Table 1 that the complete method proposed in this paper has the best performance in both in-library and cross-library tests, followed by the network structure effect of directly integrating spatial, frequency and PLGF characteristics without using the double-layer LSTM network. The effect of using three branches alone is not as good as that of fusing features. The effect of using airspace image alone is better than that of using the other two images alone, while the performance of using frequency domain image alone is the lowest. In summary, each branch structure of the multi-feature fusion method proposed in this paper plays a role in improving the performance of the method.

Comparing with Other Algorithms
In order to further verify the performance of the proposed algorithm, this section compares the proposed method with the depth-forged face detection method published in recent years and trains them on the DFD (C23) and the FF++ (C0 and C23), respectively. Then, tests are conducted on the DFD (C23), the FF++ (C0), the FF++ (C23) and the TIMIT, and HTER is selected as the algorithm evaluation index. The experimental results are shown in Tables 2 and 3.
From Tables 2 and 3, it can be seen that the proposed algorithm has the best performance when DFD (C23) is used as the training sample, in-library detection and most cross-library detection; only when FF++ (C0) is used as the test sample, the performance is not as good as MISLnet. While using FF++ (C0 and C23) as training samples, the proposed algorithm is not the best but also has considerable performance, showing the robustness and generalization ability of the algorithm.

Conclusions
Aiming at the defects of low detection accuracy, poor generalization performance and weak anti-interference ability of existing deep network face-changing video tampering detection algorithms, this paper proposes a deep forgery face video detection method based on multi-feature fusion. The spatial domain, frequency domain and PLGF feature information of face images are extracted by the Xception network, and the time domain feature information of face images is extracted by the double-layer LSTM network. The experimental results show that the proposed method has good in-library and cross-library detection performance and strong generalization ability.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Conflicts of Interest:
The authors declare no conflict of interest.