3.1. Dataset Preprocessing
We analyzed the process of generating fake faces by DeepFake. The principle of DeepFake is shown as
Figure 2. The core idea of DeepFake is the parallel training of two automatic encoders. Its principle can be summarized as follows: a neural network is trained to restore the distorted face of A to the original face by supervised learning, and the network is expected to have the ability to restore any face to the face of A.
As is shown in
Figure 2, we can see the source image in the first rectangle, which is marked “Part 1” and will be the background head of the ultimate generated image. After a series of processing such as face detection, face landmark detection, and face alignment, its transformation matrix can be calculated and obtained. The target image is in the second rectangle, which is marked “Part 2” and will be transformed into the foreground face of the ultimate generated image. The target image goes through the same processing. According to the transformation matrix of the original image, the target image can be wrapped and transformed to obtain the synthetic forged image. After some post-processing such as applying boundary smoothing to compose the image, the ultimate synthetic image is generated.
Due to the limitations of computing resources and production time, the DeepFake algorithm can only generate limited resolutions and then perform affine transformation on those generated images such as scaling, rotation, and shearing to match and cover the original face that they will replace. This will result in two different image compression ratios between the fake face area as the foreground and the original area as the background, which would leave obvious counterfeit traces.
Our purpose here was to detect the artifacts introduced by the affine transformation steps in DeepFake production. In addition, since DeepFake needs to be trained for each pair of videos, which is time-consuming and resource-demanding, we did not use the DeepFake algorithm to create negative examples. Instead, we simplified the process of generating negative examples by simulating the process of generating faces in DeepFake.
Specifically, we took the following steps to generate negative examples, as shown in
Figure 3. First, we detected faces in the original images, and extracted face landmarks from each detected facial area. Then, according to the landmarks, we aligned the face and calculated the transformation matrix. Facial landmarks contain the location of important facial structural information such as the contours of the eyes, nose, and mouth. The different facial orientation in the images will interfere with the facial analysis. Therefore, we needed to align the facial area to the unified coordinate space according to the detected facial landmarks. That is to say, we adjusted the face of a possible side face or the face with an angle to the standard forward face, so that the aligned face was in the center of the image, the eyes were on the horizontal line and zoomed to a similar size.
The transformation matrix is acquired by face alignment, which in essence is to find the affine transformation matrix from shape A to shape B by using the least square method. For the two shape matrices with face landmarks, they can be marked with
and
, respectively. Each row of the matrix represents the
and
coordinates of a facial feature point. The face has 68 landmarks, so
and
can be marked as follows:
The above problem can be described as: two matrices
and
are known, and
can be obtained by the transformation:
. Our goal was to calculate the zoom scale
, rotation angle
, and translation displacement
when
is minimized. Thus, the process of solving the affine matrix can be written in the following mathematical form:
where
is the row
of the
matrix;
is an orthogonal matrix; and the superscript
represents the transposition of the matrix. The matrix form is as follows:
where
represents the Frobenius norm, which is the sum of squares of each term.
After the previous steps, we applied Gaussian blur to the adjusted face. According to the inverse of the transform matrix, the face was wrapped back to the original angle and covered the original face image.
In addition, we also needed to preprocess the images before training. Since the input of the convolution neural network is 128128, the image area is limited. Therefore, it is important to retain the most effective and prominent signs of the forged area as our region of interest (ROI). We determined the ROI area according to the face landmarks. For the convenience of description in this paper, the rectangular region composed of the convex hull of facial landmarks (except the contour of the cheek) is called the minimum circumscribed rectangle.
Analyzed as above, the affine transformation mainly affects the inner region of the minimum circumscribed rectangle. As a result, there is an obvious contrast between the inner region and the outer adjacent region of the rectangle. There might be some forgery marks in vision. Therefore, we chose to retain the slightly larger rectangular area composed of the minimum circumscribed rectangle and its surrounding area as our ROI area, and removed the rest of the image last. Specifically, for all positive and negative examples in the dataset, only the above ROI area was preserved, which was slightly larger than the minimum circumscribed rectangle.
The two rectangle regions are represented as follows:
wherein the left matrix represents the minimum circumscribed rectangle that can cover all the facial landmarks (except the contour of the cheek), and corresponds to the green rectangle in the left image of
Figure 4.
and
represent the coordinates of two diagonal vertices of the minimum circumscribed rectangle region. Its width and height are expressed as
h and
w, respectively. The right matrix represents the ROI rectangular region that we want to retain.
and
represent the coordinates of two diagonal vertices of the ROI rectangle region. This is slightly larger than the minimum circumscribed rectangle, and corresponds to the light yellow rectangle in the left image of
Figure 3.
The conversion relationship between the two rectangular regions is illustrated as follows:
wherein the variables
and
is a random value [0,
h/5] and [0,
w/8].
After selecting the ROI area, we removed the rest of the image. In order to simulate more different resolutions of face affine transform in reality, we aligned the faces into multiple scales and randomly selected one scale to enlarge the training diversity, as can be seen in
Figure 4. The green rectangle represents the minimum circumscribed rectangle that can cover all the facial landmarks (except the contour of the cheek). The light yellow rectangle represents one of the ROI rectangular regions that we wanted to retain. The orange rectangle represents the maximal circumscribed rectangle that may be retained. The images in the second column in
Figure 4 show some different ROI results.
We also used image augmentation technology to simulate different post-processing technologies that may exist in the DeepFake process, as can be seen in the third column of
Figure 4. Specifically, for all images in the training dataset, we used image augmentation, which mainly includes shape transformation (such as rotation, scaling, flipping, translation, etc.) and color jittering (such as brightness, contrast, color distort, sharpness, etc.). We selected random values to match different effects, so that the images were slightly different for each epoch, increasing the diversity of the training dataset. Our approach also further dealt with the shape of the face area affine transformation to cope with the different post-processing techniques. After all of the above work, the image was resized to 128
128 for the next ELA processing.
3.2. The Error Level Analysis Processing
The error level analysis (ELA) method [
18] is one of the techniques for detecting an image that has been tampered. The ELA method can obtain the compression distortion during lossy image compression. This method detects tampered images by storing images at a specific level of quality and calculating the ratio between compression levels. The local minimum in the image difference represents original regions, and the local maximum represents tampered regions. Typically, this method is performed on images with lossy compression formats such as JPEG.
Images will go through independent “lossy compression” in units of 88 pixels while being saved in JPEG format. Then, there are significant differences between the ELA of the original area and that of the spliced or modified one. If the image is modified, the compression difference of each 88 pixel region is not similar. We then check the “compression feature” of the tested image with an 88 pixel grid. Therefore, if the image is saved as a whole, the compression feature of the adjacent grid should be an approximately high-frequency white distribution. Instead, if it is saved after editing or modification, the ELA distribution between the grids will have obviously different characteristics, which is shown as a discontinuous high-frequency white distribution. The more times the images are stored or edited, the lower the ELA.
We used the ELA method to process the input image, as shown below:
Save the original image and compress the input image to generate a new image according to the specified quality factor.
Calculate the absolute value of the difference between the two images pixel by pixel, and generate a difference set image.
According to the biggest pixel value of the difference set, we obtain the enhancement factor.
Adjust the brightness of the difference set image according to the enhancement factor, and generate the final enhancement ELA image.
The ELA processing effect is shown in
Figure 5. The images on the first line are the original image and its ELA image. It can be seen that the compression ratio of the whole image remained the same. The images on the second line are the tampered image and its ELA image. It can be seen that the compression ratio between the tampered face as the foreground and the original image as the background were quite different.
As shown in
Figure 5, the ELA method can be used to detect whether the image has been tampered with. However, the ELA method also has the following problems. First, it is only applicable to the compression distortion of lossy image compression such as the JPEG format and it cannot be used for the detection of lossless compression. All ELA tests were compared with JPEG lossy compression. Second, the ELA method can only roughly determine which areas of the image have been processed, which is more difficult to distinguish for low quality images.
3.3. CNN Architectures
Convolution neural network (CNN) is a kind of feedforward neural network with a depth structure and convolution computation. It is inspired by the human visual nervous system, and has two major characteristics: it can effectively reduce the dimension of a large amount of data to a small amount of data, and it can effectively retain the characteristics of the picture, in line with the principle of picture processing. CNN has achieved great success in the field of digital image processing such as object detection [
19], face detection [
20], face recognition [
21], video classification [
22], super resolution [
23], and so on.
A typical CNN consists of three parts: the convolution layer, pooling layer, and full connection layer. Generally speaking, the convolution layer is responsible for extracting the local features from the input image; the pooling layer is used to greatly reduce the parameter order of magnitude (dimensionality reduction); and the full connection layer is similar to the part of the traditional neural network, which is used to output the desired results. From the point of view of signal processing, the convolution operation in the convolution layer is a filter (convolution kernel) to filter the frequency of the signal. The training of the CNN is to find the best filter (Filter) to make the filtered signal easier to classify. From the point of view of template matching, each convolution kernel can be regarded as a feature template, and the training is to obtain a suitable filter so that the specific mode can be highly activated to achieve the purpose of classification or detection. Unlike the convolution in the image, the convolution layer of the CNN can set multiple filters to obtain different feature maps. Furthermore, the value of each filter is not fixed, but variable and trainable.
The CNN draws lessons from the working principle of the human visual system. The convolution neural network first obtains some low-level features by finding the edges or curves of the input image, and then aggregates these low-level features into more high-level ones through a series of convolution layers. As these high-level features are composed of multiple low-level features, the high-level features can cover more information of the original image. The CNN architecture used in our method is described in
Figure 6.
For the details of the deep learning model, we used the following settings. The first CNN layer consisted of a convolutional layer with a kernel size of and 32 filters. The second CNN layer consisted of a convolution layer with a kernel size of , 32 filters, and a max-pooling layer with a kernel size of . Both convolution layers use the Glorot uniform initializer kernel and the Relu activation function to create neurons on the convolution layer and perform selection so that they can receive useful signals from the input data. After that, add a dropout of 0.25 to the max-pooling layer to prevent over-fitting. The next layer is a fully connected layer with 256 neurons and Relu activation functions. After the fully connected layer, a dropout of 0.5 is added to prevent over-fitting.
The root mean square prop (Rmsprop) optimizer is one of the adaptive learning rate methods. The Rmsprop optimizer uses the same concept of the exponentially weighted average of the gradients like gradient descent with momentum, but the difference lies in the update of parameters. It limits the oscillations in the vertical direction, so that our algorithm can take a larger step in the horizontal direction and converge faster.
In Rmsprop, instead of using
and
independently for each epoch, we took the exponentially weighted average of the square of
and
.
In the above formula, and are the gradient momentum accumulated by the loss function in the first iteration, respectively. is another hyperparameter and takes values from 0 to 1. It sets the weight between the average of previous values and the square of the current on to calculate the new weighted average.
After calculating the exponentially weighted averages, we update our parameters. The difference is that the Rmsprop algorithm calculates the differential square weighted average for the gradient. This method is beneficial to eliminate the direction with a large swing amplitude, and is used to modify the swing amplitude, so that the swing amplitude of each dimension is smaller. On the other hand, it makes the network function convergence faster.
wherein
is learning rate. In order to prevent the denominator from being zero, a very small value
is used for smoothing and generally the value is 10
−8. In the above formula,
is relatively small so that here we divide
by a relatively small number whereas
is relatively large, so that here we divide
with a relatively larger number to slow down the updates on a vertical dimension.
The output layer uses the softmax loss function, which is composed of the softmax classifier and the cross-entropy loss function. Softmax normalizes the output of classification prediction, and obtains the probability distribution of a sample point belonging to each category. For example, the probability of belonging to category
j is:
The above formula is the softmax function. This result satisfies the standardization requirement of probability distribution: the output probability of all categories is not less than 0, and the sum of the output probabilities of all categories is equal to 1.
Kullback–Leibler (KL) divergence, which is also known as relative entropy, can be used to measure the difference between the two separate distributions
and
, and can be written as
. In the context of machine learning,
is often called the information gain achieved if
is used instead of
.
where
is all the possibilities of the event. In machine learning,
is often used to represent the real distribution of the samples. For example, [1,0,0] indicates that the current samples belong to the first category and
is used to represent the distribution predicted by the model such as [0.7,0.2,0.1]. The smaller the value of
, the closer the distribution of
and
.
From the perspective of information theory, the minimizing cross-entropy loss can be seen as minimizing the KL divergence of real distribution p and predicted probability distribution
. By deforming the above formula, we can get
The former part of the equation happens to be the entropy of
, and the latter part of the equation is the cross-entropy:
In machine learning, we need to evaluate the gap between the label and predictions. Using KL divergence is only good, that is, . As the former part of KL divergence is constant, we only need to pay attention to cross-entropy in the optimization process. Therefore, in machine learning, especially in neural network classification problems, cross-entropy is used as the loss and evaluation model directly.