1. Introduction
Manipulated media could be used as a fatal weapon to unbalance political processes [
1,
2], put words into a politician’s mouth [
3] or control their movements thus directing public opinion [
4,
5]. It can also be used to disgrace, smear or blackmail innocent individuals by putting their faces onto non-consensual sex videos known as revenge porn [
6,
7]. This technology could be very harmful if used to stitch the face of an innocent in a crime scene [
8], scam people out of their money [
9] or create realistic fingerprints to unlock devices thus invading people’s privacy [
10].
Generating synthesized believable media witnessed a massive advancement ever since late 2017 when a Reddit user named deepfakes managed to create a deep learning model that implants the faces of famous actresses into porn videos [
11]. In 2019 a deepfake bot called DeepNude was created and used to generate synthesized naked female bodies and attach them to victims, statistics showed that 63% of its users preferred to undress familiar females in their life thus proving how destructive such a technology could be in the hands of ill-hearted people [
12].
Despite all these destructive disadvantages, multiple researchers have used this powerful tool to create beneficial applications such as dubbing foreign movies [
13], recreating historic characters to illustrate educational content [
14], simulating the lost voices of Amyotrophic Lateral Sclerosis (ALS) patients [
15] and helping ease the online world we are living since Covid-19 by reenacting our live expressions to photos or videos of our choice in meetings [
16].
In most cases deepfakes are created using variations or combinations of Encoder-Decoder Networks [
17,
18] and Generative Adversarial Networks (GANs) [
19]. GANs consist of two neural networks, a generator and a discriminator. The generator aims to produce an N dimensional vector, that follows a predefined target distribution and a transform function once a simple random variable is input and trained. The discriminator takes the output produced by the generator along with real input images and returns an assessment of how real the generated vector is. This assessment value is then used to update the weights of both the generator and discriminator models. The overall objective is to trick the discriminator into categorizing the generated fake image as real to finally produce it as an output of the GAN model.
Harmful deepfakes can be categorized into two types: replacement and reenactment. Replacement or face swap is when the face of someone who may have been involved in inappropriate actions is replaced with the face of a target innocent person. Several creation techniques were deployed such as Deepfakes [
20], DeepFaceLab [
21], FaceSwap [
22], Faceswap-GAN [
23], RSGAN [
24], FSNet [
25] and FaceShifter [
8]. Reenactment is the second type in which a source video/image/audio is used to drive the facial expressions, mouth or full body of a target media. Techniques as Face2face [
26], Headon [
27], Recycle-GAN [
28] and Deep Video Portraits [
4] made the application of expression reenactment or Puppet-Master forgery very easy. Mouth reenactment or Lip-Sync is when someone puts words into another’s mouth, Synthesizing Obama [
3] and ObamaNet [
29] are the prominent. The last type of reenactment is body reenactment where you can drive the movements of someone else’s body where Liu et al. [
30] and Everybody Dance Now [
5] are good examples of this method.
1.1. Motivation and Objective
The great advancements in GANs and other creation methods have managed to generate believable fake media that could have severe harmful impact on society. On the other hand, the current deepfake detection techniques are struggling to keep up with the evolution of creation methods thus generating a demand for a deepfake detector that is generalizable to media created by any technique. This demand triggered the motivation behind this research with the objective of creating a deepfake video/image detector that can generalize to different creation techniques that are presented within recent challenging datasets and outperform the results produced by current state-of-the-art methods using different performance measures.
1.2. Research Contributions
In this work, a new deepfake detection model (iCaps-Dfake) is introduced to support the fight against this destructive phenomenon. The contribution of this work could be stated as follows:
A capsule network (CapsNet) is integrated with an enhanced concurrent routing technique for classification providing superior capabilities in feature abstraction and representation with no need for large amounts of data and an average number of parameters.
Two different feature extraction concepts are combined, one is texture-based using Local Binary Patterns (LBP) pointing out the differences between textures of real and forged part of the image. The second is a convolutional neural network (CNN) based method with an introduced modification to the High-Resolution Network (HRNet) which previously achieved high results in many applications such as pose estimation, semantics segmentation, object detection and image classification. Our modification of the HRNet managed to output informative representation of features with strong semantics that preserves the CapsNet concept of not losing any spatial information.
For face detection You Only Look Once (YOLO v3) is utilized, resulting in very few false positives when compared to other approaches thus enhancing the data quality.
Data preprocessing is performed to minimize the noise in faces loaded to the HRNet for further data quality enhancement.
The rest of this paper is organized as follows. A literature review of deepfake detection methods is presented in section two. A detailed explanation of the proposed model is explained in
Section 3.
Section 4 demonstrates the experimental results and finally
Section 5 presents a discussion of the proposed work.
2. Literature Review
Previous attempts to detect deepfakes followed one of two general approaches: artifact-specific detectors which spot artifacts left by different creation methods or undirected approaches that tries to achieve generalization by applying different deep learning models. The artifact-specific models can be further categorized into spatial which are concerned with environment, forensics and blending based models or temporal artifacts which spots synchronization, behavior, coherence and physiology changes.
The environment spatial artifact detection models depend on the aberrant content of the fake face when compared to its surroundings. For example, FWA [
31] addressed deepfake creations with low images resolutions that showed artifcats when wrapped to fit the target face by building a model that combined four CNN models: VGG16, ResNet50, ResNet101 and ResNet152. DSP-FWA [
32] added to the FWA method [
31] by using a pooling module to better detect the difference in resolutions of the target’s face and their test results showed improvements over the FWA model. Another example was shown in Nirkin et al. [
33] where they made a network that first split the source image into face and hair/context, then generated three encodings of source, face alone and hair/context. The encoding of the source is then concatenated to the difference of the encoding of the other two elements and input to a decoder for classification. The model was tested on FaceForensics++ (FF++) [
34], Celeb-DF [
35] and DFDC-P [
36] datasets and achieved comparable results.
Forensics spatial detection models analyze fine features and patterns caused by the creation models. Koopman et al. [
37] analyzed the camera’s unique sensor noise, called photo response non-uniformity (PRNU), to detect stitched content. They created their own dataset consisting of 26 videos making their results very specific to handling noise generated by their own camera. In Two-branch [
38], promising results were produced by Masi et al. who focused on the residuals and proposed a recurrent network with two branches, one that amplifies the frequencies and enhances it while the other transfers the original information with fine tuning in the color domain. As a final example HeadPose [
39] looked for imperfections rather than residuals and managed to apply a Support Vector Machine (SVM) model to detect inconsistent head poses in 3D.
The final spatial detection technique is blending which looks for artifacts resulting from blending the face back onto the frame. In [
40], the authors made a frequency analysis to emphasis artifacts to the learner using Discrete Fourier Transform (DFT) together with Azimuthal Average then fed the output to both supervised algorithms: Support Vector Machine (SVM) and Logistic Regression (LR), along with unsupervised algorithms K-means clustering. The authors tested their model on FF++ dataset.
Temporal Lip-sync deepfakes can be detected by comparing the synchronization between speech and mouth landmarks as [
41,
42] achieved. Agarwal et al. [
43] also exploited the irregularities in the dynamics of the mouth shape (visemes) and the spoken phonemes focusing on the letters M, B and P. Behavior anomalies can be detected as in Mittal et al. [
44] where they trained a Siamese network with triplet loss to simultaneously train audio and video inputs and perceive emotions from them both for deepfake detection. Coherence detection tests the coherence between video frames and each other, some detectors used recurrent neural network (RNN) based models to predict fake videos. In [
45], Guera et al. detected flicker and jitter artifacts by applying a RNN, while Sabir et al. [
46] applied an LSTM specifically on the area of the face and tested his model only on FF++. Physiological detection is based on the hypothesis that generated content will not have the same physiological signals as real one, Li et al. [
47] detected the temporal pattern of blinking in early created deepfakes by using a Long-term Recurrent Convolutional Network (LRCN) since the rate of fake eyes’ blink was less than that of the normal. Another physiological-based detector is Gaze Tracking [
48] where the authors trained a simple CNN on the extracted 3D eye and gaze features. The problem with the physiological-based methods is that it can be easily invaded if the creator added a simple component, a discriminator in GANs, that searches for these specific biological signals.
Undirected approaches detect deepfakes by applying deep learning models to extract features and find patterns thus classifying real and fake. The models can be then generalized to detect deepfakes created by any method not just the artifacts left by a specific tool. In Two-Stream [
49], Zhou et al. combined the feature extraction models of GoogleLeNet and a patch-based triplet network to detect face artifacts and enhance the local noise residuals. MesoNet [
50] targets unseen images properties by the application of two variants of CNN; Meso4 that uses conventional convolutional layers and MesoInception4 which is based on Inception modules [
51]. Artifacts in eyes, teeth and facial lining of fake faces were addressed by VisualArtifacts (VA) [
52]. They created two variants; a multilayer feedforward neural network (VA-MLP) and a logistic regression model (VA-LogReg). Another model, Multi-task, was created by [
53] used a CNN to perform a multi-task learning problem by classifying fake videos and specifying manipulated areas. In FakeSpotter [
54], Wang et al. managed to overcome noise and distortions by monitoring the pattern of each layer’s neuron activation of a face recognition network to seize the fine features that could help in detecting fakes. They tested their model on FF++, DFDC [
55] and Celeb-DF [
35] and produced comparable results. In DeepfakeStack [
56], the authors followed a Greedy Layer-wise Pretraining technique to train seven deep learning models (base-leaners) with ImageNet weights which were computationally very expensive. They used Stacking Ensemble (SE) and trained a CNN as their meta-learner to enhance the output of the base-learners. The authors trained and tested their model on FF++ dataset.
Capsule networks are also used, as an undirected approach, in different deepfake detection techniques. First introduced by Hinton et al. [
57], they have recently shown exceptional capabilities in feature reduction and representation [
58,
59]. Nguyen et al. were involved in two approaches [
60,
61] where in the first they created a network consisting of three primary and two output capsules that were fed latent features extracted by VGG-19 [
62], they finally added a statistical pooling layer to deal with forgery detection. Their second approach also used capsules to segment manipulated regions. Both models were tested and trained on FF++ which proved a very high detection rate [
35].
The proposed model combines both artifacts and undirected approaches as it exploits environmental artifacts using LBP to perform the texture analysis and use a deep learning-based model HRNet to automatically detect informative multi-resolution feature representations. Together with the capsule network as a classifier it outperformed previous methods and achieved generalization.
3. Materials and Methods
The block diagram of the proposed iCaps-Dfake model is presented in
Figure 1 showing three main stages, data preparation, feature extraction and classification. Each stage will be thoroughly explained in detail in the following subsections.
3.1. Data Preparation
Though some researchers tend to use Multi-task Cascaded Convolutional Network (MTCNN) [
63] to detect faces, as did the winners of the Facebook Deepfake Detection Challenge [
55]. Experiments have shown that MTCNN produced many false positive examples with more than 90% probability of being a face thus requiring an extra step for cleaning the data before training the network which is not only time consuming but also performance degrading thus, YOLO v3 [
64] is chosen for more quality enhancements.
Figure 2 shows samples of the detections that MTCNN wrongfully considered faces with high confidence in crowded frames.
The first stage of iCaps-Dfake is data preparation where faces are detected, for our network to train on, from the input video using YOLO v3. In order to down sample from nearly 2.2M training frames in the dataset, a sliding window scheme for keyframe selection is followed with a window width of one. As shown in
Figure 3, the first frame
is selected as a start then the window is slid by N frames to take the next. In each selected frame, the largest face is detected.
In order to detect the face, YOLO v3 performs a single forward propagation pass through which a CNN is applied once to the selected frame to deduce face scores. This is achieved by splitting the frame into a grid of size M × M such that the confidence score is calculated for B bounding boxes contained in each cell thus reflecting the certainty of having a face at each’s center. Each box is represented by five measurements (x, y, w, h and confidence) modelling the center of the face as (x, y) coordinates along with the width, height and confidence showing the probability of a box to include a face.
3.2. Feature Extraction
To extract features from the faces detected by YOLO, two different techniques were applied: CNN-based method and a texture-based analysis method. Both methods will be explained in detail in the following subsections.
3.2.1. CNN-based Feature Extraction
To enhance the feature extraction capability of the CNN, an extra preprocessing step was added. The following subsections explain the preprocessing and the feature extraction model.
Preprocessing
In order to train the HRNet [
65], preprocessing of the extracted faces is required to provide the CNN a variety of examples thus improving the learning process. First, the detected faces are normalized and resized to 300 × 300 then randomly cropped to 224 × 224 thus enhancing the model’s capability to detect fake faces even if it exists only in a fraction of the face. Second, different random augmentations are applied such as rotation, horizontal and vertical flipping and change of the color attributes of the image.
Figure 4 shows a sample of the HRNet input faces after preprocessing.
HRNet
The basic concept of the HRNet was to avoid the loss of any spatial information, which seemed a good fit to that of the capsule network except that the original HRNet included a pooling layer at the end. In order to best fit the two concepts, the pooling layer from the HRNet is removed and the raw feature vector generated from the HRNet is utilized [
58]. This way, the HRNet keeps the high-resolution representations across the whole process of feature extraction by connecting high-to-low resolution convolutions in parallel and producing strong high-resolution representations when repeatedly performing fusions across parallel convolutions. The output feature maps are obtained by adding the (upsampled) representations from all the parallel convolutions.
Figure 5 shows the construction of the HRNet and how the resolution changes throughout the network. It takes a 3 × 224 × 224 face and passes it through two 3 × 3 convolutions with a stride of size 2, the output of these blocks has a shape of 64 × 56 × 56. The HRNet consists of four stages where each stage is a subnetwork containing a number of parallel branches. Each branch has half the resolution and double the number of channels of the previous one, if the last branch resolution is denoted to be C then network resolutions would be 8
C at the first branch, 4C, 2C and C. One branch consists of two residual blocks [
66] where each contains two 3 × 3 convolutions.
The first stage is a high-resolution subnetwork that consists of one residual [
66] block with four different size convolutions and outputs 16 × 56 × 56 feature vector. The following three stages contains high-to-low resolution subnetworks added gradually one by one. Stage two has two branches resembling two different resolution; The first branch preserves the resolution of the first stage and propagates it through the other stages and the second contains the down sampled resolution that is obtained by applying a 3 × 3 convolution with the stride 2 to get the feature vector of size 32 × 28 × 28 that will also propagate to the end of the stages. There is a fusion layer between each consecutive stage, it is responsible for adding different feature vectors coming from each parallel branch. To achieve this, all the feature maps need to be of the same size so different resolutions are either down sampled through strided convolution or up sampled using simple nearest neighbor sampling [
67]. At Stage 4, the output of the first three branches is passed through the residual lock used at stage one in order to regulate number of channels in order to add the feature maps and obtain the output of the network of size 512 × 14 × 14.
3.2.2. Texture Analysis—LBP
To perform texture analysis, the LBP [
68] for each channel of two different color spaces (HSV and
) is extracted and then a histogram for each LBP is calculated. The extracted histograms are then resized and concatenated to the feature maps extracted from the HRNet.
Figure 6 shows the steps needed to extract histograms of the LBPs.
To perform color texture analysis, the luminance and chrominance components of the detected faces are extracted by converting them to both HSV and YCbCr color spaces. The RGB (red, green and blue) color space was not useful in this analysis due to its imperfect separation of the luminance and chrominance information and the high correlation between its color components.
In the HSV color space, Hue (H) and Saturation (S) represent chrominance while Value (V) corresponds to luminance. In the YCbCr color space, the luminance is represented in the (Y) component while both the () and () represent Chrominance blue and Chrominance red, respectively. The output shape of this step is 6 × 224 × 224 for which the LBP of each channel will be calculated.
For calculating the LBP, a local representation of texture is computed for one channel at a time by comparing each center point with its surrounding eight neighbors. The center point is used as a threshold, if the intensity of a neighbor is greater than or equal to that threshold then it is updated to the value one, otherwise zero. After this step, the eight-bit representations can be used to formulate up to 256 different decimal values which is then set to be the new value of the center point. The LBP output can be represented as a gray scale image highlighting the face texture.
Figure 7 shows the output LBPs given a sample face as input for each channel along with the calculated histograms. The final step is to concatenate the six histograms, reshape them and normalize their values making a feature vector of size 4 × 14.
3.2.3. Feature Concatenation
The final step of the feature extraction phase is concatenation where the features extracted from the HRNet are combined with those of the LBP. The HRNet output is represented as a 512 × 14 × 14 feature vector are then concatenated to the 6 × 14 × 14 feature vector produced by the 6 histograms. The outcome of the feature extraction stage is a 518 × 14 × 14 feature vector that will be used as input to the capsule network to train and update the network weights accordingly.
3.3. Classification—Capsule Network
Regular CNNs are built based on scalar neuron representations that are capable of identifying probabilities of existence of particular features. To better detect a feature, CNNs require lots of different variants of the same type of data entity which in turn needs extra neurons leading to size and parameters expansion of the entire network. On top of that, pooling operations done by CNNs only propagate features that stand out and drop all other features thus losing a lot of spatial information. Unlike CNNs, CapsNets exploits all the spatial information to identify both the probability of feature existence as well as its location relative to others (pose information) making it viewpoint-invariant thus requiring less data.
First introduced by Sabour et al. [
58], a capsule is a group of neurons in a shape of a vector encoding the pose information whose length represents the probability of an entity existence (activation probability). This makes deriving a part-whole relation given the embedded information in one computational unit representing the part easier. For example, a face’s components (eye, mouth, nose...etc.) would be embedded in the lower capsule levels along with their relative position. The routing algorithm then tries to connect each lower-level capsule to a single higher level one containing its specifications. In EM Routing [
59], Hinton et al. represented the pose by a matrix and the activation probability was determined by the EM algorithm. The proposed model used the inverted dot-product attention routing algorithm [
69] that applies a matrix-structured pose in a capsule.
As demonstrated in
Figure 8, our capsule network consists of one primary, two convolutional and two class capsules one for each class. The extracted features are fed to the primary capsule that is responsible for creating the first low-level capsules. The primary capsule applies one convolutional layer on to the extracted features then normalize the output and reshape it to create matrix-capsules of size
where
represents the grouped number of hidden layers that define a capsule. Capsules in the primary layer (children) are used to update capsules in the first convolutional capsule layer (parents) that in turn update their parents and so forth. The convolutional capsules layers are formed using Equations (1)–(5), each containing 32 capsules of size 4 × 4.
To route a child capsule
in layer
(
) to a parent capsule
in layer
(
), a vote
is created per child for each parent by applying weights assigned between them
. Initially, all parents’ poses
are set to zero.
By applying the dot-product similarity, the agreement
between each parent all children are calculated using their votes
.
A routing coefficient
is calculated by passing the agreements scores through a softmax function.
Each child contributes in updating parents poses according to its vote
and routing coefficient
.
Finally, a Normalization layer [
70] is added to enhance the routing’s convergence.
Equations (1)–(5) show the calculation steps of the capsule layers using the inverted algorithm. The first iteration is sequential, where the values of all but the first capsule layer is computed. The following iterations are concurrent thus resulting in an improved training performance.
The last capsule layers are the class capsules where the feature vector is significantly reduced to feed a linear classification layer that is used to get the prediction logits, this classifier is shared among the class capsules. Each of the two class capsules has a size of 16 and is constructed using the routing algorithm described in Equations (1)–(5).