ADD: Attention-Based DeepFake Detection Approach

: Recent advancements of Generative Adversarial Networks (GANs) pose emerging yet serious privacy risks threatening digital media’s integrity and trustworthiness, speciﬁcally digital video, through synthesizing hyper-realistic images and videos, i.e., DeepFakes. The need for ascertaining the trustworthiness of digital media calls for automatic yet accurate DeepFake detection algorithms. This paper presents an attention-based DeepFake detection (ADD) method that exploits the ﬁne-grained and spatial locality attributes of artiﬁcially synthesized videos for enhanced detection. ADD framework is composed of two main components including face close-up and face shut-off data augmentation methods and is applicable to any classiﬁer based on convolutional neural network architecture. ADD ﬁrst locates potentially manipulated areas of the input image to extract representative features. Second, the detection model is forced to pay more attention to these forgery regions in the decision-making process through a particular focus on interpreting the sample in the learning phase. ADD’s performance is evaluated against two challenging datasets of DeepFake forensics, i.e., Celeb-DF (V2) and WildDeepFake. We demonstrated the generalization of ADD by evaluating four popular classiﬁers, namely VGGNet, ResNet, Xception, and MobileNet. The obtained results demonstrate that ADD can boost the detection performance of all four baseline classiﬁers signiﬁcantly on both benchmark datasets. Particularly, ADD with ResNet backbone detects DeepFakes with more than 98.3% on Celeb-DF (V2), outperforming state-of-the-art DeepFake detection methods.


Introduction
The recent advances in the field of deep learning, specifically generative adversarial networks [1,2] and convolutional auto-encoders [3], have significantly propelled the generation of sophisticated and compelling forged versions of misinformation of all kinds. Generally, fake information is carried out for malicious purposes, such as propaganda or misinformation campaigns. In the context of digital video, sophisticated image and video manipulation techniques have emerged as one of the most sinister forms of misinformation, posing emerging yet increasing privacy risks targeting large-scale communities all over the world [4][5][6][7][8]. Such manipulated videos are so sophisticated that they are extremely difficult to detect using state-of-the-art Artificial Intelligence (AI) visual media forensic tools, or even with human eyes [9]. Moreover, recent improvements in computer vision and deep learning techniques have made it extremely easy to create fake videos called DeepFakes, hyper-realistic and deceptive videos of real people by manipulating the face region while leaving only minimal visual artifacts [10,11]. Mainly, DeepFakes are the product of merging, combining, replacing, and superimposing images and videos using AI techniques to generate fake digital videos that appear authentic [12]. While initial DeepFake videos were benign and plain, created for fun or artistic values, adversaries abused this technology for malicious purposes leading to severe political, social, financial, and legal consequences [12][13][14][15]. The DeepFake videos' impact becomes more critical considering the scope, scale, and sophistication of the technology involved, as they can be fabricated using a simple computer [14]. Furthermore, DeepFake generation algorithms are evolving continually, which not only improve their visual quality but also makes them better at circumventing existing detection methods.
Thanks to the accessibility of large-volume training data, high-throughput computing power, and automated generation procedures, there has been a huge surge in developing new DeepFake creation algorithms. DeepFake generation methods can be categorized into before deep learning approaches [3,[16][17][18][19][20] and deep learning-based approaches [21][22][23][24][25]. Despite small differences in the design of different DeepFake generators, they all follow the same flow. The common flow is to take in a video of a specific individual (target) and replace its face with another person (source). The backbone of the recent deceptive algorithms is generative adversarial networks, which map the source's facial expressions to the target through which it can achieve a high level of realism with a proper post-processing step [11].
As DeepFakes became super-realistic and more pervasive, ascertaining a digital video's trustworthiness and deciding on its authenticity becomes a more demanding yet challenging task. The fact that DeepFakes are created exploiting an AI algorithm rather than a camera capturing real events implies that they can still be detected using advanced deep learning networks [26]. Recently, multiple research works have focused on presenting a comprehensive understanding of the state-of-the-art methods and comparative analysis of DeepFakes [27][28][29]. The literature in this field shows that DeepFakes are inherently equipped with different artifacts ranging from visible artifacts as in earlier DeepFakes [11,30,31] to more hidden traces in more sophisticated DeepFakes [10,32,33], which can be exploited using high-level AI models to develop an automated digital video authentication system.
Objectives. This paper's primary goal is to present a digital video authentication system that offers high detection performance while covering a wide range of possible manipulation techniques. Such a digital media forensics tool is vital in the real-world scenario, considering the adversary's ever-evolving techniques in generating more deceptive Deep-Fakes. In general, training a new detection model is a computationally heavy and timeconsuming process or even impractical due to a lack of sufficiently labeled data from the new manipulation technique. However, this goal can be achieved by forcing the model to learn hidden traces and intrinsic representations from manipulated regions.

Contributions.
In this work, we look at the DeepFake detection task as a Fine-Grained Visual Classification (FGVC) problem. In both assignments, the main goal is to recognize the subordinate-level categories under a basic-level category. First, there is a substantial variance in the same class's images in terms of poses and viewpoints of the face, even for a person in the same video. Second, there is a minimal variance between the two different class images. The difference between the original and the fabricated image is tiny enough to deceive even human eyes, as can be observed in Figure 1. Furthermore, the forgery involves only the face region and leaves the background and other portions intact. By taking these characteristics into account, we developed a digital video authentication system, i.e., ADD, built based on an attention mechanism. ADD first locates potentially manipulated areas of the input image and extracts key representative features. Second, ADD forces the detection model to pay extra attention to these manipulated regions for decision making by imposing additional supervision on instance interpretation in the learning procedure through attention-based data augmentation. Finally, the performance of the ADD is evaluated against two challenging DeepFake forensic datasets. Comparing the obtained results with other existing models clearly demonstrate the excellence of the ADD in the given task. The major contributions of this work are summarized as follows: • We considered the Deepfake detection task as an FGVC problem and proposed a digital video authentication system, ADD, built based on an attention mechanism. • ADD first locates potentially manipulated areas of the input image and extracts discriminative features from those regions. Second, the detection model is made to pay more attention to these forgery regions for decision-making by imposing additional supervision on instance interpretation in the learning procedure through attention-based data augmentation. • The performance of the ADD is evaluated against two challenging DeepFake forensic datasets. Experimental results demonstrate that ADD could achieve a detection rate of 98.37% on Celeb-DF (V2), outperforming state-of-the-art DeepFake detection methods.
Organization. The rest of the paper is organized as follows. In Section 2 we review the related works on DeepFake generation and detection techniques. In Section 3 we describe ADD outlining its three main components, including frame-wise face localization, localized discriminative features, and attention-based data augmentation. In Section 4 we review overall evaluation settings, including datasets, baseline network architectures, implementation specifics, and evaluation metrics. In Section 5 we discuss the experimental results of ADD, and finally, a conclusion is drawn in Section 6.  [25]. The left column (green border) is a selected frame from original videos, and other columns (red border) are corresponding AI-generated frames. Note, the intra-class variance is high, whereas the inter-class variance is small.

Related Work
In general, the field of DeepFake video analysis can be categorized into two broad domains, including DeepFake generation and DeepFake detection. DeepFake Generation. Early DeepFake generation methods were simple and mostly relying on traditional vision and voice impersonation; however, most current methods involve sophisticated AI-based generation techniques, i.e., GANs. FakeApp was the first DeepFake creation software developed by a Reddit user using an autoencoder-decoder pairing structure [34,35]. Furthermore, Thies et al. [3] presented a real-time face capture and re-enactment of videos using a non-rigid model-based bundling. Masi et al. [19] presented a face-specific data augmentation technique using 3D shapes, and appearances of faces. Recent advancements in the field of deep learning have enabled adversaries to devise more sophisticated DeepFake creation techniques, leading to super-realistic videos, exploiting the unique generation capabilities of generative adversarial networks.
For example, Zhu et al. [36] and Kim et al. [37] have modified the GANs and presented cycle-consistent GANs to modify the domains of the output images based on the input image's domain. They have utilized this method for DeepFake generation where the source person's identities were changed to the target person while keeping the facial expression unchanged. Lu et al. [38] presented identity-guided conditional CycleGAN to create high-resolution face images from its low-resolution peers. Similarly, Kim et al. [23] presented deep video portraits that transfer both facial expression and 3D poses of the source image into the target image. Moreover, Faceswap-GAN [39] improves the visual quality of the synthesized images with adversarial and perceptual losses. The generated videos were more realistic thanks to the frame-to-frame face detection box's temporal smoothing and an attention mask. Thies et al. [16] presented a facial reenactment forgery method, NeuralTextures, based on a patch-based adversarial loss alongside a photometric reconstruction loss. Wang et al. [40,41] presented a flow-based face reenactment forgery method known as the video-to-video synthesis approach based on multiple talking videos of the source and generating new DeepFakes using a single image of the target. In a similar approach, Siarohin et al. [42] incorporated a learnable optical flow network approximation to a first-order Taylor polynomial to generate a manipulated video of a person using a single image. Li et al. [43] presented Faceshifter, an adaptive attention-based denormalization generator for high-quality face replacement using a heuristic error acknowledging refinement network learning method.
DeepFake Detection. A large body of work in the DeepFake analysis domain is focused on devising automated yet effective detection techniques. Early detection techniques were focused on handcrafted features, i.e., blinking inconsistencies [11], biological signals [44], and unrealistic details [45]. Although manually crafted detection features helped to advance the DeepFake detection domain, their performance was poor and could be easily circumvented. Techniques based on deep learning networks are utilized lately to overcome this issue and build more reliable forgery detection tools. For instance, Afchar et al. [33] proposed the MesoNet that detects forgeries at an intermediate level of detail using a shallow convolutional network while avoiding microscopic features that can be eliminated during the video compression process. Cozzolino et al. [46] proposed the forensictransfer method, a forgery detection approach that is built based on autoencoder architecture and transfer learning. Nguyen et al. extended this method [47] by replacing the standard decoder with a decoder that generates a mask of the manipulated region using a multitask learning approach. Furthermore, Nguyen et al. [32] proposed Capsule-Forensics method to detect both replay attacks and digitally generated images and videos. Rana et al. [48] introduced a technique that combines a series of deep learning classification models and creates an improved composite classifier for DeepFake detection.
While the previously discussed approaches target intraframe dissimilarities, Güera and Delp [49] utilized time-distributed features and a long short-term memory network for DeepFake detection. Furthermore, Sabir et al. [50] evaluated the same approach using ResNet [51] and DenseNet [52] feature extractors, where the extracted faces were aligned in consecutive order using facial landmarks to maintain temporal consistency. Furthermore, Yu et al. [53] investigated the potential of GAN fingerprinting analysis for DeepFake detection. Dordevic et al. [54] presented a method based on scale-invariant feature transform for DeepFake detection. Kaur et al. [55] presented a sequential temporal analysis to detect face-swapped video clips using convolutional long short-term memory. Mittal et al. [56] presented an approach that simultaneously exploited audio and video modalities and perceived emotions from the two modalities for DeepFake detection.
Although researchers in the community have investigated the DeepFake detection problem from various perspectives, only minimal effort has been devoted to investigating DeepFakes from a fine-grained visual classification point of view, especially using attentionbased techniques. The most similar works to ADD are [26,57] methods. In line with [26,57], our proposed method looks at the DeepFake detection problem as a fine-grained visual classification task while utilizing attention-based data augmentation techniques. However, our proposed method is different from [26] where the authors proposed a DeepFake detection method from FGVC angle that is built using an autoencoder structure different from our proposed method, which is based on a deep learning structure. Furthermore, ADD is different from [57] as ADD considers only the last two convolutional blocks in the model for data augmentation rather than the whole convolutional blocks, as it is proposed in [57]. Besides, ADD uses two different modules, i.e., Face close-up and Face Shut-off, to force the model to extract more discriminative information from different parts of face region; however, [57] generates attention masks focused on only eyes, nose, and mouth for adjusting the feature map of the face.

ADD: Methods
In this section, the proposed framework for the attention-based digital video authentication system, ADD, is introduced. The general pipeline of the presented attention-based DeepFake detection approach is illustrated in Figure 2. ADD which is composed of three main components, including face localization and preprocessing Section 3.1, local-ized discriminative feature extraction Section 3.2, and attention-based data augmentation Section 3.3 followed by a classifier to distinguish original frames from DeepFakes.

Face Localization and Preprocessing
As it is pointed out in Section 1, DeepFake generation algorithms mainly manipulate face regions and leave the background part intact. Therefore, focusing on the face region of a video frame, instead of analyzing the whole frame as input to the learning model, not only improves the detection performance by reducing background noise but also reduces the computational time by reducing the size of the input sample [58]. To this end, the following steps, as shown in Figure 3, are taken for face localization and further analysis. First, for each input video, 20% of the frames are extracted in consecutive order, yielding to over 2 million frames on Celeb-DF (V2) dataset. Second, the state-of-the-art face detection method, i.e., RetinaFace [59], is utilized to locate facial landmarks on each extracted frame. The obtained facial landmarks are utilized to crop, align, and resize the faces to standard configuration [60]. These cropped frames, containing only face regions, are further used for attention-based image augmentation and feature extraction.

Localized Discriminative Features
In the DeepFake detection task, it is essential to determine the face region along with different facial landmarks for effective feature extraction. In this work, the distribution of face regions and associated facial landmarks is represented using attention maps. For a given frame I, the feature maps F ∈ R H×W×C are extracted using a CNN-based feature extractor, where H, W and C represent feature layer's height, width, and the number of channels, respectively. The obtained feature maps, F, are then utilized to calculate the distribution of M different parts of the face, i.e., Attention Maps A ∈ R H×W×M , using a convolutional function f (· ) as A = f (F) = M k=1 A k . Here, each specific part of the face, i.e., lips, eyes, forehead, etc., are represented using A k ∈ R H×W . Having generated M attention maps corresponding to M different parts of the face, representative feature maps of those parts F k can be obtained by element-wise multiplication of feature maps F with each attention map A k . This process is shown in Figure 4. A feature extractor, e.g., global pooling function g(· ), is utilized along with each of these local feature maps F k to pool out more discriminative local features associated with k th attention feature f k ∈ R 1×C . Finally, these local features f k are stacked to build a comprehensive and distinctive feature set containing detailed information of the whole frame. Passing this valuable information to the model enforces the model to focus specifically on the forgery regions of a given input image and learn local interpretations to perform its decision-making.

Attention-Based Data Augmentation
Once the attention maps are calculated, they can be employed for more efficient data augmentation. The problem with random data augmentation methods is their low effi-ciency and introducing background noise. However, attention-based data augmentation is more efficient as it exposes the model to additional instance interpretation in the learning procedure. Two complimentary attention-based data augmentation approaches are employed in this work, including Face Close-Up and Face Shut-Off. While the former looks closer at a specific region of the face, the latter approach ignores that area and sees other face regions.
Face Close-Up. The face close-up augmentation approach's primary goal is to look closer at specific regions of the face, e.g., eyes, forehead, lips, etc., and provide more distinctive local features to the model to enhance its local interpretability. The following steps are taken into account to perform face close-up augmentation. One attention map is randomly selected from M available attention maps for each frame, and its elements are normalized to [0, 1]. All elements with a value greater than a particular predefined threshold are set to one, and the remaining are set to zero. Finally, only the region enclosed into a bounding box that covers all active areas is selected. The face close-up augmentation approach enlarges the scale of the face's selected region from raw input, thus improving the detection model's local explainability by focusing on the forgery region while being exposed to more fine-grained features. The augmented image is illustrated in Figure 2.
Face Shut-Off. While the face close-up approach provides a closer look into specific regions of the face, the resulting bounding boxes for different attention maps might be very similar. In such cases, the model would not learn new representative features. To avoid this issue and extract more discriminative features from other regions, the face shut-off data augmentation approach is utilized. Like the previous approach, for each frame, one attention map out of M available attention maps is randomly selected and normalized to [0, 1]. All the normalized attention map elements with a value greater than a particular predefined threshold are set to zero, whereas the remaining parts are set to one. This results in removing the active parts from the image, which in return forces the model to see other parts of the image and attain additional localized discriminative features.

Evaluation Settings
This section is devoted to introducing the overall evaluation settings, including the DeepFake detection datasets, baseline network architectures, implementation specifics, and evaluation metrics.

Datasets
To make a real-world impact and bear strong relevance of any digital video authentication system, it is crucial to evaluate the system against high-quality DeepFake datasets. The dataset should be super-realistic and stealthy while covering more diverse real-world scenes, and having minimal visual artifacts to maintains its high visual quality. Different research groups in the community have introduced different DeepFake detection datasets, such as UADFV dataset [61], the DeepFake-TIMIT dataset (DF-TIMIT) [62], the FaceForen-scics++ dataset (FF-DF) [58], and the FaceBook DeepFake detection challenge (DFDC) dataset [63]. While this has considerably advanced the DeepFake detection in the early stages, most of them are far from perfect for today's real-world applications. They have major visual problems, such as limited scenes in original videos, low-quality synthesized faces, visible splicing boundaries, color mismatch, visible parts of the original face, and inconsistent synthesized face orientations [25,57].
Thus, in this study, the performance of the proposed method is empirically evaluated against two most recent and challenging DeepFake datasets, i.e., Celeb-DF (V2) [25] and WildDeepfake [57]. The former is a dataset with the highest visual quality score reported to date, and the latter is a challenging real-world DeepFake dataset with more diverse scenes and more persons with rich facial expressions in each scene.

Celeb-DF (V2).
The Celeb-DF (V2) is a large-scale challenging video dataset of 590 original videos of celebrities and 5639 high-quality DeepFake videos generated using an improved synthesis process, corresponding to over 2 million frames. Real videos are collected from publicly available YouTube videos, and the fake ones are created by swapping faces for each pair of the subjects.
WildDeepfake. The WildDeepfake is a challenging real-world DeepFake detection dataset, where, unlike other datasets, both real and DeepFake videos are collected completely from the internet. This dataset presents more diverse scenes, more persons in each scene, and rich facial expressions. Corresponding dataset statistics are provided in Table 1. For more detailed information we refer the interested readers to the original sources [25,57].

Baseline Architectures
In the following, we briefly review four state-of-the-art deep learning models used in this study. These models are building the backbone of different configurations of ADD.
VGG19 Structure. The Visual Geometry Group (VGG) network is a type of deep convolutional neural network comprising 19 layers structured starting with five blocks of convolutional layers followed by three fully connected layers. Each convolutional layer contains a 3 × 3 kernel with a stride of 1 and padding of 1 to maintain the input-output dimensional match. Each of these convolutional layers are followed by a rectified linear unit (ReLU) activation and a max-pooling operation to reduce the spatial dimension. Max pooling layers employ a 2 × 2 kernel with a stride of 2 and no padding to reduce the size by 50%. Afterward, two fully connected layers with 4096 ReLU activated units are used before the final fully connected softmax classifier layer [64].
ResNet Structure. The Residual Networks (ResNets) [65] are a type of deep convolutional neural network where blocks of convolutional layers are skipped using shortcut connections. In this architecture, the down-sampling process takes place at convolutional layers with a stride of 2, after which batch normalization is performed. Finally, a ReLU activation is applied. The architecture has 101 layers in total, where the network ends with a fully connected layer with softmax activation [65].
Xception Structure. Xception is a convolutional neural network based on separable convolutions with residual connections. This model is composed of 71 deep layers, with an image input size of 299 by 299.
MobileNet Structure. MobileNet is a lightweight deep learning model developed using a depth-wise separable convolution architecture [66]. MobileNet architecture comprises 19 bottleneck layers consisting of three convolution operations, including 1 × 1 convolution, 3 × 3 depth-wise convolution, and 1 × 1 point-wise convolution. While the 1 × 1 convolution enriches the features through increasing number of channels, the 3 × 3 depthwise convolution reduces computing costs by separating the feature filtering process. The separated features are then combined at point-wise convolution [66].

Implementation Specifics
Here, the implementation and characteristics of the ADD for reproducibility purposes are provided.
Implementation. All baseline models along with various configurations of ADD are implemented using the PyTorch machine learning library and trained using Stochastic Gradient Descent SGD optimizer [67] with a learning rate of 10 −3 , momentum of 0.9, weight decay of 10 −5 , and epoch number of 20 to minimize the softmax-cross-entropy loss. Moreover, we used mini-batch approaches with different mini-batch sizes for different deep network training process models. Mini-batch sizes are ranging from 8 for XceptionNet to 64 for VGG architecture on 4 NVIDIA Titan-V Graphics Processing Units (GPUs).

Experimental Setup.
All experiments were conducted on two Lambda Quad deep learning workstations. Each workstation was equipped with Ubuntu 18.04 OS, Intel Xeon E5-1650 v4 CPU, 64 GB DDR4 RAM, 2TB SSD, 4TB HDD, and 4 NVIDIA Titan-V Graphics Processing Units (GPUs).

Evaluation Metrics
Performance of the ADD was evaluated against three different evaluation metrics, namely accuracy rate, recall, and area under the Receiver Operation Characteristic curve (ROC-AUC) at the frame level for all key frames. Although accuracy rate is easy to interpret, it might not provide a good insight for highly imbalanced datasets. Therefore, ROC-AUC metric was utilized to demonstrate how well the detection model performed on both DeepFake and pristine data distributions. Furthermore, recall metric was employed to reflect how well the model predicts manipulated videos, as missing a fake video is a costly mistake with potentially further adverse impacts. Additionally, all trained models will be published upon the acceptance of the paper. Having acquired the three metrics for ranking baseline models, they are ranked in three different manners and compared to the ground truth ranking attained from ADD on target test set. For fair comparison, all models were trained on the same training data and tested on the same hold-out test set.

Results & Discussion
This section provides a detailed discussion on the performance of the proposed DeepFake detection method. The performance of the ADD is evaluated based on three different evaluation metrics including detection accuracy, ROC, and recall. In our analysis, we focus on the DeepFake detection task at the level of each frame; hence, all reported results in this study are based on frame-wise detection tasks. First, the obtained results from simulations with/without ADD framework on Celeb-DF (V2) and WildDeepFake detection tasks are discussed for each baseline model to highlight the impact of ADD. Second, the performance of the ADD is compared to state-of-the-art DeepFake detection techniques.

ADD's Impact
To better understand the impact of the presented framework, ADD, we compared the performance of each baseline architecture with and without ADD using Celeb-DF (V2) and WildDeepFake benchmark datasets on frame-wise DeepFake detection problems, as reported in Table 2. Celeb-DF (V2). The obtained results for conducted simulations using Celeb-DF (V2) dataset for each model architecture are shown in Figure 5. As it can be observed, baseline models did not perform well; the best baseline model reached 95.53% detection accuracy at best, via VGG structure, which is not acceptable in the DeepFake detection task. While performances of the baseline models were poor, their performances were boosted by considerably large margins once upgraded to the ADD framework. For example, the performance of the vanilla detection model with ResNet architecture improved from 88.47% detection accuracy rate to 98.37% on the same model with the ADD framework, which is around a 10% improvement on the detection rate. Obtained results from the experiments clearly demonstrate the outstanding impact of the proposed attention-based framework in this study for enhanced DeepFake detection. WildDeepFake. WildDeepFake dataset is more challenging to be detected compared to virtual DeepFake; therefore, the effectiveness of detectors developed on virtual DeepFake datasets can be limited when applied to wild DeepFake. A similar set of experiments are conducted using WildDeepFake to evaluate the performance of the proposed method on a more challenging DeepFake detection task. The obtained results from these experiments are illustrated in Figure 6. A similar pattern to previous experiments was observed, which confirms the effectiveness of ADD framework on improving the detection performance of all four baseline models. As it can be seen, vanilla models that were not equipped with an attention mechanism did not offer an acceptable detection accuracy rate, not more than 69%, which is extremely low in the DeepFake detection field. However, all configurations of ADD were able to improve the evaluation metrics by significantly large margins. For example, ADD with Xception baseline architecture detected DeepFake with 79.23% detection accuracy. This result is outstanding compared to existing state-of-the-art DeepFake detection methods.

Comparison with State-of-the-Art Methods
This section is devoted to comparing the performance of the ADD against state-of-theart methods on DeepFake detection tasks. While we have reported different evaluation metrics in our analysis, we follow the reported metrics in the literature for comparison. The obtained results on Celeb-DF (V2) are reported based on the AUC score in the literature; therefore, we compare the AUC score of ADD with that of the literature, as shown in Table 3. It can be observed that all configurations of ADD outperformed the state-of-art detection AUC score with significantly large margins. In particular, ADD with ResNet baseline architecture achieved an AUC score of 98.65%, which is more than a 7% improvement compared to FakeCatcher [44].
Since detection accuracy rate is the only reported evaluation metric regarding the performance of detection techniques on WildDeepFake benchmark [57], we used the same metric for our comparison study. The obtained results from our experiments along with other approaches on this particular dataset are reported in Table 4. It can be observed that while most of the previous studies were bound to below 70% accuracy rates, our proposed ADD framework boosted the performance of all baseline models above 77%. For instance, a configuration of ADD with Xception architecture was able to further improve the DeepFake detection performance on WildDeepFake dataset and achieve 80.13%. Overall, it can be observed that all configurations of ADD outperformed the state-of-art DeepFake detection methods with significantly large margins. Table 3. Comparing the performance of ADD against state-of-the-art DeepFake detection models on Celeb-DF (V2). Note that reported results in rows 1-8 are from [25].

Conclusions
This paper presents a DeepFake detection method, ADD, that exploits the fine-grained and spatial locality attributes of the AI-synthesized videos to boost detection performance. Potentially manipulated areas of the input image and corresponding features are first extracted, and then the detection model is forced to focus more on those manipulated regions for decision making. ADD performs this task by imposing extra supervision on instance interpretation in the learning procedure. The performance of ADD is evaluated against two recently introduced challenging datasets for DeepFake forensics, i.e., Celeb-DF (V2) and WildDeepFake. For example, ADD with ResNet architecture is able to detect DeepFakes with more than 98.3% AUC on Celeb-DF (V2), outperforming state-of-the-art DeepFake detection methods.