DFDT: An End-to-End DeepFake Detection Framework Using Vision Transformer

: The ever-growing threat of deepfakes and large-scale societal implications has propelled the development of deepfake forensics to ascertain the trustworthiness of digital media. A common theme of existing detection methods is using Convolutional Neural Networks (CNNs) as a backbone. While CNNs have demonstrated decent performance on learning local discriminative information, they fail to learn relative spatial features and lose important information due to constrained receptive ﬁelds. Motivated by the aforementioned challenges, this work presents DFDT, an end-to-end deepfake detection framework that leverages the unique characteristics of transformer models, for learning hidden traces of perturbations from both local image features and global relationship of pixels at different forgery scales. DFDT is speciﬁcally designed for deepfake detection tasks consisting of four main components: patch extraction & embedding, multi-stream transformer block, attention-based patch selection followed by a multi-scale classiﬁer. DFDT’s transformer layer beneﬁts from a re-attention mechanism instead of a traditional multi-head self-attention layer. To evaluate the performance of DFDT, a comprehensive set of experiments are conducted on several deepfake forensics benchmarks. Obtained results demonstrated the surpassing detection rate of DFDT, achieving 99.41%, 99.31%, and 81.35% on FaceForensics++, Celeb-DF (V2), and WildDeepfake, respectively. Moreover, DFDT’s excellent cross-dataset & cross-manipulation generalization provides additional strong evidence on its effectiveness.


Introduction
The recent advances in the field of Artificial Intelligence (AI), particularly Generative Adversarial Networks (GANs) [1,2] and the abundance of training samples, along with robust computational resources [3], have significantly propelled the field of AI-generated fake information in all kinds, e.g., deepfakes. Deepfakes are synthesized yet super-realistic images and videos generated through combining, merging, superimposing, or replacing the facial area of images/videos leveraging advanced techniques from computer vision and deep learning domains [4]. Deepfakes are among the most sinister types of misinformation, posing large-scale and severe security and privacy risks targeting critical governmental institutions and ordinary people across the world [5,6]. Furthermore, deepfake generation algorithms are constantly evolving and have become a bullet point for adversarial entities to perpetuate and disseminate criminal content in various forms, including ransomware, digital kidnapping, etc. [7].
The ever-growing threat of deepfakes and large-scale societal implications have driven the development of deepfake forensics to ascertain the trustworthiness and authenticity of digital media. Different deepfake detection approaches have been proposed to address this challenge [8,9]. Early deepfake detection algorithms were primarily based on hand-crafted features, and visible artifacts, such as inconsistency in head poses [10], eye blinking [11] and face wrapping artifacts [12]. However, as deepfakes become more deceitful and sophisticated, deepfake detection algorithms are advancing. The fact that deepfakes are GAN-generated digital content and not actual events captured by a camera implies that they still can be detected using advanced AI models [13]. Furthermore, it has been proven that deep neural networks tend to achieve better performance than traditional image forensic tools [9]. Typical components of most state-of-the-art deepfake detection approaches are convolutional neural networks, and facial regions cropped out of an entire image [14][15][16]. Unique characteristics of the convolutional operator in deep CNNs have enabled them to demonstrate strong capabilities on capturing minor visual artifacts, yielding decent detection results [7,17,18]. Although CNNs have proven themselves solid candidates for learning local information of the image, they still miss capturing pixels' spatial interdependence due to constrained receptive fields.
Almost all deepfakes are created by applying alterations to the facial area while leaving other regions intact. Therefore, in addition to local image features, every pixel's global relationship provides essential information regarding the intensity and extension of manipulations. This information could be augmented to boost the performance of the deepfake detection algorithm and bring better insight into the location of the forgeries. At the same time, different deepfake generation techniques target different proportions and regions of the facial area to be manipulated, ranging from small regions such as color mismatch in lips to larger areas that extend throughout the image like face boundaries in face-swapping approaches. Therefore, it is vital to successfully identify forged pixels to extract better discriminative features in a scalable manner. Motivated by the aforementioned challenges, a multi-stream deepfake detection framework is presented that incorporates pixels' spatial interdependence in a global context with local image features in a scalable scheme using unique characteristics of transformer models on learning global relationship of pixels. Transformer models have proven their strong capabilities on learning long-term dependency on natural language processing tasks [19][20][21], and more recently on computer vision tasks [22][23][24]. Objectives. Although the deepfake detection task has gained massive attention within the last couple of years, the mainstream detection methods rely on localized features and CNN-based structures. Surprisingly only a few research works have been conducted on the intersection of vision transformers and facial forgery detection. The main goal of this study is to present a digital media authentication system leveraging the unique characteristics of vision transformers on modeling the global relationship of pixels in different manipulation scales. While CNNs fail to learn relative spatial information and lose essential data in pooling layers, vision transformers' global attention mechanism enables the network to learn higher-level information much faster, which leads to more promising performances in less computational time. Furthermore, in digital media forensics, it is of vital importance not only to detect deepfakes but also it is equally important to recognize specific parts of the image that has been forged. This goal can be achieved by extracting hidden traces and intrinsic representations from the image's manipulated regions. Contributions. While existing deepfake detection approaches are primarily dependent on CNN-based structures, this work presents an end-to-end deepfake detection framework leveraging the unique characteristics of transformer models. DFDT discovers hidden traces of perturbations from both local image features and global relationship of pixels at different forgery scales. Unlike previous studies that are limited to either a direct application of vision transformers or still heavily rely on CNN-based models as backbone [25,26], this work presents a transformer model that was mainly developed for the deepfake detection task. A comprehensive set of analyses are conducted to assess the performance of the proposed method from various perspectives, including intra-dataset performance, cross-dataset & cross-manipulation generalization, and various ablation studies. The key contributions of this work are summarized as follows: • An end-to-end deepfake detection framework, DFDT, is developed leveraging the unique characteristics of transformer models on learning hidden traces of perturbations from both local image features and global relationship of pixels at different forgery scales. Organization. The rest of the paper is organized as follows. Section 2 provides a brief discussion on recent significant works on deepfake generation and detection techniques. The outline of the presented approach, DFDT, along with its main components patch extraction & embedding, multi-stream transformer block, and attention-based patch selection are presented in Section 3. Overall evaluation settings, including datasets, implementation specifics, and evaluation metrics are described in Section 4. The obtained experimental results on DFDT are discussed and compared to its counterparts in Section 5. Finally, concluding remarks are drawn in Section 6.

Related Work
A brief description of recent advancements in deepfake analysis domain is provided here. DeepFake Generation. Although early deepfake generation techniques were mostly based on traditional vision and voice impersonation methods [3,27,28], most recent techniques benefit from the unique generation capabilities of GANs. For instance, Zhu et al. [29], and Kim et al. [30] utilized cycle-consistent GANs to generate deepfakes such that it maintains the facial expressions of the target while swapping the identities of source and target. Furthermore, Lu et al. [31] presented identity-guided conditional CycleGAN to convert low-resolution facial images to high-resolution images. Similarly, Kim et al. [32] introduced a deep video portraits method was introduced to transfer both facial expression and 3D poses of the source image into the target image. Moreover, Li et al. [33] presented a high-quality face replacement approach through FaceShifter that exploits a learning method based on heuristic error acknowledging refinement network. As the main scope of this study is on deepfake detection techniques, more interested readers are referred to [8,9] for more detailed information on state-of-the-art deepfake generation techniques. Deepfake Detection. While the deepfake detection task has been studied from different perspectives, this study mainly explores the AI-driven deepfake detection approaches. Given the importance and huge threat of deepfake technologies, a large body of work is devoted to devising high-performance and resilient detection technologies. While early-stage detection techniques mainly focused on handcrafted features, i.e., blinking inconsistencies [12], biological signals [15], and unrealistic details [34], more recent techniques are developed using advanced deep learning networks. For example, Afchar et al. [35] introduced MesoNet as a deepfake detection algorithm that is composed of a shallow convolutional network and intermediate level of features. A detection approach based on auto-encoder architecture and transfer learning, forensictransfer, was presented by Cozzolino et al. [36]. Similarly, Capsule-Forensics architecture was introduced by Nguyen et al. [16] for better detection of AI-generated images and videos. Furthermore, an ensemble learning approach was employed to improve deepfake detection composite model by Rana et al. [37]. Additionally, Kaur et al. [38] proposed a detection approach based on sequential temporal analysis and convolutional long short-term memory networks. Wang et al. [39] evaluated the crossdataset generalization capability of their detection model, which was trained over ProGAN and tested on other datasets. Mittal et al. [40] proposed a multi-modal approach composed of audio and video modalities to tackle deepfake detection tasks. Furthermore, Jian et al. [14] introduced a hierarchical classification approach that can recognize deepfakes at three different levels, including manipulated images, retouched from GAN-based images, and specific GAN architecture. Other researchers have investigated deepfake detection tasks from a fine-grained visual classification point of view, specifically attention-based techniques. For instance, Du et al. [13] proposed a deepfake detection method from a fine-grained visual classification angle that is built using an auto-encoder architecture. Furthermore, Khormali and Yuan [18] have presented an attention-based deepfake detection approach utilizing two different modules, i.e., Face close-up and Face Shut-off, to force the model to extract more discriminative information from other parts of the facial region. Quan et al. [41] presented a progressive transfer learning algorithm to tackle face spoofing attacks using only a limited number of training samples. The presented face anti-spoofing method benefits from a temporal consistency constraint to verify the reliability of pseudo labels of selected data. While a large body of work is focused on CNN-based approaches for deepfake detection, only minimal effort has been devoted to investigating more advanced technologies such as vision transformers for deepfake detection. Vision Transformer. Transformer networks were primarily designed to learn long-range contextual information to solve natural language processing tasks, e.g., text classification, machine translation. Transformers are extremely scalable and have demonstrated remarkable performances on learning dependency among frames of large-scale datasets, e.g., BERT [20], BioBERT [21], and GPT-3 [42]. Inspired by the great performance of transformers in NLP tasks, they have been recently extend to computer vision and multi-modal vision-language tasks, such as image classification [24,43], object detection [23,44], and image segmentation [45]. On the other hand, minimal effort is devoted to exploring vision transformers for deepfake detection. Existing methods still highly depend on CNNs for feature extraction. The community lacks an end-to-end vision-transformer framework designed explicitly for deepfake detection tasks. For example, Khan and Dai [46] presented a video transformer with an incremental learning approach for deepfake detection. Their design benefits from XceptionNet [47] as a backbone for image feature extraction and 12 transformer blocks for feature learning. Similarly, Wodajo and Atnafu [26] presented a convolutional vision transformer that uses CNNs as a feature extractor and a transformer block as a classifier. Furthermore, Heo et al. [25] proposed an scheme based on vision transformer and distillation that is build based on EfficientNet [48] features. Therefore, to fill this research gap in the domain, an end-to-end transformer-based framework is developed explicitly and developed for the deepfake detection problem in this study.

Methodology
A detailed description of the building blocks of the proposed multi-stream transformerbased deepfake detection framework, DFDT, is presented in this section. DFDT consists of four main components, including patch & embeddings extraction (Section 3.1), a multistream transformer block (Section 3.3), attention-based patch selection (Section 3.3), followed by a multi-scale classifier. The overall framework of the DFDT is depicted in Figure 1.

Patch Extraction & Embedding
All mandatory preprocessing steps, including face localization, patch extraction, and patch/positional embeddings, are described in this section. Preprocessing & Face Localization. In general image manipulation tasks, the forgeries might be scattered across regions of the image, however, in deepfake generation techniques manipulations are mainly limited to facial areas and the background area is left intact. Therefore, having trained a model based on only face area would not only reduce computational complexity but also would improve the model performance due to background noise reduction [49]. Therefore, first, 20% of frames are extracted for each input video in consecutive order, and then facial landmarks are extracted leveraging the RetinaFace [50], a state-of-the-art face detection method. Finally, the facial area on each frame is cropped, resized, and aligned using the calculated landmarks [51]. Patch & Embedding Extraction. Before analyzing images using transformer models, they need to be converted into a D-dimensional sequence of smaller patch embeddings. While non-overlapping patch extraction methods harm the neighboring local structures, the overlapping image splitting approach, where two adjacent patches share an area, helps preserve and learn the neighboring information of the local area much better. In other words, in the overlapping patch extraction approach, every given image with a resolution of (H, W) and C channels, I ∈ R H×W×C , is dissected into N smaller image patches with a resolution of (P, P) and C channels, I p ∈ R P×P×C , using a sliding window of stride S. Each adjacent patch shares an area of size P × (P − S).
The resulting patches are then flattened and projected into a latent D-dimensional embedding space E ∈ R N×D . Furthermore, to maintain positional information of each patch, patch embeddings are integrated with position embeddings E pos ∈ R (N+1)×D . The resulting sequence of embedded patches, z, serves as an effective input sequence for the transformer blocks [24].

Attention-Based Patch Selection
Other researchers have demonstrated that transformer models cannot accurately represent the importance level of input tokens, especially in higher layers, due to a lack of token identifiability of the embeddings [52]. One solution to improve the transformer model's capability in capturing sensitive information is to pay more attention to discriminative patches within the training and inference phases, as depicted in Figure 2. In addition, attention-based mechanisms have demonstrated their strong capability in improving the performance of traditional CNN-based deepfake detection models [18]. Therefore, inspired by [53], a patch selection mechanism based on attention weights is l in this study. Generally, for a transformer model with K attention heads and L layers, the input feature to the last layer and attention weights of lth layer can be represented as (3) and (4), respectively.
Input Image Attention Weights Attention-based Patch Selection It is critically important to learn how information propagates through different layers and ensure the correspondence of attention weights with input tokens. Therefore, the raw attention weights are fused throughout the transformer model using matrix multiplication as shown in (5). Note that a final provides better discriminative information on selection of top patches compared to single-layer raw attention weights, i.e., a l as in (4) [53]. Therefore, positional index of maximum attention weights corresponding to K attention heads in a final are calculated, i.e., A 1 , A 2 , · · · , A K and utilized to extract corresponding key tokens in z L−1 . Ultimately, the original input sequence in (3) is substituted with an updated sequence consisting of the concatenation of key tokens corresponding to more informative regions along with the classification token as (6). The new input sequence maintains the global information of the input image. This process makes sure that the model pays specific attention to the subtle traces of deepfakes while dismissing less discriminative regions like the background area.

Multi-Stream Transformer Block
As different deepfake generation techniques target different proportions and regions of the facial area to be manipulated, ranging from small regions such as color mismatch in lips to larger areas that extend throughout the image like face boundaries in face-swapping approaches, it is critically important to identify those regions and extract discriminative features in a scalable manner. Therefore, the capability to have a flexible field of view would provide better information compared to a fixed field of view. While the majority of existing literature focuses on only on a fixed field of view, i.e., patch size, this study proposes a multilevel patch extraction and fusion mechanism that can leverage deepfake traces ranging from more significant facial markers, e.g., eyes, nose, and lips, to more subtle details such as the eye's Iris. As depicted in Figure 1, the proposed multiscale deepfake detection framework consists of two branches, including a low-level patch branch and a high-level patch branch, each composed of three main components: patch/positional embedding, a transformer block, and an attention-based patch selection module followed by a multiscale deepfake classifier. The key difference between low-level branch and high-level branch is the size of image patches and how the sequence patch embeddings are constructed from those patch embeddings and positional embeddings. While the low-level transformer block learns from larger numbers of extracted patches in smaller sizes, the high-level transformer block learns more global features from larger image patches. These characteristics of low-level and high-level transformer blocks enable them to efficiently extract local and global features, respectively. For a given image, each branch's extracted patch/positional embeddings will be fed into the corresponding transformer block, i.e., the low-patch transformer block or the high-patch transformer block. As can be seen in Figure 3, each transformer block comprises three residual transformer blocks with three consecutive vision transformer encoders. The intuition behind using the residual connection between adjacent transformer blocks is to extract additional texture features. It has been shown that the feature maps tend to be identical in the top layers of deep vision transformer models. This means that the self-attention mechanism fails to learn effective concepts for representation learning. In other words, traditional multi-head self-attention layers suffer from an attention collapse problem, which prevents the vision transformer model from scaling up and hurts the model performance [54]. Unlike other vision transformer frameworks where each transformer encoder is composed of a multi-head self-attention layer and feed-forward multilayer perceptron, the presented transformer encoder in this study is composed of a re-attention mechanism along with a feed-forward multilayer perceptron. The re-attention mechanism re-generates the attention maps through establishing cross-head communications in an attempt to increase the diversity of attention maps at different layers. The intuition behind the re-attention mechanism is that, while similarity between attention maps across different transformer blocks is high, their similarity from different heads of the same transformer block is small. The general architecture of the traditional transformer layer with the self-attention mechanism and the transformer layer with the re-attention mechanism are demonstrated in Figure 4. Mathematical representation of the traditional multi-head self attention layer and re-attention mechanism can be written as (7) and (8), respectively [54]. Both methods generate a trainable associate memory with a query Q and a pair of key K-value V pairs to an output via linearly transforming the input.
here, √ d and θ are a scaling factor based on the depth of the network and a learnable transformation matrix, respectively, whereas Norm is a normalization function.  . Transformer layer with re-attention mechanism vs. self-attention approach [54]. While traditional transformer layers with self-attention mechanisms suffer from the attention collapse problem, the transformer layer with the re-attention mechanism has better scalability. (a) Transformer layer with re-attention mechanism; (b) Transformer layer with self-attention mechanism.

Multi-Scale Deepfake Detector
Different deepfake generation techniques target different proportions and regions of the facial area in the forgery process, ranging from small regions such as color mismatch in lips to larger areas that extend throughout the image, like face boundaries in faceswapping approaches. Therefore, key patches at different scales may contain plethora of discriminative information that needs to be participated in the decision making process. To do so, the multi-scale deepfake detector takes the following steps, as depicted in Figure 3: (1) making initial prediction based on sequence output of each residual transformer blocks at low/high-level transformer block, and (2) averaging over all prediction from different scales to achieve final decision.

Evaluation Settings
A detailed description of general evaluation settings, e.g., deepfake datasets, preprocessing steps, implementation specifics, and evaluation metrics, are provided in this section.

Datasets
Early deepfake forensics benchmarks have significantly enhanced the community's awareness of deepfake threats and encouraged the development of different detection tools. Nonetheless, they suffer from the following drawbacks (1) limited scenes in original videos, (2) low-quality synthesized faces, (3) visible splicing boundaries, (4) color mismatch, (5) visible parts of the original face, and (6) inconsistent synthesized face orientations [55,56]. On the other hand, as adversarial entities are constantly devising new techniques to create more decisive deepfakes, forensics tools must be comprehensive and applicable to challenging real-world applications. Therefore, every dataset selected for evaluation should hold specific characteristics, including diversified real-world scenes, minimal visual artifacts, super-realism and stealth, and covering a wide range of manipulation techniques.
To satisfy this requirement, DFDT's performance is empirically examined against a wide range of high-quality yet challenging benchmarks, including FaceForensics++ [49], Celeb-DF (V2) [55] and WildDeepfake [56]. While FaceForensics++ consists in four different types of facial forgery types, Celeb-DF (V2) and WildDeepfake are the most challenging deepfake datasets in real-world scenarios. For each dataset, 80% of the video samples are held out for training purposes, whereas the the rest are equally divided into test and validation sets. Furthermore, the official test set of four other well-known benchmarks, i.e., DeeperForensics [57], Facebook's DeepFake Detection Challenge (DFDC) dataset [58], and FaceShifter [33] are utilized to evaluate the cross-dataset generalization capability of the DFDT. The purpose of this measurement is to demonstrate how well the model would perform on unseen deepfake samples. A brief description of the utilized datasets are presented as follows and associated statistical specifics are listed in Table 1. Table 1. Statistical specifics of the three benchmarks utilized in this study. Holding diversified real-world scenes, minimal visual artifacts, super-realistic & stealthy [55,56], and covering a wide range of manipulation techniques [49] are key characteristics of the selected benchmarks. FaceForensics++. FaceForensics++ is one of the well-known deepfake detection datasets consisting of four different types of manipulation techniques, including Deepfakes [27], FaceSwap [59], Face2Face [3], and NeuralTextures [60]. It has 1000 real videos from YouTube and corresponding deepfake videos generated using the aforementioned techniques. Celeb-DF (V2). The Celeb-DF (V2) is composed of large-scale deepfake videos generated using an improved synthesis process that swaps faces of individuals in target and source videos. The Celeb-DF (V2) offers high visual quality scores and consists of 5639 deepfakes corresponding to over 2 million frames. WildDeepfake. Unlike FaceForensics++ and Celeb-DF (V2), the WildDeepfake dataset comprises real-world videos for both original and fake videos gathered from the Internet. They are not generated using AI-enabled methods, making them more challenging and closer to real-world scenarios. Furthermore, more diversified scenes, more individuals in each scene, and several facial expressions are among other characteristics of this benchmark.

Implementation Specifics
A detailed description of the characteristics and technical specifics on the implementation of the proposed method is provided below. Implementation. All models are implemented using the PyTorch machine learning library and trained using Adam optimizer with a learning rate of 10 −4 with ten times decay every 40 steps. The whole network is trained for 100 epochs. Experimental Setup. Two Lambda Quad deep learning workstation machines were used to conduct all experiments. Each of these machines is installed with Ubuntu 18.04 OS, along with Intel Xeon E5-1650 v4 CPUs, 64 GB DDR4 RAM, 2TB SSD, 4TB HDD, and 4 NVIDIA Titan-V GPUs.

Evaluation Metrics
The performance of the proposed deepfake detection method is evaluated on both frame-level and video-level analysis. The results are reported using accuracy score (ACC) and/or area under the receiver operating characteristic curve (AUC). These two evaluation metrics commonly have been used in existing deepfake detection tasks [17,61,62]. Therefore, to provide a better understanding and insight into the performance of the presented model, the same metrics are employed in this study.

Results & Discussion
A comprehensive set of experiments is conducted to evaluate the proposed transformerbased deepfake detection method's performance from various perspectives. The aforementioned evaluation metrics, i.e., detection accuracy, AUC, and recall scores, are employed to measure the performance of the DFDT. Most of the existing deepfake detection methods conduct only frame-level analysis. However, it is critically important to conduct a videolevel examination since most deepfake data dissemination on the digital media are forged videos. Therefore, all experiments in this study are performed on two levels, covering both frame-level and video-level. A comprehensive set of experiments are designed to examine the performance of the proposed approach from several aspects. First, the intra-dataset performance of the DFDT is evaluated against three well-known benchmark datasets, including Celeb-DF (V2), WildDeepfake, and FaceForensics++ [49,55,56].
Considering the critical role of the generalization property in deepfake detection task, another set of experiments are designed to examine the cross-dataset generalization capability of the DFDT. Third, the model's performance is compared with that of existing state-of-the-art deepfake detection methods. Finally, the impact of the different components of the DFDT on its function is investigated through various ablation studies. Each of these experiments is discussed in more detail in the following.

Intra-Dataset Evaluation
The main goal of this section is to investigate the learning capability of the model and see how well it performs against datasets with different visual qualities and challenging real-world deepfake datasets. Therefore, the model is trained and tested on a range of deepfake detection datasets, spanning different levels of visual quality scores, namely Celeb-DF (V2), WildDeepfake, and FaceForensics++ [49,55,56]. The findings of this experiment demonstrated that DFDT performs significantly well on every challenging dataset on all measured scores. Quantitative frame-level detection results are summarized in Table 2. Particularly, in the frame-level setting the DFDT model has achieved 99.41%, 99.31%, and 81.35% on an accuracy score corresponding to FaceForensics++ (raw) [49], Celeb-DF (V2) [55], and WildDeepfake [56], respectively. A similar trend is apparent in video-level analysis, providing additional strong evidence regarding the outstanding performance of the proposed transformer-based deepfake detection approach in intra-dataset settings. Furthermore, a quantitative comparison of the existing deepfake detection methods with DFDT on every dataset is presented in Table 3. Note that the same evaluation metrics as the literature are utilized for each dataset. It can be observed from Figure 5 that the presented deepfake detection approach in this work outperforms existing methods on all three benchmarks. This figure also reveals another critical point: although most deepfake detection approaches perform well on relatively more straightforward datasets, i.e., FaceForensics++, their performance is still far from perfect on more challenging and real-world datasets, i.e., WildDeepfake.

Cross-Dataset Generalization
While deepfake generation methods are constantly evolving and span across classical and AI-driven approaches, it is critically important for any deepfake detection method to have a high generalization capability to recognize unseen samples effectively. To measure this property, in line with existing literature, the DFDT model is trained on FaceForen-sics++ and then examined on other datasets, including Celeb-DF (V2), DeepFake Detection Challenge, Faceshifter, and DeeperForensics. Table 4 presents the attained AUC scores for video-level analysis. Comparison of the obtained results from this study and the existing state-of-the-art methods demonstrates the excellent cross-dataset generalization capability of the DFDT method. It can be observed from Figure 6 that the DFDT approach achieves remarkable cross-data generalization. It surpasses other systems on Celeb-DF, Faceshifter, and DFDC datasets with relatively large margins and achieves competitive scores on DeeperForensics with the state-of-the-art method, i.e., LipForensics [17].

Cross-Manipulation Generalization
Another set of experiments is designed to understand the generalization capability of the DFDT method to other fake videos created with different manipulation techniques on the same source video while maintaining the pose and illumination variables intact. To do so, DFDT is separately trained on three out of four deepfake generation methods on Face-Forensics++ and tested on the remaining one. With the same rationale as in [17,68,70,76], the experiments are performed using the high-quality subset of the FaceForensics++ dataset. It is more likely to be closer to real-world deepfake videos, i.e., where videos are processed with nearly lossless compression. The obtained results from this experiment, as shown in Table 5, show that DFDT's generalization property is well-extended to previously unseen forgery types. Specifically, it achieves higher or competitive cross-manipulation generalization ability compared to existing approaches. As it can be observed in Figure 6, on average, DFDT provides better or competitive scalability to unseen forgery types compared to existing deepfake detection methods.  (a) Cross-dataset generalization.

Ablation Study
The main goal of this experiment is to examine the impact of different attention mechanisms on the performance of the proposed deepfake detection framework. Two different attention mechanisms, i.e., self-attention and re-attention mechanisms are investigated in this study. The re-attention mechanism re-generates the attention maps through establishing cross-head communications in an attempt to increase the diversity of attention maps at different layers.
Different experiments are conducted with and without such a mechanism to explore the re-attention transformer layer's impact on DFDT's performance. The comparison results on the AUC score are demonstrated in Figure 7. It can be observed from this figure that without the re-attention mechanism the performance of the DFDT decreases by 1.7% and 0.9% and 1.4% in FaceForensics++, Celeb-DF (V2), and WildDeepfake, respectively.  . Investigating the impact of transformer layer's attention mechanism on the overall performance of the DFDT. The re-attention mechanism has improved the performance of the deepfake detection model compared to self-attention mechanism in transformer layer.

Conclusions
This work introduces DFDT, an end-to-end deepfake detection framework using vision transformers. Unlike mainstream deepfake detection methods, which exploit CNNs as their backbone, DFDT leverages the unique characteristics of vision transformer networks to model local image features and global relationships of pixels simultaneously. DFDT's multi-stream design enables it to capture different scales of alterations effectively. Obtained experimental results on several benchmarks demonstrate that DFDT achieves state-of-theart performances, achieving 99.41%, 99.31%, and 81.35% on FaceForensics++, Celeb-DF (V2), and WildDeepfake, respectively. Furthermore, DFDT's excellent cross-dataset & cross-manipulation generalization provides additional strong evidence of its effectiveness.  Institutional Review Board Statement: Not applicable.

Conflicts of Interest:
The authors declare no conflict of interest.