Deepfake Attacks: Generation, Detection, Datasets, Challenges, and Research Directions

: Recent years have seen a substantial increase in interest in deepfakes, a fast-developing ﬁeld at the nexus of artiﬁcial intelligence and multimedia. These artiﬁcial media creations, made possible by deep learning algorithms, allow for the manipulation and creation of digital content that is extremely realistic and challenging to identify from authentic content. Deepfakes can be used for entertainment, education, and research; however, they pose a range of signiﬁcant problems across various domains, such as misinformation, political manipulation, propaganda, reputational damage, and fraud. This survey paper provides a general understanding of deepfakes and their creation; it also presents an overview of state-of-the-art detection techniques, existing datasets curated for deepfake research, as well as associated challenges and future research trends. By synthesizing existing knowledge and research, this survey aims to facilitate further advancements in deepfake detection and mitigation strategies, ultimately fostering a safer and more trustworthy digital environment.


Introduction
Deepfakes are produced by manipulating existing videos and images to produce realistic-looking but wholly fake content.The rise of advanced artificial intelligence-based tools and software that require no technical expertise has made deepfake creation easier.With the unprecedented exponential advancement, the world is currently witnessing in generative artificial intelligence, the research community is in dire need of keeping informed on the most recent developments in deepfake generation and detection technologies to not fall behind in this critical arms race.
Deepfakes present a number of serious issues that arise in a variety of fields.These issues could significantly impact people, society [1], and the reliability of digital media [2].Some significant issues include fake news, which can lead to the propagation of deceptive information, manipulation of public opinion, and erosion of trust in media sources.Deepfakes can also be employed as tools for political manipulation, influence elections, and destabilize public trust in political institutions [3,4].In addition, this technology enables malicious actors to create and distribute non-consensual explicit content to harass and cause reputational damage or create convincing impersonations of individuals, deceiving others for financial or personal gains [5].Furthermore, the rise of deepfakes poses a serious issue in the domain of digital forensics as it contributes to a general crisis of trust and authenticity in digital evidence used in litigation and criminal justice proceedings.All of these impacts show that deepfakes present a serious threat, especially in the current sensitive state of the international political climate and the high stakes at hand considering the conflicts on the global scene and how deepfakes and fake news can be weaponized in the ongoing media war, which can ultimately result in catastrophic consequences.
Therefore, deepfake detection techniques need to be constantly improved to catch up with the fast-paced evolution of generative artificial intelligence.There is a need for literature reviews to keep up with the fast-changing field of artificial intelligence and deepfakes to enable researchers and professionals to develop robust countermeasure methods and to lay the right groundwork to make it easier to detect and mitigate deepfakes.
The key contributions to this survey paper are as follows: • A summary of the state-of-the art deepfake generation and detection techniques; • An overview of fundamental deep learning architectures used as backbone in deepfake video detection models; • A list of existing deepfake datasets contributing to the improvement of the performance, generalization and robustness of deepfake detection models; • A discussion of the limitations of existing techniques, challenges, and research directions in the field of deepfake detection and mitigation.
The remainder of this paper is organized as follows.Section 2 provides an outline of the most recent, existing survey papers related to deepfake technology.Section 3 is devoted to deepfake manipulation techniques for generating deepfakes.Section 4 describes existing deepfake detection techniques.Section 5 gives a list of existing datasets used for deepfake research.In Section 6, we discuss some of the challenges and future research directions of the deepfake field.Finally, the survey ends with a conclusion.

Related Surveys
Multiple surveys of the literature in the area of deepfake detection have been published in recent years as the topic is advancing rapidly.For instance, the authors of [6] offered a systematic literature review with a new, interdisciplinary viewpoint on deepfakes.They provided a meticulous definition of deepfakes and discussed the impact of the creation and spread of deepfakes.They also suggested future research directions for innovation.Alternatively, the authors of [7] provided a rich review paper that has an exhaustive breakdown of deepfake types alongside the technology leveraged in their creation and detection, as well as open-source datasets and future trends in deepfake technology.In Ref. [8], the authors focused in their systematic review on deepfake detection-technology.They include machine learning and deep learning methods alongside statistical techniques and blockchain-based techniques, assessed how well each method performs when applied to diverse datasets, and offered some recommendations on deepfake detection that may aid future studies.In Ref. [9], the author presented recent deepfake advancements, covering four face manipulation types-generation, detection methods, and future prospects.
In Ref. [10], the authors explored the background and methods of deepfakes before looking at the development of improved and resilient deep learning techniques to combat their use.In Ref. [11], the authors provided a survey with an extensive summary of deepfake manipulation types, the tools and technology used to generate deepfakes, a technical background of deep learning models and public datasets, and the challenges faced in the creation and detection of deepfakes.Whereas, in [12], the authors presented a detailed review of deepfake manipulation types and their generation processes, as well as several detection methods and the features leveraged, alongside some issues that demand serious consideration in future studies.The authors of [13] offered in their survey a technical background on the architecture used in deepfake creation that deals with two manipulation types: reenactment and replacement.In addition to detection technologies and prevention solutions, they mentioned several inadequacies of the available defense options and areas that need more focus.
In a detailed survey [14], the authors covered several topics of deepfake manipulation, including audio deepfakes, the technology used in its creation and detection, performance metrics, and publicly available datasets, in addition to a discussion about the limitations and future trends in the field of deepfakes.

Deepfake Generation
In this section, we will first state the various types of deepfake manipulations and then deliver an overview of deepfake generation techniques.

Deepfake Manipulation Types
There exist five primary types of deepfake manipulation, as shown in Figure 1.Face synthesis [32] is a manipulation type which entails creating images of a human face that does not exist in real life.In attribute manipulation [33], only the region that is relevant to the attribute is altered alone in order to change the facial appearance by removing or donning eyeglasses, retouching the skin, and even making some more significant changes, like changing the age and gender.Nevertheless, our attention is directed towards manipulations that are predominantly prevalent in video format due to their heightened engagement levels compared to image-based content.Consequently, it is more likely for people to fall victim to deepfake videos.These manipulations are designed to make it appear as though a person is doing or saying something that they did not actually do or say.
The most common manipulation types are identity swap or face swapping, face reenactment, and lip-syncing.Face swapping [34,35] is a form of manipulation that has primarily become prevalent in videos even though it can occur at the image level.It entails the substitution of one individual's face in a video, known as the source, with the face of another person, referred to as the target.In this process, the original facial features and expressions of the target subject are mapped onto the associated areas of the source subject's face, creating a seamless integration of the target's appearance into the source video.The origins of research on the subject of identity swap can be traced to the morphing method introduced in [36].Meanwhile, face reenactment [37,38] is a manipulation technique that focuses on altering the facial expressions of a person in a video.It involves the replacement of the original facial expression of the subject, with the facial expression of another person.Lastly, we have lip-syncing [39], where the objective is to generate a target face that appears authentic and synchronizes with given text or audio inputs.Achieving accurate lip movements and facial expressions that align with the source audio necessitates the use of advanced techniques.Additionally, meticulous postprocessing is crucial to ensuring that the resulting video portrays a natural and seamless facial appearance.The most common manipulation types are identity swap or fa actment, and lip-syncing.Face swapping [34,35] is a form of man marily become prevalent in videos even though it can occur at th the substitution of one individual's face in a video, known as the another person, referred to as the target.In this process, the orig expressions of the target subject are mapped onto the associated a ject's face, creating a seamless integration of the target's appearanc The origins of research on the subject of identity swap can be t method introduced in [36].Meanwhile, face reenactment [37,38] i nique that focuses on altering the facial expressions of a person in replacement of the original facial expression of the subject, with t another person.Lastly, we have lip-syncing [39], where the objectiv face that appears authentic and synchronizes with given text or a accurate lip movements and facial expressions that align with the tates the use of advanced techniques.Additionally, meticulous po

Deepfake Generation Techniques
Multiple techniques exist for generating deepfakes.Generative Adversarial Networks GANs [40] and Autoencoders are the most prevalent techniques.GANs consist of a pair of neural networks, a generator network and discriminator network, which engage in a competitive process.The generator network produces synthetic images, which are presented alongside real images to the discriminator network.The generator network learns to produce images that deceive the discriminator, while the discriminator network is trained to differentiate between real and synthetic images.Through iterative training, GANs become proficient at producing increasingly realistic deepfakes.On the other hand, Autoencoders can be used as feature extractors to encode and decode facial features.During training, the autoencoder learns to compress an input facial image into a lower-dimensional representation that retains essential facial features.This latent space representation can then be used to reconstruct the original image.Though, for deepfake generation, two autoencoders are leveraged, one trained on the face of the source and another trained on the target.
Numerous sophisticated GAN-based techniques have emerged in the literature, contributing to the advancement and complexity of deepfakes.AttGAN [41] is a technology for facial attribute manipulation; its attribute awareness enables precise and high-quality attribute changes, making it valuable for applications like face-swapping and age progression or regression.Likewise, StyleGAN [42] is a GAN architecture that excels in generating highly realistic and detailed images.It allows for the manipulation of various facial features, making it a valuable tool for generating high-quality deepfakes.Similarly, STGAN [33] modifies specific facial attributes in images while preserving the person's identity.The model can work with labeled and unlabeled data and has shown promising results in accurately controlling attribute changes.Another technique is StarGANv2 [43], which is able to perform multi-domain image-to-image translation, enabling the generation of images across multiple different domains using a single unified model.Unlike the original StarGAN [44], which could only perform one-to-one translation between each pair of domains, StarGANv2 [43] can handle multiple domains simultaneously.An additional GAN variant is CycleGAN [45], which specializes in style transfer between two domains.It can be applied to transfer facial features from one individual to another, making it useful for face-swapping applications.Moreover, there is RSGAN [46], which can encode the appearances of faces and hair into underlying latent space representations, enabling the image appearances to be modified by manipulating the representations in the latent spaces.For a given audio input, LipGAN [47] is intended to produce realistic lip motions and speech synchronization.
In addition to the previously mentioned methods, there is a range of open-source tools readily available for digital use, enabling users to create deep fakes with relative ease, like FaceApp [48], Reface [49], DeepBrain [50], DeepFaceLab [51], and Deepfakes Web [52].These tools have captured the public's attention due to their accessibility and ability to produce convincing deepfakes.It is essential for users to utilize these tools responsibly and ethically to avoid spreading misinformation or engaging in harmful activities.As artificial intelligence is developing fast, deepfake generation algorithms are simultaneously becoming more sophisticated, convincing, and hard to detect.

Deepfake Detection
This section will point out the diverse clues and detection models exploited to achieve the task of classifying fake media from genuine ones.Next, it will delve into the various state-of-the-art deep learning architectures implemented in deepfake detection techniques and provide a summary of several recent deepfake detection models.

Deepfake Detection Clues
Deepfakes can be detected by exploiting various clues, as summarized in Figure 2. One approach is to analyze spatial inconsistencies by closely examining deepfakes for visual artifacts, facial landmarks, or intra-frame inconsistencies.Another method involves detecting convolutional traces that are often present in deepfakes as a result of the generation process, for instance, bi-granularity artifacts and GAN fingerprints.Additionally, biological signals such as abnormal eye blinking frequency, eye color, and heartbeat can also indicate the presence of a deepfake, as can temporal inconsistencies or the discontinuity between adjacent video frames, which may result in flickering, jittering, and changes in facial position.Poor alignment of facial emotions on swapped faces in deepfakes is a high-level semantic feature used in detection techniques.Detecting audio-visual inconsistencies is a multimodal approach that can be used for deepfakes that involve swapping both faces and audio.Another multimodal approach is to exploit spatial-temporal features by inspecting visual irregularities within individual video frames (intra-frame inspection) and analyzing temporal characteristics across video streams (inter-frame examination).

Detection Based on Spatial Artifacts
To effectively use face landmark information, in Ref. [53], Liang et al. described facial geometry prior module.The model harnesses facial maps and correlation within th frequency domain to study the distinguishing traits of altered and unmanipulated region by employing a CNN-LSTM network.In order to predict manipulation localization, a de coder is utilized to acquire the mapping from low-resolution feature maps to pixel-leve details, and SoftMax function was implemented for the classification task.A different ap proach, dubbed forensic symmetry, by Li, G. et al. [54], assessed whether the natural fea tures of a pair of mirrored facial regions are identical or dissimilar.The symmetry attribut extracted from frontal facial images and the resemblance feature obtained from profiles o the face images are obtained by a multi-stream learning structure that uses DRN as it backbone network.The difference between the two symmetrical face patches is then quan tified by mapping them into angular hyperspace.A heuristic prediction technique wa used to put this model into functioning at the video level.As a further step, a multi-margin angular loss function was developed for classification.
Hu et al. [55] proposed DeepfakeMAE which is a detection model that can leverag the commonalities across all facial components.To be more specific, a masked autoen coder is pretrained to learn facial part consistency by randomly masking some facial fea tures and rebuilding missing sections using the facial parts that are still visible.This i performed given a real face image.Moreover, a model employing two networks, both utilizing pre-trained encoders and decoders, is leveraged to optimize the differentiation

Detection Based on Spatial Artifacts
To effectively use face landmark information, in Ref. [53], Liang et al. described a facial geometry prior module.The model harnesses facial maps and correlation within the frequency domain to study the distinguishing traits of altered and unmanipulated regions by employing a CNN-LSTM network.In order to predict manipulation localization, a decoder is utilized to acquire the mapping from low-resolution feature maps to pixel-level details, and SoftMax function was implemented for the classification task.A different approach, dubbed forensic symmetry, by Li, G. et al. [54], assessed whether the natural features of a pair of mirrored facial regions are identical or dissimilar.The symmetry attribute extracted from frontal facial images and the resemblance feature obtained from profiles of the face images are obtained by a multi-stream learning structure that uses DRN as its backbone network.The difference between the two symmetrical face patches is then quantified by mapping them into angular hyperspace.A heuristic prediction technique was used to put this model into functioning at the video level.As a further step, a multi-margin angular loss function was developed for classification.
Hu et al. [55] proposed DeepfakeMAE which is a detection model that can leverage the commonalities across all facial components.To be more specific, a masked autoencoder is pretrained to learn facial part consistency by randomly masking some facial features and rebuilding missing sections using the facial parts that are still visible.This is performed given a real face image.Moreover, a model employing two networks, both utilizing pre-trained encoders and decoders, is leveraged to optimize the differentiation between authentic and counterfeit videos.Yang, J. et al. [56] tackled deepfake detection from a different perspective where they simulate the fake image generation process to explore forgery traces.A multi-scale self-texture attention Generative Network is suggested for this aim employing an encoder-decoder generator, Resnet as backbone network, and the self-texture attention method to improve the texture characteristics in the process of disassembling an image.Additionally, a loss function termed Prob-tuple loss confined by classification probability is suggested.To identify visual artifacts at different scales, Wang et al. [57] introduced a Multi-modal Multi-scale Transformer that works on patches of various sizes to identify disparities within images at various spatial tiers as well as forgery artifacts in the frequency domain; and the latter is added to RGB information by means of a cross modality fusion block.An approach based on GANs for deepfake detection is suggested by Xiao et al. [58], leveraging the concealed gradient data within the grayscale representation of the manipulated image and incorporating focal loss for the classification task.

Detection Based on Biological/Physiological Signs
Li, Y. et al. [59] adopted an approach based on identifying eye blinking, a biological signal that is not easily conveyed in deepfake videos.Therefore, a deepfake video can be identified by the absence of eye blinking.To spot open and closed eye states, a deep neural network model that blends CNN and a recursive neural network is used while taking into account previous temporal knowledge.Alternatively, Hernandez-Ortega et al. [60] present an innovative approach for detecting deepfake videos that focuses on analyzing heart rate information through remote photoplethysmography (rPPG).By examining video sequences and identifying slight alterations in skin color, the existence of human blood beneath the tissues can be revealed.The proposed detection system, called DeepfakesON-Phys, incorporates a Convolutional Attention Network to extract spatial and temporal details from video frames and effectively combine the two origins for improved fake video detection.

Detection Based on Audio-Visual Inconsistencies
Boundary Aware Temporal Forgery Detection is a multimodal technique introduced by Cai et al. [61] for correctly predicting the borders of fake segments based on visual and auditory input.While an audio encoder using a 2DCNN learns characteristics extracted from the audio, a video encoder leveraging a 3DCNN learns frame-level spatial-temporal information.Yang, W. et al. [62] also exploited discrepancy between audio and visual elements for deepfake identification.A temporal-spatial encoder for feature embedding explores the disparity between audio and visual components at temporal and spatial levels and a multi-modal joint-decoder, designed to concurrently acquire knowledge of multimodal interactions and integrate audio-visual data, alongside the cross-modal classifier incorporated for manipulation detection.Similarly performed by considering both the audio and visual aspects of a video, Ilyas et al. [63] introduced an end-to-end method called AVFakeNet.The detection model is comprised of a Dense Swin Transformer Net (DST-Net).

Detection Based on Convolutional Traces
To detect deepfakes, Huang et al. [64] harnessed the imperfection of the up-sampling process in GAN-generated deepfakes by employing a map of gray-scale fakeness.Furthermore, attention mechanism, augmentation of partial data, and clustering of individual samples are employed to improve the model's robustness.Chen et al. [65] exploited a different trace which is bi-granularity artifacts, intrinsic-granularity artifacts that are caused by up-convolution or up-sampling operations, and extrinsic granularity artifacts that are the result of the post-processing step that blends the synthesized face to the original video.Deepfake detection is tackled as a multi-task learning problem where ResNet-18 is used as the backbone feature extractor.Whereas L. Guarnera et al. [66] provided a method that uses an expectation maximization algorithm to extract a set of local features intended to simulate the convolutional patterns frequently found in photos.The five currently accessible architectures are GDWCT [67], StarGAN [68], AttGAN [41], StyleGAN [42], and StyleGAN2 [69].Next, naive classifiers are trained to differentiate between real images and those produced by these designs.

Detection Based on Identity Information
Based on the intuition that every person can exhibit distinct patterns in the simultaneous occurrence of their speech, facial expressions, and gestures, Agarwal et al. [70] introduced a multimodal detection method with a semantic focus that incorporates speech transcripts into gestures specific to individuals analysis using interpretable action units to model facial and cranial motion of an individual.Meanwhile, Dong et al. [71] proposed an Identity Consistency Transformer that learns simultaneously and identifies vectors for the inner face and another for the outer face; moreover, the model uses a novel consistency loss to drive both identities apart when their labels are different and to bring them closer when their labels are the same.Similarly, Nirkin et al. [72] identified deepfakes by looking for identity-to-identify inaccuracies between two identity vectors that represent the inner face region and its outer context.The identity vectors are obtained using two networks based on the Xception architecture and trained using a vanilla cross entropy loss.Focusing on temporal identity inconsistency, Liu et al. [73] introduced a model that captures the disparities of faces within video frames of the same person by encoding identity information in all frames to identity vectors and learning from these vectors the temporal embeddings, thus identifying inconsistencies.The proposed model integrates triplet loss for enhanced discrimination in learning temporal embeddings.

Detection Based on Facial Emotions
Despite the fact that deepfakes can produce convincing audio and video, it can be difficult to produce material that maintains coherence concerning high-level semantics, including emotions.Unnatural displays of emotion, as determined by characteristics like valence and arousal, where arousal indicates either heightened excitement or tranquility and valence represents positivity or negativity of the emotional state, can offer compelling proof that a video has been artificially created.Using the emotion inferred from the visage and vocalizations of the speaker, Hosler et al. [74] introduced an approach for identifying deepfakes.The suggested method makes use of long, short-term memory networks and visual descriptors to infer emotion from low-level audio emotion; a supervised classifier is then incorporated to categorize videos as real or fake using the predicted emotion.Leveraging the same high-level features, Conti et al. [75] focused on identifying deepfake speech tracks created using text-to-speech (TTS) algorithms that manipulate the emotional tone of the voice content.To extract emotional features, a Speech Emotion Recognition network trained on a speech dataset labeled with the speaker's emotional expression is employed, alongside a supervised classifier that receives emotional features as input and predicts the authenticity of the provided speech track as either genuine or deepfake.

Detection Based on Temporal Inconsistencies
To leverage temporal coherence to detect deepfakes, Zheng et al. [76] proposed an approach to reduce the spatial convolution kernel size to 1 while keeping the temporal convolution kernel size constant using a fully temporal convolution network in addition to a Transformer Network that explores the long-term temporal coherence.Pei et al. [77] exploited the temporal information in videos by incorporating a Bidirectional-LSTM model.Gu et al. [78] proposed a Region-Aware Temporal Filter module to generate temporal filters to distinct spatial areas by breaking down the dynamic temporal kernel into fundamental, region-independent filters.Additionally, region-specific aggregation weights are introduced to steer these regions in adaptively acquiring knowledge of temporal incongruities.The input video is split into multiple snippets to cover the long-term temporal dynamics.Inspired by how humans detect fake media through browsing and scrutinizing, Ru et al. [79] presented a model dubbed Bita-Net which consists of two pathways: one that checks the temporal consistency by rapidly scanning the entire video, and a second pathway improved by an attention branch to analyze key frames of the video at a lower rate.

Detection Based on Spatial-Temporal Features
The forced mixing of the manipulated face in the generation process of deepfakes causes spatial distortions and temporal inconsistencies in crucial facial regions, which Sun et al. [80] proposed to reveal by extracting the displacement trajectory of the facial region.For the purpose of detecting fake trajectories, a fake trajectory detection network, utilizing a gated recurrent unit backbone in conjunction with a dual-stream spatial-temporal graph attention mechanism, is created.In order to detect the spatial-temporal abnormalities in the altered video trajectory, the network makes use of the extracted trajectory and explicitly integrates the important data from the input sequences.Lu et al. [81] proposed a detection method based on an improved Capsule Network and the fusion of temporal-spatial features.The optical flow algorithm effectively captures the temporal characteristics of manipulated videos, and the improved Capsule Network reaches a thorough conclusion by considering temporal-spatial features using weight initialization and updating on a dynamic routing algorithm.Meanwhile, Waseem et al. [82] described a dualstream convolutional neural network strategy is employed, incorporating XceptionNet and 3DCNN, to capture spatial irregularities and temporal variations.Initially, MTCNN is employed for face detection and extraction from input video frames.Subsequently, 3DCNN and XceptionNet are utilized to extract features from facial images.Finally, fully connected layers and sigmoid layers determine the authenticity of the video.

Deep Learning Models for Deepfake Detection
Several advanced technologies have been employed in the domain of deepfake detection, such as machine learning [83][84][85] and media forensics-based approaches [86].However, it is widely acknowledged that deep learning-based models currently exhibit the most remarkable performance in discerning between fabricated and authentic digital media.These models leverage sophisticated neural network architectures known as backbone networks, displayed in Figure 3, which have demonstrated exceptional efficacy in computer vision tasks.Prominent examples of such architectures include VGG [87], EfficientNet [88], Inception [89], CapsNet [90], and ViT [91], and are particularly renowned for their prowess in the feature extraction phase.Deep learning-based detection models go beyond conventional methods by incorporating additional techniques to further enhance their performance.One such approach is meta-learning, which enables the model to learn from previous experiences and adapt its detection capabilities accordingly.By leveraging meta-learning, these models become more proficient at recognizing patterns and distinguishing between genuine and manipulated content.
Furthermore, data augmentation plays a crucial role in training deep learning-based detection models.This technique involves augmenting the training dataset with synthetic or modified samples, which enhances the model's capacity to generalize and recognize diverse variations of deepfake media.Data augmentation enables the model to learn from a wider range of examples and improves its robustness against different types of manipulations.Attention mechanisms have also proven to be valuable additions to deep learning-based detection models.By directing the model's focus toward relevant features and regions of the input data, attention mechanisms enhance the model's discriminative power and improve its overall accuracy.These mechanisms help the model select critical details [92], making it more effective in distinguishing between real and fake media.Collectively, the combination of deep learning-based architectures, meta-learning, data augmentation, and attention mechanisms has significantly advanced the field of deepfake detection.These technologies work in harmony to equip models with the ability to identify and flag manipulated media with unprecedented accuracy.[92], making it more effective in distinguishing between real and fake media.Collectively, the combination of deep learning-based architectures, meta-learning, data augmentation, and attention mechanisms has significantly advanced the field of deepfake detection.These technologies work in harmony to equip models with the ability to identify and flag manipulated media with unprecedented accuracy.
The Convolutional Neural Network is a powerful deep learning algorithm designed for image recognition and processing tasks.It consists of various levels, encompassing convolutional layers, pooling layers, and fully connected layers.There are different types of CNN models used in deepfake detection such as ResNet [93], short for Residual Network, which is an architecture that introduces skip connections to fix the vanishing gradient problem that occurs when the gradient diminishes significantly during backpropagation; these connections involve stacking identity mappings and skipping them, utilizing the layer's prior activations.This technique accelerates first training by reducing the number of layers in the network.The concept underlying this network is different from having the layers learn the fundamental mapping.Rather than directly defining the initial mapping as H(x), we let the network adapt and determine it, as shown in Figure 4.The Convolutional Neural Network is a powerful deep learning algorithm designed for image recognition and processing tasks.It consists of various levels, encompassing convolutional layers, pooling layers, and fully connected layers.There are different types of CNN models used in deepfake detection such as ResNet [93], short for Residual Network, which is an architecture that introduces skip connections to fix the vanishing gradient problem that occurs when the gradient diminishes significantly during backpropagation; these connections involve stacking identity mappings and skipping them, utilizing the layer's prior activations.This technique accelerates first training by reducing the number of layers in the network.The concept underlying this network is different from having the layers learn the fundamental mapping.Rather than directly defining the initial mapping as H(x), we let the network adapt and determine it, as shown in Figure 4. Another architecture based on CNNs is VGG [87], short for Visual which comprises multiple layers.Instead of using large kernel sized fil Another architecture based on CNNs is VGG [87], short for Visual Geometry Group, which comprises multiple layers.Instead of using large kernel sized filters, this architecture utilizes multiple filters with a kernel size of 3 × 3. The VGG16 architecture employs a doubling of filters at each convolutional layer, a fundamental design principle.However, a notable drawback of the VGG16 network is its substantial size, resulting in extended training times due to its depth and numerous fully connected layers.The model's file size exceeds 533 MB, rendering the implementation of a VGG network a time-intensive endeavor.
An additional significant CNN-based architecture in deepfake detection models is EfficientNet [88].It has a scaling method that applies a uniform scaling approach to all dimensions of depth, width, and resolution.This is achieved by utilizing a compound coefficient.In Figure 5, the performance of EfficientNet is presented alongside other network architectures.The largest model within the EfficientNet series, EfficientNet B7, achieved remarkable results on both the ImageNet and CIFAR-100 datasets.Specifically, it achieved approximately 84.4% in top-1 accuracy and 97.3% in top-5 accuracy on the ImageNet dataset.Furthermore, this model was not only significantly more compact, being 8.4 times smaller, but also notably faster, with a speedup of 6.1 times compared to the prior leading CNN model.Additionally, it exhibited strong performance with 91.7% accuracy on the CIFAR-100 dataset and an impressive 98.8% accuracy on the Flowers dataset.
training times due to its depth and numerous fully connected layers.The mo exceeds 533 MB, rendering the implementation of a VGG network a time deavor.
An additional significant CNN-based architecture in deepfake detect EfficientNet [88].It has a scaling method that applies a uniform scaling ap dimensions of depth, width, and resolution.This is achieved by utilizing coefficient.In Figure 5, the performance of EfficientNet is presented alongs work architectures.The largest model within the EfficientNet series, Effi achieved remarkable results on both the ImageNet and CIFAR-100 datasets it achieved approximately 84.4% in top-1 accuracy and 97.3% in top-5 acc ImageNet dataset.Furthermore, this model was not only significantly more ing 8.4 times smaller, but also notably faster, with a speedup of 6.1 times com prior leading CNN model.Additionally, it exhibited strong performance w curacy on the CIFAR-100 dataset and an impressive 98.8% accuracy on the taset.Inception [89] models help mitigate the computational cost and other overfitting in CNN architectures by utilizing stacked 1 × 1 convolutions for dimensionality reduction.Xception [94], developed by researchers at Google, is an advanced version of the Inception architecture.It offers a novel approach by reinterpreting Inception modules as an intermediate step between standard convolution and depthwise separable convolution.While the conventional convolution operation combines channel-wise and spatial-wise computations in a single step, depthwise separable convolution divides this process into two distinct steps.Firstly, it employs depthwise convolution to apply an individual convolutional filter to each input channel, and subsequently, pointwise convolution is employed to create a linear combination of the results obtained from the depthwise convolution.
An alternative to CNNs would be Capsule Networks [90] that are able to retrieve spatial information as well as other important details to avoid the information loss seen during pooling operations.Capsules exhibit equivariance characteristics and consist of a neural network that handles vectors as inputs and outputs, in contrast to the scalar values processed by CNNs.This unique attribute of capsules enables them to capture not only the features of an image, but also its deformations and various viewing conditions.Within a capsule network, each capsule comprises a cluster of neurons, with each neuron's output signifying a distinct attribute of the same feature.This structure offers the advantage of recognizing the entire entity by first identifying its constituent parts.
Recurrent Neural Networks are a kind of neural network that handles sequential data by feeding it in a sequential manner.They are specifically designed to tackle the challenge of time-series data, where the input is a sequence of data points.In an RNN, the input not only includes the current data point but also the previous ones.This creates a directed graph structure between the nodes, following the temporal sequence of the data.Additionally, each neuron in an RNN has its own internal memory, which retains information from the computations performed on the previous data points.LSTM, or Long Short-Term Memory, is a specific type of recurrent neural network that addresses the challenge of long-term dependencies in sequential data by allowing more accurate predictions based on recent information.While traditional RNNs struggle as the gap between relevant information increases, LSTM networks excel at retaining information over extended periods.This capability makes LSTM particularly effective for processing, predicting, and classifying time-series data.
A new model that has emerged as a strong alternative to convolutional neural networks is the vision transformer [91].ViT models exhibit exceptional performance, surpassing the state-of-the-art CNNs by nearly four times in both computational efficiency and accuracy.Transformers, which are non-sequential deep learning models, play a significant role in vision transformers.They utilize the self-attention mechanism, assigning varying degrees of importance to different segments of the input data.The Swin Transformer [95] is a type of ViTs that exhibits versatility in modeling at different scales and maintains linear computational complexity concerning image size.This advantageous combination of features enables the Swin Transformer to be well suited for a wide array of vision tasks, encompassing image classification, object detection, and semantic segmentation, among others.Another variant of transformers is Video Transformers [96], which are efficient for evaluating videos on a large scale, ensuring optimal utilization of computational resources and reduced wall runtime.This capability enables full video processing during test time, making VTNs particularly well-suited for handling lengthy videos.Table 2 shows some of the recent detection techniques.Abbreviations FaceForensics++ (FF++), DeepFakes (DF), Face2Face2 (F2F), FaceSwap (FS), NeuralTextures (NT), DeeperForensics-1.0 (DFR1.0).

Datasets
In the context of deepfakes, datasets serve as the foundation for training, testing, and benchmarking deep learning models.The accessibility of reliable and diverse datasets plays a crucial role in the development and evaluation of deepfake techniques.A variety of important datasets, summarized in Table 3, have been curated specifically for deepfake research, each addressing different aspects of the problem and contributing to the advancement of the field.Figure 6 shows some of the widely used datasets in deepfake detection models' improvement.
Computers 2023, 12, x FOR PEER REVIEW 17 of 27

Datasets
In the context of deepfakes, datasets serve as the foundation for training, testing, and benchmarking deep learning models.The accessibility of reliable and diverse datasets plays a crucial role in the development and evaluation of deepfake techniques.A variety of important datasets, summarized in Table 3, have been curated specifically for deepfake research, each addressing different aspects of the problem and contributing to the advancement of the field.Figure 6 shows some of the widely used datasets in deepfake detection models' improvement.
Table 3. Key characteristics of the most prominent and publicly available deepfake datasets.
DFDC [123], the deepfake detection challenge dataset hosted by Facebook, stands as the most extensive collection of face swap videos available and openly accessible.It contains over 100,000 total clips sourced from 3426 paid actors from diverse backgrounds, including different genders, ages, and ethnic groups.
DeeperForensics-1.0 [126] is a significant dataset available for detecting deepfakes that contains 50,000 original clips and 10,000 forged ones.These manipulated videos were generated using a conditional autoencoder called DF-VAE.The dataset includes a broad range of actor appearances and is designed to represent real-world scenarios more accurately by including a blend of alterations and disturbances, including compression, blurriness, noise, and other visual anomalies.
WildDeepfake [127] is a dataset that is widely recognized as a difficult one for deepfake detection.It features both authentic and deepfake samples obtained from the internet, which distinguishes it from other available datasets.While previous datasets have only included synthesized facial images, this dataset includes a variety of body types.However, there remains a need for a more comprehensive dataset that can generate full-body deepfakes to improve the robustness of deepfake detection models.
Celeb-DF [122] dataset is a collection of authentic and synthesized deepfake videos that are visually similar in quality to those that are commonly shared online.This dataset represents a significant expansion of its first version, which contained only 795 deepfake videos.Celeb-DF comprises 590 unaltered videos sourced from YouTube, featuring individuals of varying ages, ethnicities, and genders, along with 5639 associated deepfake videos, all of which were created using readily accessible YouTube excerpts featuring 59 famous personalities from diverse backgrounds.The deepfake videos were generated using an advanced synthesis method, resulting in more realistic and convincing deepfakes.
Finding fake faces among numerous genuine faces in scenes taken in the nature is a significant difficulty.OpenForensics [128] dataset was specifically created with face-wise rich annotations for the detection and segmentation of face forgeries.The OpenForensics dataset has a lot of potential for study in generic human face detection and deepfake prevention because of its extensive annotations.A total of 334 K human faces are depicted among 115 K photos in version 1.0.0.This collection includes numerous individuals with different origins, ages, genders, stances, positions, and face occlusions.
FakeAVCeleb [133] is a multimodal deepfake detection dataset that includes deepfake videos and cloned deepfake audio.It features diverse celebrities in terms of ethnicity, age, and gender balance.The dataset was evaluated using 11 different deepfake detection methods, including unimodal, ensemble-based, and multimodal approaches.To create deepfake videos, 500 real videos were used as sources and generated around 20,000 deepfake videos using various techniques like face-swapping and facial reenactment.
DeepfakeTIMIT [135] is a dataset containing 620 videos where faces were swapped using GAN-based techniques.It was created by selecting 16 pairs of similar-looking individuals from the VidTIMIT database, with two quality levels for each pair (64 × 64 and 128 × 128).The original audio tracks were retained without any alterations.
UADFV [136] dataset includes 98 videos, totaling 32,752 frames, evenly split between 49 real videos and 49 fake ones.Each video features a single subject and lasts around 11 s.Among these videos, there are 49 original real videos, which were manipulated to generate 49 Deep Fake videos.
DFD [137] or DeepFakeDetection is a dataset created by Google and Jigsaw and encompasses a wide range of scenes that consist of more than 363 genuine sequences featuring 28 paid actors across 16 different scenes.Additionally, it includes over 3000 manipulated videos.
HiFiFace [138] is a dataset that contains 1000 fake videos from FaceForensics++, meticulously adhering to the source and target pair configurations defined in FF++.Additionally, it includes 10,000 frames extracted from FF++ videos, facilitating quantitative testing.
KoDF [129] is an extensive compilation of synthesized and authentic videos primarily centered around Korean subjects.Its primary objective is to support the advancement of deepfake detection methods.This dataset comprises 62,166 authentic videos and 175,776 fake videos, featuring 403 different subjects.
One of the challenges faced by researchers in the field of deepfakes is the lack of comprehensive and diverse datasets for deepfake detection.Existing datasets either have limited diversity, meaning they do not cover a wide range of scenarios and variations, or only focus on basic forgery detection without capturing the intricacies and subtleties of advanced deepfakes.To address this problem and push the boundaries of deepfake detection, researchers and technology companies have taken up the task of constructing several benchmarks.These benchmarks serve as standardized datasets that encompass a broad range of facial variations, lighting conditions, camera angles, and other relevant factors.By including diverse samples, these benchmarks enable researchers to develop and evaluate advanced algorithms and techniques for detecting and analyzing deepfakes more effectively.To mention a few, ForgeryNet [139] is an extremely large deepfake benchmark with consistent annotations in both image and video data for four distinct tasks: Image Forgery Classification, Spatial Forgery Localization, Video Forgery Classification, and Temporal Forgery Localization.It consists of 2.9 million images, 221,247 videos and 15 manipulation methods.
For the predominant focus on a single modality and limited coverage of forgery methods, current datasets for deepfake detection are primarily constrained when it comes to audio-visual deepfakes.DefakeAVMiT [62] is a dataset includes an ample amount of deepfake visuals paired with corresponding audios and generated by various deepfake methods affecting either modality.Alternatively, LAV-DF [61] consists of content-driven manipulations to help with the detection of content altering fake segments in videos due to the lack of suitable datasets for this task.It is important to note that the availability and creation of datasets are ongoing processes, with new datasets being introduced and existing ones being expanded or refined over time.The continuous development of diverse and representative datasets is crucial to ensure the robustness and generalizability of deepfake detection algorithms, as well as to keep up with the evolving techniques employed by malicious actors.

Challenges and Future Directions
Although deepfake detection has improved significantly, there are still a number of problems with the current detection algorithms that need to be resolved.The most significant challenge would be real-time detection of deepfakes and the implementation of detection models in diverse sectors and across multiple platforms.A challenge difficult to surmount due to several complexities, such as the computational power needed to detect deepfakes in real-time considering the massive amount of data shared every second on the internet and the necessity that these detection models be effective and have almost no instances of false positives.To attain this objective, one can leverage advanced learning techniques, such as meta-learning and metric learning, employ efficient architectures like transformers, apply compression techniques such as quantization, and make strategic investments in robust software and hardware infrastructure foundations.
In addition, detection methods encounter challenges intrinsic to deep learning, including concerns about generalization and robustness.Deepfake content frequently circulates across social media platforms after undergoing significant alterations like compression and the addition of noise.Consequently, employing detection models in real-world scenarios might yield limited effectiveness.To address this problem, several approaches have been explored to strengthen the generalization and robustness of detection models, such as feature restoration, attention guided modules, adversarial learning and data augmentation.Additionally, when it comes to deepfakes, the lack of interpretability of deep learning models becomes particularly problematic, making it challenging to directly grasp how they arrive at their decisions.This lack of transparency can be concerning, especially in critical applications, such as forensics, where understanding the reasoning behind a model's output is important for accountability, trust, and safety.Furthermore, since private data access may be necessary, detection methods raise privacy issues.
The quality of the deepfake datasets is yet another prominent challenge in deepfake detection.The development of deepfake detection techniques is made possible by the availability of large-scale datasets of deepfakes.The content in the available datasets, however, has some noticeable visual differences from the deepfakes that are actually being shared online.Researchers and technology companies such as Google and Facebook constantly put forth datasets and benchmarks to improve the field of deepfake detection.A further threat faced by detection models is adversarial perturbations that can successfully deceive deepfake detectors.These perturbations are strategically designed to exploit vulnerabilities or weaknesses in the underlying algorithms used by deepfake detectors.By introducing subtle modifications to the visual or audio components of a deepfake, adversarial perturbations can effectively trick the detectors into misclassifying the manipulated media as real.
Deepfake detection algorithms, although crucial, cannot be considered the be-all endall solution in the ongoing battle against the threat they pose.Recognizing this, numerous approaches have emerged within the field of deepfakes that aim to not only identify these manipulated media but also provide effective means to mitigate and defend against them.These multifaceted approaches serve the purpose of not only detecting deepfakes but also hindering their creation and curbing their rapid dissemination across various platforms.One prominent avenue of exploration in combating deepfakes involves the incorporation of adversarial perturbations to obstruct the creation of deepfakes.An alternative method involves employing digital watermarking, which discreetly embeds data or signatures within digital content to safeguard its integrity and authenticity.Additionally, blockchain technology offers a similar solution by generating a digital signature for the content and storing it on the blockchain, enabling the verification of any alterations or manipulations to the content.
Moreover, increasing public knowledge of the existence and potential risks linked with deepfakes is essential.Education and media literacy initiatives can educate users on how to critically evaluate digital content, recognize signs of manipulation, and verify the authenticity of media before sharing or believing its content.By empowering individuals to be more discerning consumers of information, the impact of deepfakes can be mitigated.Lastly, governments and policymakers are working to develop regulations and laws that address the misuse of deepfake technology.These policies and legislative measures aim to prevent the creation and dissemination of malicious deepfakes, establish liability frameworks for their creation and distribution, and protect individuals' rights and privacy.

Conclusions
In conclusion, deepfake videos will get harder to detect as AI algorithms become more sophisticated.This survey paper has provided a comprehensive overview encompassing the realm of deepfake generation, the spectrum of deep learning architectures employed in detection, the latest advances in detection techniques, and the pivotal datasets tailored to advance this field of study, all in order to stay one step ahead in the race with generative artificial intelligence, curb the spread of false information, safeguard the integrity of digital content, and stop the damage that deepfakes can cause on a social, political, and economic level.The survey has also highlighted the importance of continued research and development in deepfake detection techniques.Despite the issues presented by deepfakes, this technology nevertheless shows potential for artistic uses in virtual communication, entertainment, and visual effects.Future work must continue to focus on finding a balance between utilizing deepfakes' beneficial potential and reducing their negative effects.

Figure 1 .
Figure 1.The five principal categories of deepfake manipulation.

Figure 2 .
Figure 2. Clues and features employed by deepfake detection models in the identification of deep fake content.

Figure 2 .
Figure 2. Clues and features employed by deepfake detection models in the identification of deepfake content.

Figure 3 .
Figure 3. Overview of predominant deep learning architectures, networks, and frameworks employed in the development of deepfake detection models.

Figure 3 .
Figure 3. Overview of predominant deep learning architectures, networks, and frameworks employed in the development of deepfake detection models.

Figure 6 .
Figure 6.Frequency of usage of different deepfake datasets in the discussed detection models within this survey.

Figure 6 .
Figure 6.Frequency of usage of different deepfake datasets in the discussed detection models within this survey.

Table 1 .
An analysis of several CNN-and RNN-based Cont.

Table 2 .
Summary of recent deepfake detection models, employed techniques, feature sets, datasets, and intra-dataset performance results.