1. Introduction
Deepfake technology takes advantage of recent technology in artificial intelligence (AI) and deep learning to produce hyper-realistic so-called synthetic media such as videos, images, and audio [
1,
2].
Such rigged contents are capable of portraying people doing or saying things that did not take place. Sophisticated algorithms, including generative adversarial networks (GANs), variational autoencoders (VAEs), and transformer-based models are used in the technology to produce an unprecedented level of realism in its face expression manipulation capabilities, voice synthesis, and scene generation [
3].
Malik et al. comprehensively survey deepfake detection methods for face images and videos. They classify deepfake generation techniques into five categories and detection approaches into traditional forensic methods and deep learning-based techniques, including pixel-level, DNN, and artifact-based analyses.The survey determines the critical datasets (UADFV, FaceForensics++, and Celeb-DF) and major problems such as scanty data, unidentified attacks, and chronological incompatibilities, and stressing the desire of powerful, universalized models to battle the emerging deepfakes [
4].
Since their emergence in 2017 deepfakes have experienced evolution far greater in a short time, fuelled by accessible open source tools, enormous available online data, and growing computational power. Despite without possible benefits in entertainment, education and creative technology deep-fakes have immense harm to the cohesion of the society and democratic systems by causing erosion to human and trust and magnifying false information [
5,
6].
This accessibility has created a situation in which disinformation, financial fraud, identity theft, and forced content generation coming to the fore considerably compromising privacy, trust and information integrity. Despite without possible benefits in entertainment, education and creative technology deep-fakes have immense harm to the cohesion of the society and democratic systems by causing erosion to human and trust and magnifying false information [
7,
8].
These challenges highlight the need for effective and efficient detection systems to aid in addressing the existing issues.These challenges highlight the urgency of robust detection systems as well as effective mitigation regulatory systems, and raising the media literacy to fight the risks of the deepfakes [
9,
10].
This review analyzes the increasing menace of AI created deepfakes which undermine confidence in online media by obscuring the individual through meddling with endangering them privacy. It gives a review of technical and non-technical preventive measures, with incorporating the current detection techniques and regulation frameworks that are capable of assisting. mitigate these risks [
11,
12].
The development of deepfake technology was studied by Ruben Tolosana et al. from the Biometrics and Data Pattern Analytics (BIDA) Lab. They concentrated on facial area analysis and detection system performance throughout deepfake datasets from the first and second generations. The study emphasizes how deepfakes are becoming more realistic and challenging to identify [
13].
The intensive development of the deepfake generation and detection techniques was systematically reported in the recent and extensive surveys.This review notes a growing complexity of synthesis algorithms and defense mechanisms, especially the complexity of the more recent generation models, by the use of advanced generative adversarial networks and transformer generation models [
14,
15]. The creation of many challenging benchmark datasets like Celeb-DF has facilitated the stronger testing of detection methods under a variety of manipulation conditions to augment prior benchmarks with high-quality celebrity deepfakes that best simulate manipulation issues in the real world [
16].
Dolhansky et al. developed the deepfake detection challenge(DFDC) dataset, which provides a large scale ethically sourced collection of 128,154 movies made with over 3400 consenting actors and with eight deepfake algorithms. In order to overcome the shortcomings of previous datasets, DFDC offers a variety of excellent material and acts as a uniform standard foe assessing detection techniques. Strong generalization to real world deepfakes was shown by the challenge’s top performing models, like EfficientNet and XceptionNet [
17].
The Anti-Deepfake Transformer (ADT) is a vision transformer framework that addresses the generalization limitations of CNN-based deepfake detectors. ADT leverages variant residual connections (VRC) in four trans-blocks, an attention-leading module (ALM), and a multi-forensics module (MFM) to capture both global and regional information. Trained with token-level contrastive loss on FaceForensics++, ADT achieves state-of-the-art cross-dataset performance (84.97% AUC on Celeb-DF) and competitive intra-dataset results (96.30% AUC on FF++ HQ) [
18].
In a 2022 Applied Sciences publication, Khormali and Yuan introduce DFDT, a vision transformer-based deepfake detection framework. DFDT overcomes the receptive field limitations of CNNs by modeling both local and global pixel relationships across various forgery scales. The framework uses patch extraction and embedding, a multi-stream transformer block, attention-based patch selection, and a multi-scale classifier. It incorporates a re-attention mechanism for improved scalability and to prevent attention collapse. DFDT achieves high accuracy on FaceForensics++ (99.41%), Celeb-DF V2 (99.31%), and WildDeepfake (81.35%), demonstrating strong cross-dataset and cross-manipulation generalization [
19].
Nirkin et al. suggest a face-swapping detection approach which makes use of disparities between the manipulated face and the surrounding load context with two Xception-net based networks one is face identification and the other context recognition. The approach outperforms traditional classifiers, achieving state-of-the-art results on FaceForensics++ and Celeb-DF while generalizing effectively to unseen manipulation methods [
20].
Guo et al. introduce Space-Frequency Interactive Convolution (SFIConv), which replaces standard convolution layers in backbone networks to improve deepfake detection. SFIConv incorporates Multichannel Constrained Separable Convolution (MCSConv) to jointly capture spatial and high-frequency manipulation traces. This can be shown by experiments on HFF, FF++, FDDC, and CelebDF precision and lower cost of computation. However, its performance drops on unseen manipulations methods highlighting persistent generalization challenges [
21].
Wang et al. provide a deep convolutional Transformers model that uses convolutional pooling and re-attention techniques to aggregate local and global picture information for deepfake detection. on benchmarks like FaceForencise++ and Celeb-DF the model performs better than conventional CNN based baselines. By introducing keyframe extraction to preserve high-resolution information and mitigate loss of video compression, the method achieves improved performance, reaching up to 97.69% AUC in FF++ in both within- and cross-dataset evaluations [
22].
Zhao et al. suggest a deepfake detector framework, which is multi-attentional scale-invariant attention to manipulation artifacts. Their method utilizes a spatial attention module so as to pay attention to the areas of the face that are susceptible. Whereas channel attention mechanisms focus on discriminative, manipulation artifact feature representations. The multi-scale architecture has the ability to detect both local blending boundaries and distortions of facial features inconsistencies) together with global artifacts (such as lighting anomalies and lapses in time). Evaluated on FaceForensics++ and Celeb-DF, the algorithm shows high-quality results on various environments manipulation, that is, competitive precision without loss of computational performance, which is appropriate to be used in practice. The trend is represented in this work towards attention-based architectures, which are explicit about spatial and semantic relationships between deep fake and detection [
23].
Yang et al. [
24] introduce AVoiD-DF, a multi-modal deepfake detection framework that learns audio-visual inconsistencies. AVoiD-DF outperforms uni-modal methods by using a two-stream Temporal-Spatial Encoder (TSE), a Multi-Modal Joint Decoder (MMD) with bi-directional cross-attention, and a cross-modal classifier to fuse features. It achieves state-of-the-art accuracy on DefakeAVMIT (91.2%), FakeAVCeleb (92.3%), and DFDC (91.4%), exceeding Xception, LipForensics, and CViT. To further research, they present DefakeAVMIT, a dataset of 6480 audio-visual deepfake samples. AVoiD-DF exhibits strong cross-dataset generalization, effectively detecting unseen forgeries. The better performance shows the importance of audio-visual synchrony analysis, which is especially to detect temporal misalignment, phoneme and viseme mismatch, prosodic expression inconsistency and lip sync degradation (achieved 6–8% improvement of visual-only methods) [
24].
This study presents a CNN-based multi-color spatio-temporal method for deepfake detection. By analyzing facial color inconsistencies and temporal artifacts, it achieves high AUC scores on FaceForensics++, outperforming physics-based detectors in capturing flickering and boundary effects. However, computational complexity and limited generalization due to sensitivity to advanced GAN-based deepfakes highlight the need for better temporal continuity modeling [
25].
Figure 1 shows a comprehensive multimodal deepfake detection framework that takes both the visual and audio inputs in parallel and detects the content that is manipulated. This architecture is of fundamental importance as it overcomes the key limitation of unimodal detection methods that only analyze visual/audio features separately and are therefore unable to capture cross-modal inconsistencies that are often a reveal of deepfake manipulation. The power of the framework is that it is able to identify very subtle differences between audio and visual streams that are inherently characteristic of synthetic media, but hard for human observers to detect.
The framework has four interrelated components that work in harmony to result in robust detection. First, the spatiotemporal encoder processes the video frames to extract features related to the visual contents representing both inconsistencies in space (within single frames) and inconsistencies in time (within the frames sequences) that will help to identify manipulation artifacts that occur across time. Second, an audio feature extractor is used to analyse acoustic features such as spectral features, prosodic features, and voice quality features which can be used to identify synthetic audio.
Third, these extracted features are fused through a joint decoder using cross attention mechanisms to learn complex dependencies among modalities e.g., in lip synchronization, detecting when lip movements are inconsistent with the audio phonemes or when there are inconsistencies between the dynamics of the facial muscles and those of voice production. Fourth, alignment and inconsistency detection modules pick up relative light differences in the audio-visual signal that are hallmarks of synthetic manipulation, such as temporal levels of misalignment, or unnatural synchronization patterns.
The main innovation demonstrated in this architecture is the use of the cross attention mechanism in the joint decoder which helps the model to find the correlations between the video and audio streams that would be invisible to unimodal approaches. For example, genuine videos have a natural matching between facial muscle movements and vocalization, whereas deepfakes have temporal desynchronisations or sound and pronunciation mismatches, because of the independent generation of visual and audio information.
This type of multimodal method, such as the AVoiD-DF method, can yield better results (91.2% on DefakeAVMIT, 92.3% on FakeAVCeleb) with respect to unimodal methods thanks to some complementary information between the different modalities, which explains why integrated detection frameworks are the current state-of-the-art for deepfake detection.
Rossler et al. present FaceForensics++, a full-size benchmark data and detection system that has developed into the standard to assess deepfake detection schemes. These are the conditions that define the systematic evaluation protocols in various manipulation and their work compression levels and techniques (Deepfakes, Face2Face, FaceSwap, NeuralTextures), making it possible to reasonably compare methods of detection. The sample consists of 1000 original videos manipulated with five variations of methods in generating conditions estimating method-based detection abilities. This quality measure addresses past constraints in the area where inconsistent assessment guidelines impaired significant assessment comparisons of performance between studies. FaceForensics++ was referenced more than 3000 times and has been the most commonly-used benchmark in our reviewed corpus (40% of studies) [
26].
Jayakumar et al. present a conceptualized deepfake detection model that is visually interpretable and is based on an EfficientNetB0 backbone trained on a subset of FaceForensics++. Using MTCNN as the face detector and other preprocessing methods, such as scaling, normalizing, and augmentation, the system has 89.58% fidelity under human-grounded evaluation with Anchors XAI and SLIC as the segmentation system in generating visual explanations about manipulated areas. Anchors perform better than LIME when it comes to giving similar interpretations, although additional testing in bigger datasets is necessary [
27].
Zou et al. propose a semantics-oriented multitask learning framework for deepfake detection, introducing a dataset expansion technique and a joint embedding approach using vision-language models. Their Semantics-based Joint Embedding DeepFake Detector (SJEDD) utilizes face semantics, bi-level optimization, and automated task focus and is significantly outperforming 18 state-of-the-art detectors on 6 datasets that it competes in. The lack of variety in datasets and the complexity of computation is a challenge, and the generalizability and interpretability of SJEDD are improved [
28].
Cozzolino et al. introduce POI-Forensics, which uses contrastive learning to derive identity-specific features by using real videos and an audio-visual deepfake detector. The technique impacts strong cross-dataset results, such as an AUC of 73.4% on FakeAVCelebV2 and good results on pDFDC and KoDF, by comparing audio and video embeddings with a reference set. Its usefulness in facial reenactment (AUC: 70–80%) is however lower and it requires having a large set of references [
29].
Pham et al. had done a dual study benchmarking both on facial forgery and facial detection the data used is a collection of forged images (91,885 large data set) and videos (2000 videos) out of FaceForensics++. The authors made the comparison of conventional computer vision algorithms, and the deep learning algorithms such as XceptionNet and GAN-Fingerprint and their stability to brightness, resolution, and so on and compression. Most techniques were greatly affected by compression Despite GAN-Fingerprint being stronger. This bi-dimensional benchmarking data gives users the option to confronts forging and detection procedures directly and provides helpful performance information. Nevertheless, certain information was lost due to the limitation of OCR [
30].
On black-box, grey-box and white-box evaluation of 16 state of the art detectors, Le et al. systematize deepfake detection using a five stepping conceptual framework environments with data sets such as DFDC Spatial-temporal and transformer models out distort others, especially detect subtle changes of a face. However these limited problem of diversity and repeatability of datasets, and only 30 percent of the models are open source hinder generalization. The paper focuses on the architecture of the model and data variety as necessary to effective deepfake detection [
31].
Shahzad et al. examine cinematographic deepfake detection ability of ChatGPT-4 by compares the results with human judges and the latest AI models in the FakeAVCeleb dataset. ChatGPT with context-rich prompts is similar in human-comparable performances (65 percent) compared to AI models (87.5–97.5 percent) but underperforms AI models. Although its interpretabilities and generalization are good strengths, the use of traditional characteristics and inability to work with simple prompts limits its functionality. The paper recommends that future detection systems can be enhanced through the integration of deep learning with large language models (LLMs) [
32].
Rana and Sung provide a summary of deepfake detection in their IWSPA has in their tutorial and dividing approaches into Digital Media Forensics, Face Manipulation, Machine Learning models like (SVM, CNN) and other Computational Methods. On the visual materials and facial landmarks, and when analyzed on a dataset such as FaceForensics++, the weaknesses are evident of individual methods. They suggest mixed methods and different data sets to increased protection against emerging deepfakes [
33].
Figure 2 shows a detailed classification of deepfake detection techniques in a taxonomic structure of three main categories and methodologies are classified depending on their underlying algorithm and operation principles. This classification is important in order to comprehend the landscape of approaches to detection since it highlights the fundamental trade-offs between multiple and methodology paradigms, which ultimately help researchers in choosing the suitable techniques for particular deployment scenarios. The tri-partite taxonomy reflects the evolution of the detection methods from the old school signal processing methods to the classical machine learning and finally to the latest deep learning architectures.
The first category, which includes Deep Learning-Based Detection Techniques, includes advanced neural architectures such as Transformers, Convolutional Neural Networks (CNNs), Long Short-Term Memory networks (LSTMs), Capsule Networks, Generative Adversarial Networks (GANs), XceptionNet, Recurrent Neural Networks (RNNs), Autoencoders and attention mechanisms. These approaches are currently leading the research field because of the capacity to automatically learn hierarchical representations of features from raw data with state-of-the-art accuracy on benchmark datasets. However, as detailed in
Section 2 and
Section 4, this category shows a large amount of variability in terms of the generation of the need for computational resources and cross-dataset generalization capabilities. The second category, MachineLearning and Traditional Methods, comprise of Support Vector Machines (SVM), Random Forest, K-Nearest Neighbors (KNN) & Logistic Regression. There comes a case where we have approaches that are based on handcrafted features, and some classical optimization techniques. Despite the fact that such models are overshadowed by deep learning in latter literature, our analysis in
Section 3 shows that the accuracy of these models is competitive with much lower computational overhead, especially Random Forest which is able to get 99.64% accuracy on DFDC. The third category, which is Traditional Image Processing Techniques, includes frequency domain analysis methods such as Discrete Cosine Transform (DCT), Discrete Fourier Transform (DFT) and Pixel and Frequency Analysis, Edge Detection and Wavelet Analysis that take advantage of statistical regularities in deepfake generation processes.
This taxonomic organization makes some very important points related to methodological diversity and research gaps. The underlying rate of deep learning approaches (approximately 70% of reviewed studies) represents the current research trends, but our systematic evaluation refutes the affirmation of complexity of architecture equates to better performance.
Traditional methods have intriguing advantages in terms of interpretability, computational efficiency, and deployment feasibility which should be continued to explore in parallel with the deep learning innovations. The classification framework is obviously not just an organizational tool, but it is an analytical instrument than discovers the need to have multi-dimensional evaluation criteria that include accuracy, efficiency, and generalization and not only to optimize accuracy.
Generative adversarial networks have boosted the development of deepfake technology (GANs). This has developed to severe worries as to whether digital media can be trusted, which means that there should be powerful detection systems. Cao and Gong [
34] disclose that although the accuracy was very high (0.94–0.99) in controlled environments, present detection models are susceptible to Gaussian noise adversarial attacks weak cross-method generalization and backdoor exploits. This review addresses the recent advancement and identifies the main challenges and antagonistic threats of coming up with a secure and reliable method of detecting deepfakes [
34].
1.1. Novel Contributions
This review makes four innovative contributions that differentiate it from existing deepfake detection surveys and provide new knowledge regarding the capabilities, limitations and future directions of deepfake detection.
1.1.1. Contribution 1: Deployment-Centric Evaluation Framework
Unlike previous surveys focused on architectural categorization, in this review, methods are addressed from the deployment point of view by systematically comparing the different methods with their associated practical constraints such as their computational overhead, the robustness to cross-datasets, and the need to operate in real time. We show that methodological superiority is not a given, but context sensitive—transformer architectures are superior in a cross-disease setting despite the computational burden and classical machine learning approaches are competitive in terms of accuracy with little resource needs in particular deployment scenarios. This framework changes a focus on evaluation of “which method is best” to a focus on “which method is optimal for specific operational constraints.”
1.1.2. Contribution 2: Challenging Deep Learning Dominance Assumptions
Through this rigorous analysis of the type of computational efficiency, we dispute common assumptions of the superiority of deep learning over traditional methods. We show that using Random Forest, we obtain 99.64% accuracy on DFDC with 2 ms inference time (better than many deep learning methods that need 60 ms to attain 91% accuracy). This result, which was systematically documented in many different traditional methods, shows that intelligent feature engineering can be a very good competitor in comparison to learned representations in resource-constrained applications, and suggests that an implicit assumption in the recent literature is incorrect-namely, that complexity in the architecture guarantees superior performance.
1.1.3. Contribution 3: Systematic Cross-Dataset Generalization Analysis
We provide the first quantification of cross-dataset generalization failures across areas of methodologies to reveal systematic 10 to 15% of degradation of performance, which hints of learning dataset specific rather than a universal deepfake characteristic. While Le et al. [
31] point out the issues with generalization and discuss theoretically, our systematic empirical study quantifies the patterns of degradation and shows that such patterns are the same for transformer (11.33% decline), CNN (15%+ decline) and traditional methods. Such proof is a fundamental limitation: something that needs paradigm shifts in the methodologies for training as well as dataset construction, not so much incremental refinements within the architecture.
1.1.4. Contribution 4: Comprehensive Computational Efficiency Quantification
We give the first systematic comparison within computational overhead (inference time, parameter count) between the different methodology categories, and it turns out that quite often marginal increases in accuracy necessitate disproportionate increases of computational cost. As an example, for ADT, it delivers only 1% better accuracy than for XceptionNet even with being 4x more computationally costly. This quantification enables evidence based architecture selection between performance and efficiency for specific deployment scenarios—critical information missing from existing surveys mainly dealing with accuracy. These contributions separately present researchers and practitioners with practical understanding to select methods, convey basic limitations requiring research focus or attention, and set evaluation frameworks focusing on practical feasibility of deployment together with laboratory performance.
1.2. Research Gap and Motivation
Despite the recent rise in research focus on deepfake detection, the literature in the topic has a number of critical limitations that hinder our understanding of detection capabilities and the ability to deploy them in practice. First, the literature to date is mostly focused on deep learning approaches and systematically understudies or ignores competitive performance of traditional machine learning approaches under certain deployment constraints. This prejudice on the part of the educated for architectural complexity tends to cloud situations in which simpler methods may provide better efficiency-accuracy trade-offs, especially in resource-constrained or in any real-time applications. Second, most reviews break methods down by the type of architecture (CNNs, Transformers, LSTMs) without systematically considering the underlying trade-offs among computational efficiency, detection accuracy and cross-dataset generalization—factors that have been shown to be important when assessing the viability of real-world system deployment beyond suspiciousness benchmarking on many datasets, which include a diversity of raw datasets including images, videos, corpus biologies, and more. This methodological aspect of taxonomy over performance trade-offs offers little information for practitioners to aid in choosing detection methodologies within particular operational scenarios.
Third, benchmarking performance comparisons in existing literature usually focus on within-dataset accuracy (ranging from 99%+ accuracy on FaceForensics++) without reporting or ignoring cross-dataset generalization failures that make many approaches unsuitable for production systems that must contend with evolving manipulation approaches.
The methodical body of cross-dataset performance degradation and what these patterns indicate about fundamental limitations in current procedures for how to detect have largely remained untested. Fourth, there is an underexploration of the complexity dimension versus the practical deployment viability dimension here is almost no analysis of the computation cost (inference time, number of parameters, memory) versus the marginal performance benefits. This oversight is especially important because detection systems need to be scaled for various platforms beforehand from high-performance servers to mobile devices. Finally, the field is missing rich synthesis of the methodological diversity that spans from deep learning all the way to classical machine learning and more traditional image processing techniques and it is difficult to understand the full landscape of the different detection approaches and their relative strengths in terms of different evaluation dimensions.
1.3. Research Questions
To fill these critical lacuna in deepfake detection literature, we deal with this systematic review which contains four critical research questions.
1.3.1. RQ1: Comparative Performance Across Methodological Paradigms
How do various detection methodologies, including deep learning architectures (Transformers, CNNs, LSTMs), classical machine learning methodologies (SVM, Random Forest, KNN) and traditional image processing techniques (DCT, DFT, wavelet analysis) compare with each other with respect to the dimensions of detection accuracy, computational efficiency and ability to generalize to new automated datasets?
1.3.2. RQ2: Cross-Dataset Generalization Patterns
What are the systematic patterns in cross-dataset performance degradation when the considered methods for detection are tested on datasets they have never seen before and what can their patterns tell us about whether currently considered approaches learn deepfake characteristics that are generalizable rather than dataset-specific?
1.3.3. RQ3: Viability of Traditional Methods
Do certain traditional machine learning and image processing approaches provide feasible alternatives to deep learning approaches in particular deployment scenarios with limited computational resources, real-time processing needs, or the need to provide some form of model interpretability?
1.3.4. RQ4: Architectural Innovations for Generalization
I am already involved in the research, perhaps not quite yet (nor even close), but in my own views, the best questions are how can we achieve the goals that matter most (not all) and how can we make it most reliable with what is or is not in the most instrumented representation?
These research questions pertain to our systematic analysis of studies published between 2018–2025 for the purpose of providing evidence to assess detection capabilities, fundamental limitations, and promising research directions for the field.
1.4. Review Objectives
This extensive systematic review aims to have three main objectives which are directly related to the research gaps and questions found.
1.4.1. Objective 1: Comprehensive Methodological Synthesis
We systematically synthesize the deepfake detection methodologies from a total of peer-reviewed studies of deepfake detection methods published between January 2018 and December 2025 by which approaches are categorized into 9 different technological categories which are in a deep learning architectures [Transformers, CNNs, LSTMs, Capsule Networks, GAN de Synthesis, XceptionNet, RNN, Autoencoders, Attention Mechanisms], classical machine learning methods [Support Vector Machines, RandomForest, K Nearest Neighbors, Logistic Regression], and the traditional image processing methods [Discrete Cosine Transform, This taxonomic organization covers the first comprehensive mapping of methodology diversity across the different spectrum of possible detection approaches instead of just focussing on deep learning methods.
1.4.2. Objective 2: Multi-Dimensional Performance Analysis
We perform from rigorous comparative performance analysis on three benchmark performance data sets (FaceForensics++, DFDC, Celeb-DFase and Celeb-DF) and quantify accuracy efficiency trade-offs as well as cross-data sets generalization powers for representative algorithms belonging to each category of technology. Unlike existing reviews, which mainly focus on single metric optimization (most commonly in the form of accuracy on FaceForensics++), in our analysis, we systematically assess three key dimensions related to performance evaluation: (1) detection accuracy for various types of manipulations and quality levels, (2) computational efficiency, in terms of the inference time and numbers of parameters, and (3) cross-dataset generalization, in terms of the problem of degradation in performance, on unseen data distributions. This multi-dimensional framework offers means of deployment centric evaluation taking into consideration economically important operational constrains beyond the laboratory environment.
1.4.3. Objective 3: Fundamental Limitation Identification and Future Directions
We recognize failure patterns in the literature in terms of methodology as well as suggest evidence-based directions for future work to address basic challenges related to generalization, efficiency, and deployment issues. Through performed cross-dataset analysis, we report whether performance degradation shows random manifestations or systematic limitations that indicate that the methods currently in use learn dataset specific artifacts and not universal deepfake features. This is a highly important evaluation, which will steer a field away from attempting to improve the accuracy slightly on corners and saponified benchmarks, toward solving underlying issues of detection failure.
1.5. Review Methodology and Organizational Framework
This systematic review adheres to a systematic approach for the systematic coverage and rigorous analysis of the deepfake detection literature. Whilst narrative reviews do not require the formality of protocols in systematic reviews (PRSMA guidelines), we use transparent selection criteria/analytical frameworks to allow reproducibility and systematic synthesis.
1.5.1. Literature Selection Protocol
We did systematic searches in five major academic databases: the libraries of the research communities of the world’s major tech companies and publishers, or more specifically: the libraries of the world’s major techs—the world’s major tech databases: attractive and user-friendly libraries of scientific data and content: libraries of which the full text content is available online. The search term was used with Boolean operators in order to obtain relevant literature: ((”deepfake detection”OR ”face manipulation detection” OR ”synthetic media detection” OR ”facial forgery detection”) AND ”deep learning” OR ”machine learning” OR ”computer vision” OR ”image processing”). We limited the temporal objects within the given papers from January 2018 to December 2025, i.e., the time frames from the innovation of deepfakes in late 2017 to current state-of-the-art methods. This period spans the full development of various detection techniques from the initial convolution-based techniques (CNN) to the latest transformer-based models and multimodal models.
The search process has resulted in 327 potentially relevant publications. We applied systematic inclusion criteria requiring: (1) peer-reviewed publications in either journals or conferences or including highly-cited preprints (>20 citations) from reputable archives (arXiv, ResearchGate), (2) empirical evaluation using established/unofficial benchmark datasets, being either FaceForensics++, DFDC, Celeb-DF or equivalent, (3) clear methodological description to allow for methodological replication or critical assessment, and (4) quantitative performance metrics reporting, including accuracy, AUC, F1-score, etc.; Exclusion criteria that were removed from our search were: (1) pure theoretical papers lacking implementation and/or empirical validation; (2) papers that do not contain quantitative results or merely provide qualitative analysis; (3) duplicate publications or extended versions of previous research publications (we kept the most complete version); (4) papers that exclusively focus on deepfake generation techniques and not on deepfake detection techniques; and (5) papers that do not address modalities outside of our scope of study (e.g., exclusively audio deepfakes lacking visual components, text-based manipulation). This-focused systematic filtering process produced primary studies that make up the final review corpus.
1.5.2. Analytical Framework and Categorization
We divided the preferred detection methodologies into three main categories depending on the algorithmic basis and working principle of the detection. The firstcategory, Deep Learning-Based Detection Techniques (
Section 2) contains nine groups of algorithms, i.e., the Transformers (dependencies among the global objects, using self-attention mechanisms), Convolution-based Neural Networks (global feature examination, using hierarchies of space features), Long Short Term Memory networks (dependencies over time), Capsule Networks (preserving the hierarchies of space), Generative Adversarial Networks (Using devices gained), Recurrent Multiscale Networks (Acting on events irrespective of spatial hierarchies) and Multi-level Networks (Multi-scale network) via adversarial training), XceptionNET (separable convolutional operations for depthwise processing of images), RNN (processing sequence data networks), Autoencoders (learning process by reconstruction) and Attention Mechanism (selecting salient features). Such granular categorization makes it possible to identify architecture-specific abilities and disabilities instead of considering “deep learning” to be a monolithic category.
The second category, Machine Learning and Traditional Detection Techniques (
Section 3), is the classical supervised learning techniques and the signal processing techniques: Support Vector Machines (Optimization of margin, classification) Random Forest (decision trees—ensemble) K-Nearest Neighbors (instance-based learning) Logistic Regression (probabilistic linear classification) User Discrete Cosine Transform (Frequency domain analysis) Discrete Fourier Transform (spectral decomposition) Pixel and Frequency Analysis (hybrid between space—spectrum) Edge Detection (Gradient-based features) Wavelet Analysis (Multi-resolution). Inclusion of traditional methods covers the found gap in current surveys that systematically under-examine approaches that are based on a non-deep learning strategy, although they may have theoretical advantages in terms of computational efficiency and interpretability.
The third category, Performance Analysis (
Section 4), can be the synthesis in the form of a comparative evaluation in the benchmark data sets. For each of the reviewed techniques, we extracted the following information: (1) architectural design, algorithmic technique, (2) computational requirements (inference time, parameter number, memory footprint when reported), (3) within-dataset performance on FaceForensics++, DFDC and Celeb-DF (4) cross-dataset generalization in terms of performance degradation of the tests on unseen benchmarks, and (5) deployment considerations (real-time capability, hardware).
1.5.3. Evaluation Metrics and Comparative Analysis
In our performance comparison framework, we place a large emphasis on three dimensions which are critical for determining practical deployed viability beyond single metric optimization: First, detecting accuracy is evaluated in three benchmark datasets of different manipulation characteristics: FaceForensics++ (controlled manipulations using 5 generation methods at multiple compression levels), DFDC (large-scale realistic variations, 128,154 videos) and Celeb-DF (realistic celebrity deepfakes which are close to realistic scenarios).The overall accuracy, AUC (Area Under ROC Curve), and F1-score are used as the metrics of accuracy and can be compared with each other even though the metrics are heterogeneous as only what the original studies reported is in use. F1-score presents the combination of precision and recall in a balanced way, which is especially useful with imbalanced datasets, typical of the deepfake detection tasks [
35].
Second, computational efficiency is measured by inference time (mili-seconds frame or video) and model complexity (number of parameters FLOPs if known). This dimension touches on the gap we have observed in terms of computational cost benefits, in that insights into whether there is a great deal of additional accuracy that warrants a great deal of extra computation for particular deployments (real-time content moderation vs. forensic analysis vs. resource-constrained mobile deployment). Third, cross-dataset generalization is defined as performance degeneration when approaches trained using one dataset are tested on a different benchmark. This metric provides insight into whether techniques are learning the characteristics of deepfakes in general or these specific datasets—a very important distinction to make if we are to deploy these deepfake detection methods for practical applications to real-world scenarios where manipulation techniques might evolve continuously. This tri-dimensional evaluation framework allows for subtle evaluation beyond conventional tests of accuracy (i.e., in relation to other studies and datasets, RQ1 comparative performance tests by methodology and RQ2 generalization patterns across data sets). Methods that excel in all three dimensions are ideal solutions and those which maximise one dimension at the cost of another are appropriate for particular deployment situations.
1.5.4. Organizational Structure and Roadmap
The review is defined to increase a understanding from individual methodologies to comparing synthesis and critical analysis.
Section 2 presents a systematic discussion of deep learning architectures, which starts fromTransformer based architectures (demonstrate superior generalization in spite of computational overhead), through the Commun innovation technology (CNN-s) architectures (trade between effectiveness and efficiency) to specialized architectures targeting on solvable challenges (temporal modeling with long short term memories (LSTMs), spatial hierarchy preservation with Capsule Networks). Within each family of architecture, we analyze representative methods, document performance in each of the benchmarks, and detect failure patterns.
Section 3 is an analysis of machine learning and the conventional image processing methods and by that challenge the assumption that deep learning is overwhelming and show that it provides the same performance in some situations. For each method category (SVM, Random forest, frequency domain analysis), we conduct a synthesis of the different studies within the set so as to observe, not isolated results, but rather consistent findings.
Section 4 is devoted to comparative analyses of the performance through the use of structured tables and cross-method analysis synthesis, which relate directly to research questions using factual evidence.
Section 5 (Discussion) involves synthesis of findings to identify fundamental trade-offs and systematic limitations as well as promising (future) research path directions.
Section 6 (Conclusion), distils some key insights and suggestions for action.
Such organizational format that CD readers can; (1) understand individual methodologies in their architectural context (
Section 2 and
Section 3), (2) compare methods based on evaluation dimensions through systematic analysis (
Section 4), (3) determine basic patterns and limitations that transcend single methods (
Section 5), and (4) draw out ways to use the methodology to inform the choice of a method and guide future research (
Section 6). The structure allows for both the type of reading that moves toward comprehension for the deep understanding of particular ideas and the type of selective navigation for practitioners who may need to find specific information.
1.6. Bibliometric Analysis and Research Landscape
To give quantitative and objective visualization of the research landscape of deepfake detection, we performed a bibliometric analysis of our 74-study corpus, analyzing publication trends and methodology evolution and discovering the clustering patterns with respect to its themes over time.
1.6.1. Temporal Publication Trends and Growth Patterns
Our corpus shows that the research on deepfake detection has grown exponentially, as the number of publications has grown from 3 studies in 2018 and 23 studies in 2024, which is a 667% growth rate. This evolution is in numerous dimensions and is thoroughly reported in recent comprehensive surveys published in the best-tier journals. Media forensics is also discussed in IEEE reviews around the foundations of media forensics and media benchmarking challenges, whereas state-of-the-art methodologies and open challenges are evaluated in surveys published by Springer and Elsevier [
36]. Wiley has systematic coverage of deep learning approaches in the MDPI and ACM publications emphasize machine learning fusion techniques and perspectives on reliability. In general, these 2023–2024 surveys intersect as they are focused on defining underlying issues, such as cross-dataset generalization, computational efficiency and the necessity of the evaluation frameworks that go beyond single-dataset accuracy measures.In her paper, Kaur et al. talk about the main challenges of GAN-based generation methods which are the detection approaches that need to be tailored according to the specific case. The bibliometric analysis performed by the authors uncovers the trends in publication and the dominant research fields, thus indicating that there is still an active interest from the academic community in the GAN-driven detection methods [
37]. This acceleration is a true reflection of the other two factors branching from the problem: an accelerated deepfake threat, as well as the mature detection methodologies. Publication distribution reveals three clearly distinguished phases at the end of which CNN-based detection paradigms (1) continued its founding phase 2018–2019 (15% overall studies), (2) with the diversification phase 2020–2021 (28% overall studies) introducing attention mechanism and spatial-temporal modelling and (3) with the sophistication phase 2022–2025 including 70% of overall studies.
Figure 3 shows the methodological evolution in 3 different phases. Phase 1 (2018–2019) Operations for foundational detection capabilities based on traditional hand-crafted features as Local Binary Patterns (LBP), Histogram of Oriented Gradients (HOG) and KSZ Lapini Descriptors together with classical machine learning classifiers as Support Vector Machine (SVM) classifiers and Random Forest classifiers. This phase corresponds to 15% of the studied, and is focused on the mainly interpretability, and also computational efficiency. Phase 2 (2019–2021) Corresponds to the switching towards deep learning architectures, in particular for the second phase where Convolutional Neural Networks (XceptionNet, ResNet50, EfficientNet), and Long Short-Term Memory networks (LSTMs networks) for temporal modeling were mostly used. This phase of diversification granted the introduction of attention mechanisms and hybrid algorithms of fusion for the spatial-temporal combination, that represent 28% of the total number of studies, which gave a high accuracy for the detriment of the understanding. Phase 3 (2022–2025) represents the current state-of-the-art which is composed of advanced Vision Transformers (ADT, DFDT), multimodal audio-visual frameworks (AVoiD-DF), and hybrid spatialrequency approaches. This sophistication phase is the one that dominates the current research (70% of studies) and it focuses on issues related to dealing with cross-dataset generalization problems while remaining computationally feasible for use in practice.
Keyword analysis shows three major research clusters. The Deep Learning Architecture cluster (45% of studies) emphasizes CNNs, Transformers, LSTMs, and hybrid models for the accuracy of the models with architecture innovation. The Cross-Dataset Generalization cluster (25% of studies) discusses issues of lack of performance across benchmarks, which is directly tied to our RQ2. The Computational Efficiency cluster (18% of studies) studies the trade-offs of accuracy and efficiency as well as deployment constraints which support our deployment-centric evaluation framework. The Multimodal Detection cluster (12% of studies) involves unresolved audio-visual integration approaches and represents new areas of research. The Multimodal Detection cluster (12% of studies) focuses on audio-visual synchrony enlargements such as lip-sync violations and voice-expression violations with approaches such as AVoiD-DF achieving 91–92% accuracy.
1.6.2. Benchmark Dataset Utilization Patterns
Consistent with
Figure 4, FaceForensics++ is far more popular for benchmarks (40% studies), followed by DFDC (25%) and Celeb-DF (20%). This high of a concentration leads to possible evaluation biases because methods that have been tuned for FF++’s controlled conditions, might learn dataset specific in addition to the overall characteristics of deepfakes. The growing number of specialized datasets (WildDeepfake 3%, audio-visual datasets 3%, speech-specific datasets 2%) is reflective of understanding that the assessment of single-benchmark evaluation method is inadequate as evidence for useful detection capability in practice.
1.6.3. Citation Network and Methodological Evolution
Analysis of citation patterns reveals ground-breaking works that set paradigms for detection: date datasets creation by Dolhansky et al., Wang et al. [
17,
18] are highly influential publications that define the further research directions. Citation network analysis shows the evolution in methodology from the extraction of spatial features (2018–2019), attention mechanisms (2020–2021) to the focus currently on generalization across data sets and computational efficiency (2022–2025).
1.6.4. Research Gap Validation
This bibliometric landscape empirically gives our identified research gaps legitimacy. The systematic 10–15% cross-dataset performance degradation ability we document is 25% of the papers with explicit evaluation of the generalization ability considered, suggesting that this is still not a fully solved problem. This 18% of studies concerned with computational efficiency makes sense for our finding that this dimension is given far too little attention compared to its importance in deployment. The focus of 70% of studies on deep learning approaches justifies our observation that traditional approaches of machine learning are systematically underexamined, even though they are competitive performers on certain deployment scenarios.
This bibliometric analysis enhances the contribution of our review by showing that our categorisation framework (
Figure 2), identified research gaps (
Section 1.2), and multi-dimensional evaluation approach (
Section 1.5.3) are consistent with objective patterns in the research landscape and not with subjective choices in categories.
4. Performance Analysis and Comparative Evaluation
Having elaborated in a systematic way deep learning architectures (
Section 2) and machine learning with traditional means (
Section 3), we now perform careful comparative performance analysis in a wide number of benchmark data sets in order to assess detection accuracy, computation efficiency, and cross-datasets generalization. This multi-dimensional analysis directly answers our research questions based on quantification of fundamental trade-offs that go into determining where and when the deployment will be practical outside of the laboratory. The systematic comparison provides critical information about challenging conventional assumptions in methodological superiority and about basic limitations that require paradigm shifts in approaches for detection.
Our evaluation framework uses three main benchmark datasets which have varying characteristics of manipulation and complex level. FaceForensics++ (FF++) is the most widely used benchmark consisting of 1000 original videos manipulated using five different generation methods (Deepfakes, Face2Face, FaceSwap, NeuralTextures and FaceShifter) with multiple compression levels providing the ability to control the evaluation of the method-specific detection capabilities. The DeepFake Detection Challenge (DFDC) dataset provides large-scale, realistic evaluation with 128,154 videos involving diverse actors and wide lighting conditions with realistic compression effects to provide more difficult assessment conditions that are close to the real world. Celeb-DF is a collection of high-quality celebrity deepfakes using advanced generation techniques that look very similar to the original video, which is a very hard test of cross-dataset generalization, where the methods must detect manipulations that are quite different from those used to train them. Together, the combination of these datasets allows for an looks at the detection robustness given variable manipulate high-quality including the dataset dimensions and the distribution shift.
We structure our comparative analysis around four related subsections into areas of complementary performance. First,
Section 4.1 provides an in benchmark analysis of within-dataset performance on FaceForensics++ (
Table 1) setting a base-line performance under controlled conditions and picking architectural approaches with state-of-the-art performance under the most common benchmark in the community. Second,
Section 4.2 evaluates the performance of a DFDC (in
Table 2 using scalability and robustness tests to realistic variations and to complement the performance evaluation, e.g., compression artifact, lighting changes, subject characteristics. Third,
Section 4.3 deals with the cross-dataset generalization (
Table 3) to some extent as a means of quantifying the performance degradation when methods are challenged with an unseen data distribution (that is, the critical test on whether the detectors have learned generalizable deepfake characteristics, or datasets specific artifacts). Fourth,
Section 4.4 is a comparison of the computational efficiency (Table to append to the
Table 4), the inferring time and parameter overhead, so to determine if accuracy ap reflects a good trade off vs. complications of parameters boolean “(part of group ware set)”. scenarios.
One of the main results that has emerged from this type of analysis is the systematic performance degradation that occurs during cross-dataset evaluation scenarios. Most methods lose 10–15% accuracy when tested on different data distribution from training data. This applies across a variety of transformer-based, CNN-based and traditional methods. This steady pattern of degradation indicates that there is a fundamental limitation in that current methods of detection have a tendency of learning dataset-specific results of compression artifacts and manipulation boundaries and generation signatures as opposed to generalizable characteristics that are, as, synthetic media. This understanding has vast implications in the real world of application where, to be useful weapons, systems for detection must be able to continue robust performance as techniques for the deepfake image generation continue to develop. The analysis therefore shows that benchmark saturation (achieving > 99% accuracy on FaceForensics++) can disguise practical limitations and perhaps cross-detection should be the most critical measure for detection capability.
4.1. FaceForensics++ Benchmark Results
Table 1 shows performance comparison on the FaceForensics++ dataset, showing the potency of current approaches for detection and the shortcomings of this saturated benchmark for state-of-the-art detection methods. The table shows that the results of recent methods keep achieving more than 99% accuracy, which is the performance ceiling for transformer-based methods (ADT, DFDT), advanced CNN architecture (MCX-API, 3D-CNN) and specialized frequency-domain approach (DFT-MF) reaching this lower bound. This saturation hints either at a lack of discriminative power for contemporary detection methods by FF++, or, at the level of the dataset, a controlled manipulation of the dataset and a consistent compression artifact allows relatively simple detection methods to be almost perfect in accuracy.
4.2. DFDC Performance Analysis
Table 2 shows the detection accuracy result on the DFDC dataset that has greatly more performance variance than FaceForensics++ because the combined effect of its more realistic and challenging characteristics. Unlike the controlled conditions of FF++, the DFDC contains a diverse manipulation quality, diverse compression artifacts and in the wild recording conditions, which are closer to real-world deployment scenarios. The distribution of performance is revealing of important information regarding the robustness of the methodologies and the relationship between the sophistication of architecture and its usefulness in the field.
4.3. Cross-Dataset Generalization Assessment
Table 3 quantifies this critical dimension of cross-dataset generalization of showing the systematic degradation of performance when testing for detection methods on data different from training data. This analysis specifically addresses RQ2 (cross-dataset generalization patterns), and it can provide empirical evidence as to whether methods currently learn when they learn about deepfake characteristics, as generalizable or specific to datasets. The consequences of the results show that generalization failure is not a singular case but a systematic deficiency for all types of methodology.
4.4. Computational Efficiency Comparison
Table 4 does the analysis of the computational overhead by comparing the inference time and number of parameters for Bartlett of representative methods from each architecture’s category. This dimension relates to RQ1 and RQ3 by showing whether and how improvements in accuracy at the margin on computationally expensive methods are justified where the implementation cost in terms computation resources is significant, and whether traditional methods indeed provide a feasible efficiency tradeoff for application-level implementation constraints. The offered analysis shows that the cost of computation versus the accuracy of the detection is not linear, and will reach a point of diminishing return as the architectural complexity grows.
5. Discussion
5.1. Principal Findings and Novel Contributions
This systematic review diabetes of 74 deepfake detection studies published at 2018–2025 supply four key contributions to the literature examining ways to advance deepfake detection beyond what currently exists. First, the use of our multi-dimensional evaluation framework reveals basic trade-offs between detection accuracy and computational efficiency and generalization across data are data streams that have been systematically understudied in previous surveys. While some existing reviews covered the aspect of architectural categorization and the within-dataset accuracy, our deployment-centric analysis proved that the methodological superiority depends on the context rather than the methods themselves: transformer architectures show substantial benefit in later cross-dataset scenarios where they have 3–5× higher computational costs and convolutional or traditional ml shows high enough accuracy on specific deployment scenarios where the resource consumption is minimal.
Several studies that present various all-encompassing surveys published in 2023–2024 in IEEE, Springer and Elsevier, as well as in MDPI, ACM, and Wiley support our systematic results pertaining to major issues with detection. The IEEE literature has a focus on media forensics principles and benchmarking specifications compatible with our cross dataset evaluation focus. The state-of-the-art problems and the generalization failures that we report are recorded independently in Springer and Elsevier reviews and are corroborated by MDPI and Wiley analyses which indicate our results on competitiveness of traditional methods [
10,
11,
15,
36]. The reliability-oriented survey of ACM [
9] also gives us a complementary view of legitimizing our deployment-oriented model. The homogeneity of results between numerous independent leading publications provides our arguments on the necessity of assessment systems on accuracy, efficiency and generalization as opposed to maximisation of individual-data set performance indicators [
9,
12,
13,
37].
Second, we question the current assumptions regarding the relative superiority of deep learning over traditional methods by showing that Random Forest can achieve with 99.64% DFDC accuracy in 2 ms inference time, compared with multimodal deep learning approaches that take 60 ms to achieve 91.4% of accuracy. This finding contrasts the tacit assumption in the recent literature that architectural complexity is a guarantee of good performance, and finds that intelligent feature engineering methods can compete well with learned representations, in situations with limited resources.
Third, our systematic cross-data sets find generalizable degradation across all types of methodologies 10–15% of an average is generalization failure wherein users learn dataset specific artifacts rather than deepfake generalizations. This is an extension of observations by Le et al. [
31] and offers empirical evidence for generalisation challenges that have been mentioned at the theoretical level by [
34].
Fourth, we present full-scale computational efficiency analysis that shows that even marginal improvements in accuracy are often disproportionately costly computationally—for example, a architecture improvement from 1% accuracy—ADT—which is 4× the computational cost of XceptionNet—ADT, is analyses of this sort allow one to make evidence-driven architecture choices for specific deployment scenarios.
5.2. Comparison with Existing Review Literature
Our results both support and add to past deepfake detecting reviews while expose critical weaknesses of existing syntheses. [
35] provide comprehensive architectural taxonomy of detection methods but do not systematically test how computationally efficient or cross-dataset generalizable they are–dimensions that we show are important to test when developing practical methods for assessing deployment Their review classifies approaches by generation and detection methods but does not quantitatively compare the tradeoffs between accuracy, efficiency and generalization and hence practitioners find it difficult to choose appropriate approaches to impairment rate under particular operational constraints. Ref. [
4] identify chronological incompatibilities, lack of generalization, and data scarcity as major challenges but lack systematic quantification of degradation patterns of performance that our
Table 3 provides. Their qualitative assessment of generalization difficulties is useful but not sufficient for helping to uncover the magnitude and consistency of cross-dataset failures.
Ref. [
31] introduce a five-stage concept of acquisition, pre-processing, detection, post-processing, decision for detector evaluation and Stem importance of model architecture and data diversity. Our review goes beyond this approach by adding computational efficiency as a dimension for algorithm evaluation that holds equal importance with accuracy and generalization, and uncovers trade-offs that were not present in their analysis. In addition, that [
31] found only 30% of reviewed models are open-source, which makes reproducibility difficult, we show that this reproducibility crisis is worse for performance claims—too often 10–15% degradation is removed from the model when cross dataset evaluation is performed, which is not reported in original publications focused on in dataset optimization. Rana and Sung classify methods as Digital Media Forensics, Face Manipulation, and Machine Learning approaches but as a result, they focus on architectural difference, as opposed to performance tradeoff. Our deployment-centric view complements such taxonomies, so as to provide a measure of the scenarios where traditional methods have practical advantages over deep learning despite simplicity in terms of their architecture.
Unlike previous reviews where cross-dataset generalization was defined as either a success or a failure, our systematic quantification of cross-dataset generalization indicates that generalization degradation has surprising correlations with the complexity of the algorithm (from which the transformer architectures are derived), transformer architectures have held up better in terms of relative performance (11.33% decline) although the absolute accuracy is often lower than for simpler methods on specific datasets. This nuanced finding implies that architectural sophistication facilitates learning of more transferable features even though they may not maximise within-dataset accuracy which breaks assumptions that if one has the highest measure of benchmark performance, this will result in optimal ability to transfer that performance to new data.
5.3. Limitations of Current Detection Methods
Our systematic analysis shows five basic, pervasive, fundamental limitations that affect all detection methodologies limiting the effectiveness of their deployment in the real world necessitating paradigm shifts over incremental improvements.
Dataset Overfitting and Artifact Learning. The regular 10–15% degradation in cross-dataset performance shows that existing ones majorly learn the dataset-specific compression type, boundary of the facial region, and signature of the manipulation, and not generalizeable deepfake features. This limitation is even seen with transformer architectures intentionally designed for generalization (ADT: 11.33% decline and is a fundamental challenge for which new training paradigms need to be considered. Methods achieving > 99 of within readiness on datasets years hearty razorback-target attack linked business interpretative limit 1 (significance
) on unseen data (
Table 3), performing a main total to a scaling benched that isn’t a goal? I study a overlapping total. The acuity for this pattern by methodological category suggests that easiness current supervised learning paradigms optimizing for dataset specific accuracy approximating with the goal of finding features in an invariant of manipulation.
Computational Scalability Requirements. While transformer-based methods have established their superior performance in terms of cross-dataset generalization, the 3–5× computational burden they impose when compared to CNNs imposes deployment constraints for real-time applications, devices constrained in resources, and for large-scale applications performing content moderation. The marginal increase in accuracy (1–2% in average) compared to computational increases therefore difference reliquary the viability of architectural complexity as a universal solution. This trade-off, insofar as its underexploration in the literature in favor of maximizing accuracy and no context of deployment, is the case. For example, multimodal approaches such as AVoiD-DF take 60 ms inference time and are not feasible to accept in real-time video processing (requires < 33 ms per frame when processing 30 fps videos) although the accuracy score is higher on benchmark datasets.
Temporal Consistency Modeling Shortcomings Research Problems: Despite the LSTM and spatiotemporal approaches, there are existing approaches that fail to suitably represent the temporal artifacts of longer video sequences. The excellent performance of the semantic temporal analysis by [
48] that achieved 100% DFDC accuracy by detecting emotional continuity, seems to indicate superiority of explicit temporal modelling over frame-independent or short-sequence approaches. However, computational costs of dense temporal processing across full video sequences place a limit on scale and a trade-off is created between temporal robustness and efficiency.
Adversarial Vulnerability: As known from the work of Cao and Gong, existing detectors suffer from adversarial attacks of such mechanisms as Gaussian noise perturbations, backdoor attacks, and targeted adversarial examples. Our review shows very little regard to adversarial robustness in methods evaluated, with only 3 out of 74 performed this adversarial training/evaluation. This gap makes deployed systems against human adversarial manipulation between organic deepfake generation a great security vulnerability.
Multimodal Integration Challenges: While multimodal approaches are shown to provide performance improvements taking advantage of the existence of audio-visual inconsistencies, they come with synchronization needs, computational overhead and depend on the completeness of the multimodal data available. There are many real-world scenarios involving visual only content (static images, muted videos), which limit the use of multimodal detectors although they perform better on audio-visual benchmarks. The field does not have unified frameworks to ensure the advantages of multimodal when complete data exists and graceful degradation to strong unimodal operation when modalities are missing.
5.4. Adversarial Attacks on Deepfake Detectors
As the technology for detection has improved, assaults on the detectors have become an important security issue. Attackers can make deepfakes undetectable through a number of different methods. Gaussian Noise: The noise perturbations in the form of perceptible random variations added to degrade the detector accuracy without deteriorating the visual quality. Backdoor attacks in which known or unknown attackers intentionally insert disruptive triggers in training data, so that that the detectors misclassify particular manipulated samples. Adaptive adversarial examples are crafted with the sole purpose of exploiting detection architectures weaknesses.
Our analysis shows that of the 74 reviewed studies, only 3 studies (4%) explicitly address adversarial robustness before the significant gap in a magazine’s literature. Cao and Gong showed that detectors that perform 94–99% accurately under controlled conditions that would be subject to adversarial attacks, which shows that there is a difference between the benchmark performance and actual security. This vulnerability is especially worrying when it comes to use in applications that are security critical and where attackers are actively working to circumvent the detection systems.
Future research will need to focus on the robustness to adversarial attacks as well as the accuracy. Promising directions include certified defense mechanisms offering provable bounds of robustness, adversarial training against various kinds of attacks, and ensemble approaches adding upsetting principles for easier attacks to be made. Standardized methods of evaluation in an adversarial way must become the standard for the detection method validation.
5.5. Future Research Directions and Recommendations
Based on limitations identified and performance gaps identified, we propose 7 critical research directions for improving deepfake detection capabilities in order to make them practically deployable.
1. Generalization-Focused Training Paradigms: The cross-dataset degradation must be performed in a systematic way, so that the training methodologies needed to be specifically geared toward generalization instead of in-dataset accuracy. Promising approaches are: (a) meta-learning frameworks where training is done on distribution shifts, not on fixed datasets, learning to adapt these into previously unseen manipulation types; (b) domain adaptation approaches which make use of unlabeled target domain data, and bridge distribution gaps; (c) disentangled representation learning for separating content-specific features from manipulation artifacts in order to learn invariant characteristics; and (d) contrastive learning which enforces that each manipulation method is consistent. Future work should therefore focus optimizing cross dataset AUC rather than within dataset accuracy as the main optimization goal, with benchmarks that provide advantages for consistent performance under various conditions for evaluation.
2. Computational Efficiency through Neural Architecture Search: The transformer accuracy efficiency trade-off requires systematic search of architecture design space of performance and computational cost. Neural architecture search (NAS) specifically focused on the deployment constraints (inference time < 10 ms, parameters < 50 M) could help find Pareto-optimal architectures not suitable for manual design. The Mixture-of-Experts approach shows that selective computation can preserve the benefits of transformers while mitigating the overhead, it is worth further research on the conditional computation and adaptive inference mechanism.
3. Temporal Consistency Modeling the Scale: Superior performance of semantic temporal analysis points to possibilities in explicit temporal consistency modeling. Research should explore the following: (a) efficient temporal attention mechanisms that cover the full video sequence and not on the short clips; (b) hierarchical temporal modeling at different timescales (frame, shot, scene) modeling both local and global inconsistencies; and (c) enforcement of physical constraints according to facial biomechanics, audio-visual physics and lighting consistency. Lightweight temporal encoder would be able to offer temporal robustness without LSTM Cmp cost.
4. Unified Multimodal-Unimodal Frameworks: Current multimodal approaches work well with audio-visual data but are unable to handle visual-only data. Future architectures should degrade gracefully if modalities are absent, so that they provide reasonable dysfunctionality with unimodal inputs and superior accuracy if complete multimodal data is made available. Self-supervised pre-training with a some of diverse modality combinations might help to develop flexible deployment for different data availability cases.
5. Adversarial Robustness With the Power of Defensibility: Critical security applications need adversarial robustness guarantees that are currently lacking in detection techniques. Research objectives should consider: (a) certified defense mechanisms with provable robustness bounds against worst case perturbations; (b) adversarial learning adversarial have learned from diverse kinds of attacks such as noise perturbations, back door, and adaptive adversarial examples; and (c) ensemble approaches are approaches that combine detection principles that are complementary to each other rendering the attack more difficult.Standardized adversarial evaluation benchmarks would be available for systematic evaluation of robustness for methods.
6. Interpretable detection and explainable detection: Forensic and legal applications require interpretable detections in addition to binary classification scores Attention Visualization Attribution methods Counterfactual explanations identifying specific. manipulation regions add trust and allow for human verification; Research should create explanation methods keeping the balance between the two points: maintaining accuracy and providing actionable forensic evidence, (potentially including the use of interpretability integrated into architecture, rather than explainability methods post-hoc).
7. Standardized Evaluation Protocol(s): Inconsistent evaluation protocols make it difficult to compare study results from one study to another. Standardized evaluation should include and impose: (a) mandatory cross-dataset testing of the models on FaceForensics++, DFDC and Celeb-DF with consistent splits and preprocessing; (b) reporting of inference time, number of parameters and computation requirements as well as accuracy metrics; (c) evaluation for adversarial robustness against standardized attack protocols; (d) testing for statistical significance instead of reporting the performance of one run, currently done in only 30%; and (e) code and model release for publication in order to allow for reproducibility, currently done in only 30%. Benchmark committees could set and enforce such standards in the form of publication requirements.
By addressing these directions of research in a systematic manner, the research direction could advance toward developing detection systems with the combined goals of high detection accuracy, cross-dataset generalization, computational efficiency, adversarial robustness, and interpretability required for the reliable deployment in the real world. The movement from optimizing accuracy to all-around assessment of dimensions of this kind is the paradigm shift required for impacting practical use.