1. Introduction
The rapid proliferation of Deepfake technology is fundamentally blurring the lines between reality and fiction. Driven by sophisticated generative models such as Generative Adversarial Networks (GANs), these techniques can create hyper-realistic face swaps, manipulate facial expressions, and synthesize entire audio-visual streams with a fidelity that can deceive the human senses. This technological diffusion presents a range of severe societal challenges, including the propagation of disinformation and defamation of individuals, as well as the erosion of public trust and threats to national security. In response, the research community has developed a vast array of detection models, achieving remarkable success in classification accuracy [
1]. However, the vast majority of these high-performance detectors operate as black boxes, capable of rendering a real or fake verdict but unable to articulate the rationale behind their decisions.
This opacity constitutes a critical barrier to the adoption of Deepfake detection technologies in real-world applications. In high-stakes domains such as forensic investigation, journalistic verification, and content moderation, a model that cannot explain its reasoning is of limited practical value. A court of law, for instance, requires a verifiable evidence chain that explicitly identifies manipulated regions, not merely a probabilistic score. Consequently, a significant trust gap exists between the high accuracy of current detectors and the level of credibility required for their operational deployment. Bridging this gap is the central challenge of explainability. Within the context of Deepfake detection, explainability transcends a simple classification label; It is the comprehensive ability to provide the reasoning for a decision, localize the evidence artifacts, and even shed light on the manipulation methodology [
2]. An explainable model would not only classify a video as a forgery, but would also function as a digital forensic expert, highlighting the boundaries of a face swap, pointing out inconsistencies between facial features and head movements, or clarifying that its conclusion is based on anomalies in sensor pattern noise.
This paper provides a comprehensive and systematic survey of explainability methods in deepfake detection, offering researchers a panoramic overview of the field’s technological trajectory. Existing approaches can be broadly grouped into several paradigms. One line of research focuses on methods based on forensic analysis, which provide direct evidence by identifying physical or algorithmic artifacts left during media generation and manipulation. Another direction emphasizes model-centric approaches, which enhance transparency either through post hoc interpretation of black-box models or by designing architectures with inherent interpretability. More recently, advances have emerged in multimodal and natural language explanations, where Large Multimodal Models (LMMs) are leveraged to generate rich, human-readable reports that articulate the reasoning behind detection outcomes [
3]. In addition, this survey systematically reviews the specialized metrics and datasets developed to evaluate the quality of these explanations. By clarifying the underlying principles, advantages, and limitations of these methods, this paper aims to chart the evolutionary trajectory of explainability in Deepfake detection and to highlight promising avenues for building the next generation of trustworthy and reliable detection systems.
2. Backgroud
2.1. Overview of Deepfake Detection
The rapid proliferation of generative models has led to an unprecedented surge in fake content, posing profound challenges to the integrity of digital media and the social structures that rely on it. Early research in this domain primarily focused on a technical arms race aimed at maximizing the accuracy of binary detectors. However, the field has reached a critical juncture: Deepfakes are no longer merely a classification problem but a tool for deception, disinformation, and malicious influence, demanding a paradigm shift from detection to understanding and explanation.
Many high-performance deep learning models remain black boxes. For example, prominent Deepfake detection architectures include XceptionNet [
4], which has achieved strong performance on benchmarks like FaceForensics++ [
2], as well as various methods that analyze frequency-domain artifacts. Despite their high accuracy on benchmark datasets, these models often function as black boxes due to their complex, high-parameter nature. This inherent opacity limits their reliability and trustworthiness in high-stakes applications such as forensic analysis. Although effective in benchmark datasets, their opacity is a major limitation in high-stakes scenarios. Black-box models provide a probability score, such as suggesting that a video is highly likely to be fake, without revealing the reasoning behind the decision. This lack of transparency undermines trust in key applications, including legal proceedings, journalism, and national security, where a simple label is insufficient for evidence-based decision-making [
5].
Opaque models also often suffer from poor generalization. Many detectors achieve high accuracy by learning artifacts specific to particular generative models or datasets, but their performance can collapse when confronted with novel manipulations. Explainable AI provides a diagnostic lens that reveals whether a model relies on fundamental generalizable manipulation traces, such as blending boundaries or frequency inconsistencies, or overfits to superficial, generator-specific cues [
6]. Understanding these mechanisms is crucial for building robust and widely applicable detection systems.
2.2. Interpreting Explainability
In the context of explainable AI (XAI), interpretability and explainability represent distinct yet complementary concepts relevant to Deepfake forensics.
Interpretability refers to the extent to which a model’s internal mechanisms and decision-making processes are transparent and intelligible without auxiliary tools. An interpretable model allows direct inspection of its logic. For instance, a decision tree is inherently interpretable, as each decision can be traced through explicit logical splits [
7]. In Deepfake detection, prototype-based networks exemplify this principle: the model compares new inputs against learned prototype examples, providing insights through its structure rather than requiring post hoc explanation.
Explainability denotes the capability of a system to produce human-understandable justifications for specific outputs, often applied to complex, otherwise opaque models. Post hoc techniques generate explanations such as visual heatmaps highlighting regions critical to a decision, or textual summaries describing detected manipulation artifacts. Even when the underlying model is too complex to be fully interpretable, these methods allow its outputs to be meaningfully analyzed.
This distinction establishes two primary directions in the research of fake detection. One path develops inherently transparent architectures emphasizing interpretability, while the other applies post hoc techniques to high-performance black-box models to enhance explainability. Both approaches aim to move beyond binary detection and provide the contextual information necessary for informed assessment.
2.3. Motivation for Explainable Detection
High-stakes applications increase the need for explainable forensic systems. Deepfakes pose a cognitive threat by undermining trust in digital media, potentially creating a scenario in which citizens can no longer rely on visual or auditory information. Reliable detection systems serve as the primary defense against this threat, but their effectiveness depends on explainability.
In legal and forensic contexts, digital evidence must be verifiable and defensible. Expert testimony cannot rely solely on AI predictions; it requires the presentation of underlying evidence. Explainable systems provide this capability by highlighting manipulated pixels or identifying inconsistencies in sensor noise patterns, thereby transforming AI from a black box into a tool that strengthens human analytical expertise.
In journalism and fact-checking, detecting falsified content is insufficient without understanding the nature of the manipulation. Explainable models can classify manipulations, such as face swaps or lip-sync alterations, and highlight the relevant artifacts, offering actionable intelligence that supports accurate reporting and public media literacy [
8].
For content moderation, platforms face unprecedented volumes of user-generated media. Automation is necessary, but human oversight remains critical. Explainable detection guides moderators to specific frames or regions deemed suspicious, improving review efficiency and decision accuracy.
Across all domains, explainability underpins accountability. It enables auditing, challenges algorithmic conclusions, and integrates AI outputs into structured human decision processes. Without it, even highly accurate detectors remain isolated tools; with it, they become trusted partners in preserving digital integrity.
3. Foundational and Forensics-Based Explainability Methods
This chapter examines detection methods grounded in the intrinsic physical properties or generative process artifacts of digital media. These approaches provide explanations by identifying forensic traces resulting from manipulation, with interpretability arising directly from the data itself rather than from the internal decision-making mechanisms of complex detection models.
3.1. Sensor Noise Pattern Analysis
One of the earliest directions in Deepfake detection focused on the physical source of digital imagery—the camera sensor. Methods based on sensor pattern noise analysis, particularly Photo Response Non-Uniformity (PRNU), form a fundamental component of digital image forensics. The underlying principle is that every digital camera sensor possesses unique microscopic manufacturing imperfections, which produce a stable and distinctive noise pattern across images. This PRNU pattern is inherent to the sensor and remains consistent for authentic images. In contrast, manipulated content often exhibits local inconsistencies or a complete absence of this pattern in altered regions, thereby providing direct, physically grounded evidence of tampering.
3.1.1. Principle of PRNU
The seminal work by Lukas et al. [
9] systematically introduced the use of sensor pattern noise, particularly PRNU, as a unique identifier for the source camera. PRNU arises from minute manufacturing variations between sensor pixels, which cause slight differences in their sensitivity to incident light. This results in a stable, multiplicative noise pattern. By extracting and averaging the noise residuals from multiple images captured by the same camera, a reference PRNU pattern can be estimated. In forensic analysis, this reference serves as a camera fingerprint. By computing the correlation between the noise residual of a query image and the reference pattern, it is possible to determine whether the image originated from the same camera. The logic is intuitive, with regions lacking the expected PRNU fingerprint consistent with the rest of the image likely to have been altered or sourced from a different device.
3.1.2. Advancements in PRNU Techniques
Marra et al. [
10] addressed a more challenging forensic setting involving large collections of images with unknown origins. They proposed a blind clustering technique that relies solely on the correlation between PRNU patterns, without prior knowledge of camera sources. Their multi-stage optimization strategy combines consensus clustering with a maximum-likelihood-based merging step, improving both clustering robustness and PRNU estimation accuracy. This enables investigators to identify common sources across extensive image datasets.
Saito et al. [
11] focused on the statistical reliability of PRNU-based source attribution, introducing a theoretical framework for estimating the False Acceptance Rate (FAR). They recommended the use of Peak-to-Correlation Energy (PCE) over normalized correlation, as PCE offers a more stable decision threshold that is less sensitive to variations in fingerprint strength or structured noise such as linear patterns. This work enhanced the statistical rigor of PRNU-based analysis.
Anti-forensic research has also emerged as a countermeasure to PRNU-based detection. The DIPPAS method proposed by Picetti et al. [
12] employs a Deep Image Prior (DIP) framework to attenuate PRNU traces in an image. A convolutional neural network is trained to generate an anonymized image that minimizes correlation with the original PRNU pattern while preserving high visual quality. This demonstrates that PRNU-based evidence, while physically interpretable, remains susceptible to sophisticated anti-forensic techniques.
3.1.3. Strengths and Limitations of Sensor-Based Traces
PRNU constitutes one of the most direct and physically interpretable forms of evidence in media forensics, enabling conclusions such as the following: a given region was not captured by the same device as the rest of the image. Its strength lies in the clear causal link between the signal and a specific physical camera sensor [
9]. Early work, such as that by Koopman et al., exploited this property by detecting disruptions in the spatial or temporal consistency of the PRNU pattern [
13]. However, reliance on a physical imaging device becomes a limitation when addressing modern Deepfakes, particularly fully synthetic content. When manipulated content is generated entirely by an algorithm, it lacks any physical camera origin and therefore carries no PRNU signal. The detection problem then shifts from identifying inconsistent PRNU to detecting its absence, a more challenging task, since many factors, including compression, resizing, or noise, can attenuate or remove the PRNU pattern. Furthermore, anti-forensic techniques such as DIPPAS [
12] can deliberately suppress PRNU traces while preserving high visual fidelity. As a result, PRNU-based methods remain effective for partially manipulated forgeries, such as face swaps, but their reliability decreases significantly for fully synthetic media or under advanced adversarial attacks.
3.2. Convolutional Traces and Residual Analysis
As sensor-based traces proved less reliable for synthetic content, research attention shifted from physical device fingerprints to algorithmic fingerprints, unique artifacts left by the image generation process itself. These methods treat the generative pipeline as a source of forensic evidence.
3.2.1. Noiseprint
Cozzolino et al. [
14] introduced Noiseprint, which improves on traditional hand-crafted PRNU features by replacing the fixed noise model with a learned representation. A Siamese network is trained on pairs of patches from the same or different cameras, learning to suppress image content while amplifying subtle model-specific artifacts. This learned camera model fingerprint is more robust to post-processing and can be used for forgery localization by detecting spatial inconsistencies within the Noiseprint map. Cozzolino et al. [
15] extended the concept to video, introducing a video noiseprint derived from temporal sequences of patches. This method enables both camera model identification and the localization of temporal manipulations.
3.2.2. Exposing the Generative Traces of Convolutional Networks
Guarnera et al. [
16] targeted artifacts inherent to GAN-based image generation, particularly those introduced by transposed convolution layers during upsampling, termed convolutional traces. These traces manifest as distinctive local correlation patterns determined by the generator architecture. Using an Expectation–Maximization algorithm, they extracted feature vectors capturing these correlations, enabling the creation of architecture-specific fingerprints. This approach can determine authenticity and, in some cases, identify the specific GAN model responsible (e.g., StyleGAN vs. StarGAN). Later work demonstrated that convolutional traces are robust to operations such as compression and rotation and are independent of semantic image content, making the method applicable to both facial and nonfacial images [
17].
3.2.3. Separating Tampering Traces via Residual Learning
Residual learning techniques aim to separate the natural image signal from manipulation artifacts. Zhang et al. [
18] pioneered this approach with DnCNN, a CNN designed to predict the residual (noise) rather than the clean image. This implicitly isolates clean content in the intermediate features of the network.
Motivated by this concept, Guo et al. [
19] introduced GRnet, featuring a Manipulation Trace Extractor (MTE) that leverages guided filtering to retain genuine content while extracting detailed residual information. By fusing residual domain and spatial domain features through an attention-fusion mechanism, GRnet achieves enhanced resilience to common degradations, including low resolution.
Chen et al. [
20] further formalized this approach in the SNIS network, explicitly framing post-processed forgery detection as a signal–noise separation problem. Their method isolates manipulated regions from background noise, improving resilience to compression and blurring.
3.2.4. From Physical to Algorithmic Fingerprints
This research trajectory reflects a conceptual shift in forensic explainability: from physical, device-specific artifacts (e.g., PRNU) to statistical, algorithm-specific artifacts (e.g., convolutional traces, learned residuals).
Noiseprint represents a transitional stage, learning camera model fingerprints that are more abstract than PRNU but still tied to device classes [
14]. Guarnera et al. [
17] advance this further by directly modeling the fingerprints of the GAN architecture, changing the explanatory statement from “captured by camera X” to “generated by algorithm Y”. Residual-based methods such as GRnet and SNIS [
19,
20] generalize this approach, identifying manipulation artifacts regardless of the generating algorithm. Here, the explanation becomes the following: manipulated regions contain statistical inconsistencies separable from natural image content. This evolution demonstrates increasing abstraction in explainability, from concrete physical traces to generalized statistical anomalies introduced by digital synthesis or editing.
3.3. Frequency-Domain and Re-Synthesis Methods
These methods examine representations beyond the spatial domain, such as frequency spectra, or actively probe the image through further transformations to reveal inconsistencies.
3.3.1. Detecting Artifacts Beyond the Visual Spectrum
Early generative models, particularly those employing transposed convolutions, often produced periodic artifacts in the frequency domain. Early detection strategies leveraged transformations such as the Discrete Fourier Transform (DFT) to reveal these spectral patterns, laying the foundation for later, more robust approaches.
3.3.2. Probing Forgeries via Re-Synthesis and Contextual Discrepancies
He et al. [
21] argued that reliance on static frequency artifacts is unsustainable as generative quality improves. This concern is valid, as many early frequency-based methods focused on specific, tell-tale signs like upsampling artifacts introduced by transposed convolutions, which newer generator architectures may avoid. They proposed a re-synthesis framework, passing the test image through a model trained exclusively on authentic data (e.g., super-resolution or denoising). Real images yield low reconstruction error, whereas fake images, drawn from a different distribution, produce higher and more structured residuals. Detection is based on these residual patterns, providing a distributional inconsistency explanation.
Nirkin et al. [
22] targeted semantic inconsistencies in face-swap forgeries. They trained separate recognition networks for the inner facial region and the surrounding context. Discrepancies between the two identity predictions strongly indicate a swap. The explanation is intuitive: the identity of the face does not match the identity cues from the context.
3.3.3. The Role of Transformed Domains and Generative Probing
These methods signal a shift from passive observation of artifacts to active interrogation of an image. Re-synthesis [
21] evaluates an image’s compatibility with models trained on genuine data, while contextual identity checks [
22] test for semantic contradictions. As synthetic media becomes increasingly artifact-free, active probing strategies are likely to become central to detection, focusing on logical and statistical inconsistencies rather than fixed forensic markers.
3.4. Feature Decoupling and Contrastive Learning
Advanced representation learning methods aim to construct feature spaces that naturally separate real and manipulated content, yielding explanations derived from the learned embedding structure.
3.4.1. Learning Generalizable Forgery Features with Contrastive Learning
Xu et al. [
23] applied Supervised Contrastive Learning (SupCon) to enhance generalization in Deepfake detection. The SupCon loss encourages intra-class compactness and inter-class separation in the learned feature space. By contrasting authentic images with a diverse set of forgeries, the model learns a robust representation of authenticity. Heatmap visualizations reveal that the network often attends to facial boundaries where manipulations are introduced, offering spatial interpretability. The Dual Contrastive Learning (DCL) framework [
24] modifies the standard classification objective by operating at two granularities. First, Inter-Instance Contrastive Learning (Inter-ICL) is designed to learn a globally discriminative feature space by increasing the similarity between representations of authentic images while decreasing their similarity to those of forged images. Second, Intra-Instance Contrastive Learning (Intra-ICL) addresses local inconsistencies within a single forged image by contrasting features from manipulated regions against those from original regions. This dual mechanism is intended to produce a more generalizable representation by training the model to be sensitive to both global authenticity and local artifacts.
3.4.2. Explaining Detection via Source–Target Image Matching
Dong et al. [
25] analyzed Deepfake detectors through the lens of source–target forgery relationships, constructing fake-source-target (FST) image triplets to study model responses under different matching conditions (FST-Matching). They proposed that detectors implicitly learn artifact-related visual concepts via this relational structure and designed a model that explicitly incorporates such matching. This improvement is particularly noticeable for heavily compressed videos.
3.4.3. Separating the Forgery Signal
Unlike methods that search for predefined artifacts, contrastive learning approaches define the objective in terms of feature-space geometry: maximize the distance between real and fake representations. Interpretability arises from post hoc analysis of model attention and activation patterns, revealing which features drive this separation. This reflects a higher level of abstraction in forensic reasoning, focusing on the learned discriminative features rather than fixed, physically motivated traces, allowing adaptation to novel manipulation techniques.
The effectiveness of such feature space separation is often visually explained using dimensionality reduction techniques like t-Distributed Stochastic Neighbor Embedding (t-SNE). By projecting the high-dimensional feature embeddings of real and fake images into a two- or three-dimensional map, t-SNE provides an intuitive visualization of whether a model has successfully learned to organize authentic and manipulated content into distinct clusters. This serves as an effective diagnostic tool for interpreting the structure of the model’s learned representation.
4. Model-Centric and Interpretable-by-Design Methods
This chapter shifts the focus from artifacts embedded in the data to the decision-making process of the detection model itself. It reviews two main paradigms, namely the use of post hoc techniques to explain existing black-box models and the design of inherently transparent model architectures.
4.1. Visualization and Post Hoc Explanations
These approaches typically begin with a pre-trained, high-performance deep learning model and apply external tools to generate explanations for its predictions.
4.1.1. Application of Gradient and Perturbation Methods
A significant body of work has adapted general-purpose XAI techniques for Deepfake detection. Among the most popular of these are gradient-based attribution methods such as Gradient-weighted class activation mapping (Grad-CAM) [
26]. This technique produces a visual explanation in the form of a heatmap, highlighting the input regions that are most influential for a specific prediction. In the context of deepfake forensics, Grad-CAM is widely used to reveal which facial regions a detector focuses on when classifying an image as fake.
Malolan et al. [
27] proposed a framework utilizing visual interpretability methods such as saliency maps and guided backpropagation to reveal which facial regions a CNN relies on in its decision-making process. They further compared white-box, black-box, and model-specific techniques, and adapted SHAP and Grad-CAM to explain an EfficientNet-based detector. Ge et al. [
28] demonstrated the utility of SHAP for analyzing both spoofing and Deepfake detectors. Their study showed how SHAP can expose unexpected model behaviors, such as reliance on non-speech intervals in audio, identify the most contributory artifacts, and highlight behavioral differences between competing classifiers. Parvez [
29] proposed combining a CNN with a CapsuleNet, integrated with Grad-CAM to visualize the input regions most important for prediction, thereby enhancing model transparency and accountability.
4.1.2. The Utility and Pitfalls of Post Hoc Explanations
The primary value of post hoc explanations lies not in providing definitive forensic evidence but in serving as diagnostic tools for model developers and as confidence-building mechanisms for end-users. These methods can expose whether a model is focusing on plausible cues, but their fidelity is limited, as the explanation itself is only an approximation of the model’s internal processes. The opacity of deep learning detectors creates a trust gap, which post hoc tools such as Grad-CAM and SHAP attempt to mitigate by providing heatmaps or feature importance scores that suggest why a model reached a given decision [
27,
30]. Ge et al. [
28] highlighted the critical role of this approach in discovering when a model exploits spurious correlations, such as irrelevant audio features.
However, explanations derived from gradient-based methods remain indirect. As Gowrisankar et al. [
31] emphasized, standard XAI evaluation frameworks may not be suitable for Deepfake detection, and the validity of explanations must themselves be scrutinized. Overall, post hoc explanations represent an important first step toward transparency, but their role is primarily to verify that the model attends to meaningful input regions rather than to provide direct forensic proof. This limitation motivates the design of interpretable-by-architecture models.
4.2. Attention Mechanisms and Transformers
Attention-based methods incorporate explainability directly into the model architecture. By design, the model outputs attention weights that indicate which parts of the input most influenced its decision.
4.2.1. Self-Attention and Cross-Modal Attention for Forgery Localization
Asha et al. [
32] introduced a defensive attention mechanism for detecting multimodal Deepfakes involving both audio and video. Their system employs a self-attentive VGG16 for visual features and a self-attentive RNN for audio features. The core innovation is a cross-modal attention mechanism that quantifies discrepancies between audio and video streams, simultaneously supporting detection and interoperability. Another line of research shifts focus from low-level visual artifacts to higher-level semantic inconsistencies. The Voice-Face Homogeneity approach [
33], for example, operates on the premise that an individual’s voice and face share a biometric link. It uses a pre-trained speaker-face matching model to determine if the audio speaker is the same person as the individual depicted visually. By targeting a potential identity mismatch in face-swap forgeries, rather than specific visual artifacts, this method is designed to generalize to novel manipulation techniques.
4.2.2. Case Study
Qi et al. [
34] proposed DeepRhythm, which detects Deepfakes by exploiting disruptions in remote photoplethysmography (rPPG) signals, corresponding to subtle heartbeat rhythms observable in facial videos. To achieve robust detection, they designed a dual spatio-temporal attention network that learns both spatial regions and temporal segments most reliable for rPPG extraction. The resulting attention maps not only drive the detection process but also provide intuitive explanations of where and when the model relies on physiological cues.
4.2.3. Building Intrinsic Focus into Detection Models
Unlike post hoc explanations, attention mechanisms provide explanations that are integral to the model’s computations. Post hoc tools approximate importance retrospectively, whereas attention-based models explicitly assign weights to input regions during training. In DeepRhythm [
34], for example, the model learns which facial areas are most informative for rPPG extraction, and the attention map functions both as a signal-processing step and as an explanation. Similarly, in Asha et al. [
32], cross-modal attention weights directly quantify consistency between audio and video streams, making them both computationally essential and interpretable. This tight coupling of explanation and decision processes makes attention-based methods more faithful to the model’s reasoning.
4.3. Prototype-Based and Case-Based Reasoning
Prototype-based methods aim to provide explanations in terms of similarity to representative cases, mirroring forms of human reasoning that rely on precedents.
4.3.1. Explaining Decisions with Dynamic and Learned Prototypes
Trinh et al. [
7] introduced the Dynamic Prototype Network (DPNet), which employs learnable prototypes to capture temporal artifacts in Deepfake videos. Instead of returning only a binary classification, the model explains a decision by comparing the input to a set of class-specific prototypes. For example, an unnatural head movement is detected as fake because it closely resembles a learned prototype of Deepfake head motion. PUDD (Prototype-based Unified Framework for Deepfake Detection) [
35] presents a similarity-driven approach to deepfake detection, where input data is compared to known prototypes for classification. The system identifies potential deepfakes and previously unseen classes by analyzing similarity drops. PUDD integrates image classification as an upstream task during training, allowing it to perform well in both image classification and deepfake detection. This approach offers notable efficiency and environmental benefits, as it requires minimal retraining time and has a significantly lower carbon footprint compared to existing models.
4.3.2. Human-in-the-Loop Refinement with Visual Analytics
den Bouter et al. [
36] addressed challenges in prototype interpretability, such as redundancy and poor human comprehensibility, by developing ProtoExplorer, a visual analytics tool for forensic experts. The system enables exploration and refinement of prototype sets, including visualization of spatio-temporal prototypes, interactive filtering of predictions, and removal of uninformative prototypes. This human-in-the-loop process improves interpretability while maintaining detection accuracy.
4.3.3. Towards Case-Based Explainability
Prototype methods shift the explanation from abstract features to concrete archetypal cases. Trinh et al. [
7] demonstrated that decisions can be explained by reference to representative examples, offering a precedent-based rationale closer to human reasoning. Whereas saliency maps highlight important regions, prototype methods categorize anomalies by comparison to known cases. The quality of these prototypes is therefore crucial. ProtoExplorer [
36] exemplifies how expert-guided curation of prototype sets can ensure interpretability and non-redundancy. Such case-based reasoning offers a more tangible form of evidence for forensic analysts, who can inspect the referenced prototype to understand the detected artifact.
4.4. Other Interpretable Architectures
Shang et al. [
37] introduced PRRNet, which captures inconsistencies between local facial regions and global context. By explicitly modeling the relationships between pixel-level and region-level representations, the model highlights disharmonies characteristic of forgeries. The interpretability arises from its architectural design, which attends to these discrepancies. Soltandoost et al. [
38] proposed a method to extract local explanations from global representations. This architecture enables decomposition of holistic feature vectors into localized evidence, bridging the gap between strong classification performance and interpretable outputs. Recent work has also explored architectures that address considerations such as the computational cost and parameter efficiency of deploying large models. The MoE-FFD framework [
39], for instance, adapts a pre-trained Vision Transformer (ViT) for face forgery detection using a Mixture of Experts (MoE) approach combined with Parameter-Efficient Fine-Tuning (PEFT) techniques like Low-Rank Adaptation (LoRA). In this architecture, a gating network dynamically selects a subset of ‘expert’ models to process an input image, while the ViT backbone remains frozen. While the primary goal is efficiency, this modular design provides a mechanism for potential interpretability; for example, analyzing which experts are activated for different forgery types may offer insight into model specialization.
5. Multimodal and Language-Based Explanations
This chapter discusses a recent and significant development in the field: the transition from purely visual or feature-based explanations to natural language explanations enabled by LMMs.
5.1. Textual Explanations and Visual Question Answering (VQA) Frameworks
To enhance both interpretability and robustness in Deepfake detection, recent works have increasingly adopted VQA-style formulations that jointly address classification and explanation. Kundu et al. [
40] proposed TruthLens, a framework that formulates Deepfake detection as a VQA task. The model provides not only binary classification but also detailed textual reasoning for its predictions, and it is capable of addressing fine-grained queries such as whether a specific facial attribute appears authentic. The architecture is hybrid, combining a vision-only model for local feature extraction with a Large Multimodal Model (LMM) for global context understanding and text generation. This design enables the system to detect subtle artifacts and explain them in natural language. Guo et al. [
41] introduced M2F2-Det, a multimodal detector that jointly generates detection scores and textual explanations. The model leverages the visual representation capabilities of a pre-trained CLIP model together with the generative ability of a Large Language Model (LLM), thereby linking subtle forgery-related visual cues with natural language descriptions. Jia et al. [
42] emphasized the role of common-sense reasoning in detection. They argued that many Deepfakes violate basic perceptual rules, such as inconsistent hairlines or unnatural skin shading, which are readily identified by humans but often overlooked by conventional CNN-based approaches. To address this, they introduced the DD-VQA dataset, which pairs images with questions and detailed, commonsense-based textual explanations. Their framework employs a vision-and-language transformer to execute the VQA task. Yan et al. [
43] proposed
-DFD, a framework for systematically assessing and enhancing the Deepfake detection capabilities of LMMs. The framework comprises three modules: Model Feature Assessment (MFA), which evaluates an LMM’s ability to capture different forgery features; Strong Feature Strengthening (SFS), which fine-tunes the model on features where it already demonstrates competence; and Weak Feature Supplementation (WFS), which incorporates external detectors for features where the LMM performs poorly. This structured approach yields a more robust and interpretable hybrid system.
Huang et al. [
44] developed SIDA, a framework tailored to social media platforms that performs detection, localization, and explanation simultaneously. Leveraging an LMM, SIDA produces not only a classification output but also a segmentation mask of manipulated regions together with a textual justification. They also introduced SID-Set, a large-scale dataset designed specifically for this multi-task setting. Following this direction, recent forensics-driven frameworks such as Propose and Rectify [
45] utilize the capabilities of Multimodal Large Language Models (MLLMs). These models are designed to first propose potential forgery regions and then rectify their localization, demonstrating a multi-step process that aims to improve the precision of manipulation detection. LayLens [
46] is a user-centric tool for deepfake forensics that combines explainable forgery localization, natural language simplification, and visual reconstruction. It translates complex model reasoning into accessible explanations for non-experts while maintaining technical depth, and presents side-by-side comparisons of original and reconstructed images. User studies demonstrate that LayLens improves clarity, reduces cognitive load, and enhances trust in deepfake detection.
5.2. The Paradigm Shift to Natural Language Explanations
The introduction of Large Vision-Language Models (LVLMs), also referred to as LMMs, marks a methodological shift in deepfake detection. While earlier methods often treated explainability as a separate, post hoc analysis, LVLM-based frameworks integrate explanation directly into their primary output. The task is consequently reoriented from a focus on binary classification toward a process of generating reasoned explanations for why an image is considered authentic or manipulated. This aligns the model’s output more closely with human-centric, evidence-based analysis. Traditional forensic and model-centric approaches, such as those relying on PRNU correlation plots, heatmaps, or prototype analysis, generate explanations that typically require expert interpretation. In contrast, LMMs can integrate visual and linguistic reasoning to produce directly human-readable explanations.
Jia et al. [
42] highlight the discrepancy between machine and human reasoning: while models excel at detecting low-level statistical artifacts, humans rely heavily on high-level common-sense reasoning. The outputs of these LVLM-based systems differ from those of traditional methods. For example, an explanation from a PRNU analysis may be a correlation plot, while a Grad-CAM output is a heatmap. Such visualizations typically require domain expertise for interpretation. In contrast, LVLMs can generate explanations in natural language. For instance, they may describe an inconsistency such as “the texture of the skin in the cheek area appears artificially smooth.” This form of output is more directly accessible to non-expert users such as journalists or content moderators. LMMs represent the generation of models capable of bridging this gap by articulating observations such as unrealistic texture or improper alignment of facial features in natural language. This reframes the detection task. Instead of training a classifier and then constructing an auxiliary explanation mechanism, models such as TruthLens and M2F2-Det perform detection through the process of explanation itself [
40,
41].
This shift has substantial implications for the field. It suggests that the most effective and trustworthy detection systems will not be simple classifiers but reasoning frameworks that articulate their findings in a transparent and accessible manner. The focus of research is therefore moving from accuracy alone to the quality, coherence, and faithfulness of generated explanations.
7. Discussion and Future Directions
7.1. Evolution of Explainability Methods
Research on explainability in Deepfake detection has progressed along several stages. Early work concentrated on identifying physical or algorithmic artifacts of manipulation. Explanations at this stage were grounded in measurable irregularities, such as the PRNU patterns of camera sensors, or algorithmic traces introduced by generative architectures. These approaches connected the detection process to observable properties of either acquisition devices or synthesis methods.
As detection models became more complex, attention shifted from analyzing data artifacts to interpreting the models themselves. Post hoc methods such as Grad-CAM and SHAP were applied to visualize decision-making in trained models. In parallel, explainability was embedded directly into model architectures through mechanisms like attention, which indicates focus regions, or prototype-based learning, which relates predictions to known forgery cases. This development reframed explanations from pointing to artifacts in the data toward revealing the internal reasoning of the models.
Recent work has introduced a multimodal and language-based perspective. LMMs are increasingly used to generate natural language descriptions that articulate the reasoning behind classification outcomes. Frameworks such as TruthLens and DD-VQA treat the detection task not only as binary classification but as a process of producing textual explanations aligned with human understanding. This shift expands the role of explainability from providing technical cues for experts toward communicating results in a form accessible to broader audiences.
7.2. Challenges
A key difficulty concerns the balance between accuracy, interpretability, and computational efficiency. Simpler architectures tend to be more transparent, yet often lack the precision required to handle advanced forgeries. In contrast, highly accurate models usually remain opaque, which complicates the effort to provide explanations that are both faithful and efficient. Methods that can reconcile these competing objectives remain an open area of research.
Fidelity of explanations is also a persistent concern. Explanations should reflect the true reasoning process of the model rather than provide a post hoc narrative that appears plausible but is disconnected from the underlying decision. The reliability of methods such as Grad-CAM has been questioned in this respect. For example, Gowrisankar et al. [
31] introduced adversarial evaluation protocols to test whether highlighted regions truly align with the model’s reasoning. Robustness is equally important, since adversarial or anti-forensic manipulations can obscure forensic traces, such as deliberately suppressing PRNU signals.
Generalization across unseen forgery techniques presents another obstacle. Approaches that rely on specific artifacts, including PRNU patterns or convolutional traces, may lose effectiveness when confronted with manipulations produced by new generative architectures. Recent work has attempted to address this by focusing on universal forgery representations or by framing unknown manipulations as out-of-distribution instances, though this remains an open problem.
Evaluation and usability form an additional dimension. A technically valid explanation may not necessarily assist human decision-making. Early studies frequently relied on subjective inspection without standardized benchmarks. Prototype-based methods, while offering case-based reasoning, sometimes generate redundant or unintuitive examples that require expert curation. Developing systematic evaluation protocols and presentation strategies that adapt to different user needs is essential for broader adoption.
Furthermore, the deployment of these detection systems introduces significant regulatory and ethical considerations. Emerging legal frameworks for artificial intelligence, for instance, are beginning to mandate transparency for high-risk systems, which would include tools used for verifying forensic evidence or news content. The admissibility of AI-generated explanations in a court of law, the ethical responsibility of news organizations to audit their verification algorithms, and the potential for algorithmic bias in detection models are all critical challenges that require interdisciplinary attention beyond the purely technical domain.
7.3. Opportunities
Hybrid and Multimodal Explainability. Future research can integrate multiple layers of evidence into unified frameworks. LMMs can be used to interpret low-level forensic artifacts such as PRNU inconsistencies or convolutional traces, while simultaneously generating high-level semantic explanations. This type of hybrid system retains the rigor of forensic evidence while presenting results in a human-readable form. The -DFD framework, which incorporates external detectors to complement large models, illustrates an initial step in this direction.
Causal Reasoning and Robustness of Explanations. Moving beyond correlation-based attribution, causal inference provides a way to relate model outputs to underlying physical or semantic inconsistencies, such as deviations from lighting laws or natural dynamics. Explanation methods built on causal principles have the potential to improve reliability by clarifying why a decision was made rather than only where a model focused. At the same time, anti-forensic techniques highlight the need for explanations that are robust by design against adversarial manipulations intended to obscure forensic traces.
Continual Learning for Open-World Scenarios. Static detectors often fail when new forgery techniques emerge. Explainable models that incorporate continual, zero-shot, or few-shot learning could adapt dynamically to novel manipulations while preserving interpretability. Such approaches would allow systems to evolve in parallel with generative models rather than requiring retraining from scratch.
Interactive and User-Centered Explainable Systems. Explanation can extend beyond static outputs to interactive systems that allow users to query and refine results. Forensic experts may require the ability to trace specific artifacts, while general users may prefer concise textual descriptions. Systems such as ProtoExplorer demonstrate how prototypes can be curated and refined interactively, and VQA-based frameworks like TruthLens show the potential of dialog-style querying. These directions point toward explainable systems that adapt to the expertise and needs of different users.
Proactive Forensics and Content Provenance. Beyond passive detection approaches, researchers have investigated proactive authentication strategies. The FractalForensics framework [
53] is an example of this approach, which involves embedding an imperceptible, semi-fragile watermark into an image before its distribution. The watermark is designed to be robust to common operations like compression while being fragile to manipulations such as face swapping. In this framework, the integrity of the watermark serves as evidence of authenticity. This method also provides a form of direct explanation, as damaged or missing sections of the watermark serve to localize the manipulated regions.
Bridging Research and Practice in Real-World Scenarios. A gap remains between the evaluation of explainable methods on benchmark datasets and their practical implementation in high-stakes environments, such as forensic investigations or newsrooms. Future work should prioritize closing this gap by conducting case studies on real-world legal or journalistic corpora. Furthermore, there is a need to develop evaluation metrics that extend beyond algorithmic performance to measure practical utility, such as the time required for a human expert to reach a conclusion with AI assistance, or the overall impact of the explanation on the decision-making process.
8. Conclusions
The research paradigm in Deepfake detection has fundamentally shifted, as classification accuracy alone no longer satisfies real-world demands for credibility. This paper has systematically traced the evolution of explainability, from methods grounded in forensic data artifacts to the interpretation of model behavior, culminating in the current paradigm of semantic explanation via large multimodal models. The future trajectory of the field, therefore, will be defined not just by enhancing detection performance, but by developing transparent and reliable explanation mechanisms that foster human–machine synergy in safeguarding information integrity. This necessitates that future research afford equal weight to a model’s predictive power and the fidelity and usability of its explanations, a foundational strategy for confronting the challenges of digital disinformation.