1. Introduction
In recent years, with the rapid development of deep learning technologies such as Generative Adversarial Networks (GANs) and diffusion models, deepfake techniques have made significant progress [
1,
2]. These techniques are capable of generating highly realistic forged videos, which have greatly facilitated advancements in fields such as film production, virtual reality, and digital entertainment. However, the negative impacts of deepfake technology have also become increasingly prominent. Forged videos are frequently exploited for malicious purposes including disinformation dissemination, defamation, and political manipulation, thereby posing serious threats to public safety and the trust infrastructure of modern society [
3]. A notable example occurred during the recent Russia–Ukraine conflict, in which a video depicting President Zelensky allegedly urging soldiers to lay down their arms was circulated online and later confirmed to be a deepfake—marking the first known application of deepfake technology in a military context [
4]. Consequently, the development of efficient, robust, and generalizable deepfake video detection methods has emerged as a critical research focus in the fields of computer vision and AI security.
Current deepfake video detection methods face substantial challenges, primarily due to the nature of the forged content itself. Generated by powerful deep learning models, deepfake content often exhibits high fidelity in visual details and structural features, rendering it nearly indistinguishable from real content. This significantly undermines the effectiveness of traditional feature extraction and video analysis techniques. To address these issues, a wide range of deepfake detection approaches have been proposed in recent years, primarily falling into two major categories. The first category focuses on temporal consistency modeling, which aims to capture subtle temporal discontinuities or anomalies by analyzing the dynamic evolution between consecutive video frames [
5,
6,
7,
8,
9,
10,
11]. The second category emphasizes intra-frame forensic analysis, which seeks to detect local artifacts by examining spatial cues such as texture irregularities, frequency domain distributions, and illumination inconsistencies within individual frames [
12,
13,
14,
15,
16]. These two categories represent the most prominent directions in current deepfake video detection research, each addressing the problem from a complementary perspective: temporal modeling and spatial feature analysis.
However, as deepfake generation techniques continue to evolve, existing detection methods are increasingly challenged in complex and dynamic real-world scenarios. Inter-frame approaches may fail to capture meaningful temporal anomalies in low-motion or static scenes, while intra-frame methods, although effective in modeling local textures and structural inconsistencies, often struggle to distinguish high-fidelity forged content from genuine images and tend to overlook latent temporal dependencies across frames. Consequently, developing a unified framework that effectively integrates spatial perception with temporal modeling remains a central challenge in deepfake detection.
Meanwhile, uncertainty quantification has emerged as a fundamental research direction for improving the reliability and robustness of deep learning systems. Existing studies have systematically categorized uncertainty estimation methods and established the distinction between data-related and model-related uncertainty [
17,
18], while practical Bayesian learning techniques such as Monte Carlo Dropout have demonstrated that uncertainty can be effectively extracted from standard neural networks [
19]. In parallel, recent advances indicate that continuous-time stochastic modeling provides a principled framework for capturing non-stationary temporal dynamics [
20], quantum-inspired entropy measures such as von Neumann entropy offer advantages in representation manipulation [
21], and fractional-order operators are effective in detecting anomalies with long-range dependencies due to their inherent non-local memory property [
22]. Collectively, these findings motivate the development of a unified uncertainty-aware spatiotemporal modeling framework for robust deepfake detection.
To address the above challenges, we propose a novel deepfake video anomaly detection framework, termed SDEQ-Net, which integrates stochastic differential equations with quantum-based uncertainty modeling. The proposed method is designed to jointly capture spatial texture cues and inter-frame dynamic evolution patterns. SDEQ-Net comprises three key modules: a Continuous-time Neural Stochastic Differential Filtering Module (CNSDFM) for modeling temporally continuous frame transitions; a Quantum Uncertainty-Aware Fusion Module (QUAFM) for uncertainty-aware feature integration; and a Fractional-Order Temporal Anomaly Detection Module (FOTADM) for frame-level anomaly localization. To verify the effectiveness of the proposed framework, extensive experiments are conducted on three widely used deepfake video datasets, including FaceForensics++, Celeb-DF, and DFDC. Experimental results demonstrate that SDEQ-Net significantly outperforms existing state-of-the-art methods across various metrics and exhibits superior generalization capability in cross-dataset evaluations.
The main contributions of this study are summarized as follows:
We develop a novel CNSDFM that models the continuous evolution of video frame states using neural stochastic differential equations. By combining LSTM-based temporal encoding with gain filtering, the module enables dynamic modeling and temporal uncertainty estimation.
We propose a QUAFM that leverages density matrix representations and von Neumann entropy to quantify and fuse uncertainty-aware features, thereby improving the stability and robustness of uncertain region identification.
We design a FOTADM that introduces a residual-driven, fractional-order scoring mechanism for fine-grained frame-level anomaly detection, guiding the model to adaptively focus on anomalous dynamic patterns.
We perform comprehensive evaluations on three benchmark datasets (FF++, Celeb-DF, and DFDC), demonstrating the superior performance and generalizability of the proposed method compared to mainstream baselines.
The remainder of this paper is organized as follows:
Section 2 reviews related work.
Section 3 details the proposed methodology and component modules.
Section 4 presents experimental settings, evaluation metrics, and results.
Section 5 concludes the study and outlines future research directions.
2. Related Work
This section first reviews the mainstream techniques for facial forgery generation, including methods based on Autoencoders, Generative Adversarial Networks (GANs), Diffusion Models, and Transformer-based architectures. Subsequently, we discuss existing detection strategies for facial forgery, which can be broadly categorized into intra-frame-based and inter-frame temporal-based detection approaches.
2.1. Deepfake Generation Techniques
Deepfake generation has evolved through several paradigms. Early approaches employed autoencoders to learn latent facial representations for face swapping [
23,
24]. Generative Adversarial Networks (GANs) subsequently became dominant, with variants such as StyleGAN producing highly realistic results [
25,
26]. More recently, diffusion models have emerged as powerful alternatives, achieving state-of-the-art image quality through iterative denoising [
27,
28]. Transformer-based architectures have also been explored for their ability to model long-range dependencies [
29,
30]. These advances have significantly increased the realism of synthetic content, posing substantial challenges for detection methods.
2.2. Deepfake Detection Methods
With the rapid evolution of deepfake generation techniques, the challenge of accurately detecting manipulated content has become increasingly urgent. Deepfake detection methods primarily aim to determine the authenticity of visual content by identifying abnormal patterns or inconsistencies within images or videos. Based on the processing strategy, existing detection approaches can be broadly divided into two categories:
(1) Inter-frame-based deepfake detection. Real video sequences typically exhibit smooth and coherent temporal transitions, while manipulated videos often contain abrupt or inconsistent temporal dynamics. Therefore, detecting subtle discrepancies between consecutive frames has proven effective for identifying forgeries. For instance, the work in [
8] utilized optical flow estimation to capture latent differences across frames, which were subsequently fed into a Convolutional Neural Network (CNN) for classification. In [
31], a Gated Recurrent Unit (GRU) was integrated after the CNN to incorporate temporal dependencies for improved distinction of frame-level inconsistencies. The study in [
10] employed 3D Convolutional Neural Networks (3D CNNs) to jointly model spatial artifacts and temporal motion inconsistencies, enabling comprehensive spatiotemporal detection. To further enhance performance, several works have introduced domain-specific priors, such as eye blinking patterns [
32], lip movements [
11], and emotional expressions [
33], as discriminative features. In addition, recent approaches have focused on temporal representation learning. For example, ref. [
34] proposed a framework that explicitly learns spatiotemporal inconsistencies for deepfake detection. The work in [
35] adopted a self-supervised cross-modal learning strategy by leveraging the natural synchrony between visual and auditory modalities in real videos to learn temporally dense representations. Moreover, ref. [
5] introduced a self-supervised framework that learns consistency representations from authentic facial videos, aiming to improve the generalization and robustness of detection models against various types of manipulations.
(2) Intra-frame-based deepfake detection. Intra-frame detection methods focus on analyzing the content of individual video frames, aiming to identify forgeries by extracting local image features such as texture artifacts, illumination inconsistencies, and frequency domain cues. The study in ref. [
36] ignored fine-grained facial features and instead concentrated on detecting global artifacts for deepfake identification. In ref. [
37], a method was proposed to estimate 3D head pose from facial images, where an angular deviation beyond a certain threshold between two directional vectors was considered indicative of forgery. The work in ref. [
38] extracted and fused features from both YCbCr and RGB color spaces to improve detection accuracy. In ref. [
39], original images were partitioned into equal-sized blocks prior to spatial feature extraction, encouraging the model to explore more discriminative tampering traces. In ref. [
40], a deepfake detection framework with adaptively weighted multi-scale attention features was introduced to enhance feature representation. To improve generalization, ref. [
41] utilized unsupervised domain adaptation to bridge the distribution gap between real and fake samples. To further boost detection performance, several studies have incorporated frequency domain information. The work in ref. [
42] analyzed both global and local temporal inconsistencies from spatial and frequency perspectives, proposing a locally frequency-guided dynamic inconsistency network. In ref. [
43], a frequency-aware learning strategy was introduced to enhance generalization performance by focusing on frequency domain patterns. In ref. [
44], a spatial-phase shallow learning method was developed by integrating spatial image content with phase spectrum features, effectively capturing upsampling artifacts in manipulated faces. The study in ref. [
45] employed the Laplacian of Gaussian (LoG) filter for frequency-based enhancement, aiming to suppress low-level image content while amplifying high-frequency forgery cues.
In summary, existing detection methods predominantly focus on either temporal or spatial features in isolation. Inter-frame approaches may fail in low-motion scenarios, while intra-frame methods struggle against high-fidelity forgeries. These limitations motivate the development of unified frameworks that effectively integrate both perspectives.
3. Proposed Method
This section introduces SDEQ-Net, a unified framework that integrates neural stochastic differential modeling with quantum-inspired uncertainty fusion and fractional-order temporal analysis for deepfake video detection. The key motivation is to jointly capture temporal uncertainty at complementary scales, including instantaneous dynamics, aggregated temporal correlations, and long-range dependency patterns, which has been shown to be more effective than single-paradigm approaches [
46]. SDEQ-Net adopts a pre-trained ResNet3D-50 backbone for spatiotemporal feature extraction. While ResNet3D-50 effectively captures local spatiotemporal patterns through 3D convolutions, its temporal receptive field is inherently limited to short clip-level windows (typically 16–32 frames). To model long-range temporal dependencies and continuous-time dynamics across entire video sequences, SDEQ-Net incorporates LSTM-based recurrent processing within the CNSDFM. This hierarchical architecture separates local and global temporal modeling: the 3D CNN backbone extracts clip-level spatiotemporal representations, while the LSTM captures sequence-level temporal evolution and uncertainty propagation [
47].
The proposed framework comprises three specialized modules organized in a coordinated pipeline, as illustrated in
Figure 1: (1) CNSDFM receives the backbone features
and outputs refined temporal representations F along with frame-level uncertainty estimates U; (2) QUAFM takes both F and U as inputs, constructing density matrices to perform uncertainty-aware feature fusion; and (3) FOTADM computes frame-wise anomaly scores A from the residuals between original and refined features. During training, the anomaly scores serve as dynamic weights in the loss function (Equation (17)). During inference, the fused features from QUAFM are concatenated with global features from the backbone and passed through fully connected layers for final classification.
3.1. Continuous-Time Neural Stochastic Differential Filter Module
Deepfake videos often exhibit subtle temporal inconsistencies that manifest as continuous, non-stationary dynamic variations during inter-frame propagation. To model the evolution of latent inter-frame states and their associated uncertainties, we propose the Continuous-time Neural Stochastic Differential Filter Module (CNSDFM). Unlike standard LSTM or RNN-based temporal models that operate on discrete time steps and lack principled uncertainty modeling, Neural SDEs enable continuous-time latent dynamics with mathematically grounded stochastic components [
20]. CNSDFM adopts the Itô integral formulation compatible with the Euler–Maruyama discretization scheme [
48], and integrates LSTM with an adaptive gain mechanism. This hybrid architecture leverages the complementary strengths of both paradigms rather than being a simple enhancement of either [
49]. The overall workflow is illustrated in
Figure 2.
Let the high-level semantic feature representation of the input video frame sequence be denoted by the tensor , where is the batch size, is the feature dimension, and is the number of temporal steps. This tensor is generated by a pretrained visual encoder and serves as the observation input for the CNSDFM.
To capture long-range temporal dependencies, the tensor is transposed to shape and fed into an LSTM network, producing a sequence of hidden states . These hidden states are subsequently mapped back to the feature space via a fully connected transformation, yielding the prior prediction of the current latent state.
To model the continuous temporal evolution of the latent state, we introduce a Neural Stochastic Differential Equation (Neural SDE) defined as follows:
where
represents the drift function,
denotes the diffusion term, and
is a standard Brownian motion under the Itô interpretation.
In practice, we approximate this continuous evolution using the Euler–Maruyama scheme in discrete time. Let the current time step be
and the step size be
, and the discretized state transition can be approximated by:
where
is a standard Gaussian noise term. This continuous-time modeling process effectively captures non-stationary phenomena such as stochastic jumps and nonlinear drifts that may occur in deepfake regions. As illustrated in
Figure 2, the drift function employs tanh activation to produce bounded outputs, while the diffusion function uses ReLU to ensure non-negative coefficients, following standard practices in neural SDE implementations [
48,
49].
To address the uncertainty inherent in video frame observations, we further introduce an adaptive Kalman-inspired gain mechanism to dynamically balance the contributions from observation and prediction. Although the proposed update rule resembles the measurement update of a Kalman filter, it does not constitute a formal Kalman filter, as no explicit covariance propagation or Bayesian optimality assumptions are imposed. Instead, the design is motivated by the correction principle of Kalman filtering, where a residual between prediction and observation is used to adaptively refine the latent state. The observation residual is defined as:
where
is the actual observation at time step
. The residual
is passed through a two-layer fully connected network to estimate the adaptive fusion weight:
where
and
are learnable parameters, and
denotes the element-wise Sigmoid activation function. The final state update equation fuses the prior prediction from the SDE and the observation-guided correction term as follows:
where
denotes element-wise multiplication. This formulation enables filtering updates from prior (predictive) to posterior (fused) latent states based on continuous-time evolution and observed evidence. To quantify the uncertainty in the latent state estimation at each time step, we further define the following metric:
This uncertainty measure can be utilized in downstream decision modules or incorporated into loss function design to reflect the model’s confidence in the current observation.
3.2. Quantum Uncertainty-Aware Fusion Module
Fusing smoothed features with their associated uncertainty information is critical for enhancing discriminative power in deepfake detection. Conventional methods typically rely on simple concatenation or attention-based fusion strategies, which struggle to capture the intrinsic correlation between these two types of information. To address this issue, we propose the Quantum Uncertainty-Aware Fusion Module (QUAFM), which introduces density matrices and von Neumann entropy from quantum information theory. Unlike classical covariance matrices that only capture second-order statistics and linear correlations, density matrices can encode both classical uncertainty and the intrinsic correlations among feature states in a unified framework [
50]. The von Neumann entropy operates on the eigenvalue spectrum of the density matrix and captures the correlations between feature states that classical entropy measures cannot capture [
21]. As depicted in
Figure 3, QUAFM receives the refined states xt (Equation (5)) as input features F and the uncertainty estimates (Equation (6)) as U from CNSDFM (
Figure 2). The overall workflow is illustrated in
Figure 3a, and the density matrix construction process is depicted in
Figure 3b.
Let the input consist of the preprocessed smoothed features denoted by
, and the corresponding uncertainty information be denoted by
. First, both types of features are independently normalized along the temporal dimension
, and their density matrix representations are constructed as follows:
The density matrix can be interpreted as a statistical superposition of all temporal feature states, characterizing their spatial distribution structure. Notably, these density matrices are Hermitian by construction, since each outer product used in the summation satisfies this symmetry property. This Hermitian symmetry ensures that the eigenvalue spectrum consists entirely of real non-negative values, which is essential for the subsequent von Neumann entropy computation. To quantify the uncertainty associated with each representation, we introduce the von Neumann entropy, defined as:
where
is the density matrix and
denotes the trace operator. The Hermitian symmetry of the density matrix guarantees that all eigenvalues are real and non-negative with unit sum, which makes the von Neumann entropy well-defined. Unlike classical Shannon entropy, which only measures the uncertainty of individual random variables, the von Neumann entropy operates on the eigenvalue spectrum of the density matrix and thus captures the correlations and entanglement between feature states [
21]. This property enables more accurate quantification of the uncertainty within temporal features, as it accounts for the statistical dependencies that classical entropy measures cannot capture. Next, we compute fusion weights based on the entropy values. Features with lower uncertainty are assigned higher weights to enhance their contribution in the fusion process. The fusion weights
and
for the two inputs are calculated as:
Finally, we perform temporal average pooling on both feature streams and project them into a unified feature space via a linear transformation. The final fused representation is obtained by the weighted combination of the two branches, as given by:
3.3. Fractional-Order Temporal Anomaly Detection Module
In deepfake video detection, manipulations typically affect only a small number of frames or localized regions, resulting in sparse and subtle forgery signals. Traditional supervised learning frameworks treat all frames with equal importance, making it difficult to model the temporal dynamics of key anomalous frames. To address this challenge, we propose the Fractional-Order Temporal Anomaly Detection Module (FOTADM), which generates frame-wise anomaly scores and uses them as loss weighting factors. FOTADM introduces fractional-order differential operators to capture long-range dependencies and multi-scale temporal variations through learnable fractional orders. The overall workflow is illustrated in
Figure 4.
Let the original temporal feature sequence and its smoothed reference be denoted by
, where
,
, and
denote the batch size, number of channels, and number of temporal steps, respectively. The residual sequence is first computed frame-by-frame as:
To model more discriminative temporal anomaly responses, FOTADM introduces the Grünwald–Letnikov fractional-order differential operator, defined as:
where
is a learnable fractional order,
denotes the Gamma function, and L is the truncation length. Compared to integer-order derivatives, fractional-order operators exhibit non-local memory characteristics, enabling the modeling of long-range temporal dependencies. Physically, the Grünwald–Letnikov coefficients decay according to a power-law with exponent determined by α, where smaller α yields slower decay and thus longer memory span [
22].
To enhance the representation of multi-scale temporal anomalies, FOTADM applies two parallel fractional orders
and
, generating two distinct residual sequences, which are then fused:
Subsequently, global average pooling is applied along the channel dimension to extract a single scalar response for each time step:
This global response sequence is then passed through a two-layer perceptron (MLP) to produce normalized anomaly scores:
Here, and are learnable weight matrices, and denotes the Sigmoid activation function. The final output represents the anomaly degree at each time step, where higher values indicate more significant dynamic variations in the residual space, potentially corresponding to forged regions.
The anomaly scores produced by FOTADM are not used directly as detection outputs. Instead, they are integrated into the loss function of the backbone network to adaptively weight training on anomalous frames. This design follows the weakly supervised learning paradigm commonly adopted in video anomaly detection, where only video-level labels are available during training while frame-level predictions are desired [
51]. Critically, the frame-level anomaly scores function as soft attention weights rather than hard pseudo-labels, which avoids the self-confirmation bias typically associated with pseudo-label-based methods [
52]. Let
be the prediction of the backbone model and
the ground-truth label. The standard cross-entropy loss is given by:
To incorporate anomaly-aware weighting, the loss is reformulated as:
Here, α > 0 is a modulation factor that controls the influence of anomaly scores on the training loss. Frames assigned higher anomaly scores contribute more significantly to the overall loss, encouraging the model to focus on improving recognition accuracy at these time steps during optimization. Notably, the supervision signal is solely derived from the video-level ground-truth label Y, rather than the generated anomaly scores, which prevents the model from reinforcing its own predictions and ensures stable learning. During inference, video-level predictions are obtained by averaging the frame-level outputs [
51].
4. Experiments
This section presents the experimental setup and result analysis of the proposed method. We begin by detailing the experimental configuration, including the datasets used, preprocessing procedures, and implementation details. Subsequently, we conduct a comprehensive evaluation of the proposed algorithm through a series of experiments, including accuracy comparison, cross-dataset generalization, ablation studies, visualization analysis, and robustness assessment, thereby validating its effectiveness and robustness from multiple perspectives.
4.1. Datasets
To comprehensively evaluate the detection performance of the proposed method, experiments were conducted on three widely used deepfake benchmark datasets: FaceForensics++ [
53], Celeb-DF [
54], and DFDC [
55]. These datasets contain large-scale collections of both real and manipulated videos or images, and are representative and challenging benchmarks commonly adopted in state-of-the-art deepfake detection research.
(1) FaceForensics++ (FF++): The FF++ dataset consists of 1000 real videos and 4000 fake videos generated using four manipulation techniques: DeepFake (DF), Face2Face (F2F), FaceSwap (FS), and NeuralTextures (NT), with 1000 videos per technique. The dataset is available in three different compression levels: uncompressed (c0), light compression (c23), and strong compression (c40). In this study, we adopt the c23 and c40 versions to assess the robustness of the model under varying compression conditions. For each forgery method, video samples are divided into training, validation, and testing sets in a ratio of 720:140:140. To mitigate the issue of class imbalance, 128 frames are uniformly sampled from each real video, and 32 frames are sampled from each fake video. All images are cropped to the facial region using the official face masks provided by the dataset, and resized to 224 × 224 pixels as input to the model.
(2) Celeb-DF (CDF): The Celeb-DF dataset consists of 590 real videos and 5693 forged videos, involving subjects of diverse genders, ethnicities, and age groups. The fake videos are generated using an enhanced DeepFake algorithm, which significantly reduces common visual artifacts—such as unnatural boundaries and color inconsistencies—frequently found in earlier methods. This improvement leads to higher-quality forgeries and increases the difficulty of detection. In this study, we follow the official dataset partitioning protocol, extracting facial images from the designated videos. A total of 128 frames are sampled from each video, and face regions are detected and cropped using the Dlib library. All face images are then resized to a uniform resolution of 224 × 224 pixels for input into the model.
(3) Deepfake Detection Challenge (DFDC): Released by Meta AI in 2019, the DFDC dataset is one of the largest and most challenging benchmarks for deepfake detection. It comprises 19,197 real videos recorded by 430 actors, along with approximately 100,000 fake videos generated using a variety of forgery techniques, including DFAE, MM/NN Face Swap, NTH, and FSGAN, among others. The dataset exhibits high diversity and complexity in both content and manipulation methods. Given the massive scale of the dataset, we selected the last 10 compressed archive files out of the 50 officially provided and constructed a subset consisting of 1000 real videos and 1000 fake videos. For each video, 64 frames were uniformly sampled, facial regions were extracted using Dlib, and all frames were resized to 224 × 224 pixels for training and evaluation purposes.
4.2. Implementation Details
The proposed network was implemented using the PyTorch 2.7.1 deep learning framework. A ResNet3D-50 backbone pre-trained on the Kinetics-400 dataset was adopted. All experiments were conducted on an NVIDIA RTX A6000 GPU. The model was trained for 20 epochs with a batch size of 32 using the Adam optimizer, where both the learning rate and weight decay were set to 1 × 10
−4. A custom anomaly score-based loss function was employed to guide the optimization. Accuracy (ACC) and Area Under the ROC Curve (AUC) were used as evaluation metrics. ACC reflects the model’s classification performance, while AUC provides a threshold-independent assessment of detection capability. To ensure fair comparison, video-level ACC and AUC were reported following the protocol in [
11]. We strictly followed the official train/validation/test split without any overlap. In addition, no cross-video data augmentation was applied to avoid data leakage.
4.3. Accuracy Comparison
The proposed method was evaluated on two standard datasets, FF++ (c23) and FF++ (c40), and compared with several state-of-the-art deepfake detection approaches. The results, summarized in
Table 1, demonstrate that the proposed method achieved an ACC of 99.12% and an AUC of 99.81% on FF++ (c23), and an ACC of 94.89% with an AUC of 97.91% on FF++ (c40). While the ACC ranks second among all compared methods, the AUC consistently outperforms existing approaches. These results indicate the robustness of the proposed model in maintaining high detection accuracy under varying compression levels.
4.4. Generalization Experiments
Two sets of experiments were conducted to evaluate the generalization capability of the proposed method. In the first set of experiments, the model was trained on the FF++ (c23) dataset and tested on two unseen datasets, Celeb-DF and DFDC, following a cross-dataset evaluation protocol. Several state-of-the-art deepfake detection methods were included for comparison. The Area Under the ROC Curve (AUC) was used as the evaluation metric, and the results are summarized in
Table 2. On the Celeb-DF dataset, the proposed method achieved an AUC of 89.55%, outperforming all competing methods. On the DFDC dataset, it obtained an AUC of 86.21%, also achieving the best performance among the compared approaches.
In the second set of experiments, a leave-one-subset-out strategy was applied. Each of the four subsets in FF++ (c23)—DF, F2F, FS, and NT—was used as the training set, while the remaining three subsets served as the test sets. This procedure was repeated four times, and the AUC was reported as the evaluation metric. The results are shown in
Table 3. When DF was used for training, the average AUC across the other three subsets reached 90.21%, representing the best performance among all methods. When trained on F2F, the average AUC was 92.61%, again outperforming all baseline models. Using FS as the training set yielded an average AUC of 92.00%, while training on NT achieved an average AUC of 91.18%, the best result in this setting as well. To provide a more intuitive understanding of the contribution of each module, we visualize the ablation results in
Figure 5.
These results consistently demonstrate that the proposed deepfake face detection method exhibits superior generalization ability compared to existing approaches when applied to previously unseen or cross-dataset fake video scenarios.
4.5. Ablation Study
To evaluate the effectiveness of the three proposed modules in the deepfake face detection framework, an ablation study was conducted. The model was trained on the FF++ (c23) dataset, and tested separately on its four subsets: DF, F2F, FS, and NT. The Area Under the ROC Curve (AUC) was used as the evaluation metric. The results are summarized in
Table 4. Compared to the baseline model, incorporating the Continuous-time Neural Stochastic Differential Filtering Module (CNSDFM) led to a 4.87% improvement in AUC on FF++ (c23). Adding the Quantum Uncertainty-Aware Fusion Module (QUAFM) on top of CNSDFM further improved the AUC by 1.23%. Finally, introducing the Fractional-order Temporal Anomaly Detection Module (FOTADM) resulted in an additional gain of 1.25%. Overall, the progressive integration of these three modules significantly enhanced the model’s performance across all subsets. Specifically, the AUC increased by 9.42% on DF, 8.46% on F2F, 6.37% on FS, and 7.47% on NT, with an overall improvement of 7.35% on the entire FF++ (c23) dataset. These results clearly demonstrate the effectiveness and complementary contributions of the proposed CNSDFM, QUAFM, and FOTADM in boosting detection performance on both the full dataset and its individual subsets.
To further investigate the effectiveness of the fractional-order differential operator in FOTADM, we conducted additional experiments comparing different temporal weighting strategies. As shown in
Table 5, four approaches were examined: (1) Uniform weighting, which treats all frames equally (α = 0 in Equation (17)); (2) L1-Residual weighting, which uses absolute residual magnitudes directly; and (3) Integer-order weighting, which employs standard first-order temporal difference (α = 1); and (4) FOTADM with learnable fractional orders.
The results demonstrate that anomaly-based weighting consistently outperforms uniform weighting, and the proposed FOTADM with learnable fractional orders surpasses the integer-order variant by 0.58%. Notably, the learned orders converged to α1 = 0.73 and α2 = 1.42. According to the power-law decay property of Grünwald–Letnikov coefficients, α1 = 0.73 corresponds to slower coefficient decay and thus longer memory span, making the operator more sensitive to accumulated temporal inconsistencies. Conversely, α2 = 1.42 exhibits faster decay, emphasizing recent frame differences and capturing local discontinuities. The combination of these two complementary fractional orders enables FOTADM to simultaneously detect both long-range temporal drift and short-term abrupt artifacts in deepfake videos.
4.6. Grad-CAM Visualization
To further validate the effectiveness of the proposed detection method, Grad-CAM was employed to visualize the attention distribution across different convolutional layers.
Figure 6 presents the visualization results from Layer1 to Layer4, clearly illustrating the evolution of the model’s focus as the network depth increases. At the shallow layer (Layer1), the model primarily responds to global facial structures and low-level texture features, with dispersed activation regions mostly located around the eyes and forehead. As the network deepens, the attention in Layer2 and Layer3 gradually shifts towards more representative local facial regions, such as the eye contour, nasal bridge, and mouth corners—regions that are inherently prone to forgery artifacts due to the technical limitations of deepfake generation pipelines. Specifically, the eye region frequently exhibits artifacts such as irregular pupil shapes, inconsistent corneal reflections between the two eyes, and unnatural blinking patterns, as generative models struggle to reproduce the physical and physiological constraints of human eyes [
72]. The mouth and nasal regions are particularly susceptible to blending boundary artifacts and texture inconsistencies, since most face-swapping and face-reenactment methods require warping and blending the synthesized facial content onto the target frame, inevitably introducing visible seams and color mismatches around these semantically complex areas [
73,
74]. At the deepest level (Layer4), attention becomes highly concentrated around the nasal alae and mouth corners, indicating that the model is capable of accurately localizing the most discriminative forged regions in high-level semantic space.
Figure 7 shows Grad-CAM results for four types of forgeries (DF, F2F, FS, and NT) from the FF++ (c23) dataset. It can be observed that the proposed method accurately identifies forged regions across different manipulation types, with strong activation in key facial features. This consistent focus on the eye, nose, and mouth regions across diverse manipulation techniques aligns with findings from recent forensic studies, which demonstrate that these facial areas are the most manipulation-prone regions where geometric asymmetries, lighting irregularities, and up-sampling artifacts from the generator’s decoder are most pronounced [
75]. This suggests that the method not only achieves high classification accuracy but also exhibits good interpretability and region-aware capability. These findings further confirm the effectiveness and feasibility of the proposed method in the task of deepfake detection.
4.7. Robustness Evaluation
To assess the robustness of the proposed method under various visual degradation conditions, the model was trained on the FF++ (c23) dataset and systematically evaluated under seven types of degradation transformations: saturation variation, contrast variation, block occlusion, Gaussian noise, Gaussian blur, pixelation, and JPEG compression. Each degradation type was applied at five severity levels to simulate a spectrum of distortions ranging from mild to severe. The proposed method was compared against three representative deepfake detection models: F3-Net (frequency domain-based), Xception, and MADD (both spatial domain-based). The results are illustrated in
Figure 8. Under color perturbations—namely saturation and contrast variations—the proposed method consistently achieved high AUC scores (97.86–99.02%), significantly outperforming the baselines. In block occlusion scenarios (with the number of occlusion blocks increasing from 10 to 50), our method exhibited the smallest performance degradation, maintaining strong stability. For Gaussian noise (with standard deviation σ ranging from 2 to 10), the proposed method achieved an AUC of 72.06% under the highest noise level, which was notably superior to F3-Net (55.05%) and Xception (65.37%).
Under Gaussian blur, with kernel sizes increasing from 1 to 9, the AUC of the proposed method declined from 98.69% to 80.23%, but still exceeded MADD (74.55%) and F3-Net (71.82%). Regarding pixelation (with block sizes from 2 to 6), the proposed method demonstrated greater adaptability, maintaining an AUC of 77.96% even at the most severe level. In the scenario of heavy JPEG compression (quality factor reduced from 50 to 10), the AUC remained at 61.35%, indicating stronger robustness compared to the competing approaches.
Overall, the results confirm that the proposed deepfake detection method exhibits superior stability under various complex visual degradation conditions, highlighting its strong potential for real-world video deepfake detection applications.
4.8. Model Complexity and Inference Efficiency Analysis
To comprehensively evaluate the deployment potential of different types of deepfake detection models in real-world scenarios, we systematically compare representative 2D CNNs, 3D CNNs, Transformer-based architectures, and the proposed SDEQ-Net across two critical dimensions: model complexity (measured by parameter count and FLOPs) and inference efficiency (measured by frames per second, FPS, and average inference time per video clip, ms/clip). All experiments were conducted on the same hardware platform (NVIDIA RTX A6000 GPU), using standardized input clips consisting of 16 frames with a spatial resolution of 224 × 224 pixels to ensure fair and reproducible comparisons.
Table 6 presents the comparison results in terms of parameter count, FLOPs, average inference time, and FPS under FP32 precision. R3D-50 demonstrated the highest efficiency, achieving an inference speed of 567.17 FPS with an average inference time of only 28.21 ms per clip. While the proposed SDEQ-Net exhibits slightly higher complexity (60.16 GFLOPs and 47.83 M parameters), it maintains competitive inference performance, reaching 280.94 FPS and 56.95 ms/clip—significantly outperforming more complex models such as Swin3D. Regarding real-time applicability, standard video streams typically operate at 24–30 FPS, while high-frame-rate content may reach 60 FPS. The achieved inference speed of 280.94 FPS is approximately 9× higher than the 30 FPS broadcast standard, indicating that SDEQ-Net is well-suited for real-time deepfake detection on server-grade GPUs. The end-to-end latency of 56.95 ms per 16-frame clip remains within acceptable bounds for online monitoring applications such as video conferencing platforms and social media content moderation. For edge deployment or latency-critical scenarios, optimization techniques including INT8 quantization and TensorRT acceleration can be employed to further reduce computational overhead.
To analyze the computational overhead introduced by each proposed module, we report the cumulative cost as each component is progressively added, as shown in
Table 7. CNSDFM introduces the largest overhead (18.2 GFLOPs, 12.3 ms) due to LSTM encoding and Euler–Maruyama discretization. QUAFM adds 1.4 GFLOPs and 8.6 ms for density matrix construction and entropy computation, while FOTADM contributes only 0.4 GFLOPs and 7.8 ms for fractional-order differentiation. Overall, the three modules collectively add 20.0 GFLOPs and 28.7 ms to the backbone, enabling SDEQ-Net to strike a favorable balance among accuracy, complexity, and latency.
4.9. Sensitivity Analysis
4.9.1. Backbone Architecture
The proposed CNSDFM, QUAFM, and FOTADM modules operate on intermediate feature representations, enabling integration with various backbone architectures. To evaluate this flexibility, we conducted experiments with four representative backbones: ResNet3D-50, EfficientNet-B4, Xception, and ViT-B. All models were trained on FF++ (c23) and evaluated on both intra-dataset and cross-dataset benchmarks. As shown in
Table 8, the proposed modules yield consistent performance improvements across all backbone architectures. With EfficientNet-B4, the model achieves 98.92% AUC on FF++ (c23) and 87.43% on Celeb-DF. The Xception-based variant obtains 98.45% and 86.21% on the respective datasets. ViT-B, representing transformer-based architectures, achieves 99.15% on FF++ (c23) and 88.12% on Celeb-DF. ResNet3D-50 achieves the highest overall performance with 99.81% and 89.55% on FF++ (c23) and Celeb-DF, respectively, which motivates its selection as the default backbone in our framework.
4.9.2. Training Data Ratio
We investigated model performance under varying amounts of training data. The models were trained using 100%, 75%, 50%, 25% of the FF++ (c23) training set and evaluated on both FF++ (c23) and Celeb-DF. Xception and EfficientNet-B4 were included as baselines. As shown in
Table 9, the proposed method achieves 98.76% AUC on FF++ (c23) with 50% training data, compared to 96.45% for Xception and 97.23% for EfficientNet-B4. At 25% training data, the proposed method maintains 96.82% AUC, outperforming both baselines by 3.70% and 2.26%, respectively. In the cross-dataset evaluation, the proposed method exhibits a slower rate of performance degradation as training data decreases. With only 25% training data, the proposed method achieves 81.24% AUC on Celeb-DF, while Xception and EfficientNet-B4 obtain 62.45% and 57.89%, respectively.
4.9.3. Regularization Strategy
We analyzed the effects of weight decay and dropout on model generalization. As shown in
Table 10, without regularization, the model achieves 99.56% AUC on FF++ (c23) but only 76.82% on Celeb-DF, resulting in a significant performance gap of 22.74%. Applying weight decay (1 × 10
−4) reduces the FF++ (c23) performance slightly to 99.48% while improving cross-dataset generalization to 83.67%. Adding dropout (0.3) alone yields 99.35% on FF++ (c23) and 85.41% on Celeb-DF. The combination of weight decay (1 × 10
–4) and dropout (0.3) achieves the optimal balance, with 99.81% on FF++ (c23) and 89.55% on Celeb-DF. Increasing the weight decay to 1 × 10
−3 leads to performance degradation on both datasets, suggesting that excessive regularization impairs the model’s representational capacity.
5. Conclusions
In this paper, we propose SDEQ-Net, a novel deepfake video anomaly detection framework that integrates stochastic differential equation modeling, quantum uncertainty-aware fusion, and fractional-order temporal analysis. The proposed method addresses the challenges of continuous temporal dynamics modeling and fine-grained anomaly localization through three complementary modules: CNSDFM for continuous-time state evolution, QUAFM for uncertainty-aware feature fusion based on Hermitian-symmetric density matrices, and FOTADM for frame-level anomaly scoring. Extensive experiments on FaceForensics++, Celeb-DF, and DFDC demonstrate that SDEQ-Net consistently outperforms state-of-the-art methods in detection accuracy, cross-dataset generalization, and robustness. Future work will focus on improving computational efficiency and extending the framework to multimodal deepfake detection scenarios.