STA-3D: Combining Spatiotemporal Attention and 3D Convolutional Networks for Robust Deepfake Detection

Wang, Jingbo; Lei, Jun; Li, Shuohao; Zhang, Jun

doi:10.3390/sym17071037

Open AccessArticle

STA-3D: Combining Spatiotemporal Attention and 3D Convolutional Networks for Robust Deepfake Detection

Laboratory for Big Data and Decision, National University of Defense Technology, Changsha 410073, China

^*

Authors to whom correspondence should be addressed.

Symmetry 2025, 17(7), 1037; https://doi.org/10.3390/sym17071037

Submission received: 14 April 2025 / Revised: 17 June 2025 / Accepted: 23 June 2025 / Published: 1 July 2025

(This article belongs to the Special Issue Applications Based on Symmetry and Asymmetry in Deep Learning and Artificial Intelligence Methods)

Download

Browse Figures

Versions Notes

Abstract

Recent advancements in deep learning have driven the rapid proliferation of deepfake generation techniques, raising substantial concerns over digital security and trustworthiness. Most current detection methods primarily focus on spatial or frequency domain features but show limited effectiveness when dealing with compressed videos and cross-dataset scenarios. Observing that mainstream generation methods use frame-by-frame synthesis without adequate temporal consistency constraints, we introduce the Spatiotemporal Attention 3D Network (STA-3D), a novel framework that combines a lightweight spatiotemporal attention module with a 3D convolutional architecture to improve detection robustness. The proposed attention module adopts a symmetric multi-branch architecture, where each branch follows a nearly identical processing pipeline to separately model temporal-channel, temporal-spatial, and intra-spatial correlations. Our framework additionally implements Spatial Pyramid Pooling (SPP) layers along the temporal axis, enabling adaptive modeling regardless of input video length. Furthermore, we mitigate the inherent asymmetry in the quantity of authentic and forged samples by replacing standard cross entropy with focal loss for training. This integration facilitates the simultaneous exploitation of inter-frame temporal discontinuities and intra-frame spatial artifacts, achieving competitive performance across various benchmark datasets under different compression conditions: for the intra-dataset setting on FF++, it improves the average accuracy by 1.09 percentage points compared to existing SOTA, with a more significant gain of 1.63 percentage points under the most challenging C40 compression level (particularly for NeuralTextures, achieving an improvement of 4.05 percentage points); while for the intra-dataset setting, AUC is enhanced by 0.24 percentage points on the DFDC-P dataset.

Keywords:

deepfake detection; 3D-CNN; spatial–temporal attention

1. Introduction

With the rapid advancement of image generation and manipulation technologies, face forgery techniques have proliferated, allowing the general public to easily and quickly create forged media content. The term Deepfake refers to images, videos, and other media, which typically involve human faces, synthesized or manipulated using tools based on deep learning and related technologies, such as ZAO and FaceApp. Such plausible content is often indistinguishable from authentic media, enabling applications in entertainment and privacy protection. However, the malicious abuse of these techniques poses significant threats to online security and digital trust.

Deepfake technologies typically encompass four distinct types: entire face synthesis, attribute manipulation, identity swapping, and expression manipulation (also known as face reenactment). Of these categories, entire face synthesis and attribute manipulation involve only a single face and generally present lower risks of malicious exploitation. This study concentrates on the more concerning categories—identity swapping and expression manipulation—which are commonly created using deep learning models such as Variational Autoencoders (VAEs) [1] and Generative Adversarial Networks (GANs) [2]. Identity swapping involves transferring identity-related features from a source face onto a target face while preserving the head pose, expressions, background, and lighting conditions of the latter. Expression manipulation modifies the target subject’s facial expressions to match those of the source subject while retaining the target’s original identity unchanged. For both concerning deepfake types, the deepfake video generation process comprises several sequential steps: frame extraction from source videos, facial region segmentation with feature extraction, facial region generation, and full-frame synthesis, as illustrated in Figure 1. Both techniques can produce media content that unlawfully exploits an individual’s portrait rights without their knowledge, leading to significant security concerns, such as telecommunication fraud, defamation of national leaders, creation of fake pornographic content, and the deception of facial recognition systems, thereby posing threats to social security.

Deep learning-based detection techniques have been developed to counteract Deepfake threats. Many current detection approaches are frame-based, analyzing single images or individual video frames. These methods often use image classification techniques to extract facial regions from input samples, followed by Convolutional Neural Networks (CNNs) for feature extraction and classification. For example, MesoNet [3] is a shallow network designed to capture mesoscopic-level features. Kumar et al. [4] proposed a method using five ResNet-18 networks, with one branch processing the entire facial image and four branches analyzing uniformly divided image patches independently. Rossler et al. [5] showed that the Xception network performs well in detection tasks. Li et al. [6] trained CNN-based models to identify artifacts from affine transformations. However, the effectiveness of these models can be compromised by post-processing operations, such as image and video compression, which degrade forensic artifacts in the spatial domain. To address this limitation, some methods have incorporated the extraction of frequency domain features. For instance, F³-Net [7] implements a dual-branch architecture utilizing Discrete Cosine Transform (DCT), enabling simultaneous analysis of subtle forgery patterns and high-level semantic features in image patches. Given that common forgery methods typically manipulate only facial regions, detection approaches leverage spatial inconsistencies between modified and unmodified areas. Face X-Ray [8] pioneered this approach by developing a method to detect the boundaries of forged facial regions. LAA-Net [9] establishes fused artifact-driven ‘vulnerable points’, formulating their regression and fusion boundaries as auxiliary tasks that extend Face X-Ray [8].

Despite advances in deepfake technology, contemporary generation methods predominantly operate on a frame-by-frame basis, as Figure 1 depicts, introducing temporal inconsistencies that manifest across video sequences, forming a symmetry with intra-frame inconsistencies. In contrast, video-based detection methods leverage temporal coherence by analyzing sequential or densely sampled frames, thereby exploiting the inherent temporal relationships present in video data. Additionally, video-level inputs inherently provide data augmentation [10]. Early research [11,12,13] employed modules such as CNNs to independently extract multi-frame embeddings, followed by Recurrent Neural Networks (RNNs) and classification modules to capture temporal features. These methods that extract temporal features from high-level spatial features often result in significant loss of fine-grained temporal features during feature migration and struggle to detect subtle spatiotemporal inconsistencies. Some studies utilized optical flow as an auxiliary feature to describe dynamic changes between frames [14,15], but environmental factors such as camera movement, viewpoint changes, and illumination variations can substantially impact the optical flow field. Various approaches leverage interdisciplinary research to identify spatial and temporal anomalies by analyzing physiological features that forgery generation methods often fail to accurately replicate. LipForensics [16] proposes a two-stage detection framework that first extracts lip-reading features and then performs forgery classification. Wu et al. [17] enhanced deepfake detection by analyzing remote Photoplethysmography (rPPG) signals [18]—physiological indicators obtained through facial light absorption patterns—by segmenting videos into temporal clips, extracting multi-scale features from local facial regions, and employing Vision Transformers (ViT) [19] to capture long-range temporal dependencies. DFGaze [20] demonstrates that authentic videos maintain consistent gaze directions over short intervals while deepfake videos exhibit irregular gaze patterns, achieving superior performance by integrating gaze direction variance with attribute and texture-based features. However, these physiological feature-based approaches, particularly those analyzing lip movements during speech, typically require extended temporal sequences for effective analysis. Some studies reduce the emphasis on spatial feature extraction and focus primarily on capturing temporal inconsistencies. FTCN [21] prioritizes temporal inconsistency detection by deliberately constraining spatial feature extraction through 1 × 1 convolutional kernels, followed by a ViT architecture for deepfake classification. Some approaches transform spatiotemporal features into structures compatible with conventional vision backbones through feature transformation methods. TALL-Swin [22] introduces a thumbnail layout that converts consecutive video frames into predefined configurations to preserve their temporal and spatial consistency features. ISTVT [23] extracts features from consecutive frames using CNN backbones, constructs patch embeddings with classification tokens, and incorporates spatial and temporal class tokens at respective spatial and temporal positions before feeding them into a Transformer architecture that decomposes spatiotemporal self-attention into independent temporal and spatial modules. Three-dimensional Convolutional Neural Networks (3D-CNNs) provide integrated extraction of spatiotemporal features, obviating the need for separate temporal sequence representation derived from pre-extracted spatial features of multiple frames or dependence on feature sets derived from auxiliary visual detection tasks. 3D-CNNs have been widely applied in video action recognition [24] and volumetric image analysis [25]. The effectiveness of pre-trained 3D-CNNs for deepfake detection was first demonstrated by [26]. Following this work, several studies proposed various improvements to the basic 3D-CNN architecture. Liu et al. [27] enhanced the network by incorporating SRM features as input. Lu et al. [28] and Ma et al. [29] integrated CBAM-like [30] and lightweight spatiotemporal attention modules, respectively. More recently, Wang et al. proposed AltFreezing [31], which introduced an alternate training strategy for temporal and spatial correlation components, along with task-specific video data augmentation methods.

Based on the above analysis, we selected 3D-CNN architecture as our foundational architecture due to its intrinsic capability for joint spatial–temporal feature extraction. To address the computational intensity of training 3D-CNNs from scratch while preserving their representational power, we initialize our model with a pre-trained S3D [32]. In addition, inspired by Triplet Attention [33], we design and incorporate a new lightweight attention module. This module innovatively extends conventional spatial/channel-wise attention mechanisms by simultaneously modeling four-dimensional interaction: spatial (intra-frame height–width structural relationships), temporal (inter-frame time–height/time–width motion trajectories), and cross-channel (channel–time dynamic feature evolution), thereby establishing comprehensive spatiotemporal–channel interdependencies within the 3D-CNN backbone. The main contributions of this paper are listed below:

We propose a novel framework named the Spatiotemporal Attention 3D Network (STA-3D), which integrates a lightweight spatiotemporal attention module with a 3D-CNN backbone for enhanced deepfake detection.
We initialize our 3D-CNN backbone using weights pre-trained on action recognition tasks, reducing computational overhead through transfer learning. The design incorporates a plug-and-play multi-branch attention layer for analyzing correlations across temporal and spatial dimensions, a spatial pyramid pooling layer supporting variable-length frame sequences, and focal loss [34] to address class asymmetry issues during training.
We conduct comprehensive experiments evaluating intra-dataset performance, cross-dataset generalization, and comparisons with multiple baselines. Our method demonstrates consistent performance across various testing scenarios.

2. Related Works

2.1. Deepfake Generation

The development of face deepfakes has evolved significantly over time. Early approaches relied primarily on Computer Graphics (CG) techniques, whereas modern methods leverage advanced deep learning architectures, particularly Variational Autoencoders (VAE) [1] and Generative Adversarial Networks (GANs) [2]. These deepfake generation methods typically specialize in one of two main tasks: identity swapping or expression manipulation, which will be analyzed systematically in the following discussions.

Face Identity Swap. Early approaches like FaceSwap [35] used lightweight 3D models for CPU-efficient processing but sacrificed visual quality. Fast Face-Swap [36] reformulates the identity swapping task as a specialized application of style transfer. Similarly, DeepFake [37] employs an architecture comprising one encoder and two decoders. Beyond conventional encoder–decoder architectures, researchers have successfully incorporated adversarial training strategies to enhance face swapping capabilities. A notable example is FaceSwap-GAN [38], which adapts the CycleGAN [39] framework to achieve improved identity transfer results. Research in face swapping has introduced several subject-agnostic methods to address the challenge of identity preservation. IPGAN [40] uses a dual-stream architecture to decompose facial images into identity and attribute features, employing multiple loss functions to ensure identity consistency in face swapping. FaceShifter [41] implements a two-stage framework where the initial face synthesis is followed by a specialized heuristic network that refines occluded regions to ensure robust performance under challenging conditions. MegaFS [42] achieves unprecedented megapixel-resolution outputs in face swapping applications by leveraging the pretrained StyleGAN2 [43] latent space. Smooth-Swap [44] replaces complex hand-designed networks and loss functions with an identity encoder, which projects the source face into a smooth feature space with well-defined gradient directions, eliminating the need for self-supervised contrastive learning. FaceDancer [45] proposes an adaptive attention mechanism that dynamically responds to diverse facial features, thereby maintaining both high-fidelity output and precise identity preservation during image generation.

Face Expression Swap. Face2Face [46] incorporates CG-driven 3D models for full facial control. Neural Textures [47] maps visual features to 3D textures while focusing adjustments primarily on mouth regions for facial reenactment. ICFace [48] implements a two-stage GAN architecture where the first stage extracts target face identity features and the second generates reenacted facial expressions. LIA [49] adopts a self-supervised autoencoder architecture to produce expression swap results by inferring transformation paths in the latent space, avoiding the need for explicit structural representations. Similarly, DPE [50] separates facial motion into pose and expression to represent global and local features, enabling selective modification of specific parts and mitigating inconsistencies in face shape during fusion with the background.

This paper focuses on detecting the two aforementioned types of forgeries. It is observed that expression manipulation introduces subtle alterations to the original video, and most methods generate forged videos on a frame-by-frame basis.

2.2. Plug-and-Play Attention Mechanisms

The human visual system demonstrates remarkable efficiency in identifying salient regions within complex visual scenes. The attention mechanism in computer vision simulates this biological selective attention process. This computational approach implements a content-driven weight adjustment mechanism that generates sample-specific activations [51], enabling its successful application across diverse computer vision tasks. Attention mechanisms in computer vision tasks primarily consist of two components: spatial attention, which emphasizes critical regions within the spatial domain, and channel attention, which selectively enhances semantic features and object-specific information across channels. Squeeze-and-Excitation Networks (SENet) [52] introduced a channel attention mechanism that exploits spatial global context by applying global average pooling, followed by two fully connected layers, to compute channel-wise attention weights. Non-local networks [53] extended the attention framework by introducing a spatial self-attention mechanism that captures long-range dependencies through pairwise pixel/patch correlations. Building on these advancements, BAM [54] proposed a dual-branch architecture to compute channel and spatial attention simultaneously, while CBAM [30] further optimized this design by sequentially applying channel and spatial attention modules. The Spatial Pooling Network (SPN) [55] employs dual-branch one-dimensional convolutions along rows and columns to simultaneously process global and local spatial features. GCNet [56] and Efficient Attention [57] enhanced the computational efficiency of non-local networks. Triplet attention [33] innovatively addresses cross-dimensional correlations by jointly modeling channel and spatial relationships through a three-branch architecture; each branch employs a combination of 1D pooling and 2D convolution operations: one branch processes spatial-wise features, while two branches handle channel-spatial dimensions (spatial dimensions H and W, respectively), collectively generating attention weights that are subsequently averaged across channels. Some methods focus on multi-scale feature extraction, using channel grouping to reduce computational overhead and avoid feature loss due to channel compression: EMA [58] proposed a new cross-spatial learning strategy introducing a cross-spatial learning strategy that combines CA [59] and local branches; SCSA [60] emphasizes the synergistic effect of spatial and channel attention to guide key feature extraction and mitigate disparities in multi-semantic features. The aforementioned methods, originally designed for 2D-CNNs (with the exception of non-local networks [53]), can generally be extended to 3D-CNNs by treating the additional temporal dimension as a spatial dimension. This paper employs a 3D-CNN framework and proposes an adapted version of triplet attention specifically redesigned for 3D-CNNs, capitalizing on its cross-dimensional interaction and scalability.

2.3. 3D-CNN and Action Recognition

Three-dimensional convolutional layers incorporate temporal information alongside spatial dimensions, unlike their 2D counterparts, making them particularly effective for video analysis tasks such as action recognition. Ji et al. [61] first introduced 3D-CNN layers for video action recognition. Tran et al. [62] developed a fully 3D network by extending the concept to include 3D pooling layers. Varol et al. [63] achieved higher temporal resolution by reducing spatial dimensions while maintaining computational complexity. Previous methods employed relatively shallow networks yet still incurred high computational costs. To address this limitation, subsequent research has focused on computational efficiency through various architectural innovations. One approach involves adapting established 2D-CNN architectures to create deeper networks, which has demonstrated enhanced representational capabilities on large-scale datasets. P3D [64] and R(2+1)D [65] decompose 3D convolutions into separate spatial and temporal components, utilizing a combination of 2D spatial convolutions and 1D temporal convolutions. I3D [66], which builds upon the Inception-v1 architecture [67], enables the use of ImageNet pre-trained weights in 3D networks, eliminating the need for training from scratch. Following this trajectory, S3D [32] extends the I3D architecture by incorporating computational efficiency improvements similar to those found in R(2+1)D. More recent works [68,69] have enhanced the modeling of long-range temporal dependencies by performing joint reasoning across multiple video clips’ embeddings extracted from the same video. Using lightweight 3D-CNNs as components in other lightweight models is also an effective means of improving both efficiency and performance. MCANet [70] expands the receptive field using multi-scale dilated convolutions, reduces the number of parameters with an inverted residual structure combined with depthwise separable convolutions, and extracts global features of the full video with an improved ViT. Hara et al. [71] demonstrated that deep 3D-CNNs trained on the Kinetics dataset [72] exhibit strong transfer learning capabilities across various video understanding tasks. Furthermore, as this study focuses on short-term inconsistencies, a single-clip-based pre-trained S3D model was selected and modified to enhance performance in deepfake detection.

3. Methods

3.1. Overview

This paper approaches deepfake detection as a binary classification problem through a model architecture comprising two main components: feature extraction and classification. The model takes as input a temporally continuous video sequence, denoted as

X_{c l i p}

, along with a binary label

y \in {0, 1}

, where

y = 1

represents a forged (deepfake) sample and

y = 0

represents an authentic sample. For feature extraction, the model employs a 3D-CNN as its backbone architecture. The classification component consists of two fully connected layers that progressively reduce the feature dimensionality to a single output. An attention module is integrated between the feature extraction and classification components to enhance the model’s focus on relevant spatiotemporal patterns. The entire architecture is designed for end-to-end training. The structure of the proposed model is illustrated in Figure 2.

3.2. 3D-CNN Feature Extraction Module

The architecture of our proposed 3D-CNN network, detailed in Table 1, is fundamentally based on the S3D [32] network. Our model is designed to detect spatiotemporal inconsistencies in short sequences, where the temporal dimension is smaller compared to the spatial dimensions (i.e., height and width). To preserve temporal information in the early stages of processing, we modify the original S3D network by adjusting the temporal stride in the pooling layer following the first inception module (InceptionBlock-1 in Table 1) from 2 to 1. Furthermore, we remove the classification layers of the original S3D network.

To handle variable-length temporal sequences (8, 16, or 32 frames), we incorporate a Spatial Pyramid Pooling (SPP) layer [73] before the classification network. Although initially proposed for object detection tasks, the SPP layer generates fixed-dimensional output representations, as detailed in Equation (1):

SPP (x) = Concat [Pool (⌊ t / 2 ⌋, ⌊ t / 2 ⌋, x), Pool (t, t, x)]

(1)

where

x \in R^{t}

(with batch dimension and spatial dimensions omitted for brevity). In

Pool (kernel_size, stride, x)

, the first two arguments specify both the kernel size and the stride, while the third argument represents the input to be processed. This process is illustrated in Figure 3.

3.3. Attention Module

Our attention mechanism builds upon an enhanced version of triplet attention, which emphasis cross-dimensional attention capturing the interaction between spatial and channel dimensions. Although the original triplet attention framework was developed for single-image processing, we extend its rotational strategy to accommodate multi-frame video clips. In its original formulation, triplet attention employs three parallel processing branches: two branches model the relationships between channel dimensions and each of the spatial dimensions independently, while the third branch captures the correlations between the two spatial dimensions. In this paper, the attention module also takes the temporal dimension into account, modeling the correlations between the channel and the temporal (C-T stream), the temporal and the height (T-H stream), the temporal and the width (T-W stream), and the height and the width (H-W stream), respectively, as shown in Figure 4.

The spatial–temporal attention module receives feature embeddings

x \in R^{C \times T \times H \times W}

from the preceding block, where C, T, H, and W denote the channel, temporal, height, and width dimensions, respectively (with batch dimension omitted for brevity). The proposed approach extends the Z-pool mechanism in triplet attention by introducing a Z3D-pool operation. Specifically, this operation performs average and max pooling applied independently along the first two dimensions, then concatenates them as given in Equation (2):

x_{p o o l} = Z 3 D-pool (x) = C o n c a t [A v g P o o l 3 D (x), M a x P o o l 3 D (x)]

(2)

where

AvgPool 3 D (x) = \frac{1}{H \times W} \sum_{i = 1}^{C} \sum_{j = 1}^{T} x_{i, j, h, w}, h = 1, \dots, H, w = 1, \dots, W

(3)

MaxPool 3 D (x) = max_{i = 1, \dots, C, j = 1, \dots, T} x_{i, j, h, w}, h = 1, \dots, H, w = 1, \dots, W

(4)

The operator

C o n c a t [\cdot]

denotes the concatenation of tensors along the first dimension. At this stage,

x_{p o o l} \in R^{2 \times H \times W}

.

In the C-T stream, a rotation operation

R^{C T}

transforms the input tensor x into

x_{r o t a t e}^{C T} \in R^{H \times W \times C \times T}

, reorganizing it such that the channel and temporal dimensions become the final two dimensions to facilitate subsequent computations. A Z3D-pool operation is then applied to obtain

x_{p o o l}^{C T} \in R^{2 \times C \times T}

. Subsequently, a 2D convolution layer that outputs a single channel with kernel size

k_{C} \times k_{T}

, followed by batch normalization, generates the attention map, as shown in Equation (5):

x_{a t t n}^{C T} = BatchNorm [{Conv}_{k_{C} \times k_{T}} (x_{p o o l}^{C T})]

(5)

Before applying the attention weights

x_{a t t n}^{C T}

, the values are normalized to the range

[- 1, 1]

, and element-wise multiplication is performed to obtain the re-weight feature map, as shown in Equation (6):

x_{w}^{C T} = 2 \cdot σ (x_{a t t m}^{C T} - 0.5) * x_{r o t a t e}^{C T}

(6)

This differs from the original triplet attention: the attention map is linearly scaled to

[- 1, 1]

after sigmoid, rather than

[0, 1]

. The inverse rotation operation restores the tensor dimensions to their original order, yielding the branch output as shown in Equation (7):

x^{C T} = {InvR}^{C T} (x_{w}^{C T})

(7)

In the H-W stream, rotation operations are unnecessary since the height and width dimensions are already present in the input tensor. The remaining 2 branches follow similar processing steps to obtain

x^{T H}

and

x^{T W}

, respectively, with detailed procedures omitted for brevity.

After obtaining the outputs on the four branches, average fusion was used along with residual concatenation:

x^{A T T} = x + \frac{1}{4} (x^{C T} + x^{T H} + x^{T W} + x^{H W})

(8)

A residual connection is incorporated, representing another distinction from the original triplet attention.

Complexity Analysis. The attention mechanism introduced in this paper has minimal impact on the model’s overall complexity, as demonstrated by analyzing the number of additional parameters it introduces. Following the approach of previous works [33,56], no bias terms are used in any convolutional or linear layers, and parameters introduced by normalization layers are disregarded. During the attention generation process, operations such as rotation, pooling, and merging do not introduce additional parameters; trainable parameters only exist in the convolutional layers. Each of the C-T, T-H, T-W, and H-W stream branches contains one convolutional block, with parameter counts of

2 \cdot k_{C} \cdot k_{T}

,

2 \cdot k_{T} \cdot k_{H}

,

2 \cdot k_{T} \cdot k_{W}

, and

2 \cdot k_{H} \cdot k_{W}

, respectively. The total number of parameters introduced is

2 \cdot (k_{C} \cdot k_{T} + k_{T} \cdot k_{H} + k_{T} \cdot k_{W} + k_{H} \cdot k_{W})

. Assuming the kernel sizes are equal, i.e.,

k_{C} = k_{T} = k_{H} = k_{W} = k

, the total number of parameters becomes

8 k^{2}

. It is evident that the proposed attention mechanism adds only a small number of parameters (with kernel sizes typically being small, especially in deeper layers where

k ≪ C

) while remaining independent of the input’s channel count C and spatial–temporal dimensions

T \times H \times W

. To further illustrate the lightweight nature of the proposed attention mechanism, we compare it with the SE block [52] and CBAM [30] mentioned in Section 2.2. These two methods were originally designed for 2D convolution and are adapted here for 3D convolution by treating the temporal dimension as an irrelevant axis. Let the reduction ratio be denoted as r. For the SE block [52], the predominant portion of its parameters arises from two fully connected layers that adjust the number of feature channels, resulting in a total parameter count of

2 C^{2} / r

. For CBAM [30], the channel attention part is similar to the SE block, with

2 C^{2} / r

parameters, while the spatial attention part includes one convolutional layer, adding

2 k^{2}

parameters, resulting in a total parameter count of

2 C^{2} / r + 2 k^{2}

. The number of additional parameters introduced by each of these attention modules is shown in Table 2.

Compared to the original triplet attention, the proposed attention mechanism extends to spatial–temporal inputs. This attention module remains nearly parameter-free relative to the backbone network, with trainable parameters existing only in the convolution layers of four branches, thus introducing minimal computational overhead.

3.4. Loss Function Design for Class Imbalance

The detection task in this study belongs to a classification problem, where Cross Entropy (CE) serves as the fundamental and most prevalent loss function, widely adopted in related research within our domain of interest [22,23]. However, CE loss exhibits notable limitations when handling class-imbalanced data, particularly demonstrating susceptibility to overfitting on easily classified samples. Given the inherent characteristics of our task—specifically the substantial class imbalance ratio evident in our dataset (as detailed in Table 3)—we explored alternative loss functions and optimization strategies.

Label smoothing [74] addresses overfitting through the introduction of soft targets, effectively mitigating the impact of label noise and annotation errors. Focal loss [75] innovatively combines dynamic modulating factors with class-weighted balancing, strategically suppressing the loss contribution from well-classified samples while compelling the model to concentrate on challenging misclassified instances. Building upon this foundation, asymmetric loss [75] incorporates asymmetric focusing parameters and probability shifting mechanisms. Label smoothing demonstrates particular efficacy in multi-class classification scenarios, whereas asymmetric loss shows superior performance in multi-label recognition tasks. However, the latter introduces additional hyperparameter complexity, requiring meticulous tuning. Given that this paper studies a binary classification task with class imbalance, we ultimately chose focal loss as our loss function, as shown in Equation (9):

FL (p_{t}) = - α_{t} {(1 - p_{t})}^{γ} log (p_{t})

(9)

where

p_{t}

represents the predicted probability as defined in [34],

α_{t}

denotes the balancing factor, and

γ

represents the focusing parameter.

4. Results and Analysis

4.1. Datasets and Metrics

This paper utilizes three publicly available datasets, which are described below.

The FaceForensics++ (FF++) dataset [5] contains original videos and manipulated samples generated via five distinct forgery methods: two expression manipulation techniques (Face2Face, NeuralTextures) and three identity-swapping methods (Deepfakes, FaceSwap, FaceShifter). Traditional computer graphics pipelines are employed in FaceSwap and Face2Face, whereas Deepfakes, NeuralTextures, and FaceShifter utilize deep learning architectures. The dataset offers three compression levels (C0, C23, C40), comprising 3000 real videos and 15,000 manipulated counterparts across all methods and compression settings. Official splits for training, validation, and testing are provided, though only its training set is used for model development in this work.

The DFDC-Preview (DFDC-P) dataset [76], released as the first version of the Deepfake Detection Challenge, enhances diversity and scale with 5000+ videos (4464 training, 780 test). It employs two identity-swap forgery techniques. Official training and test set splits are provided.

The Celeb-DF [77] dataset features 590 original YouTube videos from 59 celebrities, augmented by 5639 high-quality synthetic videos generated via an upgraded DeepFake algorithm. Its test set includes 518 challenging samples.

Table 3 quantifies the dataset composition across training and testing phases.

Deepfake detection systems can be evaluated using standard binary classification metrics, including accuracy, precision, recall, and the Receiver Operating Characteristic (ROC) curve. In this work, forged samples are designated as the positive class, while authentic samples constitute the negative class. The confusion matrix categorizes predictions into four groups: True Positive (TP) (correctly identified forged samples), False Negative (FN) (forged samples misclassified as authentic), False Positive (FP) (authentic samples erroneously labeled as forged), and True Negative (TN) (correctly recognized authentic samples). Let

# (\cdot)

denote the number of samples in each category. The accuracy metric, quantifying the overall correct prediction rate, is formally defined as shown in Equation (10):

Accuracy = \frac{# (TP) + # (TN)}{# (Total Samples)}

(10)

The true positive rate (TPR) and false positive rate (FPR) are formally defined as follows:

TPR = \frac{# (TP)}{# (TP) + # (FN)}, FPR = \frac{# (FP)}{# (TN) + # (FP)}

(11)

The TPR is synonymous with recall, quantifying the detection completeness of forged samples. Conversely, the FPR measures the proportion of authentic samples incorrectly classified as forged relative to all authentic samples. The Receiver Operating Characteristic (ROC) curve is generated by plotting the true positive rate (TPR) against the false positive rate (FPR) across varying classification thresholds, with TPR and FPR serving as the y-axis and x-axis coordinates, respectively. The Area Under the Curve (AUC) quantifies the overall detection performance, where higher AUC values indicate superior model discriminative capability. Statistically, the AUC represents the probability that a randomly selected positive sample will receive a higher detection score than a negative sample [78]. Compared to accuracy, the AUC provides a threshold-agnostic evaluation of model robustness. We employ AUC for cross-dataset generalization assessment while adopting classification accuracy for controlled intra-dataset experiments as commonly adopted.

4.2. Implementation Details

Data Processing & Augmentation. For video samples, the data underwent a two-step preprocessing pipeline before model input: (1) facial region cropping and (2) data augmentation. This study employs the https://github.com/akanametov/yolo-face (accessed on 18 February 2025) for face detection. To reduce frame-to-frame jitter resulting from inconsistencies in face detection across frames, we apply a bounding box that includes all anchor boxes. Frames exhibiting excessive facial movement are excluded from analysis. Following previous studies, the detection region is expanded by a fixed ratio (20% in this study) before cropping. The cropped images undergo a series of data augmentation transformations, including random horizontal flipping (

p = 0.5

), random inversion (

p = 0.5

), and affine transformation (

p = 0.8, s h e a r = 10^{\circ}

), followed by normalization to

[0, 1]

. The resulting clips are reshaped to a spatial resolution of

256 \times 256

. During inference, we exclude normalization and reshaping from the data augmentation pipeline. For each test video sample, we randomly select 10 clips and compute their mean prediction score as the final result.

Hyper-parameters & Experiment Environment. All experiments were conducted using PyTorch 2.0.1 on Ubuntu 22.04 LTS with two NVIDIA RTX 2080 Ti GPUs. For the feature extraction backbone network, we utilized https://www.github.com/kylemin/S3D (accessed on 10 January 2025) pre-trained on the Kinetics-400 dataset with a Top-1 accuracy of 72.1%. The convolutional layers and fully connected layers were initialized using a normal distribution with 0.0 mean and 0.5 variance and He-uniform [79], respectively. For the proposed attention module, the convolution kernel sizes were set as

k_{C} = 7,

k_{T} = k_{H} = k_{W} = 3

. For the SE block [52] and CBAM [30] used for comparison, their channel compression ratios were both set to

r = 16

, consistent with the default settings in the corresponding papers. Additionally, the kernel size of the latter was set to 7. For the focal loss, we set

α_{t} = 0.2

and

γ = 2.0

. During training, we employed a learning rate warm-up strategy for the first three epochs, linearly increasing the rate from

1 \times 10^{- 5}

to

1 \times 10^{- 4}

. For subsequent epochs, we implemented an exponential learning rate decay strategy with a decay rate of 0.9 using the Adam optimizer [80]. The model was trained for 25 epochs with a batch size of 32. During the inference phase, as backpropagation and parameter updates were not required, less GPU memory was consumed, allowing for an increased batch size of 64.

4.3. Overall Model Parameters and Computational Cost

To provide a comprehensive understanding of the efficiency and scalability of our proposed STA-3D framework, we report the overall model complexity and resource requirements under different input frame settings in the inference setting. Specifically, Table 4 summarizes key metrics that characterize our model’s computational demands. The inference time reported represents the average duration from 32 inference iterations under the conditions specified in Section 4.2, excluding sample loading and preprocessing time. Since these values can be influenced by multiple factors, they are provided only as reference. The remaining metrics were analyzed using THOP.

As shown in Table 4, GPU memory usage and computational complexity increase approximately 2× with each doubling of the frame count, while the growth of inference time is sublinear. The number of input frames can be selected based on available hardware conditions and detection performance (as described in Section 4.6).

4.4. Comparison with Previous Methods

In this section, we present the performance (in terms of accuracy) of our proposed method on the FF++ dataset. To contextualize our results, we compare with three categories of baselines: (1) conventional approaches, which take a single frame as model input (frame-based): Steg.Features [81], ResNet-50 [82], Meso-4 [3], MesoInception-4 [3], and Xception [83]; (2) approaches employing frequency-domain as auxiliary feature (frequency-based): Two-branchRN [84], F³-Net-Xception [7], fCNN [85], and SPSL [86]; (3) approaches modeling spatiotemporal characteristics jointly (spatiotemporal): C3D [87], I3D [66], SlowFast [88], TEINet [89], ADDNet-3D [90], and Ma et al. [29], where the first four are generic action recognition methods, while the latter two are specifically designed for deepfake detection with incorporated attention mechanisms.

We perform comprehensive experiments on the FF++ dataset at all three compression levels (C0, C23, C40) and present comparisons with state-of-the-art methods in Table 5. Notably, our experimental protocol combines all four manipulated subsets (DF, FS, F2F, NT) for both training and testing phases, which creates a more challenging evaluation scenario compared to approaches that train and test on individual subsets separately [91].

Our method surpasses all baseline approaches in terms of the average performance across different compression levels of FF++. Specifically, it outperforms the second-best detector F3-Net-Xception [7] by a margin of 1.09 percentage points. Regarding compression levels, the proposed method achieves state-of-the-art performance at a C40 compression level with 94.40% accuracy, demonstrating 1.63 percentage points improvement over the second-best result in this challenging scenario; for C0 and C23 compression levels, our method attains competitive second-best performance, with minimal accuracy gaps of 0.78 and 0.51 percentage points, respectively, compared to the top-performing methods (99.17% vs. 99.95% at C0, 97.62% vs. 98.04% at C23). Deepfake detection models should maintain consistent performance across varying compression ratios, demonstrating robustness to sample quality degradation. As evidenced in Table 5, spatial–temporal methods achieve superior robustness compared to methods relying solely on single-frame inputs, with the proposed framework outperforming all baselines under cross-compression evaluation.

We further dissect our method and baseline approaches across compression levels (C0/C23/C40) and forgery types (DF/FS/F2F/NT) to provide granular insights into performance variations across different scenarios in Table 6. For each method, we evaluate the results across 12 different condition combinations. Notably, our proposed approach achieves the best performance in six combinations and second-best performance in three combinations, demonstrating excellent cross-scenario adaptability.

Upon further analysis, we observe that several baseline methods only exhibit superior performance under specific conditions. For instance, ResNet-50 [82] demonstrates outstanding performance at the C23 compression level but shows relatively inferior results at the C40 compression level. Similarly, Ma et al. [29] achieve excellent detection performance on samples generated using the DF method across different forgery levels, yet their detection capability for other methods is more significantly impacted by compression. In contrast, our proposed method maintains robust performance across various scenarios, particularly excelling in detecting the more challenging NT method, where it achieves the best performance among all baselines under all three compression conditions.

4.5. Visualization Analysis

To better understand how our model processes spatiotemporal features and makes decisions, we conduct visualization experiments using both t-SNE [92] and Grad-CAM [93] on the FF++ dataset.

Feature Space Visualization with t-SNE. To demonstrate the effectiveness of our model’s feature space, we employ t-SNE to analyze the feature distributions under different methods and compression levels. Specifically, following the settings in Section 4.2, we extracted the features (384-dimensional) after the first fully connected layer in the classification component and used t-SNE from https://scikit-learn.org/stable/index.html (accessed on 2 June 2025) for dimensionality reduction. We plotted the distributions for authentic/forged samples and different forgery types in the FF++ dataset against authentic samples, as shown in Figure 5 and Figure 6, respectively. Figure 5 demonstrates distinct separability across different forgery types, indicating our method’s strong discriminative capability for different manipulation techniques in the FF++ dataset. Meanwhile, Figure 6 reveals minimal separation between compression levels (particularly C0 and C23) in both authentic and forged samples, highlighting the model’s robustness to compression artifacts.

Grad-CAM Visualization. To interpret the decision logic of STA-3D, we conducted pixel-level interpretability analysis using Grad-CAM. Diverging from conventional single-frame visualizations, this method quantifies the contribution of spatiotemporal regions to classification outcomes. Specifically, Grad-CAM is applied to the final block within InceptionBlock-3 (mentioned in Table 1) to trace gradient propagation. Representative samples are selected for analysis: a readily detectable DeepFakes case (Figure 7a) and a challenging FaceShifter example (Figure 7b). In Figure 7a, distinct fusion boundaries are evident. In the earlier frames, the model focuses on the pronounced skin tone inconsistency boundary at the subject’s chin, while in later frames with mouth movements, it attends to the loss of detail between the mouth and nose compared to earlier frames. In Figure 7b, the model emphasizes temporal inconsistencies: in later frames, the intermittent appearance of the nasolabial fold on the subject’s right side becomes a key region of focus.

4.6. Ablation Study

In this section, we conducted an ablation study examining two critical factors in our method: the number of consecutive input frames and the contribution of the attention mechanism. Table 7 presents detailed results on the interplay between these factors across three compression levels (C0/C23/C40) on the FF++ dataset, showing how frame count and attention mechanisms jointly affect detection performance under varying video quality conditions.

Effect of Consecutive Frame Numbers. As described in Section 3.2, our method supports various input frame configurations, including 4, 8, 16, and 32 consecutive frames. We evaluated the model’s performance on the FF++ dataset using 8, 16, and 32 consecutive frames during testing. It can be observed that increasing the number of input frames from 8 to 16 consistently leads to a significant improvement in detection performance, regardless of compression level or the presence of the attention module. However, further increasing the number of frames from 16 to 32 results in a performance drop overall. This decline is particularly notable under heavy compression and when the attention module is absent, where the performance even falls below that of the 8-frame input. Only under light compression (C0, C23) and with the attention module does the 32-frame input show marginal improvement. Considering both detection performance and computational cost (described in Section 4.3), we adopt 16 consecutive frames as the final input configuration for our model.

Impact of Attention Mechanism. As described in Section 3.3, the attention module employed in this work is designed to be insertable, maintaining identical tensor dimensions between input and output, while only introducing a small number of trainable parameters. We evaluated the performance differences with and without this module on the FF++ dataset across various compression levels. Under all compression levels, when using 32 consecutive frames as input, there was a performance increase, whereas with 8 consecutive frames, performance generally declined. When using 16 consecutive frames as input, the overall performance was optimal, especially at higher compression levels. It can be inferred that the attention module in this study helps to mitigate the impact of redundant spatial information in long sequences, enhancing the utilization of cross-frame discriminative features. Additionally, we replaced our proposed attention module with several plug-and-play attention modules analyzed in Section 3.3 under the condition of 8 frames as input, with the results shown in Table 8. The attention module adopted in this paper achieved the best results in terms of average accuracy, achieving the highest or second-highest accuracy across all compression levels; SE [52] also demonstrated good performance, but CBAM [30] was clearly unsuitable for this task. Overall, the attention module proposed in this paper demonstrated superior detection capability and generalizability under different compression conditions.

4.7. Generalization Ability

In this section, we compare and analyze the generalization ability of proposed STA-3D trained on the FF++ dataset and their performance when evaluated on the DFDC-P and Celeb-DF datasets, as presented in Table 9. The proposed STA-3D model was exclusively trained on the FF++ dataset and subsequently evaluated on both Celeb-DF and DFDC test sets under strict cross-domain protocols without domain adaptation techniques.

Cross-dataset generalization remains a persistent challenge, with both baseline methods and our approach exhibiting notable performance degradation in cross-dataset scenarios, as shown in Table 9. Our proposed method achieves the highest AUC score of 69.24 on the DFDC-P dataset, surpassing all other baselines, including the RATF [94] method, which obtained an AUC of 69.1. When examining performance on the Celeb-DF dataset, our method scored 59.64. However, on Celeb-DF, our method achieved 59.64% AUC, demonstrating competitiveness while trailing the current best approach.

Table 9. Generalization performance (AUC) of various methods on DFDC-P and Celeb-DF datasets. Best and second-best results are annotated with double underlines and single underline, respectively.

Methods	Training Dataset	Testing Dataset
Methods	Training Dataset	DFDC-P	Celeb-DF
Meso-4 [3]	FF++	59.4	53.6
HeadPose [95]	FF++		54.6
DSP [5]	FF++	67.9	73.7
VA-LogReg [96]	FF++		55.1
Capsule-Forensics [97]	FF++		57.5
Multi-task [98]	FF++		54.3
D-FWA [6]	FF++		56.9
LRNet [99]	FF++		53.2
RATF [94]	FF++	69.1	76.5
Ours	FF++	69.24	59.64

5. Conclusions and Discussion

In this paper, we present the Spatiotemporal Attention 3D Network (STA-3D), a novel framework for deepfake video detection. The proposed method achieves state-of-the-art performance on the FaceForensics++ (FF++) dataset, particularly under challenging C40 compression, while maintaining strong performance on lightly compressed videos (i.e., RAW and C23). Our spatiotemporal attention mechanism effectively reduces redundant spatial information in long video sequences and enhances discriminative feature utilization under high compression conditions.

However, cross-dataset generalization remains a critical challenge. While STA-3D demonstrates competitive results on FF++, its performance may exhibit limited robustness when tested on some unseen datasets. A potential reason is that the authentic and forged samples do not exhibit clear separation in the model’s feature space, as demonstrated by the results in Section 4.5. Additionally, although 3D-CNNs efficiently capture spatiotemporal dependencies, they incur substantial computational overhead, posing challenges for real-time deployment. These challenges present key areas for further improvements. Furthermore, detection performance in real-world environments remains to be tested. Some directions for further exploration include anomaly detection, explainable AI, and incremental learning.

Author Contributions

J.W. conceived the research idea, designed and performed the experiments, and drafted the initial manuscript; J.L. provided technical guidance throughout the experimental procedures; J.L., S.L., and J.Z. evaluated the feasibility and completeness of this work and revised the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All datasets used in this study are publicly available and can be accessed through the references cited in the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

3D-CNNs	Three-dimensional convolutional neural networks
ViT	Vision Transformers
SPP	Spatial Pyramid Pooling
FF++	FaceForensics++
DFDC-P	Deepfake Detection Challenge Preview
Celeb-DF	Celeb-DeepFake
AUC	Area Under the Curve
CE	Cross Entropy
t-SNE	T-distributed Stochastic Neighbor Embedding
Grad-CAM	Gradient-weighted Class Activation Mapping

References

Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; Volume 27. [Google Scholar]
Afchar, D.; Nozick, V.; Yamagishi, J.; Echizen, I. Mesonet: A compact facial video forgery detection network. In Proceedings of the 2018 IEEE International Workshop on Information Forensics and Security (WIFS), Hong Kong, China, 11–13 December 2018; pp. 1–7. [Google Scholar]
Kumar, P.; Vatsa, M.; Singh, R. Detecting face2face facial reenactment in videos. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 2589–2597. [Google Scholar]
Rossler, A.; Cozzolino, D.; Verdoliva, L.; Riess, C.; Thies, J.; Nießner, M. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1–11. [Google Scholar]
Li, Y.; Lyu, S. Exposing deepfake videos by detecting face warping artifacts. arXiv 2018, arXiv:1811.00656. [Google Scholar]
Qian, Y.; Yin, G.; Sheng, L.; Chen, Z.; Shao, J. Thinking in frequency: Face forgery detection by mining frequency-aware clues. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 13–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 86–103. [Google Scholar]
Li, L.; Bao, J.; Zhang, T.; Yang, H.; Chen, D.; Wen, F.; Guo, B. Face X-ray for more general face forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5001–5010. [Google Scholar]
Nguyen, D.; Mejri, N.; Singh, I.P.; Kuleshova, P.; Astrid, M.; Kacem, A.; Ghorbel, E.; Aouada, D. LAA-Net: Localized Artifact Attention Network for Quality-Agnostic and Generalizable Deepfake Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–22 June 2024; pp. 17395–17405. [Google Scholar]
Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; Volume 27. [Google Scholar]
Güera, D.; Delp, E.J. Deepfake video detection using recurrent neural networks. In Proceedings of the 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Auckland, New Zealand, 27–30 November 2018; pp. 1–6. [Google Scholar]
Sabir, E.; Cheng, J.; Jaiswal, A.; AbdAlmageed, W.; Masi, I.; Natarajan, P. Recurrent convolutional strategies for face manipulation detection in videos. Interfaces (GUI) 2019, 3, 80–87. [Google Scholar]
Vamsi, V.V.V.N.S.; Shet, S.S.; Reddy, S.S.M.; Rose, S.S.; Shetty, S.R.; Sathvika, S.; Supriya, M.; Shankar, S.P. Deepfake detection in digital media forensics. Glob. Transit. Proc. 2022, 3, 74–79. [Google Scholar] [CrossRef]
Amerini, I.; Galteri, L.; Caldelli, R.; Del Bimbo, A. Deepfake video detection through optical flow based cnn. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Saikia, P.; Dholaria, D.; Yadav, P.; Patel, V.; Roy, M. A hybrid CNN-LSTM model for video deepfake detection by leveraging optical flow features. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022; pp. 1–7. [Google Scholar]
Haliassos, A.; Vougioukas, K.; Petridis, S.; Pantic, M. Lips don’t lie: A generalisable and robust approach to face forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5039–5049. [Google Scholar]
Wu, J.; Zhu, Y.; Jiang, X.; Liu, Y.; Lin, J. Local attention and long-distance interaction of rPPG for deepfake detection. Vis. Comput. 2024, 40, 1083–1094. [Google Scholar] [CrossRef]
Verkruysse, W.; Svaasand, L.O.; Nelson, J.S. Remote plethysmographic imaging using ambient light. Opt. Express 2008, 16, 21434–21445. [Google Scholar] [CrossRef] [PubMed]
Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Peng, C.; Miao, Z.; Liu, D.; Wang, N.; Hu, R.; Gao, X. Where Deepfakes Gaze at? Spatial-Temporal Gaze Inconsistency Analysis for Video Face Forgery Detection. IEEE Trans. Inf. Forensics Secur. 2024, 19, 4507–4517. [Google Scholar] [CrossRef]
Zheng, Y.; Bao, J.; Chen, D.; Zeng, M.; Wen, F. Exploring temporal coherence for more general video face forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15044–15054. [Google Scholar]
Xu, Y.; Liang, J.; Jia, G.; Yang, Z.; Zhang, Y.; He, R. Tall: Thumbnail layout for deepfake video detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 22658–22668. [Google Scholar]
Zhao, C.; Wang, C.; Hu, G.; Chen, H.; Liu, C.; Tang, J. ISTVT: Interpretable spatial-temporal video transformer for deepfake detection. IEEE Trans. Inf. Forensics Secur. 2023, 18, 1335–1348. [Google Scholar] [CrossRef]
Tran, D.; Ray, J.; Shou, Z.; Chang, S.F.; Paluri, M. Convnet architecture search for spatiotemporal feature learning. arXiv 2017, arXiv:1708.05038. [Google Scholar]
Turrisi, R.; Verri, A.; Barla, A. The effect of data augmentation and 3D-CNN depth on Alzheimer’s Disease detection. arXiv 2023, arXiv:2309.07192. [Google Scholar]
De Lima, O.; Franklin, S.; Basu, S.; Karwoski, B.; George, A. Deepfake detection using spatiotemporal convolutional networks. arXiv 2020, arXiv:2006.14749. [Google Scholar]
Liu, J.; Zhu, K.; Lu, W.; Luo, X.; Zhao, X. A lightweight 3D convolutional neural network for deepfake detection. Int. J. Intell. Syst. 2021, 36, 4990–5004. [Google Scholar] [CrossRef]
Lu, C.; Liu, B.; Zhou, W.; Chu, Q.; Yu, N. Deepfake video detection using 3D-attentional inception convolutional neural network. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; pp. 3572–3576. [Google Scholar]
Ma, Z.; Mei, X.; Shen, J. 3D Attention Network for Face Forgery Detection. In Proceedings of the 2023 4th Information Communication Technologies Conference (ICTC), Nanjing, China, 17–19 May 2023; pp. 396–401. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wang, Z.; Bao, J.; Zhou, W.; Wang, W.; Li, H. Altfreezing for more general video face forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 4129–4138. [Google Scholar]
Xie, S.; Sun, C.; Huang, J.; Tu, Z.; Murphy, K. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 305–321. [Google Scholar]
Misra, D.; Nalamada, T.; Arasanipalai, A.U.; Hou, Q. Rotate to attend: Convolutional triplet attention module. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 3139–3148. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
MarekKowalski. 3D Face Swapping Implemented in Python. 2021. Available online: https://github.com/MarekKowalski/FaceSwap (accessed on 15 January 2025).
Korshunova, I.; Shi, W.; Dambre, J.; Theis, L. Fast face-swap using convolutional neural networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3677–3685. [Google Scholar]
DeepFakes. 2018. Available online: https://www.theverge.com/2018/2/7/16982046/reddit-deepfakes-ai-celebrity-face-swap-porn-community-ban (accessed on 15 January 2025).
Shaoanlu. Faceswap-GAN. 2022. Available online: https://github.com/shaoanlu/faceswap-GAN (accessed on 15 January 2025).
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Bao, J.; Chen, D.; Wen, F.; Li, H.; Hua, G. Towards open-set identity preserving face synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6713–6722. [Google Scholar]
Li, L.; Bao, J.; Yang, H.; Chen, D.; Wen, F. Faceshifter: Towards high fidelity and occlusion aware face swapping. arXiv 2019, arXiv:1912.13457. [Google Scholar]
Zhu, Y.; Li, Q.; Wang, J.; Xu, C.Z.; Sun, Z. One shot face swapping on megapixels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4834–4844. [Google Scholar]
Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8110–8119. [Google Scholar]
Kim, J.; Lee, J.; Zhang, B.T. Smooth-swap: A simple enhancement for face-swapping with smoothness. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10779–10788. [Google Scholar]
Rosberg, F.; Aksoy, E.E.; Alonso-Fernandez, F.; Englund, C. Facedancer: Pose-and occlusion-aware high fidelity face swapping. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 3454–3463. [Google Scholar]
Thies, J.; Zollhofer, M.; Stamminger, M.; Theobalt, C.; Nießner, M. Face2face: Real-time face capture and reenactment of rgb videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2387–2395. [Google Scholar]
Thies, J.; Zollhöfer, M.; Nießner, M. Deferred neural rendering: Image synthesis using neural textures. Acm Trans. Graph. (TOG) 2019, 38, 1–12. [Google Scholar] [CrossRef]
Tripathy, S.; Kannala, J.; Rahtu, E. Icface: Interpretable and controllable face reenactment using gans. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass, CO, USA, 1–5 March 2020; pp. 3385–3394. [Google Scholar]
Wang, Y.; Yang, D.; Bremond, F.; Dantcheva, A. Latent image animator: Learning to animate images via latent space navigation. arXiv 2022, arXiv:2203.09043. [Google Scholar]
Pang, Y.; Zhang, Y.; Quan, W.; Fan, Y.; Cun, X.; Shan, Y.; Yan, D.M. Dpe: Disentanglement of pose and expression for general video portrait editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 427–436. [Google Scholar]
Guo, M.H.; Xu, T.X.; Liu, J.J.; Liu, Z.N.; Jiang, P.T.; Mu, T.J.; Zhang, S.H.; Martin, R.R.; Cheng, M.M.; Hu, S.M. Attention mechanisms in computer vision: A survey. Comput. Vis. Media 2022, 8, 331–368. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networkspark2018bam. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
Park, J.; Woo, S.; Lee, J.Y.; Kweon, I.S. Bam: Bottleneck attention module. arXiv 2018, arXiv:1807.06514. [Google Scholar]
Hou, Q.; Zhang, L.; Cheng, M.M.; Feng, J. Strip pooling: Rethinking spatial pooling for scene parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4003–4012. [Google Scholar]
Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Shen, Z.; Zhang, M.; Zhao, H.; Yi, S.; Li, H. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 3531–3539. [Google Scholar]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient multi-scale attention module with cross-spatial learning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Si, Y.; Xu, H.; Zhu, X.; Zhang, W.; Dong, Y.; Chen, Y.; Li, H. SCSA: Exploring the synergistic effects between spatial and channel attention. Neurocomputing 2025, 634, 129866. [Google Scholar] [CrossRef]
Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 221–231. [Google Scholar] [CrossRef]
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
Varol, G.; Laptev, I.; Schmid, C. Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1510–1517. [Google Scholar] [CrossRef]
Qiu, Z.; Yao, T.; Mei, T. Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5533–5541. [Google Scholar]
Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6450–6459. [Google Scholar]
Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Zhang, S.; Guo, S.; Huang, W.; Scott, M.R.; Wang, L. V4d: 4d convolutional neural networks for video-level representation learning. arXiv 2020, arXiv:2002.07442. [Google Scholar]
Wu, W.; Zhao, Y.; Xu, Y.; Tan, X.; He, D.; Zou, Z.; Ye, J.; Li, Y.; Yao, M.; Dong, Z.; et al. Dsanet: Dynamic segment aggregation network for video-level representation learning. In Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; pp. 1903–1911. [Google Scholar]
Tian, Q.; Miao, W.; Zhang, L.; Yang, Z.; Yu, Y.; Zhao, Y.; Yao, L. MCANet: A lightweight action recognition network with multidimensional convolution and attention. Int. J. Mach. Learn. Cybern. 2025, 16, 3345–3358. [Google Scholar] [CrossRef]
Hara, K.; Kataoka, H.; Satoh, Y. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6546–6555. [Google Scholar]
Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. The kinetics human action video dataset. arXiv 2017, arXiv:1705.06950. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Ridnik, T.; Ben-Baruch, E.; Zamir, N.; Noy, A.; Friedman, I.; Protter, M.; Zelnik-Manor, L. Asymmetric loss for multi-label classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 82–91. [Google Scholar]
Dolhansky, B.; Howes, R.; Pflaum, B.; Baram, N.; Ferrer, C.C. The deepfake detection challenge (dfdc) preview dataset. arXiv 2019, arXiv:1910.08854. [Google Scholar]
Li, Y.; Yang, X.; Sun, P.; Qi, H.; Lyu, S. Celeb-df: A large-scale challenging dataset for deepfake forensics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3207–3216. [Google Scholar]
Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Fridrich, J.; Kodovsky, J. Rich models for steganalysis of digital images. IEEE Trans. Inf. Forensics Secur. 2012, 7, 868–882. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Masi, I.; Killekar, A.; Mascarenhas, R.M.; Gurudatt, S.P.; AbdAlmageed, W. Two-branch recurrent network for isolating deepfakes in videos. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part VII 16. Springer: Cham, Switzerland, 2020. [Google Scholar]
Kohli, A.; Gupta, A. Detecting deepfake, faceswap and face2face facial forgeries using frequency cnn. Multimed. Tools Appl. 2021, 80, 18461–18478. [Google Scholar] [CrossRef]
Liu, H.; Li, X.; Zhou, W.; Chen, Y.; He, Y.; Xue, H.; Zhang, W.; Yu, N. Spatial-phase shallow learning: Rethinking face forgery detection in frequency domain. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 772–781. [Google Scholar]
Tran, D.; Bourdev, L.D.; Fergus, R.; Torresani, L.; Paluri, M. C3D: Generic Features for Video Analysis. arXiv 2014, arXiv:1412.0767. [Google Scholar]
Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October 2019–2 November 2019; pp. 6202–6211. [Google Scholar]
Liu, Z.; Luo, D.; Wang, Y.; Wang, L.; Tai, Y.; Wang, C.; Li, J.; Huang, F.; Lu, T. Teinet: Towards an efficient architecture for video recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11669–11676. [Google Scholar]
Dang, H.; Liu, F.; Stehouwer, J.; Liu, X.; Jain, A.K. On the detection of digital face manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5781–5790. [Google Scholar]
Pang, G.; Zhang, B.; Teng, Z.; Qi, Z.; Fan, J. MRE-Net: Multi-rate excitation network for deepfake video detection. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 3663–3676. [Google Scholar] [CrossRef]
Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Gu, Z.; Yao, T.; Chen, Y.; Yi, R.; Ding, S.; Ma, L. Region-Aware Temporal Inconsistency Learning for DeepFake Video Detection. In Proceedings of the IJCAI, Vienna, Austria, 23–29 July 2022; pp. 920–926. [Google Scholar]
Yang, X.; Li, Y.; Lyu, S. Exposing deep fakes using inconsistent head poses. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 8261–8265. [Google Scholar]
Matern, F.; Riess, C.; Stamminger, M. Exploiting visual artifacts to expose deepfakes and face manipulations. In Proceedings of the 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW), Waikoloa Village, HI, USA, 7–11 January 2019; pp. 83–92. [Google Scholar]
Nguyen, H.H.; Yamagishi, J.; Echizen, I. Capsule-forensics: Using capsule networks to detect forged images and videos. In Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 2307–2311. [Google Scholar]
Nguyen, H.H.; Fang, F.; Yamagishi, J.; Echizen, I. Multi-task learning for detecting and segmenting manipulated facial images and videos. In Proceedings of the 2019 IEEE 10th International Conference on Biometrics Theory, Applications and Systems (BTAS), Tampa, FL, USA, 23–26 September 2019; pp. 1–8. [Google Scholar]
Sun, Z.; Han, Y.; Hua, Z.; Ruan, N.; Jia, W. Improving the efficiency and robustness of deepfakes detection through precise geometric features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3609–3618. [Google Scholar]

Figure 1. Schematic illustration of the typical deepfake generation pipeline, highlighting its sequential frame-wise processing approach.

Figure 2. Structure of the proposed model. The feature extraction part (from Input to InceptionBlock-3) corresponds to Table 1. The attention module and SPP module are described in Section 3.3 and illustrated in Figure 3, respectively.

Figure 3. The processing workflow of the SPP layer. The blue annotations indicate the dimensional shapes. For simplicity, the batch and spatial dimensions are omitted.

Figure 4. Proposed spatial–temporal attention module, which is based on improved triplet attention. The blue annotations adjacent to the arrows indicate the dimensional shape of the feature map output from each

boxed

operation. The batch dimension has been omitted for clarity.

Figure 4. Proposed spatial–temporal attention module, which is based on improved triplet attention. The blue annotations adjacent to the arrows indicate the dimensional shape of the feature map output from each

boxed

operation. The batch dimension has been omitted for clarity.

Figure 5. t-SNE visualization of feature distributions by forgery methods: (a) DeepFakes (DF); (b) FaceSwap (FS); (c) Face2Face (F2F); (d) NeuralTextures (NT).

Figure 6. t-SNE visualization of feature distributions across compression levels: (a) authentic samples; (b) forged samples.

Figure 7. Grad-CAM visualization for detection difficulty cases. Each case displays the following: Top row—sampled consecutive frames; Middle row—Grad-CAM heatmaps; Bottom row—overlay of frames and heatmaps. (a) Easily detectable DeepFakes sample (C23, 135_880.mp4); (b) Challenging FaceShifter sample (C23, 000_003.mp4).

Table 1. Architecture of the 3D-CNN feature extraction module. Convolution/pooling layers use

T_{\ker} \times H_{\ker} \times W_{\ker}, T_{str} \times H_{str} \times W_{str}

for kernel/stride specifications followed by filter counts, while

SepInc

modules denote branch channels as

({ch}_{1}, {ch}_{2}, {ch}_{3}, {ch}_{4})

with final output dimension.

Table 1. Architecture of the 3D-CNN feature extraction module. Convolution/pooling layers use

T_{\ker} \times H_{\ker} \times W_{\ker}, T_{str} \times H_{str} \times W_{str}

for kernel/stride specifications followed by filter counts, while

SepInc

modules denote branch channels as

({ch}_{1}, {ch}_{2}, {ch}_{3}, {ch}_{4})

with final output dimension.

Layers	Filter Configuration	Output Shape (Relative)
Layers	Filter Configuration	Temporal	Spatial
Input	-	1	1
SepConv3D-1	$(7, 7, 7, 2, 2, 2)$ , 64	$1 / 2$	$1 / 2$
MaxPooling3D-1	$(1, 3, 3, 1, 2, 2)$ , 64	$1 / 2$	$1 / 4$
ChannelConv3D-1	$(1, 1, 1, 1, 1, 1)$ , 64	$1 / 2$	$1 / 4$
SepConv3D-2	$(3, 3, 3, 1, 1, 1)$ , 192	$1 / 2$	$1 / 4$
MaxPooling3D-2	$(1, 3, 3, 1, 2, 2)$ , 192	$1 / 2$	$1 / 8$
InceptionBlock-1	$[\begin{matrix} SepInc (64, 128, 32, 32), 256 \\ SepInc (128, 192, 96, 64), 480 \end{matrix}]$	$1 / 2$	$1 / 8$
MaxPooling3D-3	$(3, 3, 3, 1, 2, 2)$ , 480	$1 / 2$	$1 / 16$
InceptionBlock-2	$[\begin{matrix} SepInc (192, 208, 48, 64), 512 \\ SepInc (160, 224, 64, 64), 512 \\ SepInc (128, 256, 64, 64), 512 \\ SepInc (112, 288, 64, 64), 528 \\ SepInc (256, 320, 128, 128), 832 \end{matrix}]$	$1 / 2$	$1 / 16$
MaxPooling3D-4	$(2, 2, 2, 2, 2, 2)$ , 832	$1 / 4$	$1 / 32$
InceptionBlock-3	$[\begin{matrix} SepInc (256, 320, 128, 128), 832 \\ SepInc (384, 384, 128, 128), 1024 \end{matrix}]$	$1 / 4$	$1 / 32$

Table 2. Comparison of parameter complexity across attention modules, where the values in the Parameter Count column are the specific numbers obtained under the settings in Section 4.2.

Method	Parameters Formula	Parameter Count
SE Block [52]	$2 C^{2} / r$	131.07 K
CBAM [30]	$2 C^{2} / r + 2 k^{2}$	131.17 K
Proposed Attention Module	$8 k^{2}$	96

Table 3. Dataset overview: Number of real and fake videos for training and testing. Note the imbalance between real and fake samples.

Dataset	Usage	Real/Fake Videos	Fake-to-Real Ratio
FaceForensics++ (FF++)	Train	2160/10,800	5.00
FaceForensics++ (FF++)	Test	420/2100	5.00
DFDC-Preview (DFDC-P)	Test	276/501	1.82
Celeb-DF	Test	178/340	1.91

Table 4. Overall model complexity under different input frame settings. The inference time is provided as a reference value, as it can be affected by multiple factors.

Frames	Number of Trainable Parameters	GPU Memory Usage (MB)	Inference Time (s)	Computational Complexity (FLOPs)
8	7.96 M	1535	0.86	10.72 G
16		3009	1.03	21.44 G
32		5958	1.75	42.88 G

Table 5. Accuracy comparison with state-of-the-art methods on FF++ under different compression levels. Best and second-best results are annotated with double underlines and single underline, respectively. ↓ means lower is better.

Categories	Methods	C0	C23	C40	$Δ$ (C23-C40) ↓	Avg
Frame-based	Steg.Features [81]	97.63	70.97	55.98	14.99	74.86
	ResNet-50 [82]		98.04	91.61	6.43	94.83
	Meso-4 [3]	96.4				96.4
	MesoInception-4 [3]	94.08	86.09	75.15	10.94	85.11
Frequency-based	Two-branchRN [84]		96.43	86.34	10.09	91.39
	F³-Net-Xception [7]	99.95	97.52	90.43	7.09	95.97
	fCNN [85]	87.48	83.78	72.49	11.29	81.25
	SPSL [86]		91.5	81.57	9.93	86.54
Spatial–temporal	C3D [87]		90.72	86.79	3.93	88.75
	I3D [66]		93.13	86.88	6.25	90.00
	SlowFast [88]			92.48		92.48
	TEINet [89]		96.79	92.77	4.02	94.78
	ADDNet-3D [90]		86.70	79.47	7.23	83.08
	Ma et al. [29]	98.79	95.92	91.49	4.43	95.40
Ours		99.17	97.62	94.40	3.22	97.06

Table 6. Detailed accuracy by forgery type and compression level. Abbreviations with corresponding full methods: DF (DeepFakes), FS (FaceSwap), F2F (Face2Face), NT (NeuralTextures). Best and second-best results are annotated with double underlines and single underline, respectively.

Methods	C0				C23				C40
Methods	DF	FS	F2F	NT	DF	FS	F2F	NT	DF	FS	F2F	NT
[82]					98.93	99.64	98.93	95.00	95.36	94.64	88.93	87.5
[3] ¹	96.37	98.17	97.75	93.30
[3] ²	88.34	97.81	97.65	92.52	83.47	94.34	94.34	75.06	74.20	79.72	78.75	67.94
[85]	87.79	89.28	85.37		85.24	85.03	85.03		79.24	68.88	69.35
[87]					92.86	91.79	91.79	89.64	89.29	87.86	82.86	87.14
[66]					92.86	96.43	96.43	90.36	91.07	91.43	86.43	78.57
[88]									97.50	95.00	94.90	82.50
[89]					97.86	97.50	97.50	94.29	95.00	94.64	91.07	90.36
[90]					92.14	92.50	83.93	78.21	90.36	80.00	78.21	69.29
[29]	99.60	98.84	99.12	97.58	98.43	94.34	94.34	93.45	96.97	90.63	93.80	84.54
ours	99.29	99.29	99.29	98.69	97.62	97.62	98.22	97.62	93.22	94.41	95.00	94.41

¹ Meso-4 variant from [3]; ² MesoInception-4 variant from [3].

Table 7. Ablation study on frame numbers and attention mechanism across compression levels (C0/C23/C40) on FF++. Values represent detection accuracy (%).

		C0		C23		C40		Avg
		w/ attn	w/o attn	w/ attn	w/o attn	w/ attn	w/o attn	w/ attn	w/o attn
Frames	8	96.55	97.74	92.50	93.33	90.00	91.67	93.02	94.25
	16	99.17	99.64	97.62	98.10	94.40	93.10	97.06	96.95
	32	99.64	99.40	98.33	97.26	92.86	91.07	96.94	95.91
Avg		98.45	98.93	96.15	96.23	92.42	91.95	–	–

Table 8. Comparison of different attention modules with 8-frame input on FF++ dataset across compression levels (values in %).

Attention Module	C0	C23	C40	Avg
SE [52]	96.19	92.26	89.40	92.62
CBAM [30]	92.14	79.64	81.55	84.44
Ours	96.55	92.50	90.00	93.02

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, J.; Lei, J.; Li, S.; Zhang, J. STA-3D: Combining Spatiotemporal Attention and 3D Convolutional Networks for Robust Deepfake Detection. Symmetry 2025, 17, 1037. https://doi.org/10.3390/sym17071037

AMA Style

Wang J, Lei J, Li S, Zhang J. STA-3D: Combining Spatiotemporal Attention and 3D Convolutional Networks for Robust Deepfake Detection. Symmetry. 2025; 17(7):1037. https://doi.org/10.3390/sym17071037

Chicago/Turabian Style

Wang, Jingbo, Jun Lei, Shuohao Li, and Jun Zhang. 2025. "STA-3D: Combining Spatiotemporal Attention and 3D Convolutional Networks for Robust Deepfake Detection" Symmetry 17, no. 7: 1037. https://doi.org/10.3390/sym17071037

APA Style

Wang, J., Lei, J., Li, S., & Zhang, J. (2025). STA-3D: Combining Spatiotemporal Attention and 3D Convolutional Networks for Robust Deepfake Detection. Symmetry, 17(7), 1037. https://doi.org/10.3390/sym17071037

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

STA-3D: Combining Spatiotemporal Attention and 3D Convolutional Networks for Robust Deepfake Detection

Abstract

1. Introduction

2. Related Works

2.1. Deepfake Generation

2.2. Plug-and-Play Attention Mechanisms

2.3. 3D-CNN and Action Recognition

3. Methods

3.1. Overview

3.2. 3D-CNN Feature Extraction Module

3.3. Attention Module

3.4. Loss Function Design for Class Imbalance

4. Results and Analysis

4.1. Datasets and Metrics

4.2. Implementation Details

4.3. Overall Model Parameters and Computational Cost

4.4. Comparison with Previous Methods

4.5. Visualization Analysis

4.6. Ablation Study

4.7. Generalization Ability

5. Conclusions and Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI