Deepfake Face Detection and Adversarial Attack Defense Method Based on Multi-Feature Decision Fusion

Lei, Shanzhong; Song, Junfang; Feng, Feiyang; Yan, Zhuyang; Wang, Aixin

doi:10.3390/app15126588

Open AccessArticle

Deepfake Face Detection and Adversarial Attack Defense Method Based on Multi-Feature Decision Fusion

by

Shanzhong Lei

,

Junfang Song

^*,

Feiyang Feng

,

Zhuyang Yan

and

Aixin Wang

College of Information Engineering, Xizang Minzu University, No. 6, East Section of Wenhui Road, Weicheng District, Xianyang 712082, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(12), 6588; https://doi.org/10.3390/app15126588

Submission received: 29 April 2025 / Revised: 5 June 2025 / Accepted: 8 June 2025 / Published: 11 June 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

The rapid advancement in deep forgery technology in recent years has created highly deceptive face video content, posing significant security risks. Detecting these fakes is increasingly urgent and challenging. To improve the accuracy of deepfake face detection models and strengthen their resistance to adversarial attacks, this manuscript introduces a method for detecting forged faces and defending against adversarial attacks based on a multi-feature decision fusion. This approach allows for rapid detection of fake faces while effectively countering adversarial attacks. Firstly, an improved IMTCCN network was employed to precisely extract facial features, complemented by a diffusion model for noise reduction and artifact removal. Subsequently, the FG-TEFusionNet (Facial-geometry and Texture enhancement fusion-Net) model was developed for deepfake face detection and assessment. This model comprises two key modules: one for extracting temporal features between video frames and another for spatial features within frames. Initially, a facial geometry landmark calibration module based on the LRNet baseline framework ensured an accurate representation of facial geometry. A SENet attention mechanism was then integrated into the dual-stream RNN to enhance the model’s capability to extract inter-frame information and derive preliminary assessment results based on inter-frame relationships. Additionally, a Gram image texture feature module was designed and integrated into EfficientNet and the attention maps of WSDAN (Weakly Supervised Data Augmentation Network). This module aims to extract deep-level feature information from the texture structure of image frames, addressing the limitations of purely geometric features. The final decisions from both modules were integrated using a voting method, completing the deepfake face detection process. Ultimately, the model’s robustness was validated by generating adversarial samples using the I-FGSM algorithm and optimizing model performance through adversarial training. Extensive experiments demonstrated the superior performance and effectiveness of the proposed method across four subsets of FaceForensics++ and the Celeb-DF dataset.

Keywords:

diffusion model; facial geometry; texture feature enhancement; deepfake face detection; multi-feature decision fusion; adversarial attack and defense

1. Introduction

Against the backdrop of booming social media and digital technologies, the rise of deep forgery techniques poses a serious threat to information security and social stability. Highly realistic synthesized images and audio make them the vanguard of disinformation dissemination and privacy invasion, challenging personal privacy and information security. With the advancement of deep learning technology, deep face forgery technology is gradually becoming a new security challenge. The performance of traditional detection methods becomes more and more vulnerable in the face of highly realistic deep images. Especially in recent years, the rise of advanced technologies such as Generative Adversarial Network (GAN) [1] and Diffusion Model (DP) [2] has made synthetic face images more and more realistic and difficult to distinguish from real face images. This makes traditional face detection methods face serious challenges in recognizing deeply faked images. At the same time, the emergence of adversarial attacks has provided new tools and strategies to the producers of forged images, making it more difficult for existing detection systems to detect forged images. Therefore, there is an urgent need for researchers to come up with more efficient and robust deep forgery face detection techniques to secure digital identity and privacy. Deep forgery face detection serves as a key application in the field of intelligent visual security surveillance. It can more effectively identify and block behaviors that attempt to evade surveillance systems through fake face images. Key applications such as authentication and security applications, social media and online platforms, video forensics and investigations, and personal privacy protection. Fake face videos may be harmless in intent and have advanced research in video generation across industries. However, once they are maliciously used to disseminate false information, harass individuals, or defame celebrities, they have attracted significant attention on social platforms globally, particularly hampering the credibility of digital media. As a result, fake face video detection has become a key challenge in the field of AI security. However, the emergence of sample adversarial attacks has further elevated the difficulty of detection. These attacks introduce tiny perturbations on the original image that alter the image almost imperceptibly but are effective in spoofing deep forgery detection models and distorting their outputs to avoid detection.

Currently, the detection models on deepfake face video are more often using CNN methods that target the face features in each frame and CNN+LSTM methods that focus on the features between frame and frame images. Xing et al. [3] proposed a deepfake face video detection model based on 3DCNNS, which notices the time-domain features and spatial-domain features of deepfake fake face video feature inconsistencies to achieve higher detection accuracy and robustness. Fu et al. [4] first revealed that deepfake detection’s generalization challenges stem not only from forgery method discrepancies but also position bias (over-relying on specific image regions) and content bias (misusing irrelevant information). They proposed a transformer-based feature-reorganization framework that eliminates biases through latent space token rearrangement and mixing, significantly enhancing cross-domain generalization across benchmarks. Siddiqui et al. [5] proposed integrating vision transformers with DenseNet-based neural feature extractors, achieving state-of-the-art vision transformer performance without relying on knowledge distillation or ensemble methods. Frank et al. [6] found that the discrete cosine transform (DCT) of the deepfake image and the real image show significant differences in the frequency domain. Sabir et al. [7] designed a framework for detecting deepfake videos using inter-frame temporal information, which achieved the SOTA at that time. Gu et al. [8] utilized the spatio-temporal inconsistency in Deepfake to propose the three templates of SIM, TIM, and ISM, which form the STILBlock plug-in module that can be inserted into a convolutional neural network to spatio-temporal information and perform spatio-temporal information fusion to complete the output of the deep forgery detection results. Kroiß et al. [9] implemented efficient synthetic/fake facial image detection using a pre-trained ResNet-50 architecture modified with adapted fully connected output layers, trained via transfer learning and fine-tuning on their “Diverse Facial Forgery Dataset”. R et al. [10] introduced TruthLens, a semantic-aware interpretable deepfake detection framework that simultaneously authenticates images and provides granular explanations (e.g., “eye/nose authenticity”), uniquely addressing facial manipulation deepfakes through unified multi-scale forensic analysis validated via multi-dataset experiments. Cheng et al. [11] proposed a directional progressive learning framework redefining hybrid forgeries as pivotal anchors in the “real-to-deepfake” continuum. They systematically implemented a Directional Progressive Regularizer (OPR) to enforce discrete anchor distributions and a feature bridging module for smooth transition modeling, demonstrating enhanced forgery information utilization through extensive experiments. Choi et al. [12] detected temporal anomalies in synthetic videos’ style latent vectors using StyleGRU and style attention modules, experimentally validating their cross-dataset robustness and temporal dynamics’ critical role in detection generalization. Cozzolino et al. [13] demonstrated the breakthrough potential of pre-trained vision-language models (CLIP) for cross-architecture AI-generated image detection under few-shot training, achieving state-of-the-art in-domain performance through a lightweight detection strategy with 13% robustness improvement. VTD-Net [14] is a frame-based deep forgery video identification method using CNN, Xception [13], and LSTM [15]. In VTD-Net, faces are extracted from video frames using a multi-task cascaded CNN, and then the Xception network is used to learn the distinguishing features between real and fake faces. Coccomini et al. [16] further improved the model by combining Vision Transformer [17] in the accuracy performance in deep forgery identification tasks. Chen et al. [18] proposed DiffusionFake, enhancing detection generalization by reverse-engineering forgery generation through feature injection into frozen pre-trained diffusion models for source/target image reconstruction, effectively disentangling forgery features to improve cross-domain robustness. Zhao et al. [19] developed an Interpretable Spatiotemporal Video Transformer (ISTVT) featuring a novel spatiotemporal self-attention decomposition mechanism and self-subtraction module to capture spatial artifacts and temporal inconsistencies, enabling robust deepfake detection. MesoNet [20] distinguishes whether the content is forged or not by detecting the mid-level semantics of the forged faces in the video. This approach can automatically and efficiently detect forged videos generated by forgeries such as deepfake and Face2Face methods. Li et al. [21] proposed a FaceX-Ray model for forgery detection by determining the boundaries of face fusion. Liu et al. [22] emphasized the importance of mobility, i.e., cross-library detection accuracy, and achieved the best migration performance in forgery detection. Wang et al. [23] used a multiscale vision transformer to capture the local inconsistencies existing in faces at different scales, in resistance to compression algorithms by strong robustness, and achieved significant results on mainstream datasets. Lu et al. [24] proposed a new long-range attention mechanism to capture global semantic inconsistencies in forgery samples, which reduces the complexity of the model and achieves good detection results.

Therefore, to address the above problems, this manuscript proposes an improved deep forgery face detection method mainly focusing on the existing deep-learning-based deep forgery detection methods with noise artifacts, insufficient feature extraction capability in the face of a deep forgery identification task, and poor detection performance in the face of sample confrontation attack. The framework of deepfake face detection and adversarial attack defense method based on multi-feature decision fusion is shown in Figure 1, which includes the use of improved IMTCNN to accurately extract face faces and a diffusion model for noise reduction and de-artificing of the forgery face data, the addition of the SENet attention mechanism to the FG-TEFusionNet network, WADAN, and an image texture enhancement module on the FG-TEFusionNet network to improve the deep neural network for the acquisition of deep information and texture features of the face in the video image; add samples in the network during the training process against the attack I-FGSM algorithm to generate samples against the attack; and then use Adversarial Training for defense, for the accuracy, robustness, and security of the experimental dataset for the research and analysis.

Figure 1 illustrates the overall framework and workflow of the proposed method, with each component linked to the main contributions; to enhance the method’s reproducibility and clarity, this paper adopted a four-stage processing workflow: (1) input preprocessing resizes raw video frames to 384 × 384 pixels and normalizes them, locates facial features via the IMTCCN network, and achieves denoising and artifact removal using a diffusion model; (2) the spatiotemporal feature extraction stage processes geometric and texture feature streams in parallel, where the geometric branch calibrates key points with LRNet and calculates inter-frame attention through SENet, while the texture branch extracts features via EfficientNet and detects spectral anomalies through the collaboration of Gram matrices and WSDAN modules; (3) adversarial optimization generates adversarial samples based on I-FGSM to update model weights, yielding preliminary inter-frame/intra-frame detection results; (4) decision fusion output produces frame-level authenticity predictions. This structured workflow complements the visualization in Figure 1 to ensure algorithmic transparency.

The main contributions of this manuscript are as follows:

(1) The IMTCCN architecture was systematically improved to achieve high-precision facial feature extraction through the integration of a diffusion model, which effectively suppresses noise and artifacts. Qualitative evaluations and quantitative metrics jointly validate the enhanced operational efficacy of this optimized framework.

(2) A novel multi-feature decision fusion model named FG-TEFusionNet is proposed for deepfake detection, which consists of two specialized modules: the SEAR-Net and the TE-WSDAN-DDNet. The SEAR-Net enhances inter-frame dependency modeling by integrating SENet attention mechanisms into a dual-stream RNN architecture based on the LRNet baseline, enabling preliminary predictions through frame-sequence correlation analysis. Simultaneously, the TE-WSDAN-DDNet embeds a Gram image texture module within the EfficientNet backbone, fusing feature maps from the Weakly Supervised Data Augmentation Network (WSDAN) to overcome geometric method limitations through deep texture pattern extraction. A voting mechanism synergistically combines geometric and textural features to generate final detection results, achieving state-of-the-art (SOTA) performance on the FaceForensics++ and Celeb-DF datasets.

(3) An adversarial training methodology was implemented to enhance defense robustness by incorporating I-FGSM-generated adversarial samples during training. The experimental results indicate that under adversarial training conditions, the success rate of adversarial attacks on the model significantly decreases, effectively improving the model’s detection accuracy.

2. Related Works

In the fields of computer vision and network safety, deepfake face detection remains a fundamental challenge. In recent years, rapid advancements in deep generative models have enabled the creation of highly realistic and indistinguishable content, such as manipulated, forged, or synthesized images. This development has introduced new challenges and security threats to information security and societal stability. Consequently, there is an urgent need to explore deepfake detection methods that offer high efficiency, precision, and robust security to mitigate the potential risks associated with this mature technology. Currently, deepfake detection is predominantly approached as a binary classification problem, focusing on subtle local features of deepfake faces, such as minor variations in facial geometry, temporal consistency, and changes in feature textures. Effective feature extraction and meticulous model design are crucial for enhancing accuracy and detection efficacy. In this section, we provide a concise overview of prior research directly pertinent to our work.

2.1. Deepfakes

Current mainstream deepfake technologies primarily encompass face-swapping techniques such as FaceSwap [25] and Deepfakes [25], the use of generative adversarial networks (GANs) and diffusion models for generating entirely fake faces, and facial reenactment techniques based on facial expression transfer. Figure 2 provides a schematic classification of these deepfake face technologies. Identity replacement involves employing deep learning methods to substitute the facial shape and features of the source image with those of the target image. Facial reenactment is the process of achieving face forgery by altering facial expressions while maintaining the identity information of the face. Attribute editing entails modifying facial appearance attributes such as age, lips, and skin color. These three methods manipulate real faces. Another method generates realistic fake faces using GANs or DDPMs with labels or noise information. Facial forgery technology has significantly compromised individuals’ privacy and has even been used for illegal purposes, posing a considerable threat to social security.

2.2. Deepfake Detection

Traditional methods for creating fake faces rely heavily on manually designed features and rules, yet they perform inadequately when confronted with complex deepfake videos. In recent years, deep-learning-based methods have emerged as the mainstream, capable of autonomously learning and extracting features from images and videos to effectively detect deepfakes. This section categorizes deepfake video detection based on features and analytical methods between frames within fake videos, distinguishing between frame-level and video-level information differences. As shown in Figure 3.

In the realm of frame-level detection, much of the effort in deepfake detection focuses on methods analyzing individual image frames to identify irregularities and flaws. Typically, deep learning techniques are employed to distinguish between authentic and manipulated images. Zhao et al. [26] introduced an attention-mechanism-based approach for extracting artifact features from images and achieving fine-grained classification, yielding promising detection outcomes. Furthermore, Hu et al. [27] integrated convolutional attention modules that combine channel and spatial attention for detecting manipulated faces, leveraging attention mechanisms to enhance the model’s focus on crucial image details and improve its ability to discern inconsistencies surrounding manipulated content. Li et al. [28] proposed an adaptive frequency feature generation module to extract discriminative features from different frequency bands in a trainable manner. They also introduced the Single Center Loss (SCL) to enhance differentiation between real and fake faces. In the realm of video-level detection, video-level detection significantly amplifies the appeal of temporal-feature-based deepfake detection, driven by the intuition that videos, unlike images, inherently encapsulate richer information. Boosting detection accuracy involves leveraging spatiotemporal cues embedded within video sequences. These methodologies commonly entail optical flow analysis of video frames, validation of motion consistency, and statistical examination of video duration to pinpoint anomalies and inconsistencies in deepfake videos. Zheng et al. [29] discovered that by setting the temporal convolution kernel size to 1 within 3D convolutional kernels, the network’s ability to represent temporal information can be enhanced, allowing for the detection of temporal inconsistencies in forged videos. This approach exhibits excellent detection performance when confronted with unknown forgery methods. Shao et al. [30] introduced a novel research problem for multimodal fake media, termed Detection and Multi-modal Media Manipulation (DGM4). DGM4 aims not only to ascertain the authenticity of multimodal media but also to detect manipulated content, thereby facilitating the identification of video forgeries and ownership authentication.

2.3. MTCNN Network Model

The Multi-Task Convolutional Neural Network (MTCNN) is a deep learning model extensively employed for face detection and alignment. It is composed of three subnetworks, each dedicated to a specific task: face detection, key point localization, and bounding box regression. Each network contributes to the efficient detection and precise alignment of faces. Initially, the Proposal Network (P-Net) utilizes a convolutional neural network to concurrently generate candidate boxes and associated face confidence scores. P-Net effectively filters potential face regions, serving as input for subsequent stages. Following this, the Refinement Network (R-Net) further refines the candidate boxes produced by P-Net, enhancing the accuracy and stability of face detection. R-Net leverages a deeper neural network architecture to meticulously process and filter the candidate boxes, ensuring that the final detection results are both accurate and reliable. Finally, the output Network (O-Net) is tasked with the ultimate face landmark regression and bounding box refinement. O-Net not only pinpoints facial key points but also precisely adjusts the position and size of each face box, leading to superior alignment accuracy and effective facial feature extraction. MTCNN accomplishes the comprehensive process from initial candidate box generation to final face alignment and feature extraction through these three sequential subnetworks. Its architectural design and end-to-end processing enable rapid and accurate face detection, even in complex environments, providing a dependable foundation for subsequent face recognition and analysis tasks. MTCNN, as a typical face detection and alignment model, plays a crucial role in the field of artificial intelligence due to its effective multitasking capabilities and optimized network architecture. In this paper, the face detection and alignment method was improved based on the MTCNN framework. The receptive field of the images was expanded, enabling a more accurate capture of facial information in scenes. This enhancement provides more precise feature information for subsequent facial feature extraction tasks.

2.4. SENet Channel Attention Mechanism

In deep learning models, the richness of extracted features increases with the number of parameters, significantly improving the model’s accuracy and robustness. However, this parameter expansion can lead to information overload, negatively impacting model training and real-world performance. To address this challenge, an attention mechanism is introduced in this study. This mechanism allows the model to swiftly select high-value feature information, enabling it to accurately focus on critical information while effectively mitigating the interference of irrelevant data. Typical attention mechanisms are categorized into three main types: channel attention, spatial attention, and temporal attention. These mechanisms enable the model to assign varying weights to different positions in the input sequence, allowing it to concentrate on the most relevant parts when processing each sequence element. Accordingly, this study incorporates a SENet channel attention mechanism, which enhances the feature information learned within frames of forged videos by the attention dual-stream RNN. This approach also facilitates the effective collection of local features for the deepfake detection task, compelling the network to attend to diverse local information and thereby improving the accuracy of forgery detection.

SENet Attention Optimization Network Model. The fundamental concept of SENet involves the introduction of a lightweight attention module, which comprises three primary operations: Squeeze, Excitation, and Scale. The architecture of the SENet attention mechanism is illustrated in Figure 4.

SENet utilizes the Squeeze and Excitation operations to extract inter-channel relationships and recalibrate the convolutional input accordingly. This process involves extracting convolutional features between channels and using these features to adjust the feature maps produced by the original network, as detailed below.

The Squeeze operation is represented by the function

F_{sq} (•)

in Figure 4. This operation compresses channel features by applying global average pooling, which reduces multidimensional data to a single scalar value. This scalar represents the overall importance of the features within each channel, thereby capturing the global information for each channel. The Squeeze operation is illustrated in Equation (1):

z_{c} = F_{s q} (u_{c}) = \frac{1}{H \times W} \sum_{x = 1}^{H} \sum_{y = 1}^{W} u_{c} (x, y)

(1)

In Equation (1),

z_{c}

represents the result of the Squeeze operation;

u_{c}

denotes the input feature map matrix; H and W indicate the height and width of the feature map, respectively; and x and y represent the rows and columns of the input matrix.

The Excitation component is represented by

F_{e x} (•, W)

in Figure 4. The Excitation operation functions as a gating mechanism to generate a feature weight for each feature channel. In this process, the features obtained from the Squeeze operation pass through a fully connected layer, which involves two fully connected operations, a ReLU activation function and a Sigmoid activation function. This design enables the model to automatically learn the importance of each channel’s features. The Excitation operation is described in Equation (2):

s_{c} = F_{e x} (z_{c}, W) = σ (g (z_{c}, W)) = σ (W_{2} (δ (W_{1} z_{c})))

(2)

In Equation (2),

s_{c}

denotes the result of the Excitation operation, δ represents the ReLU activation function, σ represents the Sigmoid activation function, and

W_{1} (.)

and

W_{2} (.)

refer to the two fully connected layer operations.

The Scale component is represented by

F_{s c a l e} (•, •)

in Figure 4. This component recalibrates the feature map by multiplying the weights obtained from the Excitation operation with the features of the original channels, resulting in the weighted output features. It is described in Equation (3):

o u t p u t = F_{s c a l e} (u_{c}, s_{c}) = u_{c} \times s_{c}

(3)

Within the SENet architecture, the input tensor is defined as

X \in R^{B \times C \times T \times H \times W}

(batch size B, channel dimension

C

, temporal frames

T

, spatial resolution

H \times W

. Global average pooling is first applied to extract channel-wise statistics

S \in R^{{B \times C}}

followed by the generation of attention weight vectors

a \in R^{{B \times C}}

through two-stage fully-connected (FC) layers: The first FC layer with ReLU activation models inter-channel dependencies, while the second FC layer with Sigmoid activation normalizes attention weights. Crucially, parameter sharing across the temporal dimension

T

is implemented in the FC layers, achieving temporal consistency preservation with computational complexity reduced.

2.5. WSDAN Network Model

In practical applications of deepfake face detection, random data augmentation often introduces uncontrollable noise and interference, which can hinder the model’s learning capacity and negatively impact feature extraction, leading to suboptimal detection performance. Compared to traditional data augmentation methods, random augmentation is typically less efficient in handling small objects and can introduce background noise. To address these challenges, reference [31] proposes the Weakly Supervised Data Augmentation Network (WSDAN). The model leverages attention maps to augment data more effectively, enhancing feature extraction and data augmentation capabilities. This approach improves the accuracy and generalization ability of the detection model.

During the model training process, preprocessed images are denoted as

S

, which are then fed into the Xception weakly supervised network for data augmentation, producing feature maps

F

. Ref. [31] employs Bilateral Attention Pooling (BAP) to extract features. As depicted in Equation (4), feature fusion involves element-wise multiplication of the feature maps

F

and attention map

A

, yielding values for the new feature maps. This element-wise multiplication enables the network to selectively emphasize local regions, significantly enhancing its capability to capture subtle features. Moreover, this strategy plays a constructive role by mitigating the network’s tendency to overfit irrelevant features, thereby ensuring that the feature matrix contains fewer disruptive elements and provides more precise feature descriptions.

F_{i} = F \otimes A_{i}, i = 1, 2, \dots, π

(4)

According to Equation (5), the attention map feature

f_{i} A

is derived using the Global Average Pooling (GAP) method.

\otimes

represents element-wise multiplication. Following this, the attention feature matrix M is constructed, as depicted in Equation (6).

f_{i} A = G a v g P o o l i n g (F_{i}), i = 1, 2, \dots, π

(5)

M = (A, F) = [\begin{matrix} G a v g P o o l i n g (F \otimes a_{1}) \\ G a v g P o o l i n g (F \otimes a_{2}) \\ \dots \\ G a v g P o o l i n g (F \otimes a_{π}) \end{matrix}] = [\begin{array}{l} f_{1} A \\ f_{2} A \\ \dots \\ f_{π} A \end{array}], i = 1, 2, \dots, π

(6)

Regarding training images, as suggested in reference [31], an attention normalization method has been devised. This method strategically focuses attention on specific regions and normalizes them according to Equation (7), guiding the data augmentation process to produce enhanced data. Thus, it strategically directs attention to the most crucial areas of the image, offering more advantageous guidance for model training. Empirical [31] evidence shows that during the data augmentation phase, this attention normalization strategy enhances the model’s performance by improving its capacity to assimilate and discern critical features.

{\bar{A}}_{i} = \frac{A_{i} - \min (A_{i})}{\max (A_{i}) - \min (A_{i})}

(7)

3. Multi-Feature Extraction

3.1. Extraction of Facial Geometric Features Based on Inter-Frame Information

In the realm of deepfake detection, the crucial task lies in accurately distinguishing genuine facial images from synthesized deepfake counterparts by extracting and analyzing facial geometric features such as contours, eyes, nose, and mouth structure. Researchers typically tackle this challenge by analyzing temporal features between video frames, albeit facing inherent challenges. Thus, there is a pressing need to explore temporal features specific to manipulated facial videos to bolster the accuracy and resilience of deepfake detection methods. This paper adopts LRNet (Landmark Recurrent Network), a foundational model proposed by Sun et al. [32], for extracting geometric features. LRNet leverages facial landmarks’ geometric attributes as discriminative features and introduces an effective facial landmark alignment module aimed at refining the precision of facial frames in videos. It mitigates noise and jitter induced by video motion or compression artifacts, ensuring high geometric fidelity of facial landmarks fed into subsequent feature extraction networks. This study addresses and improves upon the limitations of the LRNet model.

3.2. Extraction of Image Texture Information Based on Intra-Frame Information

Deepfake-generated synthetic media achieves visual realism through localized texture manipulation in facial regions. This methodology has gained prominence due to its effectiveness in enhancing visual authenticity through detailed feature reproduction, including micro-expressions, skin texture variations, and lighting effects. Conventional detection approaches typically employ direct texture analysis through comparative evaluation of pixel values and localized texture patterns. However, these techniques are frequently challenged by advanced generation methods that employ region-specific texture modifications capable of deceiving traditional analytical frameworks. Current detection systems demonstrate limited robustness when confronted with sophisticated manipulation algorithms that maintain structural consistency while altering subtle textural properties.

Gatys et al. [33] proposed using the Gram matrix as a method for describing image textures. To address this limitation, we introduce the Gram matrix as a means to capture the global texture features of an image. Unlike traditional methods that analyze the texture of individual local regions, the Gram matrix provides a more abstract and comprehensive image description by comparing the texture relationships across all regions of the image. Specifically, the Gram matrix utilizes the covariance matrix of feature maps to describe the texture correlations between different locations, thereby capturing the overall texture structure of the image within a broader spatial context. This approach to extracting global texture features is more effective for detecting deepfakes because, even when local textures are fine-tuned, the overall texture structure and relationships retain a certain degree of stability and distinctiveness. Furthermore, current convolutional neural networks often fail to capture long-range information due to their limited receptive fields, which leads to reduced performance on lower-resolution images. By incorporating auxiliary networks to gather texture information, the ability to capture long-range information is enhanced, thereby improving the model’s generalization capability. The Gram matrix is defined as shown in Equation (8).

Δ (α_{1}, α_{2}, \dots, α_{k}) = (\begin{matrix} (α_{1}, α_{1}) & (α_{1}, α_{2}) & \dots & (α_{1}, α_{k}) \\ (α_{2}, α_{1}) & (α_{2}, α_{2}) & \dots & (α_{2}, α_{k}) \\ \dots & \dots & \dots & \dots \\ (α_{k}, α_{1}) & (α_{k}, α_{2}) & \dots & (α_{k}, α_{k}) \end{matrix})

(8)

Equation (8) defines a matrix composed of the pairwise inner products of any k vectors in an n-dimensional Euclidean space, known as the Gram matrix of these k vectors. This paper employs the shallow and deep structures of the EfficientNet backbone network to extract global texture features of images. By calculating the Gram matrix from the feature vectors, the hidden relationships between image features can be revealed. The calculation method is detailed in Equation (9). The computational framework standardizes input tensor dimensions as (B, C, H, W), where B denotes batch size,

C

represents channel count, and H/W correspond to spatial dimensions. The intermediate variable is reshaped into a two-dimensional form (1, H × W) through Gram matrix operations, maintaining the tensor dimensions as (B, C, H, W).

G r a m_{i, j}^{l} = \sum_{k} F_{i k}^{l} F_{j k}^{l}

(9)

In Equation (9),

F^{l}

denotes the feature map at layer

l

, with

F_{i k}^{l}

representing the

k

-th element of the

i

-th feature map at the same layer. Each element in the feature map indicates the intensity of a particular feature.

G r a m_{i, j}^{l}

describes the entire feature map and is not constrained by the receptive field of the convolutional neural network, thus allowing the Gram matrix to capture long-range texture information within images. The Gram matrix reflects the pairwise correlations between different feature intensities in the feature map, with diagonal elements representing the response of specific filters and off-diagonal elements indicating the relationships between different filters. With standardized input dimensions of (B, C, H × W) for Gram matrix computation, the final output tensor dimensions are reformulated as (B, C × C). This dimensional transformation is systematically achieved via (1) spatial dimensions (H, W) into H × W, and (2) Gram matrix-based computation of inter-channel covariance features to generate cross-channel correlation matrices. Furthermore, Gram matrices are computed frame-wise to preserve temporal dynamics. Compared to direct feature mapping, the Gram matrix demonstrates heightened sensitivity to subtle spatiotemporal inconsistencies in generative deepfakes by encoding cross-channel texture statistics, thereby providing an interpretable physical basis for detection methodologies.

For model construction, a Gram matrix-based image texture enhancement module is introduced to calculate image texture information. This module is designed to aid the neural network in effectively capturing texture features, thereby significantly improving the network’s ability to extract these features. The structure of the Gram image texture enhancement module is depicted in Figure 5.

In the network module illustrated in Figure 5, feature extraction is first conducted by a 3 × 3 convolutional layer, followed by a Gram matrix calculation layer to capture global texture features. A channel attention mechanism is then introduced between the subsequent two 3 × 3 convolutional layers and the normalization and activation function blocks, enhancing the representation capability of the image and further optimizing texture feature representation. Finally, a global average pooling layer is employed to ensure dimensional consistency with the backbone network. The texture features extracted at different levels of the backbone network are fused together using appropriate feature fusion methods, resulting in comprehensive image texture feature information across various network layers. This approach facilitates a better understanding of the overall structure of textures and the relationships between texture elements.

4. Construction and Optimization for the Deepfake Detection Model

4.1. Construction of the Improved MTCNN-Based Model

MTCNN, a multi-task cascaded convolutional network, has advanced face detection and alignment. But it has limitations. In complex scenarios like uneven lighting or large pose variations, its detection accuracy drops. Also, its high computational complexity and slow speed limit large-volume data processing. To address these, MTCNN is improved. The improved IMTCNN structure (Figure 6) adds a convolutional residual module in the R-Net layer, along with transposed convolution and max-pooling layers for dimension adjustment and resolving fusion issues. This increases network depth and non-linearity, expanding the receptive field. These enhancements boost both detection accuracy and adaptability.

The structure of the convolutional residual module proposed in this paper is illustrated in Figure 7. This module comprises a 5 × 5 convolutional kernel and two serially connected residual blocks, each containing a 5 × 5 convolutional kernel and a ReLU activation function. By employing larger convolutional kernels, the receptive field is expanded, allowing the network to accurately capture facial information in images while mitigating the effects of external environmental factors. The stacking of residual modules facilitates deeper network layers, enhancing training stability and improving detection performance.

4.2. Noise Reduction and De-Artifacting Using the Diffusion Model

In the realm of image processing, diffusion models stand out as extensively adopted methods for data generation, leveraging the diffusion relationships between pixels in images or videos to uncover the genesis of images. The fundamental concept behind this approach treats the data generation process akin to a Markov chain, transitioning gradually from a complex data distribution to a simpler Gaussian distribution, and then reversing this process to generate new data points, thereby achieving denoising and artifact removal. Traditional image denoising methods, such as those relying on wavelet transforms or conventional low-pass filters, though effective in some scenarios, often falter when dealing with intricate noise and artifacts, struggling to capture high-order correlations between pixels. Hence, this paper introduces deep-learning-based diffusion models to adeptly learn complex diffusion relationships within images, thereby enhancing the efficacy of image processing tasks. The algorithm of the diffusion model is depicted in Figure 8.

As illustrated in the process flow above, the diffusion process involves gradually introducing noise into the image, as indicated by the process s

x_{0} \to x_{T}

in Figure 8. Conversely, the inverse diffusion process,

x_{T} \to x_{0}

, entails denoising and restoring the image or generating new images. The diffusion model excels in learning the ability to generate high-quality data samples from noise, thus demonstrating outstanding performance in generating diverse, high-fidelity data, and in image denoising and artifact removal.

In the diffusion model illustrated in Figure 8, a cosine noise schedule is primarily employed to balance generation efficiency and quality through a 1000-step fully trained sampling strategy. The denoising network, based on a pre-trained U-Net architecture with encoder–decoder structure, simultaneously captures global structural features and local details via multi-scale feature fusion and skip connections during the reverse denoising process. This design optimizes both noise prediction accuracy and generation speed during progressive denoising steps, which was experimentally validated to enhance detection sensitivity and synthetic data fidelity for forgery detection tasks.

In image noise reduction, the diffusion model treats the noisy data as the result of a partial backward diffusion process. Through training, it learns effective strategies to reverse this process, thereby achieving denoising. Deartifacting, meanwhile, targets the elimination of unnatural signals or structures introduced during data acquisition and processing. Consequently, the formulation of the diffusion model is articulated below. It delineates forward and reverse processes, representing sequential stages within the diffusion model, with a forward step of the diffusion model presented as follows.

The forward diffusion steps of the diffusion model are executed through the application of linear transformations and Gaussian noise, as depicted in Equation (10).

q (x_{t} | x_{t - 1}) = N (x_{t}; \sqrt{1 - β_{t}} x_{t - 1}, β_{t} I)

(10)

In Equation (10),

β_{t}

denotes the noise scale at each time step, regulating the amount of noise added. I stands for the identity matrix, ensuring isotropic noise.

ε \sim N (0,1)

signifies Gaussian noise, while

q (x_{t} |x_{t - 1})

denotes the Gaussian probability distribution transition. Thus, the computation of noise x_t is expressed in Equation (11).

x_{t} = \sqrt{1 - β_{t}} x_{t - 1} + \sqrt{β_{t}} ε_{t}

(11)

The above equation implies that as time t progresses,

β_{t}

increases gradually until

x_{t}

closely approximates Gaussian noise. Further analysis reveals

x_{t} = \sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ε

(12)

In Equation (12),

1 - α_{t} = β_{t}

,

\bar{α} = \prod_{s = 1}^{t} α_{s}

. Here,

x_{0}

signifies the original image before noise addition,

x_{t}

denotes the image fully diffused with noise, and ε represents Gaussian noise. In addition, the inverse step of the diffusion model is formulated as follows:

x_{t - 1} = \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{β_{t}}{\sqrt{1 - {\bar{α}}_{t}}} ε_{θ} (x_{t}, t)) + \sqrt{β_{t}} ε_{t} x_{t - 1} = \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{1 - {\bar{α}}_{t}}{\sqrt{1 - {\bar{α}}_{t}}} ε_{θ} (x_{t}, t)) + σ_{t} z

(13)

In Equation (13),

x_{t - 1}

is the denoised image,

x_{t}

is the noisy image, and

ε_{θ} (x_{t}, t)

is a function approximator for the total noise ε between

x_{t}

and

x_{0}

used in prediction. Within the diffusion model, the goal is to incorporate learned noise into

x_{0}

, and the relationship between the diffusion model score function and noise prediction is articulated in Equation (14).

s_{θ} (x_{t}, t) \approx - \frac{ε_{θ} (x_{t}, t)}{\sqrt{1 - {\bar{α}}_{t}}}

(14)

In Equation (14), we simultaneously use

ε_{θ} (x, t)

and

s_{θ} (x, t)

to represent the diffusion model.

4.3. Construction of the FG-TEFusionnet Model for Deepfake Detection

Multi-feature fusion is a method that utilizes different types of features for fusion judgments to improve detection performance. This approach typically encompasses both feature-level fusion and decision-level fusion. Feature-level fusion involves concatenating or adding features extracted from different feature extractors to create a richer feature representation. Decision-level fusion, on the other hand, aggregates classification outcomes from various features using weighting or voting methods to produce the ultimate classification result. By integrating cross-frame information analysis into established frame-level detection techniques, the effectiveness of deepfake video detection can be significantly enhanced. Depending solely on intra-frame or inter-frame information in deepfake video detection methods may result in a drastic reduction in detection accuracy when facing novel datasets or emerging forgery techniques. Comprehensive utilization of intra-frame and inter-frame information for integrated feature analysis not only enhances the model’s capability to identify forged videos but also substantially improves the accuracy and robustness of the detection system across various contexts. Consequently, this study utilizes decision-level fusion for deepfake fake detection. As shown in the method framework of Figure 1, this method leverages the LRNet network model, based on SENet attention, to extract facial geometric features, and the EfficientNet network model, incorporating WSNDA and image texture enhancement modules, to extract image texture features. By comprehensively extracting both inter-frame and intra-frame feature information and subsequently performing decision-level fusion, the system can more effectively differentiate between authentic and forged videos.

4.3.1. Construction of the LRnet Network Model Incorporating SENet Attention

The LRNet framework, by employing its specifically designed calibration module, enhances the accuracy of facial landmarks in input face images and effectively mitigates irregular jitter and noise resulting from video compression or facial movements within videos. Consequently, the geometric features of facial landmarks can be utilized for the detection of deepfake face videos. However, LRNet still has several limitations. Due to the specific nature of optical flow calculations, the tracking and denoising of facial landmarks can only be carried out between adjacent frames and propagated downwards. This approach lacks the capability to extract cross-frame information and does not sufficiently consider long-range temporal features, leading to inadequate extraction of inter-frame information. Thus, by comprehensively utilizing the detailed features within the facial landmarks, the accuracy and robustness of the deepfake face video detection model can be significantly improved. Building on the aforementioned concepts, an inter-frame information-based deepfake face detection model, SEAR-Net (SENet Attention in Two-RNN-Net), is proposed. To enhance the feature representation of inter-frame information in deepfake videos, the facial landmark calibration module from the LRNet framework was retained, while a SENet attention mechanism module was incorporated into the recurrent neural network branches. The outputs of these two branches are then fed into a decision fusion module. The SEAR-Net method extracts features from the inter-frame information of the input sample video and generates the results from the network branches, thus achieving the detection of deepfake face videos. The framework of the SEAR-Net method is illustrated in Figure 9.

Facial Landmarks Calibration Module

The facial landmarks obtained after preprocessing can accurately represent the contours and details of facial features. However, due to the continuous nature of video dynamics, these landmarks still exhibit noticeable jitter, which hinders the effective extraction of inter-frame information. To address this, a facial landmark calibration module is employed to reduce the impact of jitter noise, thereby enhancing the resolution and accuracy of the facial landmarks and improving detection accuracy. The facial landmark calibration module used in this study is illustrated in Figure 10. It incorporates the Pyramid Lucas–Kanade algorithm and the Kalman filter, which are utilized for predicting the position in the next frame and denoising the inter-frame facial landmarks, respectively.

To calibrate facial landmarks, the algorithm adjusts the landmarks by matching small patches in the surrounding area. The calibration module employs the optical flow method to track the positional changes of facial landmarks across consecutive video frames. Given that deepfake face images are highly detailed and the Lucas–Kanade algorithm is sensitive to patch size, a pyramidal structure is introduced. The Pyramid Lucas–Kanade algorithm facilitates the tracking of facial landmarks between adjacent frames, as depicted in Step 1 of the LK operation in Figure 10.

In applications, the Lucas–Kanade algorithm may introduce noise to facial landmark movement in sample videos, affecting network model feature extraction. Thus, the Kalman filter is used. It combines actual and predicted landmark positions to remove most noise, enhancing feature-point precision and stability, as shown in Figure 10 Step 2.

The working principle of the facial landmark calibration module is further detailed. First, the Lucas–Kanade algorithm tracks the facial feature points in the previous frame to predict the approximate location of the face in the next frame. To ensure prediction accuracy, small image patches with unreasonable predicted positions that exhibit significant discrepancies are discarded. The Kalman filter is then used to perform a weighted average of the positions predicted by the Lucas–Kanade algorithm and those actually detected by the face detection module. This fusion strategy not only calibrates the detection positions but also effectively reduces the noise level of facial landmarks within the video frame sequence. The calibrated detection position information is more precise, enabling subsequent feature extraction networks to capture it more effectively and extract critical temporal feature information, thereby enhancing the overall system’s capability to understand and analyze the dynamic characteristics of video content.

SENet Attention-Based Feature Extraction Module For Dual-Flow RNN

The input sample video is processed through the facial landmarks calibration module to obtain standardized facial landmarks. The Lucas–Kanade algorithm and Kalman filter are then utilized to fuse the predicted and actual positions, effectively reducing noise and yielding the optimal calibrated feature point locations. Subsequently, a dual-stream RNN is used to extract temporal dimension features. The calibrated facial landmark coordinates are embedded into two types of feature vector sequences in different forms, which are then fed into two separate RNNs to extract temporal information at different levels. Each set of facial landmarks for a video frame

L_{i}

can be represented as

{[x_{i}^{1}, x_{i}^{2}, . . ., x_{i}^{68}]}^{T}

, where a facial landmark point is denoted as

x_{i}^{α} = {[x_{i}^{α}, y_{i}^{α}]}^{T}

. Consequently, the first type of feature vector sequence is expressed as shown in Equation (15)

α_{i} = {[x_{i}^{1}, y_{i}^{1}, x_{i}^{2}, y_{i}^{2}, . . ., x_{i}^{68}, y_{i}^{68}]}^{T}

(15)

The first type of feature vector sequence represents direct facial geometric features. The second type of feature vector sequence is derived from the differences between facial landmarks of two consecutive frames, as illustrated in Equation (16).

β_{i} = α_{i + 1} - α_{i} = {[x_{i + 1}^{1} - x_{i}^{1}, y_{i + 1}^{1} - y_{i}^{1}, \dots, x_{i + 1}^{68} - x_{i}^{68}, y_{i + 1}^{68} - y_{i}^{68}]}^{T}

(16)

Therefore, by embedding, the geometric feature information of the optimal facial landmarks can be represented as two feature vector sequences

A = {[α_{1}, α_{2}, . . ., α_{n}]}^{T}

and

B = {[β_{1}, β_{2}, . . ., β_{n}]}^{T}

, where the subscript

n

denotes the number of facial landmark sets in a video. Temporally,

A

represents the collection of facial landmarks in the sample video and serves as the input vector sequence to an RNN network

g_{1}

. The vector sequence

B

, formed by the differences between consecutive frames, captures the velocity feature information of the facial landmarks and is input as a vector sequence into another RNN network

g_{2}

. In the branch network

g_{1}

, the SENet attention module is incorporated to enhance the representation of varying weights for each facial landmark in the temporal information features. This forces the network to focus on different local information, thereby improving classification accuracy. Both RNN branches subsequently pass through fully connected neural networks, with the sigmoid activation function then producing the prediction results

p r e d_{1}

and

p r e d_{2}

of each branch. Finally, these outputs are fed into the multi-feature decision fusion module to achieve deepfake face detection.

4.3.2. Model Construction of Incorporating WSDAN and Image Texture Enhancement Module

The improved SEAR-Net model relies solely on optimized geometric features of facial landmarks for discrimination, which effectively eliminates finer facial textures and retains only the external contour shapes. By comprehensively utilizing detailed facial texture features and incorporating a temporal information branch within video frames into the discriminative network, the accuracy and robustness of deepfake face video detection can be significantly improved. Consequently, this study introduces the Gram image texture enhancement module. Compared to baseline neural networks, this module excels in capturing facial texture features, acquiring larger texture sizes and more extensive texture patterns, thereby enhancing the model’s capability to detect deepfake faces. Furthermore, the model integrates the WSDAN network, which further strengthens the extracted texture features, enhancing the model’s robustness. In summary, the image texture enhancement module and the WSDAN network are integrated into EfficientNet, resulting in the design of TE-WSDAN-DDNet (Texture Enhancement and WSDAN with EfficientNet for Deepfake Detection) for deepfake face detection, as illustrated in Figure 11.

As illustrated in Figure 11, the framework of TE-WSDAN-DDNet for deepfake face detection integrates the image texture enhancement module and the WSDAN network. The TE-WSDAN-DDNet model primarily comprises the Gram Block, WSDAN module, EfficientNet, and GAP Layer. Both shallow and deep layers of the network compute the Gram matrix to extract texture feature information at different levels. Specifically, texture enhancement blocks are added at five positions within EfficientNet: after the first convolution layer, after the second MBConv6, after the tenth MBConv6, after the twenty-second MBConv6, and before the final classification layer. Texture information is computed at each of these layers. This texture information from different network levels is concatenated and then combined with the backbone network of the EfficientNet network to produce image features containing multi-level texture information. The enhanced data derived from WSDAN is subsequently used to aid in deepfake detection. Finally, the features are input into the global average pooling module for the classification of the prediction results.

The EfficientNet backbone network in this paper primarily relies on the MBConv module. This module performs convolutional operations on input feature maps to capture inter-pixel correlations within convolutional maps and subsequently extracts global features. The composition of the MBConv module is depicted in Figure 12. During the model’s feature extraction process, Conv1 is initially employed to extract shallow-level information from the images using a 3 × 3 convolutional kernel and a stride of 2. Following this, batch normalization (BN) and Swish activation functions are incorporated to mitigate overfitting. In subsequent stages, seven MBConv modules are utilized for deep feature extraction. Among these, MBConv1 and MBConv6 employ expansion ratios of 1 and 6, respectively, with similar network structures. MBConv6, however, scales up the input channels by a factor of 6 to capture more intricate image features, necessitating five downsampling steps for extracting forged image texture features.

4.3.3. Multi-Feature Decision Fusion Module

The multi-feature extraction and decision fusion framework in this study incorporates three distinct network branches. In the network branch dedicated to inter-frame information extraction, the RNN g1 leverages the SENet attention mechanism to extract geometric features from the video input, yielding the branch prediction

p r e d_{1}

. Concurrently, RNN g2 extracts velocity features of the optimal facial landmarks, resulting in the branch prediction

p r e d_{2}

. Within the network branch focused on intra-frame information extraction, EfficientNet integrates the WSDAN network and an image texture enhancement module to extract detailed texture information from facial regions within the input video frames. This process effectively highlights the extracted features, with the output prediction result labeled as

p r e d_{3}

. The final multi-feature extraction and decision fusion network predict the overall outcome pred for detecting deepfake videos. Consequently, depending on the type of prediction results, the voting method is employed to fuse and finalize the outcomes from each branch, as depicted in Equation (17).

y_{p r e d} = f (p r e d_{1} + p r e d_{2} + p r e d_{3})

(17)

In Equation (17), f denotes the voting method commonly employed in machine learning classification tasks. Furthermore, the decision logic of base classifiers in the voting method adheres to a hard decision rule based on majority consensus. Specifically, if two of the three independent branches reach consensus on a sample’s class prediction, the consensus prediction is adopted as the ensemble result. Conversely, when branch predictions diverge (two votes are false, one vote is true), the system defaults to a false classification.

5. Construction of Sample Adversarial Attacks and Defense Model

Adversarial attack refers to the process of introducing adversarial perturbations into a model’s original inputs, thereby creating adversarial examples that cause the model to make incorrect predictions. Adversarial perturbations involve making subtle modifications to the model inputs capable of misleading the model’s output. IFGSM (Iterative Fast Gradient Sign Method) [34] is an iterative adversarial attack method used to generate adversarial samples within a model. It repeatedly plays out the game [35] within the model through multiple iterations to deeply generate adversarial samples. Consequently, to enhance the robustness of the deepfake video detection in this study, IFGSM was introduced to generate adversarial examples and mitigate their impact through adversarial training. The algorithm for generating IFGSM adversarial examples is depicted in Equation (18). As shown in Algorithm 1.

\begin{matrix} x_{0} = x \\ x_{t + 1} = c l i p_{x} ε (x_{t} + α \times s i g n (\nabla_{x_{t}} J (f (x_{t}, θ), y))) (t \leq T) \end{matrix}

(18)

In Equation (18),

x

represents the original sample, and

x_{t + 1}

denotes the adversarial example after

t

iterations, where

t

does not exceed the specified number of iterations

T

.

c l i p_{x} ε

refers to the clipping function, which ensures that the updated adversarial example remains within an

ε

distance from the original sample, thereby preventing significant deviations between

x_{t + 1}

and

x

. Here,

‖x_{t} - x‖ \leq ε

,

ε

denotes the perturbation range,

α

represents the learning rate,

s i g n

indicates the sign function, J represents the cross-entropy loss function,

f

denotes the objective function,

θ

stands for the model parameters, and

y

signifies the true label for the genuine face. The loss is computed using the cross-entropy between the output of the sample and the original output.

Algorithm 1 IFGSM algorithm for generating adversarial samples

Input: Original sample

x,

Loss function

J,

Perturbation range

ε,

Number of iterations,

T .

Learning rate

α,

Truncation function

c l i p_{x} ε, ‖x_{t} - x‖ \leq ε

, where

x_{t}

denotes the sample after

t - 1

iterations
Output: Adversarial sample

x_{t}

Initialization:

x_{0} = x

1.: for $t = 1$ to $(t = 1,2, 3, . . ., T - 1, T)$ do
2.: Calculate the loss of the sample image and the original image according to the cross-entropy loss function:

g r a d = \nabla_{x_{t}} J (f (x_{t}, θ), y)

, where y is the real sample

3.: Calculate the gradient according to the target category:

s i g n_g r a d = s i g n (g r a d)

4.: Update the parameters of $x_{t - 1}$ :

x_{t} = x_{t - 1} + α s i g n (s i g n_g r a d)

5.: $x_{t} = c l i p_{x} ε (x_{t}, 0,1)$ Where control the pixels at (0, 1), $‖x_{t} - x‖ \leq ε$ .

6.: return $x_{t}$

According to Equation (18), by employing the IFGSM algorithm to generate adversarial examples, this study utilizes adversarial training to assess the model’s robustness against adversarial attacks. The essence of adversarial training lies in generating effective adversarial examples and striking a balance between the significance of original samples and adversarial ones during the model training process. This methodology aims to bolster the model’s robustness in the face of adversarial assaults [36]. Figure 13 outlines the schematic of the proposed method for defending against sample adversarial attacks in this paper. The primary workflow involves generating adversarial examples using the IFGSM algorithm, feeding them alongside original samples into SEAR-Net and TE-WSDAN-DDNet for adversarial training, and ultimately employing decision fusion to determine the authenticity of the results.

Furthermore, in this paper, when the adversarial training method is adopted, the adversarial samples used are mixed with the original samples at a ratio of 1:1 for training. By replacing the original samples with adversarial samples, the model’s robustness is enhanced while ensuring that the original data distribution is maintained. Meanwhile, the weighted cross-entropy is used as the loss function for adversarial training to balance the robustness and the original accuracy. Therefore, the formula for the cross-entropy in adversarial training is as follows.

First, in the detection task, the model learns the true distribution of the data by minimizing the cross-entropy loss function, as shown in Equation (19).

L_{C E} (x, y) = - \sum_{i = 1}^{2} y_{i} \log (p_{i} (x))

(19)

Let

x

denote the input image, and y =

[y_{1}, y_{2}]

represent the ground-truth labels for authentic and forged categories, respectively. The predicted probability for class

i

is expressed as

p_{i} (x)

.

The adversarial perturbation

δ

is generated via the IFGSM algorithm defined in Equation (20). By injecting

δ

into the original data, the model is compelled to maintain robustness against adversarial samples

x + δ

. Consequently, the adversarial loss function is formulated as follows:

L_{Adv} (x + δ, y) = - \sum_{i = 1}^{2} y_{i} \log (p_{i} (x + δ))

(20)

The final adversarial training objective function is formulated by jointly optimizing the original data loss and adversarial sample loss, as mathematically expressed below:

L_{Total} = L_{CE} (x, y) + λ L_{Adv} (x + δ, y)

(21)

As indicated in Equation (21), the adversarial loss weighting coefficient

λ

is empirically set to 0.5. This hyperparameter serves as a critical trade-off parameter to balance the model’s learning focus between clean sample feature extraction and robustness enhancement against adversarial perturbations, thereby preventing either loss term from dominating the optimization process.

6. Experiment and Result Analysis

6.1. Datasets

The method was tested and validated on widely acknowledged benchmark datasets FaceForensics++ (FF++) [21] and CelebDF [37]. FF++ encompasses four distinct methods for generating deepfake facial data. The dataset, comprising 1000 original real video sequences and 4000 manipulated videos, is extensively utilized in training and evaluating our models. Specifically, Deepfakes employs an autoencoder-based approach to achieve precise face swapping through the training of one-to-one generation models. Face2Face replaces original faces in videos with real ones without employing generative models to alter facial features. FaceSwap utilizes a 3D-based technique to swap faces by learning and reconstructing facial features. NeuralTextures adopts deep neural texture transfer methods to ensure the 3D consistency of generated faces by incorporating texture details from reference images.

To cater to researchers with diverse requirements, the original data was encoded using H264 compression for both real and fake videos, with varying levels of compression applied. In our experiments, the datasets consisted of lightly compressed high-definition facial video images, specifically graded at the C23 quality level. This ensures dataset clarity and establishes an effective benchmark to assess the influence of different compression techniques on forged videos. This study selected 1000 real videos and 500 forged videos from each of the Deepfakes, FaceSwap, Face2Face, and NeuralTextures datasets for training and testing. Examples of dataset samples are illustrated in Figure 14.

Moreover, the CelebDF dataset exists in two versions, with this study opting for CelebDF-v2. This dataset effectively compensates for the FF++ dataset’s lower-quality deepfake content by offering synthetic videos that rival the high-quality fake data prevalent on the internet. Specifically, CelebDF-v2 consists of 590 real videos and 5639 synthetic videos of generally high quality, serving as the transfer test dataset for evaluating the deepfake face detection model in this study.

6.2. Evaluation Metrics and Experimental Parameter Settings

6.2.1. Model Evaluation Metrics

ACC (Accuracy) is a metric that measures the precision of the model in identifying targets. The ROC [38] curve reflects the relationship between sensitivity and specificity. The ROC curve is obtained by calculating the True Positive Rate (TPR) and False Positive Rate (FPR), as shown in Equations (22) and (23). AUC (Area Under the Curve) quantifies the area under the ROC curve, shown in Equation (24), providing a comprehensive measure of the model’s classification performance across different thresholds and effectively mitigating the subjective impact of threshold selection. Therefore, this study employs AUC and ACC as primary evaluation metrics to validate the performance of the predictive network models. These model evaluation metrics are defined as follows.

F P R = \frac{N_{F P}}{N_{F P} + N_{T N}}

(22)

T P R = \frac{N_{T P}}{N_{T P} + N_{F N}}

(23)

A c c u r a r y = \frac{N_{T P} + N_{T N}}{N_{T P} + N_{T N} + N_{F P} + N_{F N}}

(24)

In the three equations provided above,

N_{T P}

represents the count of true positive samples,

N_{F N}

represents the count of false negative samples.

N_{F P}

stands for the count of false positive samples, while

N_{T N}

indicates the count of true negative samples.

6.2.2. Experimental Model Parameter Settings

In accordance with the study requirements, the dataset undergoes initial preprocessing steps. Each frame image is cropped to a standardized size, followed by facial feature extraction using MTCCN. A diffusion model is then applied to denoise the images and eliminate artifacts. Subsequently, each video is processed through various input methods tailored to the respective networks for feature extraction, training, and predictive assessment of facial forgery. The dataset is partitioned into three segments, divided in a ratio of 60% for training, 20% for validation, and 20% for testing, to assess both its performance and generalization ability.

In the inter-frame information extraction network of SEAR-Net, the dual-stream RNN branches

g_{1}

and

g_{2}

are each equipped with 32 GRU units. As a result, SENet attention is applied at the output layer of the RNN, with input and output dimensions of 2 and 64, respectively. Following this, an FC layer with input and output dimensions of 64 and 2, respectively, is connected at the output layer. Before entering the RNN network, a Dropout layer with a dropout rate of 0.3 is added to the input, and within the RNN network, four Dropout layers with a dropout rate of 0.5 each are inserted to enhance model generalization and prevent overfitting. For network training, the Adam optimization algorithm is chosen for parameter optimization, with a learning rate set to 0.002, and batch size and block size set to 2048 and 32, respectively. The epochs for

g_{1}

and

g_{2}

are set to 1000 and 800, respectively.

In the intra-frame information extraction network of TE-WSDAN-DDNet, the WSDAN network, with an attention map threshold of 0.45, is integrated, and a Gram Block is introduced for deep texture feature extraction. The EfficientNetB4 model, pre-trained and used as the backbone network, is employed during training. The input image size for model training is set to 384 × 384. Adam is selected as the optimizer, with cross-entropy loss serving as the loss function. The learning rate is set to 0.01, and ExponentialLR is utilized as the learning rate scheduler. The epochs and batch size for data training are 100 and 32, respectively. The GPU used for this experiment is the NVIDIA GeForce RTX4090, and the PyTorch 1.1.0 deep learning framework is employed. The primary evaluation metrics for analyzing model performance are AUC and ACC. Detailed parameters used in the experiments are listed in Table 1 and Table 2.

6.3. Analysis of Feature Extraction Effectiveness

The FG-TEFusionNet network proposed in this study extracts facial geometric and texture features independently and integrates them at the decision layer. Thus, the effectiveness of the method in deepfake detection is further validated through an analysis of the experimental results on facial geometric and texture features.

6.3.1. Analysis of Facial Geometric Feature Point Extraction

In the experiment, the Landmark method is utilized for extracting facial geometric features. Initially, the Lucas–Kanade algorithm is employed to track facial landmark points, followed by noise filtering using a Kalman filter. Research demonstrates the effectiveness of the Landmark method in extracting facial geometric landmarks for deepfake face detection. Figure 15 illustrates a schematic of facial landmark tracking on selected forged face images.

6.3.2. Analysis of Texture Feature Extraction

The following section introduces the effectiveness of the proposed image texture enhancement module in extracting texture features. A series of experiments were conducted on four distinct datasets from FaceForensic++, exploring training adjustments with incrementally increasing weight coefficients and evaluating training outcomes with different loss functions.

To enhance the training effectiveness in extracting texture features, this study incrementally added weights to different layers of the EfficientNet model within the intra-frame network branch. This approach aimed to improve the model’s detection performance across various texture features. This study employed a geometric progression with a set common ratio between adjacent terms (q = 1, 2, 3, 3.5, 4). Table 3 presents the AUC values tested on the EfficientNet network branch.

Based on the results from Table 3, the best detection performance was achieved by setting the training incremental weight q to 3 when the texture feature module was added to the EfficientNet network. As q increased, the detection AUC value gradually decreased, possibly due to a decline in the quality of texture features extracted by the network, thereby resulting in diminished detection effectiveness. Additionally, this study employed Earth Mover’s Distance (EMD) as a loss function for model optimization. EMD is a measure used in content-based image retrieval to compute the distance between two distributions. Initially proposed by Rubner et al. [39] for visual problems, we formulate the following loss function based on the EMD algorithm, as shown in Equation (25).

T e x t u r e_{L o s s} = \sum_{l} w_{l} {\sum_{i} (s o r t e d (F_{i}) - s o r t e d ({\hat{F}}_{i}))}^{2}

(25)

In this study, the EMD loss function was integrated into the model. Building upon the incremental sequence with q = 3, experiments were conducted to train the L2 loss function for calculating the Gram matrix and the EMD loss function separately within EfficientNet. The detection AUC results are presented in Table 4.

Table 4 shows that the Gram L2 loss function achieved superior AUC performance. The network model learns relationships between feature vectors of the original texture images, yielding richer and more detailed texture features. Conversely, modifying the EMD loss results in poorer texture capture due to its sorted principle. This underscores the image texture enhancement module’s role in aiding the model to capture forgery features across local and global perspectives. Integrated with the WSDAN network, it enables the backbone network to prioritize areas with prominent textures, thereby improving detection accuracy.

6.4. Analysis of Diffusion Model Effectiveness

To validate the effectiveness of the diffusion model in denoising and artifact removal tasks, we employed four distinct subsets of FaceForensics++. Initial preprocessing steps were implemented to ensure the quality of the input data. Subsequently, the effectiveness of denoising and artifact removal was evaluated using assessment metrics and visual analysis, laying the foundation for subsequent deepfake detection efforts.

In the quantitative experiments of the diffusion model, we utilized the Peak Signal-to-Noise Ratio (PSNR) [40] and Learned Perceptual Image Patch Similarity (LPIPS) [41] as metrics to gauge image quality. PSNR is considered a reliable method for evaluating image quality, with higher scores indicating better image quality. LPIPS, meanwhile, uses deep learning models to assess perceptual differences between two images. In this study, VGG network models were employed for extracting image features, followed by computing the distance between them. Evaluation scores were derived by comparing the generated images with real images to calculate the L2 distance value. A smaller LPIPS score signifies higher image quality, as illustrated in Equations (26) and (27).

P S N R (I_{t + 1}, {\hat{I}}_{t + 1}) = 10 \times \log_{10} \frac{{[\max (I_{t + 1})]}^{2}}{\frac{1}{N} \sum_{t = 0}^{N} {(I_{t + 1} - {\hat{I}}_{t + 1})}^{2}}

(26)

L P I P S (I_{1}, I_{2}) = \sum_{l} \frac{1}{H_{l} W_{l}} {\sum_{h, w} ‖w_{l} • ({\hat{y}}_{1 h w}^{l} - {\hat{y}}_{2 h w}^{l})‖}_{2}^{2}

(27)

In Equation (27),

•

represents multiplication.

\max (I_{t + 1})

signifies the maximum pixel value in the real image, while

1 / N \sum_{t = 0}^{N} (I_{t + 1} - {\hat{I}}_{t + 1})^{2}

represents the mean squared error between the generated and real images. Within Equation (27),

I_{1} a n d

I_{2}

denote two image blocks, and

{\hat{y}}_{1 h w}^{l}

and

{\hat{y}}_{2 h w}^{l}

denote the normalized feature map activations of the real and generated images at layer l, respectively. These activations have dimensions H × W (where H is height and W is width), and

w_{l}

is a scaling vector used to weigh the feature channels of the

l

layer. Finally, by computing the L2 distance between

{\hat{y}}_{1 h w}^{l}

and

{\hat{y}}_{2 h w}^{l}

across all layers of image blocks, these distances are averaged across the spatial dimensions.

Moreover, in the experiments, T was set to 1000 for all trials to ensure a sufficient and consistent number of neural network evaluations during sampling. Therefore, this paper illustrates the denoising process through the inverse reversible process of the diffusion model, as depicted qualitatively in Figure 16.

The experiment validated the efficacy using motion blur and Gaussian noise. Through the inverse process of the diffusion model, it was observed from Figure 16 that both denoising and restoration were notably effective on images affected by motion blur and those with added Gaussian noise. The diffusion model effectively reduced noise from deblurred images and removed Gaussian noise, as visually confirmed in Figure 16. This verifies the effectiveness of the diffusion model in experimental settings. Additionally, to ensure result stability, quantitative tests were conducted. PSNR and LPIPS scores were computed on the sample dataset, experiments were repeated, and the averages were taken as the final results. The model’s performance on the four different subsets of the FaceForensics++ dataset is presented in Table 5.

Based on the experimental results, it is obvious that our diffusion model performs well in denoising tasks. Table 5 displays the PSNR evaluation results of our method on four different deepfake face datasets, with PSNR values consistently above 35 for each dataset and LPIPS values around 0.2. Particularly notable are the highest PSNR values achieved on the FaceSwap and NeuralTextures datasets, indicating excellent generalization capabilities of the diffusion model on these datasets. According to the quality definitions of these two metrics, such results typically indicate a very close visual resemblance between pairs of images, highlighting the effectiveness of our method in reducing noise levels. This further underscores the significant potential of the diffusion model in handling real-world noise and artifacts.

6.5. Analysis of Sample Adversarial Attack and Defence Effectiveness

Adversarial samples were generated using the IFGSM sample attack algorithm in this study. These samples were then trained along with the original dataset inputs on two models. This adversarial training approach effectively mitigates sample attacks and has demonstrated favorable outcomes. The following section presents an analysis of the results on adversarial attacks and defense.

Adversarial sample generation algorithms require manual configuration of three hyperparameters: the number of iterations T, the perturbation range

ε

, and the learning rate

α

. Therefore, based on the experimental requirements of this study, the hyperparameters were set to

T = 100, α = 1 / 255, ε = 24 / 255

for generating adversarial samples. In this research, adversarial samples were evaluated using the average L1 loss and success rate metrics.

{\hat{L}}^{l_{1}}

denotes the average L1 loss between all adversarial sample frames and their corresponding original samples, as formulated in Equation (28). The success rate ASR represents the proportion of test results where the model predictions were successfully deceived. Thus, adversarial samples were generated, and adversarial training was conducted using four different datasets. The results of the adversarial attacks are presented in Table 6.

{\hat{L}}^{l_{1}} = \frac{1}{N} \sum_{i = 1}^{N} ‖I_{i} - J_{i}‖

(28)

In Equation (28),

I

and

J

are the original and adversarial images, respectively, and N is the total number of pixels in the image.

From Table 6, it can be observed that the average L1 loss in this study was quite low, indicating that the differences between the original frames and the adversarial frames were minimal, and the effectiveness of generating adversarial samples was substantial. Additionally, the high sample attack success rate suggests that the robustness of the model needs improvement. Consequently, the model was trained using adversarial training to enhance its robustness against adversarial attacks. The ASR under adversarial training, as shown in Table 6, decreased, confirming that adversarial training is beneficial for defending against adversarial attacks and improving robustness. Specifically, the strategy reduced the success rates (ASR) of adversarial attacks on four datasets by (10.15%, 7.07%, 3.32%, and 1.24%), respectively, while still achieving high detection AUC and ACC under the premise of maintaining good robustness.

Figure 17 presents adversarial deepfake images from four datasets, randomly selected from successful attacks. The frames in the first row were correctly identified as Fake by the model, while the samples generated by the IFGSM attack algorithm were misclassified as Real. As depicted in the figure, the adversarial samples generated by the algorithm showed no noticeable differences from the original images but consistently resulted in misclassification during detection. This demonstrates the effectiveness of the IFGSM adversarial attack algorithm. Moreover, employing adversarial training as a defensive measure significantly improved the model’s robustness, effectively countering adversarial sample attacks.

To systematically assess the robustness of deepfake detection models under adversarial attack scenarios, controlled experiments were conducted to investigate the correlation between perturbation magnitude (ε) and attack success rates. With iterative steps held constant, the perturbation range was progressively increased while monitoring the variation patterns of attack success rates across four deepfake datasets: Deepfakes, Face2Face, FaceSwap, and NeuralTextures. The quantitative relationship between ε and detection vulnerability is demonstrated in Figure 18.

The experimental results in Figure 18 reveal significant generation-method dependency in the model’s adversarial robustness. Specifically, when the perturbation budget ε increased from 0.01 to 0.1, FaceSwap achieved rapid ASR saturation at 98.24% with ε = 0.07, and the ASR gain reached 29.78% for every 0.02 ε increase, which indicates that FaceSwap has a high ASR saturation, being 1.9 times faster than the saturation rate of NeuralTextures. This accelerated saturation pattern validates the heightened sensitivity of high-frequency artifact features to minor perturbations. Meanwhile, Deepfakes and Face2Face demonstrated progressive ASR growth across the entire perturbation range, with Face2Face showing 8.3 percentage points lower saturated ASR compared to other methods, indicating preserved detectability under subtle perturbations.

Furthermore, to comprehensively validate the model’s adversarial robustness, the evaluation framework was extended by incorporating Projected Gradient Descent (PGD) multi-step attacks [42] as a benchmark comparison. Under identical parameter configurations to I-FGSM (iterations, step size, learning rate, perturbation budgets), the attack success rates (ASR) and perturbation imperceptibilityL1-norm distortion) of single-step (I-FGSM) versus multi-step (PGD) attacks were systematically compared. This comparative analysis delineates the operational applicability boundaries of adversarial attack paradigms, with quantitative results detailed in Table 7.

As detailed in Table 7, significant performance disparities were observed between I-FGSM and PGD adversarial attacks across four deepfake datasets under unified parameter configurations (T = 100, α = 1/255, ε = 24/256). The experimental results demonstrate that I-FGSM achieved good perturbation imperceptibility, with L1 distortion values being lower than those of PGD, confirming the enhanced stealthiness of single-step gradient-based perturbations in circumventing statistical anomaly detection.

Although PGD exhibited marginal advantages in ASR metrics, its adversarial transferability showed notable degradation: The adversarial transfer attack success rates (Adv-ASR) for PGD on FaceSwap and NeuralTextures datasets were 90.25% and 93.47%, respectively, representing reductions of 6.52% and 5.18% compared to I-FGSM. This phenomenon is attributed to PGD’s multi-step optimization process overfitting the decision boundaries of white-box models, whereas I-FGSM’s single-step perturbations preserved stronger cross-model generalization capabilities, making them more suitable for practical black-box attack scenarios. Furthermore, I-FGSM demonstrated computational efficiency advantages critical for real-time deepfake detection systems requiring low-latency processing. Through balanced optimization of attack effectiveness, perturbation stealthiness, and transferability, I-FGSM emerged as the preferred baseline method for adversarial attack-defense tasks in our deepfake detection framework.

In addition, accuracy experiments were also conducted in this paper under the conditions of with and without adversarial training to explore whether adversarial training would affect the performance of the model on clean samples. Based on the FG-TEFusionNet model, this study used I-FGSM to generate adversarial samples for adversarial training and systematically evaluated the performance of the deepfake detection model on four mainstream datasets (DeepFakes, Face2Face, FaceSwap, NeuralTextures). Table 8 shows the AUC and ACC metrics of deepfake face detection with and without adversarial training on the four subsets of the FaceForensics++ dataset.

As evidenced in Table 8, the empirical validation demonstrates that adversarial training substantially enhanced cross-domain forgery attack resistance, with statistically significant improvements observed in robustness metrics. Specifically, the adversarially trained model achieved significant performance gains in both AUC and ACC across all four benchmarks compared to the non-adversarial baseline. Notably on Face2Face, the AUC and ACC increased by 1.75% and 2.70%, respectively, confirming that gradient perturbation strategies effectively enhance inter-class discriminability in deep feature space. The NeuralTextures dataset exhibited differential improvement patterns (AUC promotes 0.90% and ACC promotes 1.82%), potentially attributable to the distinct sensitivity of its neural radiance field-based texture generation mechanism to local gradient perturbations. Further analysis reveals that adversarial training compels the model to prioritize spatiotemporal consistency features over superficial artifact representations. On Deepfakes, the model’s detection sensitivity for identity attribute tampering traces remarkably improved, demonstrating the method’s effectiveness in strengthening global comprehension of cross-modal forgery semantics. In addition, based on the results in Table 8, a graph of the changes in AUC and ACC without adversarial training was plotted, as shown in Figure 19.

As visually demonstrated in Figure 19, the comparative experimental results across four datasets conclusively validated the optimization effects of adversarial training. The adversarially trained model exhibited significant superiority in both AUC and ACC metrics, with its performance curves consistently dominating the baseline model’s results within stable confidence bands. Particularly on the FaceSwap and Face2Face datasets, the ACC differentials reached 4.2% and 3.8%, respectively, which correlates strongly with their distinctive attack patterns: the facial geometric distortion in FaceSwap and expression parameter tampering in Face2Face enable the gradient smoothing mechanism induced by adversarial training to better enhance model adaptability against nonlinear attacks. The systematic improvement in ROC curve areas confirms the theoretical hypothesis that decision boundary optimization through adversarial training strengthens model generalization, providing strong support for enhancing the robustness and detection accuracy of models in practical applications.

6.6. Analysis of Ablation Experiments

To validate the role and effectiveness of the innovations introduced in this paper within the FG-TEFusionNet method, a quantitative analysis of the primary components of the model was conducted. To achieve this, ablation experiments were performed under consistent experimental conditions and settings to assess the impact of the fundamental components on the model’s performance, thereby verifying the effectiveness of each component. These experiments aim to provide an in-depth evaluation of the performance contribution of each component. Accordingly, this study examined the effectiveness of each module by dividing the validation into inter-frame information feature extraction networks and intra-frame information feature extraction networks.

In order to test the effectiveness of the SENet attention mechanism integrated into the SEAR-Net inter-frame information feature extraction network, tests were conducted under consistent experimental conditions. Table 9 summarizes the performance of the four different subsets of FaceForensics++ in the inter-frame information feature extraction branch.

From the ablation experiment results shown in Table 9, it is evident that the introduction of the SENet attention component significantly improved the AUC and ACC metrics of the g1 network branch of the two-stream RNN across the four different subsets of FaceForensics++ (Deepfakes, Face2Face, FaceSwap, NeuralTextures), with increases of 0.78%, 0.55%, 0.53%, and 2.15%, and 0.69%, 0.56%, 0.80%, and 0.59%, respectively. Furthermore, the AUC and ACC metrics of the inter-frame information feature extraction network, which includes both g1 and g2, also showed respective improvements of 0.68%, 0.41%, 0.21%, and 1.91%, and 0.88%, 0.94%, 0.25%, and 0.12%. This indicates that the SENet component effectively enhances the capability of extracting cross-frame information features from the input samples when geometric features are extracted, enabling the two-stream RNN network model to better learn temporal dimension forgery features.

Additionally, SENet attention was also added to the g2 network branch of the RNN, but the results were not satisfactory. The g2 network branch represents the differences in the optimal facial feature points between adjacent frames, capturing geometric feature velocity information. This method is not effective for representing long-term temporal information and can even negatively impact accuracy. Therefore, SENet attention is primarily applied to the g1 network branch of the RNN in this study to achieve the best overall network performance.

To validate the effectiveness of the Gram Block, WSDAN module, and EfficientNet within the TE-WSDAN-DDNet network model’s intra-frame information feature extraction branch, ablation experiments were conducted. The detection performance of each module was tested on four different subsets of the FaceForensics++ dataset, with the experimental results presented in Table 10.

The ablation experiment results from Table 10 demonstrate that the backbone network EfficientNetB4 of the intra-frame feature extraction branch, enhanced with Gram Block and WSDAN data augmentation networks, effectively extracts and enhances image textures features, thereby providing assessments of intra-frame temporal information. Relative to SEAR-Net, FG-TEFusionNet showed improvements in overall performance metrics AUC and ACC across different subsets of FaceForensics++ (Deepfakes, Face2Face, FaceSwap, NeuralTextures) by 0.29%, 0.56%, 1.95%, and 1.65%, and 1.75%, 2.75%, 3.63%, and 2.04%, respectively. This underscores that relying solely on geometric features extracted from facial landmarks for temporal information, specifically through inter-frame feature methods, may neglect variations in the forged facial texture features. Hence, integrating an intra-frame information feature extraction branch enhances the efficacy.

6.7. Comparative Experimental Analysis

6.7.1. Comparison Experiment

To assess the effectiveness of the proposed detection method, it was compared against existing state-of-the-art deepfake detection techniques. As mainstream evaluations usually emphasize AUC and ACC metrics, this study compares these metrics across the four subsets of the FaceForensics++ dataset with other methods. The proposed method has been thoroughly tested on this dataset. Table 11 presents the comparative results of AUC and ACC across various methods on the FaceForensics++ subsets (Deepfakes, Face2Face, FaceSwap, NeuralTextures).

Table 11 displays the AUC and ACC performance metrics of the FG-TEFusionNet method proposed in this study, compared with several leading deepfake detection methods across the four subsets of FaceForensics++. From the table, it is demonstrated that the Long-Distance algorithm marginally outperforms our method on the Face2Face dataset, with AUC and ACC values higher by 0.56% and 0.80%, respectively. However, on the Deepfakes, FaceSwap, and NeuralTextures datasets, our approach demonstrates improvements in AUC and ACC by 0.08%, 0.03%, and 1.17%, and 0.05%, 0.84%, and 5.78%, respectively. The most substantial enhancement was observed in the NeuralTextures dataset. This was likely due to the dataset’s emphasis on texture-altering forgeries, where our method’s inclusion of a texture feature module excels in capturing these nuanced features, thereby resulting in significant performance gains compared to other methods.

In addition, the experimental results demonstrate substantial performance variations of the proposed multi-feature decision fusion framework (FG-TEFusionNet) across four deepfake categories, as detailed in Table 3. Notably, the model achieved superior detection capabilities for Deepfakes (AUC 99.87%/ACC 99.49%) and NeuralTextures (AUC 99.78%/ACC 99.03%), which were attributed to the synergistic operation of dual-stream architecture. Geometric landmarks were effectively captured through 3D facial mesh reconstruction to identify structural deviations caused by global face replacement in Deepfakes, while textural patterns were enhanced via the Gram texture feature module through gradient-based localization of synthetic artifacts. In contrast, localized manipulations in Face2Face (AUC 99.42%) and FaceSwap (AUC 99.29%) exhibited micro-movement patterns that closely resembled authentic expressions, resulting in constrained feature discriminability within the dual-stream framework. This limitation precipitated a 0.38% accuracy decline compared to Deepfakes detection performance.

Table 11 also highlights that, apart from the Long-Distance method, FG-TEFusionNet consistently achieves superior AUC and ACC metrics compared to all other algorithms. Specifically, in terms of AUC metrics, this study outperformed the state-of-the-art Face X-ray and LipForensics by 0.35%, 4.88%, 6.06%, and 7.21%, and 0.33%, 0.07%, 9.26%, and 0.64% on the Deepfakes, Face2Face, FaceSwap, and NeuralTextures datasets, respectively. The results in the table demonstrate that our method maintained stable and robust detection performance across these four datasets, indicating minimal variance in detection accuracy. In contrast, LipForensics exhibited unstable detection performance, especially with a significant performance decline observed on the FaceSwap dataset. Moreover, LipForensics heavily relies on fixed feature extractors pre-trained specifically for predefined tasks within the dataset, leading to substantial performance degradation when trained from scratch. In terms of ACC metrics, improvements over the F3-Net method were observed across the four datasets: 0.87%, 3.34%, 1.68%, and 13%, respectively. Certain methods leverage frequency-domain feature extraction to enhance classification. For instance, M2TR enhances detection performance by integrating frequency-domain features into spatial information-based models. Moreover, the F3-Net method demonstrates effective performance on low-quality data using frequency-domain techniques. In comparison to the CNN-GRU method, which generally shows poorer overall detection efficacy, our approach notably enhances performance on the Face2Face, FaceSwap, and NeuralTextures forgery datasets, with most gains on Deepfakes. CNN-GRU excels on Deepfakes due to its robust handling of long-term dependencies, though it may overfit and underperform on shorter sequences. Importantly, as illustrated in Table 11, our proposed method achieves significant detection performance improvements over backbone networks like EfficientNetB4 and LRNet, owing to the integration of texture features and guided inter-frame correlations.

The method proposed in this paper consistently outperformed other methods across all four datasets. This confirms that the multi-feature decision fusion approach proposed here compensates for deficiencies in feature extraction found in other methods, resulting in highly effective detection models that integrate both inter-frame and intra-frame network branches. In summary, this study validates the superiority of our method in the field of deepfake detection, achieving state-of-the-art performance levels. Based on comparative experimental results, we have selected several methods to illustrate the AUC and ACC values obtained on the Deepfakes, Face2Face, FaceSwap, and NeuralTextures datasets in the form of line graphs, as depicted in Figure 20.

Figure 20 presents line graphs of AUC and ACC values tested on four different datasets of deepfake faces. Figure 20a and Figure 20b illustrate the ACC and AUC performances of various methods across these datasets, respectively. Based on the ACC and AUC detection results, it is shown that the lines representing our method are generally positioned above those of other methods. Therefore, the calculated AUC and ACC values are relatively high, indicating that our approach combines facial geometry and texture features specific to deepfake datasets, resulting in superior detection performance.

To comprehensively evaluate model reliability, a dual-metric framework incorporating both False Positive Rate (FPR) and False Negative Rate (FNR) was implemented to systematically quantify classification error patterns across four major deepfake categories: Deepfakes, Face2Face, FaceSwap, and NeuralTextures. As detailed in Table 12, the error distribution was rigorously analyzed under a fixed decision threshold (θ = 0.5).

As evidenced in Table 12, statistically significant disparities in False Positive Rate (FPR) and False Negative Rate (FNR) were observed across the four deepfake datasets. The Deepfakes category demonstrated superior error control performance, achieving the lowest FPR (0.32%) and FNR (0.51%), while NeuralTextures also maintained relatively low error rates (FPR = 0.41%, FNR = 0.63%). Notably, the elevated FPR in FaceSwap (1.12%, 2.1× higher than the category average) is attributed to morphological similarities between synthetic mouth deformations and authentic smile expressions. Conversely, Face2Face exhibited heightened sensitivity to low-resolution inputs, resulting in a marked FNR of 0.89%. Analysis revealed positive relationship between FPR and FNR, suggesting bias patterns that may originate from shared adversarial feature distributions.

6.7.2. Experiments Across Datasets

In real-world scenarios, suspicious images or videos are often generated from unknown source videos using unknown algorithms. Therefore, cross-dataset transfer is crucial. To better demonstrate the effectiveness of our method, we adopted the widely used CelebDF dataset for a comprehensive evaluation. Representative algorithms in the field were trained on the FaceForensics++ dataset, and their transferability was tested on CelebDF. The test results are presented in Table 13.

From Table 13, it is obvious that the FG-TEFusionNet algorithm achieved an AUC of 72.15% when trained on the FF++ dataset and tested on the CelebDF dataset. While the validation results are not the absolute best obtained, our method still exhibits the highest AUC performance on the FF++ dataset. We evaluated the transferability of our framework in this context and conducted tests on Celeb-DF. Given the substantial differences between datasets, this remains a challenging task for most detection methods. All methods show varying degrees of performance decline in cross-dataset tasks. Furthermore, experimental results indicate that although our method was not explicitly designed for cross-dataset performance, its AUC metric still outperforms most methods and demonstrates satisfactory cross-dataset performance. SPSL, designed specifically for framework transferability, achieved the best results. The results in the table also affirm the robustness and generalizability of our proposed approach.

6.8. Analysis of Model Complexity

To validate the practical applicability of the proposed FG-TEFusionNet model, we conducted a comprehensive complexity evaluation from both time and space complexity perspectives. For temporal complexity optimization, the lightweight SENet module simplifies the overall network architecture by streamlining computation processes, effectively reducing floating-point operations (FLOPs) and accelerating training convergence. In terms of spatial complexity, the integration of global average pooling significantly decreases the number of model parameters, thereby lowering structural complexity while enhancing generalization capability. Notably, our framework achieves high detection performance with reduced overall complexity, accompanied by additional measurements of single-frame inference time. Table 14 summarizes the complexity comparison and single-frame inference time between the model proposed in this paper, XceptionNet, and EfficientNet.

Aiming at the trade-off problem between model efficiency and performance in the task of deepfake detection, this study conducted a comprehensive comparison between FG-TEFusionNet and mainstream baseline models (XceptionNet, EfficientNet-B4) (Table 14). Although the number of parameters of our model (18.6M) is similar to that of EfficientNet-B4 (19.3M), through the collaborative optimization of channel attention (SE Module) and global average pooling, while the FLOPs only increased by 4.3%, the inference speed was improved by 26.2% (28.7 ms/frame vs. 38.9 ms/frame). In terms of performance, FG-TEFusionNet achieved an AUC of 99.83% on FaceForensics++ (Table 14), which was 1.1 percentage points higher than that of EfficientNet-B4 (98.74%) and 0.87% higher than that of XceptionNet (98.97%). These results indicate that a moderate targeted investment of computational resources (such as SENet) can significantly enhance the discriminative ability for forged features. Moreover, through the optimization of attention and global average pooling, the additional complexity cost can be offset, and the requirements of real-time detection can be met, providing a feasible solution for practical application scenarios with limited resources.

6.9. Experimental Visualization and Analysis

The algorithm proposed in this study underwent preprocessing on four distinct subsets of FaceForensics++ before being fed into the FG-TEFusionNet network for training. Upon completion of training, it was subsequently tested on the Deepfakes, Face2Face, FaceSwap, and NeuralTextures datasets. The experimental results are visualized and analyzed using the AUC evaluation metric, with ROC curves for the AUC values obtained across the four datasets shown in Figure 21 and Figure 22.

Figure 21 and Figure 22 depict ROC curves obtained from testing on four categories of manipulated face datasets, evaluating the detection performance of our FG-TEFusionNet. The ROC-AUC values were used to assess our model’s performance across these datasets. These four AUC curves clearly demonstrate the outstanding performance of the model on different experimental datasets. The model achieved AUC values of 99.87%, 99.42%, 99.29%, and 99.78%, respectively, indicating its robust capability in detecting deepfake faces and reinforcing its overall superiority in the field. The results highlight the model’s optimal performance on the Deepfakes and NeuralTextures datasets, with good detection results on Face2Face and FaceSwap as well. This further validates the effectiveness and superiority of our approach.

Additionally, we utilized the dataset for predictions and visualized the results as shown in Figure 23. This figure displays precise classification results when provided with images of both real and fake faces, accurately distinguishing between the two types of facial images.

As shown in Figure 23, both real and manipulated fake face videos were separately inputted for classification prediction, accurately discerning between genuine and fake facial images. The results displayed in the figure present two sets of predictions, one containing genuine faces and the other fake faces.

6.10. Limitations and Future Research Work

Although the method in this paper has achieved state-of-the-art performance, there are still two key limitations in our framework worthy of in-depth discussion.

6.10.1. Limitation 1: Occlusion Sensitivity in Real-World Scenarios

The model’s performance significantly degrades when processing heavily occluded faces (such as those covered by medical masks, sunglasses, or scarves). This stems from the fundamental dependence of geometric landmark calibration and texture analysis on visible facial regions. When key facial areas are occluded, both feature extraction pathways are affected by establishing temporal consistency and detecting texture anomalies. This limitation reveals the inherent challenges of the current appearance-based detection paradigm in real-world occluded environments.

6.10.2. Limitation 2: Computational Constraints for High-Resolution Content

Processing 4K-resolution video streams incurs significant computational overhead, particularly in the Gram matrix calculations of the texture analysis branch. The quadratic complexity of high-dimensional feature processing creates a performance bottleneck, hindering real-time deployment in resource-constrained scenarios. This reflects the broader challenge of balancing model complexity with high-fidelity media operation efficiency.

6.10.3. Future Research Work

Based on the limitations identified in this study, future research will focus on the following aspects: (1) Occlusion-Robust Architectures: Develop attention-guided feature completion networks to dynamically reconstruct occluded regions; (2) Efficiency Optimization: Explore more lightweight feature compression and model distillation techniques to maintain accuracy while reducing complexity; (3) Cross-Cultural Generalization: Build more diverse deepfake datasets that include a wide range of ethnic features and cultural attributes; (4) Edge-Deployable Frameworks: Utilize Neural Processing Units (NPUs) to achieve hardware-aware acceleration.

These limitations highlight critical research opportunities for developing deployable deepfake defense systems. Future work will prioritize balancing robustness and efficiency without compromising detection integrity.

7. Conclusions

This paper addresses the challenges in detecting deepfake videos by proposing a multi-feature decision fusion detection method, which demonstrates its superiority across multiple datasets. Firstly, by integrating the improved IMTCNN and Diffusion Model, we can extract high-quality facial features from video frames, effectively reducing the impact of noise and artifacts. Secondly, the FG-TEFusionNet model proposed in this paper achieves efficient integration of inter-frame and intra-frame detection through a multi-feature decision fusion mechanism. This not only enhances detection accuracy but also strengthens the model’s generalization to different manipulation techniques. Specifically, integrating SENet attention into the LRNet framework enhances the model’s ability to capture and represent key information, thereby improving detection performance. Moreover, combining the image texture enhancement module of EfficientNet with the attention maps from WSDAN data augmentation further explores deep texture features, providing robust technical support for detecting manipulated videos. In terms of adversarial training, generating adversarial samples using I-FGSM demonstrate good robustness. The experimental results show significant performance improvements of our method across four different subsets of the FaceForensics++ dataset, validating the effectiveness and superiority of our approach. Furthermore, cross-dataset testing on the Celeb-DF dataset demonstrates better generalization. This framework achieved improved cross-dataset generalization ability on the FaceForensics++ and Celeb-DF datasets, with significant improvements in AUC. Under I-FGSM attacks, it enhanced adversarial robustness by reducing the average attack success rate (ASR) by 5.67%, overcoming the main limitations of single-feature detectors. The model achieved real-time detection on single-frame images (28.7 ms/frame), providing deployable deepfake defense solutions for social platforms and biometric authentication systems. Overall, this study not only provides a new perspective and solution for detecting deepfake videos but also offers valuable insights for research in information security and digital identity authentication. In the future, we will continue to explore more efficient detection methods to address the increasingly complex challenges of manipulated videos.

Author Contributions

Conceptualization, S.L.; methodology, S.L.; software, F.F.; formal analysis, A.W.; writing—original draft, S.L.; visualization, Z.Y.; supervision, J.S.; project administration, J.S.; data curation, F.F. and A.W.; writing—review and editing, J.S.; funding acquisition, J.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Scientific Research Program (No. 24MDY06) and Innovation Practice Project (Y2025135) and (Y2025127) of Xizang Minzu University, The National Natural Science Foundation of China (No. 62263028), and the Special Scientific Research Program Project of the Shaanxi Provincial Department of Education (24JK0690).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Acknowledgments

I would like to express my sincere gratitude to my mentor, Song, and my fellow students for their invaluable guidance and support throughout the course of this research. Moreover, I also wish to acknowledge the Funding Agency for providing financial support for this research project. Their commitment to advancing scientific inquiry has been instrumental in the completion of this work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets; MIT Press: Cambridge, MA, USA, 2014. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Xing, H.; Li, M. Deep forgery video tampering detection based on 3D CNNS. Comput. Sci. 2021, 48, 86–92. [Google Scholar]
Fu, X.; Yan, Z.; Yao, T.; Fu, X.; Yan, Z. Exploring Unbiased Deepfake Detection via Token-Level Shuffling and Mixing. arXiv 2025, arXiv:2501.04376. [Google Scholar] [CrossRef]
Yang, J.; Xiao, S.; Siddiqui, F.; Fahad, M. Enhanced deepfake detection with DenseNet and Cross-ViT. Expert Syst. Appl. 2025, 267, 126150. [Google Scholar]
Frank, J.; Eisenhofer, T.; Schönherr, L.; Fischer, A.; Kolossa, D.; Holz, T. Leveraging frequency analysis for deep fake image recognition. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 3247–3258. [Google Scholar]
Sabir, E.; Cheng, J.; Jaiswal, A.; AbdAlmageed, W.; Masi, I.; Natarajan, P. Recurrent convolutional strategies for face manipulation detection in videos. Interfaces (GUI) 2019, 3, 80–87. [Google Scholar]
Gu, Z.; Chen, Y.; Yao, T.; Ding, S.; Li, J.; Huang, F.; Ma, L. Spatiotemporal inconsistency learning for deepfake video detection. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 3473–3481. [Google Scholar]
Kroiß, L.; Reschke, J. Deepfake Detection of Face Images based on a Convolutional Neural Network. arXiv 2025, arXiv:2503.11389. [Google Scholar]
Kundu, R.; Balachandran, A.; Roy-Chowdhury, A.K. TruthLens: Explainable DeepFake Detection for Face Manipulated and Fully Synthetic Data. arXiv 2025, arXiv:2503.15867. [Google Scholar]
Cheng, J.; Yan, Z.; Zhang, Y.; Luo, Y.; Wang, Z.; Li, C. Can We Leave Deepfake Data Behind in Training Deepfake Detector? arXiv 2024, arXiv:2408.17052. [Google Scholar]
Choi, J.; Kim, T.; Jeong, Y.; Baek, S.; Choi, J. Exploiting style latent flows for generalizing deepfake video detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 1133–1143. [Google Scholar]
Cozzolino, D.; Poggi, G.; Corvi, R.; Nießner, M.; Verdoliva, L. Raising the Bar of AI-generated Image Detection with CLIP. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 4356–4366. [Google Scholar]
Yang, T.; Wu, J.; Liu, L.; Chang, X.; Feng, G. VTD-Net: Depth face forgery oriented video tampering detection based on convolutional neural network. In Proceedings of the 2020 39th Chinese Control Conference (CCC), Shenyang, China, 27–29 July 2020; pp. 7247–7251. [Google Scholar]
Chinchalkar, R.; Sinha, R.; Kumar, M.; Chauhan, N.; Deokar, S.; Gonge, S. Detecting Deepfakes using CNN and LSTM. In Proceedings of the 2023 Second International Conference on Informatics (ICI), Noida, India, 23–25 November 2023; pp. 1–5. [Google Scholar] [CrossRef]
Coccomini, D.A.; Messina, N.; Gennaro, C.; Falchi, F. Combining efficientnet and vision transformers for video deepfake detection. In Image Analysis and Processing–ICIAP 2022: 21st International Conference, Lecce, Italy, 23–27 May 2022; Proceedings, Part III; Springer: Cham, Switzerland, 2022; pp. 219–229. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Chen, S.; Yao, T.; Liu, H.; Sun, X.; Ding, S.; Ji, R. Diffusionfake: Enhancing generalization in deepfake detection via guided stable diffusion. Adv. Neural Inf. Process. Syst. 2024, 37, 101474–101497. [Google Scholar]
Zhao, C.; Wang, C.; Hu, G.; Chen, H.; Liu, C.; Tang, J. ISTVT: Interpretable spatial-temporal video transformer for deepfake detection. IEEE Trans. Inf. Forensics Secur. 2023, 18, 1335–1348. [Google Scholar] [CrossRef]
Afchar, D.; Nozick, V.; Yamagishi, J.; Echizen, I. Mesonet: A compact facial video forgery detection network. In Proceedings of the 2018 IEEE International Workshop on Information Forensics and Security (WIFS), Hong Kong, 11–13 December 2018; pp. 1–7. [Google Scholar]
Li, L.; Bao, J.; Zhang, T.; Yang, H.; Chen, D.; Wen, F.; Guo, B. Face x-ray for more general face forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 5001–5010. [Google Scholar]
Liu, H.; Li, X.; Zhou, W.; Chen, Y.; He, Y.; Xue, H.; Zhang, W.; Yu, N. Spatial-phase shallow learning: Rethinking face forgery detection in frequency domain. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 772–781. [Google Scholar]
Wang, J.; Wu, Z.; Ouyang, W.; Han, X.; Chen, J.; Lim, S.-N.; Jiang, Y.-G. M2tr: Multi-modal multi-scale transformers for deepfake detection. In Proceedings of the 2022 International Conference on Multimedia Retrieval, Newark, NJ, USA, 27–30 June 2022; pp. 615–623. [Google Scholar]
Lu, W.; Liu, L.; Zhang, B.; Luo, J.; Zhao, X.; Zhou, Y. Detection of deepfake videos using long-distance attention. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 9366–9379. [Google Scholar] [CrossRef] [PubMed]
Rossler, A.; Cozzolino, D.; Verdoliva, L.; Riess, C.; Thies, J.; Nießner, M. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1–11. [Google Scholar]
Zhao, H.; Zhou, W.; Chen, D.; Wei, T.; Zhang, W.; Yu, N. Multi-attentional deepfake detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2185–2194. [Google Scholar]
Hu, J.; Pu, W.; Kong, W. A two-branch deep face replacement detection method based on multi-task learning and attention mechanism. Telecommun. Technol. 2022, 2, 1383–1390. [Google Scholar]
Li, J.; Xie, H.; Li, J.; Wang, Z.; Zhang, Y. Frequency-aware discriminative feature learning supervised by single-center loss for face forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6458–6467. [Google Scholar]
Zheng, Y.; Bao, J.; Chen, D.; Zeng, M.; Wen, F. Exploring temporal coherence for more general video face forgery detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 15044–15054. [Google Scholar]
Shao, R.; Wu, T.; Liu, Z. Detecting and grounding multi-modal media manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6904–6913. [Google Scholar]
Hu, T.; Qi, H.; Huang, Q.; Lu, Y. See better before looking closer: Weakly supervised data augmentation network for fine-grained visual classification. arXiv 2019, arXiv:1901.09891. [Google Scholar] [CrossRef]
Sun, Z.; Han, Y.; Hua, Z.; Ruan, N.; Jia, W. Improving the efficiency and robustness of deepfakes detection through precise geometric features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3609–3618. [Google Scholar]
Gatys, L.; Ecker, A.S.; Bethge, M. Texture synthesis using convolutional neural networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1880. [Google Scholar]
Kurakin, A.; Goodfellow, I.J.; Bengio, S. Adversarial examples in the physical world. In Artificial Intelligence Safety and Security; Chapman and Hall/CRC: Boca Raton, FL, USA, 2018; pp. 99–112. [Google Scholar]
Cheng, G.; Sun, Q.; Bie, Y. Mandatory lane-changing modelling based on a game theoretic approach in traditional and connected environments. Transp. Saf. Environ. 2023, 5, tdac035. [Google Scholar] [CrossRef]
Chen, X.; Hu, R.; Luo, K.; Wu, H.; Biancardo, S.A.; Zheng, Y.; Xian, J. Intelligent ship route planning via an A∗ search model enhanced double-deep Q-network. Ocean. Eng. 2025, 327, 120956. [Google Scholar] [CrossRef]
Li, Y.; Yang, X.; Sun, P.; Qi, H.; Lyu, S. Celeb-DF: A Large-Scale Challenging Dataset for DeepFake Forensics. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 3204–3213. [Google Scholar] [CrossRef]
Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2005, 27, 861–874. [Google Scholar] [CrossRef]
Rubner, Y.; Tomasi, C.; Guibas, L.J. The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis. 2000, 40, 99–121. [Google Scholar] [CrossRef]
Hore, A.; Ziou, D. Image quality metrics: PSNR vs. SSIM. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 2366–2369. [Google Scholar]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar]
Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards deep learning models resistant to adversarial attacks. arXiv 2017, arXiv:1706.06083. [Google Scholar]
Nguyen, H.H.; Yamagishi, J.; Echizen, I. Capsule-forensics: Using capsule networks to detect forged images and videos. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 2307–2311. [Google Scholar]
Qian, Y.; Yin, G.; Sheng, L.; Chen, Z.; Shao, J. Thinking in frequency: Face forgery detection by mining frequency-aware clues. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XII. Springer International Publishing: Cham, Switzerland, 2020; pp. 86–103. [Google Scholar]
Haliassos, A.; Vougioukas, K.; Petridis, S.; Pantic, M. Lips don’t lie: A generalizable and robust approach to face forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5039–5049. [Google Scholar]
Zhou, P.; Han, X.; Morariu, V.I.; Davis, L.S. Two-Stream Neural Networks for Tampered Face Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 1831–1839. [Google Scholar] [CrossRef]
Masi, I.; Killekar, A.; Mascarenhas, R.M.; Gurudatt, S.P.; AbdAlmageed, W. Two-branch recurrent network for isolating deepfakes in videos. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part VII 16. Springer International Publishing: Cham, Switzerland, 2020; pp. 667–684. [Google Scholar]

Figure 1. The framework of deepfake face detection and the adversarial attack defense method based on multi-feature decision fusion. (It encompasses four end-to-end workflows, including an input preprocessing module (denoising/artifact removal), inter-frame and intra-frame feature extraction, decision fusion, and adversarial attack and defense.).

Figure 2. The classification diagram of deepfake technology.

Figure 3. The classification diagram of deepfake detection methods.

Figure 4. Structure of the SENet attention mechanism network.

Figure 5. Structure of the Gram image texture enhancement module.

Figure 6. Structure of the improved MTCNN model.

Figure 7. Structure of the convolutional residual module.

Figure 8. Directed graphical modeling diagram for diffusion models.

Figure 9. The framework of the SEAR-Net model.

Figure 10. Schematic diagram of the facial landmarks calibration module.

Figure 11. Framework diagram of TE-WSDAN-DDNet incorporating image texture enhancement.

Figure 12. Component structure of the MBConv module.

Figure 13. Adversarial training approach for adversarial attack sample defense.

Figure 14. Sample example of the FaceForensics++ dataset.

Figure 15. The Landmark method is employed to annotate and track facial landmarks in two sets of sample data. (The second column shows facial landmark annotation, and the third column depicts landmark tracking.).

Figure 16. The qualitative results of motion blur denoising (left) and Gaussian blur denoising (right) using the diffusion model on samples from the FaceForensic++ dataset.

Figure 17. IFGSM Adversarial Sample Examples from Four Forged Datasets.

Figure 18. The attack success rate (ASR) variations of the I-FGSM adversarial attacks under different perturbation budgets (ε) across four deepfake datasets.

Figure 19. Graph of the changes in AUC and ACC on different datasets with and without adversarial training.

Figure 20. Line graphs comparing the AUC and ACC of various specific algorithms across different datasets. (a) Line graphs of ACC for different methods across four datasets. (b) Line graphs of AUC for different methods across four datasets.

Figure 21. The ROC curves were obtained from testing on the Deepfakes (a) and Face2Face (b) datasets.

Figure 22. The ROC curves were obtained from testing on the FaceSwap (a) and NeuralTextures (b) datasets.

Figure 23. The predicted results on four datasets are depicted in the graph above.

Table 1. Meaning of parameters in the experimental section.

Parameters	Meaning	Parameters Setting
$G r a m_{i, j}^{l}$	The Gram matrix	−
${b a t c h s i z e}_{1}$ $, {b a t c h s i z e}_{2} {, batch size}_{3}$	$Training batch size of branch networks g_{1}$ $, g_{2}$ $and g_{3}$	2048, 32 and 32
$I m a g e_s i z e$	Input image size	384 × 384
$Lr_scheduler$	Exponential LR strategy	−
$ε$	$ε$ specifies the perturbation range	16/255
$l r_{1}, l r_{2}$ $, l r_{3}$	$Learning rate of branch networks g_{1}$ $, g_{2}$ $and g_{3}$	0.002, 0.002 and 0.01
$e p o c h_{1}, e p o c h_{2}$ $, e p o c h_{3}$	$Epoch of branch networks g_{1}$ $, g_{2}$ $and g_{3}$	1000, 800 and 100
$optimize$	Parameter optimizer	Adam
a	Threshold of WSDAN	0.45

Table 2. Parameters settings in the experimental section.

Algorithm Model	Parameters Setting
$S E N e t$	$z_{c} = F_{s q} (u_{c})$ $, s_{c} = F_{e x} (z_{c}, W)$ $, F_{s c a l e} (u_{c}, s_{c})$
$I M T C N N$	$k_{C o n v 5 \times 5} = 5 \times 5$
$G r a m B l o c k$	$F_{i k}^{l} = l, i, k$ $, G r a m_{i, j}^{l} = \sum_{k} F_{i k}^{l} F_{j k}^{l}$ ,
$W S D A N$	$θ_{c} = 0.45$ $, G A v g P o o l i n g = 1 \times 1$
$D D P M$	$s_{θ} (x_{t}, t) \approx - \frac{ε_{θ} (x_{t}, t)}{\sqrt{1 - {\bar{α}}_{t}}}$ $, ε \sim N = (0,1)$ $, q (x_{t} \|x_{t - 1})$
$EfficientNet$	$k_{D S C} = 3 \times 3 k_{D S C} = 5 \times 5$ $, k_{e f f} = 3 \times 3$ , $stride = 2$ $, p_{MBConv 1} = 1$ $, p_{MBConv 6} = 6$

Table 3. The AUC results from testing four different subsets of FaceForensics++ on the efficientnet network branch.

Measurement	Deepfakes	Face2Face	FaceSwap	NeuralTextures
Measurement	AUC	AUC	AUC	AUC
q = 1	98.75	97.93	98.71	97.43
q = 2.5	98.99	98.54	99.02	98.49
q = 3	99.50	98.87	99.24	98.75
q = 3.5	99.08	98.24	98.57	98.05
q = 4	98.33	96.74	96.36	95.78

Table 4. The detection AUC results were obtained from training experiments with the gram l2 loss function and the emd loss function in efficientnet.

Loss Function	Deepfakes	Face2Face	FaceSwap	NeuralTextures
Loss Function	AUC	AUC	AUC	AUC
Gram L2 Loss	99.50	98.87	99.24	98.75
EMD Loss	96.78	95.43	96.48	95.97

Table 5. Results from the diffusion model across four different subsets of the FaceForensics++ dataset.

Methods	Deepfakes	Face2Face	FaceSwap	NeuralTextures	Average
PSNR(db)	35.71	36.44	38.36	38.93	37.31
LPIPS	0.2218	0.2203	0.2175	0.2154	0.2154

Table 6. Based on the results of the adversarial attacks using ifgsm.

Metrics	Deepfakes	Face2Face	FaceSwap	NeuralTextures
${\hat{L}}^{l_{1}}$	0.0039	0.0042	0.0040	0.0041
ASR(%)	99.87	99.65	100.00	99.89
Adv-ASR(%)	89.72	92.58	96.77	98.65

Table 7. Based on the results of the adversarial attacks using pgd.

Metrics	Deepfakes	Face2Face	FaceSwap	NeuralTextures
${\hat{L}}^{l_{1}}$	0.0072	0.0085	0.0069	0.0063
ASR(%)	99.91	99.83	99.95	99.92
Adv-ASR(%)	82.14	85.39	90.25	93.47

Table 8. AUC and ACC metrics of deepfake face detection with and without adversarial training on four different datasets.

Methods	Deepfakes		Face2Face		FaceSwap		NeuralTextures
Methods	AUC	ACC	AUC	ACC	AUC	ACC	AUC	ACC
No adversarial training	98.95	98.74	97.67	96.48	97.54	95.98	98.88	97.21
Adversarial training	99.87	99.49	99.42	99.18	99.29	99.11	99.78	99.03

Table 9. Performance comparison of the four different subsets of Faceforensics++ in the inter-frame.

Inter-Frame Information Feature Extraction Branch	AUC(%)				ACC(%)
$g_{1}$	√	√	×	×	√	√	×	×
$g_{1} + g_{2}$	×	×	√	√	×	×	√	√
SENet Attention	×	√	×	√	×	√	×	√
Deepfakes	97.23	98.01	98.90	99.58	85.64	86.33	96.86	97.74
Face2Face	97.69	98.24	98.45	98.86	85.43	85.99	95.49	96.43
FaceSwap	96.39	96.92	97.92	98.13	84.27	85.07	96.74	96.99
NeuralTextures	92.19	94.34	95.43	97.34	84.76	85.35	95.36	95.48

Table 10. Performance comparison of the four different subsets of Faceforensics++ in the intra-frame.

Model Component		AUC (%) and ACC (%)
SEAR-Net		√	×	×	×	×	√	√	√	√
EfficientNetB4		×	√	√	√	√	√	√	√	√
Gram Block		×	×	×	√	√	×	√	×	√
WSDAN		×	×	√	×	√	×	×	√	√
AUC(%)	Deepfakes	99.58	99.28	99.62	99.50	99.78	99.32	99.57	99.79	99.87
	Face2Face	98.86	97.63	98.92	98.87	99.23	98.28	98.48	98.83	99.42
	FaceSwap	97.34	98.09	99.34	99.24	99.03	97.68	98.25	99.14	99.29
	NeuralTextures	98.13	98.64	98.99	98.75	99.24	98.50	98.77	99.03	99.78
ACC(%)	Deepfakes	97.74	96.63	97.52	97.28	98.93	97.93	98.09	99.15	99.49
	Face2Face	96.43	95.49	97.43	96.35	98.87	97.48	97.69	98.47	99.18
	FaceSwap	95.48	96.39	97.73	97.24	98.85	97.21	97.47	98.94	99.11
	NeuralTextures	96.99	97.53	97.84	97.43	98.82	97.35	97.25	98.91	99.03

Table 11. Comparison of AUC and ACC results across different methods on the four subsets of the FaceForensics++ dataset.

Methods	Deepfakes		Face2Face		FaceSwap		NeuralTextures		Date
Methods	AUC	ACC	AUC	ACC	AUC	ACC	AUC	ACC	Date
MesoNet [20]	−	87.27	−	56.23	−	61.17	−	40.67	2018
Capsule [43]	96.01	95.68	94.15	92.91	91.89	92.85	89.55	87.35	2019
CNN+GRU [7]	97.62	86.26	85.8	82.04	86.64	77.31	79.62	76.45	2019
XceptionNet [13]	98.68	95.35	93.77	83.48	98.82	98.84	97.18	95.84	2019
EfficientNetB4 [18]	99.28	96.63	97.63	95.49	99.34	96.38	98.99	97.53	2019
F3-Net [44]	−	98.62	−	95.84	−	97.43	−	86.03	2020
Face X-ray [21]	99.52	−	94.54	−	93.23	−	92.57	−	2020
SPSL [22]	98.50	93.48	94.62	86.02	98.10	92.26	84.23	76.78	2021
LipForensics -Scratch [45]	93.05	−	98.76	−	58.76	−	98.34	−	2021
LipForensics [45]	99.54	−	99.35	−	90.03	−	99.14	−	2021
LRNet [32]	98.90	95.83	98.45	94.79	97.92	93.87	95.43	92.54	2021
M2TR [23]	98.65	94.23	97.64	93.54	96.74	94.97	95.47	93.39	2022
Long-Distance [24]	99.79	99.44	99.98	99.98	99.26	98.27	98.61	93.25	2023
FG-TEFusionNet (Ours)	99.87	99.49	99.42	99.18	99.29	99.11	99.78	99.03	2024

Table 12. Classification error distribution fpr/fnr of the proposed model across four deepfake datasets.

Datasets	FPR (%)	FNR (%)	AUC (%)	ACC (%)
Deepfakes	0.32	0.51	99.87	99.49
Face2Face	0.89	0.92	99.42	99.18
FaceSwap	1.12	1.05	99.29	99.11
NeuralTextures	0.45	0.62	99.78	99.03

Table 13. The AUC results of transferability testing on the CelebDF dataset across different methods.

Method	FaceForensics++ (%)	CelebDF (%)	Date
Two-Stream [46]	70.10	53.90	2017
MesoNet [20]	84.70	54.80	2018
Capsule [43]	96.60	57.50	2019
XceptionNet [13]	98.97	65.30	2019
Two-Branch [47]	93.18	73.41	2020
F3-Net [45]	97.97	65.17	2020
SPSL [22]	96.94	76.88	2021
Multi-attentional [26]	99.80	67.64	2021
M2TR [23]	98.50	65.70	2022
Long-Distance [24]	99.79	70.33	2023
FG-TEFusionNet (Ours)	99.83	72.15	2025

Table 14. Comparison of the complexity and inference time of the deepfake detection models between the method proposed in this paper and the baseline models.

Model	FLOPs (M)	Parameters (M/106)	Inference Time (ms/frame)
XceptionNet [13]	22.5	20.8	45.2
EfficientNetB4 [18]	18.7	19.3	38.9
TEFusionNet (Ours)	19.5	18.6	28.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lei, S.; Song, J.; Feng, F.; Yan, Z.; Wang, A. Deepfake Face Detection and Adversarial Attack Defense Method Based on Multi-Feature Decision Fusion. Appl. Sci. 2025, 15, 6588. https://doi.org/10.3390/app15126588

AMA Style

Lei S, Song J, Feng F, Yan Z, Wang A. Deepfake Face Detection and Adversarial Attack Defense Method Based on Multi-Feature Decision Fusion. Applied Sciences. 2025; 15(12):6588. https://doi.org/10.3390/app15126588

Chicago/Turabian Style

Lei, Shanzhong, Junfang Song, Feiyang Feng, Zhuyang Yan, and Aixin Wang. 2025. "Deepfake Face Detection and Adversarial Attack Defense Method Based on Multi-Feature Decision Fusion" Applied Sciences 15, no. 12: 6588. https://doi.org/10.3390/app15126588

APA Style

Lei, S., Song, J., Feng, F., Yan, Z., & Wang, A. (2025). Deepfake Face Detection and Adversarial Attack Defense Method Based on Multi-Feature Decision Fusion. Applied Sciences, 15(12), 6588. https://doi.org/10.3390/app15126588

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deepfake Face Detection and Adversarial Attack Defense Method Based on Multi-Feature Decision Fusion

Abstract

1. Introduction

2. Related Works

2.1. Deepfakes

2.2. Deepfake Detection

2.3. MTCNN Network Model

2.4. SENet Channel Attention Mechanism

2.5. WSDAN Network Model

3. Multi-Feature Extraction

3.1. Extraction of Facial Geometric Features Based on Inter-Frame Information

3.2. Extraction of Image Texture Information Based on Intra-Frame Information

4. Construction and Optimization for the Deepfake Detection Model

4.1. Construction of the Improved MTCNN-Based Model

4.2. Noise Reduction and De-Artifacting Using the Diffusion Model

4.3. Construction of the FG-TEFusionnet Model for Deepfake Detection

4.3.1. Construction of the LRnet Network Model Incorporating SENet Attention

4.3.2. Model Construction of Incorporating WSDAN and Image Texture Enhancement Module

4.3.3. Multi-Feature Decision Fusion Module

5. Construction of Sample Adversarial Attacks and Defense Model

6. Experiment and Result Analysis

6.1. Datasets

6.2. Evaluation Metrics and Experimental Parameter Settings

6.2.1. Model Evaluation Metrics

6.2.2. Experimental Model Parameter Settings

6.3. Analysis of Feature Extraction Effectiveness

6.3.1. Analysis of Facial Geometric Feature Point Extraction

6.3.2. Analysis of Texture Feature Extraction

6.4. Analysis of Diffusion Model Effectiveness

6.5. Analysis of Sample Adversarial Attack and Defence Effectiveness

6.6. Analysis of Ablation Experiments

6.7. Comparative Experimental Analysis

6.7.1. Comparison Experiment

6.7.2. Experiments Across Datasets

6.8. Analysis of Model Complexity

6.9. Experimental Visualization and Analysis

6.10. Limitations and Future Research Work

6.10.1. Limitation 1: Occlusion Sensitivity in Real-World Scenarios

6.10.2. Limitation 2: Computational Constraints for High-Resolution Content

6.10.3. Future Research Work

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI