Towards Generalizable Deepfake Detection via Facial Landmark-Guided Convolution and Local Structure Awareness

Chen, Hao; Zhang, Zhengxu; Li, Qin; Feng, Chunhui

doi:10.3390/a19040270

Open AccessArticle

Towards Generalizable Deepfake Detection via Facial Landmark-Guided Convolution and Local Structure Awareness

¹

Center for Agroforestry Mega Data Science, School of Future Technology, Fujian Agriculture and Forestry University, Fuzhou 350002, China

²

College of Computer and Information Sciences, Fujian Agriculture and Forestry University, Fuzhou 350002, China

^*

Author to whom correspondence should be addressed.

Algorithms 2026, 19(4), 270; https://doi.org/10.3390/a19040270

Submission received: 12 February 2026 / Revised: 3 March 2026 / Accepted: 13 March 2026 / Published: 1 April 2026

Download

Browse Figures

Versions Notes

Abstract

As deepfakes become increasingly realistic, there is a growing need for robust and highly accurate facial forgery detection algorithms. Existing studies show that global feature modeling approaches (Transformer, VMamba) are effective in capturing long-range dependencies, yet they often lack sufficient sensitivity to localized facial tampering artifacts. Meanwhile, traditional convolutional methods excel at extracting local image features but struggle to incorporate prior knowledge about facial anatomy, resulting in limited representational capability. To address these limitations, this paper proposes LGMamba, a novel detection framework that integrates facial guidance focusing on key facial components and fine-grained detail regions commonly manipulated in deepfakes with global modeling. First, we introduce an innovative Landmark-Guided Convolution (LGConv), which adaptively adjusts convolutional sampling positions using facial landmark information. This allows the model to attend to forgery-prone facial regions, such as the eyes and mouth. Second, we design a parallel Facial Structure Awareness Block (FSAB) to operate alongside the VMamba-based visual State-Space Model. Equipped with a multi-stage residual design and a CBAM attention mechanism, FSAB enhances the model’s sensitivity to subtle facial artifacts, enabling joint exploitation of global semantic consistency and fine-grained forgery cues within a unified architecture. The proposed LGMamba achieves superior performance compared to existing mainstream approaches. In cross-dataset evaluations, it attains AUC scores of 92.34% on CD1 and 96.01% on CD2, outperforming all compared methods.

Keywords:

face forgery detection; landmark-guided convolution; facial structure awareness; Artificial Intelligence

1. Introduction

The rapid advancement of Generative Adversarial Networks (GANs) [1,2] and diffusion models [3,4] has greatly improved the realism of synthetic facial images and videos. Modern deepfake techniques can now generate content that is nearly indistinguishable from real media, even for human observers. This capability poses serious risks to personal privacy, social trust, and public security. As a result, developing accurate and robust deepfake detection algorithms has become a pressing challenge in both academia and industry. Facial forgery detection is typically formulated as a binary classification problem (real vs. fake) [5,6,7]. Existing approaches can be broadly grouped into four categories: (1) Physiological-based methods, which rely on interpretable signals such as eye blinking or head motion [8,9]; (2) Pixel-level statistical methods, which detect inconsistencies in compression, resampling, or color distribution [10]; (3) Deep feature learning, which leverages end-to-end CNNs or frequency-aware modules to extract complex visual forensics cues [11,12,13,14,15]; and (4) Video-level modeling, which uses RNNs or optical flow to detect temporal artifacts across frames [16,17].

Despite these advances, generalization to unseen datasets or manipulation methods remains a major challenge. For instance, on cross-dataset evaluations like Celeb-DF, most models suffer significant performance drops, limiting their real-world applicability [18,19,20,21,22].

To improve generalization in deepfake detection, researchers have explored enhancements along two major directions: data and model design. On the data side, image mixing and augmentation strategies are used to encourage the learning of generator-invariant artifacts, such as boundary inconsistencies or regional texture shifts [1,23]. On the model side, approaches incorporate cues like temporal consistency—motivated by frame-by-frame generation artifacts such as flicker or drift [24,25,26]—or region-specific signals such as lip motion and audio-visual mismatches [27]. However, each of these cues can be limited in scope or effectiveness when applied across diverse manipulation types.

In general, existing efforts fall into two strategic paradigms: (i) crafting stable features under strong assumptions (e.g., blending boundaries, high-frequency noise) [28,29] and (ii) employing transfer learning techniques like domain adaptation or meta-learning to mitigate distribution shifts [30,31]. Yet, the former can fail under complex post-processing or unseen forgery styles, while the latter often introduces high training complexity and limited scalability.

At the feature representation level, key bottlenecks persist. Pixel and frequency cues highlight global inconsistencies but are less sensitive to subtle manipulations around lips, eyes, or facial contours. Geometric priors can capture shape distortions but are difficult to integrate end-to-end. Global modeling structures such as VMamba offer efficient long-range modeling [32], but still lack sensitivity to localized artifacts—which are critical for robust detection.

While traditional deformable convolution methods (e.g., DCN [33]) introduce learnable offsets to enhance geometric flexibility, these offsets are inferred purely from appearance features and lack semantic guidance. In deepfake scenarios, such unconstrained sampling may fail to focus on semantically meaningful regions. In contrast, our proposed LGConv directly generates sampling offsets from facial landmarks, which serve as strong structural priors. This design enables more stable and targeted feature extraction from forgery-prone facial regions, such as the eyes and mouth.

To this end, we propose a novel LGMamba framework. The main contributions of this paper are summarized as follows:

We innovatively propose Landmark-Guided Convolution (LGConv), which generates facial landmark-guided offsets to adjust convolutional kernel sampling positions. This allows LGConv to respond specifically in semantically meaningful facial regions prone to forgery artifacts, such as around the eyes, lips, and nose wings.
We propose Facial Structure Awareness Block (FSAB) that enhances the detection of local and fine-grained deepfake artifacts. By incorporating residual connections and a Convolutional Block Attention Module (CBAM) for adaptive feature recalibration, FSAB effectively fuses the landmark-guided local features extracted by LGConv with the global representations from the backbone, thereby improving the model’s sensitivity to subtle facial forgery cues.
Extensive evaluations on multiple mainstream benchmark datasets show that our method achieves state-of-the-art detection performance in both intra-dataset and cross-dataset evaluations. In particular, on the highly challenging CD1, CD2 and DFDCP datasets, our method obtains results superior to existing methods, validating its strong generalization ability in complex real-world scenarios.

The remainder of this paper is organized as follows: Section 2 reviews related work on deepfake detection and deformable convolution techniques. Section 3 details the proposed LGMamba framework and its key components. Section 4 presents the experimental setup, results, and ablation studies. Section 5 concludes the paper and discusses future research directions.

2. Related Works

To provide a clearer contextual foundation for our proposed method, this section reviews existing facial forgery detection approaches from three perspectives: global facial consistency, structural landmark guidance, and local forgery trace modeling.

2.1. Methods Based on Overall Facial Consistency

Early deepfake detection methods primarily relied on global appearance cues—such as texture disruptions and boundary inconsistencies—modeled via CNN-based classifiers [11,12,19,34]. While effective under controlled conditions, these methods generalize poorly to cross-dataset or heavily post-processed content due to limited robustness.

To enhance global consistency modeling, subsequent studies introduced low-level priors such as color and frequency statistics [35,36], or explored semantic coherence between facial regions and contextual backgrounds [37]. These works highlighted the need for capturing inconsistencies that extend beyond pixel-level artifacts.

More recent efforts focused on improving generalization through representation regularization, such as using perturbations, token-level mixing, or multi-source training to suppress dataset-specific biases [38,39,40]. Others applied information-theoretic constraints or attention mechanisms to avoid overfitting to non-generalizable textures [41].

With the rise of Vision Transformers, global modeling has gained traction. However, the quadratic complexity of self-attention hinders fine-grained processing. Visual state-space models like VMamba offer a linear-complexity alternative, enabling global reasoning at higher resolutions. For example, WMamba integrates VMamba with wavelet-domain cues [42], while MSER-Net enhances VMamba with edge refinement modules [43].

These approaches demonstrate the strength of global modeling but often overlook explicit localization of facial forgeries. Most VMamba-based models prioritize holistic context without incorporating structured spatial priors—an important gap that our proposed LGMamba addresses through landmark-guided sampling and facial structure-aware enhancement.

2.2. Methods Based on Facial Landmark Structural Features

Facial landmark-based methods utilize the explicit geometric structure of faces to extract transferable cues related to shape, pose, and motion. In generation tasks, landmark points are often used to establish semantic correspondences between source and target faces for motion transfer [44,45]. For example, X2Face transfers expressions and poses between identities in a frame-wise manner [46]. To address large pose variations, later works integrated 3D landmark detectors and dual-generator architectures to improve geometric stability [47,48].

In contrast, our proposed LGMamba framework repurposes landmarks from geometric alignment tools into semantic priors for detection. Specifically, we introduce Landmark-Guided Convolution, which adaptively modulates sampling positions based on landmark locations, guiding attention toward forgery-prone facial regions such as the eyes and mouth. This approach transforms structural facial cues into an active component of the feature learning process, enhancing spatial focus and regional sensitivity for detecting subtle manipulations.

2.3. Methods Based on Local Facial Forgery Traces

Unlike facial-consistency-based methods, local forgery detection approaches aim to identify subtle artifacts in specific facial regions—such as the eyes, lips, and contours—that may evade global modeling. One line of work exploits biological signals, under the premise that deepfakes often fail to replicate natural human behavior. For instance, early studies proposed blink-frequency-based indicators [8], while others used lip-reading models to assess the mismatch between visual lip movements and audio content [27].

Another common strategy targets local texture inconsistencies and high-frequency artifacts. Face X-Ray [1] treats forgery as a blending problem and generates response maps to highlight fusion boundaries. Gram-Net models global texture statistics through Gram matrices, while other methods fuse high-pass filtered residuals with RGB features to emphasize manipulation cues around edges and fine details [49].

Despite their effectiveness under specific conditions, these methods often suffer from limited generalization. For example, blink-based detection fails if the tampered region excludes the eyes, and handcrafted filtering strategies may overfit to specific forgery types.

In contrast, our method introduces facial landmarks as semantic priors to guide attention toward regions likely to contain forgeries. Combined with a global–local dual-branch design, LGMamba enhances both spatial focus and semantic invariance, improving robustness across diverse manipulation types and datasets.

2.4. Deformable Convolution

To enhance the spatial adaptability of convolutional neural networks, deformable convolution techniques have been extensively explored. Early methods like DCN [33] introduced learnable sampling offsets to overcome the constraints of fixed receptive fields, yielding significant improvements in object detection and segmentation.

Building on this, later approaches proposed more sophisticated mechanisms. WTConv [50] employs multi-scale wavelet transforms to capture texture-rich regions, while DSConv [51] enforces geometric constraints using B-spline-based sampling paths, particularly effective in modeling smooth anatomical structures such as vessels.

Although these methods demonstrate strong capabilities in general vision tasks, their application to deepfake detection remains limited. Specifically, techniques like WTConv and DSConv focus on edge or topology-aware modeling but lack explicit guidance for facial semantics. In deepfake scenarios, forged regions often exhibit subtle, spatially sensitive distortions. Relying solely on data-driven offset learning can lead to misaligned sampling in key facial areas, reducing the model’s sensitivity to localized forgeries.

These limitations motivate our use of facial landmarks as structural priors to enhance semantic alignment in spatial sampling, bridging the gap between flexible receptive fields and forgery-aware attention.

3. Methods

3.1. Overview

Our framework is illustrated in Figure 1. First, the input image

I \in R^{H \times W \times 3}

undergoes facial landmark detection to obtain 68 facial landmarks. Next, the input image is passed through a stem module and partitioned into

H / 4 \times W / 4

2D feature map. Without any additional positional embeddings, we employ multiple network stages to create

H / 8 \times W / 8

,

H / 16 \times W / 16

and

H / 32 \times W / 32

hierarchical representation at multiple resolutions. Each stage consists of a downsampling layer except in the first stage and multiple Visual State Space (VSS) Block and Facial Structure Awareness Block (FSAB) modules connected in parallel. We process the feature maps in two parallel branches: performing global state-space modeling and local feature extraction and then fuse their outputs via element-wise addition to serve as input to the next stage. After four stages, a classifier outputs the probability of the image being real or fake.

3.2. VMamba and Facial Structure Awareness Block

In deepfake detection, accurately identifying fake content requires not only modeling the image’s overall semantic consistency but also capturing localized facial artifact traces. However, current mainstream architectures show notable limitations in global modeling: CNNs struggle to establish long-range dependencies due to their limited receptive fields, and although Swin Transformer provides regional interaction, its windowed self-attention mechanism restricts effective global context integration. In addition, its computational complexity grows quadratically with image size, hindering efficient application to high-resolution face images. For this reason, we adopt VMamba as our backbone network.

While VMamba effectively captures long-range dependencies, its global modeling paradigm may exhibit limited sensitivity to subtle, localized artifacts. Therefore, in this work, we preserve VMamba’s strong global representation capability and introduce the Facial Structure Awareness Block (FSAB) to enhance the model’s sensitivity to fine-grained facial details.

VMamba is a visual backbone based on a State-Space Model (SSM) that achieves global long-range dependency with linear computational complexity. The core idea of SSM is to map a one-dimensional input

x (t) \in R

sequence to an output

y (t) \in R

through a latent hidden state

h (t) \in R^{N}

, described by the following continuous-time equations:

\begin{matrix} h^{'} (t) = A h (t) + B x (t) \\ y (t) = C h (t) + D x (t) \end{matrix}

(1)

Here,

A \in R^{N \times N}

is the state matrix governing the system’s time evolution.

B \in R^{N \times 1}

and

D \in R^{1 \times 1}

are the input, output, and feedforward projection parameters, respectively. To enable efficient discretization on modern hardware, VMamba uses a zero-order hold (ZOH) method to convert the above continuous system into discrete form:

\begin{matrix} h_{t} = \bar{A} h_{t - 1} + {\bar{B}}_{x (t)} \\ y_{t} = C h_{t - 1} + D_{x (t)} \end{matrix}

(2)

Here, the discrete parameters

\bar{A}

and

\bar{B}

are computed as

\begin{matrix} \bar{A} & = \exp (Δ A) \\ \bar{B} & = {(Δ A)}^{- 1} (\exp (Δ A - I)) \cdot Δ B \end{matrix}

(3)

where

Δ

is an input-dependent time step parameter. This allows the model to adaptively adjust its behavior based on the input content; such input-dependent discretization is key to VMamba’s strong performance.

As illustrated in Figure 2, the core of VMamba is the Visual State Space (VSS) Block, built around a Selective-Scan 2D (SS2D) module. SS2D first converts a 2D feature map into four 1D sequences via a four-directional cross-scan (top-left to bottom-right), enabling each patch to access a global receptive field. Each sequence is then processed by a series of S6 blocks with state-space models, which extract features in linear time by preserving key context and filtering noise. Finally, a Scan Merge module reconstructs the 2D feature map through inverse flattening, fusing multi-directional information while maintaining spatial structure. The output is further refined by a normalization layer and a Feed-Forward Network (FFN), providing a globally consistent foundation for classification.

We further detail the structure of the Facial Structure Awareness Block (FSAB). Specifically, the Facial Structure Awareness Block (FSAB) consists of two cascaded residual units as the core feature extractor, followed by a Convolutional Block Attention Module (CBAM) [52] for feature recalibration. The pseudo code is as in Algorithm 1.

The operation of FSAB is defined by the following formulation:

\begin{matrix} z = x + R e L U (L N (L G C o n v (x, L))) \\ y = C B A M (z + R e L U (L N (L G C o n v (z, L)))) \end{matrix}

(4)

As shown in Figure 3,

x \in R^{H \times W \times C}

, L,

y \in R^{H \times W \times C}

denote the input feature, facial landmark coordinates and output feature of the FSAB, respectively. This design ensures effective gradient propagation in deep networks, while enabling the network to focus on extracting forgery-related features from the input—preserving important details and preventing them from being lost through multiple nonlinear transformations.

Algorithm 1: Facial Structure Awareness Block (FSAB).

Input: Input feature

x \in R^{H \times W \times C}

, facial landmarks L

Output: Output feature y

Step 1: Apply landmark-guided convolution

{\hat{x}}_{1} = LGConv (x, L)

Step 2: Residual unit 1 with normalization and activation

z = x + ReLU (LN ({\hat{x}}_{1}))

Step 3: Apply LGConv again with updated input

{\hat{x}}_{2} = LGConv (z, L)

Step 4: Residual unit 2

z^{'} = z + ReLU (LN ({\hat{x}}_{2}))

Step 5: Apply CBAM attention module

y = CBAM (z^{'})

return y

We further introduce the CBAM to recalibrate features both channel-wise and spatially. The channel attention module generates a weight vector via global pooling and a multi-layer perceptron, emphasizing channels strongly associated with forgery traces. The spatial attention module then computes an attention map by aggregating max and average feature maps and applying a convolution. This further highlights subtle texture anomalies in spatial regions roughly aligned with facial landmarks.

Finally, the FSAB output is fused with the global features from the VSS block via element-wise addition. This dual-branch design enables the network to integrate holistic facial context with detailed, landmark-sensitive forgery features, enhancing its ability to capture subtle manipulations in structurally important regions.

3.3. Facial Landmark-Guided Convolution

In deepfake detection, forgery traces exhibit clear semantic correlations in their spatial distribution, often concentrating in facial regions with well-defined semantics such as the eyes, mouth, teeth, and the glabella between the eyebrows. As shown in Figure 4, these regions closely coincide with the spatial distribution of facial landmarks.

Traditional convolution operations, constrained by their fixed regular sampling grid, struggle to effectively capture the irregular geometric distortions and local artifacts introduced during the forgery process. Although deformable convolutions introduce learnable offsets to improve geometric transformation modeling to some extent, the learning process for these offsets lacks explicit semantic guidance, resulting in considerable randomness and uncertainty in the model’s perception of critical forged regions.

Therefore, we propose a Landmark-Guided Convolution mechanism that incorporates facial landmarks as strong semantic priors, explicitly guiding the convolution’s sampling toward high-risk forgery regions. This achieves precise focus and enhanced sensitivity to local forgery artifacts.

Specifically, building on LDConv [53], we design a Landmark-Guided Convolution (LGConv) that dynamically steers the convolution kernel’s spatial sampling offsets using facial landmarks as semantic guidance. This design allows the model to adaptively adjust its focus regions based on facial structural semantics, significantly enhancing its ability to perceive local forgery artifacts and texture anomalies.

A standard convolution operation samples features on a regular grid. Let

R

denote the regular

3 \times 3

sampling grid, defined as

R = {(- 1, - 1), (- 1, 0), \dots, (0, 1), (1, 1)}

(5)

Here, the center of

R

is the sampling origin of the convolution kernel. To enable an irregular convolution kernel to have a corresponding sampling grid, we first reconstruct an initial sampling coordinate distribution for the given kernel size. In setting up the coordinate system, unlike traditional convolution which takes the center as the origin

(0, 0)

, we set the top-left corner as the sampling origin

(0, 0)

to better accommodate the scale variation of the deformable kernel. We then map the detected facial landmark coordinates L through a Multi-Layer Perceptron (MLP) into landmark-guided offset

P_{lm}

, and combine these with the initial sampling coordinates of the deformable kernel

P_{n}

. This provides explicit guidance for the convolution kernel to sample more densely in semantically critical regions.

The convolution operation at position

P_{0}

can be defined as

Conv (P_{0}) = \sum_{P_{n} \in R} w_{p_{n}} \times x (P_{o} + P_{n} + P_{lm})

(6)

where

R

represents the set of sampling offsets,

w

represents the convolution parameters, and

x (P_{0} + P_{n} + P_{lm})

is the pixel at the corresponding position of the value. The pseudo code is as shown in Algorithm 2.

Algorithm 2: Landmark-Guided Convolution (LGConv).

Input: Input feature map

x \in R^{B \times C \times H \times W}

, facial landmarks

L \in R^{B \times 68 \times 2}

Output: Output feature map y

Step 1: Generate fixed base sampling shape

P_{n} = GenerateFixedPattern (N)

Step 2: Generate landmark-guided offsets

P_{lm} = MLP (L)

Step 3: Learn image-dependent offsets

P_{0} = Conv (x)

(output shape:

B \times 2 N \times H \times W

)

Step 4: Compute sampling positions

P = P_{n} + P_{lm} + P_{0}

Step 5: Resample feature map at

P

x^{'} = BilinearSample (x, P)

Step 6: Apply convolution over sampled features

y = Conv (x^{'})

return y

For example, as illustrated in Figure 5, we first generate an initial sampled shapes

P_{n}

base on

N

, then map the facial landmark coordinates through an MLP to obtain landmark-guided offset

P_{lm}

.

The operation to obtain

P_{lm}

is defined by the following formulation:

\begin{matrix} P_{lm} = L i n e a r (R e L U (L i n e a r (L))) \end{matrix}

(7)

Then, the input feature map x is processed by a convolution to produce a feature map

P_{0}

of shape

(B, 2 N, H, W)

. The coordinates

(P_{0} + P_{n} + P_{lm})

are obtained. We then add this feature to the initial sampling coordinates, generating new sampling positions for the convolution—thus the convolution kernel adopts a different sampling pattern at each spatial location of the feature map. Finally, we apply interpolate and resample to obtain the corresponding positions, then apply the corresponding convolution operation to extract the features. This step is mainly to extract the features at the corresponding locations. Through the above three steps, LGConv can complete convolution operation to get output feature y.

4. Experiments

4.1. Datasets and Experimental Setup

To evaluate the effectiveness and robustness of the proposed model, we conducted experiments on multiple mainstream face forgery detection datasets, including FaceForensics++ (FF++) [19], Celeb-DF V1 (CD1) and Celeb-DF V2 (CD2) [22], Deepfake Detection Challenge Preview (DFDCP) [54], and DeepFakeDetection (DFD) [55].

FaceForensics++ (FF++) contains 1000 real videos and 4000 fake videos generated by four different manipulation methods: DeepFake (DF), FaceSwap (FS), Face2Face (F2F), and NeuralTextures (NT). Each video is provided in three quality levels: raw (uncompressed), high quality (C23), and low quality (C40). In this study, we use the C23 compressed version for training and evaluation.

Celeb-DF is available in two versions, CD1 and CD2. CD1 comprises 408 real videos and 795 fake videos, while CD2 further expands to 590 real videos and 5639 high-fidelity fake videos.

DFDCP (Deepfake Detection Challenge Preview) is the preview version of the DFDC dataset, containing 1131 real videos and 4113 fake videos generated from the real ones.

DFD (DeepFakeDetection) dataset consists of 363 real videos and 3068 fake videos generated using publicly available deepfake generation methods; the dataset is also provided in raw, C23, and C40 versions, and we use the C23 version in our experiments.

Model Training: The model was trained using the PyTorch 2.2.0 framework framework on an NVIDIA GeForce RTX 3090 GPU (NVIDIA Corporation, Santa Clara, CA, USA) for a total of 300 epochs. We employed the Adam optimizer with betas set to (0.9, 0.999) and a weight decay of 0.05 to help regularize the model. The initial learning rate was set to 5 × 10⁻⁴ and adjusted using a cosine decay scheduler to gradually reduce it during training. The epsilon parameter was set to 5 × 10⁻⁶ to improve numerical stability. We used the Cross-Entropy Loss function for the binary classification task. The model was trained with a batch size of 1, and the number of frames per epoch varied depending on the experimental setup: 14,400 frames for intra-dataset experiments and 28,800 frames for cross-dataset experiments.

For data preprocessing, we randomly extracted 10 face frames from each video, and all face images were centrally cropped to

224 \times 224

pixels. We then extracted 68 facial landmarks coordinates by OpenFace 2.2.0 toolkit from each cropped face image. Model outputs a real/fake prediction for each frame. For evaluation, we adopt the Area Under the Curve (AUC) as the primary performance metric to measure detection accuracy.

4.2. Results

4.2.1. Intra-Dataset Experiments

To evaluate the effectiveness of the proposed method, we first performed intra-dataset experiments on the FaceForensics++ (FF++) dataset (using the C23 compression level). We randomly selected 720 videos from each manipulation method for training, 140 for validation, and 140 for testing. On this benchmark, we conducted within-subset evaluations on the four subsets corresponding to each forgery method (DF, FS, F2F, and NT). We also performed a cross-method evaluation: training the model on the samples of one manipulation subset and testing on the other three methods. We compared our approach with several current state-of-the-art face forgery detection techniques. The detailed results are shown in Table 1.

Our comparison includes a range of representative deepfake detection frameworks. Face X-Ray [1] generates an “X-ray” image to expose blending boundaries of manipulated regions, focusing on detecting the compositing artifacts when content from different source images is merged. MAT [58] employs multiple spatial attention heads to capture forgery traces in various local facial regions and uses a texture enhancement module to amplify subtle artifacts in shallow features. RECCE [59] and MRL [60] are methods that guide the network using facial priors to learn authentic face characteristics for detecting manipulations. Mf-net [61] integrates multiple feature types and utilizes multi-scale information to learn more generalized forgery representations. D2Fusion [62] takes a multi-domain detection approach: through bidirectional attention and fine-grained frequency-domain attention, it extracts spatial and frequency features and uses feature fusion to enhance the differences between real and fake characteristics.

As shown in Table 1, the proposed method achieves excellent performance on test sets sharing the same distribution as the training data. Under the HQ setting of FF++, the average AUC scores obtained from training and testing on each individual subset reach 91.37%, 88.97%, 86.23%, and 85.80%, respectively. Most of these results outperform other competing state-of-the-art detection methods. Moreover, the proposed method attains superior performance in cross-method training evaluations across multiple subsets.

As a baseline, Face X-ray shows limited performance under cross-method testing. The core assumption of Face X-ray is the existence of “blending boundaries” in forged images. However, the validity of this assumption varies across different forgery methods, ultimately leading to a degradation in its cross-subset performance compared with its within-subset performance.

Although the attention mechanism in MAT can effectively focus on local forgery artifacts, such artifacts may not be universally representative across different manipulation techniques. This potentially restricts its capability to capture global consistency cues and generalizable forgery traces, causing inferior performance in cross-method evaluations. By contrast, our proposed method adaptively allocates computational resources to facial regions that are more prone to unnatural artifacts during forgery—such as the eyes, mouth, and nasal wings—allowing the model to capture more essential and generalizable facial forgery features. Therefore, even though our within-subset scores on FF++ are slightly lower than those of MAT, the proposed method consistently achieves superior cross-subset performance.

Similarly, Mf-net benefits from dual-stream feature extraction and extensive multi-scale fusion, which indeed provide abundant information. However, these operations may simultaneously introduce features or noise unrelated to the core artifacts that distinguish each specific forgery method. If the model cannot effectively suppress such redundant information, the within-method performance may remain high, but the most discriminative features across different forgery methods could be obscured, thereby limiting its performance under cross-method testing.

4.2.2. Cross-Dataset Evaluation

To assess the generalization ability of our method under different data distributions, we conducted rigorous cross-dataset experiments. In particular, after training the model on FF++ (C23), we directly tested it on four other datasets: CD1, CD2, DFDCP, and DFD. The results are presented in Table 2.

The proposed method achieves AUC scores of 92.34%, 96.01%, and 88.87% on CD1, CD2, and DFDCP, respectively, all of which significantly outperform existing mainstream approaches. This performance advantage primarily stems from the designed LGConv and the dual-branch collaborative architecture.

Specifically, facial landmarks serve as strong priors that guide the model to adaptively concentrate computational resources on high-frequency forgery-prone regions—such as the periocular area, lips, and nasal wings. These regions are consistently manipulated across diverse forgery methods and thus contain concentrated local artifacts. This mechanism prevents the model from learning non-generalizable forgery patterns in the absence of explicit guidance.

Furthermore, while the VMamba module enables global state-space modeling, the introduction of the Facial Structure Awareness Block facilitates fine-grained feature extraction from key facial regions. The dual-branch architecture allows the model to simultaneously maintain global semantic consistency and regional specificity, yielding complementary discrimination of forgery cues.

Compared to LAA-Net, our method achieves higher AUC scores on CD2 and DFDCP, demonstrating stronger generalization under distribution shifts. While LAA-Net focuses on localized artifacts using blended pseudo-fakes and explicit attention mechanisms, our method leverages a dual-branch architecture and semantically guided LGConv to model diverse forgery cues across both local and global facial regions. This design enables more robust detection of subtle manipulations that are not strictly boundary-based, contributing to improved cross-dataset performance.

Compared to CADDM, which suppresses identity-related features and relies on localized artifact detection, our method adopts a dual-branch architecture that jointly models both local and global forgery cues. This design enables the model to better capture subtle and diverse forgery traces. As shown in Table 2, our method outperforms CADDM by 2.77%, 18.97%, and 7.64% AUC on CD1, CD2, and DFDCP respectively. We attribute this significant improvement to the landmark-guided convolution and global-aware branch, which enhance both regional focus and holistic consistency modeling, thereby improving generalization under unseen distributions.

It is worth noting, however, that although the proposed approach performs remarkably well in most cross-dataset evaluations, its performance on the DFD dataset is relatively limited—achieving 92.26%, which represents a reduction of 1.66% compared with CADDM. This performance gap is largely due to the nature of the DFD dataset, which includes a subset of fake videos generated through operations such as frame interpolation or dropping. These manipulations introduce temporal inconsistencies that are difficult to detect from a single frame. As our method is designed for frame-level analysis and does not model temporal dependencies, it is less effective in scenarios where forgery cues mainly lie in the temporal domain.

4.2.3. Ablation Experiment

To systematically evaluate the contributions of LGConv and FSAB in our framework, we conducted a series of ablation experiments. All ablation models were trained on the FF++ and tested on CD2, DFDCP and FF++ to ensure consistent evaluation conditions.

In particular, we focus on analyzing the advantages of our landmark-guided convolution over conventional deformable convolutions in the forgery detection task. In the experiments, we compared our proposed landmark-guided convolution module against several deformable convolution variants, including DCN [33], DSConv [51], and WTConv [50]. As shown by the cross-dataset results in Table 3, LGConv achieved the highest performance on both test datasets.

DSConv depends on continuous structural cues, such as center lines, to guide kernel offsets. However, facial manipulations are often sparse, irregular, and concentrated around key components like the mouth and eye regions where such continuity does not exist. WTConv increases the receptive field but lacks region-specific semantic guidance, which limits its effectiveness in capturing localized forgery artifacts. As a result, DSConv and WTConv perform slightly worse in our setting.

In contrast, LGConv leverages sparse landmark points as priors to adaptively guide convolutional sampling toward potentially manipulated regions. This region-aware sampling strategy enables better localization of forgery clues and contributes to stronger generalization across manipulation types.

As shown in Table 4, to verify the effectiveness of CBAM, we conducted ablation experiments by replacing it with two alternative attention mechanisms: SimAM and SENet. The first row in the table shows the performance without any attention module. When CBAM was replaced with SimAM and SENet, the AUC dropped by 0.78% and 0.98%, respectively.

SimAM, as a lightweight parameter-free 3D attention mechanism, adopts a generic design that may lack the specificity required to capture subtle and complex local artifacts characteristic of deepfake manipulations. SENet focuses solely on channel-wise recalibration and does not explicitly model spatial locations where forgery traces typically appear.

In contrast, CBAM sequentially applies learnable channel and spatial attention, enabling the model to concentrate on critical forgery-prone regions such as eye contours and lip boundaries. This dual-attention strategy aligns more closely with the nature of deepfake forgeries, making CBAM better suited for the detection task.

To verify the generality and effectiveness of the proposed modules, we also performed ablation studies across different backbone networks: ResNet [72], Swin Transformer (Swin-T) [73], and VMamba [32]. In experiments where the Landmark-Guided Convolution was not used, we replaced it with a standard convolution. As shown in Table 5, after incorporating the Landmark-Guided Convolution (LGConv) and the Facial Structure Awareness Block (FSAB), the model with VMamba achieved AUC of 88.87% (DFDCP) and 96.01% (CD2) in cross-dataset evaluation, outperforming all other backbone configurations.

Although Swin-T, with its window-based self-attention, has shown excellent performance in many vision tasks, its localized inductive bias limits the effective integration of global context information. In contrast, VMamba, built on a state-space model, can achieve a global receptive field within a single processing layer, thereby more naturally modeling long-range spatial dependencies in the image. This characteristic allows it to provide a more consistent global semantic context for the landmark-guided local feature extraction, achieving a deep synergy between global understanding and local analysis. This is likely the main reason why our method’s results on DFDCP and CD2 are higher by 6.63% and 1.13%, respectively, compared to using the Swin-T backbone.

As shown in Table 5, the full model under the VMamba backbone contains only 31 M parameters and 5.1 G FLOPs, with an inference time of 199 ms. Compared to the Swin-T, our model is approximately 32 ms faster while achieving better detection performance. This demonstrates a better trade-off between performance and efficiency. The results suggest that the LGMamba framework has strong potential for lightweight deployment and practical applicability.

4.2.4. Robustness Evaluation

To simulate challenging conditions, we introduce three common types of visual perturbations: Gaussian blur, salt-and-pepper noise, and block-wise occlusion.

As shown in Table 6, LGMamba achieves an average AUC of 88.06% under three common types of image degradation, consistently outperforming all baseline methods. Although there is a slight drop compared to the clean setting, with the average AUC decreasing by 1.85%, the model still maintains stable forgery detection performance.

This robustness is primarily attributed to the use of facial landmarks in LGConv. Unlike low-level texture cues, landmarks encode semantically structured facial regions (e.g., eye contours, lips), which remain geometrically stable under moderate noise. In addition, LGConv performs region-guided sampling around landmarks rather than relying on single points, allowing the model to remain effective even when landmark positions are slightly perturbed.

Overall, LGMamba shows stronger resilience to visual degradation than methods that rely heavily on pixel-level artifacts.

4.3. Grad-CAM Visualization

To further investigate the model’s decision-making mechanism and verify the spatial consistency between its attention regions and the actual forgery traces, this study employs the Grad-CAM [74] technique for visual interpretability analysis. We focus on examining the activation regions within the FSAB under conditions with and without the LGConv, and we extract representative heatmaps for comparison.

As shown in Figure 6, without the incorporation of the LGConv, the model produces heatmaps with dispersed activations, where attention often spreads across non-essential facial regions and even background areas. This indicates that conventional convolutions, in the absence of prior guidance, struggle to consistently focus on discriminative forgery regions. In contrast, after integrating LGConv, the heatmaps become highly concentrated around key facial structures such as the periocular area, lip contours, and nasal wing edges—regions that are prone to geometric distortions or texture inconsistencies in various face forgery techniques. This visual comparison demonstrates that the facial structure previously introduced by LGConv effectively guides the model’s attention toward more discriminative local forgery artifacts, thereby providing more reliable feature evidence for classification decisions.

In summary, the heatmap analysis not only visually verifies the effectiveness of the proposed Landmark-Guided Convolution, but also further demonstrates that combining global semantic understanding with localized region focusing enables the model to more reliably capture forgery cues that are consistent across datasets. This, in turn, enhances both the interpretability and the generalization capability of the detection system.

5. Conclusions

In this paper, we proposed LGMamba, a deepfake detection framework that integrates facial landmark guidance to effectively model both global-level forgery cues and two types of local forgery patterns: structural inconsistencies around key facial components (eyes, mouth, nose wings) and fine-grained texture-level artifacts that commonly appear in manipulated regions. By explicitly combining these complementary cues, the framework achieves more comprehensive and reliable forgery detection. The proposed method is built upon two key innovations. First, inspired by deformable convolution, we introduced the Landmark-Guided Convolution (LGConv), which leverages facial landmark information to adaptively adjust convolutional sampling positions. This mechanism guides the network to focus on forgery-prone facial regions, ensuring stable and anatomically meaningful local feature extraction. Second, by proposing a Facial Structure Awareness Block (FSAB) that operates in parallel with the VMamba-based visual State-Space Model, our framework jointly captures subtle local artifacts, enabling robust feature discrimination across diverse manipulation methods. Extensive experiments on multiple benchmark datasets including both intra-dataset and cross-dataset evaluations demonstrate that LGMamba consistently achieves leading performance. In addition, the Grad-CAM-based visualization analysis further validates that the model’s attention aligns well with actual forgery traces, providing strong interpretability evidence for the design choices and highlighting the contribution of the landmark-guided mechanism.

While the proposed framework focuses on spatial-level modeling at the frame level, deepfake content typically exists in video form. In future work, we plan to extend our approach by incorporating temporal information to enhance detection robustness and generalization. Specifically, we aim to leverage the temporal dynamics of facial landmarks by analyzing inconsistencies in their motion trajectories for detecting manipulation artifacts across frames. We also aim to develop a more unified detection architecture.

Author Contributions

Conceptualization, H.C. and C.F.; methodology, H.C.; software, H.C.; validation, H.C., Z.Z. and Q.L.; formal analysis, H.C. and Z.Z.; investigation, H.C. and Z.Z.; resources, C.F.; data curation, H.C. and Z.Z.; writing—original draft preparation, H.C. and Z.Z.; writing—review and editing, C.F.; visualization, H.C.; supervision, C.F.; project administration, C.F.; funding acquisition, C.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant 61802064, and by the Fujian Agriculture and Forestry University Science and Technology Innovation Special Fund KFB23157A.

Institutional Review Board Statement

Not applicable. The experiments were conducted using publicly available datasets, including FaceForensics++ and Celeb-DF. No new data involving human or animal participants were collected in this study.

Informed Consent Statement

According to the terms of use for the FaceForensics++ and Celeb-DF datasets: The FaceForensics++ dataset consists of manipulated and original videos collected from YouTube under Creative Commons licenses, and is intended for research use only. The dataset creators confirm that they do not own the content and take no responsibility for the source videos. The Celeb-DF dataset contains Internet-sourced public celebrity videos. The dataset is released for non-commercial research use only, and users must agree not to redistribute the data or use it for commercial purposes.

Data Availability Statement

The datasets used in this study are publicly available. FaceForensics++ and Celeb-DF can be accessed through their official websites. DFDCP and DFD are also publicly available through their official dataset sources. No new data were created in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, L.; Bao, J.; Zhang, T.; Yang, H.; Chen, D.; Wen, F.; Guo, B. Face X-Ray for More General Face Forgery Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 5000–5009. [Google Scholar] [CrossRef]
Zhu, H.; Huang, H.; Li, Y.; Zheng, A.; He, R. Arbitrary Talking Face Generation via Attentional Audio-Visual Coherence Learning. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020, Vienna, Austria, 11–17 July 2020; pp. 2362–2368. [Google Scholar] [CrossRef]
Wang, Y.; Chen, X.; Zhu, J.; Chu, W.; Tai, Y.; Li, J.; Wang, C.; Wu, Y.; Huang, F.; Ji, R. HifiFace: 3D Shape and Semantic Prior Guided High Fidelity Face Swapping. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Montreal, QC, Canada, 19–27 August 2021; pp. 1136–1142. [Google Scholar] [CrossRef]
He, Y.; Yu, N.; Keuper, M.; Fritz, M. Beyond the Spectrum: Detecting Deepfakes via Re-Synthesis. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Montreal, QC, Canada, 19–27 August 2021; pp. 2534–2541. [Google Scholar] [CrossRef]
Hu, Z.; Xie, H.; Wang, Y.; Li, J.; Wang, Z.; Zhang, Y. Dynamic Inconsistency-aware DeepFake Video Detection. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Montreal, QC, Canada, 19–27 August 2021; pp. 736–742. [Google Scholar] [CrossRef]
Nirkin, Y.; Wolf, L.; Keller, Y.; Hassner, T. DeepFake Detection Based on Discrepancies Between Faces and Their Context. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 6111–6121. [Google Scholar] [CrossRef] [PubMed]
Yang, Q.; Yu, D.; Zhang, Z.; Yao, Y.; Chen, L. Spatiotemporal Trident Networks: Detection and Localization of Object Removal Tampering in Video Passive Forensics. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 4131–4144. [Google Scholar] [CrossRef]
Li, Y.; Chang, M.; Lyu, S. In Ictu Oculi: Exposing AI Created Fake Videos by Detecting Eye Blinking. In Proceedings of the 2018 IEEE International Workshop on Information Forensics and Security, WIFS 2018, Hong Kong, China, 11–13 December 2018; pp. 1–7. [Google Scholar] [CrossRef]
Matern, F.; Riess, C.; Stamminger, M. Exploiting Visual Artifacts to Expose Deepfakes and Face Manipulations. In Proceedings of the IEEE Winter Applications of Computer Vision Workshops, WACV Workshops 2019, Waikoloa Village, HI, USA, 7–11 January 2019; pp. 83–92. [Google Scholar] [CrossRef]
Akhtar, Z.; Dasgupta, D. A Comparative Evaluation of Local Feature Descriptors for DeepFakes Detection. In Proceedings of the 2019 IEEE International Symposium on Technologies for Homeland Security (HST), Woburn, MA, USA, 5–6 November 2019; pp. 1–5. [Google Scholar] [CrossRef]
Afchar, D.; Nozick, V.; Yamagishi, J.; Echizen, I. MesoNet: A Compact Facial Video Forgery Detection Network. In Proceedings of the 2018 IEEE International Workshop on Information Forensics and Security, WIFS 2018, Hong Kong, China, 11–13 December 2018; pp. 1–7. [Google Scholar] [CrossRef]
Hsu, C.; Lee, C.; Zhuang, Y. Learning to Detect Fake Face Images in the Wild. arXiv 2018, arXiv:1809.08754. [Google Scholar] [CrossRef]
Hsu, C.C.; Zhuang, Y.X.; Lee, C.Y. Deep Fake Image Detection Based on Pairwise Learning. Appl. Sci. 2020, 10, 370. [Google Scholar] [CrossRef]
Nguyen, H.H.; Fang, F.; Yamagishi, J.; Echizen, I. Multi-task Learning for Detecting and Segmenting Manipulated Facial Images and Videos. In Proceedings of the 10th IEEE International Conference on Biometrics Theory, Applications and Systems, BTAS 2019, Tampa, FL, USA, 23–26 September 2019; pp. 1–8. [Google Scholar] [CrossRef]
Nguyen, H.H.; Yamagishi, J.; Echizen, I. Capsule-forensics: Using Capsule Networks to Detect Forged Images and Videos. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2019, Brighton, UK, 12–17 May 2019; pp. 2307–2311. [Google Scholar] [CrossRef]
Guera, D.; Delp, E.J. Deepfake Video Detection Using Recurrent Neural Networks. In Proceedings of the 15th IEEE International Conference on Advanced Video and Signal Based Surveillance, AVSS 2018, Auckland, New Zealand, 27–30 November 2018; pp. 1–6. [Google Scholar] [CrossRef]
Amerini, I.; Galteri, L.; Caldelli, R.; Bimbo, A.D. Deepfake Video Detection through Optical Flow Based CNN. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshops, ICCV Workshops 2019, Seoul, Republic of Korea, 27–28 October 2019; pp. 1205–1207. [Google Scholar] [CrossRef]
Li, Y.; Lyu, S. Exposing DeepFake Videos By Detecting Face Warping Artifacts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Rossler, A.; Cozzolino, D.; Verdoliva, L.; Riess, C.; Thies, J.; Nießner, M. FaceForensics++: Learning to Detect Manipulated Facial Images. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1–11. [Google Scholar] [CrossRef]
Sun, K.; Liu, H.; Ye, Q.; Gao, Y.; Liu, J.; Shao, L.; Ji, R. Domain General Face Forgery Detection by Learning to Weight. Proc. Aaai Conf. Artif. Intell. 2021, 35, 2638–2646. [Google Scholar] [CrossRef]
Zhou, P.; Han, X.; Morariu, V.I.; Davis, L.S. Two-Stream Neural Networks for Tampered Face Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2017, Honolulu, HI, USA, 21–26 July 2017; IEEE Computer Society: Piscataway, NJ, USA, 2017; pp. 1831–1839. [Google Scholar] [CrossRef]
Li, Y.; Yang, X.; Sun, P.; Qi, H.; Lyu, S. Celeb-DF: A Large-Scale Challenging Dataset for DeepFake Forensics. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 3204–3213. [Google Scholar] [CrossRef]
Shiohara, K.; Yamasaki, T. Detecting Deepfakes with Self-Blended Images. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 18699–18708. [Google Scholar] [CrossRef]
Ganiyusufoglu, I.; Ngô, L.M.; Savov, N.; Karaoglu, S.; Gevers, T. Spatio-temporal Features for Generalized Detection of Deepfake Videos. arXiv 2020, arXiv:2010.11844. [Google Scholar] [CrossRef]
Amerini, I.; Caldelli, R. Exploiting Prediction Error Inconsistencies through LSTM-based Classifiers to Detect Deepfake Videos. In Proceedings of the IH&MMSec ’20: ACM Workshop on Information Hiding and Multimedia Security, Denver, CO, USA, 22–24 June 2020; pp. 97–102. [Google Scholar] [CrossRef]
Masi, I.; Killekar, A.; Mascarenhas, R.M.; Gurudatt, S.P.; AbdAlmageed, W. Two-Branch Recurrent Network for Isolating Deepfakes in Videos. In Proceedings of the Computer Vision—ECCV 2020—16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part VII; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2020; Volume 12352, pp. 667–684. [Google Scholar] [CrossRef]
Haliassos, A.; Vougioukas, K.; Petridis, S.; Pantic, M. Lips Don’t Lie: A Generalisable and Robust Approach to Face Forgery Detection. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 5037–5047. [Google Scholar] [CrossRef]
Liu, H.; Li, X.; Zhou, W.; Chen, Y.; He, Y.; Xue, H.; Zhang, W.; Yu, N. Spatial-Phase Shallow Learning: Rethinking Face Forgery Detection in Frequency Domain. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 772–781. [Google Scholar] [CrossRef]
Cozzolino, D.; Thies, J.; Rössler, A.; Riess, C.; Nießner, M.; Verdoliva, L. ForensicTransfer: Weakly-supervised Domain Adaptation for Forgery Detection. arXiv 2018, arXiv:1812.02510. [Google Scholar] [CrossRef]
Kong, C.; Chen, B.; Li, H.; Wang, S.; Rocha, A.; Kwong, S. Detect and Locate: Exposing Face Manipulation by Semantic- and Noise-Level Telltales. IEEE Trans. Inf. Forensics Secur. 2022, 17, 1741–1756. [Google Scholar] [CrossRef]
Wang, Y.; Peng, C.; Liu, D.; Wang, N.; Gao, X. ForgeryNIR: Deep Face Forgery and Detection in Near-Infrared Scenario. IEEE Trans. Inf. Forensics Secur. 2022, 17, 500–515. [Google Scholar] [CrossRef]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. VMamba: Visual State Space Model. In Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar] [CrossRef]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 22–29 October 2017; IEEE Computer Society: Piscataway, NJ, USA, 2017; pp. 764–773. [Google Scholar] [CrossRef]
Bayar, B.; Stamm, M.C. A Deep Learning Approach to Universal Image Manipulation Detection Using a New Convolutional Layer. In Proceedings of the Proceedings of the 4th ACM Workshop on Information Hiding and Multimedia Security, IH&MMSec 2016, Vigo, Galicia, Spain, 20–22 June 2016; pp. 5–10. [Google Scholar] [CrossRef]
Qian, Y.; Yin, G.; Sheng, L.; Chen, Z.; Shao, J. Thinking in Frequency: Face Forgery Detection by Mining Frequency-Aware Clues. In Proceedings of the Computer Vision—ECCV 2020—16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XII; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2020; Volume 12357, pp. 86–103. [Google Scholar] [CrossRef]
Li, H.; Li, B.; Tan, S.; Huang, J. Identification of deep network generated images using disparities in color components. Signal Process. 2020, 174, 107616. [Google Scholar] [CrossRef]
Nirkin, Y.; Wolf, L.; Keller, Y.; Hassner, T. DeepFake Detection Based on the Discrepancy Between the Face and its Context. arXiv 2020, arXiv:2008.12262. [Google Scholar] [CrossRef]
Sun, K.; Chen, S.; Yao, T.; Liu, H.; Sun, X.; Ding, S.; Ji, R. Diffusionfake: Enhancing generalization in deepfake detection via guided stable diffusion. Adv. Neural Inf. Process. Syst. 2024, 37, 101474–101497. [Google Scholar]
Fu, X.; Yan, Z.; Yao, T.; Chen, S.; Li, X. Exploring Unbiased Deepfake Detection via Token-Level Shuffling and Mixing. In Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; pp. 3040–3048. [Google Scholar] [CrossRef]
Yan, Z.; Zhang, Y.; Fan, Y.; Wu, B. UCF: Uncovering Common Features for Generalizable Deepfake Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, 1–6 October 2023; pp. 22355–22366. [Google Scholar] [CrossRef]
Sun, K.; Liu, H.; Yao, T.; Sun, X.; Chen, S.; Ding, S.; Ji, R. An Information Theoretic Approach for Attention-Driven Face Forgery Detection. In Proceedings of the Computer Vision—ECCV 2022—17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part XIV; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2022; Volume 13674, pp. 111–127. [Google Scholar] [CrossRef]
Peng, S.; Zhang, T.; Gao, L.; Zhu, X.; Zhang, H.; Pang, K.; Lei, Z. WMamba: Wavelet-based Mamba for Face Forgery Detection. In Proceedings of the 33rd ACM International Conference on Multimedia, New York, NY, USA, 27–31 October 2025; pp. 4768–4777. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, C.; Zhou, X. MSER-Net: Multi-stage edge refinement network for deepfake detection. Knowl. Based Syst. 2025, 328, 114280. [Google Scholar] [CrossRef]
Zakharov, E.; Shysheya, A.; Burkov, E.; Lempitsky, V.S. Few-Shot Adversarial Learning of Realistic Neural Talking Head Models. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9458–9467. [Google Scholar] [CrossRef]
Zhang, J.; Zeng, X.; Wang, M.; Pan, Y.; Liu, L.; Liu, Y.; Ding, Y.; Fan, C. FReeNet: Multi-Identity Face Reenactment. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 5325–5334. [Google Scholar] [CrossRef]
Wiles, O.; Koepke, A.S.; Zisserman, A. X2Face: A Network for Controlling Face Generation Using Images, Audio, and Pose Codes. In Proceedings of the Computer Vision—ECCV 2018—15th European Conference, Munich, Germany, 8–14 September 2018; Proceedings, Part XIII; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2018; Volume 11217, pp. 690–706. [Google Scholar] [CrossRef]
Hsu, G.S.; Tsai, C.H.; Wu, H.Y. Dual-Generator Face Reenactment. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 632–640. [Google Scholar] [CrossRef]
Doukas, M.C.; Ververas, E.; Sharmanska, V.; Zafeiriou, S. Free-HeadGAN: Neural Talking Head Synthesis With Explicit Gaze Control. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 9743–9756. [Google Scholar] [CrossRef]
Liu, Z.; Qi, X.; Torr, P.H. Global Texture Enhancement for Fake Face Detection in the Wild. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 8057–8066. [Google Scholar] [CrossRef]
Finder, S.E.; Amoyal, R.; Treister, E.; Freifeld, O. Wavelet Convolutions for Large Receptive Fields. In Proceedings of the Computer Vision—ECCV 2024—18th European Conference, Milan, Italy, 29 September–4 October 2024; Proceedings, Part LIV; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2024; Volume 15112, pp. 363–380. [Google Scholar] [CrossRef]
Qi, Y.; He, Y.; Qi, X.; Zhang, Y.; Yang, G. Dynamic Snake Convolution based on Topological Geometric Constraints for Tubular Structure Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, 1–6 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 6047–6056. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the Computer Vision—ECCV 2018—15th European Conference, Munich, Germany, 8–14 September 2018; Proceedings, Part VII; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2018; Volume 11211, pp. 3–19. [Google Scholar] [CrossRef]
Zhang, X.; Song, Y.; Song, T.; Yang, D.; Ye, Y.; Zhou, J.; Zhang, L. LDConv: Linear deformable convolution for improving convolutional neural networks. Image Vis. Comput. 2024, 149, 105190. [Google Scholar] [CrossRef]
Dolhansky, B.; Howes, R.; Pflaum, B.; Baram, N.; Canton-Ferrer, C. The Deepfake Detection Challenge (DFDC) Preview Dataset. arXiv 2019, arXiv:1910.08854. [Google Scholar] [CrossRef]
Dufour, G.R.N.; Gully, A. Contributing Data to Deepfake Detection Research. 2019. Available online: https://ai.googleblog.com/2019/09/contributing-data-to-deepfake-detection.html (accessed on 7 December 2025).
Li, X.; Ni, R.; Yang, P.; Fu, Z.; Zhao, Y. Artifacts-Disentangled Adversarial Learning for Deepfake Detection. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 1658–1670. [Google Scholar] [CrossRef]
Bai, N.; Wang, X.; Han, R.; Hou, J.; Wang, Q.; Pang, S. Towards generalizable face forgery detection via mitigating spurious correlation. Neural Netw. 2025, 182, 106909. [Google Scholar] [CrossRef]
Zhao, H.; Wei, T.; Zhou, W.; Zhang, W.; Chen, D.; Yu, N. Multi-attentional Deepfake Detection. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 2185–2194. [Google Scholar] [CrossRef]
Cao, J.; Ma, C.; Yao, T.; Chen, S.; Ding, S.; Yang, X. End-to-End Reconstruction-Classification Learning for Face Forgery Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 4103–4112. [Google Scholar] [CrossRef]
Yang, Z.; Liang, J.; Xu, Y.; Zhang, X.; He, R. Masked Relation Learning for DeepFake Detection. IEEE Trans. Inf. Forensics Secur. 2023, 18, 1696–1708. [Google Scholar] [CrossRef]
Duan, H.; Jiang, Q.; Jin, X.; Wozniak, M.; Zhao, Y.; Wu, L.; Yao, S.; Zhou, W. Mf-net: Multi-feature fusion network based on two-stream extraction and multi-scale enhancement for face forgery detection. Complex Intell. Syst. 2025, 11, 11. [Google Scholar] [CrossRef]
Qiu, X.; Miao, X.; Wan, F.; Duan, H.; Shah, T.; Ojha, V.; Long, Y.; Ranjan, R. D2Fusion: Dual-domain fusion with feature superposition for Deepfake detection. Inf. Fusion 2025, 120, 103087. [Google Scholar] [CrossRef]
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, Long Beach, CA, USA, 9–15 June 2019; Proceedings of Machine Learning Research. Volume 97, pp. 6105–6114. [Google Scholar]
Chen, S.; Yao, T.; Chen, Y.; Ding, S.; Li, J.; Ji, R. Local Relation Learning for Face Forgery Detection. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, 2–9 February 2021; pp. 1081–1088. [Google Scholar] [CrossRef]
Sun, K.; Yao, T.; Chen, S.; Ding, S.; Li, J.; Ji, R. Dual Contrastive Learning for General Face Forgery Detection. In Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022, Virtual Event, 22 February–1 March 2022; pp. 2316–2324. [Google Scholar] [CrossRef]
Hu, J.; Liao, X.; Liang, J.; Zhou, W.; Qin, Z. FInfer: Frame Inference-Based Deepfake Detection for High-Visual-Quality Videos. In Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022, Virtual Event, 22 February–1 March 2022; pp. 951–959. [Google Scholar] [CrossRef]
Dong, S.; Wang, J.; Ji, R.; Liang, J.; Fan, H.; Ge, Z. Implicit Identity Leakage: The Stumbling Block to Improving Deepfake Detection Generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 3994–4004. [Google Scholar] [CrossRef]
Nguyen, D.; Mejri, N.; Singh, I.P.; Kuleshova, P.; Astrid, M.; Kacem, A.; Ghorbel, E.; Aouada, D. LAA-Net: Localized Artifact Attention Network for Quality-Agnostic and Generalizable Deepfake Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, 16–22 June 2024; pp. 17395–17405. [Google Scholar] [CrossRef]
Yan, Z.; Luo, Y.; Lyu, S.; Liu, Q.; Wu, B. Transcending Forgery Specificity with Latent Space Augmentation for Generalizable Deepfake Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, 16–22 June 2024; pp. 8984–8994. [Google Scholar] [CrossRef]
Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. SimAM: A Simple, Parameter-Free Attention Module for Convolutional Neural Networks. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Proceedings of Machine Learning Research. Volume 139, pp. 11863–11874. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016; IEEE Computer Society: Piscataway, NJ, USA, 2016; pp. 770–778. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int. J. Comput. Vis. 2020, 128, 336–359. [Google Scholar] [CrossRef]

Figure 1. LGMamba network.

Figure 2. Visual State Space Block. Different colors are used to distinguish different functional modules and scan directions. Green circles denote element-wise addition, and black arrows indicate the forward flow.

Figure 3. The proposed Facial Structure Awareness Block.

Figure 4. Visualization of Facial Landmarks and Forgery Artifact Regions. Orange dots denote facial landmarks. Cyan dashed boxes and connecting lines highlight typical forgery-prone regions in Sample 1, while the red dashed box indicates the corresponding artifact region in Sample 2.

Figure 5. Illustration of LGConv. Arrows indicate the processing flow. Colored circles in the generating algorithm represent different candidate initial sampling patterns, while colored squares denote the corresponding sampling positions or offsets used for landmark-guided adjustment.

Figure 6. Visualization of Grad-CAM results. The blue and pink panels denote pristine and fake samples, respectively. Warmer colors indicate regions with stronger model attention, while cooler colors denote weaker responses.

Table 1. FF++ intra-dataset evaluation. The best results are highlighted in bold, and the second-best are underlined. * indicates methods re-implemented by us.

Training Set	Methods	Venue	Test Sets (AUC (%))
Training Set	Methods	Venue	DF	FS	F2F	NT	Avg
DF	Face-Xray *	CVPR2020	98.70	60.07	63.36	69.82	72.99
	MAT	CVPR2021	99.92	40.51	75.23	71.08	71.69
	RECCE *	CVPR2022	99.95	54.72	69.75	77.15	75.39
	ADA * [56]	TCSVT2023	99.68	79.41	73.61	81.92	83.66
	MRL *	TIFS2023	99.75	95.63	79.04	79.69	88.53
	TGF [57]	NN2024	99.47	69.06	77.39	68.51	78.61
	Mf-net	CIS2024	99.97	74.54	70.82	82.37	81.93
	D2Fusion	IF2025	99.98	62.25	77.88	75.73	78.96
	LGMamba (Ours)		99.81	96.82	85.17	83.66	91.37
FS	Face-Xray *	CVPR2020	45.84	95.89	76.17	70.22	72.03
	MAT	CVPR2021	64.13	99.67	66.39	50.10	70.07
	RECCE *	CVPR2022	63.05	99.72	66.21	58.07	71.76
	ADA *	TCSVT2023	72.36	99.91	70.20	62.12	76.15
	MRL *	TIFS2023	89.33	95.67	75.66	80.35	85.25
	TGF	NN2024	81.47	98.93	65.28	60.63	76.58
	Mf-net	CIS2024	76.62	99.91	70.12	60.34	76.75
	D2Fusion	IF2025	77.50	99.92	69.76	58.45	76.41
	LGMamba (Ours)		93.22	98.10	80.63	83.92	88.97
F2F	Face-Xray *	CVPR2020	63.06	68.81	94.43	72.58	74.72
	MAT	CVPR2021	86.15	60.14	99.13	64.59	77.50
	RECCE *	CVPR2022	71.55	50.02	99.20	72.27	73.26
	ADA *	TCSVT2023	90.32	69.49	99.17	73.13	83.03
	MRL *	TIFS2023	81.44	83.43	83.41	79.52	81.95
	TGF	NN2024	78.07	67.58	98.27	74.01	79.48
	Mf-net	CIS2024	86.74	67.51	99.96	90.42	86.16
	D2Fusion	IF2025	89.50	62.64	99.86	75.23	81.81
	LGMamba (Ours)		83.01	77.50	98.23	86.19	86.23
NT	Face-Xray *	CVPR2020	70.51	78.37	79.22	92.57	80.17
	MAT	CVPR2021	87.23	75.33	48.22	98.66	77.36
	RECCE *	CVPR2022	72.37	51.61	64.69	99.59	72.07
	ADA *	TCSVT2023	90.94	78.47	63.28	99.28	82.99
	MRL *	TIFS2023	80.54	81.74	76.56	78.42	79.32
	TGF	NN2024	83.81	63.88	78.60	92.42	79.68
	Mf-net	CIS2024	89.68	64.59	74.97	99.36	82.15
	D2Fusion	IF2025	94.44	80.75	71.08	99.43	86.43
	LGMamba (Ours)		83.88	81.23	83.03	95.07	85.80

Table 2. Cross-dataset evaluation on CD1, CD2, DFDCP, and DFD. The best results are highlighted in bold, and the second-best are underlined. * indicates methods re-implemented by us. ~indicates that the corresponding result was not reported in the original paper.

Methods	Venue	Test Sets (AUC (%))
Methods	Venue	CD1	CD2	DFDCP	DFD	Avg
Xception * [19]	ICCV2019	78.90	73.75	74.96	80.66	77.07
Ef-b4 * [63]	ICML2019	69.44	64.29	70.38	83.17	71.82
LRL * [64]	AAAI2021	~	78.26	~	89.24	~
LipFor [27]	CVPR2021	~	82.40	~	~	~
DCL * [65]	AAAI2022	~	82.30	76.71	91.66	83.56
Finfer [66]	AAAI2022	70.60	~	70.39	~	~
ADA *	TCSVT2023	82.49	84.62	78.51	92.14	84.44
MRL *	TIFS2023	~	83.58	71.53	~	~
CADDM * [67]	CVPR2023	89.57	77.04	81.23	93.92	85.44
LAA-Net [68]	CVPR2024	~	95.40	86.94	~	~
LSDA [69]	CVPR2024	86.70	83.00	81.50	88.00	84.80
D2Fusion	IF2025	88.14	83.29	~	~	~
UDD [39]	AAAI2025	~	86.90	85.60	91.00	~
LGMamba (Ours)		92.34	96.01	88.87	92.26	92.37

Table 3. Ablation study results for LGConv. The best results are highlighted in bold, and the second-best are underlined.

DConv	Training Set	Test Sets (AUC (%))
DConv	Training Set	CD2	DFDCP
DCN	FF++	92.50	80.77
DSConv	FF++	95.23	84.62
WTConv	FF++	95.51	84.98
LGConv	FF++	96.01	88.87

Table 4. Ablation Study on Attention Mechanisms. The best results are highlighted in bold, and the second-best are underlined.

Components	Training Set	Test Set (AUC (%))
Components	Training Set	FF++
None	FF++	87.50
SimAM [70]	FF++	87.93
SENet [71]	FF++	88.22
CBAM	FF++	89.91

Table 5. Comparison across different backbone architectures. The best results are highlighted in bold, and the second-best are underlined. A checkmark indicates that the corresponding module is included.

Components		Backbone	Params	FLOPs	Inference Time	Training Set	Test Sets (AUC (%))
LGConv	FSAB	Backbone	Params	FLOPs	Inference Time	Training Set	DFDCP	CD2
		Resnet	44 M	7.8 G	59 ms	FF++	72.57	70.38
	✓	Resnet	45 M	8.0 G	91 ms	FF++	72.91	71.02
✓	✓	Resnet	45 M	8.0 G	127 ms	FF++	73.80	72.22
		Swin-T	49 M	8.5 G	102 ms	FF++	80.22	89.90
	✓	Swin-T	50 M	8.7 G	194 ms	FF++	81.83	91.01
✓	✓	Swin-T	50 M	8.7 G	231 ms	FF++	82.24	92.88
		VMamba	30 M	4.9 G	68 ms	FF++	85.12	94.28
	✓	VMamba	31 M	5.1 G	178 ms	FF++	86.97	95.66
✓	✓	VMamba	31 M	5.1 G	199 ms	FF++	88.87	96.01

Table 6. Robustness evaluation across different types of perturbations. The best results are highlighted in bold, and the second-best are underlined.

Methods	Venue	Clean	Blur	Noise	Block	Avg
Xception	ICCV2019	75.98	73.64	72.75	72.56	73.73
Face-Xray	CVPR2020	80.92	76.31	75.02	77.25	77.38
RECCE	CVPR2022	78.33	76.20	74.69	74.23	75.86
ADA	TCSVT2023	80.92	79.33	80.01	77.59	79.46
MRL	TIFS2023	80.21	79.65	77.32	78.44	78.91
LGMamba		89.91	87.69	88.70	85.95	88.06

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, H.; Zhang, Z.; Li, Q.; Feng, C. Towards Generalizable Deepfake Detection via Facial Landmark-Guided Convolution and Local Structure Awareness. Algorithms 2026, 19, 270. https://doi.org/10.3390/a19040270

AMA Style

Chen H, Zhang Z, Li Q, Feng C. Towards Generalizable Deepfake Detection via Facial Landmark-Guided Convolution and Local Structure Awareness. Algorithms. 2026; 19(4):270. https://doi.org/10.3390/a19040270

Chicago/Turabian Style

Chen, Hao, Zhengxu Zhang, Qin Li, and Chunhui Feng. 2026. "Towards Generalizable Deepfake Detection via Facial Landmark-Guided Convolution and Local Structure Awareness" Algorithms 19, no. 4: 270. https://doi.org/10.3390/a19040270

APA Style

Chen, H., Zhang, Z., Li, Q., & Feng, C. (2026). Towards Generalizable Deepfake Detection via Facial Landmark-Guided Convolution and Local Structure Awareness. Algorithms, 19(4), 270. https://doi.org/10.3390/a19040270

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Towards Generalizable Deepfake Detection via Facial Landmark-Guided Convolution and Local Structure Awareness

Abstract

1. Introduction

2. Related Works

2.1. Methods Based on Overall Facial Consistency

2.2. Methods Based on Facial Landmark Structural Features

2.3. Methods Based on Local Facial Forgery Traces

2.4. Deformable Convolution

3. Methods

3.1. Overview

3.2. VMamba and Facial Structure Awareness Block

3.3. Facial Landmark-Guided Convolution

4. Experiments

4.1. Datasets and Experimental Setup

4.2. Results

4.2.1. Intra-Dataset Experiments

4.2.2. Cross-Dataset Evaluation

4.2.3. Ablation Experiment

4.2.4. Robustness Evaluation

4.3. Grad-CAM Visualization

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI