Forgery-Aware Guided Spatial–Frequency Feature Fusion for Face Image Forgery Detection

He, Zhenxiang; Liu, Zhihao; Zhao, Ziqi

doi:10.3390/sym17071148

Open AccessArticle

Forgery-Aware Guided Spatial–Frequency Feature Fusion for Face Image Forgery Detection

by

Zhenxiang He

^*

,

Zhihao Liu

and

Ziqi Zhao

School of Cyberspace Security, Gansu University of Political Science and Law, No. 6 Anning West Road, Lanzhou 730070, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(7), 1148; https://doi.org/10.3390/sym17071148

Submission received: 28 June 2025 / Revised: 11 July 2025 / Accepted: 15 July 2025 / Published: 18 July 2025

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

The rapid development of deepfake technologies has led to the widespread proliferation of facial image forgeries, raising significant concerns over identity theft and the spread of misinformation. Although recent dual-domain detection approaches that integrate spatial and frequency features have achieved noticeable progress, they still suffer from limited sensitivity to local forgery regions and inadequate interaction between spatial and frequency information in practical applications. To address these challenges, we propose a novel forgery-aware guided spatial–frequency feature fusion network. A lightweight U-Net is employed to generate pixel-level saliency maps by leveraging structural symmetry and semantic consistency, without relying on ground-truth masks. These maps dynamically guide the fusion of spatial features (from an improved Swin Transformer) and frequency features (via Haar wavelet transforms). Cross-domain attention, channel recalibration, and spatial gating are introduced to enhance feature complementarity and regional discrimination. Extensive experiments conducted on two benchmark face forgery datasets, FaceForensics++ and Celeb-DFv2, show that the proposed method consistently outperforms existing state-of-the-art techniques in terms of detection accuracy and generalization capability. The future work includes improving robustness under compression, incorporating temporal cues, extending to multimodal scenarios, and evaluating model efficiency for real-world deployment.

Keywords:

facial image forgery detection; spatial–frequency fusion; forgery-aware module; saliency maps; facial symmetry

1. Introduction

With the rapid advancement of deepfake and image manipulation technologies, facial image forgeries have become increasingly common in daily life. Today, even ordinary users can easily modify or synthesize fake facial images using open-source tools [1,2,3]. This accessibility significantly lowers the barrier to creating convincing fake images, making such forgeries more widespread and harder to detect. Figure 1 [4,5] illustrates several common types of facial image forgeries. These altered facial images pose serious threats, such as identity theft, misinformation dissemination, and financial fraud, severely undermining public trust and information security systems [6]. Therefore, developing effective detection methods to identify facial manipulations is of critical importance.

Early facial forgery detection methods mainly relied on digital forensics and handcrafted features, including texture consistency analysis [7,8], noise statistical modeling [9], and error level analysis (ELA) [10]. While these approaches perform reasonably well on simple manipulations, they often fail when faced with high-fidelity deepfake content, leading to poor robustness and high false-negative rates. With the development of deep neural networks and large-scale facial datasets, end-to-end models based on CNNs or Transformers have become the mainstream [11,12,13,14]. These models can automatically extract potential tampering patterns and generalize across various manipulation types. Moreover, frequency-domain modeling has been widely adopted to capture high-frequency perturbations and spectral anomalies in images, using techniques such as DCT, FFT, and wavelet transforms [15,16,17]. These serve as a valuable complement to spatial semantic features.

However, relying solely on features from a single domain often overlooks local details and spatial structures, making it difficult to comprehensively capture inconsistencies within forged images. In particular, manipulated facial images frequently disrupt natural symmetry in key facial regions such as the eyes and mouth. This facial symmetry is typically stable and crucial in authentic images. Traditional forgery detection methods have largely failed to exploit this symmetry property, resulting in blind spots when detecting subtle manipulations. To address these limitations, recent studies have explored joint modeling of spatial and frequency domain features. For instance, LRL [18] enhances spatial features through an RGB–frequency attention module, SPSL [19] uses phase spectrum features for identifying upsampling artifacts, and F2Trans [20] introduces a high-frequency guided strategy to improve fine-grained feature extraction. Nevertheless, most existing methods remain confined to basic spatial–frequency fusion and face two critical challenges. First, existing methods lack sufficient sensitivity to local tampered regions, making it difficult for models to focus on key manipulated areas—especially small regions such as asymmetric eyes or mouths or semantically contradictory expressions like a smiling mouth paired with neutral eyes. Second, their fusion strategies provide limited cross-domain complementarity, failing to capture high-order collaborative relationships between spatial and frequency features.

To overcome these limitations, we propose a forgery-aware guided spatial–frequency feature fusion network. Based on both spatial and frequency domain features, we introduce a forgery-aware module (FAM) and a symmetry analysis mechanism to integrate spatial semantics, spectral cues, and forgery saliency into a unified detection framework. This design enhances the model’s ability to focus on manipulated regions and improves feature interaction and the completeness of feature representation.

The main contributions of this work are as follows.

We propose a novel forgery-aware guided spatial–frequency feature fusion network that jointly utilizes spatial semantics, frequency patterns, and forgery-aware cues. This unified framework enhances the detection of fine-grained tampering and strengthens the complementarity between spatial and frequency domains.
We design a lightweight forgery-aware module (FAM) based on U-Net to generate pixel-level saliency maps. These maps, guided by a symmetry analysis mechanism and semantic consistency modeling, help the network concentrate on manipulated facial regions, improving regional discrimination without requiring ground-truth masks.
We enhance feature extraction by incorporating an improved Swin Transformer with dynamic windowing and spatial pyramid pooling for capturing multi-scale spatial features. Additionally, we utilize Haar wavelet transforms to extract high-frequency forgery artifacts. A cross-domain interaction mechanism, along with channel recalibration and spatial gating, is employed to effectively fuse and refine features from both domains.

The remainder of this paper is organized as follows. Section 2 reviews related work, summarizing existing algorithms and detection methods. Section 3 presents the proposed forgery-aware guided spatial–frequency feature fusion network in detail, including its overall architecture and key components. Section 4 describes the experimental setup and presents results from comparative evaluations, ablation studies, and visual analyses. Section 5 provides a discussion of the findings and potential limitations. Finally, Section 6 concludes this paper and outlines future research directions.

2. Related Work

In recent years, face forgery detection has evolved from traditional image processing techniques to deep learning-based models. Early approaches primarily relied on edge detection, illumination consistency analysis, noise estimation, and color–texture features, aiming to identify tampering traces by analyzing both local irregularities and global consistency. However, these methods often suffer from limited accuracy and robustness, especially under complex real-world conditions.

With the widespread adoption of deep learning, convolutional neural networks (CNNs) have become the mainstream solution for face manipulation detection. Representative architectures such as Xception [21] and EfficientNet [22] enable effective classification of real and fake images by extracting multi-scale representations through hierarchical convolutional layers. Transformer-based methods have emerged more recently. The Vision Transformer (ViT) [23] demonstrates superior capability in modeling global contextual information, while the Swin Transformer [24] enhances the perception of multi-scale tampering traces through a hierarchical window-based attention mechanism. ViXNet, proposed by Shreyan Ganguly et al. [25], integrates Xception and ViT structures to improve the network’s sensitivity to subtle local artifacts in facial regions.

Beyond spatial-domain analysis, researchers have observed that deepfakes often leave distinctive artifacts in the frequency domain, which are difficult to perceive directly in the RGB space. Qian and Wu et al. employed the discrete cosine transform (DCT) to extract frequency-based features [16,26]. HFI-Net [27], introduced by Miao et al., adopts a dual-branch design with a global–local interaction (GLI) module to capture hierarchical frequency artifacts. F3-Net [28], proposed by Liu et al., leverages frequency-aware decomposition (FAD) in one branch to detect subtle forgery patterns, while the other branch applies local frequency statistics (LFS) to extract high-level semantics for real–fake classification. Tan et al. proposed FreqNet [29], which focuses on high-frequency components and introduces a frequency learning module to obtain domain-independent features, thereby improving detection accuracy. Similarly, MLFFEViT [30] utilizes discrete wavelet transforms in conjunction with a Vision Transformer to enhance multi-level local features for robust frequency-aware deepfake detection.

To better exploit both spatial and frequency information while preserving multi-scale feature representations, researchers have explored cross-domain fusion strategies [31,32,33]. Chen et al. [18] designed a joint RGB–frequency attention module to enhance forgery localization through local relation learning. Recognizing that upsampling is a common step in most face manipulation pipelines, Liu et al. [19] developed the spatial-phase shallow learning (SPSL) method, which targets phase spectrum variations to detect upsampling artifacts. Building on this direction, a recent study [34] rethinks the role of upsampling operations in CNN-based generative networks and proposes an upsampling-aware detection mechanism to improve generalization across manipulation methods. F2Trans, introduced by Miao et al. [20], employs a high-frequency fine-grained transformer to integrate spatial and frequency cues, enabling the detection of subtle tampering signals. These dual-domain methods have shown notable improvements in detection accuracy and generalization ability.

Nevertheless, challenges remain. Most existing approaches lack strong regional awareness and suffer from insufficient coupling between spatial and frequency features. This limitation becomes more pronounced in complex cases involving blurred boundaries or small-scale manipulations.

3. Methods

This paper proposes a forgery-aware guided spatial–frequency feature fusion network, as illustrated in Figure 2. The overall architecture consists of four key modules: a forgery-aware module, a spatial feature branch, a frequency feature branch, and a fusion module. The network is designed to simultaneously extract complementary clues of facial forgery—namely, pixel-level forgery saliency, semantic spatial features, and frequency-domain artifacts—from different perspectives. These features are then integrated through a fusion mechanism to enhance cross-domain complementarity and unified decision-making, thereby improving the accuracy and generalization of forgery detection. The forgery-aware module takes intermediate spatial features as input and employs a lightweight U-Net [35] structure to generate pixel-level saliency maps, which are used to guide regional attention during the subsequent fusion process. The spatial branch leverages an improved Swin Transformer with a dynamic window mechanism to strengthen global feature representation, while incorporating a multi-scale spatial pyramid structure to model local semantics. The frequency branch performs hierarchical transforms to extract frequency-domain information, aiming to enhance sensitivity to high-frequency disturbances and compression artifacts. To achieve effective cross-domain modeling, the fusion module adopts a cross-domain interaction mechanism that captures dependencies across both channel and spatial dimensions. This collaborative strategy improves the network’s discriminative capability. The following sections describe each component in detail, including the forgery-aware module (Section 3.1), the spatial feature extraction module (Section 3.2), the frequency modeling module (Section 3.3), and the spatial–frequency fusion mechanism (Section 3.4).

3.1. Forgery-Aware Module Guided by Facial Symmetry and Semantic Consistency

In face forgery detection, manipulated regions are typically small, randomly located, and visually inconspicuous, often featuring blurred boundaries and low saliency. These subtle alterations are challenging to localize using global modeling, potentially resulting in missed detections. To address this, we propose a forgery-aware module guided by facial symmetry and semantic consistency (FAM-GS), which generates pixel-level saliency maps to highlight suspected regions. These maps guide the adaptive fusion of spatial and frequency features, enhancing the model’s sensitivity to tampered areas.

3.1.1. Symmetry-Aware Modeling of Facial Structure

Human faces naturally exhibit strong bilateral symmetry, particularly in key regions such as the eyes, eyebrows, and mouth corners. Forgery operations—such as splicing or region replacement—often disrupt this symmetry, resulting in position mismatches or boundary inconsistencies.

To detect such asymmetries, FAM-GS includes a symmetry-aware mechanism. Both the input image and its horizontally flipped version are passed through a shared feature encoder, and their feature difference is computed via an

L_{1}

loss:

L_{sym} = {∥f_{enc} (x) - Flip (f_{enc} (Flip (x)))∥}_{1},

(1)

where

f_{enc} (\cdot)

is the encoder and

Flip (\cdot)

denotes horizontal flipping. This formulation enables a soft, statistical measure of structural inconsistency. Rather than enforcing absolute symmetry, the model learns to differentiate forgery-induced asymmetry from pose-induced variations through training. This design offers robustness even under non-frontal or distorted facial orientations.

3.1.2. Semantic Consistency-Aware Modeling of Facial Expressions

Besides structural anomalies, semantic contradictions between facial regions are common in forged images. For example, a person might appear to smile due to mouth curvature, while the eyes express a neutral emotion. FAM-GS captures such inconsistencies by modeling both local and global facial expressions.

Facial landmarks are first used to extract features from key regions such as the eyes and mouth. A lightweight MLP predicts the expression category for each region. Meanwhile, a global representation of facial emotion is derived. Their discrepancy is measured by a semantic consistency loss:

L_{\exp} = \sum_{r \in regions} CE (E_{local}^{(r)}, E_{global}),

(2)

where

E_{local}^{(r)}

is the predicted expression for region r,

E_{global}

is the global emotion class, and

CE

denotes the cross-entropy loss.

To further enhance robustness, we introduce a symmetric semantic consistency loss that encourages expression similarity between symmetric components (e.g., eyes and mouth corners):

L_{sem_sym} = {∥F_{left_eye} - F_{right_eye}∥}_{1} + {∥F_{left_mouth} - F_{right_mouth}∥}_{1} .

(3)

This loss promotes balanced representations across symmetric facial regions, strengthening the model’s ability to identify semantically unnatural manipulations.

3.1.3. Saliency Map Generation and Guided Feature Fusion

FAM-GS constructs a saliency map from intermediate features extracted by ResNet-50. We use features from res2, res3, and res4 layers. Each is passed through a

1 \times 1

convolution to unify channel dimensions, and lower-resolution features are upsampled via bilinear interpolation to match the spatial size of the highest-resolution feature map. These are then concatenated along the channel dimension:

F_{s}^{m i d} = Concat (ϕ_{2} (F_{2}), {Up}_{3} (ϕ_{3} (F_{3})), {Up}_{4} (ϕ_{4} (F_{4}))),

(4)

where

ϕ_{i} (\cdot)

denotes a

1 \times 1

convolution and

{Up}_{j} (\cdot)

is bilinear upsampling. The fused feature map

F_{s}^{m i d}

is then input to a lightweight U-Net to generate a saliency map:

M = σ (f_{dec} (f_{enc} (F_{s}^{m i d}))),

(5)

with

σ (\cdot)

denoting the Sigmoid function. The U-Net leverages skip connections to preserve local information while extracting contextual cues. Importantly, this saliency map is trained without explicit pixel-level supervision; instead, it is optimized jointly via the main classification loss and the auxiliary structural and semantic losses (

L_{sym}, L_{\exp}, L_{sem_sym}

). This setup enables weakly supervised learning of manipulation-sensitive attention maps.

During fusion, the saliency map M modulates the spatial (

F_{s}

) and frequency (

F_{f}

) features through element-wise weighting:

{\tilde{F}}_{s} = M ⊙ F_{s},

(6)

{\tilde{F}}_{f} = (1 - M) ⊙ F_{f},

(7)

where ⊙ denotes broadcasted pixel-wise multiplication. This operation allocates more attention to tampered regions and balances spatial–frequency information accordingly, thereby enhancing both discriminative capability and localization precision. As a core component, FAM-GS provides dynamic, region-aware guidance to the fusion process, significantly improving detection performance.

3.2. Spatial-Domain Feature Extraction Module

3.2.1. Global Feature Extraction Module

To enhance contextual modeling and long-range dependencies across windows, we improve the traditional Swin Transformer and build a global feature extraction module, as shown in Figure 3. First, we introduce a dynamic window mechanism that adjusts the window size N based on the input resolution:

N = max (4, ⌊\frac{min (H, W)}{k}⌋),

(8)

where k is a scaling factor. This ensures effective multi-scale representation and adaptability to varying resolutions.

Within each window, features are projected to generate Q, K, and V, and attention is computed as follows:

Attention (Q, K, V) = Softmax (\frac{Q K^{T}}{\sqrt{d}} + R) V,

(9)

where R denotes the relative position bias capturing spatial structure. We incorporate a geometric-bias form of R to better represent internal spatial patterns with negligible computational overhead.

To improve feature fusion across windows, we adopt an explicit window interaction mechanism using depthwise convolutions. Horizontal and vertical contexts are aggregated via

k \times 1

and

1 \times k

convolutions:

\begin{matrix} Y_{i, j}^{(h)} & = \sum_{u = 1}^{k} W_{i, j - u} \cdot K_{u}^{h}, \end{matrix}

(10)

\begin{matrix} Y_{i, j}^{(v)} & = \sum_{u = 1}^{k} W_{i - u, j} \cdot K_{u}^{v}, \end{matrix}

(11)

and the final interaction output is

Y_{i, j} = Y_{i, j}^{(h)} + Y_{i, j}^{(v)} .

(12)

This mechanism strengthens cross-window communication while preserving lightweight computation. Additionally, we replace selected convolution layers with depthwise separable convolutions to reduce computational cost and enhance feature expressiveness.

Altogether, the global module integrates dynamic window partitioning, internal attention, and inter-window communication, yielding semantically rich and spatially aware global representations.

3.2.2. Local Feature Extraction Module

The local feature extraction module complements global modeling by focusing on fine-grained spatial cues such as edges, textures, and local patterns. It utilizes depthwise separable convolutions, spatial pyramid pooling, and cross-stage fusion to extract multi-scale local features efficiently.

Initially, the input feature map undergoes a

3 \times 3

convolution for downsampling and local detail extraction. A sequence of

1 \times 1

,

3 \times 3

, and

1 \times 1

convolutions then enhances nonlinear representation:

X_{conv} = {Conv}_{1 \times 1} ({Conv}_{3 \times 3} ({Conv}_{1 \times 1} (X))) .

(13)

Next, a multi-scale spatial pyramid pooling (SPP) module captures contextual features using max-pooling with kernel sizes 5 × 5, 9 × 9, and 13 × 13. These pooled features are concatenated with the original features:

X_{SPP} = Concat (X_{conv}, {Pool}_{5} (X_{conv}), {Pool}_{9} (X_{conv}), {Pool}_{13} (X_{conv})) .

(14)

The resulting

X_{SPP}

is compressed and fused via

1 \times 1

and

3 \times 3

convolutions. To retain low-level information, a shortcut path applies

1 \times 1

convolution directly to the original input X.

Finally, features from both the pooled and shortcut paths are concatenated and integrated:

X_{output} = {Conv}_{1 \times 1} (Concat ({Conv}_{3 \times 3} ({Conv}_{1 \times 1} ({Conv}_{1 \times 1} (X_{SPP}))), {Conv}_{1 \times 1} (X))) .

(15)

This module enables efficient local representation with reduced computation. The combination of SPP, depthwise convolutions, and residual fusion allows for detailed local feature learning while maintaining high efficiency.

3.3. Frequency-Domain Feature Extraction Module

Forgery operations often introduce artifacts in the frequency domain. To exploit this, we design a frequency-domain module based on a two-stage Haar wavelet transform to extract multi-scale high-frequency and low-frequency features for forgery detection, as shown in Figure 3. The Haar transform captures both fine-grained textures and broader semantic information by decomposing feature maps across scales.

In the first stage, the input feature map

x \in R^{C \times H \times W}

is transformed into four components: diagonal (D), vertical (V), horizontal (H) high-frequency components, and a low-frequency component (L). The process for each channel

X_{c}

is defined as follows:

\{\begin{matrix} D [i, j] & = \frac{1}{2} X_{c} [2 i, 2 j] - X_{c} [2 i, 2 j + 1] - X_{c} [2 i + 1, 2 j] + X_{c} [2 i + 1, 2 j + 1], \\ V [i, j] & = \frac{1}{2} X_{c} [2 i, 2 j] - X_{c} [2 i, 2 j + 1] + X_{c} [2 i + 1, 2 j] - X_{c} [2 i + 1, 2 j + 1], \\ H [i, j] & = \frac{1}{2} X_{c} [2 i, 2 j] + X_{c} [2 i, 2 j + 1] - X_{c} [2 i + 1, 2 j] - X_{c} [2 i + 1, 2 j + 1], \\ L [i, j] & = \frac{1}{2} X_{c} [2 i, 2 j] + X_{c} [2 i, 2 j + 1] + X_{c} [2 i + 1, 2 j] + X_{c} [2 i + 1, 2 j + 1], \end{matrix}

(16)

where

i \in [1, H / 2], j \in [1, W / 2]

. The low-frequency component L from the first stage undergoes a second Haar transform, generating a new set of D, V, H, and L components. This two-stage structure improves feature discrimination across scales, especially in high-frequency bands.

To emphasize detail, each high-frequency component is processed with convolutional layers and adaptive scaling. We then concatenate the high-frequency outputs from both stages. Due to resolution differences, bilinear interpolation aligns all feature maps to the size of the second-stage outputs. Let

F_{1} = Concat (H_{1}, V_{1}, D_{1})

and

F_{2} = Concat (H_{2}, V_{2}, D_{2})

represent the high-frequency results of the first and second transforms, respectively. Their shapes are

(3 C, H / 2, W / 2)

and

(3 C, H / 4, W / 4)

.

The final output is computed as follows:

F_{high}, F_{low} = (Concat (Interpolate (F_{1}, size = F_{2}), F_{2}), L_{2}),

(17)

where

F_{high}

represents concatenated high-frequency features and

L_{2}

is the low-frequency output from the second transform. All maps are unified in size

(3 C, H / 4, W / 4)

.

This frequency decomposition provides complementary features: high-frequency components highlight textures and edges, while low-frequency features preserve holistic structure and semantics. While wavelet transforms alone may not fully detect manipulations, this module lays a strong foundation for cross-domain fusion with spatial features. The resulting frequency features enhance both local anomaly detection and global representation, improving the robustness of face forgery detection.

3.4. Forgery-Aware Guided Spatial–Frequency Fusion Module

3.4.1. Cross-Domain Interaction Fusion Module

To further explore the complementarity between spatial and frequency features and achieve deep semantic synergy, we design a forgery-aware guided cross-domain interaction fusion module. This module enables deeper alignment and adaptive selection of cross-domain information. It not only emphasizes the integration of features from different semantic domains but also incorporates regional attention to focus the model on potentially forged areas during the fusion process.

Specifically, the spatial features

F_{s}

and frequency features

F_{f}

capture discriminative cues from structural and frequency perturbations, respectively. However, due to their divergent semantic distributions, direct fusion often results in poor alignment, redundant information, or mutual interference. To mitigate this, we introduce a forgery saliency map M as a region-guided weighting mask. Pixel-wise weighting is applied to both domains to obtain attention-aware feature maps

{\tilde{F}}_{S}

and

{\tilde{F}}_{f}

.

These two weighted feature maps are then passed into a multi-scale orthogonal convolution module, which constructs diverse receptive fields using combined strip convolutions of size 1 × k and k × 1 (OC). This enhances the structural expressiveness of features. A 1 × 1 convolution is subsequently used to unify dimensions and generate the Q, K, and V triplets for attention computation:

Q, K, V = δ_{1 \times 1} ({OC}_{7} (LN (x)) + {OC}_{11} (LN (x)) + {OC}_{21} (LN (x))) .

(18)

To ensure semantic alignment during cross-domain fusion, we apply a bi-directional cross-attention mechanism. Spatial and frequency domains, respectively, construct attention paths

Q_{1}, K_{2}, V_{2} and Q_{2}, K_{1}, V_{1}

, enabling cross-query operations between the two domains:

F_{1} = δ_{1 \times 1} (Softmax (\frac{Q_{2} K_{1}^{T}}{\sqrt{d}}) V_{1}),

(19)

F_{2} = δ_{1 \times 1} (Softmax (\frac{Q_{1} K_{2}^{T}}{\sqrt{d}}) V_{2}),

(20)

the final cross-domain fused representation is obtained by concatenating the outputs from both attention branches:

F_{fuse} = Concat (F_{1}, F_{2}) .

(21)

This module enables a structure-aware cross-domain attention fusion strategy. Through region-guided weighting initialization, multi-scale structural encoding, and semantic cross-matching, it achieves a highly cooperative representation between spatial and frequency domains while preserving their individual discriminative characteristics. This provides a more distinguishable joint representation for downstream classification tasks.

3.4.2. Channel Recalibration Attention

The fused spatial–frequency features may still contain redundant or weakly discriminative channels. Before proceeding to finer spatial modeling, it is necessary to selectively enhance informative channels at the global level. Inspired by the Squeeze-and-Excitation (SE) [36] mechanism, we adopt a channel recalibration strategy. Global average pooling (GAP) is applied to compute channel-wise statistics, followed by a two-layer perceptron for nonlinear transformation and compression. The resulting attention map is used to rescale the feature channels:

A_{c} = σ (W_{2} \cdot δ (W_{1} \cdot GAP (F_{fuse}))) \in R^{B \times C \times 1 \times 1},

(22)

F_{c} = A_{c} ⊙ F_{fuse},

(23)

where

W_{1} and W_{2}

are learnable parameters,

δ

denotes the ReLU activation, and

σ

is the Sigmoid function. This module dynamically modulates the importance of each channel at the global level, enhancing the network’s responsiveness to critical feature pathways.

3.4.3. Spatial Gating Attention

Although channel recalibration strengthens global semantic representation, forged regions may still be neglected due to background interference, especially in complex forgery scenarios. To further enhance regional focus, we introduce a spatial gating mechanism after channel attention. A spatial attention map is generated through a 3 × 3 convolution:

A_{s} = σ ({Conv}_{3 \times 3} (F_{fuse})) \in R^{B \times 1 \times H \times W},

(24)

where

A_{s}

represents the importance of each spatial location, estimating the likelihood that a region contains forgery. The final output is obtained by element-wise multiplication of the spatial map with the input features:

F_{final} = A_{s} ⊙ F_{fuse .}

(25)

This step further enhances the localization of tampered areas such as blurred boundaries or low-contrast regions. By suppressing background noise while retaining channel discrimination, it improves the quality of spatial representation and overall detection robustness.

4. Experiments

This section presents a comprehensive evaluation of the proposed method through four main components. First, the experimental setup is introduced, detailing the datasets used, the data preprocessing steps, and the training and evaluation settings. Second, we conduct a comparative evaluation against several state-of-the-art methods to demonstrate the effectiveness of our approach. Third, an ablation study is performed to analyze the contributions of key architectural components. Finally, a visualization analysis using Grad-CAM [37] is provided to enhance the interpretability of the model and highlight its forgery-aware capability.

4.1. Experimental Setup

4.1.1. Datasets

In this paper, we select two existing publicly available datasets, namely FaceForensics++ (FF++) and Celeb-DFv2. FaceForensics++ (FF++) [4]: This dataset contains 1000 original video sequences collected from 977 YouTube videos, along with 4000 forged videos generated using four common face manipulation techniques: DeepFakes, Face2Face, FaceSwap, and NeuralTextures. For each video, three quality levels are provided: a visually lossless version (referred to as RAW) and two lossy compressed versions using H.264 encoding with constant quantization parameters (QPs) of 23 and 40, labeled as C23 and C40, respectively. It is important to note that the RAW version is not a true uncompressed format from YouTube, but rather a high-quality re-encoded version prepared by the dataset authors. All videos are uniformly downsampled to a fixed resolution of 854 × 480 pixels (480p) to ensure consistency across all quality levels. In our experiments, we primarily use the C23 (moderate compression) and C40 (high compression) versions to better simulate real-world forensic conditions. Following common practice, the dataset is split into training, validation, and test sets in a 720:140:140 ratio. Celeb-DFv2 [5]: This dataset is collected from 590 original YouTube videos of 59 well-known public figures, encompassing diverse genders, ethnicities, and facial features, as well as 5639 corresponding deepfake videos. From the extracted facial frames, we randomly select a total of 80 k face images (40 k real and 40 k forged) for evaluation. The dataset partition follows the same 720:140:140 ratio as FF++. We selected FaceForensics++ (FF++) and Celeb-DFv2 as they are the most widely used and benchmarked datasets in the face forgery detection community. FF++ contains multiple manipulation methods with different compression levels, making it suitable for evaluating intra-domain robustness. Celeb-DFv2, on the other hand, features high-quality and more naturally blended deepfakes, and serves as a challenging test set for cross-domain generalization. These datasets are also used in numerous prior works [4,5], enabling standardized comparisons.

4.1.2. Data Preprocessing

In face forgery images, the crucial manipulated regions are usually concentrated on the face, while the background remains largely unchanged. Because facial regions occupy only a small portion of each frame, and experimental results show that training with face crops significantly outperforms training with entire frames, this approach is widely adopted in video-based forgery detection. In our work, the original videos are converted into consecutive frames emphasizing the face area. We use the VideoCapture function from OpenCV to process the videos, capturing images frame by frame and then applying a Haar cascade classifier to detect facial regions. Finally, we obtain single-frame images containing the key facial areas.

4.1.3. Training and Evaluation Settings

All experiments were conducted on a workstation running Ubuntu 20.04 with an NVIDIA RTX 4090 GPU (24 GB), using Python 3.8 and PyTorch 2.4.1. To ensure fair and reproducible comparisons, all baseline models were trained and evaluated under consistent settings. For Xception [21] and EfficientNet-B4 [22], we used official or widely adopted PyTorch implementations. ViT [23] was reimplemented following its original design and adapted for binary classification by replacing the classification head with a single sigmoid-activated output node.

Input preprocessing followed the procedure described in Section 4.1.2, including face cropping via Haar cascade detection and resizing to 224 × 224 resolution. The data splits, augmentation strategies (color jitter, mixup, and label smoothing), and optimizer settings were kept identical across all methods. Table 1 summarizes the hyperparameters used.

Each model was trained for 50 epochs with early stopping (patience = 5), using the AdamW optimizer and a cosine annealing learning rate scheduler. Random seeds were fixed to 42 to ensure reproducibility. Evaluation was performed on the best validation checkpoint, and metrics including AUC, ACC, F1-score, and G-Mean were computed using scikit-learn to capture both accuracy and robustness under class imbalance.

4.2. Comparative Evaluation

4.2.1. Evaluation and Comparison on FF++ and Celeb-DFv2

Comparison across Different Datasets: To conduct a fair and consistent evaluation, several representative and state-of-the-art face forgery detection methods were re-implemented under identical training settings. All models were evaluated on three standard benchmarks: FF++ (C23 and C40) and Celeb-DFv2. The performance results, summarized in Table 2, demonstrate that the proposed method consistently outperforms these baselines across all metrics—accuracy (ACC), area under the curve (AUC), F1-score, and geometric mean (G-Mean). In particular, the proposed model achieves an AUC of 98.96% on FF++ (C23), 84.52% on FF++ (C40), and 99.71% on Celeb-DFv2, outperforming GFFD by margins of +0.7%, +0.36%, and +0.08%, respectively. While GFFD performs strongly on FF++ (C23), it suffers from performance degradation under higher compression settings such as FF++ (C40), indicating sensitivity to compression artifacts. In contrast, our method maintains robustness under varying compression levels, attributable to the combination of spatial–frequency domain modeling and the forgery-aware saliency-guided fusion module. Compared to GocNet, which introduces gradient-based analysis for forgery localization, our approach demonstrates more stable performance on compressed datasets, suggesting improved generalization. Additionally, the NPR framework—designed to enhance generalizability by revisiting upsampling design—shows competitive results on FF++ (C23) but falls behind in heavily compressed and cross-domain scenarios. Similarly, the MLFFE_ViT architecture improves upon ViT through wavelet decomposition, yet still underperforms the proposed method, particularly in terms of robustness and G-Mean. The visual comparisons in Figure 4 further illustrate the superiority of the proposed method across all evaluation metrics and datasets.

Comparison across Different Manipulation Types: To comprehensively assess the adaptability of the proposed method to different types of facial manipulations, a set of representative state-of-the-art face forgery detection models were re-implemented and evaluated under unified training conditions. All models were tested on four representative manipulation types in the FF++ dataset: DeepFake, Face2Face, FaceSwap, and NeuralTexture. The comparison results are presented in Table 3. Overall, the proposed method consistently achieves the best performance across all manipulation types and all evaluation metrics, including accuracy (ACC), area under the curve (AUC), F1-score, and geometric mean (G-Mean), demonstrating strong robustness and generalization. For global manipulation types such as deepfake and FaceSwap, the proposed model achieves an AUC of 99.99%, significantly outperforming lightweight Transformer-based models such as ViT (93.63%, 61.20%) and MLFFE_ViT (97.42%, 91.50%) while also surpassing CNN-based baselines such as GFFD (99.99%, 99.99%) and Xception (99.88%, 99.57%). In the case of Face2Face, which mainly modifies facial expressions through local manipulation, the proposed method still maintains a leading performance, achieving an AUC of 99.81%. This outperforms GFFD (99.58%), GocNet (99.49%), and EfficientNet-B4 (98.02%), indicating the model’s enhanced sensitivity to fine-grained expression variations. For the most challenging manipulation type, NeuralTexture, where tampering is typically confined to small regions around the mouth and global artifacts are minimal, the proposed method achieves an AUC of 97.45%. This result exceeds those of GFFD (97.16%), GocNet (95.64%), and NPR (94.39%), confirming the effectiveness of the proposed model in accurately localizing subtle and highly localized forgeries. The visualizations in Figure 5 further illustrate the model’s accurate focus on manipulated regions across different forgery categories.

4.2.2. Cross-Dataset Evaluation and Comparison

To better simulate real-world conditions—where forged images are often derived from unknown sources and undisclosed synthesis techniques—the generalization capability of deepfake detection models becomes essential. To assess the robustness of our proposed method under such realistic settings, we conducted cross-dataset experiments where models were trained on four representative manipulation types from the FF++ (C23) dataset and tested on the challenging Celeb-DFv2 dataset. Celeb-DFv2 comprises high-quality, naturally appearing forged videos, making it well-suited for evaluating generalization. As shown in Table 4, our method achieved the best AUC score of 79.66% and the highest F1-score of 73.44%, outperforming all compared baselines. Although EfficientNet-B4 obtained the highest accuracy (71.48%) and geometric mean (71.44%), our model demonstrated a superior balance between precision and recall, highlighting its robustness against distribution shifts. Notably, all models experienced performance drops compared to in-dataset evaluations, underscoring the intrinsic difficulty of cross-domain deepfake detection. Nonetheless, our approach consistently performed well, which we attribute to the proposed forgery-aware guided spatial–frequency feature fusion mechanism. This design enables the model to effectively capture both spatial anomalies and frequency-domain artifacts indicative of tampering. Detailed insights into each component’s contribution are provided in the subsequent ablation studies.

4.3. Ablation Study

In this subsection, we conduct ablation experiments to analyze the effectiveness of each component in our proposed framework under an intra-dataset setting using FF++ (C23). The model is trained on the FF++ (C23) training set and evaluated on its corresponding test set. Performance is assessed using four standard metrics: accuracy (ACC), area under the ROC curve (AUC), F1-score, and geometric mean (G-Mean).

To further investigate the impact of each module and its subcomponents on forgery detection, we conducted a detailed ablation study based on the results in Table 5. Within the spatial branch, using only global features (Global Only) or local features (Local Only) yields limited performance, with ACC values of 60.54% and 60.62%, respectively. This suggests that either global or local spatial features alone are insufficient for effective forgery detection. In contrast, the Full Spatial Branch achieves a significantly higher performance of 95.28% ACC and 98.19% AUC, highlighting the critical importance of spatial structural information.

Similarly, in the frequency branch, high-frequency components (High-Frequency Only) outperform low-frequency components (Low-Frequency Only), achieving 58.17% ACC versus 51.55%, which demonstrates that forgeries tend to introduce anomalies in the high-frequency domain. The Full Frequency Branch, which combines multi-scale frequency information, further improves the performance to 95.12% ACC and 98.07% AUC, confirming the complementary role of frequency-based features.

When spatial and frequency features are directly fused (+Space +Frequent), the model achieves 95.66% ACC and 98.43% AUC, indicating the synergy between semantic and spectral information. However, static fusion remains limited in its ability to capture cross-domain dependencies. With the introduction of the forgery-aware module (FAM), which includes a symmetry-aware branch, a semantic consistency branch, and a saliency-guided fusion mechanism, the model performance is further enhanced. Specifically, each component individually contributes to improved accuracy—95.73%, 95.90%, and 96.29% ACC, respectively—while the complete FAM yields the best result of 96.68% ACC and 98.97% AUC, confirming the effectiveness of each submodule and their collective synergy.

These results validate the effectiveness of the saliency map in localizing forgery regions, guided jointly by structural symmetry and semantic consistency. The saliency mechanism significantly enhances regional discrimination by emphasizing manipulated areas with subtle artifacts. The joint modeling of spatial and frequency features forms the foundation of the detection framework, while the forgery-aware guided fusion mechanism acts as a critical enhancement. Together, they enable a cross-modal, high-precision fusion strategy that further strengthens feature consistency across domains and improves the precision of region-level activation.

4.4. Visualization Analysis

To further investigate the role of each component in enhancing forgery-aware perception, we conducted a module-level ablation visualization experiment on four representative forgery types (deepfake, Face2Face, FaceSwap, and NeuralTexture) within the FF++ (C23) dataset. Specifically, based on the Grad-CAM visualization technique, we present activation heatmaps under four network configurations, (a) Full Spatial Branch, (b) Full Frequency Branch, (c) Spatial + Frequency, and (d) Spatial + Frequency + FAM, as illustrated in Figure 6. The results show that using spatial features alone (Full Spatial Branch) leads to generalized attention patterns mainly focused on facial contours, which makes it difficult to accurately localize manipulated regions. In contrast, using only frequency features (Full Frequency Branch) yields scattered responses, which are susceptible to spectral interference from non-forged regions. When spatial and frequency features are combined (Spatial + Frequency), the activation gradually concentrates on core areas such as the mouth and nasal wings, indicating strong complementarity between the two domains in identifying tampered regions. With the addition of the FAM (forgery-aware module), the network’s attention becomes more focused on specific manipulated areas. In global forgery cases like deepfake and FaceSwap, the heatmaps concentrate on edge-blending regions. For local manipulations such as Face2Face and NeuralTexture, the model accurately highlights subtle changes in the mouth region, demonstrating enhanced sensitivity to localized forgeries. These visualization results confirm that the saliency-guided mechanism significantly improves the model’s perception of local forgeries and further strengthens the cooperative representation between spatial and frequency features. This contributes critically to fine-grained forgery localization.

4.5. Symmetry Analysis Under Non-Frontal Poses

To explicitly investigate the effect of head pose on bilateral facial symmetry and the robustness of our structural symmetry modeling, we constructed a yaw-varied dataset by synthetically rotating facial images across a range of horizontal viewing angles. Specifically, we applied a perspective warping process based on 3D yaw transformation, simulating horizontal head rotations from −45° to +45° with controlled interpolation. The warping function employs an approximate camera model with focal length estimation, enabling rotation-based transformations while preserving key facial structures. All datasets used in this work were augmented using this method to ensure consistent exposure to pose variation throughout the learning process. The augmented dataset was divided into four yaw groups: frontal faces (

0 °

), mild poses

(- 15 °, 15 °]

, moderate poses

(- 30 °, - 15 °] \cup (15 °, 30 °]

, and large poses

(- 45 °, - 30 °] \cup (30 °, 45 °]

. For each group, we evaluated both the detection performance (in terms of AUC, ACC, F1-score, and G-Mean) and the structural symmetry loss

L_{sym}

, computed as the L1 distance between features of the input image and its horizontally flipped counterpart.

As shown in Table 6, the structural symmetry loss

L_{sym}

for real faces increases gradually with larger yaw angles, rising from 0.03528 in strictly frontal views to 0.03985 in large profile views. This reflects the natural asymmetry caused by geometric distortions under oblique poses. Despite this trend, the proposed model consistently maintains high detection performance, with the AUC remaining at 98.98% for frontal faces, 97.75% for mild profiles, 97.68% for moderate profiles, and 96.40% even under large pose variations. These results demonstrate that the structural alignment mechanism exhibits strong robustness to natural head pose changes and effectively discriminates between forgery-induced structural inconsistency and pose-induced facial asymmetry.

5. Discussion

This study proposes a forgery-aware guided network that fuses spatial and frequency features, achieving high accuracy and strong generalization capability across multiple benchmark datasets. By introducing a forgery-aware module (FAM) and a cross-domain interaction fusion mechanism, the model not only effectively focuses on local forged regions but also enhances the complementarity between spatial and frequency features. In particular, the proposed method demonstrates remarkable robustness and discriminative power against fine-grained and high-concealment forgery types such as NeuralTextures, indicating its strong practical value and potential for broader application.

Despite the promising results, there remains room for further improvement. For instance, when dealing with highly compressed and low-quality images, the model still shows occasional false positives or missed detections in fine-grained regions, suggesting that its adaptability to compression noise and weak texture features can be further enhanced. Future research could explore the following directions:

Developing more robust modeling approaches to improve detection accuracy under compression and quality degradation;
Incorporating temporal modeling by leveraging dependencies across video frames;
Extending the framework to multi-modal forgery detection tasks, such as voice–face matching verification and synthetic speech detection.
Conducting comprehensive technical comparisons (e.g., parameter count, model complexity, and inference latency) to further assess the computational efficiency and deployment feasibility of the proposed method in comparison with state-of-the-art baselines.

6. Conclusions

This paper presents a forgery-aware guided spatial–frequency feature fusion network for facial image manipulation detection, designed to address the limitations of existing methods in localizing fine-grained forgery regions and capturing cross-domain feature interactions. The proposed architecture integrates a dual-branch design, combining spatial–frequency fusion with a forgery-aware saliency module that generates pixel-level attention maps guided by structural symmetry and semantic consistency, thereby enabling dynamic and targeted feature refinement. This design enhances the model’s ability to focus on key forged regions and improves its discriminative capability. In the spatial branch, an improved Swin Transformer combined with dynamic sliding windows and a pyramid structure is employed to extract both global semantic and local texture features. The frequency branch leverages multi-level Haar wavelet transforms to capture high-frequency artifacts and compression traces. These complementary features are fused through a saliency-guided cross-domain interaction module, enabling effective collaboration between spatial and frequency domains as well as between local and global representations. This significantly enhances the discriminative power and robustness of the learned features.

Experiments on FaceForensics++ and Celeb-DFv2 demonstrate the effectiveness of the proposed method across diverse forgery scenarios. The model achieves 96.68% accuracy and 98.98% AUC on FF++ (C23), 98.61% accuracy and 99.71% AUC on Celeb-DFv2, and maintains 77.06% accuracy under strong compression (FF++ (C40)), outperforming prior state-of-the-art methods. Further evaluations on different manipulation types confirm the model’s robustness against various forgery patterns with diverse visual characteristics. Ablation studies validate the contribution of each individual component—including the spatial representation branch, frequency-domain modeling, saliency-guided attention, and cross-domain fusion module—with their integration, yielding notable performance gains in accuracy and robustness. The model also maintains robust detection under pose variations, with AUC scores from 98.98% (frontal) to 96.40% (large profile), highlighting the benefit of symmetry-aware design. The framework shows strong potential for real-world applications. Future work will aim at improving inference efficiency and extending the method to multi-modal and cross-frame forgery detection.

Author Contributions

The following statements are based on the CRediT taxonomy. Conceptualization, Z.H.; Supervision, Z.H.; Project Administration, Z.H.; Funding Acquisition, Z.H.; Methodology, Z.L.; Formal Analysis, Z.L.; Investigation, Z.L. and Z.Z.; Writing—Original Draft Preparation, Z.L.; Visualization, Z.L.; Data Curation, Z.Z.; Validation, Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Gansu Province Higher Education Institutions Industrial Support Program under Grant 2020C-29 and in part by the National Natural Science Foundation of China under Grant 6156200.

Data Availability Statement

The code is available at https://github.com/ZhihaoLiukk/Forgery-Aware-Guided-Spatial-Frequency-Feature-Fusion-for-Face-Image-Forgery-Detection-Data (accessed on 14 July 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gao, J.; Xia, Z.; Marcialis, G.L.; Dang, C.; Dai, J.; Feng, X. DeepFake detection based on high-frequency enhancement network for highly compressed content. Expert Syst. Appl. 2024, 249, 123732. [Google Scholar] [CrossRef]
Tyagi, S.; Yadav, D. A detailed analysis of image and video forgery detection techniques. Vis. Comput. 2023, 39, 813–833. [Google Scholar] [CrossRef]
Dou, L.; Feng, G.; Qian, Z. Image Inpainting Anti-Forensics Network via Attention-Guided Hierarchical Reconstruction. Symmetry 2023, 15, 393. [Google Scholar] [CrossRef]
Rossler, A.; Cozzolino, D.; Verdoliva, L.; Riess, C.; Thies, J.; Nießner, M. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1–11. [Google Scholar]
Li, Y.; Yang, X.; Sun, P.; Qi, H.; Lyu, S. Celeb-df: A large-scale challenging dataset for deepfake forensics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3207–3216. [Google Scholar]
Tolosana, R.; Vera-Rodriguez, R.; Fierrez, J.; Morales, A.; Ortega-Garcia, J. Deepfakes and beyond: A survey of face manipulation and fake detection. Inf. Fusion 2020, 64, 131–148. [Google Scholar] [CrossRef]
Popescu, A.C.; Farid, H. Exposing digital forgeries by detecting traces of resampling. IEEE Trans. Signal Process. 2005, 53, 758–767. [Google Scholar] [CrossRef]
Zhao, Y.; Jin, X.; Gao, S.; Wu, L.; Yao, S.; Jiang, Q. Tan-gfd: Generalizing face forgery detection based on texture information and adaptive noise mining. Appl. Intell. 2023, 53, 19007–19027. [Google Scholar] [CrossRef]
Lukáš, J.; Fridrich, J.; Goljan, M. Detecting digital image forgeries using sensor pattern noise. In Proceedings of the Security, Steganography, and Watermarking of Multimedia Contents VIII, San Jose, CA, USA, 16–19 January 2006; SPIE: Bellingham, WA, USA, 2006; Volume 6072, pp. 362–372. [Google Scholar]
Luo, W.; Huang, J.; Qiu, G. JPEG error analysis and its applications to digital image forensics. IEEE Trans. Inf. Forensics Secur. 2010, 5, 480–491. [Google Scholar] [CrossRef]
Zhou, P.; Han, X.; Morariu, V.I.; Davis, L.S. Two-stream neural networks for tampered face detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1831–1839. [Google Scholar]
Dang, L.M.; Hassan, S.I.; Im, S.; Moon, H. Face image manipulation detection based on a convolutional neural network. Expert Syst. Appl. 2019, 129, 156–168. [Google Scholar] [CrossRef]
Luo, A.; Cai, R.; Kong, C.; Ju, Y.; Kang, X.; Huang, J.; Life, A.C.K. Forgery-aware Adaptive Learning with Vision Transformer for Generalized Face Forgery Detection. IEEE Trans. Circuits Syst. Video Technol. 2024, 35, 4116–4129. [Google Scholar] [CrossRef]
Pawar, D.; Gowda, R.; Chandra, K. Image forgery classification and localization through vision transformers. Int. J. Multimed. Inf. Retr. 2025, 14, 8. [Google Scholar] [CrossRef]
Chang, C.C.; Lu, T.C.; Zhu, Z.H.; Tian, H. An effective authentication scheme using DCT for mobile devices. Symmetry 2018, 10, 13. [Google Scholar] [CrossRef]
Qian, Y.; Yin, G.; Sheng, L.; Chen, Z.; Shao, J. Thinking in frequency: Face forgery detection by mining frequency-aware clues. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 86–103. [Google Scholar]
Li, J.; Xie, H.; Yu, L.; Gao, X.; Zhang, Y. Discriminative feature mining based on frequency information and metric learning for face forgery detection. IEEE Trans. Knowl. Data Eng. 2021, 35, 12167–12180. [Google Scholar] [CrossRef]
Chen, S.; Yao, T.; Chen, Y.; Ding, S.; Li, J.; Ji, R. Local relation learning for face forgery detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 1081–1088. [Google Scholar]
Liu, H.; Li, X.; Zhou, W.; Chen, Y.; He, Y.; Xue, H.; Zhang, W.; Yu, N. Spatial-phase shallow learning: Rethinking face forgery detection in frequency domain. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 772–781. [Google Scholar]
Miao, C.; Tan, Z.; Chu, Q.; Liu, H.; Hu, H.; Yu, N. F 2 trans: High-frequency fine-grained transformer for face forgery detection. IEEE Trans. Inf. Forensics Secur. 2023, 18, 1039–1051. [Google Scholar] [CrossRef]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; PMLR: New York, NY, USA, 2019; pp. 6105–6114. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Ganguly, S.; Ganguly, A.; Mohiuddin, S.; Malakar, S.; Sarkar, R. ViXNet: Vision Transformer with Xception Network for deepfakes based video and image forgery detection. Expert Syst. Appl. 2022, 210, 118423. [Google Scholar] [CrossRef]
Wu, J.; Zhang, B.; Li, Z.; Pang, G.; Teng, Z.; Fan, J. Interactive two-stream network across modalities for deepfake detection. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 6418–6430. [Google Scholar] [CrossRef]
Miao, C.; Tan, Z.; Chu, Q.; Yu, N.; Guo, G. Hierarchical frequency-assisted interactive networks for face manipulation detection. IEEE Trans. Inf. Forensics Secur. 2022, 17, 3008–3021. [Google Scholar] [CrossRef]
Sun, B.; Liu, G.; Yuan, Y. F3-Net: Multiview scene matching for drone-based geo-localization. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5610611. [Google Scholar] [CrossRef]
Tan, C.; Zhao, Y.; Wei, S.; Gu, G.; Liu, P.; Wei, Y. Frequency-aware deepfake detection: Improving generalizability through frequency space domain learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 5052–5060. [Google Scholar]
Uddin, M.; Fu, Z.; Zhang, X. Deepfake face detection via multi-level discrete wavelet transform and vision transformer. Vis. Comput. 2025, 41, 7049–7061. [Google Scholar] [CrossRef]
Song, L.; Fang, Z.; Li, X.; Dong, X.; Jin, Z.; Chen, Y.; Lyu, S. Adaptive face forgery detection in cross domain. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 467–484. [Google Scholar]
Wang, Y.; Peng, C.; Liu, D.; Wang, N.; Gao, X. Spatial-temporal frequency forgery clue for video forgery detection in VIS and NIR scenario. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 7943–7956. [Google Scholar] [CrossRef]
Wang, F.; Chen, Q.; Jing, B.; Tang, Y.; Song, Z.; Wang, B. Deepfake Detection Based on the Adaptive Fusion of Spatial-Frequency Features. Int. J. Intell. Syst. 2024, 2024, 7578036. [Google Scholar] [CrossRef]
Tan, C.; Zhao, Y.; Wei, S.; Gu, G.; Liu, P.; Wei, Y. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16– 22 June 2024; pp. 28130–28139. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; proceedings, part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Guo, Z.; Yang, G.; Zhang, D.; Xia, M. Rethinking gradient operator for exposing AI-enabled face forgeries. Expert Syst. Appl. 2023, 215, 119361. [Google Scholar] [CrossRef]
Luo, Y.; Zhang, Y.; Yan, J.; Liu, W. Generalizing face forgery detection with high-frequency features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16317–16326. [Google Scholar]

Figure 1. (a) Examples of facial forgeries from the FaceForensics++ dataset, demonstrating four typical manipulation methods. (b) Representative forged samples from the Celeb-DFv2 dataset.

Figure 2. Overview of the proposed forgery-aware guided spatial–frequency feature fusion network. The architecture includes four main modules. First, a forgery-aware module that uses a lightweight U-Net to generate pixel-level saliency maps from backbone features, guided by facial symmetry and semantic consistency; Second, a spatial feature branch that extracts global and local features using an improved Swin Transformer and spatial pyramid pooling; Third, a frequency feature branch that applies two-stage Haar wavelet transforms to capture high-frequency forgery cues; Finally, a fusion module that integrates spatial and frequency features through cross-domain attention, followed by channel and spatial recalibration. Preprocessing involves face region extraction from video frames using a Haar cascade classifier.

Figure 3. Architecture of core modules in the forgery-aware dual-domain fusion network. (a) Forgery-Aware Module (FAM): Generates a pixel-level saliency map from multi-scale ResNet-50 features using a lightweight U-Net, guided by two auxiliary branches for symmetry-aware modeling and semantic consistency analysis. (b) Global Feature Extraction Module: Extracts global semantic features

F_{G l o b a l}

using a Swin Transformer. (c) Local Feature Extraction Module: Extracts local features

F_{L o c a l}

through spatial pyramid pooling and convolution. (d) Frequency-Domain Feature Extraction Module: Extracts frequency-aware features through Haar wavelet transformation.

Figure 3. Architecture of core modules in the forgery-aware dual-domain fusion network. (a) Forgery-Aware Module (FAM): Generates a pixel-level saliency map from multi-scale ResNet-50 features using a lightweight U-Net, guided by two auxiliary branches for symmetry-aware modeling and semantic consistency analysis. (b) Global Feature Extraction Module: Extracts global semantic features

F_{G l o b a l}

using a Swin Transformer. (c) Local Feature Extraction Module: Extracts local features

F_{L o c a l}

through spatial pyramid pooling and convolution. (d) Frequency-Domain Feature Extraction Module: Extracts frequency-aware features through Haar wavelet transformation.

Figure 4. Comparison results of different methods on four evaluation metrics (ACC, AUC, F1, and G-Mean) across three datasets: (a) FF++ (C23), (b) FF++ (C40), and (c) Celeb-DFv2. The methods from a to h represent ViT, MLFFE_ViT, NPR, EfficientNet-B4, Xception, GocNet, GFFD, and the proposed method, respectively. The proposed method shows consistently superior performance under all settings.

Figure 5. Comparison results of different methods on four evaluation metrics (ACC, AUC, F1, and G-Mean) across four manipulation types: (a) deepfake, (b) Face2Face, (c) FaceSwap, and (d) NeuralTexture. The methods from a to h represent ViT, MLFFE_ViT, NPR, EfficientNet-B4, Xception, GocNet, GFFD, and the proposed method, respectively. The proposed method exhibits consistently superior performance under all manipulation scenarios and evaluation metrics.

Figure 6. Grad-CAM visualizations for different module configurations across five manipulation types: deepfake, Face2Face, FaceSwap, NeuralTexture, and real faces. From top to bottom: input face images; (a) Full Spatial Branch; (b) Full Frequency Branch; (c) Spatial + Frequency; and (d) Spatial + Frequency + FAM. The results show that the combination of spatial and frequency features enhances attention to forgery regions, while the inclusion of the FAM further improves localization precision, especially for subtle manipulations such as NeuralTexture.

Table 1. Hyperparameter configuration of the forgery-aware guided spatial–frequency feature fusion network.

Parameters	Value
Batch Size	32
Input Size	224
Dropout	0.1
Optimizer	adamw
Optimizer Epsilon	1 × 10⁻⁸
Gradient Clipping	0.05
Weight Decay	0.03
Learning Rate Scheduler	cosine
Learning Rate	1 × 10⁻³
Warmup Learning Rate	1 × 10⁻⁵
Color Jitter	0.4
Smoothing	0.1
Mixup	0.8

Table 2. Quantitative results on FF++ (C23 and C40) and Celeb-DFv2 datasets across ACC, AUC, F1-score, and G-Mean (all in %). Note that “Gm” in the table denotes the geometric mean (G-Mean).

Methods	FF++ (C23)				FF++ (C40)				Celeb-DFv2
Methods	ACC	AUC	F1	Gm	ACC	AUC	F1	Gm	ACC	AUC	F1	Gm
ViT [23]	59.57	63.50	59.50	59.43	56.40	58.87	56.39	56.39	86.09	92.30	86.09	86.08
MLFFE_ViT [30]	71.43	79.25	73.40	71.35	66.47	72.71	66.28	66.04	77.29	85.49	77.12	76.83
NPR [34]	90.74	96.42	90.73	90.69	73.08	80.40	73.06	73.03	95.33	99.24	95.33	95.33
EfficientNet-B4 [22]	92.77	97.71	92.77	92.77	75.16	84.04	75.16	75.16	97.20	99.52	97.20	97.19
Xception [21]	94.04	97.62	94.04	94.00	75.78	83.89	75.75	75.71	97.40	99.65	97.40	97.40
GocNet [38]	94.79	98.46	94.79	94.79	73.51	81.55	73.49	73.43	97.05	99.36	97.05	97.05
GFFD [39]	96.59	98.26	96.62	96.61	75.39	84.16	75.39	75.39	97.77	99.63	97.77	97.77
Ours	96.68	98.98	96.68	96.67	77.06	84.52	77.06	77.05	98.61	99.71	98.61	98.60

The bold numbers represent the best performance.

Table 3. Quantitative results by manipulation type (deepfake, Face2Face, FaceSwap, and NeuralTexture) on FF++ dataset across ACC, AUC, F1-score, and G-Mean (all in %). Note that “Gm” in the table denotes the geometric mean (G-Mean).

Methods	Deepfake				Face2Face				FaceSwap				NeuralTexture
Methods	ACC	AUC	F1	Gm	ACC	AUC	F1	Gm	ACC	AUC	F1	Gm	ACC	AUC	F1	Gm
ViT [23]	86.71	93.63	86.71	86.69	61.82	66.37	61.81	61.80	57.71	61.20	57.21	56.68	56.89	58.66	56.63	56.36
MLFFE_ViT [30]	90.61	97.42	90.59	90.52	76.82	86.49	76.82	76.81	83.07	91.50	83.07	83.07	71.92	78.80	71.92	71.91
NPR [34]	97.50	99.69	97.50	97.50	97.18	99.38	97.18	97.16	96.64	99.49	96.64	96.64	88.29	94.39	88.25	88.13
EfficientNet-B4 [22]	98.00	99.78	98.00	98.00	93.64	98.02	93.64	93.64	92.50	97.58	92.50	92.50	86.18	92.51	86.16	86.08
Xception [21]	98.86	99.88	98.86	98.85	97.86	99.26	97.86	97.86	97.96	99.57	97.96	97.96	89.75	95.80	89.73	89.66
GocNet [38]	98.61	99.87	98.61	98.60	98.00	99.49	98.00	98.00	97.14	99.59	97.14	97.14	90.54	95.64	90.52	90.45
GFFD [39]	99.54	99.99	99.54	99.54	99.29	99.58	99.29	99.29	99.61	99.99	99.61	99.61	92.71	97.16	92.74	92.64
Ours	99.82	99.99	99.82	99.82	99.43	99.81	99.43	99.43	99.75	99.99	99.75	99.75	94.39	97.45	94.39	94.33

The bold numbers represent the best performance.

Table 4. Quantitative cross-dataset results on Celeb-DFv2 after training on FF++ (C23), evaluated across ACC, AUC, F1-score, and G-Mean (all in %).

Methods	Training Set	Testing Set (Celeb-DFv2)
Methods	Training Set	ACC	AUC	F1	G-Mean
ViT [23]	FF++ (C23)	49.92	49.77	62.12	38.14
MLFFE_ViT [30]	FF++ (C23)	61.16	68.28	66.73	58.82
NPR [34]	FF++ (C23)	63.11	68.15	54.28	60.08
EfficientNet-B4 [22]	FF++ (C23)	71.48	79.61	72.18	71.44
Xception [21]	FF++ (C23)	70.86	77.94	70.25	70.83
GocNet [38]	FF++ (C23)	68.84	76.10	70.47	68.62
GFFD [39]	FF++ (C23)	67.43	74.84	71.93	65.50
Ours	FF++ (C23)	69.63	79.66	73.44	68.14

The bold numbers represent the best performance.

Table 5. Quantitative ablation results of modules on FF++ (C23), evaluated across accuracy (ACC), area under the curve (AUC), F1-score, and geometric mean (G-Mean) (all in %).

Component Configuration		ACC	AUC	F1	G-Mean
Spatial	Global Only	60.54	64.64	60.15	59.72
	Local Only	60.62	64.60	59.84	59.00
	Full Spatial Branch	95.28	98.19	95.28	95.17
Frequency	First-Stage Haar Only	57.71	60.17	57.53	57.35
	High-Frequency Only	58.17	60.57	58.00	57.85
	Low-Frequency Only	51.55	51.40	51.19	50.82
	Full Frequency Branch	95.12	98.07	95.10	95.09
Spatial + Frequency		95.66	98.43	95.66	95.65
Spatial + Frequency +	Symmetry-Aware Module	95.73	98.30	95.73	95.72
	Semantic Consistency Module	95.90	98.46	95.91	95.88
	Saliency-Guided Fusion Only	96.29	98.71	96.29	96.28
	Full FAM (with saliency)	96.68	98.97	96.68	96.67

The bold numbers represent the best performances.

Table 6. Quantitative evaluation of symmetry loss

L_{sym}

(measured on fake samples) and detection performance across yaw angle groups, reported in terms of accuracy (ACC), area under the curve (AUC), F1-score, and geometric mean (G-Mean)(all in % or normalized values).

Table 6. Quantitative evaluation of symmetry loss

L_{sym}

(measured on fake samples) and detection performance across yaw angle groups, reported in terms of accuracy (ACC), area under the curve (AUC), F1-score, and geometric mean (G-Mean)(all in % or normalized values).

Group	Yaw Range	$L_{sym}$	ACC	AUC	F1	G-Mean
Strict Frontal	$0 °$	0.03528	96.68	98.98	96.68	96.67
Mild Profile	$(- 15 °, + 15 °]$	0.03603	94.81	97.75	94.81	94.81
Moderate Profile	$(- 30 °, - 15 °] \cup (15 °, 30 °]$	0.03666	93.03	97.68	93.03	93.03
Large Profile	$(- 45 °, - 30 °] \cup (30 °, 45 °]$	0.03985	91.80	96.40	91.80	91.78

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

He, Z.; Liu, Z.; Zhao, Z. Forgery-Aware Guided Spatial–Frequency Feature Fusion for Face Image Forgery Detection. Symmetry 2025, 17, 1148. https://doi.org/10.3390/sym17071148

AMA Style

He Z, Liu Z, Zhao Z. Forgery-Aware Guided Spatial–Frequency Feature Fusion for Face Image Forgery Detection. Symmetry. 2025; 17(7):1148. https://doi.org/10.3390/sym17071148

Chicago/Turabian Style

He, Zhenxiang, Zhihao Liu, and Ziqi Zhao. 2025. "Forgery-Aware Guided Spatial–Frequency Feature Fusion for Face Image Forgery Detection" Symmetry 17, no. 7: 1148. https://doi.org/10.3390/sym17071148

APA Style

He, Z., Liu, Z., & Zhao, Z. (2025). Forgery-Aware Guided Spatial–Frequency Feature Fusion for Face Image Forgery Detection. Symmetry, 17(7), 1148. https://doi.org/10.3390/sym17071148

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Forgery-Aware Guided Spatial–Frequency Feature Fusion for Face Image Forgery Detection

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Forgery-Aware Module Guided by Facial Symmetry and Semantic Consistency

3.1.1. Symmetry-Aware Modeling of Facial Structure

3.1.2. Semantic Consistency-Aware Modeling of Facial Expressions

3.1.3. Saliency Map Generation and Guided Feature Fusion

3.2. Spatial-Domain Feature Extraction Module

3.2.1. Global Feature Extraction Module

3.2.2. Local Feature Extraction Module

3.3. Frequency-Domain Feature Extraction Module

3.4. Forgery-Aware Guided Spatial–Frequency Fusion Module

3.4.1. Cross-Domain Interaction Fusion Module

3.4.2. Channel Recalibration Attention

3.4.3. Spatial Gating Attention

4. Experiments

4.1. Experimental Setup

4.1.1. Datasets

4.1.2. Data Preprocessing

4.1.3. Training and Evaluation Settings

4.2. Comparative Evaluation

4.2.1. Evaluation and Comparison on FF++ and Celeb-DFv2

4.2.2. Cross-Dataset Evaluation and Comparison

4.3. Ablation Study

4.4. Visualization Analysis

4.5. Symmetry Analysis Under Non-Frontal Poses

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI