Frequency-Domain Masking and Spatial Interaction for Generalizable Deepfake Detection

Luo, Xinyu; Wang, Yu

doi:10.3390/electronics14071302

Open AccessEditor’s ChoiceArticle

Frequency-Domain Masking and Spatial Interaction for Generalizable Deepfake Detection

by

Xinyu Luo

¹

and

Yu Wang

^1,2,*

¹

College of Computer Science and Technology, Jilin University (JLU), Qianjin Street 2699, Changchun 130012, China

²

Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Qianjin Street 2699, Changchun 130012, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(7), 1302; https://doi.org/10.3390/electronics14071302

Submission received: 19 February 2025 / Revised: 20 March 2025 / Accepted: 24 March 2025 / Published: 26 March 2025

(This article belongs to the Special Issue Advances in Machine Learning for Image Classification)

Download

Browse Figures

Versions Notes

Abstract

Over the past few years, the rapid development of deepfake technology based on generative models has posed a significant threat to the field of information security. Despite the notable progress in deepfake-detection methods based on the spatial domain, the detection capability of the models drops sharply when dealing with low-quality images. Moreover, the effectiveness of detection relies on the realism of the forged images and the specific traces inherent to particular forgery techniques, which often weakens the models’ generalization ability. To address this issue, we propose the Frequency-Domain Masking and Spatial Interaction (FMSI) model. The FMSI model innovatively introduces masked image modeling in frequency-domain processing. This prevents the model from focusing too much on specific frequency-domain features and enhances its generalization ability. We design a high-frequency information convolution module for spatial and channel dimensions to help the model capture subtle forgery traces more effectively. Also, we creatively design a dual stream architecture for frequency-domain and spatial-domain information interaction and overcome single-domain detection limitations. Our model is tested on three public benchmark datasets (FaceForensics++, Celeb-DF, and WildDeepfake) through intra-domain and cross-domain experiments. The detection and generalization capabilities of the model are evaluated using the AUC and EER metrics. The experimental results demonstrate that our model not only possesses high detection capability but also exhibits excellent generalization ability.

Keywords:

deepfake detection; masked image modeling; Fourier transform; vision transformer

1. Introduction

Deepfakes refer to the use of deep learning to create multimedia content, particularly involving the forgery of faces and voices. With the third wave of development in artificial intelligence, generative models in deep learning have advanced rapidly [1,2,3,4,5]. As a technological advancement, generative models exhibit creative applications in domains such as film post-production, dubbing, educational productivity, and identity anonymization. However, malicious use can easily steal others’ identities and create a large number of fake news stories [6,7], influencing the direction of public opinion. For example, in the South Korean “Room N 2.0” incident, AI—swapped faces of ordinary women were used to create pornographic content and leak personal details, triggering social unrest. In 2024, the British ARUP company fell victim to a deepfake video conference scam, losing USD 25.6 million. During the Russia-Ukraine conflict, a deepfake video of Ukrainian President Volodymyr Zelenskyy “calling on soldiers to lay down their arms and surrender” spread on social media. This incident affected the public opinion of both Russia and Ukraine and undermined the international trust system. As seen in the above examples, the use of this technology has deepened public concerns over information authenticity and privacy protection and has also created significant security vulnerabilities for nations [8,9]. Consequently, the development of robust methodologies for detecting face forgery emerges as an imperative necessity.

Deepfake detection is a process that leverages computer vision and pattern recognition technologies to authenticate the authenticity of AI-generated facial content, with the core objective of identifying forgery artifacts left during deep learning-based synthesis or manipulation. Although generative models can produce high-fidelity facial content, multidimensional forgery traces are inevitably introduced through processes such as image synthesis, video compression, and cross-device recapturing. For instance, these artifacts may manifest as deficiencies in high-frequency components within the frequency domain, abnormal patterns in biological signals, and violations of physical lighting reflection principles. While existing models demonstrate strong performance on trained datasets, their detection capability significantly deteriorates when handling unseen data. Thus, the critical challenge in current research lies in developing generalizable methods and models aimed at uncovering universal forgery artifacts shared across both known and unknown deepfake generation techniques.

To date, researchers have explored numerous forgery detection methods and models from multiple perspectives to identify forgery traces and enhance detection capabilities [10,11]. To address the limitation of existing detection methods in cross-domain forgery trace sensitivity, Shiohara and Yamasaki proposed a self-blended image generation framework (SBI). This approach incorporates forgery traces of different modalities to construct a data augmentation strategy with controllable forged features, breaking through the constraints of traditional geometric- and illumination-based data augmentation methods [12]. Furthermore, with the advancement of deepfake generation techniques and the proliferation of open-source tools such as FaceSwap, FakeApp, and DeepDeepFaceLab [13], datasets of forged facial images and videos have been established, including FaceForensics++ [14], Celeb-DF [15], and WildDeepfake [16].

To tackle the insufficient forgery trace extraction in existing detection models, Chen et al. designed a dual-granularity forgery trace detection model. It employs a multiscale feature distillation network and a local-global attention module to capture forgery traces across hierarchical granularities [17]. Ankit et al. proposed an adaptively weighted multiscale attention module that selects optimal multiscale features and highlights critical spatial-channel regions in both local and global contexts, enhancing the model’s ability to extract forgery features [18].

Some studies focus on leveraging intrinsic image properties and discrepancies between forged and authentic content. Their key advantage lies in bypassing reliance on specific forgery traces, instead exploiting intrinsic consistency features in digital media for authenticity verification, thereby improving generalization. For instance, Dong et al. developed an identity-agnostic model that eliminates facial identity information to perform classification solely based on artifact cues. In the work of Gao Yu et al., forgery detection is achieved by analyzing inconsistencies between internal and external facial identity features [19]. Liu Yu et al. exploited temporal coherence and facial identity information in videos, identifying deepfakes by detecting inconsistencies across frames of the same identity [20].

In summary, most existing deepfake-detection models are built upon image spatial information [21], as spatial data contains the richest feature representations, such as pixel values, colors, and textures. By leveraging these features, detection models can more accurately identify image details and structures, thereby effectively capturing forgery traces. However, most models tend to overly focus on these specific forgery characteristics, leading to widespread overfitting. This is typically manifested in models that excel at detecting data generated by specific forgery methods but suffer from degraded performance when encountering data processed by unseen techniques, resulting in poor generalization. From a data augmentation perspective, given the rapid emergence of new forgery technologies, it is impractical to collect and train datasets that cover all possible forgery methods, no matter how datasets are combined or expanded. Additionally, spatial forgery traces and synthetic artifacts often degrade during compression, making the detection of compressed data even more challenging.

To address these limitations, some methods leverage frequency-domain information [22,23,24,25,26]. Techniques such as the Fast Fourier Transform (FFT) [27], Discrete Cosine Transform (DCT) [28,29,30], and Discrete Wavelet Transform (DWT) [31] have been adopted. However, these approaches predominantly rely on single-domain frequency transformations. While such studies partially mitigate the issues of spatial forgery traces, the global frequency clues extracted by FFT or manually partitioned multi-frequency DCT information only provide limited frequency-domain insights, failing to significantly enhance the overall generalization performance of detection models.

In our work, we propose the Frequency-Domain Masking and Spatial Interaction (FMSI) model. We innovatively introduce masked image modeling in the frequency domain and leverage the Vision Transformer [32]’s advantage in capturing global image features to extract both frequency and spatial features. This approach compels the model to focus on invariant features in the frequency domain, thereby enhancing the generalization of the forgery detection model. Additionally, we employ a high-frequency information convolution module to prioritize high-frequency information processing. Finally, we use an attention mechanism to integrate the spatial features extracted by the Visual Transformer with high-frequency features. This combined feature set is then input into the classifier for authenticity assessment.

The contributions of this paper are as follows:

(1): Introducing masked image modeling into the frequency domain and integrating it with Vision Transformers.
(2): Proposing a high-frequency convolutional module to capture critical frequency components across spatial and channel dimensions.
(3): Designing a spatial-frequency attention module to holistically interpret image details by fusing spatial and frequency-domain information.
(4): Experimental results demonstrate superior detection accuracy and robust generalization capability of the proposed model.

The remainder of this paper is organized as follows: Section 2 provides a detailed introduction to the related techniques and theoretical foundations employed in this study. Section 3 elaborates on the structure of the FMSI model presented in this paper, as well as its detailed components. Section 4 offers an in-depth introduction to the three datasets, explains the evaluation metrics and experimental implementation details, and presents the experimental data and visualized results. Section 5 concludes our research.

2. Related Work

2.1. Deepfake Methods

Deepfakes are methods based on deep learning techniques to generate or modify digital content, with key technologies including Generative Adversarial Networks (GANs) and autoencoders. Li et al. [15] have broadly divided the evolution of facial deepfake methods into two time periods. Among the earlier methods are Face2Face [33], FaceSwap [34], NeuralTextures [35], and DeepFakes. The Face2Face method synthesizes the target face by altering the source face. FaceSwap uses corresponding rendering models and color calibration to swap facial regions. NeuralTextures trains a model to learn the textural characteristics of the target video face, modifying only the facial expressions corresponding to the mouth area while keeping the eye area unchanged. DeepFakes, on the other hand, generates fake videos based on two autoencoders and a decoder. However, these methods have their respective drawbacks, such as color mismatches and flickering in videos, leading to datasets that are visually ordinary with relatively obvious traces of facial synthesis. Subsequent deepfake methods have improved the quality of forgery compared to previous ones. The representative forgery dataset, Celeb-DF, although focusing solely on deepfake forgery techniques, has made significant visual enhancements. It performs color transformations between forged and real faces, better integrates the forged facial contours, reduces jitter during video synthesis, and makes the videos more continuous along the temporal axis.

2.2. Frequency-Domain Deepfake Detection

Deepfake-detection methods based on the frequency-domain work by analyzing abnormal features in images’ frequency representations to capture the artifacts left by generative models. Frequency-domain analysis has been widely used in various applications, such as image classification [36,37,38], steganography analysis [39,40], texture classification [41], and super-resolution [42,43]. Many studies have considered using frequency-domain features for deepfake detection. Wang et al. demonstrated in [44] that high-frequency components play a significant role in the generalization ability of convolutional neural networks, and ref. [45] proved that learning in the frequency domain can preserve most of the information in high-resolution images. It was pointed out in [27] that using the Discrete Fourier Transform (DFT) and averaging the amplitude of each frequency band can reveal discriminative spectral irregularities in fake faces. The F³-Net [30] employed two frequency-based processes: one is to discover significant frequencies from the Discrete Cosine Transform (DCT), and the other is to extract local frequency-domain statistical information. Ref. [46] showed that GAN-generated content exhibits severe spectral artifacts, which originate from the upsampling steps of generative models and their variants. Ref. [47] indicated that CNN-generated content has traceable artifacts common to all forgery methods, and by performing appropriate data augmentation, the model can generalize well to unseen tampering techniques and datasets. In addition, filtering, as a classic image signal processing method, is used in forgery detection to refine and mine potential subtle information. Some studies use high-pass filters [47], Gabor [39] filters, etc., to extract features of interest based on high-frequency components. However, the filters used in these studies are usually fixed and manually crafted, thus unable to adaptively capture forgery patterns, and relying solely on one type of feature information for detection is not always reliable.

2.3. Masked Image Modeling

Masked Image Modeling (MIM) [48], a self-supervised learning-based approach for visual representation, trains models to predict randomly masked image regions, capturing intrinsic structural and semantic patterns. Drawing inspiration from NLP’s masked language modeling, it leverages advancements in Vision Transformers [49,50]. This reconstruction—based approach is widely used in image—processing tasks across fields. It has delivered solid experimental results for both 2D and 3D images [51]. Despite their superior performance across various downstream tasks, these models incur substantial computational costs and experience slow learning processes [52]. In practice, there is an expectation to pre-train these models on a vast amount of unlabeled data, such as freely available Internet data, to achieve more generalized representations. However, the efficiency of pre-training poses an inescapable bottleneck that limits the practical application of masked image modeling (MIM) in industrial settings. Consequently, accelerating the process of representation learning within MIM is a significant area of research interest. The emergence of Masked Autoencoder (MAE) [53] has greatly advanced the development of related work. In MAE, a large portion of the input image blocks are randomly masked, and then these missing pixels are reconstructed. A series of improved works have been proposed since then, such as VideoMAE [54], which introduces masked modeling and significantly enhances the model’s understanding of fine-grained temporal information; ConvMAE [55] incorporates general convolutional structures into MAE, designing a pyramid-style local inductive bias model that has improved performance in various downstream tasks; GAN-MAE [56] combines Generative Adversarial Networks with MAE, making adversarial training more efficient.

While frequency-domain-based deepfake-detection methods effectively capture forgery artifacts, and MIM demonstrates potential in self-supervised representation learning, existing techniques still face multifaceted challenges. Frequency-domain detection relies on fixed handcrafted filters, struggling to adaptively capture dynamic forgery patterns. Meanwhile, MIM suffers from computational inefficiency, imbalanced semantic modeling, and limited cross-domain adaptability. Research has shown that using multiple information sources together gives better classification results than using just one [57]. To address these limitations, this study proposes a FMSI. By introducing a frequency-aware masking strategy to enhance model generalization, we design a high-frequency feature convolution module that prioritizes critical high-frequency information processing. Leveraging attention mechanisms, the model integrates spatial global features extracted by Vision Transformers with high-frequency representations, bridging the gap between spatial and frequency domains. This dual-domain synergistic framework overcomes the limitations of single-domain detection paradigms, significantly improving generalization and robustness in detecting diverse forgery traces.

3. Method

The general process of the FMSI is as follows: First, the image is transformed into the frequency-domain space via the FFT to extract its frequency-domain features, which facilitates the effective analysis of the forged regions in the image. For the frequency-domain features, a frequency-domain mask perception mechanism is designed to mask out irrelevant information, such as possible high-frequency noise, while retaining the abnormal frequency-domain features that the forged regions may possess. After the frequency-domain feature processing, the model uses a convolutional neural network to learn the frequency-domain information and deeply extract the frequency-domain features. This process involves a combination of multiple convolutional layers, pooling layers, and fully connected layers to capture the potential differences in forged images. To enhance the model’s detection capability, a fusion strategy of spatial and frequency-domain features is incorporated, where the spatial features of the image are combined with the frequency-domain features and fed into the subsequent fully connected layers for feature fusion. Finally, the classifier determines whether the frame is a deepfake image. The key to the entire process is to utilize the abnormal features in the frequency domain to assist in the detection of deepfakes, combining the advantages of frequency domain and spatial-domain features to improve the model’s detection accuracy and generalization ability. The FMSI is shown in Figure 1.

3.1. Frequency Masking

The core purpose of the frequency-domain masking module is to focus on the frequency-domain feature anomalies produced during the deepfake process while suppressing irrelevant information in the frequency domain of the image. Since forgery operations typically leave specific traces in the frequency domain (such as high-frequency noise or low-frequency inconsistencies), the module uses masking operations to make the feature extraction process more focused on the frequency-domain features of the forged regions, as shown in Figure 1.

The frequency-domain masking first utilizes FFT to obtain the frequency information of the image. Given an image of size

H \times W

, where H and W are the height and width of the image, respectively, we first use FFT to compute its frequency representation

F (u, v)

:

F (u, v) = F (I (x, y))

(1)

Here, u and v correspond to the frequencies along the width and height of the image, respectively. F denotes the FFT operation, and

I (x, y)

represents the original image in spatial coordinates. The parameter required for the masking operation is the masking ratio r, which indicates the number of frequency-domain components to be masked.

The frequency representation

F (u, v)

contains different frequency bands, each capturing distinct types of image information. Typically, the obtained frequency spectrum is divided into four regions according to the image size, namely the low-frequency band, mid-frequency band, high-frequency band, and the full frequency band. The low-frequency band captures the global features of the image, where changes are relatively slow and lack distinct edges and details, reflecting the overall brightness variations of the image. The mid-frequency band contains texture information, such as the patterns of fabrics and the texture of skin. While it includes some detailed information, it is not as pronounced as the detail in the high-frequency band. Finally, the high-frequency band mainly contains edge information and fine details of the image, such as the contours of objects, lines, and hairs. These details are not dominant but can be crucial for tasks like deepfake detection.

The masking ratio r is crucial in the frequency—domain masking module. A proper r can prevent the model from overfitting the training data, thereby significantly improving detection efficiency. Inspired by MIM and MAE, which split images into blocks and mask a certain proportion, we select spectral regions and calculate masked areas based on the masking rate. Since our goal is to use the backbone network to extract global information and the full frequency band contains all frequency components, this configuration optimizes the masking strategy. Thus, the parameters u and v for determining the mask are within the range u ∈ [0,H] and v ∈ [0,W]. After determining the parameter range, the mask frequency N can be calculated as follows:

N = ⌈r \times u \times v⌉ .

(2)

By setting the frequencies within this area to zero, we obtain the masked frequency representation

M (u, v)

after the masking operation:

M (u, v) = \{\begin{cases} 0, (u, v) \in N \\ F (u, v) \end{cases}

(3)

Finally, the inverse Fourier transform is applied to the obtained

M (u, v)

to convert the frequency-domain masking information back to the spatial domain

I m (x, y)

.

3.2. High-Frequency Feature Convolution

In fact, different forgery methods contain different frequency features, such as visual artifacts and texture differences. However, when images undergo processing such as compression, blurring, and cropping, these subtle differences in the spatial domain may be masked, causing detection methods based on spatial-domain features to fail. This may lead the detector to overfit specific details of the training source. To address this issue and enhance the generalization ability of the detector, in our method, we use the Fast Fourier Transform to convert it into the frequency space across the spatial

(W, H)

and channel dimension C. Subsequently, we use convolutional operations to extract features, enabling the detector to learn within the frequency space. Then, we use the inverse Fast Fourier Transform to transform the learned spectral information back to the feature space. This integrated process effectively emphasizes the detector’s representation ability in the frequency domain, enhancing its sensitivity to key features present in that space, as shown in Figure 1.

For each input image x, we first transform it into the frequency domain using the Fast Fourier Transform. Subsequently, we extract the high-frequency components by applying a high-pass filter, denoted as filter(), as follows:

f h = f i l t e r (φ (x)) .

(4)

Here,

φ

denotes the Fast Fourier Transform, and fh represents the frequency representation of the image. The zero frequency is moved to the center.

After extracting the high-frequency components, we proceed to transform this frequency information back to the image space:

x h = I F (f h) .

(5)

Here,

I F

denotes the inverse Fast Fourier Transform. This transformation is represented as xh, which stands for the high-frequency components in the image space. This process ensures that our focus remains consistently on the relevant high-frequency information contained within the image.

3.3. Spatial-Frequency Attention

Frequency information contains undiscovered clues in the frequency domain, especially in high-frequency information that is related to image details. These clues reveal abnormal frequency distributions between real and fake faces. After extracting frequency-domain features through frequency-domain convolution, they can be integrated with spatial-domain features to form a more comprehensive and rich feature representation. To fully utilize the spatial and high-frequency information, we design a high-frequency guided attention module. Specifically, for the i-th block of ViT, we employ a cross-attention mechanism, using the features extracted by ViT as the Query and the high-frequency features as the Key and Value. The specific calculation formula can be defined as:

{\hat{F}}_{h} = F_{v i t} + A t t e n t i o n (Q u e r y (F_{v i t}), K e y (F_{h}), V a l u e (F_{h})) .

(6)

Here, Attention(⋅) represents the attention operation.

To better extract both spatial and frequency domain information, after the first attention mechanism, the features

F_{v i t}

obtained from the backbone network are used as Key and Value, while

{\hat{F}}_{h}

serves as Query for the attention operation to obtain the

{\hat{F}}_{v i t}

. This process is mathematically represented as:

{\hat{F}}_{v i t} = {\hat{F}}_{h} + A t t e n t i o n (Q u e r y ({\hat{F}}_{h}), K e y (F_{v i t}), V a l u e (F_{v i t})) .

(7)

Finally, the derived

{\hat{F}}_{v i t}

is subjected to processing by the FFN layer to yield the ultimate outcome

F_{n e x t}

:

F_{n e x t} = {\hat{F}}_{v i t} + F F N (n o r m ({\hat{F}}_{v i t})) .

(8)

Here, norm(⋅) represents the normalization layer. The obtained

F_{n e x t}

is utilized as the input for the subsequent spatial-frequency attention module.

3.4. Loss Function

For each given true label value

Y^{i}

and its corresponding predicted value

{\tilde{Y}}^{i}

, we use the cross-entropy loss to train the deepfake detector:

L_{c l s} = - \frac{1}{N} \sum_{i = 1}^{N} [Y^{i} \log ({\tilde{Y}}^{i}) + (1 - Y^{i}) \log (1 - {\tilde{Y}}^{i})] .

(9)

4. Experiments

4.1. Experiments Settings

4.1.1. Datasets

The FaceForensics++ (FF++) dataset is a high-quality dataset widely used for research on face forgery detection. It comprises 1000 original video sequences that have been processed using four primary face manipulation methods: Deepfakes (DF), Face2Face (F2F), FaceSwap (FS), and NeuralTextures (NT). The original videos in the dataset are sourced from YouTube and feature individuals of diverse ages, genders, and ethnicities, ensuring the data’s diversity and representativeness. DF uses deep learning models to replace one person’s face with another’s, generating highly realistic forged videos. It leverages Generative Adversarial Network (GAN) technology to produce high-quality forgeries. F2F involves mapping the facial expressions and movements of one person onto another’s face to generate forged videos. This method allows for modifications specifically targeting facial expressions. FS primarily uses image processing techniques to achieve face swapping. Through technologies such as detection, alignment, and blending, it exchanges the source face with the target face to generate forged videos. NT combines neural textures and rendering networks to optimize the synthesis and rendering of facial textures, capturing subtle changes and dynamic features of the face, thereby generating forged videos. The FF++ dataset is derived from 977 YouTube videos. It not only provides the original videos and the videos generated by the four forgery methods but also offers mask images of the faces. These masks can be used for tasks such as image and video classification as well as segmentation. The four types of forged data in FF++ are shown in Figure 2.

To further test the model’s cross-dataset performance, we used the Celeb-DF and WildDeepfake datasets.

Celeb-DF (Celebrities-DeepFake) is a large-scale, high-quality deepfake-detection dataset. It is designed to provide a challenging benchmark for deepfake-detection research and to promote the development of related technologies. The Celeb-DF dataset contains 590 real celebrity videos collected from YouTube (Celeb-real), 5639 corresponding deepfake synthesized videos (Celeb-synthesis), and 518 test videos, totaling over 6000 videos. These videos cover a wide range of subjects with different ages, races, and genders, simulating the diverse scenarios found in the real world.

WildDeepfake is a challenging real-world dataset for deepfake detection. This dataset aims to address the limitations of existing deepfake-detection datasets in terms of diversity and authenticity by providing forged video samples that are closer to real-world scenarios. WildDeepfake includes 7314 facial sequences extracted from 707 deepfake videos collected from the Internet, covering a wide range of scenarios, individuals, and expressions and reflecting the complexity of deepfakes in the real world.

4.1.2. Evaluation Metrics

In this research, we predominantly utilize the AUC (Area Under the Curve) metric as the criterion for discerning authenticity. The AUC is defined as the area beneath the ROC curve (Receiver Operating Characteristic Curve). The ROC curve is constructed by plotting the True Positive Rate (TPR) against the False Positive Rate (FPR) across varying classification thresholds, thereby providing a comprehensive assessment of the model’s performance. The AUC value spans a range from 0 to 1, with higher values indicative of superior model performance.

4.1.3. Implementation Details

In the data preprocessing stage, we aligned the masked videos processed by the official method from the FF++ dataset with the original videos and extracted face images. We extracted images from these videos by sampling every five frames. All images were cropped to

224 \times 224

pixels and normalized uniformly to the range of 0 to 1. In addition, we applied common image augmentation techniques, including flipping, contrast enhancement, and blurring. The model was trained with a learning rate of

5 \times 10^{- 4}

, and a mask ratio of 30% is employed in the frequency-domain masking module. We used the base version of the Vision Transformer model, implemented using PyTorch 1.12.1 on an NVIDIA RTX 3090 24 GB platform with it manufactured by ASUS in Suzhou, China.

4.2. Evaluations

4.2.1. Intra-Dataset Evaluation

We conducted an intra-dataset evaluation of our model. Specifically, we performed experiments on the FF++ dataset, where both the training and testing data were sourced from the same dataset. As shown in Table 1, our method demonstrated significant improvements in the low-quality setting.

We observe that our model exhibits superior performance on low-quality datasets compared to high-quality datasets. We posit that in the low-quality FF++ dataset, images or videos are typically subjected to lower-quality compression and processing, which introduces a greater number of artifacts and inconsistencies in frequency features. Frequency-domain-based detection methods are capable of effectively capturing these anomalies in artifacts and frequency features, thereby enhancing detection performance. In contrast, high-quality datasets usually undergo more refined processing and compression, resulting in fewer artifacts and anomalies in frequency features, making it challenging to discern sufficient feature differences for effective detection.

4.2.2. Cross-Manipulation Evaluation

In evaluating cross-manipulation methods, we trained our deep forgery detector on each subset of the FF++ dataset and tested it on the remaining three subsets. This approach was designed to assess our model’s ability to recognize various forgery manipulation methods, including those not encountered during training. We utilized the low-quality (LQ) version of the FF++ dataset for training. As shown in Table 2, our method demonstrates impressive generalization capabilities. For example, when the model trained on FS is tested on other forgery methods, there is a significant improvement in AUC scores. This performance enhancement is attributed to our model learning the subtle traces of forgery in the frequency domain while also combining spatial features with features masked by the global frequency domain. Our model aims to avoid overfitting to specific forgery methods by treating all non-real data as anomalies, thereby ensuring broad applicability across different forgery techniques.

Through comparative analysis, our model outperforms other models in cross-manipulation method testing, indicating its superior generalization capability. Specifically, in the FS and F2F methods, our analysis reveals that these forgery techniques primarily generate fake faces through facial replacement and mapping. In the resulting images, there are often unnatural boundaries between the facial area and the background. Due to abrupt changes in pixel values at these boundaries, the boundary artifacts manifest as anomalies in high—frequency components in the frequency domain. Consequently, our model exhibits strong detection capabilities in these two methods. However, we observe that the model’s performance is relatively weaker when trained and tested on NT and DF. The reason for this is that DF mainly relies on autoencoder structures for face generation, while NT combines computer vision techniques with deep learning methods. This results in more realistic lighting, texture, and geometric features in the generated images, leading to subtler anomalies in the frequency domain, which are more difficult for the model to detect.

4.2.3. Cross-Dataset Evaluation

We trained our model using the low-quality (LQ) version of the FF++ dataset and evaluated its generalization performance on the Celeb-DF and WildDeepfake test sets. In addition to our model, we compared its performance with several existing methods. The results presented in Table 3 indicate that, compared with existing methods, our model demonstrates superior generalization and adaptability across different datasets.

Results show that our model significantly outperforms others on both datasets. Xception, a classic deep convolutional architecture, has multiscale representation ability in spatial-domain feature extraction. However, its unimodal spatial feature modeling mechanism is inherently limited, lacking frequency-domain feature extraction and cross-domain fusion capabilities. This makes it hard to fully capture the common patterns of cross-dataset forgery traces, restricting its generalization in complex adversarial samples. DisGRL and RECCE rely on image reconstruction strategies, determining the authenticity of images by detecting pixel-level differences in reconstruction results. However, as space-domain methods, they do not use frequency-domain information, and their detection ability decreases with highly concealed forgeries. F3Net improves detection through dual-domain feature fusion but lacks constraints on model attention regions, making it prone to overfitting frequency-domain forgery traces from specific methods and reducing its generalization. In contrast, our model fuses spatial and frequency-domain features for multidimensional information complementarity. It also uses a frequency-domain masking mechanism to dynamically mask specific frequency bands, forcing the model to avoid oversensitivity to local forgery traces and instead learn more generalizable cross-method abnormal representations. This improves detection accuracy and generalization.

4.3. Ablation Study

The baseline model is based on the Vision Transformer architecture, and ablation experiments were conducted on each module. We selected the FaceSwap (FS) forgery method in the low-quality (LQ) version of the FF++ dataset for training and tested the remaining methods. The results are shown in Table 4. Here, FM denotes the frequency-domain masking module, HFC represents the high-frequency convolution module, and SFA indicates the spatial-frequency attention.

According to the ablation study in Table 4, adding only SFA to the baseline model significantly improves detection. SFA fuses global spatial and frequency features via attention, capturing structural and distributional anomalies. In contrast, adding FM alone yields smaller AUC gains. FM’s frequency masking indirectly constrains learning but does not fully utilize the complementary features. Further adding HFC to SFA boosts performance. HFC enhances frequency detail capture through high-pass filtering and convolution across spatial and channel dimensions, maintaining detection ability even for low-quality images. The complete model, combining FM, SFA, and HFC, shows the best performance. FM prevents overfitting to specific frequencies, SFA enables cross-domain feature fusion, and HFC strengthens high-frequency detail extraction. Together, they help the model identify more universal forgery patterns, surpassing single—or dual—module setups. This underscores the importance of multimodal fusion and frequency masking for enhancing detection generalization.

To investigate the impact of frequency-domain masks on the overall detection capability of the model, the mask frequencies were set to 15%, 30%, 45%, 60%, and 75%. Training was conducted on the F2F method of the FF++ (LQ) dataset, and testing was performed on the DF method. The results are shown in Table 5.

At a low masking ratio r (e.g., 15%), less frequency information is masked, allowing the model to retain more original frequency features. This helps the model learn comprehensive image information but may cause it to focus on specific forgery features. At a moderate r (e.g., 30%), the model retains sufficient image information while effectively identifying abnormal frequency patterns from various forgery methods, improving detection accuracy and generalization. However, too high an r (e.g., 75%) masks a large amount of frequency information, limiting the model’s learnable image features. This may result in insufficient overall image understanding, making it hard to accurately judge image authenticity and leading to decreased detection performance.

4.4. Visualize

4.4.1. Analysis of Classification Decision

To gain a deeper understanding of the features extracted by our model, we conducted a Grad-CAM visualization experiment on the FF++ dataset, with results shown in Figure 3. The visualization reveals distinct activation patterns for real and forged faces. For forged faces, since the model can capture the high-frequency noise or artifacts in forged images and notice the differences in feature distribution between real and forged images, this is manifested in the activation maps as more pronounced activation in the facial areas of the forged images.

4.4.2. Feature Distribution Analysis

We visualized the feature distribution of our model, which was trained on the FF++ (LQ) dataset using t-SNE. The features were extracted from the layers just before the final fully connected layer. For this visualization, we randomly selected 1000 images of FF++. The t-SNE plots clearly show that our method can effectively distinguish between real face images and forged images, highlighting the model’s capability to separate these two categories distinctly. The results are shown in Figure 4. Here, (a) represents the results after the first training epoch, (b) denotes the results after the 10th training epoch, and (c) and (d) illustrates the scatter plot results after the final 50 and 60 training epochs, respectively.

5. Conclusions

To tackle the challenges posed by the evolving deepfake technology, this study proposes a facial forgery detection algorithm based on frequency-domain information masking and spatial-domain interaction. Using the Fourier transform to map images to the frequency domain, we design a frequency masking module to prevent the model from focusing too much on specific frequency-domain features, enhancing its generalization. We also create a high-frequency information convolution module for spatial and channel dimensions to help the model capture subtle forgery traces more effectively. Moreover, we introduce a cross-domain attention mechanism to integrate frequency-domain and spatial-domain information, forming a multimodal detection method that overcomes the limitations of single-domain approaches. Experiments show that our model outperforms others in intra-domain and cross-domain detection on benchmark datasets like FaceForensics++, Celeb-DF, and WildDeepfake. This confirms the effectiveness of our method, which achieves both high detection accuracy and strong generalization.

Author Contributions

X.L.: Conceptualization, Methodology, Data curation, Writing—review and editing. Y.W.: Supervision, Investigation, Methodology, Visualization—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

The work of Xinyu Luo and Yu Wang is supported by the Jilin Province Science and Technology Development Plan Key R&D Project under Fund 20240304140SF.

Data Availability Statement

The authors have used the publicly archived datasets FaceForensics++ (FF++), Celeb-DF, and WildDeepfake for this study. FaceForensics++ (FF++) is available through its official repository at https://github.com/ondyari/FaceForensics (accessed on 15 March 2024). This dataset includes manipulated facial videos generated via Deepfakes, Face2Face, FaceSwap, and NeuralTextures. Celeb-DF is accessible via its dedicated platform at https://github.com/yuezunli/celeb-deepfakeforensics (accessed on 18 March 2024). It contains high-quality deepfake videos synthesized from celebrity source materials. WildDeepfake can be downloaded from the project page https://github.com/deepfakeinthewild/deepfake-in-the-wild (accessed on 24 March 2024). It comprises diverse real-world forged videos collected from open Internet sources.

Conflicts of Interest

The authors declare no conflict of interest.

References

Du, Y.; Luo, D.; Yan, R.; Wang, X.; Liu, H.; Zhu, H.; Song, Y.; Zhang, J. Enhancing job recommendation through llm-based generative adversarial networks. Proc. AAAI Conf. Artif. Intell. 2024, 38, 8363–8371. [Google Scholar] [CrossRef]
Miao, J.; Tao, H.; Xie, H.; Sun, J.; Cao, J. Reconstruction-based anomaly detection for multivariate time series using contrastive generative adversarial networks. Inf. Process. Manag. 2024, 61, 103569. [Google Scholar] [CrossRef]
Wang, P.; Liu, Z.; Wang, Z.; Zhao, Z.; Yang, D.; Yan, W. Graph generative adversarial networks with evolutionary algorithm. Appl. Soft Comput. 2024, 164, 111981. [Google Scholar] [CrossRef]
Yan, N.; Gu, J.; Rush, A.M. Diffusion models without attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 8239–8249. [Google Scholar]
Chen, H.; Zhang, Y.; Cun, X.; Xia, M.; Wang, X.; Weng, C.; Shan, Y. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 7310–7320. [Google Scholar]
Qu, Z.; Meng, Y.; Muhammad, G.; Tiwari, P. QMFND: A quantum multimodal fusion-based fake news detection model for social media. Inf. Fusion 2024, 104, 102172. [Google Scholar] [CrossRef]
Peng, L.; Jian, S.; Kan, Z.; Qiao, L.; Li, D. Not all fake news is semantically similar: Contextual semantic representation learning for multimodal fake news detection. Inf. Process. Manag. 2024, 61, 103564. [Google Scholar] [CrossRef]
Liu, C.; Chen, H.; Zhu, T.; Zhang, J.; Zhou, W. Making DeepFakes more spurious: Evading deep face forgery detection via trace removal attack. IEEE Trans. Dependable Secur. Comput. 2023, 20, 5182–5196. [Google Scholar] [CrossRef]
Ba, Z.; Liu, Q.; Liu, Z.; Wu, S.; Lin, F.; Lu, L.; Ren, K. Exposing the deception: Uncovering more forgery clues for deepfake detection. Proc. AAAI Conf. Artif. Intell. 2024, 38, 719–728. [Google Scholar] [CrossRef]
Husseini, S.; Dugelay, L. A comprehensive framework for evaluating deepfake generators: Dataset, metrics performance, and comparative analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 372–381. [Google Scholar]
Yu, P.; Xia, Z.; Fei, J.; Lu, Y. A survey on deepfake video detection. Iet Biom. 2021, 10, 607–624. [Google Scholar] [CrossRef]
Shiohara, K.; Yamasaki, T. Detecting deepfakes with self-blended images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18720–18729. [Google Scholar]
Perov, I.; Gao, D.; Chervoniy, N.; Zhou, W.; Zhang, W. Deepakala: Integrated, flexible and extensible face-swapping framework. arXiv 2020, arXiv:2005.05535. [Google Scholar]
Rossler, A.; Cozzolino, D.; Verdoliva, L.; Riess, C.; Thies, J.; Niessner, M. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1–11. [Google Scholar]
Li, Y.; Yang, X.; Sun, P.; Qi, H.; Lyu, S. Celeb-df: A large-scale challenging dataset for deepfake forensics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 14–19 June 2020; pp. 3207–3216. [Google Scholar]
Zi, B.; Chang, M.; Chen, J.; Ma, X.; Jiang, Y. Wilddeepfake: A challenging real-world dataset for deepfake detection. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 2382–2390. [Google Scholar]
Chen, H.; Li, Y.; Lin, D.; Li, B.; Wu, J. Watching the big artifacts: Exposing deepfake videos via bi-granularity artifacts. Pattern Recognit. 2023, 135, 109179. [Google Scholar] [CrossRef]
Yadav, A.; Vishwakarma, D.K. AW-MSA: Adaptively weighted multi-scale attentional features for DeepFake detection. Eng. Appl. Artif. Intell. 2024, 127, 107443. [Google Scholar] [CrossRef]
Gao, J.; Concas, S.; Orrù, G.; Feng, X.; Marcialis, G.L.; Roli, F. Generalized deepfake detection algorithm based on inconsistency between inner and outer faces. In International Conference on Image Analysis and Processing; Springer Nature: Cham, Switzerland, 2023; pp. 343–355. [Google Scholar]
Liu, B.; Liu, B.; Ding, M.; Zhu, T.; Yu, X. TI2Net: Temporal identity inconsistency network for deepfake detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 4691–4700. [Google Scholar]
Xu, F.; Wang, R.; Huang, Y.; Guo, Q.; Ma, L.; Liu, Y. Countering malicious deepfakes: Survey, battleground, and horizon. Int. J. Comput. Vis. 2022, 130, 1678–1734. [Google Scholar]
Zhou, P.; Han, X.; Morariu, V.I.; Davis, L.S. Two-stream neural networks for tampered face detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 1831–1839. [Google Scholar]
Zhou, P.; Han, X.; Morariu, V.I.; Davis, L.S. Learning rich features for image manipulation detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1053–1061. [Google Scholar]
Liu, H.; Li, X.; Zhou, W.; Chen, Y.; He, Y.; Xue, H.; Zhang, W.; Yu, N. Spatial-phase shallow learning: Rethinking face forgery detection in frequency domain. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 772–781. [Google Scholar]
Luo, Y.; Zhang, Y.; Yan, J.; Liu, W. Generalizing face forgery detection with high-frequency features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16317–16326. [Google Scholar]
Masi, I.; Killekar, A.; Mascarenhas, R.M.; Gurudatt, S.P.; AbdAImageed, W. Two-branch recurrent network for isolating deepfakes in videos. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part VII 16. Springer International Publishing: New York, NY, USA, 2020; pp. 667–684. [Google Scholar]
Durall, R.; Keuper, M.; Pfreundt, F.; Keuper, J. Unmasking deepfakes with simple features. arXiv 2019, arXiv:1911.00686. [Google Scholar]
Chen, S.; Yao, T.; Chen, Y.; Ding, S.; Li, J.; Ji, R. Local relation learning for face forgery detection. Proc. AAAI Conf. Artif. Intell. 2021, 35, 1081–1088. [Google Scholar]
Concas, S.; Perelli, G.; Marcialis, G.L.; Puglisi, G. Tensor-based deepfake detection in scaled and compressed images. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 3121–3125. [Google Scholar]
Qian, Y.; Yin, G.; Sheng, L.; Chen, Z.; Shao, J. Thinking in frequency: Face forgery detection by mining frequency-aware clues. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2020; pp. 86–103. [Google Scholar]
Miao, C.; Tan, Z.; Chu, Q.; Liu, H.; Hu, H.; Yu, N. F2 trans: High-frequency fine-grained transformer for face forgery detection. IEEE Trans. Inf. Forensics Secur. 2023, 18, 1039–1051. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Thies, J.; Zollhofer, M.; Stamminger, M.; Theobalt, C.; Niessner, M. Face2face: Real-time face capture and reenactment of rgb videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2387–2395. [Google Scholar]
FaceSwap. 2018. Available online: https://github.com/MarekKowalski/FaceSwap/ (accessed on 29 October 2018).
Thies, J.; Zollhöfer, M.; Nießner, M. Deferred neural rendering: Image synthesis using neural textures. ACM Trans. Graph. (TOG) 2019, 38, 1–12. [Google Scholar]
Stuchi, A.; Angeloni, M.A.; Pereira, R.F.; Boccato, L.; Folego, G.; Prado, P.V.S. Improving image classification with frequency domain layers for feature extraction. In Proceedings of the 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP), Tokyo, Japan, 25–28 September 2017; pp. 1–6. [Google Scholar]
Sarlashkar, A.N.; Bodruzzaman, M.; Malkani, M. Feature extraction using wavelet transform for neural network based image classification. In Proceedings of the Thirtieth Southeastern Symposium on System Theory, Tuskegee, AL, USA, 8–10 March 1998; pp. 412–416. [Google Scholar]
Franzen, F. Image classification in the frequency domain with neural networks and absolute value DCT. In Proceedings of the Image and Signal Processing: 8th International Conference, ICISP 2018, Cherbourg, France, 2–4 July 2018; Proceedings 8. Springer International Publishing: New York, NY, USA, 2018; pp. 301–309. [Google Scholar]
Chen, M.; Sedighi, V.; Boroumand, M.; Fridrich, J. JPEG-phase-aware convolutional neural network for steganalysis of JPEG images. In Proceedings of the 5th ACM Workshop on Information Hiding and Multimedia Security, Philadelphia, PA, USA, 20–22 June 2017; pp. 75–84. [Google Scholar]
Denemark, T.D.; Boroumand, M.; Fridrich, J. Steganalysis features for content-adaptive JPEG steganography. IEEE Trans. Inf. Forensics Secur. 2016, 11, 1736–1746. [Google Scholar] [CrossRef]
Fujieda, S.; Takayama, K.; Hachisuka, T. Wavelet convolutional neural networks for texture classification. arXiv 2017, arXiv:1707.07394. [Google Scholar]
Li, J.; You, S.; Robles-Kelly, A. A frequency domain neural network for fast image super-resolution. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 8–13 July 2018; pp. 1–8. [Google Scholar]
Huang, H.; He, R.; Sun, Z.; Tan, T. Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1689–1697. [Google Scholar]
Wang, H.; Wu, X.; Huang, Z.; Xing, E.P. High-frequency component helps explain the generalization of convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8684–8694. [Google Scholar]
Xu, K.; Qin, M.; Sun, F.; Wang, Y.; Chen, Y.K.; Ren, F. Learning in the frequency domain. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1740–1749. [Google Scholar]
Frank, J.; Eisenhofer, T.; Schönherr, L.; Fischer, A.; Kolossa, D.; Holz, T. Leveraging frequency analysis for deep fake image recognition. In Proceedings of the 37th International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 3247–3258. [Google Scholar]
Wang, S.Y.; Wang, O.; Zhang, R.; Owens, A.; Efros, A.A. CNN-generated images are surprisingly easy to spot... for now. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8695–8704. [Google Scholar]
Xie, Z.; Zhang, Z.; Cao, Y.; Lin, Y.; Bao, J.; Yao, Z.; Dai, Q.; Hu, H. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9653–9663. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 568–578. [Google Scholar]
Azari, F.; Robertson, A.M.; Tobe, Y.; Tobe, Y.; Watton, P.N.; Birder, L.A.; Yoshimura, N.; Matsuoka, K.; Hardin, C.; Watkins, S. Elucidating the high compliance mechanism by which the urinary bladder fills under low pressures. arXiv 2025, arXiv:2501.10312. [Google Scholar]
Huang, L.; You, S.; Zheng, M.; Wang, F.; Qian, C.; Yamasaki, T. Green hierarchical vision transformer for masked image modeling. Adv. Neural Inf. Process. Syst. 2022, 35, 19997–20010. [Google Scholar]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16000–16009. [Google Scholar]
Tong, Z.; Song, Y.; Wang, J.; Wang, L. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv. Neural Inf. Process. Syst. 2022, 35, 10078–10093. [Google Scholar]
Gao, P.; Ma, T.; Li, H.; Lin, Z.; Dai, J.; Qiao, Y. Convmae: Masked convolution meets masked autoencoders. arXiv 2022, arXiv:2205.03892. [Google Scholar]
Fei, Z.; Fan, M.; Zhu, L.; Huang, J.; Wei, X.M.; Wei, X. Masked auto-encoders meet generative adversarial networks and beyond. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 24449–24459. [Google Scholar]
Kordi, F.; Yousefi, H. Crop classification based on phenology information by using time series of optical and synthetic-aperture radar images. Remote Sens. Appl. Soc. Environ. 2022, 27, 100812. [Google Scholar]

Figure 1. Pipeline of our FMSI. The entire network is divided into two parallel branches. In Branch A, the most fundamental Vision Transformer architecture is employed. Prior to Patch Embedding, frequency-domain masking is applied to the image’s frequency information. Subsequently, the frequency-domain information is transformed back into the spatial domain and subjected to Patch Embedding with the image itself. In Branch B, the image is transformed into the frequency domain, where high-frequency filtering is performed across spatial and channel dimensions, followed by the extraction of high-frequency convolutional features. An attention mechanism is then utilized to integrate spatial features from Branch A with high-frequency features. After several such attention operations, the resulting features are classified.

Figure 2. FF++ forged data. In each group of images, the left side shows the real image, while the right side shows the forged image. Specifically, (a) represents the Deepfakes (DF) method, (b) represents the Face2Face (F2F) method, (c) represents the FaceSwap (FS) method, and (d) represents the NeuralTextures (NT) method.

Figure 3. The Grad-CAM visualization. (a) represents the real image, and (b) represents the fake image.

Figure 4. The T-sne visualization of FMSI training on FF+. Red dots signify images classified as real, while purple dots indicate those identified as deepfakes.

Table 1. Intra-dataset evaluation (AUC (%)) on FF++.

Methods	FF++ (HQ)	FF++ (LQ)
Xception	97.80	87.21
Multi-task	88.72	76.74
EfficientNetB4	98.94	88.14
Face X-ray	87.40	61.60
Two-Branch	98.70	86.59
RFM	98.79	89.83
GFFD	98.95	88.27
SPSL	95.32	82.82
FST+Efficient-b4	98.81	91.27
ALFE(Xception)	99.10	92.60
PESAF	99.32	92.86
Ours	98.18	96.17

Table 2. Cross-manipulation evaluation (AUC (%)) on each forgery type of FF++ when trained on one type and tested on the remaining types.

Methods	Train	Test
Methods	Train	DF	F2F	FS	NT
Freq-SCL	DF	98.91	58.90	66.87	63.61
RECCE		99.65	70.66	74.29	67.34
DisGRL		99.67	71.76	75.21	68.74
Ours		99.78	62.34	84.68	56.67
Freq-SCL	F2F	67.55	93.06	55.35	66.66
RECCE		75.99	98.06	64.53	72.32
DisGRL		75.73	98.69	65.71	74.15
Ours		73.08	99.52	72.84	74.46
Freq-SCL	FS	75.90	54.64	98.82	56.70
RECCE		82.39	64.44	98.82	56.70
DisGRL		82.73	64.85	99.01	56.96
Ours		92.44	78.87	99.68	71.86
Freq-SCL	NT	79.09	74.21	53.99	88.54
RECCE		78.83	80.89	63.70	93.63
DisGRL		80.29	83.30	65.23	94.10
Ours		68.15	87.12	79.11	96.21

Table 3. Cross-dataset evaluation (EER and AUC (%)) on Celeb-DF and WildDeepfake when trained on FF++. Here, ↑ indicates that a higher value for this type of metric corresponds to better model detection capability, while ↓ indicates that a lower value reflects better detection capability.

Methods	WDF		CDF
Methods	AUC ↑	EER ↓	AUC ↑	EER ↓
Xception	62.72	40.65	61.80	41.73
Add-Net	62.35	41.42	65.29	38.90
F³Net	57.10	45.12	61.51	42.03
RFM	57.75	45.45	65.63	38.54
MultiAtt	59.74	43.73	67.02	37.90
MAT	59.74	43.73	67.02	37.90
RECCE	64.31	40.53	68.71	35.73
DisGRL	66.73	39.24	70.03	34.23
PESAF	-	-	73.61	32.70
Ours	75.82	31.49	74.86	31.56

Table 4. Ablation studies (AUC (%)) on FF++ when trained on FS and tested on the remaining types.

Method	Baseline	SFA	HFC	FM	DF	FS	NT
F2F	√				67.47	64.25	69.36
	√	√			70.31	68.04	70.67
	√			√	69.51	67.66	70.24
	√	√	√		72.40	70.01	71.22
	√	√	√	√	73.08	72.84	74.46

Table 5. Mask Ratio evaluation (AUC (%)) on FF++.

Mask Ratio	AUC
15%	68.89
30%	73.08
45%	70.26
60%	70.41
75%	69.47

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Luo, X.; Wang, Y. Frequency-Domain Masking and Spatial Interaction for Generalizable Deepfake Detection. Electronics 2025, 14, 1302. https://doi.org/10.3390/electronics14071302

AMA Style

Luo X, Wang Y. Frequency-Domain Masking and Spatial Interaction for Generalizable Deepfake Detection. Electronics. 2025; 14(7):1302. https://doi.org/10.3390/electronics14071302

Chicago/Turabian Style

Luo, Xinyu, and Yu Wang. 2025. "Frequency-Domain Masking and Spatial Interaction for Generalizable Deepfake Detection" Electronics 14, no. 7: 1302. https://doi.org/10.3390/electronics14071302

APA Style

Luo, X., & Wang, Y. (2025). Frequency-Domain Masking and Spatial Interaction for Generalizable Deepfake Detection. Electronics, 14(7), 1302. https://doi.org/10.3390/electronics14071302

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Frequency-Domain Masking and Spatial Interaction for Generalizable Deepfake Detection

Abstract

1. Introduction

2. Related Work

2.1. Deepfake Methods

2.2. Frequency-Domain Deepfake Detection

2.3. Masked Image Modeling

3. Method

3.1. Frequency Masking

3.2. High-Frequency Feature Convolution

3.3. Spatial-Frequency Attention

3.4. Loss Function

4. Experiments

4.1. Experiments Settings

4.1.1. Datasets

4.1.2. Evaluation Metrics

4.1.3. Implementation Details

4.2. Evaluations

4.2.1. Intra-Dataset Evaluation

4.2.2. Cross-Manipulation Evaluation

4.2.3. Cross-Dataset Evaluation

4.3. Ablation Study

4.4. Visualize

4.4.1. Analysis of Classification Decision

4.4.2. Feature Distribution Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI