Transformer Based on Multi-Domain Feature Fusion for AI-Generated Image Detection

Man, Qiaoyue; Cho, Young-Im

doi:10.3390/electronics15030716

Open AccessArticle

Transformer Based on Multi-Domain Feature Fusion for AI-Generated Image Detection

by

Qiaoyue Man

and

Young-Im Cho

^*

Department of Computer Engineering, Gachon University, 1342 Seongnamdaero, Sujeong-gu, Seongnam-si 13120, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(3), 716; https://doi.org/10.3390/electronics15030716

Submission received: 6 January 2026 / Revised: 4 February 2026 / Accepted: 5 February 2026 / Published: 6 February 2026

(This article belongs to the Special Issue Artificial Intelligence, Computer Vision and 3D Display)

Download

Browse Figures

Versions Notes

Abstract

With the rapid advancement of Generative Adversarial Networks (GANs), diffusion models, and other deep generative techniques, AI-generated images have achieved unprecedented levels of visual realism, posing severe challenges to the authenticity, security, and credibility of digital content. This paper proposes a novel hybrid transformer model that integrates spatial and frequency domains. It leverages CLIP to extract semantic inconsistencies in the image’s spatial domain while employing wavelet transforms to capture multi-scale frequency anomalies in AI-generated images. After cross-domain feature fusion, global modeling is performed within the Swin-Transformer architecture, enabling robust authenticity detection of AI-generated images. Extensive experiments demonstrate that our detector maintains high accuracy across diverse datasets.

Keywords:

AI-generated image detection; forged forensics; frequency domain analysis; semantic analysis

1. Introduction

Recently, AI-generated content (AIGC) technology has made groundbreaking progress, especially with the widespread application of generative models such as Generative Adversarial Networks (GANs) and Diffusion Models. These advances have enabled AI-generated images to reach a level of visual quality that is nearly indistinguishable from reality. The continuous evolution of diffusion models, in particular, has significantly blurred the boundary between synthetic and real images. Their powerful text-to-image generation systems not only produce works of high artistic value but also demonstrate broad application potential in advertising, entertainment, and education. However, the rapid development of these AI technologies has also introduced serious societal challenges, including security risks such as the spread of misinformation, digital identity forgery, and the erosion of evidentiary credibility. Consequently, accurately and efficiently detecting AI-generated images has become a critical research topic in computer vision, digital forensics, and cybersecurity.

Faced with this pressing challenge, developing reliable and robust detection methods for AI-generated images has emerged as a core issue in multimedia forensics, content security, and AI governance. Early detection approaches primarily relied on convolutional neural networks (CNNs), which distinguished synthetic images by modeling high-frequency artifacts, boundary discontinuities, or spectral anomalies unique to generated content in pixel space. Representative works include Chen et al. [1], who employed an improved Xception model to detect faces generated by local GANs; Darius et al. [2], who proposed MesoNet—an efficient architecture for automatically detecting face tampering in videos; and Guarnera et al. [3], who utilized the expectation-maximization (EM) algorithm to extract local features and model convolutional traces potentially present in images. These models were largely optimized for specific generative architectures (e.g., ProGAN [4] and StyleGAN [5,6,7]) and achieved strong performance on data drawn from similar distributions. However, their detection accuracy degrades dramatically in “open-world” scenarios—such as images generated by unknown architectures, cross-domain generation (e.g., from natural photographs to artistic styles), or images subjected to common post-processing operations (e.g., JPEG compression, blurring, cropping, or color dithering). This generalization bottleneck fundamentally stems from CNNs’ over-reliance on local textures and low-level statistical cues, which limit their ability to capture high-level semantic inconsistencies or cross-modal logical contradictions inherent in the AI image generation process.

To address this limitation, researchers have begun exploring more generalizable detection paradigms. One promising direction involves integrating frequency-domain analysis into AI-generated image detection, leveraging systematic biases exhibited by synthetic images in the Fourier domain or discrete cosine transform (DCT)—such as periodic artifacts and anomalous high-frequency energy—for authentication. For instance, Li et al. [8] proposed FreqBlender, a frequency analysis network that adaptively segments frequency components associated with forgery traces. Luo et al. [9] employed frequency-domain masking combined with spatial interactions to help models more effectively capture subtle manipulation signatures and enhance generalization. Liu et al. [10] introduced SFANet, a Spatio-Frequency Attention Network based on wavelet transforms, which uses a dual-attention mechanism to adjust frequency-domain weights for deepfake detection dynamically. Nevertheless, approaches that merely combine frequency-domain features with CNNs or standalone frequency classifiers can capture generation-specific fingerprints but often lack robust semantic context modeling, rendering them prone to misclassification in complex, open-world environments.

More recently, Transformer-based detector models have demonstrated superior generalization capabilities due to their global receptive fields and exceptional ability to model long-range dependencies. As a result, they are gradually replacing CNNs as the backbone architecture for forgery detection. Li et al. [11] developed the Detail-Aware Transformer (DAT) to focus on subtle fusion traces arising from inconsistencies in image details. Wang et al. [12] proposed M²TR (Multimodal Multiscale Transformer), which fuses image frequency-domain features with RGB information and processes image patches at multiple scales to detect local inconsistencies across different spatial resolutions. Furthermore, the advent of large-scale pre-trained vision-language models like CLIP offers a powerful tool for assessing high-level semantic consistency in images. Liu et al. [13] introduced FatFormer, a forgery-aware adaptive Transformer that identifies and integrates local forgery traces from both spatial and frequency domains. Yan et al. [14] presented AIDE, an AI-generated image detector based on hybrid features, which employs multiple expert modules to simultaneously extract visual artifacts and noise while leveraging semantic and contextual cues for effective identification.

Existing CLIP/ViT-based methods emphasize semantic generalization but underutilize fine-grained, local frequency artifacts. Conversely, simple wavelet-based CNNs or frequency-domain classifiers can detect generation fingerprints yet lack robust semantic reasoning, leading to performance degradation in complex open-world settings involving diverse categories and environmental variations. To bridge this gap, we propose a hybrid model that jointly leverages spatial and frequency domains. Specifically, our approach utilizes CLIP to extract semantic inconsistencies in the spatial domain, employs discrete wavelet transforms to capture multi-scale frequency anomalies characteristic of AI-generated images, and performs cross-domain feature fusion before feeding the combined representation into a Swin Transformer for robust authenticity verification.

Specifically, this paper makes the following contributions:

We propose a multimodal backbone network based on Transformers and CLIP, effectively capturing semantic inconsistencies inherent in AI-generated images.
We introduce the discrete wavelet transform for multi-scale frequency analysis, extracting distinctive features across different sub-bands to enhance sensitivity to generation traces.
We design an efficient feature fusion mechanism that organically integrates semantic and frequency-domain features into complementary representations. Extensive experiments validate the superior performance and robustness of our method across various generative models and under diverse perturbation conditions.

2. Related Works

Spatial Domain-Based Forgery Detection. Deep learning techniques dominate the field of detecting AI-generated images, with convolutional neural networks (CNNs) and vision transformers (ViTs) [15] being particularly effective. Early approaches to deepfake and AI-generated image detection [16,17] primarily relied on CNN architectures, which automatically learn spatial-domain artifacts such as edge blurring, inconsistent textures from upsampling, and local color discrepancies. Much of this work is built upon established backbone networks—such as ResNet [18] and Xception [19]—to perform binary classification on benchmark datasets including FaceForensics++ [20], Celeb-DF [21], and GAN-synthesized image collections [22]. While these methods achieve near-saturated accuracy on their respective training datasets, subsequent studies have revealed significant performance degradation when evaluated across unseen or heterogeneous datasets—a phenomenon often referred to as domain shift or generalization collapse.

Frequency-Based Forensics. Frequency-domain analysis methods transform images into spectral representations (e.g., via Discrete Cosine Transform or wavelet transforms) to examine generation-induced artifacts that are less apparent in the spatial domain. A key advantage of these approaches is their relative insensitivity to semantic content and robustness to common post-processing operations such as resizing and cropping. For instance, Sun et al. [23] proposed a time-frequency convolutional neural network that leverages an Upsampling Artifact Representation Module (UARM) and a Frequency-Assisted Temporal Incoherence Module (FATIM) to detect fake faces by modeling inconsistencies in frequency responses. Zhou et al. [24] introduced the Frequency-based Local and Global (FLAG) Network, which incorporates a Frequency-based Attention Enhancement Module (FAEM) to facilitate synergistic fusion between CNNs and ViTs. By exploiting frequency-domain cues to capture local textural anomalies and global structural inconsistencies jointly, FLAG demonstrates improved cross-dataset generalization. Similarly, Jia et al. [25] developed a frequency-based adversarial attack detection framework for facial forgeries, applying the Discrete Cosine Transform (DCT) and introducing a dedicated fusion module to highlight salient regions of adversarial perturbations in the frequency domain. However, as generative models advance—particularly with the rise of diffusion-based synthesis—the frequency-domain artifacts they produce have become increasingly subtle and model-dependent, limiting the effectiveness of purely frequency-based detection strategies.

Challenges Posed by Generative Model Diversity. A major challenge in current forgery detection lies in the heterogeneity of modern generative models. Common architectures include Generative Adversarial Networks (GANs), Diffusion Models, Variational Autoencoders (VAEs), and Autoregressive Models, each with numerous variants and distinct generation mechanisms. These differences lead to diverse statistical footprints and artifact patterns in synthesized images. For example, GAN-generated images often exhibit grid-like or rasterized artifacts due to transposed convolutions during upsampling, whereas diffusion models may introduce subtle deviations in noise distribution or high-frequency coherence. Consequently, detectors fine-tuned on one class of generative models (e.g., StyleGAN) frequently fail to generalize to others (e.g., Stable Diffusion or DALL·E), highlighting a critical “model gap.” This gap is exacerbated by the rapid pace of generative model development, underscoring the need for detection frameworks that identify universal traces of synthetic origin rather than model-specific signatures.

To address these limitations, this paper proposes a hybrid detection architecture that jointly exploits spatial and frequency domains. Specifically, we employ CLIP to capture semantic inconsistencies in the spatial domain—leveraging its strong zero-shot generalization and sensitivity to implausible visual-textual alignments—while simultaneously analyzing multi-scale frequency anomalies through wavelet-based decomposition. The resulting multi-domain features are then fused and globally modeled using a Swin Transformer, enabling robust and generalizable detection of AI-generated imagery across diverse generative paradigms.

3. Method

The model we propose is illustrated in Figure 1. It consists of preliminary multi-domain feature extraction, comprising spatial semantic inconsistency detection based on CLIP and frequency-domain feature anomaly detection based on wavelet transform. These complementary features are then fused and input to a Swin Transformer for global contextual modeling, ultimately determining whether the image is AI-generated.

3.1. Feature Extraction

This module is responsible for extracting two complementary feature representations from the input image, including spatial domain features based on CLIP deep semantics and frequency domain features based on wavelet transform frequency domain texture.

In spatial-domain feature extraction, the CLIP encoder is used to extract pixel features from images. Considering efficiency and computational complexity, we selected ViT-L/14 as the backbone network. Since the authenticity detection model does not require learning new semantic features from images—only detecting “semantic inconsistencies” using existing prior knowledge—we adopted a frozen CLIP model. First, the input image

I \in R^{H \times W \times 3}

is segmented into N = 14

\times

14 blocks, and each block is mapped to a

D_{p}

dimensional embedding vector through a learnable linear projection layer E. The embedding sequence is then fed into CLIP-ViT, which consists of an

L_{c l i p}

layer Transformer encoder. Each layer contains a multi-head self-attention (MSA) module and a multilayer perceptron (MLP) module, along with layer normalization (LayerNorm) and residual connections. The spatial domain feature vector

F_{s} \in R^{N \times D_{p}}

is computed via the Transformer encoder.

In frequency domain feature extraction, to capture the inherent high-frequency inconsistencies in AI-generated images—particularly those arising from upsampling operations in diffusion models and generative adversarial networks (GANs)—while balancing computational efficiency, we employ a multi-level discrete wavelet transform (DWT) based on the Daubechies-4 (db4) wavelet basis, which is more efficient than other wavelet methods. The db4 wavelet was selected due to its four vanishing moments, which enable sparse representation of smooth natural signals while maintaining high sensitivity to unnatural oscillatory artifacts commonly found in synthetic images. First, a single two-dimensional discrete wavelet transform (2D-DWT) is performed independently on each color channel (R, G, B) of the input image

I \in R^{H \times W \times 3}

. After the transform, each channel yields four sub-bands: a low-frequency approximation component (LL) and high-frequency detail components in three directions: horizontal (LH), vertical (HL), and diagonal (HH). The low-frequency subband LL mainly contains the overall image information and highly overlaps with pixel-domain features. The high-frequency subbands LH, HL, and HH, on the other hand, contain detailed information such as texture and edges, which are precisely the parts where the generative model is prone to distortion. Therefore, we discard the low-frequency LL and concatenate the three high-frequency subbands along the channel dimension. For a three-channel image, we finally obtain a 9-channel high-frequency feature map W ∈

R^{(H / 2) \times (W / 2) \times 9}

. To encode the high-frequency feature map W into a sequential feature set matching the spatial resolution of the pixel-domain feature Fp, we designed a lightweight CNN encoder. This encoder consists of three convolutional blocks, each containing a 3 × 3 convolutional layer, batch normalization (BatchNorm), and a ReLU activation function. Downsampling is performed through convolutions with a stride of 2. Finally, an adaptive average pooling layer resizes the feature map to match

F_{s}

and flattens it into a sequence

F_{f}

.

3.2. Spatial-Frequency Cross-Domain Feature Fusion

To effectively fuse semantic (CLIP) and artifact (Wavelet) information, we designed a two-stage fusion mechanism:

(1) Cross-Attention Alignment. Using spatial feature

F_{s}

as the Query, and frequency domain feature

F_{f}

. as both Key and Value, compute cross-domain attention:

\begin{matrix} Q = F_{s} W_{Q}, K = F_{f} W_{K}, V = F_{f} W_{V} \\ A = Softmax (\frac{Q K^{⊤}}{\sqrt{D}}) V, A \in R^{N \times D} \end{matrix}

(1)

where W_Q, W_K, W_V

\in R^{D \times D}

are learnable projection matrices.

(2) Gated Feature Integration. Compared to simple addition or concatenation, gating mechanisms can adaptively suppress noise bands or semantically ambiguous regions. Here, we introduce a gating mechanism to dynamically weight information from both domains:

\begin{matrix} G = σ (M L P ([F_{s}; A])) \in R^{N \times D}, \\ F_{fused} = G ⊙ F_{s} + (1 - G) ⊙ A, \end{matrix}

(2)

where [⋅;⋅] represents channel concatenation, the MLP consists of two layers of linear transformation + GELU, and σ is a Sigmoid. The gate value G learns the relative importance of spatial and frequency features for each token.

3.3. Swin Transformer Backbone

The fused features obtained by combining the spatial domain and frequency domain are input into the Swin-Transformer. Through progressive downsampling using a shifted window MSA, multi-scale feature maps are output. As shown in Figure 2.

The fused feature map

F_{f u s e d} \in R^{H \times W \times C}

is first projected into non-overlapping patches using a convolutional embedding layer:

X_{0} = PatchEmbed (F_{f u s e d})

(3)

where each patch is mapped into a token of dimension D. This operation preserves spatial correspondence while enabling efficient Transformer-based processing.

The embedded tokens are processed by a sequence of Tiny Swin Transformer blocks, each consisting of Window-based Multi-Head Self-Attention (W-MSA), Shifted Window Multi-Head Self-Attention (SW-MSA), and Feed-Forward Network (FFN). Formally, the l-th Swin block is defined as:

{\hat{X}}^{(l)} = W - M S A (L N (X^{(l)})) + X^{(l)}, X^{(l+ 1)} = F F N (L N ({\hat{X}}^{(l)})) + {\hat{X}}^{(l)}

(4)

where LN denotes Layer Normalization.

We only take the final layer

Z_{f i n a l}

.

\begin{matrix} z_{pool} = GlobalAvgPool (z_{final}) \in R^{768}, \\ \hat{y} = Sigmoid (w^{⊤} M L P (z_{pool}) + b) \in (0,1) \end{matrix}

(5)

The main loss function can be expressed as:

L_{c l s} = - \frac{1}{B} \sum_{i = 1}^{B} [{(1 - p_{i})}^{γ} y_{i} l o g p_{i} + p_{i}^{γ} (1 - y_{i}) l o g (1 - p_{i})]

(6)

where

p_{i} = {\hat{y}}_{i}

, γ = 2 are the focusing parameters, and

y_{i}

∈ {0, 1} are the true labels (0 = real, 1 = AI-generated).

4. Experiments and Results

4.1. Dataset

In terms of dataset selection, to better verify the reliability and robustness of the model and consider the impact of different types of generative models on the detection model, this study used multiple datasets for verification experiments, including the common face dataset DFDC, as well as ForenSynths [26] and GenImage [27], which contain various types of images, as shown in Figure 3. In splitting the dataset, we used 80% for model training and 20% for testing.

DFDC, the Deepfake Detection Challenge dataset, is a large dataset released by Meta to measure the progress of Deepfake detection technology. This dataset is a deep face detection dataset consisting of more than 100,000 fake videos created from 19,154 real videos and fully considers the diversity of subjects and backgrounds in real scenes (skin color, gender, lighting conditions, etc.).

ForenSynths: This dataset contains fake images generated by 11 different convolutional neural network (CNN)-based image generator models. These models cover commonly used architectures today (ProGAN, StyleGAN, BigGAN, CycleGAN, StarGAN, GauGAN, DeepFakes, etc.).

GenImage is an AI-generated image detection dataset comprising millions of images. It utilizes the same 1000 categories as ImageNet, with synthetic images generated by Midjourney, Stable Diffusion, ADM, GLIDE, Wukong, VQDM, and BigGAN.

4.2. Evaluation Metrics and Implementation Details

Before model training, the training images were uniformly resized to 224 × 224 pixels and augmented using random flipping, random cropping, and JPEG compression. During training, the lightweight CNN employed the AdamW optimizer with a learning rate of 1 × 10⁻⁴, while the backbone Swin Transformer network used the AdamW optimizer with a learning rate of 1 × 10⁻⁵. The batch size was set to 16, and the number of training epochs was 100. We used mean accuracy (AP) and accuracy (ACC) as evaluation metrics to assess the proposed method. All experiments were trained on a server equipped with dual NVIDIA RTX 3090Ti GPUs, an AMD Threadripper 2950× CPU, and 64GB of RAM.

4.3. Comparisons with State-of-the-Art Methods

Among AI-generated fake images, fake facial images dominate. Here, we first analyze model performance using the CFDC dataset based on facial images. Simultaneously, we test the impact of different modules on the overall network architecture. As shown in Table 1, we tested models using Clip alone for spatial feature extraction, models combining Clip-based spatial and frequency-domain feature extraction, and models employing feature fusion methods incorporating cross-domain feature attention and gating. By introducing spatial semantic analysis and frequency-domain features, and focusing on anomalous features through cross-attention and gating, our model demonstrated outstanding performance in both accuracy and average precision.

We performed a frequency domain visualization comparison of real and synthetic facial images, as shown in Figure 4. Although current AI generation capabilities have nearly reached the level of fooling the human eye, subtle differences remain detectable in the frequency domain features. Concurrently, we visualized the model’s feature attention patterns.

In model testing experiments designed to detect deepfake images generated by GAN-based models, we utilized the Forensynths dataset to evaluate images produced by various GAN models, as shown in Table 2. Compared to other state-of-the-art models, our model demonstrated superior robustness.

Compared to deepfake images generated using Generative Adversarial Networks (GANs), those produced by diffusion models are more difficult to detect. They exhibit more natural edge transitions and more concealed forgery traces, making them challenging for detection models based on a single network. As shown in Table 3, most detection models perform poorly when addressing these challenges. Our proposed multi-domain fusion Transformer network simultaneously searches for forgery traces in both the spatial domain’s semantic inconsistencies and the frequency domain’s high-frequency features of tampered images, significantly enhancing the model’s detection capabilities. Even when compared to other state-of-the-art models, our approach consistently demonstrates superior performance.

In AI-generated image authenticity detection tasks, operations such as image file compression, blurring, and resizing can significantly impact the model’s performance, even leading to model failure. However, our proposed model, as shown in Figure 5, combines spatial, frequency, and semantic features to detect multi-dimensional features, mitigating these problems to some extent and resulting in more robust model performance.

We conducted a visualization analysis of the regions of interest for real and fake images in model detection, especially for AI-generated images based on diffusion models, which are currently more difficult to distinguish from real images, as shown in Figure 6.

5. Conclusions

In this paper, we propose a novel cross-domain AI-generated image detection framework that integrates spatial-semantic information based on CLIP with frequency representations based on wavelets through a cross-attention fusion mechanism. Leveraging the hierarchical modeling capabilities of the Swin Transformer backbone, our method effectively captures the inherent local artifacts and global inconsistencies present in AI-generated images. Extensive experiments demonstrate that our approach outperforms existing methods across multiple benchmarks and generative models, particularly in cross-model generalization scenarios.

Author Contributions

Conceptualization, Q.M.; methodology, software, Q.M.; validation, Q.M.; Y.-I.C.; formal analysis, Q.M.; investigation, Q.M.; resources, Q.M. and Y.-I.C.; data curation, Q.M.; writing—original draft preparation, Q.M.; writing—review and editing, Q.M.; visualization, Q.M.; supervision, Q.M. and Y.-I.C.; project administration, Q.M. and Y.-I.C.; funding acquisition, Q.M.; Y.-I.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Korea Institute of Marine Science & Technology Promotion (KIMST) from 2022 under the project entitled “Development and Demonstration of a Data Platform for AI-Based Safe Fishing Vessel Design” (RS-2022-KS221571). This work was also supported by the Ministry of Trade, Industry and Energy (MOTIE) and implemented by the Korea Institute for Advancement of Technology (KIAT) under the project entitled “Development of an International Standardization and Sustainability Integration Framework for AI Industry Internalization and Global Competitiveness Enhancement” (RS-2025-07372968). In addition, this work was supported by the Gachon University Research Fund in 2021 (GCU-202106340001).

Institutional Review Board Statement

All subjects gave their informed consent for inclusion before they participated in the study. Ethics approval is not required for this type of study. The study has been granted exemption by the Creative Commons BY 2.0, Creative Commons BY-NC 2.0, Public Domain Mark 1.0, Public Domain CC0 1.0, or U.S. Government Works license.

Informed Consent Statement

Not applicable.

Data Availability Statement

All datasets utilized in this article are open-source and publicly available for researchers.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, B.; Ju, X.; Xiao, B.; Ding, W.; Zheng, Y.; de Albuquerque, V.H.C. Locally GAN-generated face detection based on an improved Xception. Inf. Sci. 2021, 572, 16–28. [Google Scholar] [CrossRef]
Afchar, D.; Nozick, V.; Yamagishi, J.; Echizen, I. Mesonet: A compact facial video forgery detection network. In Proceedings of the 2018 IEEE International Workshop on Information Forensics and Security (WIFS), Hong Kong, China, 11–13 December 2018; pp. 1–7. [Google Scholar]
Guarnera, L.; Giudice, O.; Battiato, S. Deepfake detection by analyzing convolutional traces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 16–18 June 2020; pp. 666–667. [Google Scholar]
Gao, H.; Pei, J.; Huang, H. Progan: Network embedding via proximity generative adversarial network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 1308–1316. [Google Scholar]
Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Beach, CA, USA, 18–20 June 2019; pp. 4401–4410. [Google Scholar]
Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 8110–8119. [Google Scholar]
Karras, T.; Aittala, M.; Laine, S.; Härkönen, E.; Hellsten, J.; Lehtinen, J.; Aila, T. Alias-free generative adversarial networks. Adv. Neural Inf. Process. Syst. 2021, 34, 852–863. [Google Scholar]
Hanzhe, L.; Zhou, J.; Li, Y.; Wu, B.; Li, B.; Dong, J. FreqBlender: Enhancing DeepFake Detection by Blending Frequency Knowledge. In Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
Luo, X.; Wang, Y. Frequency-Domain Masking and Spatial Interaction for Generalizable Deepfake Detection. Electronics 2025, 14, 1302. [Google Scholar] [CrossRef]
Liu, X.; Xiao, W.; Lin, X.; He, S.; Huang, C.; Guo, D. Deepfake Detection via Spatial-Frequency Attention Network. IEEE Trans. Consum. Electron. 2025, 71, 9832–9841. [Google Scholar] [CrossRef]
Li, J.; Yu, L.; Liu, R.; Xie, H. A Detail-Aware Transformer to Generalisable Face Forgery Detection. IEEE Trans. Circuits Syst. Video Technol. 2024, 35, 3262–3275. [Google Scholar] [CrossRef]
Wang, J.; Wu, Z.; Ouyang, W.; Han, X.; Chen, J.; Jiang, Y.-G.; Li, S.-N. M2tr: Multi-modal multi-scale transformers for deepfake detection. In Proceedings of the 2022 International Conference on Multimedia Retrieval, Newark, NJ, USA, 27–30 June 2022; pp. 615–623. [Google Scholar]
Liu, H.; Tan, Z.; Tan, C.; Wei, Y.; Wang, J.; Zhao, Y. Forgery-aware adaptive transformer for generalizable synthetic image detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 19–21 June 2024; pp. 10770–10780. [Google Scholar]
Yan, S.; Li, O.; Cai, J.; Hao, Y.; Jiang, X.; Hu, Y.; Xie, W. A sanity check for ai-generated image detection. arXiv 2024, arXiv:2406.19435. [Google Scholar] [CrossRef]
Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Nguyen, H.H.; Yamagishi, J.; Echizen, I. Capsule-forensics networks for deepfake detection. In Handbook of Digital Face Manipulation and Detection: From DeepFakes to Morphing Attacks; Springer International Publishing: Cham, Switzerland, 2022; pp. 275–301. [Google Scholar]
Chen, L.; Zhang, Y.; Song, Y.; Liu, L.; Wang, J. Self-supervised learning of adversarial example: Towards good generaliza-tions for deepfake detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022; pp. 18710–18719. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Identity mappings in deep residual networks. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 630–645. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Rossler, A.; Cozzolino, D.; Verdoliva, L.; Riess, C.; Thies, J.; Nießner, M. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1–11. [Google Scholar]
Li, Y.; Yang, X.; Sun, P.; Qi, H.; Lyu, S. Celeb-df: A large-scale challenging dataset for deepfake forensics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 3207–3216. [Google Scholar]
Ren, H.; Yan, A.; Ren, X.; Ye, P.-G.; Gao, C.-z.; Zhou, Z.; Li, J. Ganfinger: Gan-based fingerprint generation for deep neural network ownership verification. arXiv 2023, arXiv:2312.15617. [Google Scholar]
Sun, R.; Yu, X.; Wang, F.; Da, Z.; Zhang, Y.; Gao, J. Frequency-Assisted Temporal Upsampling Artifacts Representation Learning for Face Forgery Detection. IEEE Trans. Biom. Behav. Identity Sci. 2025, 7, 728–739. [Google Scholar] [CrossRef]
Zhou, K.; Sun, G.; Wang, J.; Wang, J.; Yu, L. FLAG: Frequency-based local and global network for face forgery detection. Multimed. Tools Appl. 2025, 84, 647–663. [Google Scholar] [CrossRef]
Jia, S.; Ma, C.; Yao, T.; Yin, B.; Ding, S.; Yang, X. Exploring frequency adversarial attacks for face forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 4103–4112. [Google Scholar]
Wang, S.-Y.; Wang, O.; Zhang, R.; Owens, A.; Efros, A.A. CNN-generated images are surprisingly easy to spot… for now. In Proceedings of the IEEE/CVF Conference On Computer Vision and Pattern Recognition, Virtual, 14–19 June 2020; pp. 8695–8704. [Google Scholar]
Zhu, M.; Chen, H.; Yan, Q.; Huang, X.; Lin, G.; Li, W.; Tu, Z.; Hu, H.; Hu, J.; Wang, Y. Genimage: A million-scale benchmark for detecting ai-generated image. Adv. Neural Inf. Process. Syst. 2023, 36, 77771–77782. [Google Scholar]
Qian, Y.; Yin, G.; Sheng, L.; Chen, Z.; Shao, J. Thinking in frequency: Face forgery detection by mining frequency-aware clues. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 86–103. [Google Scholar]
Zhao, Y.; Jin, X.; Gao, S.; Wu, L.; Yao, S.; Jiang, Q. TAN-GFD: Generalizing face forgery detection based on texture infor-mation and adaptive noise mining. Appl. Intell. 2023, 53, 19007–19027. [Google Scholar] [CrossRef]
Peng, S.; Zhang, T.; Gao, L.; Zhu, X.; Zhang, H.; Pang, K.; Lei, Z. Wmamba: Wavelet-based mamba for face forgery detection. In Proceedings of the 33rd ACM International Conference on Multimedia, Dublin, Ireland, 27–31 October 2025; pp. 4768–4777. [Google Scholar]
Zhang, H.; He, Q.; Bi, X.; Li, W.; Liu, B.; Xiao, B. Towards Universal AI-Generated Image Detection by Variational Information Bottleneck Network. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 23828–23837. [Google Scholar]
Frank, J.; Eisenhofer, T.; Schönherr, L.; Fischer, A.; Kolossa, D.; Holz, T. Leveraging frequency analysis for deep fake image recognition. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 3247–3258. [Google Scholar]
Jeong, Y.; Kim, D.; Min, S.; Joe, S.; Gwon, Y.; Choi, J. Bihpf: Bilateral high-pass filters for robust deepfake detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2022; pp. 48–57. [Google Scholar]
Jeong, Y.; Kim, D.; Ro, Y.; Choi, J. Frepgan: Robust deepfake detection using frequency-level perturbations. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; pp. 1060–1068. [Google Scholar]
Ojha, U.; Li, Y.; Lee, Y.J. Towards universal fake image detectors that generalize across generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 24480–24489. [Google Scholar]
Tan, C.; Zhao, Y.; Wei, S.; Gu, G.; Liu, P.; Wei, Y. Frequency-aware deepfake detection: Improving generalizability through frequency space domain learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; pp. 5052–5060. [Google Scholar]

Figure 1. Our proposed model framework.

Figure 2. Swin-transformer internal main framework.

Figure 3. Images generated by different AI models.

Figure 4. Frequency domain features of real and fake facial images and regions of interest for artifact detection.

Figure 5. Robustness analysis of model performance across different image qualities.

Figure 6. Visual analysis of model attention focuses for real images and AI-generated fake images (left: real image; right: AI-generated image).

Table 1. Comparison with state-of-the-art models on AI-generated face forgery datasets.

Models	DFDC
Models	ACC	AP
Xception	66.3	68.3
F³-Net [28]	75.7	76.0
TAN-GFD [29]	84.3	85.8
WMamba [30]	90.5	90.0
VIB-Ne [31]	93.8	93.2
Ours (Clip)	94.1	93.8
Ours (Clip + F)	95.3	94.5
Ours (Clip + F+A)	97.6	96.0
Ours (Clip + F+A + G)	98.1.	96.8

Table 2. Performance comparison with other state-of-the-art models on the GANs dataset based on accuracy (ACC)/average precision (AP).

Methods	ProGAN	StyleGAN	StyleGAN2	BigGAN	CycleGAN	StarGAN	GauGAN	Deepfake	Mean
Wang	64.6/92.7	52.8/80.8	75.7/96.3	50.7/70.2	58.1/79.3	51.2/81.7	53.6/84.7	50.3/51.5	57.1/79.7
Fank [32]	85.7/81.3	73.1/68.5	75.0/70.9	76.9/70.8	86.5/80.8	85.0/77.0	67.3/65.3	50.1/55.3	75.0/71.2
F³-Net	87.8/82.4	80.3/84.7	82.2/87.9	65.5/73.4	81.2/89.7	87.8/90.4	57.0/59.5	59.9/83.0	75.2/81.4
BiHPF [33]	87.4/89.3	71.5/74.1	77.0/81.1	82.6/80.6	86.0/86.6	93.8/95.5	75.3/84.7	53.5/55.8	78.4/81.0
FrePGA [34]	95.3/97.1	82.0/90.9	72.2/93.8	66.7/69.4	69.7/71.1	97.3/99.0	53.7/55.0	62.7/80.1	75.0/82.1
UniFD [35]	98.3/99.8	78.5/92.8	75.4/96.0	89.1/94.7	91.9/98.0	96.1/99.3	92.6/98.3	80.8/90.2	88.1/96.1
FreqNet [36]	99.2/99.9	90.4/98.0	85.8/98.3	89.7/96.4	96.7/99.1	97.5/99.4	88.3/98.9	81.9/92.7	91.2/98.0
FatFormer	99.6/99.9	78.8.7/97.5	75.7/97.1	96.3/98.9	98.1/99.4	98.8/99.6	95.5/98.7	89.3/95.7	91.5/98.4
VIB-Ne	89.4/96.6	82.1/94.9	89.8/97.2	92.5/98.2	97.6/98.4	95.7/97.6	96.6/98.4	92.7/93.3	92.1/96.8
Ours (C + F)	97.9/98.1	93.2/96.7	88.8/97.0	96.3/98.4	98.2/99.0	97.3/98.6	96.1/97.9	93.3/95.4	95.1/97.6
Ours	98.9/99.0	96.3/97.5	91.7/99.1	98.5/99.1	98.5/99.4	98.7/99.5	97.8/98.9	95.3/98.6	96.9/98.9

Table 3. Accuracy and average precision comparisons with state-of-the-art methods on the diffusion model dataset.

Methods	PNDM	Guided	DALL-E	VQ-Diffusion	Mean
Wang	50.8/90.3	54.9/66.6	51.8/61.3	50.0/71.0	51.8/72.3
Fank	44.0/38.2	53.4/52.7	57.1/62.8	52.0/66.3	51.6/55.0
F³-Net	72.8/80.5	69.7/72.1	72.3/80.0	91.8/94.7	76.7/81.8
UniFD	75.3/92.5	75.7/85.1	89.5/96.8	83.5/97.7	81.0/93.0
FreqNet	89.3/97.0	81.2/92.0	94.8/98.3	92.0/97.3	89.3/96.2
FatFormer	92.5/94.2	76.8/91.7	95.3/99.0	95.4/99.1	90.0/96.0
VIB-Ne	94.9/97.1	85.1/88.9	97.0/98.4	96.5/97.8	93.4/95.6
Ours (C + F)	95.7/97.5	85.8/90.7	97.8/98.8	97.3/97.6	94.2/96.2
Ours	96.5/98.3	87.3/91.9	98.5/99.1	98.0/98.3	95.1/96.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Man, Q.; Cho, Y.-I. Transformer Based on Multi-Domain Feature Fusion for AI-Generated Image Detection. Electronics 2026, 15, 716. https://doi.org/10.3390/electronics15030716

AMA Style

Man Q, Cho Y-I. Transformer Based on Multi-Domain Feature Fusion for AI-Generated Image Detection. Electronics. 2026; 15(3):716. https://doi.org/10.3390/electronics15030716

Chicago/Turabian Style

Man, Qiaoyue, and Young-Im Cho. 2026. "Transformer Based on Multi-Domain Feature Fusion for AI-Generated Image Detection" Electronics 15, no. 3: 716. https://doi.org/10.3390/electronics15030716

APA Style

Man, Q., & Cho, Y.-I. (2026). Transformer Based on Multi-Domain Feature Fusion for AI-Generated Image Detection. Electronics, 15(3), 716. https://doi.org/10.3390/electronics15030716

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Transformer Based on Multi-Domain Feature Fusion for AI-Generated Image Detection

Abstract

1. Introduction

2. Related Works

3. Method

3.1. Feature Extraction

3.2. Spatial-Frequency Cross-Domain Feature Fusion

3.3. Swin Transformer Backbone

4. Experiments and Results

4.1. Dataset

4.2. Evaluation Metrics and Implementation Details

4.3. Comparisons with State-of-the-Art Methods

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI