Transformer Based on Multi-Domain Feature Fusion for AI-Generated Image Detection
Abstract
1. Introduction
- We propose a multimodal backbone network based on Transformers and CLIP, effectively capturing semantic inconsistencies inherent in AI-generated images.
- We introduce the discrete wavelet transform for multi-scale frequency analysis, extracting distinctive features across different sub-bands to enhance sensitivity to generation traces.
- We design an efficient feature fusion mechanism that organically integrates semantic and frequency-domain features into complementary representations. Extensive experiments validate the superior performance and robustness of our method across various generative models and under diverse perturbation conditions.
2. Related Works
3. Method
3.1. Feature Extraction
3.2. Spatial-Frequency Cross-Domain Feature Fusion
3.3. Swin Transformer Backbone
4. Experiments and Results
4.1. Dataset
4.2. Evaluation Metrics and Implementation Details
4.3. Comparisons with State-of-the-Art Methods
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Chen, B.; Ju, X.; Xiao, B.; Ding, W.; Zheng, Y.; de Albuquerque, V.H.C. Locally GAN-generated face detection based on an improved Xception. Inf. Sci. 2021, 572, 16–28. [Google Scholar] [CrossRef]
- Afchar, D.; Nozick, V.; Yamagishi, J.; Echizen, I. Mesonet: A compact facial video forgery detection network. In Proceedings of the 2018 IEEE International Workshop on Information Forensics and Security (WIFS), Hong Kong, China, 11–13 December 2018; pp. 1–7. [Google Scholar]
- Guarnera, L.; Giudice, O.; Battiato, S. Deepfake detection by analyzing convolutional traces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 16–18 June 2020; pp. 666–667. [Google Scholar]
- Gao, H.; Pei, J.; Huang, H. Progan: Network embedding via proximity generative adversarial network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 1308–1316. [Google Scholar]
- Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Beach, CA, USA, 18–20 June 2019; pp. 4401–4410. [Google Scholar]
- Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 8110–8119. [Google Scholar]
- Karras, T.; Aittala, M.; Laine, S.; Härkönen, E.; Hellsten, J.; Lehtinen, J.; Aila, T. Alias-free generative adversarial networks. Adv. Neural Inf. Process. Syst. 2021, 34, 852–863. [Google Scholar]
- Hanzhe, L.; Zhou, J.; Li, Y.; Wu, B.; Li, B.; Dong, J. FreqBlender: Enhancing DeepFake Detection by Blending Frequency Knowledge. In Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
- Luo, X.; Wang, Y. Frequency-Domain Masking and Spatial Interaction for Generalizable Deepfake Detection. Electronics 2025, 14, 1302. [Google Scholar] [CrossRef]
- Liu, X.; Xiao, W.; Lin, X.; He, S.; Huang, C.; Guo, D. Deepfake Detection via Spatial-Frequency Attention Network. IEEE Trans. Consum. Electron. 2025, 71, 9832–9841. [Google Scholar] [CrossRef]
- Li, J.; Yu, L.; Liu, R.; Xie, H. A Detail-Aware Transformer to Generalisable Face Forgery Detection. IEEE Trans. Circuits Syst. Video Technol. 2024, 35, 3262–3275. [Google Scholar] [CrossRef]
- Wang, J.; Wu, Z.; Ouyang, W.; Han, X.; Chen, J.; Jiang, Y.-G.; Li, S.-N. M2tr: Multi-modal multi-scale transformers for deepfake detection. In Proceedings of the 2022 International Conference on Multimedia Retrieval, Newark, NJ, USA, 27–30 June 2022; pp. 615–623. [Google Scholar]
- Liu, H.; Tan, Z.; Tan, C.; Wei, Y.; Wang, J.; Zhao, Y. Forgery-aware adaptive transformer for generalizable synthetic image detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 19–21 June 2024; pp. 10770–10780. [Google Scholar]
- Yan, S.; Li, O.; Cai, J.; Hao, Y.; Jiang, X.; Hu, Y.; Xie, W. A sanity check for ai-generated image detection. arXiv 2024, arXiv:2406.19435. [Google Scholar] [CrossRef]
- Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Nguyen, H.H.; Yamagishi, J.; Echizen, I. Capsule-forensics networks for deepfake detection. In Handbook of Digital Face Manipulation and Detection: From DeepFakes to Morphing Attacks; Springer International Publishing: Cham, Switzerland, 2022; pp. 275–301. [Google Scholar]
- Chen, L.; Zhang, Y.; Song, Y.; Liu, L.; Wang, J. Self-supervised learning of adversarial example: Towards good generaliza-tions for deepfake detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022; pp. 18710–18719. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Identity mappings in deep residual networks. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 630–645. [Google Scholar]
- Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
- Rossler, A.; Cozzolino, D.; Verdoliva, L.; Riess, C.; Thies, J.; Nießner, M. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1–11. [Google Scholar]
- Li, Y.; Yang, X.; Sun, P.; Qi, H.; Lyu, S. Celeb-df: A large-scale challenging dataset for deepfake forensics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 3207–3216. [Google Scholar]
- Ren, H.; Yan, A.; Ren, X.; Ye, P.-G.; Gao, C.-z.; Zhou, Z.; Li, J. Ganfinger: Gan-based fingerprint generation for deep neural network ownership verification. arXiv 2023, arXiv:2312.15617. [Google Scholar]
- Sun, R.; Yu, X.; Wang, F.; Da, Z.; Zhang, Y.; Gao, J. Frequency-Assisted Temporal Upsampling Artifacts Representation Learning for Face Forgery Detection. IEEE Trans. Biom. Behav. Identity Sci. 2025, 7, 728–739. [Google Scholar] [CrossRef]
- Zhou, K.; Sun, G.; Wang, J.; Wang, J.; Yu, L. FLAG: Frequency-based local and global network for face forgery detection. Multimed. Tools Appl. 2025, 84, 647–663. [Google Scholar] [CrossRef]
- Jia, S.; Ma, C.; Yao, T.; Yin, B.; Ding, S.; Yang, X. Exploring frequency adversarial attacks for face forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 4103–4112. [Google Scholar]
- Wang, S.-Y.; Wang, O.; Zhang, R.; Owens, A.; Efros, A.A. CNN-generated images are surprisingly easy to spot… for now. In Proceedings of the IEEE/CVF Conference On Computer Vision and Pattern Recognition, Virtual, 14–19 June 2020; pp. 8695–8704. [Google Scholar]
- Zhu, M.; Chen, H.; Yan, Q.; Huang, X.; Lin, G.; Li, W.; Tu, Z.; Hu, H.; Hu, J.; Wang, Y. Genimage: A million-scale benchmark for detecting ai-generated image. Adv. Neural Inf. Process. Syst. 2023, 36, 77771–77782. [Google Scholar]
- Qian, Y.; Yin, G.; Sheng, L.; Chen, Z.; Shao, J. Thinking in frequency: Face forgery detection by mining frequency-aware clues. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 86–103. [Google Scholar]
- Zhao, Y.; Jin, X.; Gao, S.; Wu, L.; Yao, S.; Jiang, Q. TAN-GFD: Generalizing face forgery detection based on texture infor-mation and adaptive noise mining. Appl. Intell. 2023, 53, 19007–19027. [Google Scholar] [CrossRef]
- Peng, S.; Zhang, T.; Gao, L.; Zhu, X.; Zhang, H.; Pang, K.; Lei, Z. Wmamba: Wavelet-based mamba for face forgery detection. In Proceedings of the 33rd ACM International Conference on Multimedia, Dublin, Ireland, 27–31 October 2025; pp. 4768–4777. [Google Scholar]
- Zhang, H.; He, Q.; Bi, X.; Li, W.; Liu, B.; Xiao, B. Towards Universal AI-Generated Image Detection by Variational Information Bottleneck Network. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 23828–23837. [Google Scholar]
- Frank, J.; Eisenhofer, T.; Schönherr, L.; Fischer, A.; Kolossa, D.; Holz, T. Leveraging frequency analysis for deep fake image recognition. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 3247–3258. [Google Scholar]
- Jeong, Y.; Kim, D.; Min, S.; Joe, S.; Gwon, Y.; Choi, J. Bihpf: Bilateral high-pass filters for robust deepfake detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2022; pp. 48–57. [Google Scholar]
- Jeong, Y.; Kim, D.; Ro, Y.; Choi, J. Frepgan: Robust deepfake detection using frequency-level perturbations. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; pp. 1060–1068. [Google Scholar]
- Ojha, U.; Li, Y.; Lee, Y.J. Towards universal fake image detectors that generalize across generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 24480–24489. [Google Scholar]
- Tan, C.; Zhao, Y.; Wei, S.; Gu, G.; Liu, P.; Wei, Y. Frequency-aware deepfake detection: Improving generalizability through frequency space domain learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; pp. 5052–5060. [Google Scholar]






| Models | DFDC | |
|---|---|---|
| ACC | AP | |
| Xception | 66.3 | 68.3 |
| F3-Net [28] | 75.7 | 76.0 |
| TAN-GFD [29] | 84.3 | 85.8 |
| WMamba [30] | 90.5 | 90.0 |
| VIB-Ne [31] | 93.8 | 93.2 |
| Ours (Clip) | 94.1 | 93.8 |
| Ours (Clip + F) | 95.3 | 94.5 |
| Ours (Clip + F+A) | 97.6 | 96.0 |
| Ours (Clip + F+A + G) | 98.1. | 96.8 |
| Methods | ProGAN | StyleGAN | StyleGAN2 | BigGAN | CycleGAN | StarGAN | GauGAN | Deepfake | Mean |
|---|---|---|---|---|---|---|---|---|---|
| Wang | 64.6/92.7 | 52.8/80.8 | 75.7/96.3 | 50.7/70.2 | 58.1/79.3 | 51.2/81.7 | 53.6/84.7 | 50.3/51.5 | 57.1/79.7 |
| Fank [32] | 85.7/81.3 | 73.1/68.5 | 75.0/70.9 | 76.9/70.8 | 86.5/80.8 | 85.0/77.0 | 67.3/65.3 | 50.1/55.3 | 75.0/71.2 |
| F3-Net | 87.8/82.4 | 80.3/84.7 | 82.2/87.9 | 65.5/73.4 | 81.2/89.7 | 87.8/90.4 | 57.0/59.5 | 59.9/83.0 | 75.2/81.4 |
| BiHPF [33] | 87.4/89.3 | 71.5/74.1 | 77.0/81.1 | 82.6/80.6 | 86.0/86.6 | 93.8/95.5 | 75.3/84.7 | 53.5/55.8 | 78.4/81.0 |
| FrePGA [34] | 95.3/97.1 | 82.0/90.9 | 72.2/93.8 | 66.7/69.4 | 69.7/71.1 | 97.3/99.0 | 53.7/55.0 | 62.7/80.1 | 75.0/82.1 |
| UniFD [35] | 98.3/99.8 | 78.5/92.8 | 75.4/96.0 | 89.1/94.7 | 91.9/98.0 | 96.1/99.3 | 92.6/98.3 | 80.8/90.2 | 88.1/96.1 |
| FreqNet [36] | 99.2/99.9 | 90.4/98.0 | 85.8/98.3 | 89.7/96.4 | 96.7/99.1 | 97.5/99.4 | 88.3/98.9 | 81.9/92.7 | 91.2/98.0 |
| FatFormer | 99.6/99.9 | 78.8.7/97.5 | 75.7/97.1 | 96.3/98.9 | 98.1/99.4 | 98.8/99.6 | 95.5/98.7 | 89.3/95.7 | 91.5/98.4 |
| VIB-Ne | 89.4/96.6 | 82.1/94.9 | 89.8/97.2 | 92.5/98.2 | 97.6/98.4 | 95.7/97.6 | 96.6/98.4 | 92.7/93.3 | 92.1/96.8 |
| Ours (C + F) | 97.9/98.1 | 93.2/96.7 | 88.8/97.0 | 96.3/98.4 | 98.2/99.0 | 97.3/98.6 | 96.1/97.9 | 93.3/95.4 | 95.1/97.6 |
| Ours | 98.9/99.0 | 96.3/97.5 | 91.7/99.1 | 98.5/99.1 | 98.5/99.4 | 98.7/99.5 | 97.8/98.9 | 95.3/98.6 | 96.9/98.9 |
| Methods | PNDM | Guided | DALL-E | VQ-Diffusion | Mean |
|---|---|---|---|---|---|
| Wang | 50.8/90.3 | 54.9/66.6 | 51.8/61.3 | 50.0/71.0 | 51.8/72.3 |
| Fank | 44.0/38.2 | 53.4/52.7 | 57.1/62.8 | 52.0/66.3 | 51.6/55.0 |
| F3-Net | 72.8/80.5 | 69.7/72.1 | 72.3/80.0 | 91.8/94.7 | 76.7/81.8 |
| UniFD | 75.3/92.5 | 75.7/85.1 | 89.5/96.8 | 83.5/97.7 | 81.0/93.0 |
| FreqNet | 89.3/97.0 | 81.2/92.0 | 94.8/98.3 | 92.0/97.3 | 89.3/96.2 |
| FatFormer | 92.5/94.2 | 76.8/91.7 | 95.3/99.0 | 95.4/99.1 | 90.0/96.0 |
| VIB-Ne | 94.9/97.1 | 85.1/88.9 | 97.0/98.4 | 96.5/97.8 | 93.4/95.6 |
| Ours (C + F) | 95.7/97.5 | 85.8/90.7 | 97.8/98.8 | 97.3/97.6 | 94.2/96.2 |
| Ours | 96.5/98.3 | 87.3/91.9 | 98.5/99.1 | 98.0/98.3 | 95.1/96.9 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Man, Q.; Cho, Y.-I. Transformer Based on Multi-Domain Feature Fusion for AI-Generated Image Detection. Electronics 2026, 15, 716. https://doi.org/10.3390/electronics15030716
Man Q, Cho Y-I. Transformer Based on Multi-Domain Feature Fusion for AI-Generated Image Detection. Electronics. 2026; 15(3):716. https://doi.org/10.3390/electronics15030716
Chicago/Turabian StyleMan, Qiaoyue, and Young-Im Cho. 2026. "Transformer Based on Multi-Domain Feature Fusion for AI-Generated Image Detection" Electronics 15, no. 3: 716. https://doi.org/10.3390/electronics15030716
APA StyleMan, Q., & Cho, Y.-I. (2026). Transformer Based on Multi-Domain Feature Fusion for AI-Generated Image Detection. Electronics, 15(3), 716. https://doi.org/10.3390/electronics15030716

