Deepfake Detection Using Multimodal CLIP-Based SigLIP-2 Vision Transformers
Abstract
1. Introduction
2. Materials and Methods
2.1. Model Overview
2.2. Training Setup
2.3. Prototype-Based Few-Shot Inference Using SigLIP-2
2.3.1. Robustness Through Test-Time Augmentation
2.3.2. Threshold Selection and Performance Evaluation
3. Results
3.1. Training Progress
3.2. Video Classification Model Performance
3.3. Prediction Confidence Distribution
3.4. Comparative Analysis with HiDF Baselines
3.4.1. SIDA Dataset
3.4.2. CIFake Dataset
3.5. Few-Shot Inference and Cross Validation on Different Datasets
4. Discussion and Conclusions
4.1. Comparison with Prior Work and Benchmark Performance
4.2. Limitations and Failure Modes
4.3. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Bregler, C.; Covell, M.; Slaney, M. Video Rewrite: Driving Visual Speech with Audio. In Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH); ACM: New York, NY, USA, 1997. [Google Scholar] [CrossRef]
- Thies, J.; Zollhöfer, M.; Stamminger, M.; Theobalt, C.; Nießner, M. Face2Face: Real-time face capture and reenactment of RGB videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2016; pp. 2387–2395. [Google Scholar] [CrossRef]
- Suwajanakorn, S.; Seitz, S.M.; Kemelmacher-Shlizerman, I. Synthesizing Obama: Learning lip sync from audio. ACM Trans. Graph. 2017, 36, 95. [Google Scholar] [CrossRef]
- Blanz, V.; Vetter, T. A morphable model for the synthesis of 3D faces. In Proceedings of SIGGRAPH; ACM: New York, NY, USA, 1999; pp. 187–194. [Google Scholar]
- Guo, Y.; He, W.; Zhu, J.; Li, C. A light autoencoder networks for face swapping. In Proceedings of the 2018 2nd International Conference on Computer Science and Artificial Intelligence; ACM: New York, NY, USA, 2018; pp. 459–462. [Google Scholar] [CrossRef]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
- Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2019; pp. 4401–4410. [Google Scholar] [CrossRef]
- Choi, Y.; Choi, M.; Kim, M.; Ha, J.-W.; Kim, S.; Choo, J. StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 8789–8797. [Google Scholar] [CrossRef]
- Yermakov, A.; Cech, J.; Matas, J.; Fritz, M. Deepfake detection that generalizes across benchmarks. arXiv 2025, arXiv:2508.06248. [Google Scholar] [CrossRef]
- Kaddar, B.; Fezza, S.A.; Akhtar, Z.; Hamidouche, W.; Hadid, A.; Serra-Sagristá, J. Deepfake detection using spatio-temporal transformer. ACM Trans. Multimed. Comput. Commun. Appl. 2024, 20, 345. [Google Scholar] [CrossRef]
- Raza, M.A.; Malik, K.M.; Haq, I.U. HolisticDFD: Infusing spatiotemporal transformer embeddings for deepfake detection. Inf. Sci. 2023, 645, 119352. [Google Scholar] [CrossRef]
- Xu, Y.; Liang, J.; Jia, G.; Yang, Z.; Zhang, Y.; He, R. TALL: Thumbnail layout for deepfake video detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 22658–22668. [Google Scholar]
- Guan, J.; Zhou, H.; Hong, Z.; Ding, E.; Wang, J.; Quan, C.; Zhao, Y. Delving into sequential patches for deepfake detection. Adv. Neural Inf. Process. Syst. 2022, 35, 4517–4530. [Google Scholar]
- Liu, H.; Tan, Z.; Tan, C.; Wei, Y.; Wang, J.; Zhao, Y. Forgery-aware adaptive transformer for generalizable synthetic image detection (FatFormer). In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; IEEE: New York, NY, USA, 2024. [Google Scholar]
- Smeu, S.; Oneata, E.; Oneata, D. DeCLIP: Decoding CLIP representations for deepfake localization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 28 February–4 March 2025; IEEE: New York, NY, USA, 2025; pp. 149–159. [Google Scholar]
- Gupta, P.; Ghosh, S.; Gedeon, T.; Do, T.T.; Dhall, A. Multiverse through deepfakes: The MultiFakeVerse dataset of person-centric visual and conceptual manipulations. In Proceedings of the 33rd ACM International Conference on Multimedia, Dublin, Ireland, 27–31 October 2025; Association for Computing Machinery, ACM: New York, NY, USA, 2025. [Google Scholar]
- Dolhansky, B.; Bitton, J.; Pflaum, B.; Lu, J.; Howes, R.; Wang, M.; Ferrer, C.C. The DeepFake Detection Challenge (DFDC) dataset. arXiv 2020, arXiv:2006.07397. [Google Scholar] [CrossRef]
- Li, Y.; Yang, X.; Sun, P.; Qi, H.; Lyu, S. Celeb-DF: A large-scale challenging dataset for deepfake forensics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2020. [Google Scholar]
- Kang, C.; Jeong, S.; Lee, J.; Choi, D.; Woo, S.S.; Han, J. HiDF: A human-indistinguishable deepfake dataset. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining; ACM: New York, NY, USA, 2025. [Google Scholar] [CrossRef]
- Huang, Z.; Hu, J.; Li, X.; He, Y.; Zhao, X.; Peng, B.; Wu, B.; Huang, X.; Cheng, G. SIDA: Social media image deepfake detection, localization and explanation with large multimodal model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2025; pp. 28831–28841. [Google Scholar]
- Bird, J.J.; Lotfi, A. CIFAKE: Image classification and explainable identification of AI-generated synthetic images. IEEE Access 2024, 12, 15642–15650. [Google Scholar] [CrossRef]
- Sala, A. AI vs. Human-Generated Images Dataset. Kaggle Datasets 2024. Available online: https://www.kaggle.com/datasets/alessandrasala79/ai-vs-human-generated-dataset (accessed on 10 March 2026).
- Prithiv Sakthi U R. Deepfake-vs-Real-Classification Dataset. Kaggle Datasets 2025. Available online: https://www.kaggle.com/datasets/prithivsakthiur/deepfake-vs-real-60k (accessed on 10 March 2026).
- Smith, M.S. Real-time audio deepfakes have arrived: A cybersecurity firm has created convincing voices on the fly. IEEE Spectrum, 21 October 2025. [Google Scholar]
- Momin, M.D.S.; Sufian, A.; Barman, D.; Leo, M.; Distante, C.; Damer, N. Explainable deepfake detection across different modalities: An overview of methods and challenges. Image Vis. Comput. 2025, 163, 105738. [Google Scholar] [CrossRef]










| Type | Model | AP | AUC | ΔAP vs. Baseline | ΔAUC vs. Baseline |
|---|---|---|---|---|---|
| Video | EB4 (best baseline) | 0.712 | 0.733 | - | - |
| Video | Proposed | 0.925 | 0.931 | +0.213 | +0.198 |
| Image | EB4 (best baseline) | 0.722 | 0.697 | - | - |
| Image | Proposed Model | 0.972 | 0.968 | +0.250 | +0.271 |
| Method | Year | Real (Acc/F1) | Synthetic (Acc/F1) | Tampered (Acc/F1) | Overall (Acc/F1) |
|---|---|---|---|---|---|
| AntifakePrompt | 2024 | 64.8/78.6 | 93.8/96.8 | 30.8/47.2 | 63.1/74.2 |
| CnnSpott | 2021 | 79.8/88.7 | 39.5/56.6 | 6.9/12.9 | 42.1/52.7 |
| FreDect | 2020 | 83.7/91.1 | 16.8/28.8 | 11.9/21.3 | 37.4/47.0 |
| Gram-Net | 2020 | 70.1/82.4 | 93.5/96.6 | 0.8/1.6 | 54.8/60.2 |
| UnivFD | 2023 | 68.0/67.4 | 62.1/87.5 | 64.0/85.3 | 64.7/80.0 |
| LGrad | 2023 | 64.8/78.6 | 83.5/91.0 | 6.8/12.7 | 51.7/60.7 |
| LNP | 2023 | 71.2/83.2 | 91.8/95.7 | 2.9/5.7 | 55.3/61.5 |
| SIDA-7B | 2024 | 89.1/91.0 | 98.7/98.6 | 92.7/91.0 | 93.5/93.5 |
| SIDA-13B | 2024 | 89.6/91.1 | 98.5/98.7 | 92.9/91.2 | 93.6/93.5 |
| Proposed Model | 2025 | 99.0/99.0 | 98.6/99.3 | 99.6/98.9 | 99.1/99.1 |
| Method | Year | F1 | IoU |
|---|---|---|---|
| MVSS-Net | 2023 | 31.6 | 23.7 |
| HIFI-Net | 2023 | 45.9 | 21.1 |
| PSCC-Net | 2022 | 71.3 | 35.7 |
| LISA-7B-v1 | 2024 | 69.1 | 32.5 |
| SIDA-7B | 2024 | 73.9 | 43.8 |
| Proposed Model | 2025 | 82.2 | 74.3 |
| Dataset | Accuracy | Precision (Macro) | Recall (Macro) | F1 (Macro) | Total Samples |
|---|---|---|---|---|---|
| AI vs. Human Generated Images [23] | 0.8917 | 0.8923 | 0.8917 | 0.8917 | 79,850 |
| Deepfake-vs.-Real-Classification [24] | 0.9820 | 0.9819 | 0.9819 | 0.9820 | 91,234 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Soundararajan, J.; Xu, D. Deepfake Detection Using Multimodal CLIP-Based SigLIP-2 Vision Transformers. AI 2026, 7, 115. https://doi.org/10.3390/ai7030115
Soundararajan J, Xu D. Deepfake Detection Using Multimodal CLIP-Based SigLIP-2 Vision Transformers. AI. 2026; 7(3):115. https://doi.org/10.3390/ai7030115
Chicago/Turabian StyleSoundararajan, Joe, and Dong Xu. 2026. "Deepfake Detection Using Multimodal CLIP-Based SigLIP-2 Vision Transformers" AI 7, no. 3: 115. https://doi.org/10.3390/ai7030115
APA StyleSoundararajan, J., & Xu, D. (2026). Deepfake Detection Using Multimodal CLIP-Based SigLIP-2 Vision Transformers. AI, 7(3), 115. https://doi.org/10.3390/ai7030115

