A Multimodal Fake News Detection Method Based on Contrastive Learning and Variational Autoencoder
Abstract
1. Introduction
- In loss function design, we construct a cross-modal fake news detection algorithm based on contrastive learning. By employing both supervised and unsupervised contrastive learning approaches, we capture correlations among multimodal features to enhance the model’s generalization capability.
- For auxiliary task design, we incorporate a variational autoencoder (VAE) into the detection framework to process features. By progressively learning reasonable latent representations via KL divergence, the model adaptively aggregates unimodal and multimodal features, thereby improving interpretability.
- We enhance feature extraction for text and images by integrating LSTM (Long Short-Term Memory) networks and CBAM (Concentration-Based Attention Module) attention mechanisms.
2. Related Work
3. Methods
3.1. Model Structure
3.2. Enhancement of Feature Extraction Module
3.3. Loss Calculation Method of Dual Contrastive Learning
3.4. Variational Autoencoder
3.5. Multimodal Feature Fusion Strategy
4. Experiments and Result Analysis
4.1. Datasets
4.2. Parameter Settings
4.3. Comparison Experiments
- Text-GRU: A single-text method that extracts features from text data through a bidirectional Gated Recurrent Unit network (Bi-GRU) to capture the semantic information of words, thereby achieving text classification tasks.
- Image-VGG: A single-visual method that extracts features from image data using the VGG-19 convolutional neural network, then inputs these features into a fully connected layer to map the features to the classification space through a weight matrix to obtain classification results.
- EANN [26]: An end-to-end rumor detection model built based on adversarial neural networks. It uses Text-CNN to capture local text features and further extract key features of the text. VGG-19 is used to extract image features. The two features are fused and input into the adversarial neural network for training to achieve rumor detection.
- MKEMN [27]: A model that utilizes a multimodal knowledge-aware network to fuse multimodal information such as text and images, and employs an event memory network to capture the development context of news events, achieving fake news recognition through their collaboration.
- SAFE [15]: A model that proposes an innovative multimodal similarity calculation method that can jointly learn the representations of text and visual information and their relationships. By designing a reasonable similarity measurement function, it calculates the similarity between text and images in the semantic space to determine the authenticity of news.
- MFCD [28]: A fake news detection model based on multi-level fusion, effectively addressing the problems of insufficient inter-modal information fusion and excessive requirements for the integrity of multimodal information. By designing a multi-level fusion strategy, it fuses multimodal information such as text and images at different levels, fully exploiting the information complementarity across levels while reducing reliance on the integrity of multimodal information, thereby improving the model’s adaptability in practical applications.
- MMCSC [29]: A model that extracts high-level semantic features of text and images and designs a calculation method for cross-modal topic and sentiment consistency. It uses deep learning models to extract high-level semantic features of text and images, respectively, then constructs a cross-modal topic and sentiment consistency measurement model to calculate the consistency in topic and sentiment between different modalities, thereby judging the authenticity of news.
- LIIMR [30]: A model that performs fake news detection using intra-modal and inter-modal methods. In intra-modal, it independently extracts and analyzes features from text and image data to mine feature patterns within unimodal data. In inter-modal, it fuses text and image features through an effective fusion strategy, making full use of complementary information between different modalities to improve fake news detection performance.
- MCNN [16]: A multimodal consistency detection model that can capture the overall features of social media information for fake news detection. By constructing a multimodal fusion layer, it fuses multimodal information such as text and images, and uses a consistency constraint mechanism to ensure the consistency of different modal information during fusion, thereby extracting more representative overall features and improving the accuracy of fake news detection.
- CAFE [25]: A model that maps heterogeneous unimodal features to a shared semantic space using a mapping function, and designs an ambiguity estimation module to evaluate and handle potential ambiguities between different modalities, improving detection reliability.
- BDANN [31]: A model that conducts in-depth analysis of text data using the pre-trained language model BERT, and performs feature learning on image data using the pre-trained VGG-19 model. It then introduces a domain classifier to eliminate feature dependence for fake news detection.
- MVAE [14]: A model that processes text and image data using a variational autoencoder to mine correlation information between them. After obtaining relevant features, they are input into a news classifier, and through the classifier’s operation and judgment mechanism, false information detection is achieved.
- MMF [8]: A fake news detection model based on a shared network and contrastive learning. It fully integrates text and visual information through a graph convolutional neural network, uses a shared representation module to extract fine-grained representations for richer multimodal information, and introduces two different types of contrastive learning as auxiliary tasks to enable the model to better learn correlations between samples of the same category.
4.4. Out-of-Distribution Evaluation on AI-Generated and Niche Data
4.5. Ablation Experiment
4.6. Parameter Analysis
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Liu, H.L.; Chen, S.H.; Cao, S.J.; Zhu, J.L.; Ren, Q.Q. Research on Fake News Detection Based on Multimodal Learning. J. Front. Comput. Sci. Technol. 2023, 17, 2022–2029. [Google Scholar]
- Liu, H.; Wang, W.; Li, H. Interpretable Multimodal Misinformation Detection with Logic Reasoning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Toronto, ON, Canada, 9–14 July 2023; pp. 9781–9796. [Google Scholar]
- Wang, L.; Zhang, C.; Xu, H.; Xu, S.; Xu, B. Cross-modal Contrastive Learning for Multimodal Fake News Detection. In Proceedings of the 31st ACM International Conference on Multimedia (MM), Ottawa, ON, Canada, 29 October–3 November 2023; pp. 5696–5704. [Google Scholar]
- Cao, B.; Wu, Q.; Cao, J.; Liu, B.; Gui, J. External Reliable Information-enhanced Multimodal Contrastive Learning for Fake News Detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Philadelphia, PA, USA, 25 February 2025–4 March 2025; pp. 31–39. [Google Scholar]
- Chen, W.; Cai, F.; Guo, Y.; Pan, Z.; Chen, W.; Zhang, Y. Contrastive Learning of Cross-Modal Information Enhancement for Multimodal Fake News Detection. Complex Intell. Syst. 2025, 11, 303. [Google Scholar] [CrossRef]
- Hu, S.; Hu, J.; Zhang, H. Synergizing LLMs with Global Label Propagation for Multimodal Fake News Detection. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, BC, Austria, 27 July–1 August 2025; pp. 1426–1440. [Google Scholar]
- Shen, L.; Long, Y.; Cai, X.; Razzak, I.; Chen, G.; Liu, K.; Jameel, S. GAMED: Knowledge Adaptive Multi-Experts Decoupling for Multimodal Fake News Detection. In Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining (WSDM), Hannover, Germany, 10–14 March 2025; pp. 586–595. [Google Scholar]
- Du, Z.; Wang, H.; Liu, J. Multimodal Fake News Detection Integrating Shared Representation and Contrastive Learning. Comput. Eng. Des. 2025, 46, 2879–2887. [Google Scholar]
- Lao, A.; Zhang, Q.; Shi, C.; Cao, L.; Yi, K.; Hu, L.; Zhao, D. Frequency Spectrum is More Effective for Multimodal Representation and Fusion: A Multimodal Spectrum Rumor Detector. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 18426–18434. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
- Singhal, S.; Shah, R.R.; Chakraborty, T.; Kumaraguru, P.; Satoh, S. SpotFake: A Multi-modal Framework for Fake News Detection. In Proceedings of the 2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM), Singapore, 11–13 September 2019; pp. 39–47. [Google Scholar]
- Khattar, D.; Goud, J.S.; Gupta, M.; Varma, V. MVAE: Multimodal Variational Autoencoder for Fake News Detection. In Proceedings of the World Wide Web Conference (WWW), San Francisco, CA, USA, 13–17 May 2019; pp. 2915–2921. [Google Scholar]
- Zhou, X.; Wu, J.; Zafarani, R. Similarity-Aware Multi-modal Fake News Detection. In Proceedings of the 24th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Singapore, 11–14 May 2020; pp. 354–367. [Google Scholar]
- Xue, J.; Wang, Y.; Tian, Y.; Li, Y.; Shi, L.; Wei, L. Detecting Fake News by Exploring the Consistency of Multimodal Data. Inf. Process. Manag. 2021, 58, 102610. [Google Scholar] [CrossRef] [PubMed]
- Jin, Z.; Cao, J.; Guo, H.; Zhang, Y.; Luo, J. Multimodal Fusion with Recurrent Neural Networks for Rumor Detection on Microblogs. In Proceedings of the 25th ACM International Conference on Multimedia (MM), Mountain View, CA, USA, 23–27 October 2017; pp. 795–816. [Google Scholar]
- Alzaidi, M.S.A.; Alshammari, A.; Hassan, A.Q.A.; Yousafzai, S.N.; Thaljaoui, A.; Fitriyani, N.L.; Kim, C.; Syafrudin, M. An Efficient Fusion Network for Fake News Classification. Mathematics 2024, 12, 3294. [Google Scholar] [CrossRef]
- Li, X.; Qiao, J.; Yin, S.; Wu, L.; Gao, C.; Wang, Z.; Li, X. A Survey of Multimodal Fake News Detection: A Cross-Modal Interaction Perspective. IEEE Trans. Emerg. Top. Comput. Intell. 2025, 9, 2658–2675. [Google Scholar] [CrossRef]
- Graves, A. Long Short-Term Memory. In Supervised Sequence Labelling with Recurrent Neural Networks; Springer: Heidelberg, Germany, 2012; pp. 37–45. [Google Scholar]
- Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised Contrastive Learning. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS), Virtual, 6–12 December 2020; pp. 18661–18673. [Google Scholar]
- Zhou, Y.; Yang, Y.; Ying, Q.; Qian, Z.; Zhang, X. Multimodal Fake News Detection via CLIP-Guided Learning. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; pp. 2825–2830. [Google Scholar]
- Gôlo, M.P.S.; de Souza, M.C.; Rossi, R.G.; Marcacini, R.M.; Rezende, S.O. One-class Learning for Fake News Detection through Multimodal Variational Autoencoders. Eng. Appl. Artif. Intell. 2023, 122, 106088. [Google Scholar] [CrossRef]
- Boididou, C.; Papadopoulos, S.; Dang-Nguyen, D.-T.; Boato, G.; Riegler, M.; Middleton, S.; Petlund, A.; Kompatsiaris, Y. Verifying Multimedia Use at MediaEval 2016. In Proceedings of MediaEval Benchmarking Initiative for Multimedia Evaluation, Wurzen, Germany, 14–15 September 2015. [Google Scholar]
- Chen, Y.; Li, D.; Zhang, P.; Sui, J.; Lv, Q.; Tun, L.; Shang, L. Cross-modal Ambiguity Learning for Multimodal Fake News Detection. In Proceedings of the ACM Web Conference (WWW), Lyon, France, 25–29 April 2022; pp. 2897–2905. [Google Scholar]
- Wang, Y.; Ma, F.; Jin, Z.; Yuan, Y.; Xun, G.; Jha, K.; Su, L.; Gao, J. EANN: Event Adversarial Neural Networks for Multi-Modal Fake News Detection. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), London, UK, 19–23 August 2018; pp. 849–857. [Google Scholar]
- Zhang, H.; Fang, Q.; Qian, S.; Xu, C. Multi-modal Knowledge-aware Event Memory Network for Social Media Rumor Detection. In Proceedings of the 27th ACM International Conference on Multimedia (MM), Nice, France, 21–25 October 2019; pp. 1942–1951. [Google Scholar]
- Wang, Z.; Sui, J. Multimodal Rumor Detection Model Based on Multi-Level Fusion. Comput. Eng. Des. 2022, 43, 1756–1761. [Google Scholar]
- Zhao, Y.; Hao, K.; Zhao, J.; Xin, C. MMCSC: A Cross-modal Fake News Detection Method. J. Northeast. Univ. (Nat. Sci.) 2024, 45, 18–25. [Google Scholar]
- Singhal, S.; Pandey, T.; Mreig, S.; Shah, R.; Kumaraguru, P. Leveraging Intra and Inter Modality Relationship for Multimodal Fake News Detection. In Proceedings of the ACM Web Conference (WWW), Lyon, France, 25–29 April 2022; pp. 726–734. [Google Scholar]
- Zhang, T.; Wang, D.; Chen, H.; Zeng, Z.; Guo, W.; Miao, C.; Cui, L. BDANN: BERT-based Domain Adaptation Neural Network for Multi-modal Fake News Detection. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar]







| Dataset | Model | Acc | Fake News | Real News | ||||
|---|---|---|---|---|---|---|---|---|
| Precision | Recall | F1 | Precision | Recall | F1 | |||
| Text-GRU | 0.643 | 0.662 | 0.578 | 0.617 | 0.662 | 0.578 | 0.617 | |
| Image-VGG | 0.663 | 0.630 | 0.500 | 0.550 | 0.630 | 0.750 | 0.690 | |
| EANN [26] | 0.827 | 0.847 | 0.812 | 0.829 | 0.807 | 0.843 | 0.825 | |
| MVAE [14] | 0.824 | 0.854 | 0.769 | 0.809 | 0.802 | 0.875 | 0.837 | |
| SAFE [15] | 0.816 | 0.818 | 0.815 | 0.817 | 0.816 | 0.818 | 0.817 | |
| MCNN [16] | 0.823 | 0.858 | 0.801 | 0.828 | 0.787 | 0.848 | 0.816 | |
| CAFE [25] | 0.840 | 0.855 | 0.830 | 0.842 | 0.825 | 0.851 | 0.837 | |
| BDANN [31] | 0.842 | 0.830 | 0.870 | 0.850 | 0.850 | 0.820 | 0.830 | |
| MFCD [28] | 0.829 | 0.834 | 0.829 | 0.830 | 0.834 | 0.829 | 0.830 | |
| MMCSC [29] | 0.815 | 0.857 | 0.701 | 0.806 | 0.857 | 0.701 | 0.806 | |
| MMF [8] | 0.815 | 0.778 | 0.828 | 0.802 | 0.803 | 0.823 | 0.813 | |
| VCLMMF | 0.896 | 0.923 | 0.874 | 0.898 | 0.868 | 0.941 | 0.903 | |
| Text-GRU | 0.526 | 0.586 | 0.553 | 0.569 | 0.469 | 0.526 | 0.496 | |
| Image-VGG | 0.596 | 0.695 | 0.518 | 0.593 | 0.550 | 0.700 | 0.599 | |
| EANN [26] | 0.719 | 0.642 | 0.474 | 0.545 | 0.771 | 0.870 | 0.817 | |
| MVAE [14] | 0.745 | 0.801 | 0.719 | 0.758 | 0.689 | 0.777 | 0.730 | |
| SAFE [15] | 0.762 | 0.831 | 0.724 | 0.774 | 0.695 | 0.811 | 0.748 | |
| MCNN [16] | 0.784 | 0.778 | 0.781 | 0.779 | 0.790 | 0.787 | 0.788 | |
| CAFE [25] | 0.806 | 0.807 | 0.799 | 0.803 | 0.805 | 0.813 | 0.809 | |
| BDANN [31] | 0.830 | 0.810 | 0.630 | 0.710 | 0.830 | 0.930 | 0.880 | |
| LIIMR [30] | 0.831 | 0.836 | 0.832 | 0.830 | 0.825 | 0.830 | 0.827 | |
| MMF [8] | 0.871 | 0.889 | 0.840 | 0.864 | 0.894 | 0.863 | 0.878 | |
| VCLMMF | 0.904 | 0.943 | 0.871 | 0.905 | 0.868 | 0.941 | 0.903 | |
| Model | Dataset | Acc | Fake News | Real News | ||||
|---|---|---|---|---|---|---|---|---|
| Precision | Recall | F1 | Precision | Recall | F1 | |||
| VCLMMF | Weibo (AIGC) | 0.891 | 0.918 | 0.870 | 0.893 | 0.865 | 0.936 | 0.899 |
| Weibo (Niche) | 0.901 | 0.920 | 0.876 | 0.897 | 0.872 | 0.932 | 0.901 | |
| VCLMMF | Twitter (AIGC) | 0.902 | 0.939 | 0.868 | 0.902 | 0.866 | 0.938 | 0.901 |
| Twitter (Niche) | 0.907 | 0.941 | 0.873 | 0.906 | 0.870 | 0.939 | 0.903 | |
| Dataset | Test | Acc | F1 | |
|---|---|---|---|---|
| Fake News | Real News | |||
| w/o CL | 0.840 | 0.837 | 0.842 | |
| w/o VAE | 0.852 | 0.844 | 0.853 | |
| VCLMMF | 0.896 | 0.898 | 0.903 | |
| w/o CL | 0.842 | 0.850 | 0.830 | |
| w/o VAE | 0.881 | 0.873 | 0.902 | |
| VCLMMF | 0.904 | 0.905 | 0.903 | |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Wu, B.; Hu, R.; Wang, J.; Sui, X.; Sun, J.; Liu, J.; Qu, Y. A Multimodal Fake News Detection Method Based on Contrastive Learning and Variational Autoencoder. Mathematics 2026, 14, 1773. https://doi.org/10.3390/math14101773
Wu B, Hu R, Wang J, Sui X, Sun J, Liu J, Qu Y. A Multimodal Fake News Detection Method Based on Contrastive Learning and Variational Autoencoder. Mathematics. 2026; 14(10):1773. https://doi.org/10.3390/math14101773
Chicago/Turabian StyleWu, Baowen, Ruijiao Hu, Jilin Wang, Xin Sui, Jiaxing Sun, Jie Liu, and Youli Qu. 2026. "A Multimodal Fake News Detection Method Based on Contrastive Learning and Variational Autoencoder" Mathematics 14, no. 10: 1773. https://doi.org/10.3390/math14101773
APA StyleWu, B., Hu, R., Wang, J., Sui, X., Sun, J., Liu, J., & Qu, Y. (2026). A Multimodal Fake News Detection Method Based on Contrastive Learning and Variational Autoencoder. Mathematics, 14(10), 1773. https://doi.org/10.3390/math14101773

