A Multimodal Fake News Detection Model Based on Bidirectional Semantic Enhancement and Adversarial Network Under Web3.0
Abstract
1. Introduction
- Insufficient Cross-Modal Semantic Alignment and Deep Interaction: When processing features from different modalities [19,20], many multimodal models involve a fusion operation. However, simplistic concatenation or shallow interaction mechanisms are often insufficient to bridge the semantic gap between text and images and fail to adequately explore deep cross-modal semantic correlations. Even with the introduction of attention mechanisms [21,22], the efficacy of interaction remains constrained without effective initial alignment of features within a shared semantic space.
- Lack of Calibration Capability for Cross-Modal Semantic Discrepancies: In real-world multimodal news, varying degrees of semantic discrepancy, or even outright contradiction, can exist between text and images. Most current models lack a mechanism to dynamically assess and calibrate these discrepancies. Consequently, they cannot adaptively adjust the contribution of each modality based on the degree of visual-textual consistency, rendering them susceptible to interference from inconsistent or misleading information.
- Need for Enhanced Robustness and Generalization in Feature Representations: Fake news encompasses diverse event themes and manipulation tactics. If a model overfits to specific patterns prevalent in the training data, its generalization ability will be limited. This restricts its adaptability to varying event contexts and differences in modal information quality, and it hinders the learning of stable, discriminative features for identifying fake news.
- We designed a bidirectional modality mapping method that, by promoting cross-modal feature transformation, aims to achieve effective alignment of textual and visual features within a shared semantic space, providing a high-quality semantic foundation for subsequent interaction;
- We designed a unified Semantic Enhancement and Calibration Network (SECN). This network deeply explores cross-modal relationships through a semantic interaction mechanism and adaptively adjusts modal contributions via a deviation calibration strategy that dynamically adjusts modal contributions based on perceived semantic differences, enables a refined understanding of cross-modal information and robust adaptation to potential semantic conflicts, thereby enhancing the model’s ability to discern semantic inconsistencies;
- Through extensive comparative experiments and ablation studies against several state-of-the-art benchmark models on public Weibo and Twitter multimodal fake news datasets, the proposed BSEAN model demonstrates significant performance improvements across key evaluation metrics, validating its effectiveness in the task of multimodal fake news detection.
2. Related Work
2.1. Unimodal Detection Methods
2.2. Multimodal Detection Methods
3. Methodology
3.1. Feature Extraction
3.1.1. Text Encoder
3.1.2. Image Encoder
3.2. Bidirectional Modality Mapping Network
3.2.1. Cross-Modal Bidirectional Feature Mapping
3.2.2. Cycle Consistency and Regularization Optimization
3.3. Semantic Enhancement and Calibration Network
3.3.1. Semantic Interaction Enhancement
3.3.2. Semantic Deviation Calibration
3.4. Dual Adversarial Learning
4. Experiments
4.1. Datasets
4.2. Experimental Setup
4.3. Evaluation Metrics
4.4. Comparative Experiments
- The model proposed in this paper outperforms most other baseline models across the evaluation metrics. In terms of accuracy, our model achieves improvements on both the Weibo and Twitter datasets compared to the other models, with overall accuracies of 90.1% and 88.3%, respectively. Furthermore, the F1-scores indicate that our proposed model demonstrates stable discriminative ability for fake news detection. Although the improvement in accuracy over some baseline models, such as MCAN on Weibo, is marginal, a deeper analysis reveals the distinct advantages of BSEAN. As shown in Table 1, BSEAN achieves the highest Macro-F1 score, which indicates a more balanced performance across both the real and fake news categories. Furthermore, as we will introduce in Section 4.7, the calibration of our model exhibits a lower Expected Calibration Error (ECE) score, which means its prediction confidences are more reliable.
- The data in the table show that fake news detection tasks relying solely on either textual or visual unimodal information perform poorly, as unimodal approaches struggle to capture rich semantic information. Moreover, detection tasks relying solely on the textual modality perform better than those relying solely on the visual modality. This is likely because the information provided by the textual modality is often more complete for semantic understanding than that directly obtainable from the visual modality.
- In the context of multimodal fake news detection: The CLIP-based model demonstrates the potential of pre-trained vision-language models, yet its results also suggest a potential limitation of relying solely on powerful general-purpose representations. EANN, through its event adversarial neural network, fuses textual and visual features and excels at mining explicit and implicit common features across events. However, the removal of specific event features can diminish its identification capability. MVAE’s multimodal variational autoencoder generates fused features via a self-supervised loss, enhancing generalization. Nevertheless, its simple fusion of modal information and lack of sophisticated interaction lead to suboptimal performance, and its sensitivity to hyperparameters further degrades overall efficacy. MCNN improves detection performance by analyzing social media features and the consistency of multimodal data. However, this method does not fully exploit the potential synergistic value of information from different modalities, a limitation that constrains the effective utilization of fused features by the model. ENRoBERTa achieves cross-modal information complementarity through efficient feature extraction and contrastive learning, effectively improving detection accuracy. However, the depth of feature fusion for complex modal interaction scenarios needs further enhancement. MCAN deeply fuses multimodal features through a multi-layer co-attention stack. This approach learns interdependencies between cross-modal features, underscoring the importance of capturing cross-modal semantics.
4.5. Ablation Study Analysis
4.6. Quantitative Analysis of SDC Module Weights
- When visual-textual information is highly consistent and mutually supportive, the weight values tend to be higher, ensuring that this strong, consistent evidence is fully utilized.
- When significant deviations or even contradictions exist between visual and textual information, the weight values tend to be lower to appropriately suppress this potential noise or conflicting information, preventing it from misleading the model’s discrimination.
- This context-aware dynamic adjustment capability enables the model to simulate the human cognitive process of “weighing” and “filtering” evidence when processing multimodal information, thereby achieving effective fusion of heterogeneous information.
4.7. Investigation into Model Reliability via Calibration Analysis
4.8. Analysis of Model Boundaries
5. Case Study
6. Discussion
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Li, K.; Guo, B.; Liu, J.; Wang, J.; Ren, H.; Yi, F.; Yu, Z. Dynamic probabilistic graphical model for progressive fake news detection on social media platform. ACM Trans. Intell. Syst. Technol. (TIST) 2022, 13, 1–24. [Google Scholar] [CrossRef]
- Zhang, Z.; Jing, J.; Li, F.; Zhao, C. Survey on fake information detection, propagation and control in online social networks from the perspective of artificial intelligence. Chin. J. Comput. 2021, 44, 2261–2282. [Google Scholar]
- Zhang, X.; Ghorbani, A.A. An overview of online fake news: Characterization, detection, and discussion. Inf. Process. Manag. 2020, 57, 102025. [Google Scholar] [CrossRef]
- Wentao, C.; Kuok-Tiung, L.; Wei, L.; Bhambri, P.; Kautish, S. Predicting the security threats of internet rumors and spread of false information based on sociological principle. Comput. Stand. Interfaces 2021, 73, 103454. [Google Scholar]
- Alam, F.; Cresci, S.; Chakraborty, T.; Silvestri, F.; Dimitrov, D.; Martino, G.D.S.; Shaar, S.; Firooz, H.; Nakov, P. A survey on multimodal disinformation detection. arXiv 2021, arXiv:2103.12541. [Google Scholar]
- Pawar, P.P.; Femy, F.F.; Rajkumar, N.; Jeevitha, S.; Bhuvanesh, A.; Kumar, D. Blockchain-Enabled Cybersecurity for IoT Using Elliptic Curve Cryptography and Black Winged Kite Model. Int. J. Inf. Technol. 2025, 1–11. [Google Scholar] [CrossRef]
- Roitero, K.; Soprano, M.; Portelli, B.; Spina, D.; Della Mea, V.; Serra, G.; Mizzaro, S.; Demartini, G. The COVID-19 infodemic: Can the crowd judge recent misinformation objectively? In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Virtual, 19–23 October 2020; pp. 1305–1314. [Google Scholar]
- Meel, P.; Vishwakarma, D.K. Fake news, rumor, information pollution in social media and web: A contemporary survey of state-of-the-arts, challenges and opportunities. Expert Syst. Appl. 2020, 153, 112986. [Google Scholar] [CrossRef]
- Osmundsen, M.; Bor, A.; Vahlstrup, P.B.; Bechmann, A.; Petersen, M.B. Partisan polarization is the primary psychological motivation behind political fake news sharing on Twitter. Am. Political Sci. Rev. 2021, 115, 999–1015. [Google Scholar] [CrossRef]
- Ma, J.; Gao, W.; Wong, K.F. Detect rumors on twitter by promoting information campaigns with generative adversarial learning. In The World Wide Web Conference; Association for Computing Machinery: New York, NY, USA, 2019; pp. 3049–3055. [Google Scholar]
- Sherstinsky, A. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Phys. D Nonlinear Phenom. 2020, 404, 132306. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; (long and short papers). Volume 1, pp. 4171–4186. [Google Scholar]
- Zhou, Y.; Yang, Y.; Ying, Q.; Qian, Z.; Zhang, X. Multimodal fake news detection via clip-guided learning. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; pp. 2825–2830. [Google Scholar]
- Cao, J.; Qi, P.; Sheng, Q.; Yang, T.; Guo, J.; Li, J. Exploring the role of visual content in fake news detection. In Disinformation, Misinformation, and Fake News in Social Media: Emerging Research Challenges and Opportunities; Springer: Cham, Switzerland, 2020; pp. 141–161. [Google Scholar]
- Zhou, P.; Han, X.; Morariu, V.I.; Davis, L.S. Learning rich features for image manipulation detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1053–1061. [Google Scholar]
- Wang, Y.; Ma, F.; Jin, Z.; Yuan, Y.; Xun, G.; Jha, K.; Su, L.; Gao, J. Eann: Event adversarial neural networks for multi-modal fake news detection. In Proceedings of the 24th ACM Sigkdd International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 849–857. [Google Scholar]
- Ajao, O.; Bhowmik, D.; Zargari, S. Sentiment aware fake news detection on online social networks. In Proceedings of the ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 2507–2511. [Google Scholar]
- Giachanou, A.; Rosso, P.; Crestani, F. Leveraging emotional signals for credibility detection. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, 21–25 July 2019; pp. 877–880. [Google Scholar]
- Khattar, D.; Goud, J.S.; Gupta, M.; Varma, V. Mvae: Multimodal variational autoencoder for fake news detection. In The World Wide Web Conference; Association for Computing Machinery: New York, NY, USA, 2019; pp. 2915–2921. [Google Scholar]
- Li, P.; Sun, X.; Yu, H.; Tian, Y.; Yao, F.; Xu, G. Entity-oriented multi-modal alignment and fusion network for fake news detection. IEEE Trans. Multimed. 2021, 24, 3455–3468. [Google Scholar] [CrossRef]
- Qian, S.; Wang, J.; Hu, J.; Fang, Q.; Xu, C. Hierarchical multi-modal contextual attention network for fake news detection. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 11 July 2021; pp. 153–162. [Google Scholar]
- Chen, Y.; Li, D.; Zhang, P.; Sui, J.; Lv, Q.; Tun, L.; Shang, L. Cross-modal ambiguity learning for multimodal fake news detection. In ACM Web Conference 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 2897–2905. [Google Scholar]
- Castillo, C.; Mendoza, M.; Poblete, B. Information credibility on twitter. In Proceedings of the 20th International Conference on World Wide Web, Hyderabad, India, March 28–1 April 2011; pp. 675–684. [Google Scholar]
- Horne, B.; Adali, S. This just in: Fake news packs a lot in title, uses simpler, repetitive content in text body, more similar to satire than real news. In Proceedings of the International AAAI Conference on Web and Social Media, Montreal, QC, Canada, 15–18 May 2017; Volume 11, pp. 759–766. [Google Scholar]
- Pérez-Rosas, V.; Kleinberg, B.; Lefevre, A.; Mihalcea, R. Automatic detection of fake news. arXiv 2017, arXiv:1708.07104. [Google Scholar] [CrossRef]
- Dun, Y.; Tu, K.; Chen, C.; Hou, C.; Yuan, X. Kan: Knowledge-aware attention network for fake news detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 81–89. [Google Scholar]
- Xu, W.; Wu, J.; Liu, Q.; Wu, S.; Wang, L. Evidence-aware fake news detection with graph neural networks. In Proceedings of the ACM web conference 2022, Virtual, 25–29 April 2022; pp. 2501–2510. [Google Scholar]
- Liao, H.; Peng, J.; Huang, Z.; Zhang, W.; Li, G.; Shu, K.; Xie, X. Muser: A multi-step evidence retrieval enhancement framework for fake news detection. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA, 6–10 August 2023; pp. 4461–4472. [Google Scholar]
- Huang, Y.; Shu, K.; Yu, P.S.; Sun, L. From creation to clarification: ChatGPT’s journey through the fake news quagmire. In Proceedings of the Companion Proceedings of the ACM Web Conference 2024, Singapore, 13–17 May 2024; pp. 513–516. [Google Scholar]
- Qi, P.; Cao, J.; Yang, T.; Guo, J.; Li, J. Exploiting multi-domain visual information for fake news detection. In Proceedings of the 2019 IEEE international conference on data mining (ICDM), Beijing, China, 8–11 November 2019; pp. 518–527. [Google Scholar]
- Singh, B.; Sharma, D.K. Predicting image credibility in fake news over social media using multi-modal approach. Neural Comput. Appl. 2022, 34, 21503–21517. [Google Scholar] [CrossRef] [PubMed]
- Singhal, S.; Kabra, A.; Sharma, M.; Shah, R.R.; Chakraborty, T.; Kumaraguru, P. Spotfake+: A multimodal framework for fake news detection via transfer learning (student abstract). In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 13915–13916. [Google Scholar]
- Wang, Y.; Ma, F.; Wang, H.; Jha, K.; Gao, J. Multimodal emergent fake news detection via meta neural process networks. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, 14–18 August 2021; pp. 3708–3716. [Google Scholar]
- Singhal, S.; Shah, R.R.; Chakraborty, T.; Kumaraguru, P.; Satoh, S. Spotfake: A multi-modal framework for fake news detection. In Proceedings of the 2019 IEEE fifth international conference on multimedia big data (BigMM), Singapore, 11–13 September 2019; pp. 39–47. [Google Scholar]
- Jin, Z.; Cao, J.; Guo, H.; Zhang, Y.; Luo, J. Multimodal fusion with recurrent neural networks for rumor detection on microblogs. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 795–816. [Google Scholar]
- Wu, Y.; Zhan, P.; Zhang, Y.; Wang, L.; Xu, Z. Multimodal fusion with co-attention networks for fake news detection. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, 1–6 August 2021; pp. 2560–2569. [Google Scholar]
- Qi, L.; Wan, S.; Tang, B.; Xu, Y. Multimodal fusion rumor detection method based on attention mechanism. Comput. Eng. Appl. 2022, 58, 209–217. [Google Scholar]
- Shang, L.; Kou, Z.; Zhang, Y.; Wang, D. A duo-generative approach to explainable multimodal covid-19 misinformation detection. In Proceedings of the ACM Web Conference 2022, Virtual, 25–29 April 2022; pp. 3623–3631. [Google Scholar]
- Xue, J.; Wang, Y.; Tian, Y.; Li, Y.; Shi, L.; Wei, L. Detecting fake news by exploring the consistency of multimodal data. Inf. Process. Manag. 2021, 58, 102610. [Google Scholar] [CrossRef] [PubMed]
- Zhou, X.; Wu, J.; Zafarani, R. SAFE: Similarity-aware multi-modal fake news detection. arXiv 2020, arXiv:2003.04981. [Google Scholar]
- Selvam, L.; Vinothkumar, E.S.; Krishnan, R.S.; Rajkumar, G.V.; Raj, J.R.F.; Malar, P.S.R. A Unified Deep Learning Model for Fake Account Identification Using Transformer-Based NLP and Graph Neural Networks. In Proceedings of the 2025 International Conference on Inventive Computation Technologies (ICICT), Kirtipur, Nepal, 23–25 April 2025; pp. 1033–1040. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
- Boididou, C.; Papadopoulos, S.; Zampoglou, M.; Apostolidis, L.; Papadopoulou, O.; Kompatsiaris, Y. Detection and visualization of misleading content on Twitter. Int. J. Multimed. Inf. Retr. 2018, 7, 71–86. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the International Conference on Machine Learning (ICML), Virtual Event, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Dataset | Methods | Accuracy | Macro-F1 | Fake News | True News | |||||
---|---|---|---|---|---|---|---|---|---|---|
Precision | Recall | F1 | Precision | Recall | F1 | |||||
Visual-Only | 0.528 | 0.528 | 0.540 | 0.520 | 0.530 | 0.515 | 0.538 | 0.526 | ||
Text-Only | 0.661 | 0.658 | 0.685 | 0.610 | 0.645 | 0.630 | 0.715 | 0.670 | ||
CLIP | 0.771 | 0.771 | 0.769 | 0.778 | 0.773 | 0.773 | 0.764 | 0.768 | ||
EANN | 0.782 | 0.780 | 0.827 | 0.697 | 0.756 | 0.752 | 0.863 | 0.804 | ||
MVAE | 0.824 | 0.823 | 0.854 | 0.769 | 0.809 | 0.802 | 0.875 | 0.837 | ||
MCNN | 0.846 | 0.845 | 0.809 | 0.857 | 0.832 | 0.879 | 0.837 | 0.858 | ||
ENROBERTa | 0.812 | 0.799 | 0.851 | 0.784 | 0.816 | 0.744 | 0.826 | 0.782 | ||
MCAN | 0.899 | 0.899 | 0.913 | 0.889 | 0.901 | 0.884 | 0.909 | 0.897 | ||
BSEAN(Ours) | 0.901 | 0.905 | 0.925 | 0.870 | 0.897 | 0.890 | 0.935 | 0.912 | ||
Visual-Only | 0.534 | 0.534 | 0.550 | 0.480 | 0.513 | 0.520 | 0.595 | 0.555 | ||
Text-Only | 0.592 | 0.591 | 0.610 | 0.530 | 0.567 | 0.575 | 0.660 | 0.615 | ||
CLIP | 0.788 | 0.785 | 0.770 | 0.760 | 0.765 | 0.790 | 0.820 | 0.805 | ||
EANN | 0.648 | 0.639 | 0.810 | 0.498 | 0.617 | 0.584 | 0.759 | 0.660 | ||
MVAE | 0.745 | 0.744 | 0.801 | 0.719 | 0.758 | 0.689 | 0.777 | 0.730 | ||
MCNN | 0.784 | 0.784 | 0.778 | 0.781 | 0.779 | 0.790 | 0.787 | 0.788 | ||
ENROBERTa | 0.853 | 0.849 | 0.821 | 0.943 | 0.877 | 0.913 | 0.745 | 0.820 | ||
MCAN | 0.809 | 0.809 | 0.889 | 0.765 | 0.822 | 0.732 | 0.871 | 0.795 | ||
BSEAN(Ours) | 0.883 | 0.876 | 0.905 | 0.845 | 0.874 | 0.911 | 0.846 | 0.877 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Xing, Y.; Zhai, C.; Che, Z.; Pan, H.; Li, K.; Zhang, B.; Yao, Z.; Si, X. A Multimodal Fake News Detection Model Based on Bidirectional Semantic Enhancement and Adversarial Network Under Web3.0. Electronics 2025, 14, 3652. https://doi.org/10.3390/electronics14183652
Xing Y, Zhai C, Che Z, Pan H, Li K, Zhang B, Yao Z, Si X. A Multimodal Fake News Detection Model Based on Bidirectional Semantic Enhancement and Adversarial Network Under Web3.0. Electronics. 2025; 14(18):3652. https://doi.org/10.3390/electronics14183652
Chicago/Turabian StyleXing, Ying, Changhe Zhai, Zhanbin Che, Heng Pan, Kunyang Li, Bowei Zhang, Zhongyuan Yao, and Xueming Si. 2025. "A Multimodal Fake News Detection Model Based on Bidirectional Semantic Enhancement and Adversarial Network Under Web3.0" Electronics 14, no. 18: 3652. https://doi.org/10.3390/electronics14183652
APA StyleXing, Y., Zhai, C., Che, Z., Pan, H., Li, K., Zhang, B., Yao, Z., & Si, X. (2025). A Multimodal Fake News Detection Model Based on Bidirectional Semantic Enhancement and Adversarial Network Under Web3.0. Electronics, 14(18), 3652. https://doi.org/10.3390/electronics14183652