Next Article in Journal
Fusion of Computer Vision and AI in Collaborative Robotics: A Review and Future Prospects
Previous Article in Journal
Cancer Risk Associated with Inhalation Exposure to PM10-Bound PAHs and PM10-Bound Heavy Metals in Polish Agglomerations
Previous Article in Special Issue
Seg-Eigen-CAM: Eigen-Value-Based Visual Explanations for Semantic Segmentation Models
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

User Comment-Guided Cross-Modal Attention for Interpretable Multimodal Fake News Detection

1
School of Cyber Science and Engineering, Huazhong University of Science and Technology, Wuhan 430074, China
2
Research Institute, Shenzhen Huazhong University of Science and Technology, Shenzhen 518057, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Appl. Sci. 2025, 15(14), 7904; https://doi.org/10.3390/app15147904
Submission received: 16 June 2025 / Revised: 12 July 2025 / Accepted: 13 July 2025 / Published: 15 July 2025
(This article belongs to the Special Issue Explainable Artificial Intelligence Technology and Its Applications)

Abstract

In order to address the pressing challenge posed by the proliferation of fake news in the digital age, we emphasize its profound and harmful impact on societal structures, including the misguidance of public opinion, the erosion of social trust, and the exacerbation of social polarization. Current fake news detection methods are largely limited to superficial text analysis or basic text–image integration, which face significant limitations in accurately identifying deceptive information. To bridge this gap, we propose the UC-CMAF framework, which comprehensively integrates news text, images, and user comments through an adaptive co-attention fusion mechanism. The UC-CMAF workflow consists of four key subprocesses: multimodal feature extraction, cross-modal adaptive collaborative attention fusion of news text and images, cross-modal attention fusion of user comments with news text and images, and finally, input of fusion features into a fake news detector. Specifically, we introduce multi-head cross-modal attention heatmaps and comment importance visualizations to provide interpretability support for the model’s predictions, revealing key semantic areas and user perspectives that influence judgments. Through the cross-modal adaptive collaborative attention mechanism, UC-CMAF achieves deep semantic alignment between news text and images and uses social signals from user comments to build an enhanced credibility evaluation path, offering a new paradigm for interpretable fake information detection. Experimental results demonstrate that UC-CMAF consistently outperforms 15 baseline models across two benchmark datasets, achieving F1 Scores of 0.894 and 0.909. These results validate the effectiveness of its adaptive cross-modal attention mechanism and the incorporation of user comments in enhancing both detection accuracy and interpretability.

1. Introduction

In the era of digital information explosion, the widespread dissemination of fake news has become an urgent social challenge, threatening not only public discourse and democratic processes but also individual decision-making [1,2]. Particularly, multimodal fake news, through strategically combining manipulated text, images, and user interactions, spreads significantly faster than pure textual misinformation and achieves higher engagement rates, posing severe challenges to traditional detection methods [3,4]. Therefore, developing effective fake news detection and mitigation methods has become a critical research direction in the fields of artificial intelligence and machine learning [5,6].
In recent years, with the widespread application of multimodal learning in information retrieval, its advantages of leveraging multiple data sources for complementary information have become increasingly prominent [7,8]. From image–text matching to video captioning and sentiment analysis, multimodal methods have significantly improved task performance by extracting and fusing features from different modalities [9,10]. In this context, extending multimodal learning to the fake news detection domain, particularly utilizing user comments as a key modality for analysis to improve accuracy in identifying fake news through social signals, holds significant research value and practical importance [11,12].
Although fake news detection has developed various methods based on content, context, and social signals [13,14], these methods are typically limited to superficial textual analysis or basic text–image integration, encountering significant obstacles in precisely locating deceptive content [15,16]. Existing methods often fail to comprehensively capture the nuances and complexities of fake news, primarily manifested in several aspects: sophisticated fake news often employs subtle linguistic techniques that are difficult to detect through traditional textual analysis; visual content is frequently manipulated or taken out of context to support false narratives; fake news creators exploit cognitive biases and emotional triggers to enhance credibility; valuable insights provided by user comments and reactions are often underutilized in current detection systems [17,18].
To address these challenges, this work tackles three fundamental research problems: the effective integration of user comments as a primary modality rather than auxiliary information in multimodal fake news detection; the design of cross-modal attention mechanisms capable of capturing fine-grained semantic relationships between news content and social validation signals; and the development of interpretable detection systems that provide transparent decision-making processes for practical deployment scenarios. These problems arise from the limitations of existing methods that treat user comments as supplementary data and lack the ability to capture nuanced cross-modal interactions while providing explainable detection decisions.
To bridge these gaps, we propose the UC-CMAF framework, which fundamentally transforms the role of user comments from auxiliary information to a primary modality while introducing novel adaptive cross-modal attention mechanisms for comprehensive fake news detection. Our research objectives are threefold: first, to develop a framework that effectively harnesses collective intelligence and social validation signals embedded in user comments; second, to design sophisticated cross-modal attention mechanisms that capture intricate semantic relationships between textual claims, visual evidence, and user feedback; and third, to ensure interpretable detection outcomes through comprehensive attention visualizations and hierarchical importance scoring, thereby enabling transparency and facilitating practical deployment in real-world fact-checking scenarios.
The main contributions of this paper include three aspects. First, we propose UC-CMAF, an innovative multimodal fusion framework that successfully integrates three different information modalities—news text, images, and user comments—for fake news detection. Second, UC-CMAF integrates an adaptive co-attention fusion mechanism that can dynamically adjust weights between different modalities, achieving efficient cross-modal information fusion. Finally, we conducted extensive experimental validation on multiple datasets, demonstrating the excellent performance of the UC-CMAF model in fake news detection tasks, and further validated the importance of each component in the model through ablation studies.
While effectively identifying fake news, improving the interpretability of detection decisions is of great significance for building public trust and supporting human intervention [19,20]. With the rapid development of Explainable Artificial Intelligence (XAI), incorporating its principles into multimodal misinformation identification tasks can achieve the synergistic goal of “visual explanation + social context explanation” [21,22]. Specifically, fake news detection not only requires high accuracy but also needs the model to answer “why” certain information is judged as false, which relies on the joint support of cross-modal semantic consistency and verifiable signals from user comments [23,24].

2. Review of Fake News Detection Approaches and Multimodal Learning Methods

2.1. Content-Based Fake News Detection

Content-based approaches primarily identify false information by analyzing intrinsic characteristics of news articles, including linguistic patterns, semantic inconsistencies, and visual manipulations. Early research relied on handcrafted features such as lexical diversity, syntactic patterns, readability metrics, and writing style analysis [25,26,27]. Perez et al. [25] proposed a classification method based on linguistic features, identifying fake news through lexical complexity and syntactic structure analysis. Zhou et al. [26] further extended this approach by constructing comprehensive feature sets incorporating psycholinguistic and rhetorical characteristics. While these traditional methods provided baseline detection capabilities, they often failed to capture sophisticated manipulation techniques employed in modern fake news.
With the advancement of deep learning technologies, neural network approaches have demonstrated superior performance in capturing complex semantic patterns. Wang et al. [28] proposed a CNN-based text classification method that effectively extracted local textual features. Yang et al. [8] employed Long Short-Term Memory (LSTM) networks to handle temporal dependencies in sequential data, achieving significant improvements in fake news detection tasks. Khattar et al. [16] developed MVAE (Multimodal Variational Autoencoder) that learns joint representations of text and images, enabling the model to capture latent correlations between different modalities.
Recent multimodal approaches have explored various fusion strategies to integrate textual and visual information. Wang et al. [15] proposed EANN (Event Adversarial Neural Network) that combines adversarial training with multimodal feature extraction, significantly improving generalization capabilities across different domains and events. Zhou et al. [29] designed multimodal fusion networks that dynamically balance the importance of different modalities through attention mechanisms. However, these methods often fail to capture subtle inconsistencies characteristic of complex fake news and lack the ability to leverage external validation signals from user communities.

2.2. Context-Based Fake News Detection

Context-based approaches identify false information by examining the social environment surrounding news propagation, including user characteristics, propagation patterns, temporal dynamics, and social network structures. These methods recognize that fake news often exhibits distinctive behavioral patterns during propagation through social networks [30,31].
Social network analysis research has revealed characteristic propagation patterns of fake news. Vosoughi et al. [32] analyzed large-scale Twitter data and found that false news spreads faster, farther, and deeper than true news. Jin et al. [33] further demonstrated that fake news exhibits higher engagement rates and distinctive user interaction patterns. Tacchini et al. [34] showed that false news tends to propagate within echo chambers and filter bubbles, forming identifiable network topological features.
Graph-based methods have shown unique advantages in modeling complex relationships. Guo et al. [35] proposed Hierarchical Social Attention (HSA) networks that utilize hierarchical social network structures and user comments for rumor identification. Liu et al. [36] developed graph convolutional networks capable of simultaneously processing complex relationships among users, content, and propagation patterns. Ma et al. [37] designed recurrent neural networks that model temporal evolution characteristics of rumors by constructing propagation tree structures. Additionally, tensor factorization with sparse and graph regularization has been applied to capture intricate social network structures for fake news detection [38].
Recent research has focused more on deep integration of user comments and social signals. Shu et al. [39] proposed the dEFEND framework that uses user comments as external knowledge to enhance detection performance. Shang et al. [40] developed DGExplain, introducing a dual-generative framework that combines content analysis with comment-based explanations, providing both detection capabilities and interpretable insights. However, existing methods typically treat comments as auxiliary information rather than primary modalities, representing an important research gap.

2.3. Attention Mechanisms in Multimodal Learning

Attention mechanisms have revolutionized multimodal learning by enabling models to focus on relevant information from different modalities while suppressing noise inputs [41,42]. Bahdanau et al. [41] initially introduced attention mechanisms for machine translation tasks, allowing models to dynamically attend to relevant parts of input sequences when generating outputs.
Cross-modal attention mechanisms enable identification of correspondences and relationships between different data types [43,44]. Xu et al. [45] proposed visual attention models capable of precisely aligning textual descriptions with image regions. Anderson et al. [46] developed bottom-up attention mechanisms that extract visual features through object detection techniques, significantly improving performance on vision–language tasks. Yu et al. [47] designed collaborative attention mechanisms that facilitate bidirectional information flow between modalities, enabling each modality to mutually enhance representational capabilities.
The emergence of Transformer architectures has further advanced attention mechanism capabilities. Vaswani et al. [42] proposed self-attention mechanisms capable of modeling long-range dependencies. Building upon this, Lu et al. [48] developed ViLBERT and Li et al. [49] proposed VisualBERT, achieving breakthrough progress in multimodal understanding tasks. Chen et al. [50] unified multiple vision–language pre-training tasks with UNITER, demonstrating the powerful potential of Transformers in multimodal learning.
Despite the significant success of attention mechanisms in multimodal learning, their application in fake news detection, particularly integrating user comments as primary modalities, remains limited, providing important opportunities for improving detection accuracy and robustness.

2.4. User Comment Analysis for Credibility Assessment

User comments provide valuable information for news credibility assessment through mechanisms including collective intelligence, crowdsourced fact-checking, and early warning signals [51,52]. Castillo et al. [30] conducted pioneering research showing that user replies on Twitter contain rich credibility cues that effectively assist rumor detection. Kumar et al. [53] further validated the phenomenon that domain experts, fact-checkers, and informed citizens in comment sections can quickly identify and expose misinformation.
Temporal dynamics of comments also provide important detection signals. Kwon et al. [31] found that comment patterns for legitimate news typically show predictable correlations with news cycles and public interest, while fake news often exhibits artificial engagement patterns or abnormal temporal distributions. Zhao et al. [54] confirmed the crucial role of early comments in rumor detection through Weibo data analysis.
Semantic analysis of comments reveals additional credibility indicators. Popat et al. [55] discovered that comments on legitimate news tend to focus on substantive discussions about reported events, while comments on fake news often exhibit characteristics such as emotional manipulation, topic deflection, or coordinated messaging campaigns. Enayet et al. [56] further analyzed the impact of stance detection and sentiment orientation in comments on credibility assessment.
However, comment analysis faces numerous challenges requiring effective solutions: noise filtering, relevance assessment, temporal dynamics modeling, and scalability issues [4,57]. Existing research still lacks sufficient exploration of comment information potential, particularly in cross-modal fusion and real-time processing capabilities.

2.5. Explainable AI in Misinformation Detection

Explainable Artificial Intelligence (XAI) aims to enhance transparency and interpretability of model decisions, playing crucial roles in safety-critical applications [20,58]. In fake news detection, Ribeiro et al. [22] proposed the LIME method that explains black-box model predictions through local approximation. Lundberg et al. [23] developed the SHAP framework that quantifies feature contributions to predictions based on game theory principles.
Research on explainability for multimodal fake news detection is gradually emerging. Antol et al. [59] demonstrated model attention focus when processing vision–language tasks through attention visualization. Selvaraju et al. [60] proposed Grad-CAM techniques that generate class activation maps to explain convolutional neural network decision processes. For fake news detection, Shang et al. [40] provided interpretable detection evidence for COVID-19 information detection through the DGExplain generative explanation framework.
However, existing explainability methods primarily focus on single modalities or simple multimodal fusion, lacking deep interpretability capabilities for social signals such as user comments. Furthermore, cross-modal consistency analysis and interpretable mining of community intelligence remain important directions to be explored [61,62].
Our proposed UC-CMAF framework introduces cross-modal interpretability mechanisms that not only display attention distributions of image–text consistency differences but also reveal evidence signal sources of collective intelligence through user comment importance scoring, achieving truly multimodal interpretable detection pathways.

3. Methodology

This study employs a quantitative experimental design to develop and evaluate a novel multimodal fake news detection framework. Our research design systematically addresses three specific objectives: (1) developing the UC-CMAF architecture that elevates user comments to primary modality status, (2) implementing adaptive cross-modal attention mechanisms for capturing semantic relationships between heterogeneous data sources, and (3) ensuring interpretable detection outcomes through comprehensive attention visualization and hierarchical importance scoring mechanisms.

3.1. Problem Statement

In this paper, our focus is on determining the authenticity of news articles circulating on social media platforms. Given a multimodal news article P composed of text T, images I = { I 1 , I 2 , , I m } , and associated user comments C = { C 1 , C 2 , , C n } , our goal is to predict the authenticity label y { 0 , 1 } , where y = 0 indicates real news and y = 1 indicates fake news.
The challenge lies in effectively modeling the complex relationships between these heterogeneous modalities while filtering noise and irrelevant information that may confound the detection process. Our approach must consider the following:
  • Multimodal Consistency: Analyzing whether textual claims are supported by visual evidence.
  • Social Validation: Leveraging user comments to identify credibility signals and fact-checking information.
  • Temporal Dynamics: Considering the timing and sequence of content publication and user responses.
  • Noise Resilience: Filtering irrelevant content, spam comments, and manipulative elements.

3.2. Overall Framework

Our model UC-CMAF can be represented in the architectural Figure 1, including four key sub-processes: multimodal feature extraction, cross-modal adaptive collaborative attention fusion for news text and images, cross-modal attention fusion of user comments with news text and images, and fake news detection.
The UC-CMAF framework consists of four integrated components working synergistically to achieve accurate fake news detection. The framework begins with text and image encoding networks that leverage state-of-the-art pre-trained models to extract rich semantic representations from textual and visual modalities of the provided news content. These encoded representations are then processed through the cross-modal adaptive collaborative attention (CMA) component, which implements sophisticated attention mechanisms that capture fine-grained relationships between textual and visual content while dynamically adjusting attention weights based on content characteristics. The User Comment Integration (UCI/UCT) component introduces specialized networks to process and integrate user comments with news content, enabling the model to leverage social signals and collective intelligence for authenticity assessment. Finally, the fake news detector serves as the aggregation component that combines all multimodal features and produces authenticity predictions with associated confidence scores, completing the comprehensive framework for detecting fake news through the synergistic integration of content analysis and social validation signals.

3.3. Multimodal Feature Extraction

As described in the problem statement, our model accepts multimodal news input P = { T , I , C } , where T represents textual content, I represents visual content, and C represents corresponding user comments. Text and comment inputs are encoded using a shared pre-trained BERT model to ensure consistent semantic representations. Images are processed using a ResNet-50 backbone, with final convolutional outputs adjusted to match the dimension of text embeddings through 2D convolution layers. This alignment facilitates effective interaction across modalities in the subsequent attention fusion stage.

3.3.1. Text Encoding Network

For textual content, we leverage BERT (Bidirectional Encoder Representations from Transformers), a pre-trained language model developed by Google that understands the contextual meaning of words in sentences by analyzing text bidirectionally, enabling it to capture nuanced semantic relationships and linguistic patterns essential for fake news detection. BERT generates features containing word semantics and linguistic context. Given text content T, it is modeled as a word sequence T = { w 1 , w 2 , , w m } , where m represents the number of words in the article text.
H T = BERT ( T ) = { h 1 , h 2 , , h m }
where h i R d t represents the hidden state output of the corresponding token in BERT, and d t represents the dimension of text embeddings.

3.3.2. Image Encoding Network

For visual content I, we adopt ResNet-50 (Residual Network with 50 layers), a deep convolutional neural network that excels at image classification and feature extraction by using residual connections to effectively train very deep networks, making it particularly suitable for extracting comprehensive visual features from news images that may contain manipulated or misleading content. Considering that a piece of content may contain multiple images, we process each image independently and then aggregate the features.
F V = { f 1 , f 2 , , f m } where f j = ResNet 50 ( I j )
where f j R d v represents the feature vector of the j-th image, and d v represents the dimension of image embeddings (typically 2048 for ResNet50).
To ensure dimensional consistency across modalities, we incorporate a 2D convolutional layer that adapts the embedding dimension from d v to d t :
F V _ a d a p t e d = Conv 2 D ( F V ) R d v R d t
where Conv2D represents a 2D convolutional layer, a fundamental component in deep learning that applies filters to detect visual patterns and features in images.

3.3.3. Comment Encoding Network

User comments are processed through the same BERT architecture as text encoding, ensuring semantic representation consistency across textual modalities.
F C = { c 1 , c 2 , , c n } where c i = BERT ( C i )
where c i R d t represents the encoded representation of the i-th comment, and n represents the total number of comments.

3.4. Cross-Modal Adaptive Collaborative Attention Fusion for Text and Images

This component directly addresses our first research objective by implementing sophisticated attention mechanisms that treat user comments as equal partners with textual and visual content, moving beyond auxiliary data treatment to primary modality integration.

3.4.1. Attention Distribution Generation

To generate an attention distribution of visual features guided by textual features, we input the extracted text features H T and image features F V into a single-layer neural network followed by a softmax function:
ρ = tanh ( W H H T ( W V F V + b V ) )
α = softmax ( W α ρ + b α )
where W H , W V R r × d and W α are learnable parameters, b V and b α are bias terms, and ⊕ represents element-wise addition with broadcasting. Where softmax is a mathematical function that converts a vector of numbers into a probability distribution, ensuring all attention weights sum to one.

3.4.2. Visual–Text Feature Alignment

The visual vector related to text features can be obtained by using the attention distribution as weights:
v ^ = i α i · f i
This weighted aggregation allows the model to focus on visual regions most relevant to textual content, enabling detection of inconsistencies between claimed events and visual evidence.

3.4.3. Bidirectional Attention Mechanism

We also compute text features guided by visual features to achieve bidirectional information flow:
μ = tanh ( W V F V ( W H H T + b H ) )
β = softmax ( W β μ + b β )
h ^ = j β j · h j
where W V , W H R r × d , and b H , b β are bias terms, μ j R r is the j-th column of μ R r × m , W β R 1 × r , b β R , and m is the number of text tokens. This process implements bidirectional attention by computing text features guided by visual information. It generates attention weights to determine text token importance based on visual guidance and produces visually guided text representations, enabling the model to identify textual content most relevant to the visual information.

3.4.4. Adaptive Fusion Strategy

After obtaining attention distributions α and β , we compute the mutually guided representation δ between text and visual modalities:
δ = v ^ h ^
where δ R d × n represents the fused representation capturing text and visual semantics and their cross-modal dependencies.

3.5. Fusion of User Comments and News Content

The UCI and UCT components fulfill our second research objective by establishing bidirectional semantic relationships between social validation signals and news content, enabling comprehensive credibility assessment through collective intelligence.

3.5.1. User Comment and Image Fusion (UCI)

We adopt a cross-attention mechanism to fuse news images and user comments, enabling the model to identify visual elements that users are discussing or questioning:
The comment-to-image attention mechanism involves several sequential transformations, each serving a specific purpose in establishing cross-modal relationships.
The query transformation extracts attention queries from comment features:
Q C I = Linear Q ( F C )
where Q C I represents the query matrix derived from comments, F C denotes the comment feature representations, and Linear Q is a learnable linear transformation layer that projects comment features into the query space.
The key transformation maps image features to attention keys:
K C I = Linear K ( F V )
where K C I is the key matrix generated from visual content, F V represents the visual feature representations, and Linear K is a learnable linear transformation that projects image features into the key space.
The value transformation converts image features into attention values:
V C I = Linear V ( F V )
where V C I denotes the value matrix from visual features, F V represents the same visual feature representations, and Linear V is a learnable linear transformation that projects image features into the value space.
The attention weight computation applies scaled dot-product attention:
A C I = softmax Q C I K C I T d k
where A C I represents the normalized attention weights, Q C I K C I T computes the dot product between queries and keys, d k is the dimensionality of the key vectors used for scaling, and softmax ensures attention weights sum to one.
The final attended feature computation aggregates visual information based on attention weights:
F C I = A C I V C I
where F C I represents the final comment-informed visual features, A C I are the attention weights, and V C I are the value representations that are weighted and aggregated.

3.5.2. User Comment and Text Fusion (UCT)

Since user comments often contain rich textual information that can supplement or refute the main news text, we implement a sophisticated fusion mechanism to capture comment-to-text relationships.
The query transformation extracts attention queries from comment features:
Q C T = Linear Q ( F C )
where Q C T represents the query matrix derived from comment features, F C denotes the comment feature representations, and Linear Q is a learnable linear transformation layer that projects comment features into the query space for text attention.
The key transformation maps news text features to attention keys:
K C T = Linear K ( H T )
where K C T is the key matrix generated from textual content, H T represents the news text feature representations, and Linear K is a learnable linear transformation that projects text features into the key space.
The value transformation converts text features into attention values:
V C T = Linear V ( H T )
where V C T denotes the value matrix from textual features, H T represents the same news text feature representations, and Linear V is a learnable linear transformation that projects text features into the value space.
The attention weight computation applies scaled dot-product attention:
A C T = softmax Q C T K C T T d k
where A C T represents the normalized attention weights, Q C T K C T T computes the dot product between comment queries and text keys, d k is the dimensionality of the key vectors used for scaling, and softmax ensures attention weights sum to one.
The final attended feature computation aggregates textual information based on attention weights:
F C T = A C T V C T
where F C T represents the final comment-informed textual features, A C T are the attention weights, and V C T are the value representations that are weighted and aggregated to capture relevant textual information guided by comment content.
Algorithm 1 achieves deep fusion of user comments with news content, including two main branches:
UCI Branch (Comment–Image Fusion): Adopts cross-attention mechanism, using comments as queries and image features as keys and values. This design enables the model to identify specific visual elements discussed or questioned by users in comments, such as inconsistencies in images, traces of photo manipulation, or visual content that does not match textual descriptions.
UCT Branch (Comment–Text Fusion): Similarly uses the cross-attention mechanism, but with text features as key-value pairs. This enables the model to capture supplementary, corrective, or refuting information from comments regarding text content.
Aggregation Mechanism: Combined with the importance weights obtained from the first algorithm, comment features are weighted and aggregated to ensure that high-quality comments play a greater role in the fusion process. Finally, the aggregated comment representation is combined with original text features through feature concatenation.
Algorithm 1 User comment fusion algorithm.
Require: 
Comment features F C , image features F V , text features H T , importance weights S i m p o r t a n c e
Ensure: 
Comment-image fusion features F C I , comment-text fusion features F T C
  1:
// UCI: Comment-Image Fusion
  2:
Q C I Linear Q ( F C )                                                                                                      ▹ Comment queries
  3:
K C I Linear K ( F V ) , V C I Linear V ( F V )                                                             ▹ Image keys and values
  4:
A C I softmax Q C I K C I T d k                                                                                            ▹ Attention weights
  5:
F C I A C I V C I                                                                                                                  ▹ Weighted fusion
  6:
// UCT: Comment-Text Fusion
  7:
Q C T Linear Q ( F C )                                                                                                     ▹ Comment queries
  8:
K C T Linear K ( H T ) , V C T Linear V ( H T )                                                             ▹ Text keys and values
  9:
A C T softmax Q C T K C T T d k                                                                                         ▹ Attention weights
10:
F C T A C T V C T                                                                                                                ▹ Weighted fusion
11:
// Aggregated Comment Representation
12:
C a g g i = 1 n S i m p o r t a n c e ( c i ) · c i                                                                                 ▹ Importance weighting
13:
F T C Concat ( H T , C a g g )                                                                                       ▹ Feature concatenation
14:
return  F C I , F T C
Algorithm 1 establishes a comprehensive framework for integrating user comments with multimodal news content. The UCI branch enables the model to identify visual elements that users discuss or question, while the UCT branch captures supplementary or contradictory information from comments regarding textual content. The importance weighting mechanism ensures that high-quality, credible comments play a more significant role in the detection process. This dual-branch design fully utilizes the diversity of user comments, considering both the verification of visual content by comments and the supplementation and correction of textual content.

3.5.3. Hierarchical Comment Importance Scoring

Considering that not all comments are equally informative for credibility assessment, we implement a hierarchical scoring mechanism to evaluate comment importance across multiple dimensions.
The relevance score measures semantic similarity between comments and news text:
S r e l ( c i ) = cosine_similarity   ( c i , H T )
where S r e l ( c i ) represents the relevance score for comment c i , c i denotes the i-th comment representation, H T represents the news text feature representation, and cosine_similarity computes the normalized dot product between comment and text embeddings.
The credibility score evaluates trustworthiness indicators within comments:
S c r e d ( c i ) = CredibilityScorer ( c i )
where S c r e d ( c i ) represents the credibility score for comment c i , c i denotes the comment content, and CredibilityScorer is a specialized neural network that analyzes linguistic patterns, factual consistency, and other credibility indicators within the comment text.
The temporal importance score accounts for comment recency:
S t e m p ( c i ) = exp ( λ · ( t c u r r e n t t c o m m e n t ) )
where S t e m p ( c i ) represents the temporal importance score for comment c i , λ is a decay parameter controlling the rate of temporal degradation, t c u r r e n t denotes the current timestamp, t c o m m e n t represents the comment posting time, and the exponential function ensures recent comments receive higher weights.
The overall importance score combines all three dimensions:
S i m p o r t a n c e ( c i ) = α · S r e l ( c i ) + β · S c r e d ( c i ) + γ · S t e m p ( c i )
where S i m p o r t a n c e ( c i ) represents the final importance score for comment c i ; α , β , and γ are learnable weighting parameters that balance the contribution of relevance, credibility, and temporal factors, respectively; and the linear combination produces a comprehensive measure of comment informativeness for fake news detection.
The core idea of Algorithm 2 is to assign an importance weight to each user comment to identify the most valuable comments for fake news detection. The algorithm provides a systematic approach to evaluating comment quality across multiple dimensions. By combining relevance, credibility, and temporal factors, this algorithm automatically identifies the most valuable comments for fake news detection, focusing the model’s attention on expert opinions, fact-checking information, and timely responses while filtering out noise and irrelevant content.
Relevance Score ( S r e l ): By calculating the cosine similarity between comment features and news text features, it measures the degree of relevance between comments and main content. Comments with high relevance typically contain direct discussions or questioning of news content.
Credibility Score ( S c r e d ): Through a specialized credibility scorer, it identifies credibility indicators in comments such as fact-checking information, source citations, and professional knowledge. These comments often come from users with professional backgrounds or contain verifiable information.
Temporal Importance ( S t e m p ): An exponential decay function is used to model the publication time of comments. Early comments usually contain more original information and immediate reactions, making them more valuable for detecting fake news.
Algorithm 2 User comment importance scoring algorithm.
Require: 
Comment set F C = { c 1 , c 2 , , c n } , text features H T , hyperparameters α , β , γ , λ
Ensure: 
Importance weight set S i m p o r t a n c e = { s 1 , s 2 , , s n }
  1:
for  i = 1   n  do
  2:
        S r e l ( c i ) cosine_similarity ( c i , H T )                                                                                      ▹ Relevance score
  3:
        S c r e d ( c i ) CredibilityScorer ( c i )                                                                                         ▹ Credibility score
  4:
        S t e m p ( c i ) exp ( λ · ( t c u r r e n t t c o m m e n t ( c i ) ) )                                                                        ▹ Temporal decay
  5:
        s i α · S r e l ( c i ) + β · S c r e d ( c i ) + γ · S t e m p ( c i )                                                                         ▹ Weighted fusion
  6:
end for
  7:
Normalize S i m p o r t a n c e such that i = 1 n s i = 1
  8:
return  S i m p o r t a n c e
The final importance score is obtained through weighted linear combination and normalized to ensure that the sum of all comment weights equals 1. This design enables the model to automatically identify and focus on the most informative comments.

3.5.4. Aggregated Comment Representation

We compute the aggregated comment attention for text features:
C a g g = i ( S i m p o r t a n c e ( c i ) · c i )
F T C = Concat ( H T , C a g g )
This aggregation ensures that the most informative comments receive greater weight in the final representation. This process aggregates user comments based on their importance scores and integrates them with news text features. The weighted aggregation ensures that comments with higher credibility, relevance, and temporal importance contribute more significantly to the final multimodal representation, while the concatenation creates a unified feature vector that captures both original news content and filtered social feedback signals.

3.6. Fake News Detection

The detection framework achieves our third research objective through multi-level feature integration and interpretable prediction mechanisms that provide transparent decision-making processes.

3.6.1. Multi-Level Feature Integration

We combine features from different fusion paths to create a comprehensive multimodal representation that captures both content-based and social signal-based information.
The cross-modal attention features represent text–image fusion:
F C M A = δ
where F C M A represents the fused features from the cross-modal attention network, and δ denotes the integrated text–image representations that capture semantic correspondences between textual and visual content.
The comment–image fusion features capture visual information guided by user comments:
F U C I = F C I
where F U C I represents the User Comment Integration features for visual content, and F C I denotes the comment-informed visual features obtained from the comment-to-image attention mechanism.
The comment–text fusion features capture textual information guided by user comments:
F U C T = F T C
where F U C T represents the User Comment Integration features for textual content, and F T C denotes the comment-informed textual features obtained from the comment-to-text attention mechanism.
The global feature representation concatenates all fusion pathways:
F g l o b a l = Concat ( [ F C M A , F U C I , F U C T ] )
where F g l o b a l represents the comprehensive feature vector, Concat denotes the concatenation operation, and the bracketed terms represent the feature vectors from cross-modal attention, comment–image fusion, and comment–text fusion, respectively.

3.6.2. Multi-Branch Prediction

Both the cross-modal attention network (CMA) and User Comment Integration networks (UCI/UCT) predict probability distributions for fake news classification through separate prediction branches.
The primary prediction utilizes cross-modal content features:
y 1 = Classifier C M A ( F C M A )
where y 1 represents the primary prediction probability distribution, Classifier C M A denotes a specialized neural network classifier, and F C M A represents the cross-modal attention features that capture content-based evidence.
The secondary prediction leverages comment integration features:
y 2 = Classifier U C I _ U C T ( Concat ( [ F U C I , F U C T ] ) )
where y 2 represents the secondary prediction probability distribution, Classifier U C I _ U C T denotes another specialized classifier, and the input combines both comment–image and comment–text fusion features to capture social signal-based evidence.
The final prediction combines both branches through weighted averaging:
y f i n a l = λ · y 1 + ( 1 λ ) · y 2
where y f i n a l represents the final prediction probability distribution, λ is a learnable weight parameter that balances the contributions of content-based and comment-based evidence, y 1 represents the content-focused prediction, and y 2 represents the social signal-focused prediction.

3.6.3. Training Objectives

We adopt multiple loss functions to optimize different aspects of the model:
Primary Classification Loss:
L c l s = CrossEntropy ( y t r u e , y f i n a l )
Auxiliary Classification Loss:
L a u x = CrossEntropy ( y t r u e , y 1 ) + CrossEntropy ( y t r u e , y 2 )
Attention Regularization Loss:
L a t t = A C I F + A C T F
Total Training Loss:
L t o t a l = L c l s + α a u x · L a u x + α a t t · L a t t
where α a u x and α a t t are hyperparameters controlling the relative importance of auxiliary objectives.
Algorithm 3 implements a multi-objective training strategy that balances primary classification performance with auxiliary learning goals. The comprehensive loss function ensures that each model component learns effective representations independently while maintaining overall system coherence, leading to robust performance across different types of misinformation scenarios.The algorithm designs multi-level loss functions to achieve comprehensive model optimization:
Primary Classification Loss ( L c l s ): Uses cross-entropy loss to train the final binary classification prediction, which is the main optimization objective of the model and directly relates to the accuracy of fake news detection.
Auxiliary Classification Loss ( L a u x ): Provides supervised training for cross-modal branches and comment integration branches separately. This design ensures that each branch can independently learn effective feature representations, avoiding over-dependence of one branch on others and improving model robustness.
Attention Regularization Loss ( L a t t ): Regularizes attention matrices through the Frobenius norm to prevent attention weights from being overly concentrated on a few positions, promoting the model to learn more balanced and generalizable attention distributions.
Algorithm 3 Multi-objective training algorithm.
Require: 
Ground truth labels y t r u e , primary prediction y 1 , auxiliary prediction y 2 , final prediction y f i n a l , attention matrices A C I , A C T , hyperparameters α a u x , α a t t
Ensure: 
Total loss L t o t a l
  1:
// Primary Classification Loss
  2:
L c l s CrossEntropy ( y t r u e , y f i n a l )
  3:
// Auxiliary Classification Loss
  4:
L a u x 1 CrossEntropy ( y t r u e , y 1 )                                                                                                                                                                            ▹ Cross-modal branch
  5:
L a u x 2 CrossEntropy ( y t r u e , y 2 )                                                                                                                                                             ▹ Comment integration branch
  6:
L a u x L a u x 1 + L a u x 2
  7:
// Attention Regularization Loss
  8:
L a t t _ C I A C I F                                                                                                                                                                 ▹ Comment-image attention regularization
  9:
L a t t _ C T A C T F                                                                                                                                                                    ▹ Comment-text attention regularization
10:
L a t t L a t t _ C I + L a t t _ C T
11:
// Total Loss Computation
12:
L t o t a l L c l s + α a u x · L a u x + α a t t · L a t t
13:
return  L t o t a l
Through weighted combination of these three types of losses, the algorithm achieves a balance between main task performance, branch independence, and attention distribution quality. Hyperparameters α a u x and α a t t allow adjustment of the relative importance of different loss terms according to specific datasets and task requirements, providing flexibility for model optimization.

4. Materials and Methods

4.1. Datasets

We evaluate our UC-CMAF framework on multiple benchmark datasets to demonstrate its generalizability and robustness across different domains and contexts (Table 1).
ReCOVery Dataset: The ReCOVery dataset [63] contains 2029 COVID-19 related news articles collected from January 2020 to May 2020, including 1364 fake news articles and 665 real news articles. The dataset includes 1675 associated images and 140,820 user comments, making it ideal for evaluating multimodal fake news detection with social signals.
MMCoVaR Dataset: The MMCoVaR (Multimodal COVID-19 Vaccine focused Repository) dataset [64] comprises 2593 news articles (1635 fake and 958 real) with 2357 associated images. This dataset focuses specifically on COVID-19 vaccine misinformation and includes rich comment data for social signal analysis.

4.2. Implementation Details

To thoroughly evaluate the effectiveness of our UC-CMAF framework, we adopt a structured experimental pipeline. This includes data preprocessing, multimodal feature extraction from news content (text and image) and user comments, cross-modal fusion using attention mechanisms, and model training with evaluation on benchmark datasets. Each step is designed with reproducibility and clarity in mind, and detailed settings are provided to ensure transparent experimental procedures.
Our multimodal embedding method employs carefully selected pre-trained models and architectures for feature extraction and fusion. For text processing, we utilize the pre-trained BERT-base-uncased model to extract 768-dimensional text features, with maximum sequence lengths set to 512 tokens for news articles and 128 tokens for comments, using a vocabulary of 30,522 tokens. Visual processing adopts ResNet-50 for feature extraction, reducing 2048-dimensional features to 768 dimensions through 2D convolution to maintain dimensional consistency. Image preprocessing includes resizing to 224 × 224 pixels and normalization using ImageNet statistics, with data augmentation applied during training through random horizontal flipping and color jittering. For comment processing, we select up to 50 comments per article (20 for the MMCoVaR dataset), with comments filtered by minimum length of 10 characters and maximum length of 280 characters, while preserving the temporal order of all comments. For user comments, we perform a multi-step preprocessing pipeline: (1) removal of non-linguistic content such as URLs, emojis, and special characters; (2) filtering of comments that are too short or too long; (3) BERT-based tokenization and normalization; and (4) retention of timestamp metadata for use in the temporal importance weighting module. This ensures that only informative and high-quality comments are incorporated into the model. The model is trained using the Adam optimizer with a learning rate of 0.001, a batch size of 64, and for 200 epochs. These settings are selected via cross-validation to ensure a balance between convergence speed and generalization. Xavier initialization is applied to stabilize training dynamics. For the ReCOVery and MMCoVaR datasets, we randomly sample up to 50 and 20 comments per article, respectively, padding with empty comments when necessary. The fusion weight parameter λ for integrating user reviews is set to 0.8 based on experimental validation.

4.3. Evaluation Metrics

We adopt comprehensive evaluation metrics to assess model performance across multiple dimensions: accuracy measures overall classification correctness, precision evaluates the true positive rate which is important for minimizing false accusations, recall assesses sensitivity to actual fake news instances, and F1 Score provides the harmonic mean of precision and recall to balance both metrics.

4.4. Baseline Methods

We compare UC-CMAF against state-of-the-art methods from different categories to demonstrate its effectiveness:
Traditional Machine Learning Approaches:
  • SVM-based [30]: Support Vector Machine utilizing TF-IDF features for text-based classification.
  • Random Forest [25]: Ensemble method employing handcrafted linguistic and statistical features.
Deep Learning Baselines:
  • BERT-only [65]: Fine-tuned BERT model for text classification without multimodal integration.
  • ResNet-only [66]: CNN-based image classification using ResNet-50 architecture.
  • Simple Concatenation [28]: Basic feature concatenation without attention mechanisms or sophisticated fusion strategies.
Multimodal Methods:
  • HSA [35]: Hierarchical Social Attention for rumor detection through social networks and user comments.
  • ExFaux [67]: A weakly supervised approach to explainable fauxtography detection.
  • dEFEND [39]: Explainable fake news detection leveraging news–comment correlations with co-attention mechanisms.
  • BTIC [68]: Supervised contrastive learning for multimodal unreliable news detection in COVID-19 pandemic.
  • SAFE [26]: Similarity-Aware multimodal Fake nEws detection with cross-modal attention.
  • HMCAN [69]: Hierarchical multimodal contextual attention network for fake news detection.
  • SpotFake [10]: Framework leveraging pre-trained BERT and VGG-19 for Tweet verification.
  • MVAE [16]: Multimodal Variational AutoEncoder generating representations through dual-modal VAE.
  • DGExplain [40]: A duo-generative approach to explainable multimodal COVID-19 misinformation detection.

5. Results and Discussion

5.1. Overall Performance Comparison

Our comprehensive evaluation demonstrates that UC-CMAF achieves superior performance across multiple dimensions while maintaining high interpretability standards.
Table 2 and Figure 2 present comprehensive performance comparisons between our proposed UC-CMAF and various baseline methods on both the ReCOVery and MMCoVaR datasets, with evaluations using precision, recall, and F1 Score metrics across 15 representative multimodal fake news detection methods. UC-CMAF consistently achieves the best performance across all evaluation metrics, demonstrating its superiority in multimodal fake news detection. Specifically, on the ReCOVery dataset, UC-CMAF achieves an accuracy of 0.927, precision of 0.916, recall of 0.872, and an F1 Score of 0.894, outperforming all other models and surpassing the strongest baseline DGExplain (F1: 0.873) by 2.1%. On the more challenging MMCoVaR dataset, UC-CMAF further improves its performance with the highest precision of 0.939, recall of 0.881, and F1 Score of 0.909, exceeding DGExplain (F1: 0.881) by a margin of 2.8%. In contrast, traditional machine learning methods such as SVM-Based and Random Forest perform significantly worse with F1 Scores of only 0.679/0.700 and 0.723/0.736, respectively, while single-modal methods like BERT-only show relatively stable results (F1: 0.803 and 0.817) but still fall short of UC-CMAF, and unimodal baselines such as ResNet-only achieve even lower F1 Scores (0.661 on ReCOVery; 0.674 on MMCoVaR), underscoring the limitations of relying solely on either text or image modalities. Some multimodal methods such as dEFEND and EANN (F1: 0.823/0.838 and 0.824/0.814, respectively) also underperform in comparison, and several approaches including BTIC, SAFE, and ExFaux exhibit noticeable fluctuations in performance across datasets, suggesting limited generalizability. UC-CMAF maintains strong and consistent performance on both datasets, validating its robustness and ability to generalize across different scenarios with different semantic and stylistic characteristics. These findings confirm that UC-CMAF’s unified cross-modal attention mechanism enables it to effectively model fine-grained interactions between textual and visual modalities, leading to superior detection performance in complex multimodal fake news detection tasks, with the visual annotations clearly demonstrating the consistent superiority of the UC-CMAF approach over all baselines.
Figure 3 presents the accuracy and F1 Score trends of 15 representative models on the ReCOVery and MMCoVaR datasets. Overall, models that leverage multimodal fusion strategies consistently outperform unimodal baselines such as BERT-only and ResNet-only, demonstrating the importance of integrating textual and visual cues in fake news detection. On both datasets, the proposed UC-CMAF model shows clear superiority over all other methods. Specifically, UC-CMAF achieves the highest accuracy and F1 score on both datasets, with values of 0.927 and 0.894 on ReCOVery, and 0.931 and 0.909 on MMCoVaR, respectively. These results indicate not only strong predictive power but also robust generalization across different distributions of misinformation. Close competitors such as DGExplain, MVAE, and dEFEND also exhibit relatively strong performance, but their F1 Scores remain 2–3% lower than that of UC-CMAF, underscoring the effectiveness of our proposed cross-modal attention fusion strategy. Additionally, traditional machine learning approaches like SVM-Based and Random Forest display limited performance, further validating the advantages of deep, modality-aware architectures.
Figure 4 provides a comprehensive visualization of the performance metrics for 15 representative models across the ReCOVery and MMCoVaR datasets. Each heatmap illustrates four standard evaluation metrics—accuracy, precision, recall, and F1 Score—for every method. On the ReCOVery dataset, our proposed UC-CMAF model clearly outperforms all baselines across all metrics, with an accuracy of 0.894, precision of 0.927, recall of 0.916, and an F1 Score of 0.872. This demonstrates the model’s ability to effectively capture fine-grained cross-modal interactions and maintain semantic consistency in multimodal news content. Similarly, on the MMCoVaR dataset, UC-CMAF once again leads in every metric, achieving an accuracy of 0.909, precision of 0.931, recall of 0.939, and F1 Score of 0.881. The consistently high scores across datasets of different characteristics confirm the model’s strong generalization capability. By contrast, traditional machine learning approaches such as SVM-Based and Random Forest perform noticeably worse, particularly in recall and F1 Score. Unimodal baselines like BERT-only and ResNet-only also lag behind, reaffirming the necessity of multimodal fusion. Mid-tier multimodal methods like DGExplain and dEFEND perform competitively but still fall short of UC-CMAF, especially in recall on MMCoVaR. Overall, these heatmaps highlight UC-CMAF’s superior performance and robustness in multimodal fake news detection.
These empirical improvements across all evaluation metrics confirm that the proposed adaptive fusion mechanism effectively captures and leverages complementary signals from text, images, and user comments.

5.2. Analysis of UC-CMAF Components

-CMA (w/o Cross-Modal Attention): This component serves as the core mechanism for connecting information from different modalities, responsible for learning the interactive relationships among text, images, and user comments. The cross-modal attention mechanism dynamically fuses multimodal features through attention weights, enabling the model to capture semantic associations across modalities, identify consistency and contradictions between different information sources, thereby providing crucial judgment basis for fake news detection.
-UCI (w/o Comment–Image Fusion): This component specifically handles the fusion process between user comments and image content. Through deep learning techniques, it combines users’ textual comments with visual information, capturing users’ emotional tendencies and semantic understanding of image content. This fusion mechanism can identify users’ judgments about image authenticity in their comments, as well as the degree of matching between images and comment content.
-UCT (w/o Comment–Text Fusion): This component is responsible for effectively fusing user comments with original text content. Through contrastive learning or attention mechanisms, it identifies consistency, contradictions, or complementary information between user comments and the main text, mining users’ collective judgment on text content credibility and providing verification information from the user perspective to the model.
-Comments (w/o User Comments): This component collects and processes feedback information from users, which typically contains rich subjective judgments and emotional expressions. The user comment module can capture collective wisdom in social media environments, identifying users’ questioning, support, or neutral attitudes toward content authenticity, providing important social verification signals for the model.
-Images (Text + Comments Only): This component is responsible for processing and analyzing visual content, extracting image features and identifying visual clues that may indicate misleading or false information. The image module analyzes technical characteristics, semantic content, and potential tampering traces in images through computer vision techniques, providing visual evidence support for multimodal fake news detection.
-Adaptive Weights: This component dynamically adjusts the importance weights of various modalities and components according to the characteristics of different samples, implementing personalized multimodal information fusion strategies. The adaptive weight mechanism can automatically optimize the contribution of different information sources based on input data characteristics, improving the model’s adaptability and accuracy when facing different types of fake news.
The results of the ablation study (Table 3) indicate that the complete versions of both models exhibit excellent performance, with MMCoVaR (accuracy: 0.931; F1 Score: 0.909) slightly outperforming ReCOVery (accuracy: 0.927; F1 Score: 0.894) across all metrics, demonstrating the superior effectiveness of its architectural design. An analysis of the importance of each component reveals that the user comment module is the most critical in both models, with its removal leading to the largest decline in F1 Score (approximately 0.050 for both ReCOVery and MMCoVaR), indicating that user comments contain vital information that traditional text and image analysis struggle to capture and underscoring the irreplaceable value of subjective judgment and collective intelligence in detecting misinformation. The cross-modal attention mechanism and the User Comment Integration components (UCI/UCT) rank second in importance, highlighting the limitations of unimodal information and demonstrating that multimodal fusion significantly enhances detection capabilities, with effective integration across different modalities being essential for model success. In contrast, the removal of the image module has a relatively smaller impact on performance (F1 Score drops by approximately 0.02), but it remains non-negligible, suggesting that images play a supplementary and validating role by providing visual cues that are not expressed in text. Although the adaptive weighting mechanism has the smallest impact when removed, it still contributes to stable performance improvements, reflecting the value of dynamic weight adjustment in optimizing information fusion. Overall, MMCoVaR demonstrates greater robustness across all ablation experiments, while ReCOVery shows higher dependency on its key components, with the strong reliance of both models on user comments highlighting the critical role of user feedback in misinformation detection within social media environments.
Figure 5 presents the impact of module ablations on model performance across the ReCOVery and MMCoVaR datasets, where Figure 5a shows the changes in model accuracy when individual components are removed, clearly demonstrating that the removal of the cross-modal attention module (-CMA) and the user comments module (-Comments) results in substantial accuracy drops, highlighting their importance in capturing multimodal associations and integrating user feedback, while the removal of the adaptive weight component leads to relatively minor impact on accuracy. Figure 5b illustrates the corresponding F1 Score variations under the same ablation settings, with the full model achieving the highest F1 Score, while the removal of the CMA, UCI, and comment modules causes notable declines, reaffirming their critical roles in fake news detection, particularly when the cross-modal attention (-CMA) module is removed, as the F1 degradation is especially pronounced, indicating its pivotal contribution to balancing precision and recall in the final predictions.
Figure 6 presents radar charts illustrating the ablation study results on the ReCOVery and MMCoVaR datasets, clearly validating the effectiveness and synergy of the proposed model components. The full model consistently achieves the highest scores across all four metrics—accuracy, precision, recall, and F1—demonstrating the overall advantage of our multimodal fusion strategy. Notably, removing the cross-modal attention (CMA) component leads to a significant performance drop, indicating its crucial role in modeling the interactions among text, images, and user comments. Similarly, the exclusion of the user comment module results in considerable degradation, highlighting the importance of leveraging social feedback signals in fake news detection. Furthermore, the removal of the comment–image (UCI) and comment–text (UCT) fusion components also negatively impacts performance, underscoring the value of semantic alignment between user perspectives and content. Overall, the proposed model exhibits strong robustness and accuracy in detecting fake news, benefiting from fine-grained multimodal integration and an adaptive weighting mechanism tailored to different input scenarios.
The ablation study clearly demonstrates that each component of UC-CMAF, particularly the user comment module and cross-modal attention, plays a vital role in achieving high detection accuracy, thereby justifying their inclusion in the final architecture.

5.3. Attention Mechanism Analysis

To validate the effectiveness and rationale behind our attention mechanism design, we compare several attention variants and analyze the trade-off between performance and computational complexity through visualizations. As illustrated in Figure 7 and Table 4, the F1 Scores on both ReCOVery and CoVaR tasks consistently improve from models with no attention to single-head, multi-head, and cross-modal attention, culminating in the highest scores achieved by our proposed adaptive attention. The performance–complexity scatter plot demonstrates that although adaptive attention has a computational complexity of O ( n m ) , it significantly outperforms other variants in terms of average F1 Score, achieving the best balance between effectiveness and efficiency. Additionally, the bar–line combination plot presents both the performance metrics and encoded complexity levels, further highlighting that adaptive attention offers superior performance without incurring additional computational overhead. These results collectively confirm the advantages of our design and the practical value of integrating adaptive attention in our architecture.

5.4. Weight Parameter Sensitivity

As shown in Figure 8, the introduction of the weight λ significantly improves the performance of the model, particularly in terms of accuracy and F1 Score. By observing the performance changes at different λ values, it is evident that both ReCOVery and MMCoVaR show significant improvement as λ increases from 0 to 0.8. In particular, at λ = 0.8 , both models achieve higher accuracy and F1 Scores, and their performance remains relatively stable in this region. Furthermore, MMCoVaR exhibits stronger generalization ability at higher λ values, especially at λ = 0.8 , where its accuracy reaches 0.931 and F1 Score reaches 0.909, surpassing the performance of the ReCOVery model. Based on these results, selecting a λ value of around 0.8 as the weight ensures that we maintain high performance while avoiding performance degradation when the parameter is too high.

5.5. Impact of Comment Selection Strategy

Figure 9 and Table 5 illustrate the impact of various comment selection strategies on model accuracy and processing time. As shown, the importance score strategy achieves the highest accuracy for both ReCOVery (92.7%) and MMCoVaR (93.1%), demonstrating the effectiveness of using a hierarchical importance mechanism. Although this approach incurs a higher processing time (203 ms), the trade-off is justified by its superior predictive performance. In contrast, simpler strategies such as random or time order selection offer faster processing times but result in lower accuracy. This highlights the importance of intelligent comment selection for optimizing both performance and computational efficiency in misinformation detection tasks. Our hierarchical importance scoring mechanism provides the best performance, despite a slight increase in computational cost.

5.6. Impact of Number of Comments

Figure 10 illustrates the trend of the ReCOVery model’s accuracy and training time as the number of annotated comments increases. It can be observed that the accuracy steadily improves from 0.918 with 10 comments to 0.928 with 100 comments, indicating that additional human feedback contributes positively to model performance. However, the training time also increases significantly, especially beyond 50 comments, where the duration grows from 3.2 to 4.8 h. This suggests a nonlinear relationship between annotation volume and computational cost.
Figure 11 provides a more comprehensive comparison, showing the performance of both the ReCOVery and MMCoVaR models. MMCoVaR consistently achieves higher accuracy across all annotation levels, with the performance gap widening as more comments are incorporated, demonstrating MMCoVaR’s superior ability to leverage annotated information. In terms of resource consumption, both memory usage and training time increase with the number of comments. Nonetheless, the overall resource usage remains within acceptable bounds, indicating that MMCoVaR achieves a favorable balance between accuracy and computational efficiency.
To evaluate the practical deployment feasibility of UC-CMAF, we conducted comprehensive computational cost analysis covering both training and inference scenarios (Table 6).
Inference Performance Metrics: Our full UC-CMAF model with 50 comments requires an average inference time of 203 ms per sample. The memory footprint during inference reaches 2.9 GB, primarily allocated for storing comment embeddings and attention matrices across multiple fusion components.
Baseline Computational Comparison: While UC-CMAF introduces computational overhead compared to simpler baselines (BERT-only: 45 ms; Simple Concatenation: 78 ms), this additional cost is justified by substantial performance improvements. The 2.6× increase in inference time corresponds to a 9.1% improvement in F1-score on ReCOVery dataset, demonstrating favorable cost–performance trade-offs.
Scalability Bottleneck Analysis: Profiling analysis reveals that cross-modal attention computation constitutes the primary computational bottleneck, accounting for approximately 60% of total inference time. The comment–image fusion (UCI) and comment–text fusion (UCT) modules contribute 35% and 25% respectively to the overall computational load. Memory consumption scales linearly with comment volume, with each additional comment requiring approximately 58 MB.
Deployment Optimization Strategies: For resource-constrained environments, we recommend adaptive comment sampling (reducing to 20 comments) which maintains 99.7% of full model accuracy while achieving 30% speed improvement. Additionally, batch processing techniques can improve throughput by 40% for large-scale deployment scenarios.

5.7. Attention Visualization

We analyze the semantic alignment between text and images using the cross-modal attention (CMA) module. As shown in Figure 12, the model significantly focuses on the “data chart” region of the image when processing text tokens related to semantics such as “vaccine”, “efficacy”, and “95%”. The highest attention weight reaches 0.90, indicating that the model effectively associates quantitative information like vaccine efficacy with the data visualization in the image. Simultaneously, for words like “clinical” and “trial”, the model mainly focuses on the “lab scene” region, with attention weights of 0.80 and 0.70, respectively, further validating the model’s effectiveness in understanding medical experiment scenarios. Overall, this module demonstrates strong cross-modal semantic alignment capabilities, which help enhance the model’s performance in medical text–image tasks.
As shown in Figure 13, UC-CMAF assigns high weights of 0.75 , 0.80 , and 0.70 to Evidence-Doubt, Procedure-Doubt, and Judgment-Support on their logically corresponding dimensions of “Evidence Sufficiency”, “Procedure Legitimacy”, and “Scientific Accuracy”, respectively, while cross-dimension weights remain between 0.05 and 0.15 , indicating effective suppression of semantic interference. At the same time, Emotional-Tendency distributes evenly across all four dimensions with weights between 0.20 and 0.30 (with 0.30 on both “Evidence Sufficiency” and “Scientific Accuracy”), ensuring comprehensive coverage of emotional context. In the stacked bar chart on the right, 95% of the weight for the “Evidence Sufficiency” dimension is contributed by Evidence-Doubt ( 0.75 ) and Emotional-Tendency ( 0.30 ); for “Procedure Legitimacy”, Procedure-Doubt ( 0.80 ) and Emotional-Tendency ( 0.20 ) comprise the entire weight; “Scientific Accuracy” is dominated by Judgment-Support ( 0.70 ) and Emotional-Tendency ( 0.30 ); and “Data Reliability” is primarily composed of Emotional-Tendency ( 0.20 ) and Judgment-Support ( 0.15 ), with other categories contributing less than 0.10 . This distribution both validates UC-CMAF’s high discriminative power in fine-grained intent classification and demonstrates its ability to deeply parse subjective emotional cues.
As shown in the attention alignment in Figure 14, the UCT module demonstrates highly focused comment–text alignment: Comment 1 places a weight of 0.75 on “efficacy” (all other tokens receive weights below 0.10 ); Comment 2 attends most strongly to “95%” with 0.70 , and retains minor attention on “efficacy” ( 0.15 ) and “phase 3” ( 0.05 ); Comment 3 concentrates on “phase 3” with 0.60 , while assigning 0.10 each to “clinical” and “efficacy”. Correspondingly, UC-CMAF assigns the highest importance score of 0.92 to Comment 2, with Comment 1 and Comment 3 scoring 0.85 and 0.78 , respectively, perfectly mirroring the UCT attention distributions. These results demonstrate that UCT can precisely align user comments with key textual terms, and that UC-CMAF can effectively discriminate comment priority via quantified importance, validating the model’s alignment accuracy and discriminative power in multi-comment, multi-text scenarios.

5.8. Robustness Analysis

To comprehensively evaluate the stability and reliability of our model in practical applications, we designed robustness testing experiments from two dimensions: noise tolerance and adversarial attack resistance. The noise tolerance test employs four typical noise types (text synthesis, image Gaussian blur, comment injection, and combined noise) to assess model performance degradation under different intensities from 10 dB to 50 dB. The adversarial attack test designed four attack strategies (text perturbation, image adversarial, comment manipulation, and multimodal attack) and validated the effectiveness of the UC-CMAF defense mechanism.
(1) Noise Tolerance Evaluation
The results presented in Table 7 demonstrate that our system exhibits good robustness against various noise types, maintaining satisfactory performance even under enhanced noise conditions. Under extreme 50 dB noise environments, image Gaussian blur has the minimal impact (only 4.7% degradation), while combined noise shows the most significant effect (6.4% degradation), indicating that the superposition effect of multi-source noise causes greater impact on system performance. Overall, performance degradation under all noise types is controlled within acceptable ranges, proving that the model possesses good anti-interference capabilities.
(2) Adversarial Attack Resistance
UC-CMAF demonstrates strong resistance against adversarial attacks, particularly excelling in defending against multimodal attack methods, where single-modal attacks can be identified through inconsistency detection with other modalities. The experimental results presented in Table 8 show that image adversarial attacks have the lowest original success rate (8.7%), and after defense, the system accuracy can still maintain 91.3%, demonstrating the best defense effectiveness. In contrast, multimodal coordinated attacks represent the most challenging attack method with an original success rate of 18.9%, but the UC-CMAF defense mechanism can still control it within an acceptable range, achieving 81.1% accuracy after defense.
To comprehensively evaluate the robustness of the model under complex environmental conditions, Figure 15a, Figure 15b, and Figure 15c respectively illustrate the model’s accuracy under varying noise intensities, performance degradation under different adversarial attack strategies, and a heatmap showing key robustness metrics for each attack type.
Figure 15a presents the accuracy trend under different types of noise, with the x-axis representing noise intensity (in dB) and the y-axis representing classification accuracy. It can be observed that as the noise intensity increases, the model performance degrades gradually, but the overall decline is relatively small, indicating strong noise robustness. Among the noise types, the model shows the highest tolerance to image Gaussian blur, while combined noise (multi-source) has the most significant impact.
Figure 15b displays, in the form of a stacked bar chart, the attack success rate, accuracy drop, and defense effectiveness under four types of adversarial attacks. The results show that the model performs most robustly against image-based adversarial attacks, maintaining over 91% accuracy after defense. Although multimodal attacks have the highest success rate (18.9%), the UC-CMAF defense mechanism still effectively limits performance degradation, achieving a defense effectiveness of 81.1%.
Figure 15c shows a heatmap of the relationships between attack types and key robustness metrics. The color intensity indicates the metric value, with darker shades representing better performance. It is evident that different attack strategies affect system performance differently: image-based attacks show the best robustness, while multimodal attacks pose the greatest challenge. The heatmap confirms the trends observed in the previous figures and further highlights the UC-CMAF model’s potential in defending against complex adversarial threats.

6. Conclusions

6.1. Summary and Conclusions

This paper introduces UC-CMAF (User Comment-Guided Cross-Modal Attention Fusion), a novel framework designed for interpretable multimodal fake news detection in the digital era. Our primary contribution lies in developing a pioneering fusion architecture that effectively integrates news text, images, and user comments through adaptive cross-modal attention mechanisms. Critically, we treat user comments as a primary modality rather than auxiliary data, recognizing their essential role in understanding the broader social context of information dissemination.
The experimental evaluation demonstrates UC-CMAF’s superior performance across multiple benchmark datasets. Our framework achieves remarkable results with 92.7% accuracy and 89.4% F1 Score on the ReCOVery dataset, and 93.1% accuracy with 90.9% F1 Score on MMCoVaR, significantly outperforming existing state-of-the-art methods. Beyond performance metrics, UC-CMAF addresses the critical demand for explainable AI through comprehensive multi-head attention heatmaps and comment importance visualizations, while maintaining robust performance against adversarial attacks with defense effectiveness rates ranging from 81.1% to 91.3%.
Our comprehensive ablation studies provide compelling evidence that user comments contain unique contextual and social information that extends far beyond traditional content-based analysis capabilities. This finding validates our architectural decision to elevate user comments to primary modality status and underscores the importance of incorporating social validation signals into misinformation detection systems. UC-CMAF’s consistent superior performance across diverse datasets, coupled with its robust interpretability mechanisms and strong adversarial resistance, represents a significant advancement in the field of multimodal fake news detection.
The framework successfully demonstrates how AI systems can effectively combine rigorous content analysis with social validation while maintaining comprehensive interpretability—a critical requirement for deployment in real-world scenarios where trust and transparency are paramount. This work establishes several important directions for future research: (1) development of real-time processing capabilities for immediate misinformation detection, (2) cross-platform generalization to handle diverse social media environments, and (3) implementation of adaptive mechanisms capable of responding to emerging manipulation techniques.
By successfully integrating social context with sophisticated multimodal analysis, UC-CMAF offers a new paradigm for developing robust, explainable verification systems capable of combating increasingly sophisticated misinformation campaigns. As digital misinformation continues to evolve in complexity and scale, frameworks like UC-CMAF provide essential foundations for maintaining information integrity while preserving the transparency and accountability necessary for fostering public trust in AI-assisted fact-checking systems.

6.2. Limitations and Future Directions

While UC-CMAF demonstrates strong performance across multiple datasets, several limitations warrant discussion for future research directions.
Both ReCOVery and MMCoVaR are centered around COVID-19-related misinformation, which may introduce topic-specific linguistic, visual, and social biases. For instance, common text tokens (e.g., “vaccine,” “clinical trial”) and visual patterns (e.g., medical diagrams or bar charts) may dominate feature learning, while user comments tend to reflect highly polarized or emotionally charged discourse. As a result, the model may learn domain-specific representations that do not generalize well to topics such as politics, finance, or AI-generated content. However, the modular design of UC-CMAF—including separate CMA, UCI, and UCT components—allows for efficient adaptation to new domains via fine-tuning without altering the core architecture. Future work should evaluate the framework on a broader range of datasets, such as PolitiFact or GossipCop, to systematically assess its robustness across diverse semantic and visual distributions.
Comment Availability Dependency: Our framework’s effectiveness is fundamentally tied to the presence and quality of user comments. In scenarios where comments are sparse—such as newly published articles, niche topics, or platforms with limited user engagement—the model’s performance may degrade as it relies more heavily on traditional text–image analysis. Future work should explore hybrid approaches that can gracefully handle comment-scarce environments.
Cross-linguistic and Cross-platform Adaptability: The current evaluation focuses primarily on English-language content from specific social media platforms. Extending to multilingual scenarios presents challenges, including (1) linguistic variations in commenting behavior across cultures, (2) different misinformation propagation patterns in various linguistic communities, and (3) platform-specific user interaction mechanisms. Future research should investigate transfer learning approaches and culturally aware adaptation strategies.
Temporal Dynamics: Our model treats comment collections as relatively static, but real-world comment threads evolve continuously. The temporal importance scoring mechanism may need enhancement to better capture the dynamic nature of evolving misinformation narratives and user response patterns.
Computational Scalability: As demonstrated in our analysis, processing large numbers of comments significantly increases computational overhead. While our importance scoring helps mitigate this issue, real-time deployment scenarios may require further optimization strategies or adaptive sampling techniques.
Adversarial Robustness: Although our framework shows resistance to various attack types, sophisticated adversaries might develop coordinated comment manipulation strategies specifically targeting our importance scoring mechanism, necessitating ongoing defensive improvements.
Multimodal Perception and Social Signal Integration: Although UC-CMAF is evaluated empirically, its design is inspired by cognitive principles of multimodal perception and social reasoning. Human perception tends to integrate semantically consistent signals across modalities while filtering out irrelevant inputs—an idea reflected in our bidirectional adaptive attention mechanism, which selectively enhances aligned text–image regions and dynamically adjusts fusion weights based on contextual relevance. In addition, our use of user comments draws on the notion that social discourse conveys distributed credibility signals; comments often express skepticism, support, or counter-evidence that complement or challenge the source content. By leveraging these social cues through dedicated attention pathways, UC-CMAF models a form of collective reasoning, enhancing its effectiveness in detecting multimodal misinformation.
These limitations provide clear directions for future research, including real-time processing optimization, multilingual adaptation, and enhanced adversarial robustness mechanisms.

Author Contributions

Conceptualization, Z.Y., C.T. and S.L.; methodology, Z.Y., C.T. and S.L.; software, Z.Y. and C.T.; validation, Z.Y., C.T. and S.L.; formal analysis, Z.Y. and C.T.; investigation, Z.Y., C.T. and S.L.; resources, S.L. and C.T.; data curation, Z.Y., C.T. and S.L.; writing—original draft preparation, Z.Y., C.T. and S.L.; writing—review and editing, Z.Y., C.T. and S.L.; visualization, Z.Y. and C.T.; supervision, S.L.; project administration, S.L. and C.T.; funding acquisition, S.L. and C.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Major Research Plan of Hubei Province under Grant/Award No. 2023BAA027 and the project of Science, Technology and Innovation Commission of Shenzhen Municipality of China under Grant No. GJHZ20240218114659027.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

We would like to thank Chengfu Sun for his support of this article, as well as the expert evaluators, study participants, and reviewers whose feedback has significantly improved this work.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Bondielli, A.; Marcelloni, F. A survey on fake news and rumour detection. Inf. Sci. 2019, 497, 38–55. [Google Scholar] [CrossRef]
  2. Shu, K.; Sliva, A.; Wang, S.; Tang, J.; Liu, H. Fake news detection on social media: A data mining perspective. ACM SIGKDD Explor. Newsl. 2017, 19, 22–36. [Google Scholar] [CrossRef]
  3. Zhang, J.; Dong, B.; Yu, P.S. FakeLocator: Robust localization of fake news via semantic analysis. In Proceedings of the Web Conference 2020, Taipei, Taiwan, 20–24 April 2020; ACM: New York, NY, USA, 2020; pp. 2371–2381. [Google Scholar]
  4. Zhou, X.; Zafarani, R. A survey of fake news: Fundamental theories, detection methods, and opportunities. ACM Comput. Surv. 2020, 53, 109. [Google Scholar] [CrossRef]
  5. Aslam, N.; Ullah, I.; Ullah, A.; Rahman, F.; Jahan, F.; Jaffar, A.; Alrashoud, M. Deep learning and fusion mechanism-based multimodal fake news detection methodologies: A review. Eng. Technol. Appl. Sci. Res. 2022, 12, 8892–8900. [Google Scholar]
  6. Wu, B.; Luo, Q.; Jing, C.; Zhang, W.; Zhou, J. Deep learning for fake news detection: A comprehensive survey. Comput. Commun. 2022, 189, 160–172. [Google Scholar]
  7. Zhang, C.; Gupta, A.; Kamath, C.; Deodhare, D.; Bhatia, S. Detecting fake news for reducing misinformation risks using analytics approaches. Eur. J. Oper. Res. 2019, 279, 1036–1052. [Google Scholar] [CrossRef]
  8. Yang, Y.; Zheng, L.; Zhang, J.; Cui, Q.; Li, Z.; Yu, P.S. TI-CNN: Convolutional neural networks for fake news detection. arXiv 2018, arXiv:1806.00749. [Google Scholar]
  9. Guo, Y.; Ge, H.; Li, J. A two-branch multimodal fake news detection model based on multimodal bilinear pooling and attention mechanism. Front. Comput. Sci. 2023, 5, 1159063. [Google Scholar] [CrossRef]
  10. Singhal, S.; Shah, R.R.; Chakraborty, T.; Kumaraguru, P.; Satoh, S. SpotFake: A multi-modal framework for fake news detection. In Proceedings of the 2019 IEEE Fifth International Conference on Multimedia Big Data, Singapore, 11–13 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 39–47. [Google Scholar]
  11. Fu, M.; Qiu, H.; Zhao, H.; Li, Y.; Liu, H.; Shi, G.; Duan, Y. Multimodal fake news detection incorporating external knowledge and user interaction feature. Adv. Multimed. 2023, 2023, 8836476. [Google Scholar] [CrossRef]
  12. Dellys, H.N.; Mokeddem, H.; Sliman, L. On the integration of social context for enhanced fake news detection using multimodal fusion attention mechanism. AI 2023, 6, 78. [Google Scholar] [CrossRef]
  13. Helmstetter, S.; Paulheim, H. Weakly supervised learning for fake news detection on Twitter. In Proceedings of the 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Barcelona, Spain, 28–31 August 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 274–277. [Google Scholar]
  14. Riedel, B.; Augenstein, I.; Spithourakis, G.P.; Riedel, S. A simple but tough-to-beat baseline for the Fake News Challenge stance detection task. arXiv 2017, arXiv:1707.03264. [Google Scholar]
  15. Wang, Y.; Ma, F.; Jin, Z.; Yuan, Y.; Xun, G.; Jha, K.; Su, L.; Gao, J. EANN: Event adversarial neural networks for multi-modal fake news detection. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; ACM: New York, NY, USA, 2018; pp. 849–857. [Google Scholar]
  16. Khattar, D.; Goud, J.S.; Gupta, M.; Varma, V. MVAE: Multimodal variational autoencoder for fake news detection. In Proceedings of the World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; ACM: New York, NY, USA, 2019; pp. 2915–2921. [Google Scholar]
  17. Qi, P.; Cao, J.; Yang, T.; Guo, J.; Li, J. Exploiting multi-domain visual information for fake news detection. In Proceedings of the 2019 IEEE International Conference on Data Mining, Beijing, China, 8–11 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 518–527. [Google Scholar]
  18. Silva, R.M.; Santos, R.L.; Almeida, T.A.; Pardo, T.A. Towards automatically filtering fake news in Portuguese. Expert Syst. Appl. 2020, 146, 113199. [Google Scholar] [CrossRef]
  19. Camacho-Collados, J.; Pilehvar, M.T. From word to sense embeddings: A survey on vector representations of meaning. J. Artif. Intell. Res. 2018, 63, 743–788. [Google Scholar] [CrossRef]
  20. Adadi, A.; Berrada, M. Peeking inside the black-box: A survey on explainable artificial intelligence (XAI). IEEE Access 2018, 6, 52138–52160. [Google Scholar] [CrossRef]
  21. Molnar, C. Interpretable Machine Learning; Lulu.com: Morrisville, NC, USA, 2020. [Google Scholar]
  22. Ribeiro, M.T.; Singh, S.; Guestrin, C. Why should I trust you? Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; ACM: New York, NY, USA, 2016; pp. 1135–1144. [Google Scholar]
  23. Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; pp. 4765–4774. [Google Scholar]
  24. Guidotti, R.; Monreale, A.; Ruggieri, S.; Turini, F.; Giannotti, F.; Pedreschi, D. A survey of methods for explaining black box models. ACM Comput. Surv. 2018, 51, 93. [Google Scholar] [CrossRef]
  25. Pérez-Rosas, V.; Kleinberg, B.; Lefevre, A.; Mihalcea, R. Automatic detection of fake news. In Proceedings of the COLING 2018, Santa Fe, NM, USA, 20–26 August 2018; pp. 3391–3401. [Google Scholar]
  26. Zhou, X.; Wu, J.; Zafarani, R. SAFE: Similarity-aware multi-modal fake news detection. In Proceedings of the PAKDD 2020, Singapore, 11–14 May 2020; pp. 354–367. [Google Scholar]
  27. Rashkin, H.; Choi, E.; Jang, J.Y.; Volkova, S.; Choi, Y. Truth of varying shades: Analyzing language in fake news and political fact-checking. In Proceedings of the EMNLP 2017, Copenhagen, Denmark, 7–11 September 2017; pp. 2931–2937. [Google Scholar]
  28. Wang, W.Y. Liar, liar pants on fire: A new benchmark dataset for fake news detection. In Proceedings of the ACL 2017, Vancouver, BC, Canada, 30 July–4 August 2017; pp. 422–426. [Google Scholar]
  29. Zhou, X.; Liang, W.; Shimizu, K.I.; Ma, J.; Jin, Q. MFAN: Multi-modal feature-enhanced attention networks for rumor detection. In Proceedings of the IJCAI 2022, Vienna, Austria, 23–29 July 2022; pp. 2413–2419. [Google Scholar]
  30. Castillo, C.; Mendoza, M.; Poblete, B. Information credibility on twitter. In Proceedings of the WWW 2011, Hyderabad, India, 28 March–1 April 2011; pp. 675–684. [Google Scholar]
  31. Kwon, S.; Cha, M.; Jung, K.; Chen, W.; Wang, Y. Prominent features of rumor propagation in online social media. In Proceedings of the ICDM 2013, Dallas, TX, USA, 7–10 December 2013; pp. 1103–1108. [Google Scholar]
  32. Vosoughi, S.; Roy, D.; Aral, S. The spread of true and false news online. Science 2018, 359, 1146–1151. [Google Scholar] [CrossRef] [PubMed]
  33. Jin, Z.; Cao, J.; Guo, H.; Zhang, Y.; Luo, J. News verification by exploiting conflicting social viewpoints in microblogs. In Proceedings of the AAAI 2016, Phoenix, AZ, USA, 12–17 February 2016; pp. 2972–2978. [Google Scholar]
  34. Tacchini, G.; Ballarin, G.; Della Vedova, M.L.; Moret, S.; de Alfaro, L. Some like it hoax: Automated fake news detection in social networks. In Proceedings of the DSAA 2017, Tokyo, Japan, 19–21 October 2017; pp. 740–746. [Google Scholar]
  35. Guo, H.; Cao, J.; Zhang, Y.; Guo, J.; Li, J. Rumor detection with hierarchical social attention network. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Torino, Italy, 22–26 October 2018; pp. 943–951. [Google Scholar]
  36. Liu, Y.; Wu, Y.F.B. Early detection of fake news on social media through propagation path classification with recurrent and convolutional networks. In Proceedings of the AAAI 2018, New Orleans, LA, USA, 2–7 February 2018; pp. 354–361. [Google Scholar]
  37. Ma, J.; Gao, W.; Mitra, P.; Kwon, S.; Jansen, B.J.; Wong, K.F.; Cha, M. Detecting rumors from microblogs with recurrent neural networks. In Proceedings of the IJCAI 2016, New York, NY, USA, 9–15 July 2016; pp. 3818–3824. [Google Scholar]
  38. Che, H.; Pan, B.; Leung, M.-F.; Cao, Y.; Yan, Z. Tensor Factorization with Sparse and Graph Regularization for Fake News Detection on Social Networks. IEEE Trans. Comput. Soc. Syst. 2024, 11, 4888–4898. [Google Scholar] [CrossRef]
  39. Shu, K.; Cui, L.; Wang, S.; Lee, D.; Liu, H. dEFEND: Explainable fake news detection. In Proceedings of the KDD 2019, Anchorage, AK, USA, 4–8 August 2019; pp. 395–405. [Google Scholar]
  40. Shang, L.; Kou, Z.; Zhang, Y.; Wang, D. A duo-generative approach to explainable multimodal COVID-19 misinformation detection. In Proceedings of the ACM Web Conference 2022, Virtual, 25–29 April 2022; pp. 3623–3631. [Google Scholar]
  41. Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. In Proceedings of the ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  42. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the NIPS 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
  43. Wang, L.; Zhang, C.; Xu, H.; Xu, Y.; Xu, X.; Wang, S. Cross-modal attention for multi-modal fake news detection. Inf. Process. Manag. 2022, 59, 102762. [Google Scholar]
  44. Li, X.; Liu, H.; Zhang, W.; Qian, S. Improving multimodal fake news detection by leveraging cross-modal content correlation. Inf. Process. Manag. 2023, 60, 103627. [Google Scholar]
  45. Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the ICML 2015, Lille, France, 6–11 July 2015; pp. 2048–2057. [Google Scholar]
  46. Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the CVPR 2018, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6077–6086. [Google Scholar]
  47. Yu, Z.; Yu, J.; Cui, Y.; Tao, D.; Tian, Q. Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans. Image Process. 2018, 27, 5158–5170. [Google Scholar] [CrossRef] [PubMed]
  48. Lu, J.; Batra, D.; Parikh, D.; Lee, S. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the NeurIPS 2019, Vancouver, BC, Canada, 8–14 December 2019; pp. 13–23. [Google Scholar]
  49. Li, L.H.; Yatskar, M.; Yin, D.; Hsieh, C.J.; Chang, K.W. VisualBERT: A simple and performant baseline for vision and language. arXiv 2019, arXiv:1908.03557. [Google Scholar] [CrossRef]
  50. Chen, Y.C.; Li, L.; Yu, L.; Kholy, A.E.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. UNITER: Universal image-text representation learning. In Proceedings of the ECCV 2020, Glasgow, UK, 23–28 August 2020; pp. 104–120. [Google Scholar]
  51. Zubiaga, A.; Liakata, M.; Procter, R.; Wong Sak Hoi, G.; Tolmie, P. Analysing how people orient to and spread rumours in social media by looking at conversational threads. PLoS ONE 2016, 11, e0150989. [Google Scholar] [CrossRef] [PubMed]
  52. Hannak, A.; Wagner, C.; Garcia, D.; Mislove, A.; Strohmaier, M.; Wilson, C. Bias in online freelance marketplaces: Evidence from TaskRabbit and Fiverr. In Proceedings of the CSCW 2017, Portland, OR, USA, 25 February–1 March 2017; pp. 1914–1933. [Google Scholar]
  53. Kumar, S.; West, R.; Leskovec, J. Disinformation on the web: Impact, characteristics, and detection of wikipedia hoaxes. In Proceedings of the WWW 2016, Montreal, QC, Canada, 11–15 April 2016; pp. 591–602. [Google Scholar]
  54. Zhao, Z.; Resnick, P.; Mei, Q. Enquiring minds: Early detection of rumors in social media from enquiry posts. In Proceedings of the WWW 2015, Florence, Italy, 18–22 May 2015; pp. 1395–1405. [Google Scholar]
  55. Popat, K.; Mukherjee, S.; Strötgen, J.; Weikum, G. Where the truth lies: Explaining the credibility of emerging claims on the web and social media. In Proceedings of the WWW 2017, Perth, Australia, 3–7 April 2017; pp. 1003–1012. [Google Scholar]
  56. Enayet, O.; El-Beltagy, S.R. NileTMRG at SemEval-2017 task 8: Determining rumour and veracity support for rumours on Twitter. In Proceedings of the SemEval-2017, Vancouver, BC, Canada, 3–4 August 2017; pp. 470–474. [Google Scholar]
  57. Zubiaga, A.; Aker, A.; Bontcheva, K.; Liakata, M.; Procter, R. Detection and resolution of rumours in social media: A survey. ACM Comput. Surv. 2018, 51, 32. [Google Scholar] [CrossRef]
  58. Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 2019, 1, 206–215. [Google Scholar] [CrossRef] [PubMed]
  59. Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Lawrence Zitnick, C.; Parikh, D. VQA: Visual question answering. In Proceedings of the ICCV 2015, Santiago, Chile, 7–13 December 2015; pp. 2425–2433. [Google Scholar]
  60. Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the ICCV 2017, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
  61. Gunning, D.; Aha, D.W. DARPA’s explainable artificial intelligence program. AI Mag. 2019, 40, 44–58. [Google Scholar] [CrossRef]
  62. Liao, Q.V.; Gruen, D.; Miller, S. Questioning the AI: Informing design practices for explainable AI user experiences. In Proceedings of the CHI 2020, Honolulu, HI, USA, 25–30 April 2020; pp. 1–15. [Google Scholar]
  63. Zhou, X.; Mulay, A.; Ferrara, E.; Zafarani, R. ReCOVery: A multimodal repository for COVID-19 news credibility research. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Virtual, 19–23 October 2020; pp. 3205–3212. [Google Scholar]
  64. Chen, M.; Chu, X.; Subbalakshmi, K. Mmcovar: Multimodal COVID-19 vaccine focused data repository for fake news detection and a baseline architecture for classification. In Proceedings of the 2021 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Virtual, 8–11 November 2021; pp. 31–38. [Google Scholar]
  65. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL-HLT, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
  66. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  67. Kou, Z.; Zhang, D.Y.; Shang, L.; Wang, D. Exfaux: A weakly supervised approach to explainable fauxtography detection. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020; pp. 631–636. [Google Scholar]
  68. Zhang, W.; Gui, L.; He, Y. Supervised contrastive learning for multi modal unreliable news detection in COVID-19 pandemic. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Virtual, 1–5 November 2021; pp. 3637–3641. [Google Scholar]
  69. Qian, S.; Wang, J.; Hu, J.; Fang, Q.; Xu, C. Hierarchical multi-modal contextual attention network for fake news detection. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 11–15 July 2021; pp. 153–162. [Google Scholar]
Figure 1. UC-CMAF architectural diagram.
Figure 1. UC-CMAF architectural diagram.
Applsci 15 07904 g001
Figure 2. Performance comparison of 15 multimodal fake news detection methods on ReCOVery and MMCoVaR datasets.
Figure 2. Performance comparison of 15 multimodal fake news detection methods on ReCOVery and MMCoVaR datasets.
Applsci 15 07904 g002
Figure 3. Line plots of precision, recall, and F1 Score of different models on ReCOVery (left) and MMCoVaR (right) datasets.
Figure 3. Line plots of precision, recall, and F1 Score of different models on ReCOVery (left) and MMCoVaR (right) datasets.
Applsci 15 07904 g003
Figure 4. Heatmaps of accuracy, precision, recall, and F1 Score for 15 representative models on the ReCOVery (left) and MMCoVaR (right) datasets.
Figure 4. Heatmaps of accuracy, precision, recall, and F1 Score for 15 representative models on the ReCOVery (left) and MMCoVaR (right) datasets.
Applsci 15 07904 g004
Figure 5. Impact of module ablations on accuracy and F1 Score.
Figure 5. Impact of module ablations on accuracy and F1 Score.
Applsci 15 07904 g005
Figure 6. Radar charts illustrating the performance impact of each component via an ablation study on the ReCOVery and MMCoVaR datasets.
Figure 6. Radar charts illustrating the performance impact of each component via an ablation study on the ReCOVery and MMCoVaR datasets.
Applsci 15 07904 g006
Figure 7. Comparison of attention mechanisms in terms of performance and complexity.
Figure 7. Comparison of attention mechanisms in terms of performance and complexity.
Applsci 15 07904 g007
Figure 8. Performance comparison of ReCOVery and MMCoVaR with varying λ .
Figure 8. Performance comparison of ReCOVery and MMCoVaR with varying λ .
Applsci 15 07904 g008
Figure 9. Impact of comment selection strategy on accuracy and processing time.
Figure 9. Impact of comment selection strategy on accuracy and processing time.
Applsci 15 07904 g009
Figure 10. Relationship between number of comments, ReCOVery accuracy, and training time.
Figure 10. Relationship between number of comments, ReCOVery accuracy, and training time.
Applsci 15 07904 g010
Figure 11. Comparison of ReCOVery and MMCoVaR accuracy with memory usage and training time.
Figure 11. Comparison of ReCOVery and MMCoVaR accuracy with memory usage and training time.
Applsci 15 07904 g011
Figure 12. Cross-modal attention heatmap for text–image semantic alignment.
Figure 12. Cross-modal attention heatmap for text–image semantic alignment.
Applsci 15 07904 g012
Figure 13. Comment type distribution across credibility assessment dimensions.
Figure 13. Comment type distribution across credibility assessment dimensions.
Applsci 15 07904 g013
Figure 14. Comment–text attention alignment and importance score distribution.
Figure 14. Comment–text attention alignment and importance score distribution.
Applsci 15 07904 g014
Figure 15. Robustness evaluation results under noisy and adversarial conditions. (a) Accuracy under different noise intensities. (b) Performance under various adversarial attack types. (c) Heatmap of key robustness metrics for each attack.
Figure 15. Robustness evaluation results under noisy and adversarial conditions. (a) Accuracy under different noise intensities. (b) Performance under various adversarial attack types. (c) Heatmap of key robustness metrics for each attack.
Applsci 15 07904 g015
Table 1. Dataset statistics and characteristics.
Table 1. Dataset statistics and characteristics.
MetricReCOVery [63]MMCoVaR [64]
Total Articles20292593
Fake News13641635
Real News665958
Images16752357
Comments140,820Rich
FocusCOVID-19 GeneralCOVID-19 Vaccine
Table 2. Performance comparison of main data collection methods.
Table 2. Performance comparison of main data collection methods.
MethodReCOVeryMMCoVaR
AccuracyPrecisionRecallF1AccuracyPrecisionRecallF1
SVM-Based0.6950.6780.6810.6790.7120.6980.7030.700
Random Forest0.7340.7190.7280.7230.7480.7310.7420.736
BERT-only0.8210.8090.7980.8030.8340.8230.8120.817
ResNet-only0.6630.6520.6710.6610.6780.6650.6830.674
Simple Concatenation0.7560.7420.7510.7460.7690.7530.7620.757
HSA0.7790.7370.7360.7360.8030.7820.7850.784
ExFaux0.7630.7190.6950.7040.7690.7840.6940.707
dEFEND0.8560.8260.8130.8230.8560.8470.8310.838
BTIC0.7630.7190.6950.7040.8290.8230.7910.803
SAFE0.8310.8030.7890.7950.7880.7730.7490.757
EANN0.8470.8160.8340.8240.8330.8190.8100.814
SpotFake0.6810.6370.6500.6410.6990.6700.6200.623
MVAE0.8250.8130.7550.7740.8150.8050.8340.808
DGExplain0.8970.8900.8610.8730.8950.8960.8710.881
UC-CMAF0.9270.9160.8720.8940.9310.9390.8810.909
Table 3. Ablation study results.
Table 3. Ablation study results.
Removed ComponentsReCOVeryMMCoVaR
Acc.Prec.Rec.F1Acc.Prec.Rec.F1
None (Full Model)0.9270.9160.8720.8940.9310.9390.8810.909
-CMA (w/o Cross-Modal Attention)0.9020.8830.8360.8560.9140.9030.8620.887
-UCI (w/o Comment–Image Fusion)0.8950.8460.8580.8510.9010.8880.8610.881
-UCT (w/o Comment–Text Fusion)0.8980.8840.8110.8530.9020.8950.8610.883
-Comments (w/o User Comments)0.8830.8670.8230.8440.8860.8740.8450.859
-Images (Text + Comments Only)0.9080.8950.8510.8720.9190.9110.8670.888
-Adaptive Weights0.9120.9010.8470.8730.9230.9160.8710.893
Table 4. Comparison of attention mechanisms.
Table 4. Comparison of attention mechanisms.
Attention VariantReCOVery F1CoVaR F1ComplexityDescription
No Attention0.8440.859 O ( n ) Simple direct connection
Single-Head0.8670.881 O ( n 2 ) Basic attention mechanism
Multi-Head0.8730.896 O ( n 2 ) Standard multi-head attention
Cross-Modal0.8890.903 O ( n m ) Cross-modal attention only
Adaptive Attention (Ours)0.8940.909 O ( n m ) Our integrated adaptive attention mechanism
Table 5. Comment selection strategy analysis.
Table 5. Comment selection strategy analysis.
Selection StrategyReCOVery AccuracyMMCoVaR AccuracyProcessing Time
Random Selection0.9020.915156 ms
Time Order0.9080.921143 ms
By Length0.9110.923167 ms
By Engagement0.9180.926189 ms
importance Score0.9270.931203 ms
Table 6. Impact of number of comments.
Table 6. Impact of number of comments.
Number of CommentsReCOVery AccuracyMMCoVaR AccuracyMemory UsageTraining Time
100.9180.9241.8 GB2.1 h
200.9230.9282.1 GB2.4 h
300.9250.9302.3 GB2.7 h
500.9270.9312.9 GB3.2 h
1000.9280.9324.1 GB4.8 h
Table 7. Robustness against different noise types.
Table 7. Robustness against different noise types.
Noise TypeClean Acc.10 dB20 dB30 dB50 dB
Text Synthesis0.9270.9210.9130.9040.887
Image Gaussian Blur0.9270.9240.9200.9140.896
Comment Injection0.9270.9230.9180.9120.891
Combined Noise0.9270.9180.9060.8920.868
Table 8. Performance under adversarial attacks.
Table 8. Performance under adversarial attacks.
Attack MethodSuccess RateAccuracy DropDefense Effectiveness
Text Perturbation12.3%−3.2%87.7%
Image Adversarial8.7%−2.1%91.3%
Comment Manipulation15.6%−4.1%84.4%
Multimodal Attack18.9%−5.3%81.1%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yi, Z.; Tang, C.; Lu, S. User Comment-Guided Cross-Modal Attention for Interpretable Multimodal Fake News Detection. Appl. Sci. 2025, 15, 7904. https://doi.org/10.3390/app15147904

AMA Style

Yi Z, Tang C, Lu S. User Comment-Guided Cross-Modal Attention for Interpretable Multimodal Fake News Detection. Applied Sciences. 2025; 15(14):7904. https://doi.org/10.3390/app15147904

Chicago/Turabian Style

Yi, Zepu, Chenxu Tang, and Songfeng Lu. 2025. "User Comment-Guided Cross-Modal Attention for Interpretable Multimodal Fake News Detection" Applied Sciences 15, no. 14: 7904. https://doi.org/10.3390/app15147904

APA Style

Yi, Z., Tang, C., & Lu, S. (2025). User Comment-Guided Cross-Modal Attention for Interpretable Multimodal Fake News Detection. Applied Sciences, 15(14), 7904. https://doi.org/10.3390/app15147904

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop