Cross-Modal Fake News Detection Method Based on Multi-Level Fusion Without Evidence

He, Ping; Zhang, Hanxue; Cao, Shufu; Wu, Yali

doi:10.3390/a18070426

Open AccessArticle

Cross-Modal Fake News Detection Method Based on Multi-Level Fusion Without Evidence

¹

School of Management Science and Information Engineering, Hebei University of Economics and Business, Shijiazhuang 050061, China

²

Hebei Cross-Border E-Commerce Technology Innovation Center, Shijiazhuang 050061, China

^*

Author to whom correspondence should be addressed.

Algorithms 2025, 18(7), 426; https://doi.org/10.3390/a18070426

Submission received: 7 June 2025 / Revised: 7 July 2025 / Accepted: 7 July 2025 / Published: 10 July 2025

(This article belongs to the Special Issue Algorithms for Feature Selection (3rd Edition))

Download

Browse Figures

Versions Notes

Abstract

Although multimodal feature fusion technology in fake news detection can integrate complementary information from different modal data, the semantic inconsistency of multimodal features will lead to feature fusion difficulties. And there is the problem of information loss during one fusion process. In addition, although it is possible to improve the detection effect by increasing the support of external evidence in fake news detection, there is a lag in obtaining external evidence and the reliability and completeness of the evidence source is difficult to guarantee. Additional noise may be introduced to interfere with the model judgment. Therefore, a cross-modal fake news detection method (CM-MLF) based on evidence-free multilevel fusion is proposed. The method solves the semantic inconsistency problem by utilizing cross-modal alignment processing. And it utilizes the attention mechanism to perform multilevel fusion of text and image features without the assistance of other evidential features to further enhance the expressive power of the features. Experiments show that the method achieves better detection results on multiple benchmark datasets, effectively improving the accuracy and robustness of cross-modal fake news detection.

Keywords:

fake news detection; cross-modal fusion; multi-level fusion; cross-modal alignment; attention mechanisms

Graphical Abstract

1. Introduction

As an important way of information dissemination, network news covers a wide range of fields and has rich and varied contents. It can meet the spiritual and cultural needs of different groups and improve the quality of public spiritual life. Network news is characterized by fast dissemination speed and strong timeliness. It can make it easier for the public to obtain information and participate in the discussion of social hotspots. With the rise of the Internet, especially social media platforms (e.g., Facebook, Twitter, WeChat, Weibo, etc.), the speed and scope of information dissemination have greatly expanded. This makes it possible for any piece of news, opinion, or rumor to quickly spread across the globe. However, information on social media is often not rigorously vetted or verified, and the spread of fake news has become much easier. Such false information not only misleads public perception and affects individual decisions and behaviors, but may also undermine trust between the government, the media and the public, causing social instability.

Given the double-edged sword effect of online news dissemination, the importance of fake news detection is becoming more and more significant. Traditional news verification relies on the verification process of journalists and media organizations. However, with the explosive growth of information, the limitations of manual verification in terms of speed and scale are becoming increasingly apparent. In addition, the anonymity and decentralized nature of social media makes it more difficult to track and verify sources. With the rapid development of AI, NLP, ML and Big Data, fake news detection is able to automatically identify and filter fake news by using technical means to analyze news text and images. On this basis, combined with a professional news review team, it is able to effectively identify and intercept fake news and reduce its negative impact on the public and society.

Meanwhile, fake news detection techniques continue to improve. For example, the rise of large-scale language modeling and multimodal fusion technology has made it possible to detect fake news from multiple dimensions such as text and image. Currently, the research on fake news detection has mainly experienced two stages: unimodal and multimodal. Early research is mostly in the unimodal stage, and its research object is relatively single, dominated by text or images. There are problems such as limited information and fuzzy semantics. The core task of multimodal false news detection lies in the feature extraction and feature fusion of news information. Therefore, how to effectively extract features and how to integrate multimodal data features such as text and images have become the hotspots and difficulties of current research.

In the actual scenario of fast spreading of fake news, there is often a lag in obtaining external evidence for detection. And the reliability and completeness of the evidence sources are difficult to guarantee, which may introduce additional noise and interfere with the model judgment. In addition, the semantic inconsistency problem of multimodal features can lead to feature fusion difficulties. Traditional single fusion methods (e.g., simple splicing and weighting of text and image features directly) are prone to information loss due to the semantic gap between text and visual features. If the fused features after single fusion are used directly for detection, there will be the problem of under-utilization of unimodal text and image features. For this reason, this paper focuses on mining the text and image information of the news itself to improve the detection ability of the model when it is not supported by external evidence. A Cross-modal False News Detection Model Based on Multi-level Fusion without Evidence (CM-MLF) is proposed. The model relies solely on the text and image content of the news for detection. This makes it especially suitable for real-time screening of news in content review systems on social media platforms. Since it does not depend on external evidence, it can make judgments immediately when news is newly published, even before user interaction data are generated. This is crucial for curbing the initial spread of fake news.

The CM-MLF model firstly maps text and images to a unified semantic space by means of a cross-modal alignment module, which eliminates modal differences and solves the problem of semantic inconsistency between different modalities. Second, the multimodal information is fused progressively in two stages. The first stage generates primary fusion features from aligned text and image features through cross-modal cross-attention. In the second stage, KL disambiguation scores are introduced to assist in the allocation of attention weights to realize fine-grained secondary fusion of text, image and primary fusion features. This method effectively avoids the problem of information loss in the traditional single fusion process. It also enhances the model’s ability to pay attention to key features and improves the accuracy of fake news detection. The main contributions of this paper are as follows:

This paper proposes a cross-modal false news detection model CM-MLF based on evidence-free multilevel fusion, and design a cross-modal alignment and secondary fusion framework to solve the problem of semantic inconsistency and information loss of multimodal features.
This paper designs an attention mechanism and KL score-assisted weight assignment network. This network adaptively adjusts the contribution of text, image and primary fusion features in the second stage of fusion, which enhances the model’s ability to capture key features.
Experiments on public datasets demonstrate the effectiveness of the CM-MLF model, with accuracy and other evaluation metrics outperforming other baseline models.

The paper is structured as follows: Section 2 reviews different problems and solutions for fake news detection; Section 3 describes the specific design of the CM-MLF model; Section 4 and Section 5 conducts experiments to demonstrate the effectiveness of the methodology; and Section 6 summarizes the full paper.

2. Related Work

The Fake news detection research can be traced back to the fields of “social media credibility analysis” and “rumor detection”. Early research did not directly use the term “Fake News”, which has been widely used as an academic term since 2016. In particular, the surge of political disinformation during the U.S. election period has prompted in-depth research in the field of fake news detection. Shu et al. [1] published a review study to systematically sort out the detection of fake news in social media and analyze the characteristics and detection methods of fake news. They also defined fake news as “a news article that is intentionally and verifiably false”. The study serves as an important early normative definition of the concept of fake news and provides a theoretical foundation for subsequent research. With the popularization of the Internet and social media platforms, the dissemination pattern of fake news has shown significant multimodal characteristics. Its carrier has covered a variety of media forms such as text, image, video and audio. In this context, traditional unimodal detection methods show obvious limitations. The research paradigm is prompted to shift from unimodal analysis to multimodal fusion. Current multimodal detection research focuses on three core strategies [2]: multimodal complementarity, multimodal consistency, and multimodal enhancement.

Multimodal complementarity refers to the splicing and fusion of features from different modalities, where visual information is regarded as complementary to textual features. Multimodal consistency refers to verifying whether the information content described by text and vision is relevant and consistent, which is measured by calculating the similarity between the two. Multimodal enhancement refers to bi-directional enhancement by comparing and observing the aligned parts of text and vision.

2.1. Multimodal Complementarity

Early multimodal approaches focused on fusion by simply splicing text features and image features. Multimodal complementarity research methods usually consider visual information as a complement to the textual content of fake news. Such methods firstly use text encoder and visual encoder to extract text features and image features respectively, and subsequently realize multimodal fusion by simple feature splicing.

Inspired by adversarial networks, Wang et al. [3] proposed an end-to-end fake news detection framework called Event Adversarial Neural Network (EANN). The multimodal feature extractor is first responsible for extracting textual and visual features from the posts. Subsequently, the feature vectors of the two modalities are spliced to form a fusion feature representation. The fused features are not only input to the fully connected classifier for news authenticity discrimination, but also used as input to the event discriminator for recognizing news event labels. Wang et al. [4] further proposed an end-to-end fake news detection framework called MetaFEND. The framework achieves the unification of cross-event generalized representation learning and event-specific modeling through the synergistic fusion of meta-learning and neural process. The framework requires only a small number of labeled samples to quickly adapt to unexpected events, which significantly improves the performance of fake news detection in data-scarce scenarios. Khattar et al. [5] improved the complementary framework into a fake news detection framework called Multimodal Variational Autoencoder (MVAE). The framework uses Bi-LSTM and VGG19 to obtain textual representation and visual representation respectively and splices both as shared representation. The encoder encodes the text and image information into shared latent vectors, and the decoder maps the latent vectors back into the text and image space to reconstruct the original input. Singhal et al. [6] proposed a multimodal detection framework called SpotFake+. This framework introduces BERT [7] pre-trained language model for text feature encoding and VGG19 [8] model for image feature encoding. Finally the two features are spliced and input to the classifier.

Although this method of multimodal complementarity fuses the information of two modalities, text and image, the simple splicing fusion method ignores the complex semantic associations between modalities, so it limits the performance of detection to some extent.

2.2. Multimodal Consistency

The multimodal consistency strategy addresses modal conflict situations. Contradictory features are identified through the alignment differences between visual and textual entities. This consistency metric not only recognizes false content that does not match the text and image, but also effectively detects elaborate cross-modal deceptions, such as semantically misleading illustrations or text descriptions.

Zhou et al. [9] proposed a similarity-aware fake news detection method (SAFE), which extracts textual and visual features separately through neural networks and deeply mines cross-modal feature associations. The method realizes the joint learning of textual, visual and their associated features, and is able to identify fake news based on textual and image features or their “mismatch”. The method can effectively recognize false news based on content features or cross-modal mismatches, which enhances the semantic understanding ability of the model. Xue et al. [10] proposed a multimodal consistency neural network framework (MCNN). Based on the traditional feature extraction module, a visual tampering detection module and a similarity metric module are innovatively integrated. The framework systematically combines semantic feature extraction with physical feature analysis. It realizes the accurate detection of image tampering traces and the quantitative evaluation of the degree of match between graphic and textual data. Inspired by the superior performance of Transformer in visual representation learning, Ghorbanpour et al. [11] proposed a Fake News Detection method (FNR) based on transform learning and contrastive loss. The method innovatively uses visual Transformer [12] and BERT to extract image and text features, respectively, and introduces a contrastive loss function to quantify the image-text similarity.

Although multimodal consistency methods try to verify the semantic and logical consistency of text and images by calculating the similarity between them, there is a natural difference between text and visual information in terms of expression and semantic level. Texts usually use language as a carrier to express more abstract concepts and emotions, while images convey information through visual elements. Due to this semantic divide, traditional similarity metrics are difficult to accurately capture and match the subtle differences between the two modalities, leading to certain limitations in consistency detection.

2.3. Multimodal Enhancement

In multimodal information fusion, news text and images are correlated at the high-level semantic level, and their aligned parts usually indicate important features of the news. Multimodal enhancement makes text and image information complement each other by comparatively analyzing the alignment part in text and visual information. The enhancement strategy is based on deep cross-modal interaction and realizes bidirectional enhancement of text and image information by mining the higher-order semantic features shared across modalities. Thus, the ability to understand and analyze multimodal content is enhanced on the whole.

Jin et al. [13] proposed an attention mechanism-based recurrent neural network model (att-RNN). The model can effectively fuse information from three different modalities: text, image and social context. It uses LSTM network to obtain the joint feature representation of text and social context, and introduces the attention mechanism to dynamically adjust the fusion weights of visual features. Zhang et al. [14] proposed a novel multimodal knowledge-aware event memory network (MKEMN). The network considers text embedding, visual embedding, and knowledge embedding as multiple stacked channels, and uses a multi-channel CNN with an attention mechanism to fuse multimodal information while maintaining their alignment relationships. The network focuses on unidirectional enhancement of multimodal content, highlighting important image regions through the guidance of textual information. Song et al. [15] further proposed a multimodal fake news detection model based on cross-modal attentional residuals and multi-channel convolutional neural network (CARMN). The model dynamically fuses text and image features through the attention mechanism and constructs a bidirectional enhancement relationship between image and text. On this basis, key features are further extracted from the fused features, and finally the authenticity of news is determined based on the extracted key features. Wu et al. [16] proposed a multimodal collaborative attention network (MCAN). For image processing, this network uses a convolutional neural network CNN to extract the physical feature representation of an image and a VGG19 architecture to extract the semantic feature representation of an image. For text processing, the BERT model is utilized to obtain the semantic feature representation of the text. The three are fused by stacking multiple Co-attention Blocks to learn their interdependencies. Qi et al. [17] proposed an entity-enhanced multimodal fake news detection framework (EM-FEND). The framework introduces visual entities and embedded text, and models three cross-modal association features, namely entity inconsistency, mutual enhancement, and text supplementation. Chen et al. [18] proposed a multimodal false news detection method with cross-modal ambiguity learning (CAFE). The method performs feature fusion through cross-modal alignment, fuzzy learning algorithms, and interaction matrices to explore the deep semantic associations between different modalities. Peng et al. [19] proposed a multimodal rumor detection method based on deep metric learning (MRML), which extracts intramodal and intermodal relationships by designing ternary learning and comparison learning, respectively. Semantic associations between text and images are modeled by deep metric learning and joint representation. Ying et al. [20] proposed an improved multi-gate mixture-of-expert networks for feature refinement and fusion. This network weighs the feature weights of each modality through single-view prediction and cross-modal consistency learning. In addition, the importance of different modalities for adaptive aggregated unimodal representation is estimated with the help of multi-view learning. Peng et al. [21] proposed a contextual semantic representation learning method for multimodal fake news detection. The method extracts local contextual features through unsupervised context learning and then hierarchically fuses global semantic features with visual features to model multimodal feature correlation. Yu et al. [22] proposed a fake news detection method based on progressive fusion of multi-source heterogeneous data. The method constructs a multi-source dataset and integrates the features of different modalities step-by-step through a progressive fusion strategy, which reduces the information redundancy and improves the efficiency and accuracy of the model. Zhong et al. [23] proposed a multimodal fake news detection model based on evidence enhancement and local semantic interaction. Evidence text was introduced and feature enhancement methods were designed, and finally the complementary information between the enhanced features was learned through a cross-attention module. Deng et al. [24] proposed an attention-guided multimodal feature fusion method for fake news detection (AGMFN). This method extracts multi-level frequency domain information of images by wavelet transform and combines the attention mechanism to realize the deep fusion of text, image and frequency domain features, effectively exploiting the higher-order semantic associations among multimodal features.

Although these methods have made some achievements in multimodal feature learning, there is still room for optimization in terms of complementary utilization of cross-modal information and adequacy of feature interaction. Especially when dealing with complex semantic scenes, the synergistic effect among modalities has not been fully utilized.

3. Methods

3.1. Model Definition

In this paper, we propose a cross-modal fake news detection method based on multi-level fusion without evidence (CM-MLF). As shown in Figure 1, the model is mainly composed of four parts: feature extraction, feature alignment, feature fusion, and feature detection.

For each input multimodal news

x = [T, P] \in D

, where

T, P

, and

D

denote text, data, and dataset separately. Firstly, in the feature extraction module, for the text

T

of a given news, high quality semantic features are extracted from the news text using the large-scale pre-trained language model BERT, which serve as the base text features

e^{t}

input to the subsequent module. And for a given news image

P

, a large-scale pre-trained deep learning architecture Resnet is utilized to extract high-quality semantic features from the news image as the base image features

e^{v}

input to the subsequent module. Secondly, in the feature alignment module, the extracted text features

e^{t}

and image features

e^{v}

are cross-modal aligned by comparative learning, and finally the aligned unimodal representations

m^{t}

and

m^{v}

are obtained. Next, in the feature fusion module, this part is mainly divided into two stages. In the first stage, the cross-modal aligned text features

m^{t}

and image features

m^{v}

are input into the multi-head attention mechanism for cross-modal feature fusion to obtain cross-modal primary fusion features

m^{f i n a l}

. In the second stage, KL disambiguation scores are introduced to assist in the design of dynamically learnable attention weights that adaptively adjust the contribution of aligned text features

m^{t}

, aligned image features

m^{v}

and primary fusion features

m^{f i n a l}

. These three features are fused again and the resulting features are used as detection features

\tilde{X}

. Finally, in the feature detection module, the detection features

\tilde{X}

are used as the input information for the false news detection classifier, and the final binary classification result for information detection is obtained.

3.2. Feature Extraction

In this paper, we use the pre-trained models BERT and ResNet to encode text

T

and images

P

into unimodal embeddings

e^{t}

and

e^{v}

.

3.2.1. Text Feature Extraction

The BERT model obtains the representation of linguistic information by training on large-scale unlabeled data. The model is characterized by good parallelism and high efficiency, and can transform data into sentence vectors with consistent dimensions. The BERT model is able to represent word and sentence relations by effectively capturing underlying semantic and contextual information through bidirectional coded representations. Therefore, this study extracts high-quality semantic features from news texts using a large-scale pre-trained language model, BERT-base, which contains 12 coding layers.

First, the input text

T = {t_{1}, \dots, t_{l}}

is processed using the BERT lexer to convert the text into a sequence of words. Each word is mapped to an integer to form the input sequence

K = {k_{1}, \dots, k_{l}}

. Then, the input sequence

K

is passed to the BERT-based model, and, after the forward propagation of the model, the hidden states of multiple layers are generated. The hidden state

H

of the last layer is obtained, and its shape is represented as

(n, l, d)

. Where

n

is the number of sequences,

l

is the length of the sequence, and

d

is the dimension of the hidden layer. Special attention is paid here to the output of the [CLS] tokens, as it is often used to represent global semantic information about the whole text. The [CLS] tokens are characterized by the features of the first token of each sequence, as shown in Equation (1).

e^{c l s} = H [:, 0, :] \in R^{d}

(1)

The dimension of the feature marked by [CLS] is

d

. In order to adjust the feature to a dimension suitable for subsequent processing, a fully connected layer is designed, as shown in Equation (2).

e^{t} = R e L u (W_{t} e^{c l s} + b_{t}) \in R^{d i m_t}

(2)

where,

d i m_t

is the output dimension of the textual fully connected layer,

W_{t}

and

b_{t}

are the weight and bias parameters of the fully connected layer, and

R e L u

is the activation function.

e^{t}

is the extracted text feature, which contains dynamic global abstract information about the text and provides rich semantic information for the subsequent fake news detection task.

3.2.2. Image Feature Extraction

In the field of computer vision, CNN is a common method to extract important features of images. And ResNet, as a deep learning network, can effectively extract high-level semantic features of images by introducing residual learning structure. And it has better performance in feature extraction and target classification compared to the traditional VGG network. Especially in the case of deeper network hierarchies, ResNet networks are able to solve the problem of gradient vanishing and degradation that occurs in traditional VGG networks, and they are also able to reduce the amount of computation and storage. Therefore, this study utilizes a large-scale pre-trained ResNet model to extract high-quality semantic features from news images. The core residual module of the ResNet model has four stages with two residual blocks in each stage. Each residual block consists of two

3 \times 3

convolutional layers, totaling 16 convolutional layers, and each convolutional layer is followed by

B N

and

R e L u

activation function.

Image preprocessing is first performed by cropping and scaling the input image

P

to a standard size of

224 \times 224 \times 3

. After the image pixel values are normalized to the range of

[0, 1]

, normalization is performed. The preprocessed image is then sequentially input to the initial convolutional layer, the maximum pooling layer and the residual module layer. The size variation of the output feature map is shown in Table 1.

The output feature maps of the last residual module are processed through a global average pooling layer. It compresses the feature maps of each channel into a scalar, and finally outputs a 1 × 1 × 512 feature vector. In order to adjust this feature to a dimension suitable for subsequent processing, a fully connected layer is designed as shown in Equation (3).

e^{v} = R e L u (W_{v} e^{r e s} + b_{v}) \in R^{d i m_v}

(3)

where,

d i m_v

is the output dimension of the image fully connected layer,

W_{v}

and

b_{v}

are the weight and bias parameters of the fully connected layer, and

R e L u

is the activation function.

e^{v}

is the extracted image feature, which contain dynamic global abstract information about the image. Efficient image feature extraction is achieved by using residual connections, which help avoid gradient vanishing.

3.3. Feature Alignment

3.3.1. Similarity Matching

In order to determine whether there is a semantic matching relationship between text and image, a similarity matching task is designed. Specifically, a new dataset

\tilde{x} = [\tilde{T}, \tilde{P}] \in \tilde{D}

is first constructed based on dataset

D

. If the text and image belong to a real news at the same time, they are regarded as matched text-image pairs and labeled as

\tilde{y} = 1

. On the contrary, they are regarded as unmatched pairs and labeled as

\tilde{y} = 0

. Then, the initial text

\tilde{T}

and image

\tilde{P}

are encoded by a specific encoder to obtain normalized feature vectors

\tilde{e^{t}}

and

\tilde{e^{v}}

of the same dimension. They are then fed into two independent aligners, which are further aligned to a shared semantic space to obtain text features

\tilde{e_{s}^{t}}

and image features

\tilde{e_{s}^{v}}

. As shown in Equation (4).

\tilde{e_{s}^{t}}, \tilde{e_{s}^{v}} = a l i g n e r (\tilde{e^{t}}, \tilde{e^{v}}) = a l i g n e r (e n c o d e r (\tilde{T}, \tilde{P}))

(4)

where, the

a l i g n e r

consists of a two-layer fully connected network, each layer followed by

B N

and

R e L u

activation functions.

Finally, splice

\tilde{e_{s}^{t}}

and

\tilde{e_{s}^{v}}

and input the result to a simple binary classification fully connected layer network. The output is the binary classification result

\tilde{y}

that predicts similarity. To train the similarity module, matching feature pairs are brought closer together and mismatched feature pairs are pushed farther apart. In this paper, the cosine embedding loss function is used as shown in Equation (5).

L_{s i m} = \{\begin{matrix} 1 - c o s (\tilde{e_{s}^{t}}, \tilde{e_{s}^{v}}), \tilde{y} = 1 \\ m a x (0, c o s (\tilde{e_{s}^{t}}, \tilde{e_{s}^{v}}) - d), \tilde{y} = 0 \end{matrix}

(5)

where,

c o s (\cdot)

denotes the cosine similarity and

d

is the supervisory boundary.

3.3.2. Contrast Learning

To further optimize the representation of text and image features, a CLIP-based contrast learning task is introduced. This task enhances the model’s ability to understand multimodal data by means of contrast learning. The aligned feature representations

m^{t}

and

m^{v}

are first extracted by the CLIP module based on the given input text

T

and image

P

, as shown in Equation (6).

m^{t}, m^{v} = C L I P (T, P)

(6)

Then, the similarity matrix between text features

m^{t}

and image features

m^{v}

is calculated and a temperature parameter

τ

is used to control the scaling of the similarity matrix. As shown in Equation (7).

l o g i t s = m^{v} \times {{(m}^{t})}^{T} \times τ

(7)

Next, symmetric cross-entropy loss is used to optimize the feature alignment. The image-to-text and text-to-image contrast losses are calculated separately, as shown in Equations (8) and (9), where

L_{C E}

denotes the cross-entropy loss function and

l a b e l s = [0,1, \dots, N - 1]

are the diagonal labeling sequences.

L_{c l i p}^{v \to t} = L_{C E} (l o g i t s, l a b e l s)

(8)

L_{c l i p}^{t \to v} = L_{C E} ({l o g i t s}^{T}, l a b e l s)

(9)

Finally, the mean of the bidirectional loss is set as the final comparison loss

L_{c l i p}

. As shown in Equation (10).

L_{c l i p} = \frac{1}{2} (L_{c l i p}^{v \to t} + L_{c l i p}^{t \to v})

(10)

3.3.3. Soft Label

In traditional contrast learning, one-hot labeling is usually used to supervise the learning process. One-hot labeling classifies image-text pairs strictly into two categories, and this dichotomy has the problem of over-penalization. One-hot labeling indiscriminately applies the same penalty to all mismatched predictions, even though some mismatched pairs may have some semantic similarity. Soft labeling captures the fine-grained semantic similarity between images and text, leading to a better understanding of the semantic alignment of images and text. Therefore we introduce soft labeling as an auxiliary task.

On the basis of unimodal features

e^{t}

and

e^{v}

, the similarity features

e_{s}^{t}

and

e_{s}^{v}

of text and image are obtained by a similarity matching module, which is used to calculate the soft label similarity matrix. As shown in Equation (11).

s o f t_l a b e l s = e_{s}^{v} \times {{(e}_{s}^{t})}^{T} \times τ

(11)

To align the predicted similarity distribution with the soft-label distribution, the model calculates the soft-label contrast loss in both directions using negative log-likelihood. Specifically, it computes the soft-label loss

L_{s o f t}^{v \to t}

from image to text and

L_{s o f t}^{t \to v}

from text to image, and takes the mean of these two losses as the final soft-label loss, as shown in Equation (12).

L_{s o f t} = \frac{1}{2} (L_{s o f t}^{v \to t} + L_{s o f t}^{t \to v})

(12)

By jointly training the cross-modal feature alignment module, semantically aligned unimodal representations

m^{t}

and

m^{v}

were generated and used as input for the subsequent cross-modal fusion module.

3.4. Feature Fusion

In order to realize the interactive fusion between aligned text features

m^{t}

and aligned image features

m^{v}

, this paper designs a cross-modal feature fusion network based on bidirectional multi-head cross-attention. Its core is to capture cross-modal semantic associations through text-image and image-text bidirectional interactions, and eventually fused to generate a unified joint representation. The structure of this fusion network is shown in Figure 2, which consists of two parallel Multi-head Cross Attention Networks.

The principle formula of the single-head attention mechanism is shown in Equation (13).

h e a d = A t t e n t i o n (Q, K, V) = S o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(13)

where,

Q

,

K

and

V

denote the input feature matrices, respectively.

\sqrt{d_{k}}

is the scaling factor, which is used to prevent the gradient from destabilizing due to overly large dot product results for the matrices.

The multi-head attention mechanism segments features into several “heads”. Each “head” can independently learn different semantic associations between text and image features. The computation of each attention head is shown in Equation (14).

{h e a d}_{i} = A t t e n t i o n (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})

(14)

where,

i

indicates the i-th head of the attention mechanism.

W_{i}^{Q}, W_{i}^{K}

and

W_{i}^{V}

are trainable weight matrices used to transform the input features into query, key and value vector representations, respectively.

When all the “heads” have been calculated, the results are spliced. The output formula for multi-head attention is shown in Equation (15).

M u l t i H e a d A t t e n t i o n (Q, K, V) = C o n c a t ({h e a d}_{1}, {h e a d}_{2}, \dots, {h e a d}_{h}) W^{O}

(15)

where,

h

is the number of attention heads and

W^{O}

is the trainable weight matrix.

Based on the above computational process, in the feature fusion network, the first attention network takes the aligned text feature

m^{t}

as the query vector

Q

and the aligned image feature

m^{v}

as both the key vector

K

and the value vector

V

. The text-to-image attention

O_{t x t - t o - i m g}

is computed as shown in Equation (16). Conversely, the second attention network takes

m^{v}

as

Q

and

m^{t}

as both

K

and

V

. The image-to-text attention is computed as shown in Equation (17).

O_{t x t - t o - i m g} = M u l t i H e a d A t t e n t i o n (m^{t}, m^{v}, m^{v})

(16)

O_{i m g - t o - t x t} = M u l t i H e a d A t t e n t i o n (m^{v}, m^{t}, m^{t})

(17)

Next, the two interaction results,

O_{t x t - t o - i m g}

and

O_{i m g - t o - t x t}

, are spliced along the first dimension. The spliced features are then subjected to linear transformation, batch normalization and ReLu activation. Finally, the primary fusion feature

m^{f i n a l}

is obtained, as shown in Equations (18) and (19).

C = O_{t x t - t o - i m g} \oplus O_{i m g - t o - t x t}

(18)

m^{f i n a l} = R e L u (B N (C W + b))

(19)

where,

\oplus

denotes the splicing operation,

W

is the weight matrix of the linear layer,

b

is the bias vector of the linear layer,

B N

is the batch normalization layer, and

R e L u

is the activation function.

3.5. Secondary Fusion

3.5.1. Attention Weight Acquisition Network

The purpose of secondary fusion is to form a joint representation of the aligned text features

m^{t}

, the aligned image features

m^{v}

and the primary fusion features

m^{f i n a l}

. However, these three features do not play the same degree of role in making the final decision. Therefore, in this paper, we adopt the cross-attention mechanism to model the inter-modal relationship and dynamically adjust the weight of each feature. The schematic of this attention weight acquisition network is shown in Figure 3. The interaction modalities of text → image, image → text, and fusion → text are first modeled by three-way cross-attention respectively, covering multi-granularity semantic alignment. Then, the interacted features are stitched together and fed into a simple MLP that dynamically weights each modal feature.

The computational representations of attention for text → image, image → text, and fusion → text are shown in Equations (20)–(22).

t e x t_a t t n = M u l t i H e a d A t t e n t i o n (m^{t}, m^{v}, m^{v})

(20)

i m a g e_a t t n = M u l t i H e a d A t t e n t i o n (m^{v}, m^{t}, m^{t})

(21)

c o m_a t t n = M u l t i H e a d A t t e n t i o n (m^{f i n a l}, m^{t}, m^{t})

(22)

The interacted features are stitched together to form a comprehensive feature representation

C_{a t t n}

, as shown in Equation (23).

C_{a t t n} = t e x t_a t t n \oplus i m a g e_a t t n \oplus c o m_a t t n

(23)

For the output feature

C_{a t t n} ϵ R^{B \times D}

,

B

is the batch size and

D

is the dimension of the input feature. It is passed through a two-layer fully connected network to generate the normalized weights

A t t e n t i o n S c o r e

, as shown in Equations (24)–(26).

F_{1} = R e L u (W_{1} \cdot C_{a t t n} + b_{1})

(24)

F_{2} = W_{2} \cdot F_{1} + b_{2}

(25)

A t t e n t i o n S c o r e = s o f t m a x (F_{2})

(26)

where,

W_{1} ϵ R^{D \times H}

and

b_{1} ϵ R^{H}

are the weights and biases of the first layer, and

H

is the hidden layer dimension.

W_{2} ϵ R^{H \times O}

and

b_{2} ϵ R^{O}

are the weights and biases of the second layer, and

O

is the output layer dimension.

A t t e n t i o n S c o r e

is the attentional weights of the three modalities of the output, shaped as

R^{B \times O}

, with each set of weights

a = [a_{t}, a_{v}, a_{f}]

satisfying

a_{t} + a_{v} + a_{f} = 1

.

3.5.2. Ambiguity Score Acquisition Network

In the above process, the attentional weights are computed to measure the importance of different modalities in the task at hand. These weights are able to adaptively adjust the contribution of each modality. However, this weighting method may pose the following issues. First, the model’s learning results may lack intuitive explanations, making it difficult to understand why a certain modality is assigned a higher weight. Second, in some cases, there may be ambiguity in the information between different modalities. This may lead to an irrational allocation of attention weights.

Therefore, this paper proposes a symmetric KL divergence-based ambiguity score acquisition network. This network generates KL ambiguity score to assist in adjusting the attentional weights. Specifically, the symmetric KL score between two modalities is calculated as a measure of the degree of ambiguity between them. If the KL score is large, the difference between the two modalities is significant, indicating high ambiguity. In this case, the model should rely more on cross-modal features, and the value of

a_{f}

should be increased. Conversely, the model can rely more on unimodal features, and the values of

a_{t}

and

a_{v}

should be increased.

During model training, the differences in cross-modal distributions across single samples were quantified by the following steps. First, conditional distributions

q_{ϕ_{t}}

and

q_{ϕ_{v}}

of latent variables are generated by the text encoder and image encoder. As shown in Equations (27) and (28).

q_{ϕ_{t}} (z_{1}| m^{t}) = N (z_{1}; μ_{t}, σ_{t}^{2})

(27)

q_{ϕ_{v}} (z_{2}| m^{v}) = N (z_{2}; μ_{v}, σ_{v}^{2})

(28)

where

μ_{t}, σ_{t}

, and

μ_{v}, σ_{v}

are the mean and variance of the text and image distributions, respectively.

Then, the latent variables

z_{1}

and

z_{2}

were sampled from the text and image distributions, respectively, using the Reparameterization Trick, as shown in Equations (29) and (30).

z_{1} = μ_{t} + ϵ {⨀ σ}_{t}, ϵ ~ N (0,1)

(29)

z_{2} = μ_{v} + ϵ^{'} {⨀ σ}_{v}, ϵ^{'} ~ N (0,1)

(30)

Next, in order to measure the difference between the two modal distributions, this paper approximates the single-sample KL divergence by Monte Carlo sampling method. This avoids the direct computation of integrals and improves the training efficiency. Specifically, the KL divergence is obtained by following steps: First, samples are drawn from each of the two distributions. Then, the log probabilities of these samples are calculated in the other distribution. Finally, the KL divergence is computed by taking the expectation of the difference between these log probabilities. The two-way KL divergences,

k_{i}^{t \to v}

and

k_{i}^{v \to t}

, are calculated as shown in Equations (31) and (32).

k_{i}^{t \to v} \approx {l o g q}_{ϕ_{t}} (z_{1}) - {l o g q}_{ϕ_{v}} (z_{1})

(31)

k_{i}^{v \to t} \approx {l o g q}_{ϕ_{v}} (z_{2}) - {l o g q}_{ϕ_{t}} (z_{2})

(32)

Subsequently,

k_{i}^{t \to v}

and

k_{i}^{v \to t}

are averaged to obtain a combined ambiguity measure. This measure is then passed through the Sigmoid activation function to normalize it to the range (0,1), representing the degree of ambiguity. The resulting ambiguity score is shown in Equation (33).

k_{i} = s i g m o i d (\frac{k_{i}^{v \to t} + k_{i}^{t \to v}}{2}) \in [0, 1]

(33)

We obtain the cross-modal ambiguity scores

k = {[1 - k_{i}, 1 - k_{i}, k_{i}]}

by calculating the bidirectional KL divergence, where

k_{i}

indicates the degree of ambiguity between image and text modalities, with higher values indicating greater ambiguity. In order to make the attention weights

a = [a_{t}, a_{v}, a_{f}]

closer to the ambiguity scores

k

, a loss function

L_{a k}

is introduced here. This loss function measures the difference between the attention weights and the ambiguity scores, thus guiding the model to learn a more reasonable distribution of attention weights. The loss function

L_{a k}

is calculated as shown in Equation (34).

L_{a k} = D_{K L} (k | | a) = \sum_{i} k (i) \cdot l o g (\frac{k (i)}{a (i)})

(34)

Finally, by minimizing

L_{a k}

, the model learns to rationally adjust the attention weights based on the input image and text features. Features with larger attention weights are assigned higher importance. Thus, a joint representation

\tilde{X}

is generated by weighted summation, which serves as the final detection feature. As shown in Equation (35).

\tilde{X} = {(a}_{t} \times m^{t}) \oplus {(a}_{v} \times m^{v}) \oplus {(a}_{f} \times m^{f i n a l})

(35)

3.6. Feature Detection

To perform the task of fake news classification on the detection feature

\tilde{X}

, an MLP classifier is used in the proposed model. This classifier maps the extracted detection features to the corresponding categories. It consists of multiple fully connected layers, each of which is followed by a BN layer and a ReLu activation function. Ultimately, the detection features are mapped to a binary classification output, which is converted to a probability distribution by the softmax function. The predicted labels

\hat{y}

of the classifier are shown in Equation (36).

\hat{y} = s o f t m a x (M L P (\tilde{X}))

(36)

During the training process, this paper uses the Cross-Entropy Loss function to optimize the parameters of the classifier. The loss function

L_{c l}

is defined as shown in Equation (37), where

y

denotes the true labels and

\hat{y}

denotes the predicted labels.

L_{c l} = - (y l o g (\hat{y}) + (1 - y) \log (1 - \hat{y}))

(37)

Finally, the final loss function

L

of the model is shown in Equation (38). It contains the predictive classification loss

L_{c l}

from the feature detection module, the KL loss

L_{a k}

from the cross-modal fusion module, and three losses from the cross-modal alignment module: the similarity matching loss

L_{s i m}

, the contrast loss

L_{c l i p}

from CLIP, and the soft loss

L_{s o f t}

.

L = L_{c l} + λ L_{a k} + L_{s i m} + L_{c l i p} + γ L_{s o f t}

(38)

where

λ

and

γ

are used to control the proportions of the auxiliary task loss functions

L_{a k}

and

L_{s o f t}

, respectively.

4. Experiment

4.1. Datasets

In this paper, two real social media disinformation detection datasets, Weibo [13] and Twitter [25], are used to conduct experiments and analyze the performance of the CM-MLF model. Weibo and Twitter are the mainstream social media platforms, and their data are representative. The two datasets selected for this study have been systematically validated within the field, and their core value is primarily reflected in the following three aspects: First, benchmarking. As classic benchmark datasets in the field of fake news detection, they have been widely adopted by authoritative research such as JCST 2025 (CCF-A class journal) [23] and IP&M 2024 (JCR Q1, IF = 8.6) [21]; second, irreplaceability. These two datasets fully preserve the original propagation characteristics prior to platform algorithm intervention, providing a clean experimental environment for cross-modal alignment and fusion research; Finally, reproducibility. Representative studies [15,18] based on these two datasets have been cited over 200 times, establishing them as the gold standard for method comparison within the field. Both datasets contain text, images, and annotations. The text in the Weibo dataset is in Chinese and the text in the Twitter dataset is in English. During the experimental process, we follow the same steps in the work [3,13] to remove the duplicate and low-quality images to ensure the quality of the datasets.

4.1.1. Weibo Dataset

This paper uses the Sina Weibo dataset, a publicly available dataset in the field of multimodal fake news detection. In this dataset, real news posts are collected from official news sources in China, and fake news posts are verified by Weibo’s official disinformation platform. It contains all verified fake posts from May 2012 to January 2016 period. In this paper, the whole dataset is divided into training set and testing set. The training set posts total 5369 articles, including 2492 real posts and 2877 false posts. The testing set posts total 1432 articles, containing 691 true posts and 741 false posts. The specific division form is shown in Table 2. The number of posts contained in the training and testing sets is about

8 : 2

.

4.1.2. Twitter Dataset

The Twitter dataset is a dataset collected on the Twitter platform by the MediaEval multimedia evaluation benchmark program, MediaEval2016. This database contains all verified fake posts from October 2015 to January 2016. In this paper, the entire Twitter dataset is divided into training set and testing set. The training set posts total 11,847, containing 6840 real posts and 5007 fake posts. The testing set posts total 1406, containing 833 real postings and 573 false postings. The specific division form is shown in Table 3. The number of posts contained in the training and testing sets is approximately

9 : 1

.

4.2. Baseline Model

In order to validate the effectiveness of the method proposed in this paper, we compare various classic methods for detecting fake news in recent years with the CM-MLF model.

(1) Att-RNN [13], which uses LSTM to fuse news text and social context information, and combines attention mechanisms to fuse the text features with visual features.

(2) EANN [3], which introduces an event discriminator as part of the adversarial network.

(3) MVAE [6], which uses variational autoencoders to learn latent representations of text and images.

(4) SAFE [9], which feeds the correlation between news text and image features into a classifier to detect fake news.

(5) CARMN [15], which extracts feature representations of raw and fused information through cross-attention residuals and multi-channel convolutional neural networks.

(6) CAFE [18], which performs feature fusion through cross-modal alignment, ambiguity learning algorithms, and interaction matrices.

(7) BMR [20], which proposes a multi-expert hybrid network for feature refinement and fusion, and performs detection through cross-modal consistency learning.

(8) CSFND [21], which proposes a context-semantic representation learning for multi-modal fake news detection.

(9) AGMFN [24], which proposes a detection method guided by attention-based multi-modal feature fusion, achieving deep fusion combined with frequency domain features.

4.3. Evaluation Metrics

Fake news detection is essentially a classification problem, so the results are often evaluated using Accuracy (Ac), Precision (P), Recall (R), and F1-score (F1). The ultimate goal of the model is to detect whether the news is fake news or not. The news in the testing set can be classified into four categories based on their truth value labels and model prediction labels: true positive (TP), false positive (FP), true negative (TN), and false negative (FN). The labeling concepts are described in Table 4.

Based on the above concepts, the formulas for accuracy (Ac), precision (P), recall (R), and F1 score (F1) are shown in Equations (39)–(42).

A c = \frac{T P + T N}{T P + F P + T N + F N}

(39)

P = \frac{T P}{T P + F P}

(40)

R = \frac{T P}{T P + F N}

(41)

F 1 = 2 \frac{P \cdot R}{P + R}

(42)

Ac is used to assess the overall performance of the detection model and indicates the ratio of correctly predicted samples to overall samples. P is used to assess the model’s ability to correctly predict false news, indicating the ratio of samples correctly predicted as false news out of all samples predicted to be labeled as false news. R is used to measure the sensitivity of the fake news detection classifier, denoting the proportion of samples that are correctly predicted as fake news out of all samples whose truth value is labeled as fake news. F1 is used to measure the predictive performance of the fake news detection classifier, F1 denotes the reconciled mean of P and R. A higher F1 score ensures that both P and R of the classification model are also higher.

4.4. Experimental Setup

The environment settings for this experiment is shown in Table 5.

The parameter settings for this experiment is shown in Table 6.

Where, the values of

λ

and

γ

are the optimal parameter combinations determined by the dynamic grid search described in Section 5.4. The remaining parameters are based on our experience in related fields and reference to empirical values from other studies in the field [23,24].

5. Results and Discussion

5.1. Comparative Experiment

To evaluate the model performance, comparative experiments were conducted on both the Weibo and Twitter datasets. The experimental results for the Weibo dataset are shown in Table 7 and Figure 4, while the results for the Twitter dataset are shown in Table 8 and Figure 5.

The experimental results from the Weibo dataset show that CM-MLF performs best among all baseline models. Specifically, CM-MLF outperforms the best-performing AGMFN model by 0.5% in terms of accuracy, achieving an accuracy of 92.2%. In addition, CM-MLF generally outperforms other models in other key metrics such as precision, recall, and F1 score. Among them, the CM-MLF model can achieve the highest precision rate of 93.5% for fake news detection. These results validate the effectiveness and robustness of this paper’s model in handling multimodal data and detecting fake news.

The experimental results on the Twitter dataset show that CM-MLF model still performs the best among all baseline models. Specifically, CM-MLF outperforms the best-performing AGMFN model by 5.4% in terms of accuracy, reaching 90.1% accuracy. Furthermore, CM-MLF generally outperforms other models in other key metrics such as precision, recall, and F1 score. Among them, the CM-MLF model can achieve the highest precision rate of 99.0% for true news detection. In addition, these models perform better overall on the Weibo dataset. This is because the two datasets themselves have certain differences. For example, the average length of a news article on the Weibo dataset is about 10 times that on the Twitter dataset.

Att-RNN struggles with complex modal interactions due to its single-attention fusion mechanism and the temporal limitations of LSTM. Early methods like EANN and MVAE, which rely on simple feature concatenation, leading to severe information loss. The SAFE method relies on text–image similarity. However, CM-MLF achieves an average recall rate 15.95% higher on the Weibo dataset, indicating its superior ability to handle semantic differences. While CARMN enhances modality interaction through a cross-attention mechanism, it is limited by its single-time fusion and lack of dynamic weight adjustment. Although CAFE introduces cross-modality alignment and ambiguity learning, its static fusion method for interaction matrices struggles to adapt to the modality differences. BMR uses a multi-expert hybrid network to process features modally, which enhances single-modal representations but falls short in modal interaction. CSFND extracts local features through unsupervised context learning, but global semantic fusion is insufficient. AGMFN improves image utilization by introducing frequency domain information. It designs a multi-modal feature fusion layer to achieve deep interaction between features. Therefore, its detection performance is superior.

Compared with the above methods, the CM-MLF model proposed in this paper outperforms other baseline models in terms of accuracy on both the Weibo and Twitter datasets. It also achieves excellent results in other evaluation metrics. These results further validate the effectiveness and robustness of the model in processing multimodal data and detecting fake news.

5.2. Ablation Experiment

In order to deeply analyze the contribution of different modules in the model to the performance of news detection, ablation experiments were carried out. The impact of different modules on the model performance is assessed by comparing the full model with a variant of the model with specific modules removed.

(1): Ours-b: This variant removes the model’s multi-attention feature fusion module in favor of a simple weighted fusion to integrate aligned text and image features.
(2): Ours-c: This variant removes the attention weight acquisition network and the disambiguation score acquisition network, and assigns the same weights to all three features for fusion in the secondary fusion.
(3): Ours-d: This variant removes both the ablated modules and networks in Ours-b and Ours-c.
(4): Ours-e: This variant removes the cross-modal feature alignment module.

The results of the ablation experiments are shown in Table 9. It details the performance of different models on the Weibo and Twitter datasets.

As can be seen from the table, the full CM-MLF model achieved the best performance on both datasets, proving the effectiveness of the individual modules in the model. On the Weibo datasets, the complete model achieves an accuracy and F1 score of 92.2% and 92.3%, which are significantly higher than the other variants. In particular, in the Ours-e variant, the performance degradation is most pronounced due to the lack of a feature alignment module, with both accuracy and F1 score dropping to 89.6%. This indicates that feature alignment is crucial for the model.

The full model also shows the best performance on the Twitter datasets with 90.1% accuracy and 90.8% F1 score. The Ours-e variant’s accuracy drops to 87.1%, and its performance drop reaffirms the importance of the alignment module. The Ours-c variant, on the other hand, saw its accuracy drop to 88.3% due to the simultaneous removal of several key modules, and its performance was similarly affected significantly.

In summary, the results of the ablation experiments clearly demonstrate the value of each module in the model. In particular, the fusion of the alignment module and the attention mechanism has a key role in improving the performance of fake news detection. These findings not only validate the effectiveness of the CM-MLF model design, but also provide a valuable reference for future research on multimodal fake news detection models.

5.3. Visual Analysis

To further explore the model performance, the classifier input feature distributions of the CM-MLF model and its variants are compared by t-SNE visualization. Specifically, the high-dimensional multimodal features extracted by these models were downscaled to a two-dimensional space using the t-SNE algorithm, and they are plotted on a coordinate map for visual comparison.

The t-SNE 2D visualization results of the CM-MLF model and its variants for detecting features on the Weibo datasets are shown in Figure 6.

As can be seen in Figure 6, the detection features extracted by these methods are able to classify most of the news correctly. However, the boundaries of the different categories of labeled points in Figure 6a are clearer than those in its variants, which indicates the stronger differentiation ability of the features extracted by the full model Ours. In contrast, the cross overlap between the different color points in the model variants of Figure 6b–d has increased, and the separation between categories has decreased in all of them. This indicates that these variants are less effective in feature learning compared to the full model, resulting in lower classification performance. These variants fail to adequately capture the multimodal features of the data due to the lack of certain key components, which affects the final feature representation capability and classification accuracy.

5.4. Parameter Sensitivity Experiment

To further analyze the impact of the model-assisted task on the detection performance, this paper conducts several sets of experiments by setting different combinations of loss function weights. Specifically, for the two loss functions

L_{s o f t}

and

L_{a k}

, we perform a grid search with a step size of 0.1 on the weights

λ

and

γ

of both in the interval

[0.1, 0.5]

. A total of 25 sets of parameter combinations were constructed. Each set of parameters is tested on the Weibo validation set with classification accuracy as the evaluation index, and the experimental results are shown in Figure 7.

The experimental results show that the model achieves the highest validation accuracy of 92.18% when skl_loss_weight i.e.,

λ = 0.4

and clip_loss_weight i.e.,

γ = 0.3

. This combination significantly outperforms the neighboring parameter configurations, comparing the combination

λ = 0.3

and

γ = 0.3

model accuracy is 91.76%, which is an improvement of 0.42%. The model accuracy when comparing combination

λ = 0.4

and

γ = 0.2

is 91.41%, which is an improvement of 0.77%. The model accuracy for comparison combination

λ = 0.4

and

γ = 0.4

is 90.64%, an improvement of 1.54%. The following two points are shown: First, a moderate alignment loss weight, i.e.,

γ = 0.3

, can ensure cross-modal semantic alignment without overfitting the noisy samples. Second, a higher loss weight of ambiguity score, i.e.,

λ = 0.4

, can dynamically suppress the interference of unreliable modalities through KL scatter.

5.5. Limitations

This study is dedicated to the detection of fake news without leveraging external evidence. Although it can solve some problems in obtaining external evidence, integrating external information (e.g., transmission paths, user comments, etc.) may significantly improve detection performance in practical applications. Moreover, the current model architecture is deficient in a well-defined mechanism for integrating external evidence sources. Should future implementations necessitate the integration of external validation data, extensive architectural modifications and algorithmic adjustments would be imperative. Additionally, the CM-MLF model, which is composed of multiple complex modules, incurs high computational costs during model training. When trained on large-scale datasets, this model demands considerable training time and substantial hardware resources. Consequently, its deployment in resource-constrained environments may be restricted, which in turn impinges upon its practicality and applicability.

6. Conclusions

This section is not mandatory but can be added to the manuscript if the discussion is unusually long or complex. In today’s rapid development of information dissemination, the spread of fake news has become a serious problem worldwide. Against the background of information explosion and multimodal content proliferation in social media, fake news detection faces the dual challenges of cross-modal semantic divide and easy loss of some data after multimodal fusion. In this paper, we propose a cross-modal fake news detection model CM-MLF based on evidence-free multilevel fusion to address the existing challenges. The accuracy and robustness of fake news detection are significantly improved by cross-modal alignment and multilevel feature fusion techniques.

The main contributions of this paper are as follows:

This paper proposes a cross-modal false news detection model CM-MLF based on evidence-free multilevel fusion. The problem of semantic inconsistency of multimodal features and the problem of information loss are solved by cross-modal alignment and multilevel fusion framework. Experiments on public datasets show that the CM-MLF model outperforms the benchmark algorithm in evaluation metrics such as accuracy, verifying its effectiveness.
A two-stage progressive fusion framework is designed to realize a multi-level feature fusion strategy and introduce an ambiguity learning module. The first stage generates primary fusion features through a cross-modal multi-head cross-attention mechanism. The second stage introduces the KL disambiguation score to assist in attention weight allocation, which further improves the model’s ability to focus on key features. Fine-grained secondary fusion is realized by adaptively adjusting the secondary fusion weights of text, image and primary fusion features.

Experimental results on two public datasets, Weibo and Twitter, show that the CM-MLF model outperforms the existing baseline model in terms of accuracy, precision, recall, and F1 score. The model accuracy reaches 92.2% and 90.1%, respectively, validating its effectiveness and robustness. The ablation experiments further confirmed the necessity of the modules, especially the cross-modal alignment and attention mechanisms that play a key role in performance improvement.

Future research efforts will focus on further feature fusion processes. Specifically, it will explore how to incorporate more dimensional features, such as communication features, social features, comment features, and emotion features, into the detection framework to achieve more effective complementation of information across different modalities, so as to further enhance the accuracy and robustness of fake news detection.

Author Contributions

Conceptualization, P.H. and H.Z.; methodology, P.H. and H.Z.; software, H.Z.; validation, H.Z. and Y.W.; formal analysis, S.C.; writing—original draft preparation, P.H. and H.Z.; writing—review and editing, P.H. and H.Z.; visualization, H.Z.; supervision, P.H.; funding acquisition, P.H. and Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Hebei based Universities and Shijiazhuang Industry University Research Cooperation Project (2511300301A), Science Research and Development Project of Hebei University of Economics and Business (2022YB05).

Data Availability Statement

The source code and data are publicly available at https://www.kaggle.com/datasets/hanxue123/cm-mlf (accessed on 7 June 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial intelligence
NLP	Natural language processing
ML	Machine learning
KL	Kullback–Leibler divergence
LSTM	Long short-term memory network
VGG	Visual geometry group
CNN	Convolutional neural network
BN	Batch normalization
MLP	Multilayer perceptual machine

References

Shu, K.; Sliva, A.; Wang, S.; Tang, J.; Liu, H. Fake news detection on social media: A data mining perspective. ACM SIGKDD Explor. Newsl. 2017, 19, 22–36. [Google Scholar] [CrossRef]
Hu, L.; Wei, S.; Zhao, Z.; Wu, B. Deep learning for fake news detection: A comprehensive survey. AI Open 2022, 3, 133–155. [Google Scholar] [CrossRef]
Wang, Y.; Ma, F.; Jin, Z.; Yuan, Y.; Xun, G.; Jha, K.; Su, L.; Gao, J. EANN: Event adversarial neural networks for multi-modal fake news detection. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018. [Google Scholar] [CrossRef]
Wang, Y.; Ma, F.; Wang, H.; Jha, K.; Gao, J. Multimodal emergent fake news detection via meta neural process networks. In Proceedings of the 27th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, Singapore, 14–18 August 2021. [Google Scholar] [CrossRef]
Khattar, D.; Goud, J.S.; Gupta, M.; Varma, V. MVAE: Multimodal variational autoencoder for fake news detection. In Proceedings of the World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019. [Google Scholar] [CrossRef]
Singhal, S.; Kabra, A.; Sharma, M.; Shah, R.R.; Chakraborty, T.; Kumaraguru, P. SpotFake+: A multimodal framework for fake news detection via transfer learning (Student abstract). In Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar] [CrossRef]
Sengupta, A.; Ye, Y.; Wang, R.; Liu, C.; Roy, K. Going deeper in spiking neural networks: VGG and residual architectures. Front. Neurosci. 2019, 13, 95. [Google Scholar] [CrossRef] [PubMed]
Zhou, X.; Wu, J.; Zafarani, R. SAFE: Similarity-aware multi-modal fake news detection. In Proceedings of the 24th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Singapore, 11–14 May 2020. [Google Scholar] [CrossRef]
Xue, J.; Wang, Y.; Tian, Y.; Li, Y.; Shi, L.; Wei, L. Detecting Fake News by Exploring the Consistency of Multimodal Data. Inf. Process. Manag. 2021, 58, 102610. [Google Scholar] [CrossRef] [PubMed]
Ghorbanpour, F.; Ramezani, M.; Fazli, M.A.; Rabiee, H.R. FNR: A Similarity and Transformer-Based Approach to Detect Multi-Modal Fake News in Social Media. Soc. Netw. Anal. Min. 2023, 13, 56. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Virtual Event, 3–7 May 2021; Available online: https://openreview.net/forum?id=YicbFdNTTy (accessed on 7 June 2025).
Jin, Z.; Cao, J.; Guo, H.; Zhang, Y.; Luo, J. Multimodal Fusion with Recurrent Neural Networks for Rumor Detection on Microblogs. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017. [Google Scholar] [CrossRef]
Zhang, H.; Fang, Q.; Qian, S.; Xu, C. Multi-modal Knowledge-aware Event Memory Network for Social Media Rumor Detection. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019. [Google Scholar] [CrossRef]
Song, C.; Ning, N.; Zhang, Y.; Wu, B. A Multimodal Fake News Detection Model Based on Crossmodal Attention Residual and Multichannel Convolutional Neural Networks. Inf. Process. Manag. 2021, 58, 102437. [Google Scholar] [CrossRef]
Wu, Y.; Zhan, P.; Zhang, Y.; Wang, L.; Xu, Z. Multimodal Fusion with Co-attention Networks for Fake News Detection. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP), Bangkok, Thailand, 1–6 August 2021. [Google Scholar] [CrossRef]
Qi, P.; Cao, J.; Li, X.; Liu, H.; Sheng, Q.; Mi, X.; He, Q.; Lv, Y.; Guo, C.; Yu, Y. Improving Fake News Detection by Using an Entity-Enhanced Framework to Fuse Diverse Multimodal Clues. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, 20–24 October 2021. [Google Scholar] [CrossRef]
Chen, Y.; Li, D.; Zhang, P.; Sui, J.; Lv, Q.; Tun, L.; Shang, L. Cross-Modal Ambiguity Learning for Multimodal Fake News Detection. In Proceedings of the ACM Web Conference 2022, Lyon, France, 25–29 April 2022. [Google Scholar] [CrossRef]
Peng, L.; Jian, S.; Li, D.; Shen, S. MRML: Multimodal Rumor Detection by Deep Metric Learning. In Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023. [Google Scholar] [CrossRef]
Ying, Q.; Hu, X.; Zhou, Y.; Qian, Z.; Zeng, D.; Ge, S. Bootstrapping Multi-View Representations for Fake News Detection. In Proceedings of the 37th AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023. [Google Scholar] [CrossRef]
Peng, L.; Jian, S.; Kan, Z.; Qiao, L.; Li, D. Not All Fake News Is Semantically Similar: Contextual Semantic Representation Learning for Multimodal Fake News Detection. Inf. Process. Manag. 2024, 61, 103564. [Google Scholar] [CrossRef]
Yu, Y.; Ji, K.; Gao, Y.; Chen, Z.; Ma, K.; Zhao, X. Multi-Source Heterogeneous Data Progressive Fusion for Fake News Detection. Comput. Sci. 2024, 51, 30–38. [Google Scholar] [CrossRef]
Zhong, J.; Gao, J.; Huang, J.; Yang, Y. Multimodal Fake News Detection Based on Evidence Enhancement and Local Semantic Interaction. Chin. J. Comput. 2025, 48, 556–571. Available online: https://dl.ccf.org.cn/article/articleDetail.html?type=qkwz&_ack=2&id=7477135053260800 (accessed on 7 June 2025).
Deng, X.; Wang, L.; Zeng, X.; Ye, H.; Che, X. Fake News Detection Method Based on Attention-Guided Multimodal Feature Fusion. J. Front. Comput. Sci. Technol. 2025, pp. 1–12. Available online: https://link.cnki.net/urlid/11.5602.TP.20250307.1854.002 (accessed on 7 June 2025).
Boididou, C.; Papadopoulos, S.; Zampoglou, M.; Apostolidis, L.; Papadopoulou, O.; Kompatsiaris, Y. Detection and Visualization of Misleading Content on Twitter. Int. J. Multimed. Inf. Retr. 2018, 7, 71–86. [Google Scholar] [CrossRef]

Figure 1. Cross-modal fake news detection model based on multi-level fusion without evidence (CM-MLF). The three different colored arrows represent three sets of inputs and output with different

Q

values.

Figure 1. Cross-modal fake news detection model based on multi-level fusion without evidence (CM-MLF). The three different colored arrows represent three sets of inputs and output with different

Q

values.

Figure 2. Schematic diagram of a two-way multi-head cross-attention fusion network.

Figure 3. Schematic diagram of the attention weight acquisition network during secondary fusion. Where the green arrows represent the inputs and output when

m^{t}

is used as the

Q

value, the blue arrows represent the inputs and output when

m^{v}

is used as the

Q

value, and the red arrows represent the inputs and output when

m^{f i n a l}

is used as the

Q

value.

Figure 3. Schematic diagram of the attention weight acquisition network during secondary fusion. Where the green arrows represent the inputs and output when

m^{t}

is used as the

Q

value, the blue arrows represent the inputs and output when

m^{v}

is used as the

Q

value, and the red arrows represent the inputs and output when

m^{f i n a l}

is used as the

Q

value.

Figure 4. Bar chart of comparative experimental results on the Weibo dataset.

Figure 5. Bar chart of comparative experimental results on the Twitter dataset.

Figure 6. Results of t-SNE 2D visualization of the features of this model and variants on the Weibo datasets. Where the points of the same color belong to the same tag category, (a) denotes the CM-MLF model in this paper, (b) denotes the model variant Ours-B, (c) denotes the model variant Ours-C, and (d) denotes the model variant Ours-D.

Figure 7. The effect of different combinations of weights on the accuracy of the loss function for auxiliary tasks.

Table 1. The size variation of the output feature map.

Stage	Output Size	Number of Channels	Key Operation
Conv	112 × 112	64	7 × 7 convolutional kernel, step size is 2
Maxpool	56 × 56	64	3 × 3 maximum pooling, step size is 2
Stage 1	56 × 56	64	2 residual blocks
Stage 2	28 × 28	128	2 residual blocks, step size is 2
Stage 3	14 × 14	256	2 residual blocks, step size is 2
Stage 4	7 × 7	512	2 residual blocks, step size is 2

Table 2. Statistics for the Weibo dataset.

Label	Real	Fake	Total
Train	2492	2877	5369
Test	691	741	1432
Total	3183	3618	6801

Table 3. Statistics for the Twitter dataset.

Label	Real	Fake	Total
Train	6840	5007	11,847
Test	833	573	1406
Total	7673	5580	13,253

Table 4. Labeling concept descriptions.

Number	Label	Description
1	TP	Number of news stories where the truth value label is fake and the predicted label is fake
2	FP	Number of news stories where the truth value label is real and the predicted label is fake
3	TN	Number of news stories where the truth value label is real and the predicted label is real
4	FN	Number of news stories where the truth value label is fake and the predicted label is real

Table 5. Experimental environment settings.

Experimental Environment	Settings
Pytorch	1.11.0
Python	3.8
Cuda	11.3
GPU	RTX 4090 (24 GB)
CPU	22 vCPU AMD EPYC 7T83 64-Core Processor
Memory	90 GB
Hard Disk	30 GB + 50 GB

Table 6. Experimental parameter settings.

Experimental Parameter	Settings
Batch_size	64
Epochs	20
Optimizer	AdamW
Learning_rate	2 × 10⁻⁵
Weight-decay	1× 10⁻⁴
$λ$	0.4
$γ$	0.3

Table 7. Comparative experimental results on Weibo datasets.

Method	Ac	Rumor			Non-Rumor
Method	Ac	P	R	F1	P	R	F1
Att_RNN	0.781	0.802	0.765	0.783	0.761	0.801	0.781
EANN	0.810	0.831	0.792	0.812	0.789	0.829	0.809
MVAE	0.824	0.854	0.769	0.809	0.802	0.875	0.837
SAFE	0.763	0.757	0.799	0.777	0.772	0.726	0.749
CARMN	0.853	0.891	0.814	0.851	0.818	0.894	0.854
CAFE	0.840	0.855	0.830	0.842	0.825	0.851	0.837
BMR	0.884	0.875	0.886	0.880	0.874	0.881	0.877
CSFND	0.895	0.899	0.895	0.897	0.892	0.896	0.894
AGMFN	0.917	0.918	0.910	0.912	0.913	0.924	0.918
Ours	0.922	0.935	0.912	0.923	0.908	0.932	0.920

The bold texts in the table highlight the advantages of CM-MLF model.

Table 8. Comparative experimental results on twitter datasets.

Method	Ac	Rumor			Non-Rumor
Method	Ac	P	R	F1	P	R	F1
Att_RNN	0.681	0.758	0.659	0.705	0.603	0.712	0.653
EANN	0.678	0.765	0.641	0.698	0.597	0.729	0.657
MVAE	0.598	0.697	0.543	0.610	0.518	0.676	0.587
SAFE	0.643	0.676	0.506	0.579	0.625	0.772	0.691
CARMN	0.735	0.778	0.652	0.709	0.704	0.817	0.756
CAFE	0.806	0.807	0.799	0.803	0.805	0.813	0.809
BMR	0.842	0.821	0.782	0.796	0.845	0.904	0.891
CSFND	0.833	0.899	0.799	0.846	0.763	0.878	0.817
AGMFN	0.847	0.816	0.913	0.883	0.933	0.805	0.803
Ours	0.901	0.816	0.988	0.894	0.990	0.838	0.908

The bold texts in the table highlight the advantages of CM-MLF model.

Table 9. Comparative results of ablation experiment.

Dataset	Method	Accuracy	F1
Dataset	Method	Accuracy	Rumor	Non-Rumor
Weibo	Ours-b	0.911	0.912	0.911
	Ours-c	0.906	0.906	0.906
	Ours-d	0.905	0.905	0.905
	Ours-e	0.896	0.895	0.896
	Ours	0.922	0.923	0.920
Twitter	Ours-b	0.892	0.891	0.892
	Ours-c	0.886	0.884	0.888
	Ours-d	0.883	0.884	0.885
	Ours-e	0.871	0.868	0.874
	Ours	0.901	0.894	0.908

The bold texts in the table highlight the advantages of CM-MLF model.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

He, P.; Zhang, H.; Cao, S.; Wu, Y. Cross-Modal Fake News Detection Method Based on Multi-Level Fusion Without Evidence. Algorithms 2025, 18, 426. https://doi.org/10.3390/a18070426

AMA Style

He P, Zhang H, Cao S, Wu Y. Cross-Modal Fake News Detection Method Based on Multi-Level Fusion Without Evidence. Algorithms. 2025; 18(7):426. https://doi.org/10.3390/a18070426

Chicago/Turabian Style

He, Ping, Hanxue Zhang, Shufu Cao, and Yali Wu. 2025. "Cross-Modal Fake News Detection Method Based on Multi-Level Fusion Without Evidence" Algorithms 18, no. 7: 426. https://doi.org/10.3390/a18070426

APA Style

He, P., Zhang, H., Cao, S., & Wu, Y. (2025). Cross-Modal Fake News Detection Method Based on Multi-Level Fusion Without Evidence. Algorithms, 18(7), 426. https://doi.org/10.3390/a18070426

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cross-Modal Fake News Detection Method Based on Multi-Level Fusion Without Evidence

Abstract

1. Introduction

2. Related Work

2.1. Multimodal Complementarity

2.2. Multimodal Consistency

2.3. Multimodal Enhancement

3. Methods

3.1. Model Definition

3.2. Feature Extraction

3.2.1. Text Feature Extraction

3.2.2. Image Feature Extraction

3.3. Feature Alignment

3.3.1. Similarity Matching

3.3.2. Contrast Learning

3.3.3. Soft Label

3.4. Feature Fusion

3.5. Secondary Fusion

3.5.1. Attention Weight Acquisition Network

3.5.2. Ambiguity Score Acquisition Network

3.6. Feature Detection

4. Experiment

4.1. Datasets

4.1.1. Weibo Dataset

4.1.2. Twitter Dataset

4.2. Baseline Model

4.3. Evaluation Metrics

4.4. Experimental Setup

5. Results and Discussion

5.1. Comparative Experiment

5.2. Ablation Experiment

5.3. Visual Analysis

5.4. Parameter Sensitivity Experiment

5.5. Limitations

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI