Bilinear Learning with Dual-Chain Feature Attention for Multimodal Rumor Detection

Guo, Zheheng; Liu, Haonan; Zuo, Lijiao; Wen, Junhao

doi:10.3390/math13111731

Open AccessFeature PaperArticle

Bilinear Learning with Dual-Chain Feature Attention for Multimodal Rumor Detection

¹

School of Earth Resources, China University of Geosciences (Wuhan), Wuhan 430074, China

²

School of Electronic Information and Communication, Huazhong University of Science and Technology, Wuhan 430074, China

³

School of Bigdata and Software Engineering, Chongqing University, Chongqing 400044, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2025, 13(11), 1731; https://doi.org/10.3390/math13111731

Submission received: 7 April 2025 / Revised: 14 May 2025 / Accepted: 16 May 2025 / Published: 24 May 2025

Download

Browse Figures

Versions Notes

Abstract

The rapid growth of social media and online information-sharing platforms facilitates the spread of rumors. Accurate rumor detection to minimize manual verification efforts remains a critical research challenge. While multimodal rumor detection leveraging both text and visual data has gained increasing attention due to the diversification of social media content, existing approaches face the following three key limitations: (1) yhey prioritize lexical features of text while neglecting inherent logical inconsistencies in rumor narratives; (2) they treat textual and visual features as independent modalities, failing to model their intrinsic connections; and (3) they overlook semantic incongruities between text and images, which are common in rumor content. This paper proposes a dual-chain multimodal feature learning framework for rumor detection to address these issues. The framework comprehensively extracts rumor content features through the following two parallel processes: a basic semantic feature extraction module that captures fundamental textual and visual semantics, and a logical connection feature learning module that models both the internal logical relationships within text and the cross-modal semantic alignment between text and images. The framework achieves the multi-level fusion of text–image features by integrating modal alignment and cross-modal attention mechanisms. Extensive experiments on the Pheme and Weibo datasets demonstrate that the proposed method performs better than baseline approaches, confirming its effectiveness in detecting multimodal rumors.

Keywords:

rumor detection; multimodal feature; modal alignment

MSC:

68T35

1. Introduction

The scenarios of rumor detection research mainly focus on social media platforms. In the early days, the primary form of social media platforms was forums. A post on a forum primarily comprises an original message posted by the poster and reply messages. The reply and original messages were presented in the same format, taking up much space. A long post could often reach the length of dozens of web pages. With the development of the information age, Weibo gradually replaced forums as the mainstream application for information exchange. A Weibo post mainly consists of text within a limited number of characters and images or videos. The comment and forward chains can be easily traced. It has become more convenient for people to browse posts and the comments and opinions of others. There is no need to jump between web pages repeatedly, and previous records can be viewed without returning to a specific page.

The research on the content features of rumor detection mainly focuses on the following two modalities: text and image. A specific scenario (as shown in Figure 1) is selected for analysis. The content features of the text are divided into two aspects. One is the semantic features of text entities. For example, in the text of User 1, the entities “Mary” and “cat” represent roles, the entity “supermarket” represents a location, and “fly away” represents an event. The other is the logical features of the text. From the perspective of a single text, it is the concatenation relationship between entities. “Mary’s cat flew away near the supermarket” represents the logical relationship of a role having an event occur at a specific place. At the same time, the logical features of the text also include the correctness of the relationship between entities. According to the statement of User 1 and combined with the actual situation, a “cat” cannot act “flying away”. In the process of feature extraction by the model, this is manifested as the distance between entities in the dictionary. The closer the distance, the stronger the correlation. Through a comprehensive analysis combined with tweet comments, the text of User 2 mentions a new entity “dog”, which has a logical conflict with the statement of User 1, that is, the logical features of the text context. Finally, by combining the two modalities of text and image, the entity in the text of User 1 is a “cat”, but in the image, it is a small dog. The inconsistency between the text and the image is a logical feature between the text and the image.

Currently, in the research on rumor detection based on content features, the methods with the core of mining multimodal features mainly involve extracting the content features of texts and images. After fusing these two multimodal features, the fused features are input into a classifier for classification prediction. Such algorithms have learned the common content features of texts and images in social posts and achieved a high accuracy rate. However, the following two aspects can be improved: Firstly, the perspective of learning the content features of texts and images is rather singular. Usually, it only focuses on one aspect, either the basic semantic features or the logical features, for feature learning. There is still room for improvement in comprehensively learning the content features. Secondly, existing studies mainly adopt fusion methods such as attention mechanisms for multimodal feature fusion. At the same time, there is relatively little research on reducing the differences between the features of different modalities before the fusion process.

Aiming at the above two problems, this paper proposes a rumor detection algorithm based on double-chain multimodal feature learning (DMFL). This algorithm adopts the method of double-chain learning to learn the basic semantic features and logical connection features of multimodalities in parallel. In the essential semantic feature learning chain, modal alignment is carried out for the text and image modal features, enabling the model to learn the mutual representation information between different modalities and reducing the differences between the features of different modalities. In the logical feature learning chain, by injecting image features into the multi-head attention layer of the text logical feature encoder, the text logical features fused with the logical features of text and images can be obtained. The fused multimodal content features are obtained by fusing the basic semantic features and logical features through cross-modal attention. Finally, these features are input into a classifier for classification prediction to determine the label of the input post, whether it is a rumor or not.

The research contributions made in this paper can be summarized as follows:

A method of double-chain multimodal feature learning is proposed. The learning processes of semantic and logical features are separated and carried out in parallel, paying more comprehensive attention to various existing logical features, namely, the logical features of text entities and the logical features between texts and images. Finally, more comprehensive and in-depth content features are obtained through the double-chain learning method for rumor detection.
Modal alignment operations are performed on the semantic features of the two modalities of text and image before fusion, narrowing the differences between features of different modalities and improving the effect of multimodal fusion. Feature extraction plays a role in feature enhancement, making the ultimately obtained multimodal semantic features more conducive to rumor detection.
Empirical evaluations on two public real-world datasets (Pheme and Weibo) reveal that the DMFL model achieves statistically significant improvements in accuracy, surpassing state-of-the-art baselines by 1.65% and 3.89%, respectively, underscoring its efficacy in multimodal rumor detection. The model’s robustness and generalizability are further validated through comprehensive ablation studies, parameter sensitivity analysis, and cross-modal correlation assessments.

2. Related Work

Early rumor detection research [1,2] introduced recurrent neural networks to learn hidden representations from the textual content of relevant articles, setting a precedent for deep learning to extract rumor text features automatically. Ref. [3] used convolutional neural networks to obtain key and hidden semantic features from text content. However, these studies only used the full text as input, losing the original structure of the article. Therefore, ref. [4] adopts the hierarchical structure of “word-sentence-article” to understand the text, and it extracts the word and sentence levels as features to understand the text. Refs. [5,6] proposed learning word-level and sentence-level representations of declarative texts and news articles as text features. However, the above studies only considered the text modality and ignored visual features.

In general, rumors carry multimodal information, such as images and videos. Many studies have combined visual features with features for rumor detection. Early research in [7] first attempted to manually extract information, such as the propagation time of false images on Twitter, and proposed a classification model to identify false images on Twitter. In recent years, most studies have directly utilized deep learning pre-trained models to extract high-dimensional representations of images from visual information (e.g., images and videos) for computation. For example, Ref. [8] extracts fake news, rumors, and tweet image information by a pre-trained convolutional neural network (CNN) to mine deep visual features and extract high-dimensional representations of the visual information as combined text and other modal features. There is also research on converting images to matrices using image embedding methods. Ref. [9] embeds visual information into a matrix via image2sentence [10] and then extracts visual features using TextCNN. It computes the similarity between different modal data and increases the sensory field compared to models pre-trained using convolutional neural networks such as VGG [11]. In another study, Wang [12] et al. consider the introduction of a knowledge graph and propose the KMGCN model to model the global structure between text, images, and knowledge concepts to obtain a comprehensive semantic representation.

The above research on rumor detection based on multimodal data enhances rumor feature representation for classification detection by investigating the multimodal representation of rumor features. However, there is still a lack of in-depth research on multi-level multimodal data features and multimodal fusion.

3. Problem Definition

This section will introduce the basic scenarios and important definitions studied in this chapter. In a multimodal social platform, an event is a collection of multimedia posts containing text and images on social media. For an event S containing n posts, it can be expressed as

S = {s_{1}, s_{2}, s_{3}, \dots, s_{n}}

. For each post

s_{i} \in S, s_{i} = {T_{i}, V_{i}, U_{i}, C_{i}}

, where

T_{i}

,

V_{i}

represent the text and image of the post, and

U_{i}

represents the user who posted the post.

C_{i} = {c_{1 i}, c_{2 i}, \dots, c_{j i}}

represents the set of forwarded comments on post

s_{i}

, where each forwarded comment

c_{j i}

is posted by the corresponding user

u_{j i}

.

Y = {y_{1}, y_{2}, \dots, y_{n}}

is the set of labels for n posts. This paper defines rumor detection as a binary classification task, that is,

y \in {0, 1}

represents the classification label, where

y = 0

indicates non-rumor and

y = 1

indicates that the post is a rumor.

This paper aims to predict the label of a given post

s_{i}

by learning the function

f (s_{i}) = y

to determine whether the post is a rumor.

For each post

s_{i} \in S, s_{i} = {T_{i}, V_{i}, R_{i}, U_{i}, C_{i}}

, where

R_{i}

represents the forwarding relationship of post

s_{i}

.

R_{i} = {s_{i}, w_{1}^{i}, w_{2}^{i}, \dots, w_{n_{i} - 1}^{i}, G_{i}}

, where

n_{i}

refers to the number of posts in the forwarding relationship

r_{i}

,

s_{i}

is the source post, each

w_{j}^{i}

represents the j-th comment forwarding the source post, and

G_{i}

refers to the propagation structure. Specifically,

G_{i}

is defined as a graph

G_{i} = < N_{i}, E_{i} >

,

s_{i}

is the root node, where

N_{i} = {s_{i}, w_{1}^{i}, w_{2}^{i}, \dots, w_{n_{i} - 1}^{i}}

is a node set containing source post nodes, comment nodes, and forwarding nodes.

E_{i} = {e_{s t}^{i} | s, t = 0, \dots, n_{i} - 1}

represents the edge set between posts. If

w_{1}^{i}

is a forwarded or commented post of

s_{i}

, there is a directed edge

s_{i} \to w_{1}^{i}

, i.e.,

e_{01}^{i}

. If

w_{2}^{i}

forwards or comments

w_{1}^{i}

, there is a directed edge

w_{1}^{i} \to w_{2}^{i}

, i.e.,

e_{12}^{i}

.

A_{i} \in {0, 1}^{n_{i} \times n_{i}}

represents the adjacency matrix, where

a_{t s}^{i} = \{\begin{matrix} 1, & if e_{s t}^{i} \in E \\ 0, & otherwise \end{matrix}

(1)

In addition, Table 1 summarizes the symbols used in this chapter and their descriptions.

4. Model Design and Description

4.1. Algorithm Model Framework

The rumor detection algorithm model based on the double-chain multimodal feature learning proposed in this paper is divided into four modules as follows: the basic semantic feature extraction module, the logical connection feature extraction module, the text–image modality alignment module, and the classification prediction module. The algorithm framework is shown in Figure 2.

First, the essential semantic feature extraction module extracts the basic semantic features of the text from the semantic content of the text words. The text–image multimodal basic semantic features are obtained after passing through the cross-attention mechanism with the extracted image content features. At the same time, the logical connection feature extraction module inputs the image features through the cross-attention mechanism inside the text encoder to obtain the representation of the text–image multimodal logical connection features. Then, the modal alignment is carried out for the text and image features to reduce the differences in features of different modalities, and the multimodal features in the unified representation space are obtained. Finally, in the classification prediction module, the fused multimodal features are obtained through the fusion of the cross-modal cross-attention layer, which is finally used for classification prediction to obtain labels.

4.2. Text Basic Semantic Feature Encoder

The input natural language must first be processed into a vector form to enable computers to understand and process natural language. The model’s basic semantic feature extraction module in this paper uses Word2vec, which leverages neural networks to learn and generate word vectors. It simplifies text content into vectors in the vector space. It expresses the similarity of text content by calculating the cosine similarity in the vector space, converting natural language words in the input text into low-dimensional dense vectors. In this process, words with similar meanings will be mapped to similar positions in the vector space. The CBOW model and Skip–Gram are two essential models of Word2vec. This paper uses the Skip–Gram model based on the hierarchical Softmax framework, which utilizes a binary tree with words as leaves to calculate the probability of each word.

Each word can be reached from the root node through several random node paths. Therefore, the probability of calculating the output word vector

w_{O}

based on the input word vector

w_{I}

is calculated as follows:

p (w_{O} | w_{I}) = \prod_{j = 1}^{L (w) - 1} σ ([n (w, j + 1) = c h (n (w, j))] {\dot{v}}_{n (w, j)}^{T} v_{w_{I}})

(2)

where

[x]

is an indicator function. If the value of x is true,

[x]

is 1; otherwise, it is −1. After obtaining the embedded word vector of the text, CNN and pooling focus on the local word vector to extract the semantic features of the input text. First, for each post

s_{i}

, we pad or truncate its embedding vector

T_{i}

so that all input text vectors have the same length, that is, the length is L, expressed as follows:

T_{1 : L}^{i} = {t_{1}^{i}, y_{2}^{i}, \dots, t_{L}^{i}}

(3)

Among them,

t \in R^{d}

, where d is the dimension of the word embedding, and

t_{j}^{i}

represents the j-th word embedding of

T_{i}

.

Then, the obtained word embedding matrix

T_{j : j + k - 1}^{i}

is input into the convolutional layer to obtain the feature map

f_{i j}

. By setting the receptive field size as k, the feature matrix

x^{i}

can be expressed as Formula (4), as follows:

x^{i} = x_{i 1}, x_{i 2}, \dots, x_{i (L - k + 1)}

(4)

\hat{x^{i}} = m a x (x_{i})

(5)

We perform the max-pooling operation on the feature matrix, as shown in Equation (5), to obtain

\hat{x^{i}}

. Next, this paper uses

d / 3

filters with different receptive fields where

k \in {3, 4, 5}

to obtain semantic features at different granularities. Finally, the outputs of all filters are concatenated to form the overall text basic semantic features of

T^{i}

as follows:

R_{t}^{i} = c o n c a t ({\hat{x}}_{k = 3}^{i}, {\hat{x}}_{k = 4}^{i}, {\hat{x}}_{k = 5}^{i})

(6)

4.3. Image Encoder

At the same time, the Vision Transformer (ViT) is used to process the input image and extract the features of the image content. The module structure is shown in Figure 3:

The standard Transformer takes a one-dimensional embedding sequence with position encoding as input. To process two-dimensional images, this module reshapes the image

V^{i} \in H \times W \times C

into a sequence of flattened two-dimensional patches

V_{p}^{i} \in R^{N \times (P^{2} \cdot C)}

, where

(H, W)

is the resolution of the original image, C is the number of channels,

(P, P)

is the resolution of each image patch, and

N = H W / P^{2}

is the number of generated image patches, which also serves as the effective input sequence length of the Transformer. The Transformer uses a fixed latent vector size D in all its layers. Therefore, the image patches are flattened and mapped to the D dimension using a trainable linear projection. Like the learnable position encoding used in the BERT model, this paper’s image feature extraction module adds a learnable embedding in front of the sequence of embedded patches. A standard learnable one-dimensional position embedding is used to retain the position information, and the position embedding is added to the image patch embedding. The obtained sequence of embedding vectors is used as the input of the encoder to obtain the projected output patch embedding, and the expression is as follows:

x_{0}^{i} = [v_{c l a s s}^{i}; v_{p_{1}}^{i} E; v_{p_{2}}^{i} E; \dots; v_{p_{N}}^{i} E] + E_{p o s}

(7)

Among them,

E \in R^{(P^{2} \cdot C) \times D}

,

E_{p o s} \in R^{(N + 1) \times D}

is the added position information embedding, which is used to provide position information, and

v_{c l a s s}^{i}

is the class token embedding associated with the i-th element.

The Transformer encoder consists of alternating layers of multi-head self-attention (MSA) and an MLP module that contains two layers with GELU nonlinearities. Layer normalization (LN) is applied before each module, and residual connections are used after each module. The process by which the encoder outputs the image feature representation

R_{v}^{i}

is as follows:

x_{ℓ}^{i'} = M S A (L N (x_{ℓ - 1}^{i})) + x_{ℓ - 1}^{i}, ℓ = 1 \dots L

(8)

x_{ℓ}^{i} = M L P (L N (x_{ℓ}^{i'})) + x_{ℓ}^{i'}, ℓ = 1 \dots L

(9)

R_{v}^{i} = L N (x_{L}^{i})

(10)

4.4. Cross-Modal Cross Attention

The model in this paper uses a cross-modal cross-attention mechanism to capture the mutual information between different modalities and learn the attention weights between features of different modalities to enhance cross-modal features. The essence of the cross-modal cross-attention mechanism is similar to that of the multi-head attention mechanism. Its core idea is to divide the hidden state vector into multiple independent heads, each responsible for processing a subset of the vector dimensions, thus forming multiple sub-semantic spaces. Specifically, each head independently performs a linear transformation on the input to generate a new set of vector representations. These heads work in subspaces independently, enabling each head to focus on capturing different features and patterns in the input data.

That is, each head is an independent attention mechanism that can focus on the information of different parts in the input sequence. For example, if the input vector is in the text modality, one head focuses on the relationship between nouns and verbs in a sentence. In contrast, another head focuses on the tense and mood of the sentence. Multiple heads can cover a broader range of semantic information through such division of labor and cooperation.

This multi-head attention design enables the model to analyze the input data from multiple perspectives. Each head extracts features in its unique sub-semantic space. By concatenating or combining these features, the model can obtain a more comprehensive and rich representation. In this way, the model can capture the input data’s local features and understand the global context and complex semantic relationships, thus achieving multimodal fusion and obtaining fused multimodal features.

Multi-head self-attention is first used for each modality to enhance the intra-modal feature representation. For the basic text semantic feature

R_{t}^{i}

, we use

Q_{t}^{i} = R_{t}^{i} W_{t}^{Q}

,

K_{t}^{i} = R_{t}^{i} W_{t}^{K}

and

V_{t}^{i} = R_{t}^{i} W_{t}^{V}

to calculate its query matrix, key matrix, and value matrix. Among them,

W_{t}^{Q}, W_{t}^{K}, W_{t}^{V} \in R^{d \times \frac{d}{H}}

are linear transformations, and H is the number of heads. Then, a weighted operation is performed on the results of each head to obtain the multi-head self-attention features of the basic text semantic modality. The calculation formula is as follows:

Z_{t}^{i} = (‖_{h = 1}^{H} s o f t m a x (\frac{Q_{t}^{i} {K_{t}^{i}}^{T}}{\sqrt{d}}) V_{t}^{i}) W_{t}^{O}

(11)

Among them, h represents the h-th head, and

W_{t}^{O} \in R^{d \times d}

is the output linear transformation. We perform the same operation on the image feature

R_{v}^{i}

derived above to obtain the image modality’s multi-head self-attention feature

Z_{v}^{i}

.

Then, an enhanced multimodal feature is generated through the cross-attention mechanism. To obtain the text–image multimodal basic semantic features of the post

s_{i}

, we perform an operation similar to the self-attention procedure described above, but we replace

R_{t}^{i}

with

Z_{v}^{i}

to obtain the query matrix

Q_{v}^{i}

, and we replace

R_{t}^{i}

with

Z_{t}^{i}

to receive the key matrix

K_{t}^{i}

and the value matrix

V_{t}^{i}

. Finally, a weighted operation is carried out to obtain the text–image multimodal essential semantic feature

Z_{v t}^{i}

, and the calculation formula is as follows:

Z_{v t}^{i} = (‖_{h = 1}^{H} s o f t m a x (\frac{Q_{v}^{i} {K_{t}^{i}}^{T}}{\sqrt{d}}) V_{t}^{i}) W_{v t}^{O}

(12)

Among them,

W_{v t}^{O} \in R^{d \times d}

is the output linear transformation.

4.5. Extraction of Text–Image Multimodal Logical Connection Features

The characteristics of rumor content include semantic features and require learning of logical features. In the text–image multimodal environment, there are logical relationships between words within the text and logical errors such as self-contradictions and mismatches between text and images. Moreover, extracting the logical features of rumor content from previous studies is considered a single perspective. It did not profoundly explore the logical features within and between the text and the image. This paper designs a text–image multimodal logical connection feature extraction module to deeply learn the text–image logical connections and extract multimodal logical features. Visual information is injected by inserting an additional cross-attention (CA) layer between the bidirectional self-attention (Bi-SA) layer and the feed-forward network (FFN) layer of the text logical feature encoder. Finally, the output features are used as the multimodal representation of the text–image logical connection features. The specific structure is shown in Figure 4.

Different from the way of processing text embeddings in the previous section, to learn the contextual relationships of the text, a unique classification token, denoted as

[C L S]

, is inserted at the beginning of the text word vector sequence. The aggregated sequence of the final hidden states of this token will be used as the representation for the final classification task. Meanwhile, a unique token, denoted as

[S E P]

, is inserted into the text vector sequence to separate the specific post text. A learned embedding is added to each token to indicate its specific attribution. Suppose the text input embedding is denoted as

T^{'}

. In that case, the final hidden vector of the classification token

[C L S]

is represented as

C \in R^{H}

, and the final hidden vector of the i-th input token is

T_{i}^{'} \in R^{H}

. Therefore, the calculation process of the text–image multimodal logical connection features is as follows:

x_{l}^{i t^{'}} = S e l f A t t e n t i o n (T^{'})

(13)

R_{v t^{'}}^{i} = C o - A t t e n t i o n (x^{i t^{'}}, R_{v}^{i})

(14)

Z_{v t^{'}}^{i} = F e e d F o r w a r d N e t w o r k (R_{v t^{'}}^{i})

(15)

4.6. Modality Alignment

In multimodal research, data from different modalities usually have different data distributions and feature representation methods, which makes it difficult for models to model and fuse multimodal data directly. To address this issue, this paper maps data from different modalities into a unified representation space through modality alignment, enabling the model to perform unified modeling and analysis. At the same time, this method can also help eliminate the feature differences between different modalities, allowing the model to pay more attention to the semantic similarities and correlations between modalities. By combining the results of modality alignment, the model can better integrate information from different modalities, thereby improving the generalization ability and performance of the model.

In this paper, modality alignment is carried out by forcing the enhanced text features of the post to align with its enhanced image features to enhance the representations learned in each modality and learn the inherent correlations between different modalities. For post

s_{i}

, its enhanced text features

Z_{t}^{i}

and enhanced image feature

Z_{v}^{i}

are transformed into the same modal feature space, that is,

\begin{matrix} {Z_{t}^{i}}^{'} = W_{t}^{'} Z_{t}^{i} \\ {Z_{v}^{i}}^{'} = W_{v}^{'} Z_{v}^{i} \end{matrix}

(16)

Among them,

W_{t}^{'}

and

W_{v}^{'}

are learnable parameters. Then, the Mean Squared Error (MSE) loss is used to reduce the distance between

{Z_{t}^{i}}^{'}

and

{Z_{v}^{i}}^{'}

, shown in Equation (17), as follows:

L_{a} = \frac{1}{n} \sum_{i = 1}^{n} {({Z_{t}^{i}}^{'} - {Z_{v}^{i}}^{'})}^{2}

(17)

Among them, n is the total number of posts. By calculating the loss, the text–image basic semantic multimodal features

\tilde{Z_{v t}^{i}}

and the logical connection multimodal features

\tilde{Z_{v t^{'}}^{i}}

are obtained, and these are used for fusion and classification operations.

4.7. Classification Prediction Module

By fusing the above text–image multimodal features, the essential semantic multimodal feature

\tilde{Z_{v t}^{i}}

and the logical connection multimodal feature

\tilde{Z_{v t^{'}}^{i}}

are input into the cross-modal cross-attention mechanism to obtain the following two cross-modal enhanced fusion features:

\tilde{Z_{v t v t^{'}}^{i}}

and

\tilde{Z_{v t^{'} v t}^{i}}

. Then, a weighted operation is performed on them to obtain the final fused multimodal feature for classification prediction as follows:

Z^{i} = c o n c a t (\tilde{Z_{v t v t^{'}}^{i}}, \tilde{Z_{v t^{'} v t}^{i}})

(18)

Finally, the final multimodal feature

Z^{i}

of the post

s_{i}

is input into the fully connected layer to predict the label of

s_{i}

and determine whether it is a rumor.

\hat{y_{i}} = s o f t m a x (W_{c} Z^{i} + b),

(19)

Among them,

\hat{y_{i}}

represents the probability that the post

s_{i}

is a rumor. The cross-entropy loss function is used to optimize the classification task, and the calculation process is as follows:

L_{c} = - y log (\hat{y^{i}}) - (1 - y) log (1 - \hat{y^{i}})

(20)

We assign custom weights, respectively, to the loss

L_{a}

of the modality alignment task, and we then add them together. The calculation formula of the final loss function is as follows:

L = λ_{c} L_{c} + λ_{a} L_{a}

(21)

Among them, the parameters

λ_{c}

and

λ_{a}

are the weight values of the classification task and the modality alignment task, respectively, which are used to balance the two losses.

5. Experimental Design

To verify the effectiveness of the rumor detection algorithm based on dual-chain multimodal feature learning, many experiments were conducted on two real and publicly available multimodal datasets. The model is implemented using Pytorch 2.6.0, and all the programs in this paper are executed on a computer equipped with dual Intel Xeon E5-2678 v3 processors, an RTX 3090 GPU, and 24 GB of memory.

5.1. Dataset Description

The experiments are based on the following two real-world public datasets: Pheme [13] and Weibo [14]. The Pheme dataset is an English dataset collected from the foreign social platform X (formerly Twitter), and it is a collection of rumors and non-rumors summarized based on five breaking news stories. The Weibo dataset is a Chinese dataset sourced from Sina Weibo, China’s most significant social platform. According to the Sina Community Management Center verification of event authenticity, it consists of 2351 actual events and 2313 groups of false events from 2015 to 2016. This paper mainly focuses on the research of multimodal rumor detection algorithms. Therefore, the two datasets’ data instances that do not contain text or images were removed from the experiments.

The Pheme and Weibo datasets are widely used in rumor detection and social media analysis research. Both datasets contain labeled instances of non-rumors (true information) and false rumors, along with metadata such as user interactions, images, and comments. Pheme contains 1428 non-rumors and 590 false rumors, along with 2018 images and 7388 comments. Weibo contains 877 non-rumors and 590 false rumors, with 1467 images and 4534 comments.

5.2. Benchmark Method

To evaluate the effectiveness of the rumor detection algorithm based on dual-chain multimodal feature learning, this paper selects 6 representative advanced methods for comparison. All of these methods take text content and additional images as the input of the model, and the specific introductions are as follows:

(1) att-RNN [15]: att-RNN is based on the LSTM (Long Short-Term Memory) model, and it extracts text and social context features also based on the pre-trained VGG-19 model to extract visual features. It obtains the associated features between text/social context and visual features and calculates rumor presentation through the attention mechanism.

(2) EANN [16]: TextCNN and VGG19 are, respectively, used to extract text features and image features, and the features are concatenated and input into the rumor classifier and the event discriminator. Among them, the event discriminator is used to learn the invariant representation of the event. This paper uses only the rumor classifier for a fair comparison.

(3) MVAE [17]: MVAE uses an encoder–decoder method to construct a multimodal feature expression. Training a multimodal variational autoencoder can reconstruct the two modalities according to the learned shared representation to find the correlations between cross-modalities.

(4) SAFE [9]: SAFE is a multimodal rumor detection method based on a similarity-aware approach. It extracts text and visual features from posts and explores their common representation.

(5) CARMN [18]: CARMN extracts text feature representations from the original text and the fused text through a multi-channel convolutional neural network (MCN) and uses VGG19 to extract image feature representations for multimodal rumor detection.

(6) CAFE [19]: CAFE analyzes cross-modal ambiguity learning from the perspective of information theory, adaptively aggregates unimodal and cross-modal relevant features, and conducts rumor detection.

(7) MFAN [20]: MFAN integrates text, visual, and social graph features in a unified framework for rumor detection. The model in this paper only involves text and image modalities. For fairness, MFAN only uses the text and visual modalities for the experiment.

5.3. Experimental Setup

The default settings in the original papers are used for the parameters of all baseline methods. During the training process of the DMFL model in this paper, the stochastic gradient descent method is adopted to update the model parameters, and the Adam optimizer is used to optimize the model’s loss function. The maximum length of the text in Word2vec is set to 50. The sizes of the convolution kernels in TextCNN are selected as

{3, 4, 5}

. The maximum number of epochs is set to 20, the batch size for training is 64, and the keep probability of Dropout is set to 0.6. The prediction results of the DMFL model and the baseline models are evaluated using Accuracy, Precision, Recall, and F1-score.

6. Analysis of Experimental Results

6.1. Performance Comparison and Analysis of DMFL

The comparison experimental results between the rumor detection algorithm based on dual-chain multimodal feature learning (DMFL) proposed in this paper and the baseline models on two public datasets are shown in Table 2 and Table 3 and Figure 5. In the tables, the sub-optimal results among the baseline models are underlined, and the best results among all the models are marked in bold.

The results of the comparative experiments can prove that the DMFL model performs excellently in the rumor detection task. Specifically, in terms of accuracy, compared with the best performance of the baseline model, the MFAN model that fuses text semantics and image features, the DMFL model shows improvements of 1.65% and 3.89 respectively, on the Pheme and Weibo datasets. This experimental result verifies the research findings of this paper, that is, paying attention to the internal logical connections of rumor content and extracting multi-level rumor content features are conducive to improving the effectiveness of the rumor detection task.

In addition, by comparing the experimental results of the CARMN and SAFE models on the two datasets, it can be found that on the Pheme dataset, the accuracy of the CARMN model, which extracts multi-level text feature representations, is lower than that of the SAFE model, which focuses on the connection between text and visual features. However, the situation is the opposite on the Weibo dataset. This is because the text length of the Weibo dataset is much longer than that of the Pheme dataset. For the Pheme dataset with short texts, the SAFE model that focuses on the connection between text and images performs better. For the Weibo dataset with long texts, studying the text’s multi-level content features significantly improves the accuracy of rumor detection.

In conclusion, the DMFL’s dual-chain learning fuses the multi-level text features, the text–image logical connection multimodal features that connect text and images, and the text–image basic semantic multimodal features for rumor detection, and it performs better on both datasets.

6.2. Performance Analysis of the Feature Extraction Module of DMFL

To evaluate the role and influence of each principal component on the DMFL model, this paper designs seven variants of DMFL for ablation experiments. By controlling variables, one important module is removed at a time from the complete DMFL model for experiments, and the contribution of this module to the model performance is analyzed through the results. The seven designed variant models are as follows:

DMFL-TB: The text–image basic semantic feature extraction module is removed, and rumor detection is performed only by extracting text–image logical connection multimodal features. The design of this variant aims to discuss the influence of text–image basic semantic features on the model performance.
DMFL-TL: The text–image logical connection feature extraction module is removed, and rumor detection is performed only through text–image basic semantic multimodal features. The design of this variant aims to discuss the influence of text–image logical connection features on model performance.
DMFL-A: The modality alignment module is removed. The basic semantic features of the text and image modalities are directly fused to obtain a multimodal semantic feature vector, and the rest of the model operations remain unchanged for rumor detection. The design of this variant aims to discuss the influence of the modality alignment module on the model performance.
DMFL-ATT: For multimodal basic semantic features and multimodal logical features, instead of using cross-modal cross-attention for multimodal fusion, a weighted sum is directly performed to obtain fused multimodal features for rumor detection. The design of this variant aims to discuss the influence of using the cross-modal cross-attention mechanism for multimodal fusion on the model performance.
DMFL-TCNN: Only the basic semantic features of the text modality are used for rumor detection. The design of this variant aims to discuss the influence of the basic semantic features of the single text modality on the rumor detection results.
DMFL-TTRAS: Only the logical features of the text modality are used for rumor detection. The design of this variant aims to discuss the influence of the logical features of the single text modality on the rumor detection results.
DMFL-I: Only the image modality features are used for rumor detection. The design of this variant aims to discuss the influence of the single image modality features on the rumor detection results.

Table 4 and Figure 6 show the performance comparison between the seven variants of DMFL and the original DMFL on the Pheme and Weibo datasets. The experiments prove that the accuracy and F1-score values of the DMFL algorithm model are higher than those of all DMFL variants. The four main modules of DMFL, namely, the multimodal semantic feature extraction module, the multimodal logical feature extraction module, the modality alignment module, and the cross-modal cross-attention mechanism, are all reasonable and effective and positively impact the model performance.

By comparing DMFL-TB and DMFL-TL, the influence on multimodal semantic features and multimodal logical features is analyzed. From the experimental results, it can be seen that on the Pheme dataset, the overall experimental results of DMFL-TB have the most significant decline, which indicates that for the Pheme dataset mainly composed of short texts, the essential semantic features of text and images play a more critical role in the rumor detection task. On the Weibo dataset, the accuracy of DMFL-TL drops the most, which shows that for the Weibo dataset with long texts, paying attention to the multi-level content features within the text and the text–image logical connection features that focus on the connection between text and images contributes more significantly to the rumor detection task. In addition, it can be found from the experimental results that the variation ranges of DMFL-TB and DMFL-A are almost the same, which is in line with the principle that modality alignment operates based on the basic semantic features of text and images.

To explore the influence of the modality alignment and cross-modal cross-attention mechanism acting on multimodal fusion, the experimental results of DMFL-A and DMFL-ATT are compared and analyzed. The experimental results show that on the Pheme dataset, the accuracy of DMFL-A drops by 2.71%, and the accuracy of DMFL-ATT drops by 1.19%, while on the Weibo dataset, the accuracies of both DMFL-A and DMFL-ATT drop by about 1.6%. The decreased accuracy in the experimental results indicates that the modality alignment and cross-modal cross-attention mechanism are effective for multimodal fusion. At the same time, the variant with the modality alignment removed has a more significant decline on the Pheme dataset than the variant with the cross-modal cross-attention removed, while the variation ranges are almost the same on the Weibo dataset. This shows that the differences between text semantic and image features in the Pheme dataset are more significant. A better multimodal fusion effect can be obtained after the modality alignment reduces the differences between the two modalities.

Three variants using a single modality for rumor detection, namely, DMFL-TCNN, DMFL-TTRAS, and DMFL-I, are set up to analyze the performance of text semantic features, text logical features, and image features. According to the experimental results, compared with the complete DMFL model, the accuracies of DMFL-TCNN, DMFL-TTRAS, and DMFL-I on the Pheme dataset drop by 2.26%, 6.7%, and 4.89%, respectively, while on the Weibo dataset, the accuracies drop by 3.23%, 4.88%, and 4.88%, respectively. By comparing the performance of each single variant on the two datasets, DMFL-TCNN performs better on the Weibo dataset, while DMFL-TTRAS and DMFL-I perform better on the Pheme dataset. This indicates that the Weibo dataset has richer text semantic features than the Pheme dataset and is helpful for the rumor detection task. By comparing the performance of the three variants, DMFL-TCNN performs better than the other two variants on both datasets. In contrast, the performance gap between DMFL-TTRAS and DMFL-I is relatively tiny. The results show that text semantic features help the rumor detection task more than text logical features and image features.

6.3. Analysis of Important Parameters of DMFL

(1) Classification loss coefficient

λ_{c}

.

λ_{c}

is used to adjust the weight of the classification loss. The value of

λ_{c}

is increased with a step size of 0.4 for experiments to find the range of values with the best performance. Then, the step size is reduced for further experiments to find the optimal value for the best performance. As can be seen from Figure 7, the overall influence of the classification coefficient

λ_{c}

on the model performance shows a trend of first increasing and then decreasing, and the performance differences on the two datasets are relatively small, which proves that the performance of the classification task on the two datasets has little difference. The optimal value of

λ_{c}

is 1.8 on the Pheme dataset and 1.2 on the Weibo dataset.

(2) Modality alignment loss coefficient

λ_{a}

.

λ_{a}

is used to adjust the weight of the modality alignment loss. Similarly, the value of

λ_{a}

is increased with a step size of 0.4 to find the optimal range, and then, the bisection method is used to find the optimal value of

λ_{a}

within the optimal range. The optimal value of the modality alignment loss coefficient

λ_{a}

is 2.4 for both the Pheme and Weibo datasets, as shown in Figure 8.

By comparing the trends of the line charts, the increase and decrease trend of

λ_{a}

on the Pheme dataset is more tortuous, while its performance on the Weibo dataset is smoother. The experimental results show that the weight of the modality alignment task is more sensitive on the Pheme dataset, and it has a more prominent impact on the rumor detection results.

(3) Comprehensive analysis of the weight parameters of the loss function of DMFL.

Combined with the above experiments, it can be seen that on the two datasets, the value of

λ_{c}

is smaller than that of

λ_{a}

. Therefore, the modality analysis task has a more significant impact on the rumor detection task in the DMFL model. Furthermore, since the optimal value of the ratio of

λ_{c}

to

λ_{a}

on the Pheme dataset is more significant than that on the Weibo dataset, this indicates that the modality alignment task plays a more significant role on the Weibo dataset, proving that the difference between the text and image modalities in the Weibo dataset is more prominent.

(4) The influence of the number of heads h in the cross-modal attention mechanism

The cross-modal cross-attention mechanism divides the hidden state vectors into multiple heads, and each head forms an independent sub-semantic space, enabling the model to capture different types of features simultaneously. This greatly enriches the expressive ability of the model, allowing it to understand and process input data more comprehensively, thus improving the performance and generalization ability of the model. The choice of the number of heads will affect the performance and computational complexity of the model, so the experiment on the number of heads parameter is of great importance. This section sets the initial number of heads to 2, gradually increasing with a step size of 2. The accuracy and F1-score performance indicators for each number of heads on the Pheme and the Weibo datasets are recorded, respectively, and drawn into bar charts, as shown in Figure 9.

According to the experimental results on the Pheme dataset, as the number of heads increases, the overall accuracy and F1-score of the model show a trend of decreasing first, then rising to the highest point, and then reducing. When the number of heads is two, the model performs well, with an accuracy of 0.8807 and an F1-score of 0.8812. This is because a smaller number of heads reduces the complexity of the model, decreases the risk of overfitting, and it can effectively capture the key information in the input multimodal features when the number of heads is four and six, both the accuracy and F1-score of the model decrease. This indicates that as the number of heads increases, the model becomes more complex, leading to a certain degree of overfitting. Especially when the amount of training data is not large, more heads do not effectively improve the representation ability of the model but instead increase the instability of training. When the number of heads is eight, the DMFL model performs the best, with an accuracy of 88.83% and an F1-score of 88.81%. This shows that under this setting, the increase in the number of heads helps to capture more diverse features, enhance the representation ability of the model, and thus improve the performance of rumor detection and classification. Having eight heads is a balance point, which increases the complexity of the model without significantly increasing the risk of overfitting. Finally, when the number of heads is 10, the accuracy and F1-score decrease again because further increasing the number of heads makes the model too complex, increasing the risk of overfitting. More heads make the features learned by each head overly focused on the specific patterns of the training data and unable to generalize well to the test data.

On the Weibo dataset, as the number of heads increases, the model’s accuracy shows a trend of first rising, then falling, then rising to the highest point, and then falling again. Analyzing the reasons for this trend, when the number of heads is two, the accuracy is relatively high but not outstanding. This is because the complexity of the model is low, and it can capture some basic features, but it may not be able to fully utilize the rich semantic information of the text and images. This results in the model having a specific generalization ability but limited representation ability. Subsequently, when the number of heads is four, the complexity and representation ability of the model is improved, and it can capture more features and patterns. The model can better understand and process the input data, thus increasing the accuracy. However, when the number of heads is six, the accuracy decreases. As the number of heads increases, the model becomes too complex, possibly leading to overfitting. This indicates that the model performs well on the training data but fails to generalize effectively on the test data. Then, the accuracy reaches the highest point when the number of heads is eight. At this time, the model balances complexity and generalization ability, can effectively capture various complex features in the input data, and avoids the overfitting problem. Finally, the accuracy decreases again when the number of heads is 10. The complexity of the model continues to increase, and the risk of overfitting becomes greater. More heads introduce more noise and unnecessary complexity, affecting the model’s generalization ability.

Through the experiments, we can conclude that a moderate number of heads can provide sufficient feature representation ability, thereby improving the model’s performance. Too few heads, although reducing the complexity of the model, may not be sufficient to capture the rich features in the data. Furthermore, too many heads may lead to overfitting of the model. Although it increases the complexity of the model, it also introduces more noise and instability. Therefore, for the Pheme and Weibo datasets, a number of eight heads is the most appropriate setting.

6.4. Correlation Analysis

A rumor detection model needs to make inferences and decisions from a large amount of information, and these decisions directly affect the final rumor detection results. Correlation analysis plays a crucial role in this process, as it can help us better understand the model’s decision-making mechanism. For example, correlation analysis can determine which specific features or data points play a key role in the model’s judgment when processing specific information. In this way, we can see the results output by the model and understand the evidence or logic based on which the model arrives at these results.

This section selects a cross-modal attention mechanism for the multimodal fusion of text semantics and image features. In each round of iterative training, the same batch of processed data is selected, and its attention scores are visualized as a heatmap. Then, a correlation analysis is conducted on the situation of the model learning the correlation between text semantics and image features.

(1) Visualization of attention scores in the Pheme dataset

Experiments were conducted on the Pheme dataset. The attention scores of the cross-modal attention mechanism for a multimodal fusion of text semantic features and image features were visualized as a heatmap. The changes in these scores with the number of iterations are shown in Figure 10. The cross-modal attention mechanism increases the scores of vectors with strong correlations and decreases the vectors of features with weak correlations. In the heatmap, this is manifested as the darker the color, the stronger the correlation between the text semantic features and the image features in that part.

In the initial stage of model training, when the number of iterations is one, the color blocks in the heatmap of the attention scores are evenly distributed, and the colors are all relatively light, with no apparent distribution of dark areas. This indicates that the model has learned the feature correlations between text semantics and images to a certain extent, but the generalization ability on the test dataset is limited. When the number of iterations is equal to five, the color blocks in the heatmap of the attention scores begin to show a trend of regional distribution, and prominent dark areas can be divided in the figure. This indicates that the model has further learned the feature correlations between text semantics and images, and the increase in the model’s accuracy shows that the model’s generalization ability has also been improved. When the number of iterations is seven, the trained model achieves the best model performance, and the accuracy is the highest. Compared with the previous two images, the dark color blocks in the attention heatmap are more abundant, and at this time, the generalization ability reaches a balance. Finally, when the number of iterations reaches 10, the model’s accuracy no longer increases but shows a downward trend. In the attention heatmap, the shape of the dark areas has not changed compared with the best performance, but the number of dark color blocks has increased based on the original. This introduces unnecessary complexity and noise to the model, resulting in overfitting.

(2) Visualization of attention scores in the Weibo dataset

Experiments were conducted on the Weibo dataset. To ensure the comparability of the experiments, the attention scores of the cross-modal attention mechanism for a multimodal fusion of text semantic features and image features were also visualized in the form of a heatmap. The changes in these scores with the number of iterations are shown in Figure 11. The cross-modal attention mechanism enhances the weights of high-correlation feature vectors while weakening the weights of low-correlation features. This process is intuitively reflected in the heatmap, where the darker the area, the stronger the correlation between the text semantic features and the image features in that part.

At the initial stage of model training, when the number of iterations is 1, there are already prominent dark areas in the distribution of color blocks in the heatmap of the attention scores. This indicates that the model has learned the feature correlations between text semantics and images to a certain extent and has a specific generalization ability on the test dataset. When the number of iterations is equal to five, the number of dark color blocks in the heatmap of the attention scores decreases, but the shape of the dark areas does not change. This indicates that the model has further learned the feature correlations between text semantics and images, weakened the weights of features with correlations in the noisy part, and improved the model’s generalization ability, which is reflected in the increase in accuracy. When the number of iterations is nine, the trained model achieves the best model performance, and the accuracy is the highest.

The distribution of dark color blocks in the attention heatmap is further changed. At this time, the darkest area is different from that at the beginning of the training, which shows that during the training process, the model has achieved noise reduction, reducing the weights of correlations in the noisy part and enhancing the weights of favorable high-correlation feature vectors. Finally, when the number of iterations reaches 20, the model’s accuracy no longer increases but shows a downward trend. In the attention heatmap, the shape of the dark areas has not changed compared with the best performance, but the number of dark color blocks has been further reduced on the original basis. This causes the model to perform excessive noise reduction operations, increasing the unnecessary complexity of the model and resulting in the phenomenon of model overfitting.

(3) Comprehensive analysis

Through the analysis of Figure 10 and Figure 11, it can be seen that for the multimodal fusion of text semantic features and image features, the training situations of the model on the Pheme and Weibo datasets are significantly different. First, in the early stage of training, the correlations between text semantics and images learned by the model show an apparent strong–weak distribution on the Weibo dataset. In contrast, the Pheme dataset has a uniform weak distribution. This indicates that the Weibo dataset contains more noise. One possible reason is that the Weibo dataset is Chinese, while the Pheme dataset is English. Under fields of equal length, Chinese texts contain more semantic information than English texts. During the process of learning the correlations between text semantics and images, more noise will be introduced.

Then, for the Pheme dataset, the correlations between text semantics and image features are learned from scratch. This is mainly reflected in the changes in the heatmap, which is a process of learning the distribution of dark areas, that is, discovering features with strong correlations during training. Learning the correlations between text semantics and image features is a noise-reduction process for the Weibo dataset. The primary manifestation of the changes in the heatmap is the gradual reduction of the number of dark color blocks, reducing the noise weights and retaining the features with favorable strong correlations.

6.5. Time Complexity Analysis

In our model, the time complexity of the propagation feature extraction module is

O (4 L h^{2})

. The time complexity of the contrastive learning optimization fusion feature module is

O (\frac{d}{3} m^{2} (k_{1} + k_{2} + k_{3}))

. The time complexity of the propagation feature extraction and contrastive learning modules is

O (| V |^{2})

. The time complexity of multimodal basic semantic and logical features is

O (L^{2} d + L d^{2})

. Let the time complexity of obtaining image representation by CLIP be

T 1

. Then, the final time complexity of our model is

T 1 + O (L^{2} d + L d^{2})

.

7. Conclusions and Future Work

This paper introduces the Rumor Detection Algorithm based on Dual-chain Multimodal Feature Learning (DMFL). This algorithm conducts dual-chain learning of the basic semantic multimodal features and logical connection features of text and images. Through modal alignment, it obtains fused multimodal features for rumor detection. DMFL comprehensively considers the multi-level nature of rumor content features and uses the method of modal alignment to reduce the differences between modalities, resulting in multimodal features with a better fusion effect. This is the first study on rumor detection targeting multi-level rumor content features, providing an essential reference for subsequent studies. Many experiments have been carried out on two publicly available multimodal datasets, and the experimental results have proven the rumor detection algorithm’s effectiveness based on the proposed dual-chain multimodal feature learning.

Author Contributions

Conceptualization, Z.G. and H.L.; Methodology, Z.G. and H.L.; Software, L.Z.; Validation, L.Z. and J.W.; Formal analysis, L.Z.; Resources, J.W.; Data curation, J.W.; Writing—original draft, Z.G. and H.L.; Writing—review& editing, L.Z.; Supervision, J.W.; Funding acquisition, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (grant no. 62072060).

Data Availability Statement

The data presented in this study are available in Github at https://github.com/TianBian95/BiGCN, accessed on 1 Apirl 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ma, J.; Gao, W.; Mitra, P.; Kwon, S.; Jansen, B.J.; Wong, K.; Cha, M. Detecting Rumors from Microblogs with Recurrent Neural Networks; AAAI Press: New York, NY, USA, 2016; pp. 3818–3824. [Google Scholar]
Hamed, S.K.; Juzaiddin Ab Aziz, M.; Ridzwan Yaakub, M. Improving Data Fusion for Fake News Detection: A Hybrid Fusion Approach for Unimodal and Multimodal Data. IEEE Access 2024, 12, 112412–112425. [Google Scholar] [CrossRef]
Yu, F.; Liu, Q.; Wu, S.; Wang, L.; Tan, T. A Convolutional Approach for Misinformation Identification. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI 2017), Melbourne, Australia, 19–25 August 2017; Sierra, C., Ed.; 2017; pp. 3901–3907. [Google Scholar]
Yang, Y.; Zheng, L.; Zhang, J.; Cui, Q.; Li, Z.; Yu, P.S. TI-CNN: Convolutional Neural Networks for Fake News Detection. arXiv 2018, arXiv:1806.00749. [Google Scholar]
Mishra, R.; Setty, V. SADHAN: Hierarchical Attention Networks to Learn Latent Aspect Embeddings for Fake News Detection. In Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval, Santa Clara, CA, USA, 2–5 October 2019; pp. 197–204. [Google Scholar]
Hosseinimotlagh, S.; Papalexakis, E.E. Unsupervised Content-Based Identification of Fake News Articles with Tensor Decomposition Ensembles. In Proceedings of the MIS2: Misinformation and Misbehavior Mining on the Web Workshop Held in Conjunction with WSDM 2018, Los Angeles, CA, USA, 9 February 2018. [Google Scholar]
Gupta, A.; Lamba, H.; Kumaraguru, P.; Joshi, A. Faking Sandy: Characterizing and identifying fake images on Twitter during Hurricane Sandy. In Proceedings of the 22nd International Conference on World Wide Web, Rio de Janeiro, Brazil, 13–17 May 2013; pp. 729–736. [Google Scholar]
Zhang, H.; Fang, Q.; Qian, S.; Xu, C. Multi-modal Knowledge-aware Event Memory Network for Social Media Rumor Detection. In Proceedings of the 27th ACM International Conference on Multimedia, MM 2019, Nice, France, 21–25 October 2019; Amsaleg, L., Huet, B., Larson, M.A., Gravier, G., Hung, H., Ngo, C., Ooi, W.T., Eds.; ACM: New York, NY, USA, 2019; pp. 1942–1951. [Google Scholar]
Zhou, X.; Wu, J.; Zafarani, R. SAFE: Similarity-Aware Multi-modal Fake News Detection. In Proceedings of the Advances in Knowledge Discovery and Data Mining—24th Pacific-Asia Conference (PAKDD 2020), Singapore, 11–14 May 2020; Proceedings, Part II. Lauw, H.W., Wong, R.C., Ntoulas, A., Lim, E., Ng, S., Pan, S.J., Eds.; Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, 2020; Volume 12085, pp. 354–367. [Google Scholar]
Karpathy, A.; Joulin, A.; Fei-Fei, L. Deep Fragment Embeddings for Bidirectional Image Sentence Mapping. Adv. Neural Inf. Process. Syst. 2014, 27. [Google Scholar] [CrossRef]
Cao, Q.; Shen, L.; Xie, W.; Parkhi, O.M.; Zisserman, A. VGGFace2: A Dataset for Recognising Faces across Pose and Age. In Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, China, 15–19 May 2018; pp. 67–74. [Google Scholar]
Wang, Y.; Qian, S.; Hu, J.; Fang, Q.; Xu, C. Fake News Detection via Knowledge-driven Multimodal Graph Convolutional Networks. In Proceedings of the 2020 on International Conference on Multimedia Retrieval, ICMR 2020, Dublin, Ireland, 8–11 June 2020; Gurrin, C., Jónsson, B.Þ., Kando, N., Schöffmann, K., Chen, Y.P., O’Connor, N.E., Eds.; ACM: New York, NY, USA, 2020; pp. 540–547. [Google Scholar]
Zubiaga, A.; Liakata, M.; Procter, R. Exploiting Context for Rumour Detection in Social Media. In Proceedings of the Social Informatics—9th International Conference, SocInfo 2017, Oxford, UK, 13–15 September 2017; Proceedings, Part I. Ciampaglia, G.L., Mashhadi, A.J., Yasseri, T., Eds.; Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, 2017; Volume 10539, pp. 109–123. [Google Scholar]
Song, C.; Yang, C.; Chen, H.; Tu, C.; Liu, Z.; Sun, M. CED: Credible Early Detection of Social Media Rumors. IEEE Trans. Knowl. Data Eng. 2021, 33, 3035–3047. [Google Scholar] [CrossRef]
Zhang, P.; Xue, J.; Lan, C.; Zeng, W.; Gao, Z.; Zheng, N. EleAtt-RNN: Adding Attentiveness to Neurons in Recurrent Neural Networks. IEEE Trans. Image Process. 2020, 29, 1061–1073. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Ma, F.; Jin, Z.; Yuan, Y.; Xun, G.; Jha, K.; Su, L.; Gao, J. EANN: Event Adversarial Neural Networks for Multi-Modal Fake News Detection. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, 19–23 August 2018; Guo, Y., Farooq, F., Eds.; ACM: New York, NY, USA, 2018; pp. 849–857. [Google Scholar]
Khattar, D.; Goud, J.S.; Gupta, M.; Varma, V. MVAE: Multimodal Variational Autoencoder for Fake News Detection. In Proceedings of the World Wide Web Conference, WWW 2019, San Francisco, CA, USA, 13–17 May 2019; Liu, L., White, R.W., Mantrach, A., Silvestri, F., McAuley, J.J., Baeza-Yates, R., Zia, L., Eds.; ACM: New York, NY, USA, 2019; pp. 2915–2921. [Google Scholar]
Song, C.; Ning, N.; Zhang, Y.; Wu, B. A multimodal fake news detection model based on crossmodal attention residual and multichannel convolutional neural networks. Inf. Process. Manag. 2021, 58, 102437. [Google Scholar] [CrossRef]
Chen, Y.; Li, D.; Zhang, P.; Sui, J.; Lv, Q.; Lu, T.; Shang, L. Cross-modal Ambiguity Learning for Multimodal Fake News Detection. In Proceedings of the WWW ’22: The ACM Web Conference 2022, Virtual Event, Lyon, France, 25–29 April 2022; Laforest, F., Troncy, R., Simperl, E., Agarwal, D., Gionis, A., Herman, I., Médini, L., Eds.; ACM: New York, NY, USA, 2022; pp. 2897–2905. [Google Scholar]
Zheng, J.; Zhang, X.; Guo, S.; Wang, Q.; Zang, W.; Zhang, Y. MFAN: Multi-modal Feature-enhanced Attention Networks for Rumor Detection. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23–29 July 2022; pp. 2413–2419. [Google Scholar]

Figure 1. Specific scene in a social platform.

Figure 2. Diagram of the rumor detection algorithm based on a dual-chain multimodal feature learning framework.

Figure 3. Image feature extraction module structure diagram. * means position embedding of the image.

Figure 4. Text–image multimodal logic link feature extraction module structure diagram.

Figure 5. Performance of DMFL and different baseline models on the Pheme and Weibo datasets.

Figure 6. Performance of different variants of DMFL on Pheme and Weibo datasets.

Figure 7. Performance of different

λ_{c}

of DMFL on the Pheme and Weibo datasets.

Figure 7. Performance of different

λ_{c}

of DMFL on the Pheme and Weibo datasets.

Figure 8. Performance of different

λ_{a}

of DMFL on the Pheme and Weibo datasets.

Figure 8. Performance of different

λ_{a}

of DMFL on the Pheme and Weibo datasets.

Figure 9. Influence of different head h values in DMFL on the Pheme and Weibo datasets.

Figure 10. DMFL visualization of attention scores based on the Pheme dataset.

Figure 11. DMFL visualization of attention scores based on the Weibo dataset.

Table 1. Symbols used in this chapter and their descriptions.

Symbols	Descriptions
$S = {s_{1}, s_{2}, \dots, s_{n}}$	Event collection
$s = {T, V, R, U, C}$	Post content collection
T	Post content
L	Text vector length
d	Word embedding dimension
V	Post image
U	Users who posted the message
$C = {c_{1}, c_{2}, \dots, c_{j}}$	Forward comments
$R_{i} = {s_{i}, w_{1}^{i}, w_{2}^{i}, \dots, w_{n_{i} - 1}^{i}, G_{i}}$	Forwarding relationship set
$G = 〈 N, E 〉$	Communication diagram
$N_{i} = {s_{i}, w_{1}^{i}, \dots, w_{n_{i} - 1}^{i}}$	Node collection
$E_{i} = {e_{s t}^{i} \| s, t = 0, \dots, n_{i} - 1}$	Edge collection
$Y = {y_{1}, y_{2}, \dots, y_{n}}$	Tags collection for the post
w	Edge
$A_{i} \in {0, 1}^{n_{i} \times n_{i}}$	Adjacency matrix
R	Feature representation

Table 2. Results of the performance comparison between DMFL and other baseline models on the Pheme datasets.

Model	Accuracy	Precision	Recall	F1-Sore
att-RNN	0.6822	0.7801	0.6149	0.6894
EANN	0.7713	0.7139	0.7007	0.7044
MVAE	0.7762	0.7349	0.7225	0.7277
SAFE	0.8149	0.7988	0.7950	0.7968
CARMN	0.7410	0.8542	0.6192	0.7186
CAFE	0.8066	0.8072	0.7991	0.8035
MFAN	0.8687	0.8598	0.8104	0.8295
DMFL	0.8883	0.8770	0.8853	0.8781
Percentage of improvement	1.65%	2.00%	9.24%	5.86%

Table 3. Results of the performance comparison between DMFL and other baseline models on the Weibo datasets.

Model	Accuracy	Precision	Recall	F1-Sore
att-RNN	0.7886	0.8625	0.6861	0.7648
EANN	0.8096	0.8019	0.7968	0.7987
MVAE	0.7167	0.7052	0.7021	0.7034
SAFE	0.8495	0.8498	0.8495	0.8496
CARMN	0.8537	0.8910	0.8140	0.8513
CAFE	0.8407	0.8554	0.8302	0.8423
MFAN	0.8712	0.8659	0.8574	0.8612
DMFL	0.9051	0.8971	0.9024	0.8995
Percentage of improvement	3.89%	0.68%	5.25%	4.45%

Table 4. Performance comparison of different variants on the Pheme and Weibo datasets.

Data Set		Pheme		Weibo
Evaluation Index		Accuracy	F1-Sore	Accuracy	F1-Sore
Variant	DMFL-TB	0.8597	0.8572	0.8904	0.8887
	DMFL-TL	0.8642	0.8597	0.8835	0.8815
	DMFL-A	0.8597	0.8604	0.8901	0.8919
	DMFL-ATT	0.8778	0.8479	0.8904	0.8805
	DMFL-TCNN	0.8687	0.8314	0.8767	0.8554
	DMFL-TTRAS	0.8325	0.8240	0.8630	0.8519
	DMFL-I	0.8461	0.8163	0.8630	0.8431
	DMFL	0.8883	0.8881	0.9051	0.8995

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, Z.; Liu, H.; Zuo, L.; Wen, J. Bilinear Learning with Dual-Chain Feature Attention for Multimodal Rumor Detection. Mathematics 2025, 13, 1731. https://doi.org/10.3390/math13111731

AMA Style

Guo Z, Liu H, Zuo L, Wen J. Bilinear Learning with Dual-Chain Feature Attention for Multimodal Rumor Detection. Mathematics. 2025; 13(11):1731. https://doi.org/10.3390/math13111731

Chicago/Turabian Style

Guo, Zheheng, Haonan Liu, Lijiao Zuo, and Junhao Wen. 2025. "Bilinear Learning with Dual-Chain Feature Attention for Multimodal Rumor Detection" Mathematics 13, no. 11: 1731. https://doi.org/10.3390/math13111731

APA Style

Guo, Z., Liu, H., Zuo, L., & Wen, J. (2025). Bilinear Learning with Dual-Chain Feature Attention for Multimodal Rumor Detection. Mathematics, 13(11), 1731. https://doi.org/10.3390/math13111731

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Bilinear Learning with Dual-Chain Feature Attention for Multimodal Rumor Detection

Abstract

1. Introduction

2. Related Work

3. Problem Definition

4. Model Design and Description

4.1. Algorithm Model Framework

4.2. Text Basic Semantic Feature Encoder

4.3. Image Encoder

4.4. Cross-Modal Cross Attention

4.5. Extraction of Text–Image Multimodal Logical Connection Features

4.6. Modality Alignment

4.7. Classification Prediction Module

5. Experimental Design

5.1. Dataset Description

5.2. Benchmark Method

5.3. Experimental Setup

6. Analysis of Experimental Results

6.1. Performance Comparison and Analysis of DMFL

6.2. Performance Analysis of the Feature Extraction Module of DMFL

6.3. Analysis of Important Parameters of DMFL

6.4. Correlation Analysis

6.5. Time Complexity Analysis

7. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI