A Multimodal Fake News Detection Method Based on Contrastive Learning and Variational Autoencoder

Wu, Baowen; Hu, Ruijiao; Wang, Jilin; Sui, Xin; Sun, Jiaxing; Liu, Jie; Qu, Youli

doi:10.3390/math14101773

Open AccessArticle

A Multimodal Fake News Detection Method Based on Contrastive Learning and Variational Autoencoder

by

Baowen Wu

¹,

Ruijiao Hu

¹,

Jilin Wang

¹,

Xin Sui

^2,*,

Jiaxing Sun

³,

Jie Liu

^3,4

and

Youli Qu

⁵

¹

School of Artificial Intelligence, Wenshan University, Wenshan 663099, China

²

Digital and Smart Campus Construction Center, Beijing University of Chinese Medicine, Beijing 100028, China

³

School of Artificial Intelligence and Computer Science, North China University of Technology, Beijing 100144, China

⁴

Research Center for Language Intelligence of China, Beijing 100048, China

⁵

School of Computer Science & Technology, Beijing Jiaotong University, Beijing 100044, China

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(10), 1773; https://doi.org/10.3390/math14101773

Submission received: 21 March 2026 / Revised: 23 April 2026 / Accepted: 6 May 2026 / Published: 21 May 2026

(This article belongs to the Special Issue Recent Advances of Neural Network Optimization and Algorithms in Deep Learning)

Download

Browse Figures

Versions Notes

Abstract

Fake news often exhibits pronounced bias and misleading content. To foster a harmonious information environment, there is an urgent need for rapid fake news identification. Fake news detection can assess news authenticity by analyzing multidimensional information such as text, images, and comments. This automated approach significantly reduces human and material resource costs. However, existing detection methods often focus on extracting textual features, employing coarse-grained fusion techniques when integrating multi-modal information, and neglecting the inherent correlations between different modalities. Meanwhile, these methods rely on static network structures and fixed feature weighting strategies, lacking targeted neural network optimization and adaptive learning mechanisms, which results in insufficient interpretability and limited generalization performance across most detection approaches. To address these challenges, from the perspective of neural network optimization and regularization enhancement, this paper proposes a multi-modal fake news detection method based on contrastive learning and variational autoencoders. Firstly, we design a dual-contrastive learning loss function as a specialized regularization strategy for multimodal neural networks. By learning features through comparing similar and dissimilar samples, it more effectively captures correlations across multimodal data, optimizing the feature distribution and enhancing the model’s generalization capability via contrastive regularization. Second, it introduces a variational autoencoder to realize adaptive learning and dynamic weight optimization assigned to unimodal and multimodal features during decision-making. This adaptive mechanism enables the model to distinguish the relative importance of different modal information, optimizing the decision-making process of the multimodal neural network and thereby improving detection accuracy. Experiments conducted on the public Chinese dataset Weibo and English dataset Twitter demonstrate that the proposed optimized network architecture outperforms other multimodal methods by 3% to 8% in terms of detection accuracy, validating the superiority of this neural network optimization-based approach for multimodal fake news detection tasks.

Keywords:

fake news detection; contrastive learning; variational autoencoder; multimodality

MSC:

68T50

1. Introduction

Fake news detection aims to promptly identify and distinguish false information, mitigate adverse impacts, and conserve human and material resources. In recent years, the scope of news content has expanded and become increasingly diverse, with fake news transitioning from single-text to multimodal dissemination formats [1]. Multimodal information is expressed through multiple channels, including text and images, offering richer expression than unimodal information. It leverages the characteristics and complementary information across different modalities to provide richer contextual understanding. Therefore, processing multimodal data is a critical step in fake news detection. By utilizing deep learning and natural language processing technologies, one can deeply understand and analyze textual and visual content. Through fusion strategies, multimodal feature integration can be achieved, thereby enabling automated detection, reducing human-intervention error rates, and improving detection efficiency. The task of fake news detection has become a focal point for both academia and industry, with its practical value widely recognized. It can promptly halt the spread of misinformation, safeguard information integrity, and maintain social stability. Furthermore, it assists media organizations in correcting false news, thereby enhancing news quality and media credibility [2].

Over recent years, contrastive learning has become an important paradigm in multimodal fake news detection because it can explicitly align semantically matched image–text pairs while separating mismatched ones [3,4,5]. Recent studies further show that this paradigm is evolving from simple pairwise alignment toward external knowledge enhancement, adaptive fusion, and LLM-assisted reasoning [4,6,7]. These advances indicate that multimodal fake news detection is moving beyond coarse-grained fusion toward more robust cross-modal interaction and dynamic decision-making. Nevertheless, current methods still face challenges in modeling latent semantic relations across samples, balancing unimodal and multimodal evidence, and handling ambiguous cross-modal signals [5,7,8]. Therefore, further optimizing contrastive objectives and adaptive multimodal aggregation remains necessary.

However, the contrastive framework applied to multimodal fake news detection may still be constrained by several key factors: On one hand, the vast majority of image–text pairs in fake news are inherently mismatched. For instance, in Figure 1a,c, the text content alone suffices to identify the news as fake. Figure 1b can be identified as fake news by detecting traces of image manipulation. Conversely, Figure 1d exemplifies a false connection phenomenon: the news text describes a war-like scenario featuring tense situations and conflict elements—typical war-related semantics—yet the paired image conveys joyful emotions through visual representations like cheerful facial expressions and festive scenes. This stark divergence in semantic and emotional expression between text and image constitutes a classic example of false connection, highlighting how multimodal inconsistencies manifest in fake news. Conversely, latent semantic correlations may exist between different image–text pairs, particularly in scenarios where multiple multimodal news reports cover the same event. From a semantic association perspective, these image–text pairs, though varied in form, often share deep semantic connections, collectively conveying diverse information around the core event. However, existing contrastive learning objective functions employ a simplistic, binary classification approach when handling such image–text pairs, uniformly treating these potentially relevant pairs as negative samples [9]. Consequently, while contrastive learning paradigms can benefit from multimodal representation learning, their application in multimodal fake news detection remains underexplored.

To overcome the limitations of existing contrastive learning methods in fake news detection applications, this paper optimizes the fake news detection model. We propose a novel solution: designing a variational autoencoder and contrastive learning multimodal fake news detection model (VCLMMF model). Addressing the inability to fully capture correlations between different modal data and the failure to distinguish the importance of unimodal versus multimodal information, this study proposes a multimodal fake news detection method based on contrastive learning and variational autoencoders. This approach operates on two dimensions: loss function design and auxiliary tasks. The specific contributions are as follows:

In loss function design, we construct a cross-modal fake news detection algorithm based on contrastive learning. By employing both supervised and unsupervised contrastive learning approaches, we capture correlations among multimodal features to enhance the model’s generalization capability.
For auxiliary task design, we incorporate a variational autoencoder (VAE) into the detection framework to process features. By progressively learning reasonable latent representations via KL divergence, the model adaptively aggregates unimodal and multimodal features, thereby improving interpretability.
We enhance feature extraction for text and images by integrating LSTM (Long Short-Term Memory) networks and CBAM (Concentration-Based Attention Module) attention mechanisms.

Combining these three aspects, we propose a multimodal fake news detection framework based on contrastive learning and variational autoencoders. Comparative experiments, ablation experiments, and parameter analysis conducted on Weibo and Twitter datasets validate that this approach effectively improves model performance.

2. Related Work

Early multimodal fake news detection studies mainly relied on direct fusion of textual and visual representations. Typical approaches extracted image features with CNN backbones such as VGG or ResNet and textual features with pre-trained language models such as BERT, and then concatenated them for classification [10,11,12,13]. Although such strategies are simple and effective, they often suffer from coarse interaction modeling and limited interpretability. To alleviate these limitations, some studies introduced auxiliary learning objectives and generative modeling. For example, Khattar et al. [14] used a multimodal variational autoencoder to reconstruct textual and visual signals, while subsequent methods further modeled image–text mismatch and multimodal consistency for fake news verification [15,16].

Another important research line focuses on exploiting cross-modal complementarity and semantic interaction. Jin et al. [17] adopted attention-based multimodal fusion, and Alzaidi et al. [18] combined global textual features with deep sequence modeling to improve representation quality. A recent survey by Li et al. [19] summarized multimodal fake news detection from the perspective of cross-modal interaction and pointed out that robust alignment, modality balancing, and explainability remain open challenges.

In 2025, several representative studies further pushed the field forward. Du et al. [8] integrated shared representation learning with contrastive objectives to enhance fine-grained multimodal feature modeling. Cao et al. [4] proposed ERIC-FND, which introduces external reliable information and multimodal contrastive learning to strengthen entity-aware cross-modal representations. Hu et al. [6] proposed GLPN-LLM, integrating LLM-generated pseudo labels with global label propagation to improve multimodal fake news detection. Shen et al. [7] proposed GAMED, which dynamically decouples modality-specific and shared knowledge through adaptive multi-expert modeling. Chen et al. [5] further improved cross-modal representation quality through contrastive learning-based information enhancement.

Compared with these studies, the method proposed in this paper jointly improves multimodal fake news detection from two perspectives, namely loss-function optimization and auxiliary ambiguity learning. Specifically, we introduce dual contrastive learning to strengthen cross-modal supervision and employ a variational autoencoder to adaptively regulate the contributions of unimodal and multimodal features.

3. Methods

We propose a fake news detection model based on variational autoencoder and dual contrastive Learning. In this section, we first introduce the model architecture, followed by a discussion of the specific details of the model.

3.1. Model Structure

The structure of the VCLMMF model is shown in Figure 2. The process of fake news detection is briefly described as follows: After inputting a text–image pair, first extract unimodal features through the encoder and feature extraction module, take the unimodal features as the input of the variational autoencoder, and obtain the learned ambiguity score by evaluating the KL (Kullback–Leibler) divergence between two unimodal distributions approximated by modality-specific variational autoencoders, respectively. Subsequently, precisely regulate the contributions of cross-modal and unimodal features in the adaptive control of the fake-news detection framework. In terms of loss function, InfoNCE loss, which has similar characteristics to cross-entropy loss, is selected to calculate the unsupervised contrastive loss. Then, this unsupervised learning method is extended to the fully supervised task category, and the positive and negative sample pairs are accurately screened with the help of the information contained in the labels. Through this method, potential cross-modal information can be discovered, making the model more interpretable, thus optimizing the performance of the fake news detection model. The VCLMMF model mainly includes enhancement of the feature extraction module, construction methods of supervised and unsupervised contrastive learning, cross-modal ambiguity learning using a variational autoencoder, and a fusion strategy of multimodal features and unimodal features.

3.2. Enhancement of Feature Extraction Module

The direct output features from feature encoders are typically coarse and suffer from numerous limitations. Specifically, these features may contain substantial redundant information, significantly impairing the model’s generalization capabilities. Simultaneously, they may be contaminated with excessive noise, reducing the model’s sensitivity to key features and hindering its ability to focus on core information closely related to the task. Therefore, processing the raw features is essential to enhance feature quality and thereby optimize the model’s overall performance. For textual features, a dynamic LSTM [20] network is added on top of the BERT model. This enables the dynamic learning of semantic representations beyond the fixed-dimensional features extracted by BERT. As the input text sequence progresses, the LSTM continuously updates its hidden state, thereby capturing the dynamic evolution of textual semantics throughout the sequence. The specific operation of combining BERT with LSTM is as follows: Assuming that the token sequence obtained after tokenization and encoding of the input text is

X = [\begin{matrix} x_{1}, x_{2}, . . ., x_{n} \end{matrix}]

, the output of the BERT model is

E = [\begin{matrix} e_{1}, e_{2}, . . ., e_{n} \end{matrix}]

, and the network output of LSTM is O = LSTM(E). After regularizing O, take the last token output of each sequence, and finally output it to the fully connected layer network to obtain the final feature representation, as follows:

F = FC (LSTM {(E)}_{dropout} [:, - 1, :])

(1)

For image features, CBAM (Concentration-Based Attention Module) is introduced on the basis of ResNet. CBAM can concatenate channel attention mechanisms and spatial attention mechanisms, more effectively extracting fine-grained features with higher saliency. CBAM is an attention module based on attention focusing. By calculating weights on feature maps of various dimensions, it endows the network model with the ability to efficiently extract image features. CBAM is composed of a channel attention module and a spatial attention module, and its model architecture is shown in Figure 3.

In the channel attention module, by performing global pooling and average pooling operations on each channel feature, the feature map is compressed into a vector of channel dimension, and then network structures such as Multi-Layer Perceptron (MLP)are used to learn channel features to obtain the importance weight of each channel. In this way, channel features with higher weights are screened out, key channel information is strengthened, thereby significantly enhancing feature expression ability and enabling the model to focus on channel features with important semantics in the image. The spatial attention module focuses on the spatial dimension of the image. This module generates a spatial dimension attention map by performing max pooling and average pooling on the feature map in the channel dimension, and then further learns the importance weight of spatial positions through convolution operations to calculate the weight of each pixel. This method can improve the saliency of local image features, enabling the network to more accurately locate target objects in the spatial dimension and capture the spatial distribution and detailed information of targets in the image. In the fake news detection scenario, CBAM can effectively highlight semantic feature channels in images that are closely related to the authenticity of news content. Let the image feature processed by ResNet be

V_{r}

, the processed feature be

C B_V_{r}

, and the full calculation steps for introducing CBAM are given in the related expressions that follow:

C B_{c} (V_{r}) = σ {M L P [A v g P o o l (V_{r})] + M L P [M a x P o o l (V_{r})]}

(2)

V_{r}^{'} = C B_{c} (V_{r}) \otimes V_{r}

(3)

C B_{s} (V_{r}) = σ {f ([A v g P o o l (V_{r}); M a x P o o l (V_{r}^{'})])}

(4)

V_{r}^{''} = C B_{s} (V_{r}^{'}) \otimes V_{r}^{'}

(5)

C B_V_{r} = V_{r}^{''} \oplus V_{r}

(6)

where

σ

denotes the activation function, f represents the convolution operation with a filter size of 7 × 7,

C B_{c} (V_{r})

denotes channel attention, and

C B_{s} (V_{r})

denotes spatial attention.

3.3. Loss Calculation Method of Dual Contrastive Learning

This model employs a dual contrastive learning approach to compute losses. The introduction of contrastive learning aims to focus on semantic correlations between modal data, promote multimodal learning during training, and enhance the explainability and generalization capabilities of the fake news detection model in complex scenarios. Dual contrastive learning consists of supervised contrastive learning and unsupervised contrastive learning. The specific process is as follows: First, use unsupervised contrastive learning based on proxy tasks and objective functions to compare the similarities of and differences in different modal features in spatial semantics, guide the model to learn cross-modal consistent feature representations, enabling it to synthesize multimodal information to overcome the limitations of single-modal data and enhance its ability to understand and capture the complex characteristics of fake news. Then, use the rich information contained in accurate labels to clearly distinguish positive and negative sample pairs, prompting the model to deeply explore the differences and commonalities between intra-modal data features, strengthen the model’s ability to discriminate different feature patterns within the modality, and enable the model to accurately capture feature patterns related to fake news, thereby effectively improving the recognition accuracy of fake news.

Unsupervised contrastive learning lacks label information, so it is necessary to use proxy tasks to define the positive and negative attributes of contrastive learning samples. The loss calculation process of unsupervised contrastive learning is shown in Figure 4. In this context, unsupervised contrastive learning can be regarded as a dictionary lookup task, where a given anchor sample

T^{q u e r y}

is set as a positive sample, and the matching samples together form positive sample pairs, while other samples

I_{i}^{k e y}

are classified as negative samples. After defining the positive and negative sample pairs, they are input into different encoders for encoding, respectively, and then the contrastive learning loss function is used to guide the model’s learning process, and the corresponding mathematical expression is given as follows:

l_{U C L} = \frac{1}{2 N} \sum_{i = 1}^{N} [log (\frac{exp (\frac{x_{i}^{t} \cdot x_{i +}^{v}}{τ})}{\sum_{j = 1}^{N} exp (\frac{x_{i}^{t} \cdot x_{j -}^{v}}{τ})}) + log (\frac{exp (\frac{x_{i}^{v} \cdot x_{i +}^{t}}{τ})}{\sum_{j = 1}^{N} exp (\frac{x_{i}^{v} \cdot x_{j -}^{t}}{τ})})]

(7)

Among them,

τ

represents the temperature hyperparameter, whose function is to regulate the size of the similarity measurement. The larger the

τ

value, the looser the constraint of the similarity measurement, which means that the model’s strictness in distinguishing positive and negative samples during the learning process will be reduced.

x_{i}

denotes the anchor sample,

x_{i +}

denotes the positive sample corresponding to the anchor sample, and

x_{j -}

denotes the corresponding negative sample. In addition, experimental studies have shown that selecting different types of data as keys and queries, respectively, can enable the model to focus on different-dimensional features of the input sequence. Based on this, image data and text data are used as the key and query of the dictionary lookup, respectively. By calculating the mean value of the similarity between them, this operation can effectively promote the model’s learning of cross-modal information association, thereby improving the model’s performance in related tasks.

To make the model more accurate and more interpretable, supervised contrastive learning is introduced based on unsupervised contrastive learning. Specifically, we build upon the seminal work of Khosla et al. [21] in supervised contrastive learning and apply it to the fake news detection task. Because it provides a principled objective for constructing multiple positive and negative pairs within a labeled mini-batch. The core idea of supervised contrastive learning is shown in Figure 5. Recent task-specific studies have further confirmed the effectiveness of contrastive objectives in multimodal fake news detection, including shared-representation contrastive learning [8], external-information-enhanced multimodal contrastive learning [4], and cross-modal information enhancement via contrastive learning [5]. Based on this foundation, supervised contrastive learning is incorporated into our fake news detection framework.

The specific operation in the fake news detection task is as follows: Samples are divided into fake news and real news according to label information. During the sample selection process, samples with the same category as the anchor sample are designated as positive samples, while samples with different categories are determined as negative samples. After constructing sample pairs through the above process, the final loss function is is given as follows:

l_{S C L} = \sum_{i \in I} \frac{1}{| R (i) |} \sum_{r \in R (i)} log \frac{exp (x_{i} \cdot x_{r} / τ)}{\sum_{a \in A (i)} exp (x_{i} \cdot x_{a} / τ)}

(8)

where I denotes the sample set,

| R (i) |

denotes the total number of samples in the set

R (i)

. For a given sample i,

R (i)

denotes the set of samples in the batch that have the same category as sample i but different content. According to the annotation information of the data, all samples are divided into positive samples and negative samples. When calculating the loss, positive samples and negative samples are respectively included in

R (i)

to calculate the loss.

a \in A (i)

represents a sample instance belonging to the same category as

r \in R (i)

but different from r.

Fake news detection is regarded as a binary classification task, and the common cross-entropy loss function is selected as the optimization objective of the model. By minimizing this loss function, the gap between the model’s prediction results and the true labels is minimized as much as possible, thereby improving the detection performance of the model. On the basis of the cross-entropy loss function, the use of dual contrastive learning for optimization is discussed. In binary classification tasks, the core function of the contrastive loss function is to help the model learn the differential features between categories. In view of this, the contrastive loss is set as an auxiliary task, with the cross-entropy loss function as the main task [22], and the stochastic gradient descent algorithm is used to train the model. When the category distribution of the data is unbalanced, the contrastive loss function can prompt the model to map samples of the same category to adjacent feature space points, effectively preventing the model from producing overconfident prediction results for a certain category. Based on the above analysis, to regulate the effect of the contrastive loss function, a weight parameter is introduced into the loss function. By integrating the loss functions of the main classification task and the auxiliary learning task, the final loss function of the model is given as follows:

l_{final} = l_{CE} + α l_{SCL} + β l_{UCL}

(9)

where

l_{CE}

is the cross-entropy loss, which is used to measure the degree of information difference between the model’s prediction results and the true labels;

l_{SCL}

represents the supervised contrastive loss, which can guide the model to learn the differences between categories and the similarity of samples of the same category; and

l_{UCL}

represents the unsupervised contrastive loss, whose goal is to mine the potential structure and feature representation of data from unlabeled data.

3.4. Variational Autoencoder

Multimodal data, such as text and images involved in fake news detection have nonlinear complex semantic relationships and potential connection.A variational autoencoder can map data of different modalities into a unified latent space through powerful nonlinear mapping capabilities and probabilistic modeling mechanisms, providing a basis for the fusion and analysis of cross-modal data, thereby more accurately evaluating cross-modal ambiguity and improving the accuracy of fake news detection. Based on the above analysis, this paper introduces a cross-modal ambiguity learning method based on a variational autoencoder into this detection framework, which is realized by evaluating the KL (Kullback–Leibler) divergence between unimodal distributions approximated by two modality-specific variational autoencoders [23]. KL divergence can be used to quantify the relative difference between the latent variable distribution

p (x / z)

learned by the encoder and the standard Gaussian distribution

q (x / z)

. The learned ambiguity score can realize the adaptive adjustment of the weights of cross-modal features and unimodal features in the fake news detection task. It should be noted that the KL divergence in this work is not intended to serve as a comprehensive semantic consistency metric, but rather as an auxiliary signal to estimate global cross-modal discrepancy and guide adaptive feature fusion. Therefore, the ambiguity score functions as a modulation factor in the fusion process rather than a standalone indicator of semantic consistency. Specifically, when the information conveyed by unimodal features has problems such as uncertainty and ambiguity, the weight of multimodal features should be increased during the detection task; otherwise, the dependence on cross-modal features should be appropriately reduced. In summary, KL divergence is employed to capture distribution-level discrepancies between modalities and to guide adaptive fusion, while finer-grained semantic consistency is further modeled by other components in the framework.

For a given unimodal input sample

x^{t}, x^{v}

. The core point of the variational autoencoder is how to solve the latent variable z for the input x. In practical application scenarios, it is difficult to accurately calculate the distribution of fixed unimodal features. According to the conditional probability formula, the probability of the latent variable is given as follows:

p (z | x) = \frac{p (x^{'} | z) p (z)}{p (x)}

(10)

Therefore, this paper performs sampling operations on unimodal features from the latent space with an isotropic Gaussian prior. Under the assumption that the distribution difference between unimodal features reflects the information gap between unimodals, a new distribution is found,

q_{θ}

, making it infinitely close to

p (z / x)

, which is called the posterior distribution here. The latent variable information of different modal data is different, so the posterior distribution of the data sequence also differs. The variational posterior distributions of text and image can be expressed as follows:

q_{θ} (z_{i}^{t} | x_{i}^{t}) = N (z_{i}^{t} | μ (x_{i}^{t}), σ (x_{i}^{t}))

(11)

q_{θ} (z_{i}^{v} | x_{i}^{v}) = N (z_{i}^{v} | μ (x_{i}^{v}), σ (x_{i}^{v}))

(12)

where the mean

μ

and variance

σ

represent the mean and variance, respectively, and N represents the Gaussian distribution with mean and variance. The specific values can be obtained from the variational autoencoder. Based on the above analysis, the variational posteriors of the two modalities can be calculated respectively, and the calculation formulas are as follows:

q_{θ} (z^{t}) = \frac{1}{N} \sum_{i = 1}^{N} q_{θ} (z_{i}^{t} | x_{i}^{t})

(13)

q_{θ} (z^{ν}) = \frac{1}{N} \sum_{i = 1}^{N} q_{θ} (z_{i}^{ν} | x_{i}^{ν})

(14)

Furthermore, the ambiguity degree existing in different modalities of data sample

x_{i}

can be measured by the KL divergence between unimodal distributions, and the specific forms are given as follows:

b_{i}^{1} = (\frac{D_{KL} (q_{θ} (z_{i}^{t} ‖ x_{i}^{t}) ‖ q_{θ} (z_{i}^{ν} ‖ x_{i}^{ν}))}{D_{KL} (q_{θ} (z^{t}) ‖ q_{θ} (z^{ν}))})

(15)

b_{i}^{2} = (\frac{D_{KL} (q_{θ} (z_{i}^{v} ‖ x_{i}^{v}) ‖ q_{θ} (z_{i}^{t} ‖ x_{i}^{t}))}{D_{KL} (q_{θ} (z^{v}) ‖ q_{θ} (z^{t}))})

(16)

b_{i} = sigmoid (\frac{1}{2} (b_{i}^{1} + b_{i}^{2}))

(17)

where

D_{K L}

represents the KL divergence. The ambiguity score b is constructed by normalizing the two parts and taking the mean to build a symmetrized KL divergence. Among them,

q_{θ} (z_{i}^{t} ‖ x_{i}^{t})

and

q_{θ} (z_{i}^{ν} ‖ x_{i}^{ν})

represent the variational posterior distributions based on text unimodal feature

x^{t}

and image unimodal feature

x^{v}

respectively. sigmoid(·) serves as the activation function, whose function is to map the ambiguity score to the interval [0, 1]. A lower score indicates that the two unimodal distributions are close to each other. Based on the above principle, this paper uses the ambiguity score b as a weight to adjust the contribution of unimodal features and multimodal features to the detection process. Specifically, when the score is high, this mechanism will strengthen the contribution of cross-modal features and weaken the influence of unimodal features accordingly; conversely, when the score is low, the model relies more on unimodal features and reduces the dependence on cross-modal features.

3.5. Multimodal Feature Fusion Strategy

After obtaining the ambiguity score through the variational autoencoder, multimodal feature fusion is performed. The input of the classifier consists of two parts: one is the unimodal representation from the cross-modal alignment module, and the other is the cross-modal correlation information from the cross-modal fusion module. Among them, the fusion degree of cross-modal correlation information is controlled by the cross-modal ambiguity score generated by the cross-modal ambiguity learning module, and the specific form is given as follows:

\tilde{x} = (a_{x} \times e_{s}^{m}) \oplus (b_{x} \times x^{t}) \oplus (b_{x} \times x^{v})

(18)

where

\tilde{x}

denotes the fused feature vector,

e_{s}^{m}

denotes the cross-modal feature vector, and

x^{t}

and

x^{v}

denote unimodal feature vectors. The sum of

a_{x}

and

b_{x}

is 1. They are connected through the ⊕ operation, and the final representation is input into a fully connected network. The calculation of the predicted label

{\tilde{y}}_{l}

is given as follows:

{\tilde{y}}_{l} = softmax (M L P (\tilde{x}))

(19)

4. Experiments and Result Analysis

4.1. Datasets

The datasets used in the experiments include the English dataset Twitter [24] and the Chinese dataset Weibo [17]. These two datasets are mainly used for fake news detection tasks, covering news texts and images from multiple domains. In the experiment, the dataset is divided into three parts: training set, test set, and validation set, with a division ratio of 7:1:2. This division strategy can not only provide sufficient data support for model training to ensure that the model fully learns data features, but also help effectively test the generalization ability and reliability of the model, thereby more comprehensively evaluating the model’s performance in practical application scenarios.

4.2. Parameter Settings

During the model training phase, we follow the settings of CAFE [25] and employ the Adam optimizer to iteratively update the model parameters so as to minimize the loss function. The model is trained for 100 epochs, and five-fold cross-validation is adopted during the training process. All other hyperparameter settings remain consistent with those of CAFE [25], and the target learning rate is set to

1 \times 10^{- 4}

. Considering that improper learning rate settings may lead to training instability, we further introduce a linear warm-up mechanism. Specifically, during the early stage of training, the learning rate is gradually increased from

1 \times 10^{- 6}

to

1 \times 10^{- 4}

following a linear schedule. This strategy enables the model to adapt to the data smoothly at the beginning of training and avoids abrupt parameter updates. As training progresses, the learning rate reaches the preset value, allowing the model to fully exploit the data for effective optimization, thereby improving both training stability and overall performance. In the dual contrastive learning module, the key hyperparameters are set as

α = β = 0.2

. For evaluating the model’s generalization ability and robustness on AI-generated text, we adopt GPT-3.5-Turbo as the text generation model.

4.3. Comparison Experiments

To verify the overall performance of the proposed model, this paper selects different baseline models for comparative analysis and designs comparative experiments for unimodal baseline models and multimodal baseline models, respectively.

Text-GRU: A single-text method that extracts features from text data through a bidirectional Gated Recurrent Unit network (Bi-GRU) to capture the semantic information of words, thereby achieving text classification tasks.
Image-VGG: A single-visual method that extracts features from image data using the VGG-19 convolutional neural network, then inputs these features into a fully connected layer to map the features to the classification space through a weight matrix to obtain classification results.
EANN [26]: An end-to-end rumor detection model built based on adversarial neural networks. It uses Text-CNN to capture local text features and further extract key features of the text. VGG-19 is used to extract image features. The two features are fused and input into the adversarial neural network for training to achieve rumor detection.
MKEMN [27]: A model that utilizes a multimodal knowledge-aware network to fuse multimodal information such as text and images, and employs an event memory network to capture the development context of news events, achieving fake news recognition through their collaboration.
SAFE [15]: A model that proposes an innovative multimodal similarity calculation method that can jointly learn the representations of text and visual information and their relationships. By designing a reasonable similarity measurement function, it calculates the similarity between text and images in the semantic space to determine the authenticity of news.
MFCD [28]: A fake news detection model based on multi-level fusion, effectively addressing the problems of insufficient inter-modal information fusion and excessive requirements for the integrity of multimodal information. By designing a multi-level fusion strategy, it fuses multimodal information such as text and images at different levels, fully exploiting the information complementarity across levels while reducing reliance on the integrity of multimodal information, thereby improving the model’s adaptability in practical applications.
MMCSC [29]: A model that extracts high-level semantic features of text and images and designs a calculation method for cross-modal topic and sentiment consistency. It uses deep learning models to extract high-level semantic features of text and images, respectively, then constructs a cross-modal topic and sentiment consistency measurement model to calculate the consistency in topic and sentiment between different modalities, thereby judging the authenticity of news.
LIIMR [30]: A model that performs fake news detection using intra-modal and inter-modal methods. In intra-modal, it independently extracts and analyzes features from text and image data to mine feature patterns within unimodal data. In inter-modal, it fuses text and image features through an effective fusion strategy, making full use of complementary information between different modalities to improve fake news detection performance.
MCNN [16]: A multimodal consistency detection model that can capture the overall features of social media information for fake news detection. By constructing a multimodal fusion layer, it fuses multimodal information such as text and images, and uses a consistency constraint mechanism to ensure the consistency of different modal information during fusion, thereby extracting more representative overall features and improving the accuracy of fake news detection.
CAFE [25]: A model that maps heterogeneous unimodal features to a shared semantic space using a mapping function, and designs an ambiguity estimation module to evaluate and handle potential ambiguities between different modalities, improving detection reliability.
BDANN [31]: A model that conducts in-depth analysis of text data using the pre-trained language model BERT, and performs feature learning on image data using the pre-trained VGG-19 model. It then introduces a domain classifier to eliminate feature dependence for fake news detection.
MVAE [14]: A model that processes text and image data using a variational autoencoder to mine correlation information between them. After obtaining relevant features, they are input into a news classifier, and through the classifier’s operation and judgment mechanism, false information detection is achieved.
MMF [8]: A fake news detection model based on a shared network and contrastive learning. It fully integrates text and visual information through a graph convolutional neural network, uses a shared representation module to extract fine-grained representations for richer multimodal information, and introduces two different types of contrastive learning as auxiliary tasks to enable the model to better learn correlations between samples of the same category.

The experimental results are shown in Table 1, which lists the comparison results of VCLMMF and other benchmark methods in the fake news detection task.

From the experimental results, the following conclusions are drawn: VCLMMF achieved an accuracy of 89.6% on the Weibo dataset, which is 5.4% higher than the previous best level. It reached an accuracy of 90.4% on the English Twitter dataset. After comparing the proposed model VCLMMF with other models, it can be seen that the model has an accuracy improvement of 3–8%. On one hand, this is because the introduction of the dual contrastive learning paradigm enables the model to learn more discriminative features, which helps improve model interpretability and thus accuracy; on the other hand, it is because the introduction of the variational autoencoder adaptively aggregates unimodal and multimodal features, effectively resolving inter-modal information conflicts and significantly improving the precision of fake news detection. After comparing the VCLMMF model with other baseline models, it can be found that the proposed model still performs excellently. The MCNN model [16] focuses on the consistency of cross-modal data and integrates visual visualization physical tampering, and image-text mismatch features. However, since news text content is core in fake news detection, MCNN’s over-focus on cross-modal consistency may lead to insufficiently comprehensive and in-depth analysis of news text content, thereby negatively affecting fake news detection performance. In contrast, the VCLMMF model uses a variational autoencoder to dynamically adjust the weights of text, image, and multimodal features, better highlighting the parts that make important contributions to the detection task. CAFE [25] uses cross-modal alignment to map text and visual data to a shared semantic space to construct better multimodal representations, but due to flaws in the labeling method used during the training phase, data was not labeled accurately and completely, leading to the introduction of biased information during encoding, which interferes with the model’s effective capture of real semantics. In contrast, VCLMMF introduces unsupervised contrastive learning and uses a self-training method to allow the model to automatically generate pseudo-labels for unlabeled data during training and continuously iteratively optimize, improving the effective capture of semantics.

4.4. Out-of-Distribution Evaluation on AI-Generated and Niche Data

Although the proposed method achieves strong performance on standard benchmark datasets, fake news in real-world scenarios exhibits continuously evolving characteristics, such as AI-generated text and niche topics under long-tail distributions. These out-of-distribution data often differ significantly from the training data, thereby posing greater challenges to the model’s generalization ability. To address this issue, we further design extended experiments from two perspectives—AI-generated text and niche topics—to evaluate the robustness and generalization capability of the proposed model in complex and dynamic environments.

In the experiments on AI-generated text, we randomly sample data from the test sets of Weibo and Twitter while maintaining class balance, i.e., 250 real and 250 fake samples from each dataset, resulting in 500 samples per dataset. Based on this, we keep the original images unchanged and only rewrite the textual content using large language models to construct AI-generated test data. Specifically, a prompt-based fine-tuning strategy is adopted to guide the generation process. The prompt is defined as follows: “Rewrite the following news text to better resemble AI-generated content while preserving its original semantics. The rewritten text should be more fluent and may include slight ambiguity or stylistic enhancement, but must not alter its factual label (real or fake).” As shown in Table 2, VCLMMF maintains stable performance under this setting, achieving an Accuracy of 0.891 on Weibo and 0.902 on Twitter, with Precision, Recall, and F1-score for both classes remaining largely consistent with those on the original test sets. These results indicate that the proposed model remains robust under textual distribution shifts, which can be attributed to the cross-modal ambiguity modeling mechanism that reduces reliance on superficial textual features, as well as the contrastive learning strategy that enhances cross-modal representation consistency.

In the niche topic experiments, we similarly construct subset datasets of 500 samples from the Weibo and Twitter test sets, each containing multiple low-frequency topics while preserving class balance. Specifically, for the Weibo dataset, due to the absence of explicit topic labels, we first cluster texts based on semantic similarity and then select low-frequency clusters as niche topics. For the Twitter dataset, niche topics are directly selected from small-scale events (e.g., Passport Hoax, Garissa Attack). The Weibo subset contains 12 niche topics, while the Twitter subset contains 10 niche topics, all derived from events with relatively low frequency or attention in the original datasets. As shown in Table 2, the model achieves Accuracies of 0.901 and 0.907 on the Weibo and Twitter niche datasets, respectively, which are comparable to those on the original test sets. This demonstrates that VCLMMF maintains stable performance under data scarcity and domain shift, mainly due to the VAE-based modeling of cross-modal distribution discrepancies and the robust feature representations learned through contrastive learning, enabling effective generalization under small-sample and distribution-shift conditions.

4.5. Ablation Experiment

To verify the effectiveness of introducing the variational autoencoder and dual contrastive learning modules into the fake news detection framework, this section conducts ablation experiments on the two modules and compares the differences in model performance before and after removing the variational autoencoder module and the dual contrastive learning module. By designing this ablation experiment, we aim to explore the actual contribution of the proposed method to the fake news detection task and examine how model performance changes with the introduction and removal of modules.

Table 3 presents the results of the ablation experiment, which clearly show that introducing the dual contrastive learning module (CL) and the variational autoencoder (VAE) module into the fake news framework can improve the accuracy of the detection task. The data indicates an accuracy improvement of 5–7%. Moreover, after removing the dual contrastive learning module, the model’s performance significantly degrades. In conclusion, the experimental results demonstrate that the method proposed in this paper can have a significant positive impact on the performance of fake news detection models.

4.6. Parameter Analysis

This section clarifies the mechanism by which parameters affect model performance through adjusting model parameters and analyzing results. In the dual contrastive learning module, the weights of supervised and unsupervised contrastive learning are adjusted to modify the impact of auxiliary tasks on the model. During the research, hyperparameters were selected within the range of 0–0.5 for experiments on the Weibo dataset, and the results are shown in Figure 6.

Experimental results show that the detection effect is optimal when

α = β = 0.2

. Further analysis reveals that compared with other methods, supervised contrastive learning has significant advantages in promoting the learning of complementary relationships between cross-modal features, enabling more effective mining of potential associations between different modal features and enhancing the model’s ability to fuse and process multimodal information.

In addition to parameters, the choice of optimizer is also crucial for the training process. Different optimizers vary in convergence speed, stability, and generalization ability. To explore the impact of different optimizers on model training, this paper selects three representative optimizers: Adaptive Moment Estimation (Adam), Root Mean Square Propagation (RMSprop), and Stochastic Gradient Descent (SGD) for experiments.

Adam: Combines the advantages of Adagrad and RMSprop, calculating not only the first-order moment estimation (mean) of gradients but also the second-order moment estimation (variance). It dynamically adjusts parameter learning rates based on the mean and variance. It has high computational efficiency and low memory requirements; converges quickly in most cases, enabling rapid finding of optimal solutions; adaptively adjusts learning rates for different parameters, well balancing exploration and exploitation during training. It is widely used in various deep learning tasks, especially suitable for models with large data and parameter scales. RMSprop: Performs an exponentially weighted moving average on the square of gradients to adjust learning rates. It can adaptively adjust learning rates, using different learning rates for different parameters—smaller learning rates for frequently updated parameters and larger ones for sparse parameters; it effectively handles non-stationary objective functions, resulting in more stable training processes and reduced oscillations in parameter updates. It is suitable for solving non-convex optimization problems, especially when the objective function has much noise or large gradient variations. In neural network training, when the model structure is complex (e.g., deep convolutional neural networks processing image data), RMSprop can well balance training stability and convergence speed.

SGD: Randomly samples a small batch of samples from the training data in each iteration and updates model parameters by calculating their gradients. The algorithm is simple and easy to implement; when the data volume is large, using small batches can avoid local optima to a certain extent and has high computational efficiency. It is suitable for training large-scale datasets. When the data distribution is relatively uniform and the model structure is relatively simple, this optimizer works effectively.

The experimental results are shown in Figure 7. The results indicate that when the number of iterations approaches the range of 40–50, the model training process tends to stabilize, achieving a smooth and stable transition of gradients. By comparing the convergence performance of the three optimizers, it is found that the Adam optimizer exhibits the most excellent performance in terms of convergence speed and stability.

5. Conclusions

To address issues in multimodal fusion fake news detection, such as a lack of model interpretability and neglect of semantic gaps between multimodalities, this paper proposes a fake news detection method based on a variational autoencoder and dual contrastive learning. This method improves existing approaches at three levels: First, LSTM and CBAM are introduced into the feature extraction module to enhance image and text features, respectively. Second, a variational autoencoder is integrated into the fake news detection framework, enabling the model to adaptively adjust the contribution of unimodal and multimodal features to detection. Finally, a dual contrastive learning module is introduced, using both supervised and unsupervised contrastive learning as auxiliary tasks to better capture semantic correlations. Experimental results on public datasets show that the proposed model outperforms other comparison methods in accuracy for fake news detection tasks, verifying the effectiveness of the optimization scheme proposed in this paper.

Despite these encouraging results, several limitations remain. In particular, the current ambiguity modeling relies on distribution-level divergence, which may not fully capture fine-grained semantic inconsistencies. In addition, although multiple mechanisms have been introduced to mitigate biases from pre-trained models, further improvements are still needed in terms of bias robustness and domain generalization. In future work, we plan to explore more expressive cross-modal semantic consistency modeling approaches, such as fine-grained entity alignment and reasoning-based methods, as well as bias-aware training strategies and domain-specific pre-trained models. Furthermore, we aim to extend the proposed framework to more complex multimodal settings by incorporating video modalities along with temporal modeling techniques, in order to further explore the model’s capability in detecting deepfake videos and other emerging forms of fake news. These directions are expected to further enhance the model’s capability to handle complex, diverse, and evolving fake news scenarios.

Author Contributions

Conceptualization, B.W., R.H., J.W., X.S., J.S., J.L. and Y.Q.; Methodology, B.W., R.H., J.W., X.S., J.S., J.L. and Y.Q.; Software, B.W., R.H., J.W., X.S., J.S., J.L. and Y.Q.; Validation, B.W., R.H., J.W., X.S., J.S., J.L. and Y.Q.; Formal analysis, B.W., R.H., J.W., X.S., J.S., J.L. and Y.Q.; Investigation, B.W., R.H., J.W., X.S., J.S., J.L. and Y.Q.; Resources, B.W., R.H., J.W., X.S., J.S., J.L. and Y.Q.; Data curation, B.W., R.H., J.W., X.S., J.S., J.L. and Y.Q.; Writing—original draft, B.W., R.H., J.W., X.S., J.S., J.L. and Y.Q.; Writing—review & editing, B.W., R.H., J.W., X.S., J.S., J.L. and Y.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Yunnan Provincial Science and Technology Program Project grant number 202501BA070001–109, Key Programs of the Joint Funds of the National Natural Science Foundation of China grant number U23B2029, Natural Science Foundation of Beijing Municipal grant number 4252035, Intelligent Perception and Control Engineering Research Center of Yunnan Provincial Department of Education and Yuxiu Innovation Project of NCUT grant number 2024NCUTYXCX102.

Data Availability Statement

The data presented in this study are openly available in GitHub repository “image-verification-corpus” at https://github.com/MKLab-ITI/image-verification-corpus (accessed on 8 March 2026). Specifically, the MediaEval 2015 Verifying Multimedia Use dataset was used in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, H.L.; Chen, S.H.; Cao, S.J.; Zhu, J.L.; Ren, Q.Q. Research on Fake News Detection Based on Multimodal Learning. J. Front. Comput. Sci. Technol. 2023, 17, 2022–2029. [Google Scholar]
Liu, H.; Wang, W.; Li, H. Interpretable Multimodal Misinformation Detection with Logic Reasoning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Toronto, ON, Canada, 9–14 July 2023; pp. 9781–9796. [Google Scholar]
Wang, L.; Zhang, C.; Xu, H.; Xu, S.; Xu, B. Cross-modal Contrastive Learning for Multimodal Fake News Detection. In Proceedings of the 31st ACM International Conference on Multimedia (MM), Ottawa, ON, Canada, 29 October–3 November 2023; pp. 5696–5704. [Google Scholar]
Cao, B.; Wu, Q.; Cao, J.; Liu, B.; Gui, J. External Reliable Information-enhanced Multimodal Contrastive Learning for Fake News Detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Philadelphia, PA, USA, 25 February 2025–4 March 2025; pp. 31–39. [Google Scholar]
Chen, W.; Cai, F.; Guo, Y.; Pan, Z.; Chen, W.; Zhang, Y. Contrastive Learning of Cross-Modal Information Enhancement for Multimodal Fake News Detection. Complex Intell. Syst. 2025, 11, 303. [Google Scholar] [CrossRef]
Hu, S.; Hu, J.; Zhang, H. Synergizing LLMs with Global Label Propagation for Multimodal Fake News Detection. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, BC, Austria, 27 July–1 August 2025; pp. 1426–1440. [Google Scholar]
Shen, L.; Long, Y.; Cai, X.; Razzak, I.; Chen, G.; Liu, K.; Jameel, S. GAMED: Knowledge Adaptive Multi-Experts Decoupling for Multimodal Fake News Detection. In Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining (WSDM), Hannover, Germany, 10–14 March 2025; pp. 586–595. [Google Scholar]
Du, Z.; Wang, H.; Liu, J. Multimodal Fake News Detection Integrating Shared Representation and Contrastive Learning. Comput. Eng. Des. 2025, 46, 2879–2887. [Google Scholar]
Lao, A.; Zhang, Q.; Shi, C.; Cao, L.; Yi, K.; Hu, L.; Zhao, D. Frequency Spectrum is More Effective for Multimodal Representation and Fusion: A Multimodal Spectrum Rumor Detector. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 18426–18434. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Singhal, S.; Shah, R.R.; Chakraborty, T.; Kumaraguru, P.; Satoh, S. SpotFake: A Multi-modal Framework for Fake News Detection. In Proceedings of the 2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM), Singapore, 11–13 September 2019; pp. 39–47. [Google Scholar]
Khattar, D.; Goud, J.S.; Gupta, M.; Varma, V. MVAE: Multimodal Variational Autoencoder for Fake News Detection. In Proceedings of the World Wide Web Conference (WWW), San Francisco, CA, USA, 13–17 May 2019; pp. 2915–2921. [Google Scholar]
Zhou, X.; Wu, J.; Zafarani, R. Similarity-Aware Multi-modal Fake News Detection. In Proceedings of the 24th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Singapore, 11–14 May 2020; pp. 354–367. [Google Scholar]
Xue, J.; Wang, Y.; Tian, Y.; Li, Y.; Shi, L.; Wei, L. Detecting Fake News by Exploring the Consistency of Multimodal Data. Inf. Process. Manag. 2021, 58, 102610. [Google Scholar] [CrossRef] [PubMed]
Jin, Z.; Cao, J.; Guo, H.; Zhang, Y.; Luo, J. Multimodal Fusion with Recurrent Neural Networks for Rumor Detection on Microblogs. In Proceedings of the 25th ACM International Conference on Multimedia (MM), Mountain View, CA, USA, 23–27 October 2017; pp. 795–816. [Google Scholar]
Alzaidi, M.S.A.; Alshammari, A.; Hassan, A.Q.A.; Yousafzai, S.N.; Thaljaoui, A.; Fitriyani, N.L.; Kim, C.; Syafrudin, M. An Efficient Fusion Network for Fake News Classification. Mathematics 2024, 12, 3294. [Google Scholar] [CrossRef]
Li, X.; Qiao, J.; Yin, S.; Wu, L.; Gao, C.; Wang, Z.; Li, X. A Survey of Multimodal Fake News Detection: A Cross-Modal Interaction Perspective. IEEE Trans. Emerg. Top. Comput. Intell. 2025, 9, 2658–2675. [Google Scholar] [CrossRef]
Graves, A. Long Short-Term Memory. In Supervised Sequence Labelling with Recurrent Neural Networks; Springer: Heidelberg, Germany, 2012; pp. 37–45. [Google Scholar]
Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised Contrastive Learning. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS), Virtual, 6–12 December 2020; pp. 18661–18673. [Google Scholar]
Zhou, Y.; Yang, Y.; Ying, Q.; Qian, Z.; Zhang, X. Multimodal Fake News Detection via CLIP-Guided Learning. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; pp. 2825–2830. [Google Scholar]
Gôlo, M.P.S.; de Souza, M.C.; Rossi, R.G.; Marcacini, R.M.; Rezende, S.O. One-class Learning for Fake News Detection through Multimodal Variational Autoencoders. Eng. Appl. Artif. Intell. 2023, 122, 106088. [Google Scholar] [CrossRef]
Boididou, C.; Papadopoulos, S.; Dang-Nguyen, D.-T.; Boato, G.; Riegler, M.; Middleton, S.; Petlund, A.; Kompatsiaris, Y. Verifying Multimedia Use at MediaEval 2016. In Proceedings of MediaEval Benchmarking Initiative for Multimedia Evaluation, Wurzen, Germany, 14–15 September 2015. [Google Scholar]
Chen, Y.; Li, D.; Zhang, P.; Sui, J.; Lv, Q.; Tun, L.; Shang, L. Cross-modal Ambiguity Learning for Multimodal Fake News Detection. In Proceedings of the ACM Web Conference (WWW), Lyon, France, 25–29 April 2022; pp. 2897–2905. [Google Scholar]
Wang, Y.; Ma, F.; Jin, Z.; Yuan, Y.; Xun, G.; Jha, K.; Su, L.; Gao, J. EANN: Event Adversarial Neural Networks for Multi-Modal Fake News Detection. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), London, UK, 19–23 August 2018; pp. 849–857. [Google Scholar]
Zhang, H.; Fang, Q.; Qian, S.; Xu, C. Multi-modal Knowledge-aware Event Memory Network for Social Media Rumor Detection. In Proceedings of the 27th ACM International Conference on Multimedia (MM), Nice, France, 21–25 October 2019; pp. 1942–1951. [Google Scholar]
Wang, Z.; Sui, J. Multimodal Rumor Detection Model Based on Multi-Level Fusion. Comput. Eng. Des. 2022, 43, 1756–1761. [Google Scholar]
Zhao, Y.; Hao, K.; Zhao, J.; Xin, C. MMCSC: A Cross-modal Fake News Detection Method. J. Northeast. Univ. (Nat. Sci.) 2024, 45, 18–25. [Google Scholar]
Singhal, S.; Pandey, T.; Mreig, S.; Shah, R.; Kumaraguru, P. Leveraging Intra and Inter Modality Relationship for Multimodal Fake News Detection. In Proceedings of the ACM Web Conference (WWW), Lyon, France, 25–29 April 2022; pp. 726–734. [Google Scholar]
Zhang, T.; Wang, D.; Chen, H.; Zeng, Z.; Guo, W.; Miao, C.; Cui, L. BDANN: BERT-based Domain Adaptation Neural Network for Multi-modal Fake News Detection. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar]

Figure 1. Common fake news cases. (a) Woman, 36, gives birth to 14 children from 14 different fathers. (b) New species of fish found at Arkansas. (c) A shark just chilling down the freeway in NYC. (d) Little Syrian girl sells chewing gum on the street so she can feed herself. Examples are sourced from the publicly available Twitter dataset proposed in [10]. Facial regions have been anonymized for privacy protection.

Figure 2. Fake news detection method based on dual contrastive learning and variational autoencoder. The ⊗ denote element-wise multiplication.

Figure 3. CBAM structure diagram. The ⊗ denote element-wise multiplication between the attention maps and feature maps.

Figure 4. Schematic diagram of unsupervised contrastive learning. The ⊗ denote dot-product similarity calculations between query and key representations for contrastive loss computation.

Figure 5. Core idea diagram of supervised contrastive learning.

Figure 6. Contrastive learning parameter analysis chart.

Figure 7. Optimizer analysis chart.

Table 1. Comparison experimental results of fake news detection. The bold values in the tables indicate our proposed method and the best performance results, respectively.

Dataset	Model	Acc	Fake News			Real News
Dataset	Model	Acc	Precision	Recall	F1	Precision	Recall	F1
Weibo	Text-GRU	0.643	0.662	0.578	0.617	0.662	0.578	0.617
	Image-VGG	0.663	0.630	0.500	0.550	0.630	0.750	0.690
	EANN [26]	0.827	0.847	0.812	0.829	0.807	0.843	0.825
	MVAE [14]	0.824	0.854	0.769	0.809	0.802	0.875	0.837
	SAFE [15]	0.816	0.818	0.815	0.817	0.816	0.818	0.817
	MCNN [16]	0.823	0.858	0.801	0.828	0.787	0.848	0.816
	CAFE [25]	0.840	0.855	0.830	0.842	0.825	0.851	0.837
	BDANN [31]	0.842	0.830	0.870	0.850	0.850	0.820	0.830
	MFCD [28]	0.829	0.834	0.829	0.830	0.834	0.829	0.830
	MMCSC [29]	0.815	0.857	0.701	0.806	0.857	0.701	0.806
	MMF [8]	0.815	0.778	0.828	0.802	0.803	0.823	0.813
	VCLMMF	0.896	0.923	0.874	0.898	0.868	0.941	0.903
Twitter	Text-GRU	0.526	0.586	0.553	0.569	0.469	0.526	0.496
	Image-VGG	0.596	0.695	0.518	0.593	0.550	0.700	0.599
	EANN [26]	0.719	0.642	0.474	0.545	0.771	0.870	0.817
	MVAE [14]	0.745	0.801	0.719	0.758	0.689	0.777	0.730
	SAFE [15]	0.762	0.831	0.724	0.774	0.695	0.811	0.748
	MCNN [16]	0.784	0.778	0.781	0.779	0.790	0.787	0.788
	CAFE [25]	0.806	0.807	0.799	0.803	0.805	0.813	0.809
	BDANN [31]	0.830	0.810	0.630	0.710	0.830	0.930	0.880
	LIIMR [30]	0.831	0.836	0.832	0.830	0.825	0.830	0.827
	MMF [8]	0.871	0.889	0.840	0.864	0.894	0.863	0.878
	VCLMMF	0.904	0.943	0.871	0.905	0.868	0.941	0.903

Table 2. Experimental results on AI-generated and niche-topic datasets.

Model	Dataset	Acc	Fake News			Real News
Model	Dataset	Acc	Precision	Recall	F1	Precision	Recall	F1
VCLMMF	Weibo (AIGC)	0.891	0.918	0.870	0.893	0.865	0.936	0.899
VCLMMF	Weibo (Niche)	0.901	0.920	0.876	0.897	0.872	0.932	0.901
VCLMMF	Twitter (AIGC)	0.902	0.939	0.868	0.902	0.866	0.938	0.901
VCLMMF	Twitter (Niche)	0.907	0.941	0.873	0.906	0.870	0.939	0.903

Table 3. Ablation experiment results of key modules.

Dataset	Test	Acc	F1
Dataset	Test	Acc	Fake News	Real News
Weibo	w/o CL	0.840	0.837	0.842
	w/o VAE	0.852	0.844	0.853
	VCLMMF	0.896	0.898	0.903
Twitter	w/o CL	0.842	0.850	0.830
	w/o VAE	0.881	0.873	0.902
	VCLMMF	0.904	0.905	0.903

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wu, B.; Hu, R.; Wang, J.; Sui, X.; Sun, J.; Liu, J.; Qu, Y. A Multimodal Fake News Detection Method Based on Contrastive Learning and Variational Autoencoder. Mathematics 2026, 14, 1773. https://doi.org/10.3390/math14101773

AMA Style

Wu B, Hu R, Wang J, Sui X, Sun J, Liu J, Qu Y. A Multimodal Fake News Detection Method Based on Contrastive Learning and Variational Autoencoder. Mathematics. 2026; 14(10):1773. https://doi.org/10.3390/math14101773

Chicago/Turabian Style

Wu, Baowen, Ruijiao Hu, Jilin Wang, Xin Sui, Jiaxing Sun, Jie Liu, and Youli Qu. 2026. "A Multimodal Fake News Detection Method Based on Contrastive Learning and Variational Autoencoder" Mathematics 14, no. 10: 1773. https://doi.org/10.3390/math14101773

APA Style

Wu, B., Hu, R., Wang, J., Sui, X., Sun, J., Liu, J., & Qu, Y. (2026). A Multimodal Fake News Detection Method Based on Contrastive Learning and Variational Autoencoder. Mathematics, 14(10), 1773. https://doi.org/10.3390/math14101773

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

A Multimodal Fake News Detection Method Based on Contrastive Learning and Variational Autoencoder

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Model Structure

3.2. Enhancement of Feature Extraction Module

3.3. Loss Calculation Method of Dual Contrastive Learning

3.4. Variational Autoencoder

3.5. Multimodal Feature Fusion Strategy

4. Experiments and Result Analysis

4.1. Datasets

4.2. Parameter Settings

4.3. Comparison Experiments

4.4. Out-of-Distribution Evaluation on AI-Generated and Niche Data

4.5. Ablation Experiment

4.6. Parameter Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI