A Compact GPT-Based Multimodal Fake News Detection Model with Context-Aware Fusion

Chi, Zengxiao; Guo, Puxin; Liu, Fengming

doi:10.3390/electronics14234755

Open AccessArticle

A Compact GPT-Based Multimodal Fake News Detection Model with Context-Aware Fusion

by

Zengxiao Chi

^1,2

,

Puxin Guo

² and

Fengming Liu

^1,*

¹

School of Business, Shandong Normal University, Jinan 250357, China

²

School of Information Science and Electrical Engineering, Shandong Jiaotong University, Jinan 250357, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(23), 4755; https://doi.org/10.3390/electronics14234755

Submission received: 18 October 2025 / Revised: 28 November 2025 / Accepted: 28 November 2025 / Published: 3 December 2025

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

With the rapid development of social networks, online news has gradually surpassed traditional paper media and become a main channel for information dissemination. However, the proliferation of fake news also poses a serious threat to individuals and society. Since online news often involves multimodal content such as text and images, multimodal fake news detection has become increasingly important. To address the challenges of feature extraction and cross-modal fusion in this task, this study presents a new multimodal fake news detection model. The model uses a GPT-style encoder to extract text semantic features, a ResNet backbone to extract image visual features, and dynamically captures correlations between modalities through a context-aware multimodal fusion module. In addition, a joint optimization strategy combining contrastive loss and cross-entropy loss is designed to enhance modal alignment and feature discrimination while optimizing classification performance. Experimental results on the Weibo and PHEME datasets show that the proposed model outperforms baseline methods in accuracy, precision, recall, and F1-score, effectively captures correlations between modalities, and improves the quality of feature representation and overall model performance. This study suggests that the proposed approach may serve as a useful approach for fake news detection on social platforms.

Keywords:

multimodal fake news detection; cross-modal feature fusion; contrastive learning; GPT-style model

1. Introduction

With the rapid development of Internet technology, social media platforms (such as Twitter, Weibo, Facebook, etc.) have become important channels for disseminating news. While these platforms make it easy for people to access news, they also significantly accelerate the spread of fake news. The proliferation of fake news on social media not only severely disrupts the balance of the online ecosystem but also triggers a series of deep-seated social problems. For example, certain fake news may exaggerate or distort facts, misleading the public’s perception of a particular event or individual, leading to unnecessary panic, anger, or prejudice, and ultimately affecting social stability and harmony. Traditional methods of detecting fake news are based mainly on text data, but in the face of increasingly complex and varied forms of news, their effectiveness has become insufficient [1]. Therefore, integrating multiple data modalities for the detection of fake news has become a hot and challenging area of research.

The core challenge of multimodal fake news detection lies in how to effectively integrate and utilize data from different modalities, especially text and image data, which each provide different aspects of news content. Text data usually contain the main content and contextual information of the news, while image data show the scene and atmosphere of the news through visual elements. However, due to the differences in representation dimensions and features between text and images, how to achieve high-quality feature extraction and cross-modal semantic alignment has become a key issue in improving detection performance [2].

For text feature extraction, traditional multimodal fake news detection models mostly use the BERT model [3] and its variants [4,5,6]. However, in recent years, the GPT-3 model [7] proposed by Open AI has demonstrated strong language modeling and semantic understanding capabilities in natural language processing tasks [4]. GPT-3 is based on a decoder architecture, which is different from the traditional encoder model. It can more effectively capture semantic dependencies between contexts and generate text representations with stronger semantic expression capabilities [7]. At the same time, the small version of GPT-3 strikes a good balance between model size and performance. Therefore, this paper selects the TurkuNLP/gpt3-finnish-small model [8] (abbreviated as T-GPT in this paper) as the text encoder. This model follows the decoder-only architecture of GPT-3, has 186 million parameters, and combines efficient semantic feature extraction capabilities with moderate computational complexity. By extracting deep semantic features from news text using this model, the aim is to improve the accuracy of multimodal fake news detection.

In terms of feature fusion, traditional multimodal data fusion methods, such as simple feature concatenation or weighted summation, often overlook the intrinsic connection and complementarity between different modal data. This can result in fused features that contain a large amount of redundant information, consequently reducing detection performance [9]. To address this issue, this study introduces a context-aware multimodal fusion module based on a cross-modal collaborative attention mechanism [4,10], which retains the original text and image semantics while achieving deep modeling of the dependency relationship between text and image features. This module models the correlation between modalities and dynamically weights and fuses multimodal features, thereby enhancing the expression of key features while retaining the original information, and thus improving the accuracy and robustness of classification [11].

In summary, this paper proposes a multimodal fake news detection model based on the fusion of the T-GPT model and context-aware multimodal fusion method, which is named CA-MFD in short. This model uses T-GPT to encode text and extract deep semantic features and extracts image features through ResNet34. Subsequently, a context-aware multimodal fusion module is applied to integrate the features from both modalities, thereby obtaining the fused feature representation. In the training phase, the cross-entropy loss and contrast loss function are used to guide the model to increase the discrimination ability and semantic consistency between modalities while performing classification. Finally, a classifier is used to detect fake news. Experiments on the Weibo and PHEME multimodal fake news datasets show that the proposed method is superior to traditional methods in terms of accuracy. The context-aware multimodal fusion method and contrastive learning play a key role in the multimodal fake news detection task. At the same time, the effectiveness of T-GPT in text feature extraction is verified.

The contributions of this article are summarized as follows:

We propose a multimodal fake news detection model based on T-GPT and context-aware multimodal fusion method (CA-MFD), which efficiently integrates semantic information from both image and text while simultaneously preserving the original text and image features. This approach effectively captures the correlations between modalities and enhances the representation of key features, thereby improving the accuracy and robustness of fake news detection.
We introduce T-GPT for text feature extraction in multimodal fake news detection. By leveraging T-GPT to encode text data, it successfully extracts deep semantic features, thereby providing strong support for subsequent fake news detection tasks.
We conduct extensive experiments on two publicly available dataset for multimodal fake news detection. And the experimental results demonstrate that the proposed method achieves a significant performance improvement in fake news detection tasks.
We design a joint optimization strategy that combines contrastive loss and cross-entropy loss. This strategy enhances the model’s ability to align modalities and discriminate features while optimizing classification performance.

The rest of this paper is organized as follows. First, a review of the work related to our research is presented in Section 2. Then, the proposed model and its specific implementation method are detailed in Section 3. Finally, we analyze the experiments and summarize our work, providing an outlook for future work in Section 4 and Section 5, respectively.

2. Related Works

Early research on fake news detection was mainly based on text content, especially methods based on display features, such as TF-IDF, n-gram, and traditional machine learning models (SVM, random forest, etc.). With the development of deep learning, models such as RNN, LSTM, and Transformer have made breakthroughs in text understanding. Compared with traditional methods based on display features, deep learning methods can automatically learn latent feature representations and show better performance in fake news detection. For example, Yu et al. innovatively integrated convolutional neural networks (CNNs) into their model to explore deeper connections between basic features [12]. Chen et al. designed a fake news detection model based on RNN, which can learn hidden temporal patterns in news text sequences and generate feature vectors with specific focuses, thus obtaining context that reveals changes in news content over time [13]. Shen et al. cleverly applied transfer learning technology to fake news detection. By analyzing news comment data and fine-tuning the feature extraction network, the extracted comment text information can provide effective support for subsequent detection [14].

However, with the diversification of news dissemination methods, relying solely on textual information for fake news detection can no longer meet evolving needs. News containing images and videos is more likely to capture readers’ attention than traditional text-based news (such as Figure 1), and these visual elements can also help us identify false information. In addition, social information such as comments, shares, and retweets on social media also provide valuable clues for identifying false information. Numerous studies [2,9,15] have confirmed that multimodal fake news detection models outperform single-mode models when evaluated on the same dataset. In recent years, researchers have proposed a variety of multimodal fake news detection methods, significantly improving detection accuracy by combining text, images, and social information. For example, the EANN model [2] proposed by Wang et al. combines the idea of adversarial learning to guide the model in learning event-independent features, thus removing event correlation and enhancing the model’s generalization ability. This model not only considers the textual content of news, but also integrates the image and social information, enabling a multidimensional analysis to assess the authenticity. Shivangi Singhal et al. proposed a fake news detection framework called SpotFake [16] by analyzing the inconsistencies across multiple modalities, which utilized the BERT model to extract text semantic features and VGG-19 to extract image features. These features were then fused to acquire the final multimodal feature representation, which was used for fake news detection and achieved promising results.

Since its proposal in 2014, the attention mechanism has become a crucial technology in deep learning, widely applied in natural language processing tasks. It has also inspired new approaches for fake news detection. The attention mechanism can dynamically focus on key textual information, enhancing the model’s ability to identify fake news characteristics. For instance, self-attention mechanisms effectively capture contextual information and key features in text, making them frequently integrated with other models for fake news detection. Asutosh Mohapatra et al. combined a self-attention mechanism with BiLSTM to simultaneously capture key features and bidirectional semantic dependencies in text [17]. Additionally, the co-attention mechanism can simultaneously focus on two or more input sequences and capture their correlations, demonstrating strong performance in multimodal detection tasks [18,19,20,21]. For example, the GCAN model leverages the Graph Convolutional Network (GCN) to model potential social relationships between users and integrates a co-attention mechanism to capture correlations between source tweets and user communication behaviors, enhancing both detection performance and interpretability in fake news detection [21]. The MCAN model captures the semantic associations between the two by focusing on key information in both text and images through a collaborative attention mechanism. Multiple collaborative attention layers are stacked at the same time to obtain the dependencies between different modalities and generate deeply integrated multimodal feature representations [19]. Among them, the bidirectional cross-modal attention mechanism is a special form of the collaborative attention mechanism. It can simultaneously model the bidirectional dependencies between modalities, thereby more comprehensively integrating multimodal information. For example, Chuanming Yu et al. [22] combined BERT and DeiT to encode text and images, respectively, and dynamically adjusted the integration of different modal features through bidirectional cross-modal attention and gating mechanisms, thereby improving the representation ability of the model and the accuracy and generalization ability of fake news detection. Existing methods commonly employ collaborative or bidirectional cross-modal attention mechanisms to capture the interactions between text and images. However, while emphasizing cross-modal alignment, these mechanisms often weaken the expressive power of original unimodal features, resulting in the weakening or loss of some key details. To address this issue, we propose a context-aware multimodal fusion method that preserves the original text and image features, based on bidirectional cross-modal attention, taking into account both modal interactions and modal details for fake news detection.

In 2019, Jacob Devlin et al. proposed the BERT model [3], which has been widely used in the detection of fake news due to its ability to simultaneously consider contextual information on the left and right sides of the word and then capture the contextual meaning of the word to obtain a better representation of the text, and has been proven to be the most advanced model for text classification and disinformation verification tasks [23,24,25,26]. However, traditional BERT-like models have limitations in window length when processing long texts. To address this, this study uses the hidden layer features (the mean of the last layer states) of T-GPT as the text representation method, replacing traditional BERT-like encoders. This choice is based on the autoregressive architecture characteristics of T-GPT, which may be more suitable for long-distance semantic modeling needs under certain conditions, with fewer parameters (186M parameters).

3. The Proposed Method

3.1. Problem Definition

The primary objective of the fake news detection task is to classify news items as either true or false using the text and images in a given news dataset. Let the dataset be denoted as a set

N = {n_{1}, n_{2}, \dots, n_{m}}

, where m represents the total number of news items. Each news item

n_{k}

contains two modalities, text and image, represented as

N_{k} = {T_{k}, I_{k}}

, where

T_{k}

is the text, and

I_{k}

is the image associated with the news item. Our task is to predict the label y for N, where

y \in {0, 1}

, with 0 indicating true news and 1 indicating fake news.

3.2. Method Overview

In this paper, we propose a novel multimodal fake news detection model based on GPT-based model and context-aware multimodal fusion, that is named Context-Aware Multimodal Fake News Detection (CA-MFD). The model proposed aims to improve the accuracy of news classification by effectively integrating information from both textual and visual modalities. As illustrated in Figure 2, CA-MFD comprises four main components: (1) textual feature extraction, which encodes the text information into embeddings via T-GPT; (2) visual feature extraction, which encodes the visual information into embeddings via RestNet34; (3) context-aware multimodal fusion, which uses text-to-image enhanced attention and image-to-text enhanced attention to obtain text-enhanced image features and image-enhanced text features, respectively, then obtains the multimodal representations by concatenating them with the original text features and visual features, thereby generating richer modality representations; (4) a classifier, using a linear layer producing the final predictions. During training, a joint optimization strategy combining contrastive loss and cross-entropy loss is employed to enhance modal alignment and feature discrimination while optimizing classification performance.

3.3. Multimodal Feature Extraction

3.3.1. Text Embedding

The CA-MFD model employs a GPT-style pretrained model, TurkuNLP/gpt3-finnish-small from the Huggingface Transformers library, as the text encoder, hereafter referred to as T-GPT for short. This model follows the decoder-only architecture of GPT-3, which consists of 12 Transformer layers, each with 12 self-attention heads, a hidden size of 768, and approximately 186 million parameters. T-GPT was trained on large-scale corpora via next-token prediction, enabling it to capture rich semantic and contextual information. Compared with larger GPT variants, T-GPT maintains strong semantic modeling ability while achieving high computational efficiency, making it suitable as a compact text encoder in multimodal tasks.

Given a batch of raw textual inputs,

T = {T_{1}, T_{2}, \dots, T_{B}}

, where

B = 32

represents the batch size, each text

T_{i}

is tokenized into subword units using a Byte-Pair Encoding (BPE) tokenizer, The resulting token sequences are uniformly padded or truncated to a fixed length of

L = 128

, yielding a token ID matrix

X \in R^{B \times L}

.

The token IDs are fed into the pretrained T-GPT encoder along with an attention mask indicating valid token positions. The encoder generates contextualized representations for each token, resulting in a hidden-state tensor

H \in R^{B \times L \times 768}

. To obtain the global textual representation, the hidden state of the first token in the final layer,

H [:, 0, :]

, is extracted to form the text feature vector

F_{text} \in R^{B \times 768}

.

This feature serves as the global semantic embedding of the text for subsequent multimodal fusion and classification tasks.

3.3.2. Visual Embedding

The CA-MFD model employs the classic convolutional neural network ResNet34 for image feature extraction. With its deep network structure and the advantages of residual learning, ResNet34 can accurately capture rich details and high-level semantic information in images, ensuring the accuracy and depth of feature extraction.

ResNet, or Residual Network, was innovatively proposed by Kaiming He and his colleagues in their 2016 paper as an innovative approach to training very deep neural networks [27]. This network structure is composed of multiple sub-networks that can be stacked to construct a network architecture with considerable depth. The Figure 3 illustrates the core component of ResNet—the residual learning module.

In traditional deep learning, the loss and degradation of information during the interlayer transmission process is an unavoidable issue. This loss can lead to vanishing or explosion of the gradient, a major challenge that has long troubled many researchers.

Table 1 illustrates the ResNet network architecture with different layer configurations.

Given an image input

I \in R^{B \times 3 \times 224 \times 224}

, we use the ResNet34 backbone network to extract deep image features. The extracted feature

Z = ResNet * (I) \in R^{B \times 512 \times 7 \times 7}

is followed by global average pooling to produce a 512-dimensional feature vector. To enable image–text contrastive learning, the 512-dimensions is then projected to a 768-dimensional vector using a fully connected layer to obtain the final image representation.

F_{img} = {FC}_{512 \to 768} (GAP (ResNet * (I))) \in R^{B \times 768}

(1)

where I represents the input image, and B is the batch size;

ResNet * (\cdot)

refers to the convolutional backbone of ResNet34, which includes the initial convolutional block and the four stages, not including the global average pooling and the final fully connected layer;

GAP (\cdot)

denotes the global average pooling operation; and

{FC}_{512 \to 768} (\cdot)

is a fully connected layer used to project image features from 512 dimensions into a 768-dimensional vector.

3.4. Context-Aware Multimodal Fusion

The CA-MFD model proposed in this paper introduces context-aware multimodal fusion in the feature fusion stage. This approach is implemented based on the global semantic vectors of each modality. Specifically, it employs two efficient cross-modal attention mechanisms: one projects the contextual information of image features onto text features, and the other projects the contextual information of text features onto image features. These projections are then concatenated with the original global vectors of their respective modalities and passed through a fully connected fusion layer to generate the final multimodal representation.

Compared to traditional fine-grained co-attention mechanisms, this method performs interactions at the global semantic level, which significantly reduces computational and memory costs. Additionally, it facilitates the reuse of pretrained CLS tokens or global features, making it more robust and stable in resource-constrained or noisy environments. While effectively mining the semantic relevance and importance between text and images and achieving deep interaction and fusion between modalities, it also takes into account the expressive power of the original unimodal features, combining the two to generate a more semantically complete multimodal feature representation.

Specifically, the mechanism includes attention calculations in two directions:

Text-to-image attention: Taking the text features

F_{text}

as the query, the image features

F_{img}

as both key and value, we obtain the text-enhanced image features.

Q_{text} = F_{text} W_{Q}^{t 2 i}, K_{img} = F_{img} W_{K}^{t 2 i}, V_{img} = F_{img} W_{V}^{t 2 i}

(2)

A_{t 2 i} = s o f t m a x (\frac{Q_{t e x t} K_{i m g}^{⊤}}{\sqrt{d_{a t t}}}) V_{i m g}

(3)

where

d_{a t t}

denotes the dimension of the query and key projections in the attention mechanism.

Image-to-text attention: Using the image features

F_{img}

as the query, the text features as both key and value, we obtain the image-enhanced text features.

Q_{img} = F_{img} W_{Q}^{i 2 t}, K_{text} = F_{text} W_{K}^{i 2 t}, V_{text} = F_{text} W_{V}^{i 2 t}

(4)

A_{i 2 t} = softmax (\frac{Q_{img} K_{text}^{⊤}}{\sqrt{d_{att}}}) V_{text}

(5)

After computing bidirectional cross-modal attention, this paper obtains the fused features by concatenating the original text features, the original image features, the text-enhanced image features, and the image-enhanced text features. The fused features

F_{f u s i o n}

are obtained as follows:

F_{fusion} = concat (F_{text}, A_{t 2 i}, F_{img}, A_{i 2 t})

(6)

This fusion method fully retains the original modal information while using bidirectional attention interaction to obtain cross-modal complementary information, achieving dual feature enhancement and multi-level feature abstraction capability.

3.5. Classify

After completing the multimodal feature fusion, the CA-MFD model uses a fully connected layer with ReLU activation and dropout to perform a nonlinear transformation and regularization on the fused features to enhance the generalization ability of the model and alleviate overfitting. The process is as follows:

F_{drop} = Dropout (ReLU (W_{f} F_{fusion} + b_{f}), p)

(7)

Here,

F_{fusion}

represents the representation of the features after cross-model attention fusion in the previous step;

W_{f}

and

b_{f}

are the weight and bias of the fusion layer, respectively; and p denotes the dropout probability.

This processed feature

F_{drop}

is then fed into the classification layer, implemented as a fully connected layer, to generate the prediction result

\hat{y}

.

\hat{y} = W_{c} F_{drop} + b_{c}

(8)

where

W_{c}

and

b_{c}

are the weight matrix and bias vector of the classifier, respectively.

3.6. Model Learning

The total loss function in this study is formulated as the weighted average of cross-entropy loss and contrastive loss, with the objective of optimizing classification performance and feature representation capability at the same time, while improving the generalizability of the model. The loss function is defined as follows:

L = α \cdot L_{C E} + (1 - α) \cdot L_{c o n}

(9)

Here,

α = 0.5

. The cross-entropy loss

L_{C E}

is used for supervising classification tasks, which is denoted as

L_{C E}

, and the specific calculation formula is:

L_{CE} = - \frac{1}{N_{b}} \sum_{i = 1}^{N_{b}} \sum_{c = 1}^{C} y_{i, c} log ({\hat{y}}_{i, c})

(10)

where

N_{b}

is the number of samples in the current batch, C is the number of categories,

y_{i, c}

is the true label of sample i on c (if sample i belongs to category c,

y_{i, c} = 1

, otherwise 0), and

{\hat{y}}_{i, c}

is the predicted probability.

The contrastive loss is used to enhance the discriminative ability of features, and its calculation formula is:

L_{con} = - \frac{1}{N_{b}} \sum_{i = 1}^{N_{b}} \frac{1}{| P (i) |} \sum_{p \in P (i)}^{C} log (\frac{e x p (f_{i} \cdot f_{p} / τ)}{\sum_{a \in A (i)} e x p (f_{i} \cdot f_{a} / τ)})

(11)

where

N_{b}

is the number of news samples in the current batch,

f_{i}

is the feature vector of news sample i,

P (i)

is the set of other samples of the same type as news sample i (positive samples),

A (i)

is the set of all samples in the current batch except sample i (including positive samples and negative samples), and

τ

is the temperature parameter used to control the smoothness of the loss function.

During the training process, we use the validation set to monitor the model’s performance and combine the “early stopping mechanism” to prevent the model from overfitting. At the same time, the learning rate is adjusted after each epoch to accelerate the convergence of the model.

4. Experiments

4.1. Dataset

4.1.1. Dataset Description

To evaluate our model, we selected two real-world datasets: Weibo and PHEME. Weibo is a dataset of fake news tweets collected by Jin et al. on Weibo [9]. This dataset covers all fake news posts that were officially debunked by Weibo from May 2012 to January 2016, forming a comprehensive collection of fake news samples. This dataset not only integrates various representations, such as text and images, but also includes social background information closely related to the news content, providing rich dimensions for analysis. In this Weibo dataset, real news was verified by the authoritative Chinese news agency, Xinhua News Agency, while fake news was validated by Weibo’s official debunking system. Each tweet records the text content, accompanying images or videos, as well as relevant social background information.

The PHEME [28] dataset was created by collecting tweets related to five breaking news events on Twitter: the Germanwings accident, Charlie’s, the Sydney siege, the Ferguson unrest, and the Ottawa shooting. Each data entry contains text, image URLs, thread topology, and user relationships. However, many of the original image URLs are no longer accessible. In this work, we adopted the publicly available version of PHEME with images provided by Zhao et al. [29]. Following prior studies [29,30], we retained only the samples containing both text and images, resulting in a subset of 3670 instances.

In this experiment on the two datasets, we used only the text and image information. Given the common redundancy and noise of tweets on social media, we employed a series of data preprocessing methods to enhance data quality, ensuring the effectiveness and accuracy of subsequent experiments. The statistical information of the datasets is shown in Table 2:

4.1.2. Preprocessing

The preprocessing of the Weibo dataset included two aspects: text data preprocessing and image data preprocessing.

Text preprocessing involved basic data cleaning. We used regular expressions to remove HTML tags, special characters, and extra spaces, convert numerical values to text, and delete irrelevant information such as advertisements and citations. We also employed Unicode encoding to ensure data consistency. The cleaned text was then fed into the tokenizer of the pretrained model, which handled subword segmentation and vocabulary mapping automatically. These steps aimed to eliminate irrelevant information, standardize the format, optimize feature extraction and model training, and improve the overall performance and effectiveness of NLP tasks.

To ensure that the model could efficiently and accurately extract image features, meticulous cleaning of the image data was necessary. First, the pixel size of the images was standardized to reduce processing time and maintain data consistency. Second, damaged and duplicate images were removed to avoid resource waste and interference with feature extraction. Furthermore, OpenCV’s Laplacian operator was utilized to assess image clarity, with a threshold set to distinguish between clear and blurry images (here, the threshold was set to 100, with images above the threshold considered clear, and those below regarded as blurry; an example is shown in Figure 4). Only images with high clarity were retained to optimize dataset quality. This process effectively enhanced the usability of the image data and laid a solid foundation for model training.

For the PHEME dataset, following prior work [29,30], we filtered out all text-only tweets, retaining only those containing both text and images. This filtering yielded 3670 multimodal instances. After that step, both text and image data underwent exactly the same preprocessing pipeline as described above for the Weibo dataset.

Evaluation Metrics

To evaluate the effectiveness of the fake news detection model, we employed a series of evaluation metrics commonly used in classification tasks, which can comprehensively measure the model’s performance. Specifically, these metrics included Accuracy, precision, recall, and F1-score. To clearly understand these metrics, we first need to define a few basic concepts:

True Positives (TP): The number of positive samples accurately predicted by the model.
True Negatives (TN): The number of negative samples accurately predicted by the model.
False Positives (FP): The number of actual negative samples incorrectly classified as positive by the model, also known as false alarms.
False Negatives (FN): The number of actual positive samples incorrectly classified as negative by the model, also known as missed detections.

Based on the above definitions, we further explain the meaning of each evaluation metric. Accuracy refers to the proportion of correctly predicted samples among all samples. Its calculation formula can be expressed as:

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(12)

It reflects the overall predictive ability of the model; however, it may not be accurate enough when the distribution of positive and negative samples is highly imbalanced. Precision refers to the proportion of samples predicted as positive by the model that are actually positive. Its calculation formula is as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(13)

This metric assesses the reliability of the model’s positive predictions. Recall refers to the proportion of correctly predicted positive samples out of all actual positive samples. Its calculation formula is as follows:

R e c a l l = \frac{T P}{T P + F N}

(14)

It reflects the model’s ability to identify all positive samples, but a high recall may come at the expense of precision. The F1-score is a comprehensive metric that combines precision and recall. It balances and considers the performance of both indicators by calculating their harmonic mean. The F1-score provides a holistic assessment of performance, reflecting both the accuracy of the model’s positive predictions and its ability to identify all positive samples. The calculation formula can be expressed as follows:

F 1 = \frac{2 T P}{2 T P + + F P + F N}

(15)

By comprehensively utilizing these evaluation metrics, we can conduct a thorough and detailed assessment of the effectiveness of the fake news detection model.

4.2. Baselines

In this paper, to validate the efficacy of our proposed model, we compared it with ten multimodal baseline methods in terms of accuracy, precision, recall and F1-score. To further explore the effectiveness of the generative text encoder within the proposed model, we conducted experiments by replacing T-GPT with four models from the BERT family. The multimodal baseline methods are listed, and the BERT family models are described in the following.

4.2.1. Multimodal Methods

EANN [2]: The EANN model uses Text-CNN and VGG-19 to extract text and image features, respectively, and then concatenates these features and feeds them into a classifier for fake news detection. By introducing the event adversarial mechanism, EANN can learn event-invariant multimodal feature representations, effectively addressing the challenge of fake news detection on new events.

att-RNN [9]: The att_RNN model uses an LSTM-based recurrent neural network combined with an attention mechanism to detect rumors by integrating multimodal features such as text features, social context features, and image features in two stages.

MVAE [31]: The MVAE model maps the semantic features extracted by Bi-LSTM and the news visual features extracted by VGG-19 into a shared semantic space through the VAE framework and uses reconstruction error and modal consistency to discover anomalies of image–text mismatch to detect fake news.

CAFE [32]: To address the problem of cross-modal ambiguity, CAFE introduces the KL divergence from the perspective of information theory to quantify cross-modal ambiguity. The model constructs cross-modal alignment, ambiguity learning, and fusion modules, which can adaptively aggregate cross-modal features and single-mode features, thereby alleviating the problem of misclassification caused by modality discrepancies.

SAFE [33]: The SAFE model incorporates a similarity-aware mechanism to enhance the alignment between text and image features. By leveraging contrastive learning, SAFE effectively reduces the modality gap and improves the model’s robustness against adversarial attacks.

HMCAN [4]: The HMCAN model extracts text features through BERT and image features through ResNet50, fuses text and image representations via a multimodal contextual attention network, captures the hierarchical semantics of the text through a hierarchical encoding network, and finally classifies fake news using a classifier.

SpotFake [16]: The SpotFake model uses pretrained BERT to extract text semantic features and VGG-19 to extract visual features of news images and then fuses and inputs them into a fully connected classifier for fake news classification.

SpotFake+ [34]: Based on SpotFake, SpotFake+ upgrades the text encoder from BERT to pretrained XLNet and uses VGG-19 as the image encoder to achieve more effective text–image multimodal fusion, thereby improving the performance of fake news detection.

FND-CLIP [35]: The FND-CLIP model uses BERT and ResNet to extract unimodal features of text and images, respectively, and utilizes the CLIP pretrained model to extract and align cross-modal features. It guides feature fusion through cross-modal similarity, thereby effectively integrating multimodal information for fake news detection.

MCOT [36]: The MCOT model extracts text and image features through BERT and ViT, uses cross-modal attention to enhance features, and finally fuses this information for fake news classification.

4.2.2. BERT Series Models

To further verify the effectiveness of T-GPT as a text encoder, we replaced it with a series of BERT-based pretrained models, including BERT, LERT, PERT, and MiniRBT.

BERT [3]: The original bidirectional pretrained model based on Transformer for contextual embedding which realizes bidirectional context modeling through masked language modeling (MLM) and next-sentence prediction (NSP) tasks and has good language representation capabilities.

LERT [37]: A linguistically enhanced pretraining model proposed by the Harbin Institute of Technology and iFlytek Joint Laboratory (HFL). During pretraining, it incorporates linguistic features such as part-of-speech tagging, named-entity recognition, and dependency syntax. It combines this with a masked language model for multi-task learning and employs a hierarchical task weighting strategy (LIP). This model outperforms mainstream models in Chinese comprehension tasks.

PERT [38]: A lightweight variant of BERT that reduces computational and memory costs while maintaining good performance in various NLP tasks by introducing strategies such as parameter sharing and reducing hidden layer dimensions.

MiniRBT [39]: MiniRBT is a compact model derived from RoBERTa through knowledge distillation. Compared to RoBERTa, it has fewer layers and parameters. While retaining much of RoBERTa’s representational power, MinRBT significantly improves efficiency, making it suitable for resource-constrained environments.

4.3. Comparison Experiments

4.3.1. Experimental Settings

In this study, we constructed the experiments in a Ubuntu environment using Python 3.7 and PyTorch 1.13.1. Following previous works [2,31,40,41,42], the dataset was split into training, validation, and test sets with a 7:1:2 ratio, with more recent news as the test set to avoid temporal leakage.

The hyperparameter settings are shown in Table 3.

4.3.2. Overall Performance

In order to comprehensively evaluate the performance of the proposed multimodal fake news detection model CA-MFD, this paper designed a series of comparative experiments on the Weibo [9] and PHEME [43] multimodal fake news datasets and selected representative fake news detection models such as EANN, att-RNN, MVAE, CAFE, SAFE, HMCAN, SpotFake, SpotFake+, FND-CLIP, and MCOT as baseline methods for a comprehensive analysis.

In the comparative experiments, in addition to comparing the performance with multimodal fake news detection models such as CAFE and EANN, this paper also replaced the text encoder in the CA-MFD model with multiple BERT series models, such as BERT [3], MiniRBT [39], PERT [38], and LERT [37], to verify the effectiveness of GPT-based encoder in the fake news detection task. The performance evaluation adopted indicators widely used in various classification tasks, including accuracy, precision, recall, and F1-score. Among them, the calculation of precision, recall, and F1-score was based on the indicator values of the two categories of “real news” and “fake news”, and the macro-average method was used to reduce the impact of imbalanced category distribution on the results. These indicators reflected the effectiveness of the model in fake news detection from different perspectives. For example, precision measured the accuracy of the model’s prediction of the positive class, and recall reflected the model’s ability to identify positive samples.

As shown in Table 4, the CA-MFD model achieved the best performance in all four main evaluation indicators, showing strong discrimination ability and overall advantages. The model’s accuracy was 0.903, slightly higher than the second-best model, MCAN; the macro-average precision, recall and F1-score were 0.905, 0.903, and 0.9024, respectively, all of which were better than other comparison models. Although the traditional models MPFN and EANN performed stably in some indicators, there was still a gap in overall performance; HMCAN and MCAN, proposed in recent years, performed similarly but were slightly inferior in all key indicators. Thus, our experimental results show that the model proposed in this paper has better comprehensive performance in the fake news detection task.

To further explore the effectiveness of the generative text encoder within the CA-MFD model, we conducted experiments by replacing T-GPT with several BERT-based models, including BERT, LERT, PERT, and MiniRBT. The experimental results are shown in Table 5.

As shown in Table 5, T-GPT outperformed other models on the Weibo dataset with an accuracy of 91.0%, which was significantly higher than the 88.9% accuracy of BERT and other variant models such as LERT’s 89.61%, PERT’s 87.53%, and MiniRBT’s 85.61%. In fine-grained classification, although T-GPT achieved slightly lower precision on fake news and lower recall on real news compared to BERT family models, it excelled across all other metrics. Notably, it had the highest F1-score, a key metric for balanced performance which is often used as the sole metric in many studies [44] beside accuracy. On the macro average score, T-GPT led all evaluation metrics, with a precision of 0.9106, a recall of 0.9104, and an F1-score of 0.9103. From Table 5, we observed similar results on the PHEME dataset, demonstrating that T-GPT possessed greater balance and robustness in overall classification performance.

Although BERT and its derivative models, such as LERT and MiniRBT, performed stably in some tasks, they did not surpass T-GPT, especially MiniRBT, which exhibited the lowest performance across all evaluation metrics, indicating its limitations in this task.

These results demonstrate that T-GPT, despite being relatively compact, outperforms BERT family models in multiple key indicators of this task. It exhibits strong generalization capabilities in text classification tasks, which may be due to its advantage as a large-scale pretrained model, making it better at capturing complex patterns in text. Therefore, although the BERT series of models are widely used in natural language processing, T-GPT achieves better classification performance in the fake news detection task while maintaining a compact architecture, reflecting a better balance between efficiency and effectiveness.

It is worth noting that we also tried to use bert-base-uncased as the text encoder, but the performance on both datasets was lower than that of the gpt3-finnish-small model currently used, which may be more suitable for our fusion process due to its larger parameter count and autoregressive structure. Since all variants share the same encoder, the comparative advantage of CA-MFD still holds.

4.3.3. Ablation Study

We designed six ablation experiments to investigate the influence of the context-aware multimodal fusion module and the contrast strategy on the model performance. “CA-MFD(Image only)” denotes the model with only Resnet34; “CA-MFD(Text only)” denotes our model with only T-GPT Small; “CA-MFD w/o CA” denotes our model without context-aware multimodal fusion, where we directly concatenate the text and image features for classification; “CA-MFD(Con CoAtt)” refers to the variant with the context-aware multimodal fusion module replaced by a conventional co-attention mechanism; “CA-MFD w/o Contrast” indicates that the model is trained using only the cross-entropy loss function, without incorporating contrastive learning loss term; and “CA-MFD w/o CA&Contrast” denotes our model without context-aware multimodal fusion and cross-entropy loss function. The results are summarized in Table 6.

From Table 6, we observe that removing either the context-aware multimodal fusion module or the contrastive learning strategy leads to a decline in all four metrics. Specifically, eliminating the context-aware multimodal fusion module decreases the F1-score from 91.03% to 88.99%, indicating that it strengthens cross-modal interaction effectively. Similarly, excluding the contrastive learning term reduces the F1-score to 89.03%, demonstrating its importance in improving the model’s representation quality and classification ability. The performance comparison is visualized in Figure 5.

The complete model achieves the best overall performance, confirming that the two modules are both essential and complementary. Removing either one results in a performance drop, and removing both simultaneously causes a more significant drop, for example, from 90.13% to 86.76% on the PHEME dataset. This is primarily due to the complementary nature of the two modules: the CA module enhances cross-modal interaction modeling capabilities, while contrastive learning improves the generalization and robustness of feature representations, resulting in a synergistic effect between the two.

In addition, we evaluated a variant that replaced the context-aware fusion (CA) module with a conventional co-attention mechanism. As shown in Table 6, this variant on the Weibo dataset achieved 85.40% accuracy and a fake news F1-score of 85.00%, while the text-only baseline achieved 87.00% and 85.93%, respectively; on the PHEME dataset, the accuracy and fake news F1-score of the variant were 85.03% and 84.26%, respectively, higher than the 82.01% and 69.64% of the text-only baseline. This inconsistency highlights the risk of native cross-modal fusion. In contrast, our complete CA-MFD model achieved the best performance on both datasets, demonstrating the effectiveness of context-aware fusion with preserved unimodal representations.

Overall, the ablation results validate the independent contribution of each module and the performance gain brought by its combination.

5. Conclusions

In this study, we proposed a Context-Aware Multimodal Fake News Detection (CA-MFD) model that integrated textual and visual information through a compact design. Specifically, a moderate-scale GPT-based text encoder was utilized to efficiently capture semantic representation, while ResNet34 was employed to extract image features. Then, a context-aware multimodal fusion module was introduced to dynamically model cross-modal correlations while retaining unimodal details. In addition, a joint optimization strategy combining cross-entropy loss and contrastive loss was used to enhance feature discrimination and modal alignment.

Experimental results on the Weibo and PHEME datasets demonstrated that the proposed model outperformed the baselines in terms of accuracy, precision, recall, and F1-score. These results verify the effectiveness of the proposed fusion mechanism and loss design in improving feature representation and overall detection performance.

Notably, the proposed model utilizes a compact architecture based on moderately sized components and offers a practical trade-off between performance and model scale. This practical contribution aligns with the growing demand for scalable and effective misinformation mitigation tools in today’s fast-paced online ecosystem.

In future work, we are planning to extend our approach to incorporate more modalities, such as video and audio, and to explore cross-lingual detection scenarios. We also aim to investigate the integration of external knowledge to enhance model generalization ability across multiple domains.

Author Contributions

Conceptualization, Z.C. and F.L.; Methodology, Z.C.; Software, P.G.; Validation, Z.C.; Formal Analysis, Z.C.; Investigation, Z.C.; Resources, F.L.; Data Curation, P.G. and Z.C.; Writing—Original Draft Preparation, Z.C.; Writing—Review and Editing, F.L., Z.C., and P.G.; Visualization, P.G. and Z.C.; Supervision, F.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Shandong Jiaotong University under grant number 2023266.

Data Availability Statement

The datasets analyzed in this study are publicly available. Specifically, the Weibo [9] dataset was released by Jin et al. and can be accessed at https://drive.google.com/drive/folders/1SYHLEMwC8kCibzUYjOqiLUeFmP2a19CH?usp=sharing (accessed on 1 March 2025). The PHEME [28] dataset was released by Zubiaga et al., and the version we used can be found in the following repository: https://github.com/zhaowanqing/FCN-LP [29] (accessed on 1 March 2025). All links were confirmed accessible during manuscript revision.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

BERT	Bidirectional Encoder Representations from Transformers
PERT	Pre-training BERT with Permuted Language Model
LERT	Linguistically motivated bidirectional Encoder Representation from Transformer
MiniRBT	Mini RoBERTa (A Two-stage Distilled Small Chinese Pre-trained Model)
GPT-3	Generative Pre-trained Transformer 3
ResNet	Residual Network
CA	Context-Aware
MFD	Multimodal Fake News Detection
T-GPT	TurkuNLP/gpt3-finnish-small model

References

Comito, C.; Caroprese, L.; Zumpano, E. Multimodal Fake News Detection on Social Media: A Survey of Deep Learning Techniques. Soc. Netw. Anal. Min. 2023, 13, 101. [Google Scholar] [CrossRef]
Wang, Y.; Ma, F.; Jin, Z.; Yuan, Y.; Xun, G.; Jha, K.; Su, L.; Gao, J. EANN: Event Adversarial Neural Networks for Multi-Modal Fake News Detection. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 849–857. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Qian, S.; Wang, J.; Hu, J.; Fang, Q.; Xu, C. Hierarchical Multi-modal Contextual Attention Network for Fake News Detection. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, 11–15 July 2021; pp. 153–162. [Google Scholar] [CrossRef]
Al Obaid, A.; Khotanlou, H.; Mansoorizadeh, M.; Zabihzadeh, D. Multimodal Fake-News Recognition Using Ensemble of Deep Learners. Entropy 2022, 24, 1242. [Google Scholar] [CrossRef] [PubMed]
Segura-Bedmar, I.; Alonso-Bartolome, S. Multimodal Fake News Detection. Information 2022, 13, 284. [Google Scholar] [CrossRef]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020. [Google Scholar]
Group, T.R. TurkuNLP Finnish GPT-3 Models. 2022. Available online: https://huggingface.co/TurkuNLP/gpt3-finnish-small (accessed on 15 April 2025).
Jin, Z.; Cao, J.; Guo, H.; Zhang, Y.; Luo, J. Multimodal Fusion with Recurrent Neural Networks for Rumor Detection on Microblogs. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 795–816. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6000–6010. [Google Scholar]
Gan, C.; Fu, X.; Feng, Q.; Zhu, Q.; Cao, Y.; Zhu, Y. A Multimodal Fusion Network with Attention Mechanisms for Visual–Textual Sentiment Analysis. Expert Syst. Appl. 2024, 242, 122731. [Google Scholar] [CrossRef]
Yu, F.; Liu, Q.; Wu, S.; Wang, L.; Tan, T. A Convolutional Approach for Misinformation Identification. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017;IJCAI’17; pp. 3901–3907. [Google Scholar]
Chen, T.; Li, X.; Yin, H.; Zhang, J. Call Attention to Rumors: Deep Attention Based Recurrent Neural Networks for Early Rumor Detection. In Trends and Applications in Knowledge Discovery and Data Mining; Ganji, M., Rashidi, L., Fung, B.C.M., Wang, C., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 40–52. [Google Scholar]
Shen, R.; Pan, W.; Peng, C.; Yin, P. A Microblog Rumor Detection Method Based on Multi-Task Learning. Comput. Eng. Appl. 2021, 57, 192–197. [Google Scholar]
Qian, S.; Hu, J.; Fang, Q.; Xu, C. Knowledge-Aware Multi-modal Adaptive Graph Convolutional Networks for Fake News Detection. ACM Trans. Multimed. Comput. Commun. Appl. 2021, 17, 1–23. [Google Scholar] [CrossRef]
Singhal, S.; Shah, R.R.; Chakraborty, T.; Kumaraguru, P.; Satoh, S. SpotFake: A Multi-modal Framework for Fake News Detection. In Proceedings of the 2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM), Singapore, 11–13 September 2019; IEEE: New York, NY, USA, 2019; pp. 39–47. [Google Scholar] [CrossRef]
Mohapatra, A.; Thota, N.; Prakasam, P. Fake News Detection and Classification Using Hybrid BiLSTM and Self-Attention Model. Multimed. Tools Appl. 2022, 81, 18503–18519. [Google Scholar] [CrossRef]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Wu, Y.; Zhan, P.; Zhang, Y.; Wang, L.; Xu, Z. Multimodal Fusion with Co-Attention Networks for Fake News Detection. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 2560–2569. [Google Scholar] [CrossRef]
Hu, L.; Zhao, Z.; Qi, W.; Song, X.; Nie, L. Multimodal Matching-Aware Co-Attention Networks with Mutual Knowledge Distillation for Fake News Detection. Inf. Sci. 2024, 664, 120310. [Google Scholar] [CrossRef]
Lu, Y.J.; Li, C.T. GCAN: Graph-aware Co-Attention Networks for Explainable Fake News Detection on Social Media. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 505–514. [Google Scholar] [CrossRef]
Yu, C.; Ma, Y.; An, L.; Li, G. BCMF: A bidirectional cross-modal fusion model for fake news detection. Inf. Process. Manag. 2022, 59, 103063. [Google Scholar] [CrossRef]
Anggrainingsih, R.; Hassan, G.M.; Datta, A. Evaluating BERT-based Pre-Training Language Models for Detecting Misinformation. arXiv 2022, arXiv:2203.07731. [Google Scholar]
Pelrine, K.; Danovitch, J.; Rabbany, R. The Surprising Performance of Simple Baselines for Misinformation Detection. In Proceedings of the Web Conference, WWW ’21, Ljubljana, Slovenia, 19–23 April 2021; pp. 3432–3441. [Google Scholar] [CrossRef]
Kula, S.; Choraś, M.; Kozik, R. Application of the Bert-Based Architecture in Fake News Detection. In Proceedings of the 13th International Conference on Computational Intelligence in Security for Information Systems (CISIS 2020), Burgos, Spain, 16–18 September 2020; Springer: Berlin/Heidelberg, Germany, 2021; pp. 239–249. [Google Scholar]
Cui, W.; Zhang, X.; Shang, M. Multi-View mutual learning network for multimodal fake news detection. Expert Syst. Appl. 2025, 279, 127407. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Zubiaga, A.; Liakata, M.; Procter, R. Exploiting Context for Rumour Detection in Social Media. In Proceedings of the Social Informatics; Ciampaglia, G.L., Mashhadi, A., Yasseri, T., Eds.; Springer International Publishing: Cham, Switzerland, 2017; pp. 109–123. [Google Scholar]
Zhao, W.; Nakashima, Y.; Chen, H.; Babaguchi, N. Enhancing Fake News Detection in Social Media via Label Propagation on Cross-Modal Tweet Graph. In Proceedings of the 31st Acm International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; Association for Computing Machinery: New York, NY, USA, 2023; pp. 2400–2408. [Google Scholar] [CrossRef]
Liu, H.; Wang, W.; Sun, H.; Rocha, A.; Li, H. Robust Domain Misinformation Detection via Multi-Modal Feature Alignment. IEEE Trans. Inf. Forensics Secur. 2024, 19, 793–806. [Google Scholar] [CrossRef]
Khattar, D.; Goud, J.S.; Gupta, M.; Varma, V. MVAE: Multimodal Variational Autoencoder for Fake News Detection. In Proceedings of the World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 2915–2921. [Google Scholar] [CrossRef]
Chen, Y.; Li, D.; Zhang, P.; Sui, J.; Lv, Q.; Tun, L.; Shang, L. Cross-Modal Ambiguity Learning for Multimodal Fake News Detection. In Proceedings of the ACM Web Conference 2022, Virtual Event, Lyon, France, 25–29 April 2022; pp. 2897–2905. [Google Scholar] [CrossRef]
Zhou, X.; Wu, J.; Zafarani, R. SAFE: Similarity-Aware Multi-modal Fake News Detection. In Proceedings of the Advances in Knowledge Discovery and Data Mining: 24th Pacific-Asia Conference, PAKDD 2020, Singapore, 11–14 May 2020; Proceedings, Part II. Springer International Publishing: Cham, Switzerland, 2020; pp. 354–367. [Google Scholar] [CrossRef]
Singhal, S.; Kabra, A.; Sharma, M.; Shah, R.R.; Chakraborty, T.; Kumaraguru, P. SpotFake+: A Multimodal Framework for Fake News Detection via Transfer Learning (Student Abstract). Proc. AAAI Conf. Artif. Intell. 2020, 34, 13915–13916. [Google Scholar] [CrossRef]
Zhou, Y.; Yang, Y.; Ying, Q.; Qian, Z.; Zhang, X. Multimodal Fake News Detection via CLIP-guided Learning. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo, Brisbane, Australia, 10–14 July 2023; pp. 2825–2830. [Google Scholar] [CrossRef]
Shen, X.; Huang, M.; Hu, Z.; Cai, S.; Zhou, T. Multimodal Fake News Detection with Contrastive Learning and Optimal Transport. Front. Comput. Sci. 2024, 6, 1473457. [Google Scholar] [CrossRef]
Cui, Y.; Che, W.; Wang, S.; Liu, T. LERT: A Linguistically-motivated Pre-trained Language Model. arXiv 2022, arXiv:2211.05344. [Google Scholar]
Cui, Y.; Yang, Z.; Liu, T. PERT: Pre-training BERT with Permuted Language Model. arXiv 2022, arXiv:2203.06906. [Google Scholar] [CrossRef]
Yao, X.; Yang, Z.; Cui, Y.; Wang, S. MiniRBT: A Two-stage Distilled Small Chinese Pre-trained Model. arXiv 2023, arXiv:2304.00717. [Google Scholar]
Zhang, T.; Wang, D.; Chen, H.; Zeng, Z.; Guo, W.; Miao, C.; Cui, L. BDANN: BERT-Based Domain Adaptation Neural Network for Multi-Modal Fake News Detection. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar] [CrossRef]
Wei, L.; Hu, D.; Zhou, W.; Hu, S. Modeling Both Intra- and Inter-Modality Uncertainty for Multimodal Fake News Detection. IEEE Trans. Multimed. 2023, 25, 7906–7916. [Google Scholar] [CrossRef]
Mu, Y.; Song, X.; Bontcheva, K.; Aletras, N. Examining the limitations of computational rumor detection models trained on static datasets. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 20–25 May 2024; pp. 6739–6751. [Google Scholar]
Nan, Q.; Cao, J.; Zhu, Y.; Wang, Y.; Li, J. MDFEND: Multi-domain Fake News Detection. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Virtual Event, Queensland, Australia, 1–5 November 2021; pp. 3343–3347. [Google Scholar] [CrossRef]
Tong, Y.; Lu, W.; Zhao, Z.; Lai, S.; Shi, T. MMDFND: Multi-modal Multi-Domain Fake News Detection. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; ACM: New York, NY, USA, 2024; pp. 1178–1186. [Google Scholar] [CrossRef]

Figure 1. Example of multimodal fake news.

Figure 2. Themain architecture of Context-Aware Multimodal Fake News Detection.

Figure 3. Residual Network learning module.

Figure 4. Examples of blurry and clear images from the dataset. (Left) Blurry; (Right) clear.

Figure 5. Ablation study results on Weibo and PHEME datasets.

Table 1. ResNet network architectures with different layer configurations.

Layer Name	Output Size	18-Layer	34-Layer	50-Layer	101/152-Layer
conv1	$112 \times 112$	$7 \times 7, 64$ , stride 2
		$3 \times 3$ max pool, stride 2
conv2.x	$56 \times 56$	$[\begin{matrix} 3 \times 3, 64 \\ 3 \times 3, 64 \end{matrix}] \times 2$	$[\begin{matrix} 3 \times 3, 64 \\ 3 \times 3, 64 \end{matrix}] \times 3$	$[\begin{matrix} 1 \times 1, 64 \\ 3 \times 3, 64 \\ 1 \times 1, 256 \end{matrix}] \times 3$	$[\begin{matrix} 1 \times 1, 64 \\ 3 \times 3, 64 \\ 1 \times 1, 256 \end{matrix}] \times 3$
conv3.x	$28 \times 28$	$[\begin{matrix} 3 \times 3, 128 \\ 3 \times 3, 128 \end{matrix}] \times 2$	$[\begin{matrix} 3 \times 3, 128 \\ 3 \times 3, 128 \end{matrix}] \times 4$	$[\begin{matrix} 1 \times 1, 128 \\ 3 \times 3, 128 \\ 1 \times 1, 512 \end{matrix}] \times 4$	$[\begin{matrix} 1 \times 1, 128 \\ 3 \times 3, 128 \\ 1 \times 1, 512 \end{matrix}] \times 4$
conv4.x	$14 \times 14$	$[\begin{matrix} 3 \times 3, 256 \\ 3 \times 3, 256 \end{matrix}] \times 2$	$[\begin{matrix} 3 \times 3, 256 \\ 3 \times 3, 256 \end{matrix}] \times 6$	$[\begin{matrix} 1 \times 1, 256 \\ 3 \times 3, 256 \\ 1 \times 1, 1024 \end{matrix}] \times 6$	$[\begin{matrix} 1 \times 1, 256 \\ 3 \times 3, 256 \\ 1 \times 1, 1024 \end{matrix}] \times 23 / 36$
conv5.x	$7 \times 7$	$[\begin{matrix} 3 \times 3, 512 \\ 3 \times 3, 512 \end{matrix}] \times 2$	$[\begin{matrix} 3 \times 3, 512 \\ 3 \times 3, 512 \end{matrix}] \times 3$	$[\begin{matrix} 1 \times 1, 512 \\ 3 \times 3, 512 \\ 1 \times 1, 2048 \end{matrix}] \times 3$	$[\begin{matrix} 1 \times 1, 512 \\ 3 \times 3, 512 \\ 1 \times 1, 2048 \end{matrix}] \times 3$
	$1 \times 1$	Average pool, 1000-d fc, softmax
FLOPs	-	$1.8 \times 10^{9}$	$3.6 \times 10^{9}$	$3.8 \times 10^{9}$	$7.6 / 11.3 \times 10^{9}$

Table 2. Dataset overview.

Dataset	News with Images	No. of Real News	No. of Fake News
Weibo	9528	4779	4749
PHEME	3670	3830	1972

Table 3. Hyperparameter settings.

Attribute	Hyperparameters
Dropout	0.3
Required improvement	2000
Num epochs	100
Batch size	32
Pad size	128
Learning rate	$1 \times 10^{- 5}$
Optimizer	Adam

Table 4. Performance comparison of various models on two datasets.

Dataset	Model	Accuracy	Fake News			Real News			Macro Avg
Dataset	Model	Accuracy	Precision	Recall	F1-Score	Precision	Recall	F1-Score	Precision	Recall	F1-Score
Weibo	EANN	0.782	0.827	0.697	0.756	0.752	0.863	0.804	0.789	0.780	0.780
	att-RNN	0.772	0.854	0.656	0.742	0.720	0.889	0.795	0.787	0.773	0.769
	MVAE	0.824	0.854	0.769	0.809	0.802	0.875	0.837	0.828	0.822	0.823
	CAFE	0.840	0.855	0.830	0.842	0.825	0.851	0.837	0.840	0.841	0.840
	SAFE	0.763	0.833	0.659	0.736	0.717	0.868	0.785	0.775	0.764	0.761
	HMCAN	0.885	0.920	0.845	0.881	0.856	0.926	0.890	0.888	0.886	0.886
	SpotFake	0.869	0.877	0.859	0.868	0.861	0.879	0.870	0.869	0.869	0.869
	SpotFake+	0.870	0.887	0.849	0.868	0.855	0.892	0.873	0.871	0.871	0.871
	FND-CLIP *	0.882	0.930	0.828	0.876	0.842	0.936	0.886	0.886	0.882	0.881
	MCOT	0.901	0.895	0.911	0.903	0.906	0.890	0.898	0.901	0.901	0.901
	CA-MFD (Ours)	0.910	0.924	0.896	0.910	0.897	0.925	0.911	0.911	0.911	0.911
PHEME	EANN	0.685	0.664	0.694	0.701	0.750	0.747	0.681	0.707	0.721	0.691
	att-RNN	0.850	0.791	0.749	0.770	0.876	0.899	0.888	0.834	0.824	0.829
	MVAE	0.852	0.806	0.719	0.760	0.871	0.917	0.893	0.839	0.818	0.827
	CAFE	0.861	0.812	0.645	0.719	0.875	0.943	0.907	0.844	0.794	0.813
	SAFE	0.811	0.827	0.559	0.667	0.806	0.940	0.866	0.817	0.750	0.767
	HMCAN	0.881	0.830	0.838	0.834	0.910	0.905	0.893	0.870	0.872	0.864
	SpotFake	0.823	0.743	0.745	0.744	0.864	0.863	0.863	0.804	0.804	0.804
	SpotFake+	0.800	0.730	0.668	0.697	0.832	0.869	0.850	0.781	0.769	0.774
	FND-CLIP *	0.812	0.852	0.891	0.871	0.694	0.615	0.652	0.773	0.753	0.762
	MCOT	0.870	0.839	0.727	0.779	0.882	0.936	0.908	0.861	0.832	0.844
	CA-MFD (Ours)	0.901	0.888	0.909	0.898	0.914	0.894	0.904	0.901	0.902	0.901

Note: * Denotes results reproduced by the authors. Bold values indicate the best performance among all compared methods.

Table 5. Performance comparison of different text encoders on Weibo and PHEME datasets.

Dataset	Model	Accuracy (%)	Fake News			Real News			Macro Avg
Dataset	Model	Accuracy (%)	Precision	Recall	F1-Score	Precision	Recall	F1-Score	Precision	Recall	F1-Score
Weibo	BERT	88.99	0.9297	0.8453	0.8855	0.8564	0.9351	0.894	0.893	0.8902	0.8898
	LERT	89.81	0.9128	0.8818	0.897	0.8841	0.9146	0.8991	0.8985	0.8982	0.8981
	PERT	87.53	0.9057	0.8397	0.8714	0.8486	0.9113	0.8789	0.8772	0.8755	0.8751
	MiniRBT	85.81	0.8574	0.8615	0.8595	0.8589	0.8547	0.8568	0.8581	0.8581	0.8581
	T-GPT	91.03	0.9240	0.8955	0.9095	0.8973	0.9253	0.9111	0.9106	0.9104	0.9103
PHEME	BERT	87.09	0.8339	0.9125	0.8714	0.9118	0.8325	0.8704	0.8728	0.8725	0.8709
	LERT	88.90	0.8556	0.9245	0.8887	0.9249	0.8562	0.8893	0.8902	0.8904	0.8890
	PERT	88.57	0.8616	0.9074	0.8839	0.9103	0.8657	0.8874	0.8859	0.8865	0.8857
	MiniRBT	78.54	0.8015	0.7341	0.7663	0.7727	0.8325	0.8015	0.7871	0.7833	0.7839
	T-GPT	90.13	0.8878	0.9091	0.8983	0.9144	0.8942	0.9042	0.9011	0.9016	0.9012

Note: Bold values indicate the best performance in each metric.

Table 6. Ablation study on two datasets.

Dataset	Model	Accuracy	Fake News			Real News
Dataset	Model	Accuracy	Precision	Recall	F1-Score	Precision	Recall	F1-Score
Weibo	CA-MFD (Image only)	61.43	0.5976	0.7166	0.6517	0.6399	0.5107	0.5680
	CA-MFD (Text only)	87.00	0.9438	0.7787	0.8593	0.8163	0.9524	0.8791
	CA-MFD w/o CA	88.99	0.9250	0.8501	0.8860	0.8596	0.9302	0.8935
	CA-MFD (Con CoAtt)	85.40	0.8802	0.8217	0.8500	0.8308	0.8867	0.8578
	CA-MFD w/o Contrast	89.03	0.9313	0.8445	0.8858	0.8560	0.9368	0.8946
	CA-MFD w/o CA&Contrast	87.77	0.9415	0.8073	0.8692	0.8293	0.9491	0.8851
	CA-MFD	91.03	0.9240	0.8955	0.9095	0.8973	0.9253	0.9111
PHEME	CA-MFD (Image only)	74.75	0.5778	0.4483	0.5049	0.7962	0.8681	0.8306
	CA-MFD (Text only)	82.01	0.6757	0.7184	0.6964	0.8836	0.8611	0.8722
	CA-MFD w/o CA	89.14	0.9005	0.8696	0.8848	0.8836	0.9115	0.8974
	CA-MFD (Con CoAtt)	85.03	0.8499	0.8353	0.8426	0.8507	0.8641	0.8574
	CA-MFD w/o Contrast	89.56	0.9207	0.8559	0.8871	0.8754	0.9321	0.9028
	CA-MFD w/o CA&Contrast	86.76	0.8625	0.8611	0.8618	0.8722	0.8736	0.8729
	CA-MFD	90.13	0.8878	0.9091	0.8983	0.9144	0.8942	0.9042

Note: Bold values indicate the best performance in each metric.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chi, Z.; Guo, P.; Liu, F. A Compact GPT-Based Multimodal Fake News Detection Model with Context-Aware Fusion. Electronics 2025, 14, 4755. https://doi.org/10.3390/electronics14234755

AMA Style

Chi Z, Guo P, Liu F. A Compact GPT-Based Multimodal Fake News Detection Model with Context-Aware Fusion. Electronics. 2025; 14(23):4755. https://doi.org/10.3390/electronics14234755

Chicago/Turabian Style

Chi, Zengxiao, Puxin Guo, and Fengming Liu. 2025. "A Compact GPT-Based Multimodal Fake News Detection Model with Context-Aware Fusion" Electronics 14, no. 23: 4755. https://doi.org/10.3390/electronics14234755

APA Style

Chi, Z., Guo, P., & Liu, F. (2025). A Compact GPT-Based Multimodal Fake News Detection Model with Context-Aware Fusion. Electronics, 14(23), 4755. https://doi.org/10.3390/electronics14234755

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

A Compact GPT-Based Multimodal Fake News Detection Model with Context-Aware Fusion

Abstract

1. Introduction

2. Related Works

3. The Proposed Method

3.1. Problem Definition

3.2. Method Overview

3.3. Multimodal Feature Extraction

3.3.1. Text Embedding

3.3.2. Visual Embedding

3.4. Context-Aware Multimodal Fusion

3.5. Classify

3.6. Model Learning

4. Experiments

4.1. Dataset

4.1.1. Dataset Description

4.1.2. Preprocessing

4.2. Baselines

4.2.1. Multimodal Methods

4.2.2. BERT Series Models

4.3. Comparison Experiments

4.3.1. Experimental Settings

4.3.2. Overall Performance

4.3.3. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI