SAFE-GTA: Semantic Augmentation-Based Multimodal Fake News Detection via Global-Token Attention

Zhang, Like; Zhang, Chaowei; Zhang, Zewei; Huang, Yuchao

doi:10.3390/sym17060961

Open AccessArticle

SAFE-GTA: Semantic Augmentation-Based Multimodal Fake News Detection via Global-Token Attention

¹

The Department of Information Engineering, Yangzhou University, Yangzhou 225012, China

²

The Department of Computer Science and Software Engineering, Auburn University, Auburn, AL 36849, USA

³

Future Design Laboratory, Zhejiang University, Hangzhou 310027, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(6), 961; https://doi.org/10.3390/sym17060961

Submission received: 28 April 2025 / Revised: 11 June 2025 / Accepted: 16 June 2025 / Published: 17 June 2025

(This article belongs to the Special Issue Symmetries and Symmetry-Breaking in Data Security)

Download

Browse Figures

Versions Notes

Abstract

Large pre-trained models (PLMs) have provided tremendous opportunities and potentialities for multimodal fake news detection. However, existing multimodal fake news detection methods never manipulate the token-wise hierarchical semantics of news yielded from PLMs and extremely rely on contrastive learning but ignore the symmetry between text and image in terms of the abstract level. This paper proposes a novel multimodal fake news detection method that helps to balance the understanding between text and image via (1) designing a global-token across-attention mechanism to capture the correlations between global text and tokenwise image representations (or tokenwise text and global image representations) obtained from BERT and ViT; (2) proposing a QK-sharing strategy within cross-attention to enforce model symmetry that reduces information redundancy and accelerates fusion without sacrificing representational power; (3) deploying a semantic augmentation module that systematically extracts token-wise multilayered text semantics from stacked BERT blocks via CNN and Bi-LSTM layers, thereby rebalancing abstract-level disparities by symmetrically enriching shallow and deep textual signals. We also prove the effectiveness of our approach by comparing it with four state-of-the-art baselines. All the comparisons were conducted using three widely adopted multimodal fake news datasets. The results show that our approach outperforms the benchmarks by 0.8% in accuracy and 2.2% in F1-score on average across the three datasets, which demonstrates a symmetric, token-centric fusion of fine-grained semantic fusion, thereby driving more robust fake news detection.

Keywords:

multimodal symmetry; semantic augmentations; pre-trained models; fake news detection; information balancing

1. Introduction

The controversy of fake news is a global topic that keeps being focused on by crowds since it was exacerbated by the 2016 USA presidential election and the COVID-19 pandemic [1,2,3], which has arisen further in the past few years. Recognizing fake news that spreads on social media platforms is a significant challenge due to its readability and uncontrollable propagation [4]. However, the datasets used for training fake news detection models often exhibit biases and domain differences, which can affect model robustness. These biases can arise from various factors such as topic selection, language styles, and image types. Addressing these concerns is essential to ensure that models can generalize well across diverse scenarios and avoid unintended biases.

To counter the negative effects mentioned above, numerous existing studies have been conducted to address the task of fake news detection. Traditional related studies tackle the task by balancing various aspects of features including textual features (including content-based extractable latent features, such as sentiment, syntax, and style) [5,6], users’ profile features [7], social context features [8], and propagation network-based features [9] with the assistance of deep neural networks. The appearance of large pre-trained language models such as BERT [10,11] and T5 [12] has fundamentally revolutionized the landscape of fake news detection modeling. Such pre-trained language models enable the capture of complex linguistic patterns, semantic symmetry, and contextual cues in language, thereby facilitating the discernment between reliable and misleading information [13,14]. For example, Rohit Kumar et al. proposed a BERT-based fake news detection model called FakeBERT by taking advantage of multiple parallel blocks of single-layered and kernel-size differentiated deep Convolutional Neural Networks [15].

News articles normally embrace plentiful text-based (such as titles, written content, etc.) and non-text-based (images, videos) information that allows in-field researchers to take advantage of such multimodal information for automatic fake news detection. However, existing multimodal approaches often suffer from an inherent asymmetry between textual and visual modalities: text encodes sequential, hierarchical semantics, while images capture spatially distributed patterns [16,17]. This asymmetry leads to mismatched abstraction levels and fragmented cross-modal interactions, limiting models’ ability to holistically reconcile complementary signals. Figure 1 demonstrates an example of four multimodal news samples from PolitiFact, where the four samples denote the four kinds of relationships between text and image in fake and real news, which are true positive (i.e., aligned text and image in real news), true negative (i.e., unaligned text and image in real news), false positive (i.e., unaligned text and image in fake news), and false negative (i.e., aligned text and image in fake news). In contrast to text-based unimodal fake news detection approaches, multimodal fake news detection models leverage images that contain visual cues aiding in assessing the credibility of the news. Moreover, images can provide additional context to textual information, which helps to identify discrepancies or inconsistencies in image–text pairs that might indicate potential misinformation or manipulation, thereby boosting the effectiveness of fake news detection [18].

In recent years, massive multimodal pre-trained models have been presented that have demonstrated powerful potential for multimodal representation-related downstream tasks [19,20,21]. Such pre-trained models mainly utilize a text-image aligning approach called contrastive learning, which is implemented by using contrastive loss to recognize the differences between true positive image–text pairs and all of the other unmatched true negative pairs(See Figure 1). Yet this paradigm assumes a rigid symmetry between modalities, treating text and images as interchangeable inputs for alignment. In reality, their semantic roles are asymmetric—text often dominates narrative intent, while images serve as supplementary evidence [16,17]. This case has prompted three critical limitations: (1) Aligned text–image pairs in fake news act as false positive roles that reinforce the same false claims. Such alignment fosters a reinforcing effect where the images and texts mutually support the fake narrative, thereby increasing the difficulty for detection systems to identify fake news. (2) The image in a news sample is normally supplementary information for the text content, which means that the semantic implications between an image and text might have a huge gap due to varying levels of abstraction in different modalities. (3) Text–image pairs originally obtained from different news sources may fall in the same semantic field (such as the same/relevant topics or events), resulting in pairs that are inherently correlated, yet they are treated as negative instances by the existing contrastive learning methods. The effectiveness of contrastive learning approaches adopted in detecting multimodal fake news needs to be verified further, although the approaches have delivered a certain impact in this case.

To address the issue of asymmetry between text and image, we propose a Semantic Augmentation-based multimodal FakE news detection approach using Global-Token Attention (SAFE-GTA), a semantic augmentation-based framework that introduces symmetry-aware fusion through two key designs: (1) a global-token cross-attention mechanism that dynamically balances global contextual and local token-wise interactions between modalities, ensuring bidirectional information reconciliation without assuming rigid alignment; (2) a QK-sharing strategy that enforces parameter symmetry between text and image projection layers, reducing redundancy while preserving discriminative power. Unlike contrastive learning’s forced parity, our approach respects the intrinsic asymmetry of modalities while establishing complementary symmetry in feature fusion. The remainder of this paper is organized as follows: Section 2 reviews related work, including unimodal and multimodal fake news detection. Section 3 presents the proposed framework. Section 4 discusses experimental analyses, including comparative and ablation studies. Section 5 concludes the paper, and Section 6 highlights limitations.The contributions of this study are listed as follows:

We design a global-token cross-attention mechanism for multimodal representation fusion to effectively capture the correlations between text and image, in which the trainable Query and Key matrices for the two modalities are shared with each other.
We manipulate token-level representations yielded from twelve stacked blocks of BERT using a Convolutional Neural Network for semantic augmentation.
We propose SAFE-GTA, a multimodal fake news detection system implemented based on dual unimodal pre-trained models. Specifically, we adopt BERT for text learning and ViT for image representation.
We conduct a massive amount of experiments, and the results show that our approach outperforms the state-of-the-art baselines by 0.8% in accuracy and 2.2% in F-score on average across three widely adopted datasets.

2. Literature Review

Identifying fake news has turned into a significant challenge in the current era of artificial intelligence. Existing fake news detection models oriented on inconsistent modalities of news data are fashioned based on up-to-date algorithm architectures [22] and large pre-trained models [10,12,20] for classification decisions. This section systematically reviews the literature related to this study, in which the following two aspects are mainly inspected: unimodal-based fake news detection and multimodal fusion for fake news detection.

2.1. Unimodal Fake News Detection

The news published on social media contains rich multimodal data (e.g., images, texts, videos) that offer complementary insights for prompting exploration into their combined contribution to the authenticity verification of news. Current unimodal approaches aiming at fake new detection primarily extract textual or visual features from news metadata for deep representation learning and higher-level unimodality interactions by leveraging neural networks and pre-trained models. Most mainstream studies focus on manipulating text [23,24] and images [25] for fake news detection.

Text-oriented Approaches: Recent fake news detection research emphasizes the textual modality of news by analyzing textual contents [26,27], user profiles [28], social metadata (i.e., retweets) [29,30,31], and the latent features implicated in the texts (i.e., sentiment, syntax) [32,33,34]. For example, a two-level (word and sentence) CNN-based generative model was presented to capture new semantic patterns and detect fake news by analyzing users’ historically significant responses [35]. Another study proposes a memory-guided, multi-view fake news detection method that accounts for correlations across different domains. It leverages textual features from three perspectives—semantic, emotional, and stylistic—for feature fusion and representation aggregation [36].

Modern approaches increasingly utilize fine-tuned pre-trained language models, with particularly successful applications in COVID-19-related misinformation detection. Glazkova et al. [37] demonstrated the effectiveness of CT-BERT (a COVID-Twitter-BERT variant) combined with ensemble learning for identifying pandemic-related fake news, achieving state-of-the-art performance in the Constraint@AAAI2021 shared task. Similarly, Alghamdi et al. [38] systematically evaluated various transformer architectures (including BERT, RoBERTa, and ELECTRA) for COVID-19 fake news detection, showing these models’ superior ability to capture pandemic-specific linguistic patterns compared to traditional methods. These studies highlight how domain-adapted transformer models can significantly improve detection accuracy for emerging misinformation themes. However, these text-centric approaches inherently assume textual dominance, neglecting the complementary role of visual evidence and failing to address the asymmetry in multimodal abstraction levels that are integrated into text-only-based fake news detection systems.

Image-oriented Approaches: Images play a crucial role in news reports, which can attract the attention of readers and provide a visible abstraction of news for audiences. In the field of fake news detection, images are normally used as complementary information to text features, since the information implied in them is limited for final decision-making. For example, Qi et al. give an insightful conception regarding news images that consistently exhibit low resolution and powerful emotional content across tasks [39]. To address these limitations, the authors propose a method by considering image features in terms of frequencies and pixel domains for detecting misinformation. Another study aiming to investigate the effectiveness of images extracted propagation patterns from images to discover the discrimination between real and fake samples [40]. Yet such methods treat images as isolated signals, overlooking their semantic asymmetry with text (e.g., spatial vs. sequential representations) that complicates joint reasoning.

In summary, unimodal-based approaches can effectively examine efficacious features of news, but these methods might ignore comprehensive understandings and crucial contextual cues present in other modalities. Thus, the robustness and accuracy of unimodal fake news detection systems could be enhanced by integrating other modalities.

2.2. Multimodal Fusion for Fake News Detection

Fake news can be manipulated through various means by malicious creators, including altering images and fabricating text content [41], leading to unoptimal results being yielded if solely relying on single-modal approaches. To address this challenge, state-of-the-art studies aim to improve the performance of their systems by considering information from different sources by taking multimodalities of news into account [42,43,44].

Intuitively, the feature representations of each modality can be separately or jointly obtained from large pre-trained models, such as ResNet [45] and Vision Transformer (ViT) [46] for images, BERT and T5 for text, and CLIP for both image and text learning. An earlier study presents an Event Adversarial Neural Network-based model, called EANN [47], which manipulates VGG-19 for image representation and TextCNN for word embedding acquisition. This study utilizes an auxiliary task of event classification to assist in feature extraction. The event classification branch aims to better disentangle the multimodal features, thereby enabling them to contain both specific event information and information independent of events. Moreover, conventional feature-level fusion methods (e.g., concatenation) fail to preserve token-level correspondences between visual patches and textual tokens, leading to information loss during cross-modal alignment. Nevertheless, such disentanglement strategies enforce artificial separation between modalities, exacerbating their inherent asymmetry rather than reconciling it.

Another example, the HMCAN model proposed by Qian et al. [48], is a hierarchical multimodal contextual semantics-based network implemented using ResNet and BERT for image and text representation learning, respectively. This model grasps different levels of semantics produced from layers of BERT, and the semantic representations are fused with image representation using three weight-shared multimodal attention blocks. While weight-sharing introduces partial symmetry, it rigidly aligns modalities at all abstraction levels, ignoring the dynamic complementarity between global context (e.g., image gist) and local semantics (e.g., textual tokens). In particular, the most recent works utilize the technique of contrastive learning that embeds an image–text alignment component constrained using contrastive loss [49,50]. For example, Xu et al. propose a cross-modal contrastive learning framework for multimodal fake news detection, called COOLANT, which adopts a multimodal pre-trained model, CLIP, for feature representation. However, contrastive learning assumes symmetric interchangeability between modalities, forcing matched text–image pairs into identical embedding spaces. This violates their natural asymmetry—texts define narrative intent while images provide contextual evidence, leading to suboptimal fusion.

These limitations highlight a critical gap: existing fusion paradigms either impose artificial symmetry (via contrastive alignment) or tolerate unmitigated asymmetry (via simple concatenation), failing to balance modality-specific strengths. Our work addresses this by proposing symmetry-aware fusion that respects intrinsic modality differences while establishing complementary interactions through hierarchical semantic alignment and parameter sharing.

3. The Framework of SAFE-GTA

This section introduces our proposed SAFE-GTA model. SAFE-GTA aims to deal with the tasks of detecting fake news by utilizing the fusion information of cross-modalities and exploiting the different levels of semantic representations produced from the various blocks of pre-trained text encoder. More specifically, the architecture of the SAFE-GTA model is composed of four components, as shown in Figure 2:

Multimodal encoding module. We employ BERT as our text encoder for textual representation extraction. To adapt the configuration of the BERT structure, we adopt ViT as the encoder for image learning. Such deployments make the encoders among modalities collaborate for information contrast and fusion.

Multimodal fusion module. This module is designed to fuse the information between images and text, which adopts a weight-sharing strategy in terms of Query and Key matrices. The global and token representations produced from text and image encoders are used for cross-attention calculation by deploying the QK-sharing and two value projection layers. This module not only helps dynamically adjust the attention of the SAFE-GTA model on the global and token-wise information from images and text, but can also decrease the calculation cost, thereby improving the model’s efficiency and the abilities of multimodal learning.

Semantic augmentation module. We obtain the text representation by presenting a semantic augmentation module, which fully utilizes the tokens’ representations outputted from twelve stacked blocks of BERT. Moreover, the deployed convolutional kernels are used as the BERT-block-wise weights for semantic representation from shallow to deep during the period of convolution operations. Then, the learned augmented sequential representations are sent to a Bi-LSTM network for generating augmented global semantic representations.

The binary classification module. The input of this module is obtained by concatenating the outputs yielded from the semantic augmentation module and the multimodal fusion module. Then, a multilayer perceptron and a softmax layer are deployed to predict the authenticity probability of news.

3.1. Data Preprocessing

In our multimodal framework, raw textual and visual inputs undergo distinct preprocessing pipelines tailored to the BERT and Vision Transformer (ViT) backbones. For textual data, we rely on Hugging Face’s BertTokenizer to convert sentences into token IDs, attention masks, and segment embeddings. Simultaneously, image data is prepared via the ViTFeatureExtractor, which resizes, normalizes, and converts raw images into patch embeddings suitable for a ViT encoder. By explicitly specifying each function call and its parameters, we ensure reproducibility and clarity for downstream model training.

As shown in Table 1, we instantiate a BERT tokenizer using “bert-base-uncased (https://huggingface.co/google-bert/bert-base-uncased)” accessed on 1 June 2024, from “BertTokenizer” for text tokenization. This call performs (1) lowercasing and WordPiece segmentation, (2) special token insertion ([CLS], [SEP]), (3) padding/truncation to a fixed length of 128 tokens, and (4) the generation of an attention mask. The result is a dictionary containing input_ids, attention_mask, and token_type_ids, ready for input into a BERT encoder. For image preprocessing, we adopt “vit-base-patch16-224 (https://huggingface.co/google/vit-base-patch16-224)” accessed on 10 June 2024, for image tokenization and normalization. Each raw image (RGB, variable size) is then processed via the following: (1) resizing the shorter side of the image to 256 pixels, (2) center-cropping to

224 \times 224

, (3) normalizing pixel values to the mean/std used during ViT pre-training, and (4) converting to a PyTorch tensor of the shape

(B, 3, 224, 224)

. The output dictionary contains pixel_values, which are directly consumable by a ViT encoder.

3.2. Image and Text Encoders

As we mentioned above, two unimodal pre-trained models, BERT and ViT, are deployed for text and image learning, and the global and token representations of text and image can be obtained from the pre-trained models.

Text representations: Given a text sequence of news,

T = {t_{1}, t_{2}, \dots, t_{n}}

, where n denotes the number of tokens in the text sequence. The sequence T is fed to BERT, which can obtain (1) the token

[c l s]

-corresponding vector as global representation,

e^{G T}

; (2) the embeddings of each token in the sequence

e^{T} = {e_{1}^{t}, e_{2}^{t}, \dots, e_{n}^{t}}

.

Image representations: We utilize ViT, a pre-trained model, and the ImageNet dataset for image encoding. Each image is divided into 197 blocks (image tokens). Thus, an image sequence,

I = {i_{1}, i_{2}, \dots, i_{197}}

, is conveyed to ViT for obtaining (1) the global representation

e^{G I}

that is the embedding corresponding to the token

[#]

; (2) the token-wise sequential representation

e^{I} = {e_{1}^{i}, e_{2}^{i}, \dots, e_{197}^{i}}

, where

e_{1}^{i}

denotes the embedding of the image token

i_{1}

after fine-tuning the pre-trained model.

The global representations −

e^{G T}

and

e^{G I}

stand for the overall comprehension of text and images, because the bi-directional transformer framework deployed in the unimodal encoders (i.e., Bert and ViT) allows each token in sequence to look over the correlations with all the other tokens using the self-attention mechanism.

3.3. Multimodal Information Fusion

The multimodal fusion mechanism in traditional multimodal fake news detection systems merely leverages the global representation of either images or text. In contrast, the token-wise sequential representation can retain the fine-grained features of the modality, which allows the capture of the local features and structural information of a single modality. By combining the global representation and sequential representation with the cross-attention mechanism in the multimodal fake news detection task, the task model can adjust the focused range from global to local adaptively. This means that in tasks demanding an emphasis on overall meaning, the task model is able to concentrate on comprehensive and global information. Conversely, for tasks that necessitate a focus on intricate details, the model is capable of shifting its attention to concentrate more intensely on a particular localized segment of the sequence.

The QK-sharing mechanism inherently prioritizes cross-modal consistency checks. For instance, when textual claims about ’protest size’ contradict the actual crowd density in images, the attention weights automatically redistribute to highlight this discrepancy. The multimodal information fusion module not only takes advantage of the global and token-level representations captured by fine-tuning large pre-trained models but also adopts the strategy of Query (i.e.,

W^{Q}

) and Key (i.e.,

W^{K}

) weight-sharing in the cross-attention mechanism during the period of information fusion. Specifically, we initialize two QK-sharing layers for adapting the shapes of image and text representations (

W^{T P}

for text linear projection, and

W^{I P}

for image projection), and each modality goes through a value linear layer for obtaining value matrices (

W^{T V}

for text, and

W^{I V}

for images). The Q, K, and V representations of images and texts are calculated via the four weighting matrices mentioned above. The following equations demonstrate the process of the calculations.

K^{T} = e^{T} \cdot W^{T P}, e^{T} \in R^{n \times h_{t}}, W^{T P} \in R^{h_{t} \times h}

(1)

where

e^{T}

denotes the embeddings of the tokens in the sequence T, and

W^{T P}

is the QK-sharing linear layer for text. The product of these two matrices, −

K^{T}

, indicates the key matrix of the sequence T. The shape of

K^{T}

is in

R^{n \times h}

, where n is the length of the sequence T, h is the outputted dimension of the projection layer

W^{T P}

, and

h_{t}

is the dimensionality of text tokens’ embeddings.

q^{T} = e^{G T} \cdot W^{T P}, e^{G T} \in R^{1 \times h_{t}}

(2)

where

e^{G T}

is the global representation of the sequence T obtained by fine-tuning BERT.

q^{T}

is the query vector of

e^{G T}

via the QK-sharing weight matrix

W^{T P}

, which is used to check its correlation to the image sequence I.

V^{T} = e^{T} \cdot W^{T V}, W^{T V} \in R^{h_{t} \times h}

(3)

The QK-sharing weighting matrix of images has a similar setting to that of text. The following equations show the process of producing the query vector

q^{I}

, key matrix

K^{I}

, and value matrix

V^{I}

of an image.

K^{I} = e^{I} \cdot W^{I P}, e^{I} \in R^{197 \times h_{i}}, W^{I P} \in R^{h_{i} \times h}

(4)

q^{I} = e^{G I} \cdot W^{I P}, e^{G I} \in R^{1 \times h_{i}}

(5)

V^{I} = e^{I} \cdot W^{I V}, W^{I V} \in R^{h_{i} \times h}

(6)

where the shapes of

K^{I}

and

q^{I}

are in

R^{197 \times h}

and

R^{1 \times h}

, which are compatible with the dimensionalities of texts’

q^{T}

and

K^{T}

, respectively. The QK interactions between text and images occur in two employed cross-attention blocks as long as the query vectors and key matrices are obtained. Equations (7) and (8) show the insights of global-token interactions in two cross-attention blocks between an image and text.

h^{T C} = S o f t m a x {\frac{q^{T} {(K^{I})}^{⊤}}{\sqrt{h}}} \cdot V^{T},

(7)

h^{I C} = S o f t m a x {\frac{q^{I} {(K^{T})}^{⊤}}{\sqrt{h}}} \cdot V^{I},

(8)

where

h^{T C}

denotes the hidden representation of text that captures the correlation between the global information of the text and the token-wise information of the paired image. In contrast,

h^{I C}

maintains the interactive information between images’ global representation and texts’ sequence information. Afterward, the two hidden representations,

h^{T C}

and

h^{I C}

, are concatenated for information fusion, as shown in Equation (9):

h^{T I} = M L P (C o n c a t (h^{T C}, h^{I C}))

(9)

where

M L P

refers to a multilayer perceptron that takes the concatenation of

h^{T C}

and

h^{I C}

as input for generating

h^{T I}

, which is the deep abstraction of fused text and image representation.

During the process of our preliminary experimental framework setup and the task model initialization, we found that the cross-attention with our deployed QK-sharing strategy does not yield worse results compared to the traditional cross-attention mechanism. Moreover, the strategy cuts down the required number of parameter matrices in the cross-attention mechanism from six (double Q, K, and V) to four (double shared QK and V). This strategy not only ensures the effectiveness of the model but also reduces unnecessary additional parameter overhead, thereby improving the efficiency of the task model.

A semantic augmentation module is designed in our model that fully takes advantage of the hierarchical semantics of text from BERT. Specifically, the proposed SAFE-GTA model fully exploits the different levels of token-wise representations (i.e.,

{H^{1}, \dots, H^{C}}

) in a text sequence produced by the stacked blocks of BERT by deploying a CNN network, where the

1 \times 1

convolutional kernels embedded in the CNN are used to calculate the importance of the sequence representations in various levels of the blocks. Specifically, the proposed SAFE-GTA model fully exploits the different levels of token-wise representations in a text sequence produced by the stacked blocks of BERT by deploying a CNN network. The CNN module employs three

1 \times 1

convolutional kernels for feature transformation, followed by ReLU activation and layer normalization. These kernels are used for dimensionality reduction and feature fusion, which is a standard practice in deep learning. Figure 3 illustrates the workflow of the CNN-based semantic augmentation module.

In the figure, the token-wise representations of 12 layers for a text sequence are treated as three-dimensional data that embraces 12 channels. The values of corresponding convolutional kernels are randomly generated and trainable. Each channel is used for convolution calculation, and then the outputs from all the channels are summed together as the return of the semantic augmentation module, where the result maintains the hierarchical semantic information. It is notable that the training progress of this module is also constrained by the decision label. Equation (10) reveals the calculation process of hierarchical semantic representation for text.

Y = \sum_{c = 1}^{C} K^{c} \otimes H^{c}, Y \in R^{n \times h_{t}}

(10)

where

K^{c}

represents a

1 \times 1

kernel for channel c, and c is the index of channels from 1 to 12.

H^{c}

denotes the tokens’ hidden representations of an input text obtained from the cth layer of BERT. The output Y is the token-wise summation of the results from 12 channels, which allows the module to dynamically learn and fuse the hierarchical semantics of texts oriented to decision-making. To obtain the whole representation of Y, we adopt a Bi-LSTM network to learn the sequential information of Y, which can be formalized as Equation (11). Specifically, the Bi-LSTM component consists of two layers with 256 hidden units each, using dropout (p = 0.2) between layers to prevent overfitting.

z = B i - L S T M (Y), z \in R^{1 \times h_{t}}

(11)

where z is the learned vector from Bi-LSTM by taking Y as input. The shape of z is the same as the dimensionality of word embeddings.

Overall, our proposed semantic augmentation module can (1) flexibly capture the hierarchical semantics of text; (2) effectively strengthen the comprehension of textual information for the task model. Finally, the output of the module is concatenated with the fusion representation produced from the multimodal fusion module, aiming at the classification between fake and real news.

3.4. Classification Module

In this module, we simply adopt a classic classification design that consists of a dropout-enabled full-connected layer coupled with a Sigmoid function. This module takes the concatenation of outputs from the multimodal fusion module (i.e.,

h^{T I}

) and the semantic augmentation module (i.e., z) to enrich the representation of news. Equation (12) delivers the process of computing the prediction of the SAFE-GTA model for a given image–text pair.

\hat{y} = S i g m o i d (W_{f c} \cdot (h^{T I} \oplus z))

(12)

where

W_{f c}

is the fully connected layer, and its shape is in

R^{2 \times (h + h_{t})}

for binary classification. Finally, the prediction

\hat{y}

and the label y are jointly sent to the loss function for model training. We adopt the following equation as our loss function, as shown in Equation (13).

L = L_{C E} + \frac{λ}{2 n} \sum_{w} {| | w | |}^{2}

(13)

where

L_{C E}

is the cross-entropy loss function, and

\frac{λ}{2 n} \sum_{w} {| | w | |}^{2}

is a regular L2 regularization term that is used for improving the generalization ability of the task model.

4. Experimental Analysis

This section introduces the layout of the experiment to highlight the effectiveness of our proposed model. First, the configuration of our platform is described in detail. Second, the constitution of the datasets adopted to execute our experiments is clearly demonstrated. Third, we describe the process of the conducted contrastive experiments, accompanied by corresponding results, and refer to four popular and prestigious baselines. Finally, we describe how the ablation experiments were conducted to reveal the importance of the different components involved in our model.

4.1. Experiment Platform and Datasets

The platform used for executing our experiments was a high-performance computing server that consisted of an Intel I9-3900 CPU, − a 13th-generation processor with 24 cores and a base clock speed of 2.3 GHz, a DDR5 with 32 GB of RAM at a 5200 MHz speed, a 2 TB M.2 2280 form factor NVMe-SSD offering high-speed data transfer and storage, and two NVIDIA RTX 3090 Turbo GPUs providing extraordinary graphical performance for deep learning applications. The learning rate was set to 3 × 10⁻⁵, AdamW was used as the optimizer, and the batch size of data was equal to 16. More details about the hyperparameters used in the process of SAFE-GTA training are shown in Table 2. We set up the experimental environment on the platform using PyTorch, which is the most popular Python 3.8-based open-source deep learning framework.

To verify the effectiveness of our model, we collected three widely adopted real-world datasets to conduct experiments. The three datasets are well organized by their publishers (90% for training sets and 10% for testing sets). It is noteworthy that while the ReCOVery dataset provides authentic real-world examples, its class distribution (22 fake vs. 123 real test samples) reflects natural but imbalanced occurrence patterns. We preserved this original distribution to maintain ecological validity while implementing three safeguards via (1) stratified sampling during data splits, (2) class-weighted cross-entropy loss, and (3) balanced evaluation metrics, including F1-score alongside accuracy. The details of the datasets are shown in Table 3. While these datasets provide a good starting point, it is important to acknowledge that biases and domain differences may exist within and across them. In future work, we plan to investigate strategies to mitigate these biases and enhance the model’s generalizability.

PolitiFact is a well-known non-profit fact-checking website that rates the accuracy of claims by officials, pundits, and other public figures. A large number of in-field researchers collect data from the website, aiming to detect unimodal or multimodal misinformation.
Gossipcop is a resource specifically established for the detection and analysis of fake news, particularly in the context of entertainment news and celebrity gossip. This dataset is key for research and development in the field of misinformation studies, especially with a focus on multimodal data.
ReCOVery is a specially designed and constructed repository aimed at supporting research efforts in the fight against misinformation related to COVID-19.

Table 3 illustrates the statistics of the three deployed datasets. Specifically, the PolitiFact dataset maintains 722 real news samples (i.e., 647 for training and 75 for testing) and 1238 fake news samples (i.e., 1117 for training and 121 for testing), the Gossipcop dataset consists of 7020 real news samples (i.e., 6253 for training and 767 for testing) and 2301 fake news samples (i.e., 2135 for training and 166 for testing), and 1228 real news items (i.e., 1105 for training and 123 for testing) and 216 fake ones (i.e., 194 for training and 22 for testing) are contained in the ReCOVery dataset. These datasets are all in English and domain-specific. As a result, the model’s generalizability to non-English languages or culturally distinct domains remains unverified. This calls for future exploration of multilingual and cross-cultural settings to validate SAFE-GTA’s broader applicability. Although all three datasets are unbalanced, they are still adopted for conducting method verification in many studies. On the other hand, our training protocol employs dynamic class weights adjusted per batch, gradient clipping to prevent minority-class overfitting, and early stopping based on validation F1 to mitigate potential bias from class imbalance.

It is important to note that the datasets used in our fake news detection research may contain ambiguous or noisy labels due to weak supervision during their annotation process. In this paper, we propose a novel multimodal fake news detection method, but we acknowledge that the potential limitations of the datasets could impact our model’s performance. To address this, we plan to investigate label confidence, source credibility, and annotation strategy in more depth in our future work.

4.2. Baselines and Evaluation Metrics

To spotlight the effectiveness of our approach, we employed four well-known and representative existing multimodal-based fake news detection benchmarks by collecting their code from GitHub or replicating their experiments using PyTorch 2.5.1 according to the implementation introduction from their paper.

Spotfake: Reference [51] employs pre-trained language models, such as BERT, for learning textual data, and uses VGG-19, which is pre-trained on the ImageNet dataset, to extract features from images.

Spotfake+: Reference [52] strengthens SpotFake by utilizing XLNet to extract textual features. These multimodal features are then fed into a binary classifier as long as the text and image features are concatenated.

HMCAN: Reference [48] establishes multimodal attention networks using self-attention mechanisms to combine textual and visual features. In the case of news that lacks images, HMCAN generates placeholder images to create text–image pairs.

CAFE: Reference [50] intelligently combines the individual features of two modalities and the extracted interaction representation between multimodalities to identify fake news using contrastive learning.

Metrics: Accuracy is a frequently used evaluation metric in binary classification tasks, including fake news detection. Nonetheless, its reliability diminishes significantly when dealing with imbalanced datasets. In our experiment, we included additional evaluation metrics, namely precision, recall, and F1-score, to provide a more comprehensive assessment of the task’s performance in such situations. Furthermore, we report the Area Under the ROC Curve (AUC) as a threshold-independent metric that reflects the model’s overall discriminative ability, providing further insight into performance under varying decision boundaries.

4.3. Effectiveness Evaluation

The conducted experiments in this study were in two stages: (1) a comparison with the four baselines mentioned above; (2) an ablation study by disabling the multimodal fusion module or the semantic augmentation module. Table 4 demonstrates the statistical results that involve the performance comparisons between five methods and our SAFE-GTA model on the PolitiFact, Gossipcop, and ReCOVery datasets. From the table, it is observable that our model outperformed the four baselines in most metrics. In more detail, we have the following observations and analysis that can spotlight the effectiveness, robustness, and superiority of our model.

Effectiveness in task classification: SAFE-GTA achieved the best performance in almost all metrics on the three datasets. (1) Specifically, under the comparison on PolitiFact, SAFE-GTA had the highest accuracy (i.e., 0.888), indicating it is the most reliable method for classification between fake and real news. Moreover, the values of SAFE-GTA’s precision, recall, and F1-score were fairly high for individual fake or real news evaluation, which means that SAFE-GTA is quite adept at correctly identifying real or fake news. (2) SAFE-GTA again showed strong performance with the highest accuracy (i.e., 0.876) and precision on fake news evaluation (i.e., 0.903) on the Gossipcop dataset, which outperformed the second-best model, CAFE, by 0.5% and 2.8%, respectively. It had moderate precision and recall for real news but excelled with the highest F1-score for fake news (i.e., 0.927), which suggests a good balance between precision and recall in this category. (3) SAFE-GTA outperformed all other baselines in all metrics for the ReCOVery dataset, marking it the superior method for detecting both real and fake news in this context. The high accuracy (i.e., 0.937), precision (i.e., 0.936 on average across fake and real news), recall (i.e., 0.814 on average), and F1-score (i.e., 0.86 on average) indicate that SAFE-GTA was very effective at generalizing its detection capabilities on this dataset. As shown in Figure 4, the proposed SAFE-GTA model achieved the highest AUC scores across all three datasets—PolitiFact (i.e., 0.905), Gossipcop (i.e., 0.933), and ReCOVery (i.e., 0.821). The consistently significant AUC values quantitatively validate SAFE-GTA’s superior discriminative capacity in separating real and fake news across diverse classification thresholds. As the AUC evaluates the model’s ability to balance true positive rates (TPRs) and false positive rates (FPRs) at all decision boundaries, these results confirm SAFE-GTA’s robustness against threshold variations—a critical advantage for real-world deployment where optimal thresholds are often dataset-dependent. Notably, SAFE-GTA’s AUC surpassed those of all baselines (e.g., +1.0% over CAFE’s on PolitiFact and +3.1% over Gossipcop’s), underscoring its generalized superiority in multimodal fake news detection.

Robustness Across Datasets: SAFE-GTA consistently ranked as one of the top-performing methods across all three datasets, highlighting its robustness and adaptability to different types of news content. Meanwhile, SAFE-GTA maintained a high F1-score, especially on the Gossipcop and ReCOVery datasets, while some baselines struggled to balance precision and recall for fake news. The robustness of our model benefits from the dual QK-shared global-to-token cross-attention in SAFE-GTA that is able to capture the relevance between images and text from perspectives of whole representation and token-wise sequence representations. In other words, our proposed global-to-token cross-attention helps the SAFE-GTA model possess a better ability to discover the correlations and inconsistencies between modalities from a global viewpoint of text or images.

Superiority in Complex Environments: The datasets adopted for experiment conduction are inconsistent with each other in terms of topics, lengths of news, consistency between modalities, and other factors. However, SAFE-GTA could achieve a relatively outstanding performance across all the datasets compared with other baselines. Such superiority of SAFE-GTA is caused by the advantages of the embedded semantic augmentation module, which grasps different levels of token-wise representation for text, thereby boosting SAFE-GTA’s capability in hierarchical text information comprehension produced from the stacked blocks of BERT. It can be observed from Table 4 that SAFE-GTA monopolized the best accuracy and F1-scores on both fake and real testing, except for the F1-score for fake news on the Gossipcop dataset (i.e., 92.7%, which slightly dropped by 0.1% compared with 92.8% produced by Spotfake+). Thus, SAFE-GTA has the relatively best performance to precisely locate both real and fake news among the models. Meanwhile, it is suitable to transfer SAFE-GTA to news-veracity-related real-world applications, such as social media platforms and news aggregation services.

The domain analysis: Our experiments reveal important insights about SAFE-GTA’s domain adaptability. While the architecture is fundamentally domain-agnostic through its use of general-purpose BERT/ViT backbones, performance varies according to training data characteristics. The framework achieved its highest accuracy (93.7%) on the COVID-19-focused ReCOVery dataset, where pandemic-related misinformation exhibits distinct visual–textual patterns like manipulated graphs. For entertainment news (Gossipcop, 87.6% accuracy), performance was slightly constrained by the noisy, emotionally charged nature of celebrity content. Political claims (PolitiFact, 88.8% accuracy) presented intermediate difficulty due to their subtle factual distortions. These results demonstrate that while SAFE-GTA can generalize across domains, optimal deployment requires domain-representative training data.

Summarization from the comparison result: SAFE-GTA was proven as a highly effective method for fake news detection by conducting the performance comparison with other outstanding benchmarks in this study. Meanwhile, the experimental results demonstrate the effectiveness, robustness, and huge potential of SAFE-GTA in dealing with misinformation in sensitive and rapidly evolving news environments. Although SAFE-GTA is designed to rely primarily on textual and visual content, incorporating source metadata (e.g., publisher identity, account credibility) could provide additional context. This information, if available, could be appended as auxiliary input or integrated into the hierarchical textual encoder, potentially enhancing performance in cases where multimodal clues are weak or ambiguous.

4.4. The Ablation Study

To assess the impact of the embedded components in our approach, we conducted an ablation study by shutting down the interfaces of the fusion block and the semantic augmentation module. To achieve this goal, we design the following SAFE-GTA variants to examine the importance of such components.

CF–SAFE: represents the SAFE-GTA model that disables the multimodal information fusion module.
SA–SAFE: denotes another SAFE-GTA variant that closes the functionality of semantic augmentation.
FA–SAFE: indicates the SAFE-GTA that shuts down the effects of both information fusion and semantic augmentation.

Figure 5 clearly depicts the results yielded from SAFE-GTA and its variants in terms of accuracy and F1-score for either real or fake news on three datasets. Also, it helps to understand the contributions of the components to the performance of the original SAFE-GTA. To better highlight the implications of the ablation study, we provide the following viewpoints by observing and analyzing the results from the figures.

SAFE-GTA serves as the baseline with the highest performance across all metrics on three datasets compared with others. CF–SAFE shows a drop in all metrics compared with SAFE-GTA, indicating the importance of the ablated fusion module in recognizing both real and fake news. SA–SAFE illustrates a drastic decrease in both F1-scores, with a larger impact on F1-True, again highlighting the importance of the semantic augmentation module in classifying news. FA–SAFE shows the relatively least variation across the metrics.

The CF–SAFE variant’s performance indicates that the multimodal fusion module was critical for the model’s ability to identify fake news accurately across datasets, particularly in the ReCOVery dataset, where it also affected the detection of fake news. On the other hand, the performance gaps between SAFE-GTA to SA–SAFE, especially in the F1-True metric, highlight the importance of the semantic augmentation block. That means semantic augmentation in the model plays a nuanced role, which potentially offers a trade-off between accurately detecting real and fake news.

The varying impacts of different ablated components validated on the three datasets suggest that the characteristics of each dataset, such as the scale of fake and real news, information alignment among modalities, and the length of news items, significantly influence which model features are most important. In summary, the ablation study clearly demonstrates that the performance of the SAFE-GTA model in terms of various metrics is reliant on the integrated functioning of multiple components. Any single module obliteration tends to reduce the performance to varying degrees. The study also underscores the necessity for tailored strategies for different datasets and contexts when combating fake news.

5. Conclusions

This work proposes an efficacious semantic augmentation-based multimodal fake news detection model by taking advantage of multimodal fusion and levels of fine-grained semantic information yielded by large pre-trained models. Specifically, we address the intrinsic asymmetry between text and image modalities through symmetry-aware designs: (1) We establish bidirectional symmetry in feature projection—sharing Query–Key matrices between modalities to harmonize their interaction spaces while preserving modality-specific value projections. This not only captures the interactive relationship between images and text in terms of global representation and token-wise sequences but also reduces parameter redundancy through enforced symmetry, thereby improving efficiency without sacrificing discriminative power. (2) We design a semantic augmentation module that resolves vertical asymmetry in textual abstraction levels by hierarchically fusing shallow and deep token representations from BERT’s stacked blocks, balancing semantic granularity across modalities. The experimental results reveal that our approach outperformed the baselines by 0.8% in accuracy and 2.2% in F1-score on average across the three widely adopted datasets. The success of SAFE-GTA underscores the importance of reconciling modality asymmetry through controlled symmetry in fusion mechanisms, where rigid alignment (e.g., contrastive learning) or complete independence (e.g., simple concatenation) both fail to leverage complementary signals. In the future, we will further extend the current static QK-sharing to context-aware parameter symmetry, which allows modalities to dynamically negotiate interaction weights based on input characteristics, and develop contrastive objectives that respect inherent asymmetry while encouraging semantic-level symmetry (e.g., aligning image regions with textual phrases rather than entire sentences).

6. Limitations

While our SAFE-GTA model demonstrates promising results in detecting multimodal fake news, several limitations still exist in the approach. First, the current framework was evaluated primarily on English text–image pairs. Its performance on multilingual scenarios or other multimodal tasks (e.g., video-based misinformation) remains untested, though the architecture’s reliance on BERT and ViT suggests potential adaptability with language-specific pre-training or modality-specific encoders. Future work will investigate these extensions. Secondly, the datasets used in our experiments (PolitiFact, Gossipcop, and ReCOVery) are imbalanced, with more real news items than fake news. This imbalance can affect the model’s performance, especially for the minority class (fake news). Although we employed techniques such as class weighting during training, the model still struggles with some misclassifications, particularly for underrepresented classes. Additionally, fake news often exhibits complex semantic patterns and contextual cues that can be challenging for the model to capture. The current model relies on pre-trained BERT and ViT models to extract text and image representations, which may not fully account for the nuanced semantic details present in some fake news articles.

On the other hand, the model’s performance may vary across different domains and topics. The datasets used in our experiments cover a range of topics, but the model may struggle to generalize to completely new domains or topics not seen during training. This highlights the need for domain-adaptive approaches to further enhance the model’s robustness. Finally, while our proposed global-token cross-attention mechanism and QK-sharing strategy effectively fuse multimodal information, there is still room for improvement in the fusion mechanism. More advanced techniques that can better harmonize the interactions between text and image modalities could potentially lead to even better performance.

Author Contributions

Conceptualization, C.Z.; methodology, L.Z. and C.Z.; software, L.Z.; validation, L.Z. and Z.Z.; formal analysis, L.Z. and Z.Z.; resources, Y.H.; data curation, L.Z.; writing—original draft preparation, L.Z. and C.Z.; writing—review and editing, Z.Z. and Y.H.; visualization, L.Z.; supervision, C.Z.; project administration, C.Z.; funding acquisition, C.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (grant number 62403412) and the Natural Science Foundation of the Higher Education Institutions of Jiangsu Province of China (grant number 23KJB520040). The APC was funded by the National Natural Science Foundation of China (grant number 62403412).

Data Availability Statement

Data are available in publicly accessible repositories: https://github.com/nguyenvo09/EMNLP2020/tree/master/formatted_data/Politifact (PolitiFact, accessed on 10 June 2024), https://github.com/KaiDMML/FakeNewsNet/tree/master/dataset (Gossipcop, accessed on 10 June 2024), and https://github.com/apurvamulay/ReCOVery/tree/master/dataset (ReCOVery, accessed on 10 June 2024).

Acknowledgments

This study is partially supported by the National Natural Science Foundation of China (62403412) and the Natural Science Foundation of the Higher Education Institutions of Jiangsu Province of China under grant 23KJB520040.

Conflicts of Interest

We declare that we have no financial or personal relationships with other people or organizations that could have inappropriately influenced our work; there are no professional or other personal interests of any nature or kind with any product, service, and/or company that could be construed as influencing the authors or the review of this manuscript. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Grinberg, N.; Joseph, K.; Friedland, L.; Swire-Thompson, B.; Lazer, D. Fake news on Twitter during the 2016 US presidential election. Science 2019, 363, 374–378. [Google Scholar] [CrossRef] [PubMed]
Patwa, P.; Sharma, S.; Pykl, S.; Guptha, V.; Kumari, G.; Akhtar, M.S.; Ekbal, A.; Das, A.; Chakraborty, T. Fighting an infodemic: Covid-19 fake news dataset. In Proceedings of the Combating Online Hostile Posts in Regional Languages During Emergency Situation: First International Workshop, CONSTRAINT 2021, Collocated with AAAI 2021, Virtual Event, 8 February 2021. Revised Selected Papers 1; Springer: Berlin/Heidelberg, Germany, 2021; pp. 21–29. [Google Scholar]
Tenali, N.; Babu, G.R.M. A systematic literature review and future perspectives for handling big data analytics in COVID-19 diagnosis. New Gener. Comput. 2023, 41, 243–280. [Google Scholar] [CrossRef] [PubMed]
Lazer, D.M.; Baum, M.A.; Benkler, Y.; Berinsky, A.J.; Greenhill, K.M.; Menczer, F.; Metzger, M.J.; Nyhan, B.; Pennycook, G.; Rothschild, D.; et al. The science of fake news. Science 2018, 359, 1094–1096. [Google Scholar] [CrossRef] [PubMed]
Alonso, M.A.; Vilares, D.; Gómez-Rodríguez, C.; Vilares, J. Sentiment analysis for fake news detection. Electronics 2021, 10, 1348. [Google Scholar] [CrossRef]
Shu, K.; Mahudeswaran, D.; Wang, S.; Lee, D.; Liu, H. Fakenewsnet: A data repository with news content, social context, and spatiotemporal information for studying fake news on social media. Big data 2020, 8, 171–188. [Google Scholar] [CrossRef]
Shu, K.; Zhou, X.; Wang, S.; Zafarani, R.; Liu, H. The role of user profiles for fake news detection. In Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Vancouver, BC, Canada, 27–30 August 2019; pp. 436–439. [Google Scholar]
Nguyen, V.H.; Sugiyama, K.; Nakov, P.; Kan, M.Y. Fang: Leveraging social context for fake news detection using graph representation. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Virtual Event, 19–23 October 2020; pp. 1165–1174. [Google Scholar]
Shu, K.; Mahudeswaran, D.; Wang, S.; Liu, H. Hierarchical propagation networks for fake news detection: Investigation and exploitation. In Proceedings of the International AAAI Conference on Web and Social Media, Virtual, 8 June 2020; Volume 14, pp. 626–637. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Sharma, U.; Pandey, P.; Kumar, S. A transformer-based model for evaluation of information relevance in online social-media: A case study of covid-19 media posts. New Gener. Comput. 2022, 40, 1029–1052. [Google Scholar] [CrossRef]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 5485–5551. [Google Scholar]
Tian, L.; Zhang, X.; Wang, Y.; Liu, H. Early detection of rumours on twitter via stance transfer learning. In Proceedings of the Advances in Information Retrieval: 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, 14–17 April 2020, Proceedings, Part I 42; Springer: Berlin/Heidelberg, Germany, 2020; pp. 575–588. [Google Scholar]
Lin, H.; Yi, P.; Ma, J.; Jiang, H.; Luo, Z.; Shi, S.; Liu, R. Zero-shot rumor detection with propagation structure via prompt learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 5213–5221. [Google Scholar]
Kaliyar, R.K.; Goswami, A.; Narang, P. FakeBERT: Fake news detection in social media with a BERT-based deep learning approach. Multimed. Tools Appl. 2021, 80, 11765–11788. [Google Scholar] [CrossRef]
Zhou, Y.; Yang, Y.; Ying, Q.; Qian, Z.; Zhang, X. Multimodal fake news detection via clip-guided learning. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 2825–2830. [Google Scholar]
Yang, C.; Zhu, F.; Han, J.; Hu, S. Invariant Meets Specific: A Scalable Harmful Memes Detection Framework. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 4788–4797. [Google Scholar]
Molina, M.D.; Sundar, S.S.; Le, T.; Lee, D. “Fake news” is not simply false information: A concept explication and taxonomy of online content. Am. Behav. Sci. 2021, 65, 180–212. [Google Scholar] [CrossRef]
Li, J.; Selvaraju, R.; Gotmare, A.; Joty, S.; Xiong, C.; Hoi, S.C.H. Align before fuse: Vision and language representation learning with momentum distillation. In Proceedings of the Advances in Neural Information Processing Systems 34 (NeurIPS 2021), Vitrual, 6–14 December 2021; pp. 9694–9705. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Shenzhen, China, 26 February–1 March 2021; pp. 8748–8763. [Google Scholar]
Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv 2023, arXiv:2301.12597. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, California, USA, 4–9 December 2017. [Google Scholar]
Przybyła, P.; Soto, A.J. When classification accuracy is not enough: Explaining news credibility assessment. Inf. Process. Manag. 2021, 58, 102653. [Google Scholar] [CrossRef]
Bhatt, S.; Goenka, N.; Kalra, S.; Sharma, Y. Fake news detection: Experiments and approaches beyond linguistic features. In Proceedings of the Data Management, Analytics and Innovation: Proceedings of ICDMAI 2021; Springer: Berlin/Heidelberg, Germany, 2022; Volume 2, pp. 113–128. [Google Scholar]
Xu, F.; Sheng, V.S.; Wang, M. A unified perspective for disinformation detection and truth discovery in social sensing: A survey. ACM Comput. Surv. (CSUR) 2021, 55, 1–33. [Google Scholar] [CrossRef]
Capuano, N.; Fenza, G.; Loia, V.; Nota, F.D. Content Based Fake News Detection with machine and deep learning: A systematic review. Neurocomputing 2023, 530, 91–103. [Google Scholar] [CrossRef]
Kochkina, E.; Hossain, T.; Logan IV, R.L.; Arana-Catania, M.; Procter, R.; Zubiaga, A.; Singh, S.; He, Y.; Liakata, M. Evaluating the generalisability of neural rumour verification models. Inf. Process. Manag. 2023, 60, 103116. [Google Scholar] [CrossRef]
Jarrahi, A.; Safari, L. Evaluating the effectiveness of publishers’ features in fake news detection on social media. Multimed. Tools Appl. 2023, 82, 2913–2939. [Google Scholar] [CrossRef]
Raza, S.; Ding, C. Fake news detection based on news content and social contexts: A transformer-based approach. Int. J. Data Sci. Anal. 2022, 13, 335–362. [Google Scholar] [CrossRef]
Allein, L.; Moens, M.F.; Perrotta, D. Preventing profiling for ethical fake news detection. Inf. Process. Manag. 2023, 60, 103206. [Google Scholar] [CrossRef]
Hamdi, T.; Slimi, H.; Bounhas, I.; Slimani, Y. A hybrid approach for fake news detection in twitter based on user features and graph embedding. In Proceedings of the Distributed Computing and Internet Technology: 16th International Conference, ICDCIT 2020, Bhubaneswar, India, 9–12 January 2020, Proceedings 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 266–280. [Google Scholar]
Lin, S.Y.; Kung, Y.C.; Leu, F.Y. Predictive intelligence in harmful news identification by BERT-based ensemble learning model with text sentiment analysis. Inf. Process. Manag. 2022, 59, 102872. [Google Scholar] [CrossRef]
Luvembe, A.M.; Li, W.; Li, S.; Liu, F.; Xu, G. Dual emotion based fake news detection: A deep attention-weight update approach. Inf. Process. Manag. 2023, 60, 103354. [Google Scholar] [CrossRef]
Zhu, Y.; Wang, Y.; Qiang, J.; Wu, X. Prompt-Learning for Short Text Classification. IEEE Trans. Knowl. Data Eng. (TKDE) 2024, 36, 5328–5339. [Google Scholar] [CrossRef]
Qian, F.; Gong, C.; Sharma, K.; Liu, Y. Neural User Response Generator: Fake News Detection with Collective User Intelligence. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; Volume 18, pp. 3834–3840. [Google Scholar]
Zhu, Y.; Sheng, Q.; Cao, J.; Nan, Q.; Shu, K.; Wu, M.; Wang, J.; Zhuang, F. Memory-guided multi-view multi-domain fake news detection. IEEE Trans. Knowl. Data Eng. 2022, 35, 7178–7191. [Google Scholar] [CrossRef]
Glazkova, A.; Glazkov, M.; Trifonov, T. g2tmn at Constraint@AAAI2021: Exploiting CT-BERT and Ensembling Learning for COVID-19 Fake News Detection. In Combating Online Hostile Posts in Regional Languages During Emergency Situation; Springer International Publishing: Cham, Switzerland, 2021; pp. 116–127. [Google Scholar] [CrossRef]
Alghamdi, J.; Lin, Y.; Luo, S. Towards COVID-19 fake news detection using transformer-based models. Knowl.-Based Syst. 2023, 274, 110642. [Google Scholar] [CrossRef] [PubMed]
Qi, P.; Cao, J.; Yang, T.; Guo, J.; Li, J. Exploiting multi-domain visual information for fake news detection. In Proceedings of the 2019 IEEE International Conference on Data Mining (ICDM), Beijing, China, 8–11 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 518–527. [Google Scholar]
Jin, Z.; Cao, J.; Zhang, Y.; Zhou, J.; Tian, Q. Novel visual and statistical image features for microblogs news verification. IEEE Trans. Multimed. 2016, 19, 598–608. [Google Scholar] [CrossRef]
Xue, J.; Wang, Y.; Tian, Y.; Li, Y.; Shi, L.; Wei, L. Detecting fake news by exploring the consistency of multimodal data. Inf. Process. Manag. 2021, 58, 102610. [Google Scholar] [CrossRef]
Jing, J.; Wu, H.; Sun, J.; Fang, X.; Zhang, H. Multimodal fake news detection via progressive fusion networks. Inf. Process. Manag. 2023, 60, 103120. [Google Scholar] [CrossRef]
Comito, C.; Caroprese, L.; Zumpano, E. Multimodal fake news detection on social media: A survey of deep learning techniques. Soc. Netw. Anal. Min. 2023, 13, 101. [Google Scholar] [CrossRef]
Qi, P.; Bu, Y.; Cao, J.; Ji, W.; Shui, R.; Xiao, J.; Wang, D.; Chua, T.S. Fakesv: A multimodal benchmark with rich social context for fake news detection on short video platforms. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 14444–14452. [Google Scholar]
Wu, Z.; Shen, C.; Van Den Hengel, A. Wider or deeper: Revisiting the resnet model for visual recognition. Pattern Recognit. 2019, 90, 119–133. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Wang, Y.; Ma, F.; Jin, Z.; Yuan, Y.; Xun, G.; Jha, K.; Su, L.; Gao, J. Eann: Event adversarial neural networks for multi-modal fake news detection. In Proceedings of the 24th ACM Sigkdd International Conference on Knowledge Discovery & Data Mining, Lodon, UK, 19–23 August 2018; pp. 849–857. [Google Scholar]
Qian, S.; Wang, J.; Hu, J.; Fang, Q.; Xu, C. Hierarchical multi-modal contextual attention network for fake news detection. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 11–15 July 2021; pp. 153–162. [Google Scholar]
Wang, L.; Zhang, C.; Xu, H.; Xu, Y.; Xu, X.; Wang, S. Cross-modal contrastive learning for multimodal fake news detection. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 5696–5704. [Google Scholar]
Chen, Y.; Li, D.; Zhang, P.; Sui, J.; Lv, Q.; Tun, L.; Shang, L. Cross-modal ambiguity learning for multimodal fake news detection. In Proceedings of the ACM Web Conference 2022, Virtual Event, Lyon, France, 25–29 April 2022; pp. 2897–2905. [Google Scholar]
Singhal, S.; Shah, R.R.; Chakraborty, T.; Kumaraguru, P.; Satoh, S. Spotfake: A multi-modal framework for fake news detection. In Proceedings of the 2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM), Singapore, 11–13 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 39–47. [Google Scholar]
Singhal, S.; Kabra, A.; Sharma, M.; Shah, R.R.; Chakraborty, T.; Kumaraguru, P. Spotfake+: A multimodal framework for fake news detection via transfer learning (student abstract). In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 13915–13916. [Google Scholar]

Figure 1. An example collected from PolitiCops that demonstrates four kinds of alignment relationships between text and image in fake and real news, where the True or False button in the figure respectively represents real or fake news, and the positive or negative button denotes whether a text–image pair is aligned or not.

Figure 2. The framework of SAFE-GTA contains a data representation module (pre-trained models in the figure), a semantic augmentation module, a multimodal fusion module, and a classification module. The signals used in the diagram are explained in the top left corner of the figure.

Figure 3. The insight of the CNN-based semantic augmentation module that can obtain the hierarchical semantic representation for text.

Figure 4. The visualization of performance among the baselines and SAFE-GTA in terms of ROC curves.

Figure 5. Ablation comparisons among SAFE-GTA variants in terms of accuracy, F1-score on fake news, and F1-score on real news on three datasets.

Table 1. Detailed preprocessing steps for text (BERT) and images (ViT).

Steps	Modality	Function Call
1	Text	`BertTokenizer.from_pretrained(“bert-base-uncased”)`
2	Text	`tokenizer(raw_text,`
		`padding=“max_length”,`
		`truncation=True,`
		`max_length=128,`
		`return_tensors=“pt”)`
3	Image	`ViTFeatureExtractor.from_pretrained(`
		`“google/vit-base-patch16-224”)`
4	Image	`feature_extractor(raw_image,`
		`do_resize=True,`
		`size=224,`
		`do_center_crop=True,`
		`return_tensors=“pt”)`

Table 2. The table lists the key hyperparameters and their corresponding values used in SAFE-GTA training, including epoch count, batch size, optimizer type, input dimension, learning rate, weight decay factor, and dropout rate, along with brief descriptions of their roles.

Hyperparameters	Setup	Description
Epochs	50	The number of complete passes through the training dataset during model training.
Batch Size	16	The number of training samples processed simultaneously in one forward/backward pass.
Optimizer	Adam	The algorithm used to update model weights based on the computed gradients.
Input Size	768	The dimensionality of the input features fed into the model.
Learning Rate	3 × 10⁻⁵	The step size in each iteration while moving toward a minimum of the loss function.
Weight Decay	1 × 10⁻⁴	A regularization technique that penalizes large weights to help prevent overfitting.
Dropout Rate	0.4	The proportion of neurons randomly set to zero during training to improve generalization.

Table 3. The statistics of the three datasets adopted for our model validation, with a stratified train and test split.

Datasets	Real News		Fake News
Datasets	Training Set	Test Set	Training Set	Test Set
PolitiFact	647	75	1117	121
Gossipcop	6253	767	2135	166
ReCOVery	1105	123	194	22

Table 4. A comparison of results between four baselines and the proposed SAFE-GTA model evaluated on the PolitiFact, Gossipcop, and ReCOVery datasets. ∗ indicates a statistically significant difference between SAFE-GTA and the best baseline, −CAFE, with a p value less than 0.05. (^† and ^‡, respectively, indicate that SAFE-GTA has a statistically significant difference from HMCAN and Spotfake+).

Datasets	Methods	AUC	Accuracy	Real News			Fake News
Datasets	Methods	AUC	Accuracy	Precision	Recall	F1-Score	Precision	Recall	F1-Score
PolitiFact	Spotfake	0.843	0.796	0.804	0.884	0.843	0.777	0.653	0.711
	Spotfake+	0.839	0.842	0.830	0.933	0.879	0.866	0.693	0.770
	HMCAN	0.881	0.847	0.902	0.843	0.871	0.772	0.853	0.810
	CAFE	0.895	0.882	0.865	0.958	0.909	0.919	0.760	0.832
	SAFE-concat	0.882	0.857	0.855	0.926	0.889	0.862	0.747	0.800
	SAFE-GTA	0.905 ∗^,†,‡	0.888 ∗^,†,‡	0.896	0.925	0.916 ∗^,†,‡	0.873	0.826	0.849 ∗^,†,‡
Gossipcop	Spotfake	0.715	0.869	0.714	0.227	0.344	0.876	0.983	0.927
	Spotfake+	0.895	0.903	0.700	0.636	0.667	0.936	0.951	0.943
	HMCAN	0.883	0.924	0.923	0.545	0.687	0.924	0.991	0.957
	CAFE	0.902	0.903	0.900	0.409	0.563	0.903	0.991	0.945
	SAFE-concat	0.905	0.903	0.654	0.773	0.708	0.958	0.927	0.942
	SAFE-GTA	0.933 ∗^,†,‡	0.938 ∗^,†,‡	0.882	0.682	0.769 ∗^,†,‡	0.945	0.984	0.964 ∗^,†,‡
ReCOVery	Spotfake	0.666	0.869	0.714	0.227	0.344	0.876	0.983	0.927
	Spotfake+	0.688	0.903	0.700	0.636	0.667	0.936	0.951	0.943
	HMCAN	0.781	0.924	0.923	0.545	0.687	0.924	0.991	0.957
	CAFE	0.815	0.903	0.900	0.409	0.563	0.903	0.991	0.945
	SAFE-concat	0.614	0.823	0.509	0.181	0.267	0.844	0.962	0.900
	SAFE-GTA	0.821 ∗^,†,‡	0.937 ∗^,†,‡	0.933 ∗^,†,‡	0.636 ∗^,†,‡	0.756 ∗^,†,‡	0.938 ∗^,†,‡	0.991 ∗^,†,‡	0.964 ∗^,†,‡

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, L.; Zhang, C.; Zhang, Z.; Huang, Y. SAFE-GTA: Semantic Augmentation-Based Multimodal Fake News Detection via Global-Token Attention. Symmetry 2025, 17, 961. https://doi.org/10.3390/sym17060961

AMA Style

Zhang L, Zhang C, Zhang Z, Huang Y. SAFE-GTA: Semantic Augmentation-Based Multimodal Fake News Detection via Global-Token Attention. Symmetry. 2025; 17(6):961. https://doi.org/10.3390/sym17060961

Chicago/Turabian Style

Zhang, Like, Chaowei Zhang, Zewei Zhang, and Yuchao Huang. 2025. "SAFE-GTA: Semantic Augmentation-Based Multimodal Fake News Detection via Global-Token Attention" Symmetry 17, no. 6: 961. https://doi.org/10.3390/sym17060961

APA Style

Zhang, L., Zhang, C., Zhang, Z., & Huang, Y. (2025). SAFE-GTA: Semantic Augmentation-Based Multimodal Fake News Detection via Global-Token Attention. Symmetry, 17(6), 961. https://doi.org/10.3390/sym17060961

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SAFE-GTA: Semantic Augmentation-Based Multimodal Fake News Detection via Global-Token Attention

Abstract

1. Introduction

2. Literature Review

2.1. Unimodal Fake News Detection

2.2. Multimodal Fusion for Fake News Detection

3. The Framework of SAFE-GTA

3.1. Data Preprocessing

3.2. Image and Text Encoders

3.3. Multimodal Information Fusion

3.4. Classification Module

4. Experimental Analysis

4.1. Experiment Platform and Datasets

4.2. Baselines and Evaluation Metrics

4.3. Effectiveness Evaluation

4.4. The Ablation Study

5. Conclusions

6. Limitations

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI