TLFND: A Multimodal Fusion Model Based on Three-Level Feature Matching Distance for Fake News Detection

Wang, Junda; Zheng, Jeffrey; Yao, Shaowen; Wang, Rui; Du, Hong

doi:10.3390/e25111533

Open AccessArticle

TLFND: A Multimodal Fusion Model Based on Three-Level Feature Matching Distance for Fake News Detection

by

Junda Wang

^1,2,

Jeffrey Zheng

^2,*

,

Shaowen Yao

¹,

Rui Wang

² and

Hong Du

²

¹

Engineering Research Center of Cyberspace, Yunnan University, Kunming 650091, China

²

School of Software, Yunnan University, Kunming 650091, China

^*

Author to whom correspondence should be addressed.

Entropy 2023, 25(11), 1533; https://doi.org/10.3390/e25111533

Submission received: 16 October 2023 / Revised: 29 October 2023 / Accepted: 6 November 2023 / Published: 10 November 2023

(This article belongs to the Special Issue Application of Information Theory to Physical Modeling and State Awareness in Complex Systems)

Download

Browse Figures

Versions Notes

Abstract

:

In the rapidly evolving information era, the dissemination of information has become swifter and more extensive. Fake news, in particular, spreads more rapidly and is produced at a lower cost compared to genuine news. While researchers have developed various methods for the automated detection of fake news, challenges such as the presence of multimodal information in news articles or insufficient multimodal data have hindered their detection efficacy. To address these challenges, we introduce a novel multimodal fusion model (TLFND) based on a three-level feature matching distance approach for fake news detection. TLFND comprises four core components: a two-level text feature extraction module, an image extraction and fusion module, a three-level feature matching score module, and a multimodal integrated recognition module. This model seamlessly combines two levels of text information (headline and body) and image data (multi-image fusion) within news articles. Notably, we introduce the Chebyshev distance metric for the first time to calculate matching scores among these three modalities. Additionally, we design an adaptive evolutionary algorithm for computing the loss functions of the four model components. Our comprehensive experiments on three real-world publicly available datasets validate the effectiveness of our proposed model, with remarkable improvements demonstrated across all four evaluation metrics for the PolitiFact, GossipCop, and Twitter datasets, resulting in an F1 score increase of 6.6%, 2.9%, and 2.3%, respectively.

Keywords:

fake news detection; deep learning; matching distance; multimodal integrated detection

1. Introduction

The popularization of the Internet has significantly increased the role of social platforms in people’s daily lives, serving as convenient channels for communication and information exchange [1]. Consequently, the number of social platform users has grown, leading to an exponential increase in the volume of information data. However, this growth has also given rise to the pervasive issue of fake news dissemination on these platforms, irrespective of its authenticity. This phenomenon is driven by the platforms’ pursuit of rapid development rather than a commitment to truth [2]. The spread of fake news poses a grave threat to information security and exerts a substantial influence on politics [3], the economy [4], and individuals’ well-being, resulting in immeasurable harm to society at large [5].

During the 2018 presidential elections in Brazil, false news regarding candidate Fernando Adula Bosonaro circulated [6]. These news articles claimed that Bosonaro’s opponent supported terrorism and abortion, leading to widespread misconceptions and prejudice against the opponent. In the same year, a fake news story about the Amazon rainforest fires spread rapidly on social media [7]. The fake news claimed that the Amazon rainforest fires were caused by arson by environmental groups, and this false information misinformed the public about rainforest conservation and sparked many false discussions and accusations. Additionally, in 2017, fake news falsely claimed that a bitcoin exchange had been hacked [8]. This resulted in a significant decline in the price of bitcoin, triggering panic among investors who hastily sold their bitcoins, thereby causing substantial volatility and losses within the bitcoin market.

As these problems have become more and more serious, research scholars have come up with ways to automatically detect fake news [9,10]. Initially, fake news detection primarily relied on textual content analysis [11,12,13,14]. However, as machine learning techniques advanced, support vector machines (SVM) were employed to identify fake news. It should be noted that SVMs alone may not fully capture the intricate relationships present in news texts [15]. Furthermore, the naive Bayesian classifier was also utilized as a baseline model, serving a similar purpose to SVM [16]. Subsequently, deep learning techniques gained prominence, with the aforementioned models becoming prevalent across various tasks. For instance, Goldani et al. [17] successfully applied a capsule network, originally used in computer vision, to detect fake news. In a similar vein, Raza et al. [18] integrated news content and contextual information using the Transformer architecture for detection. Notably, as the inclination to share news articles with images grew, researchers recognized the potential of incorporating visual elements in the detection process.

Scholars have embarked on investigating the detection of fake news in a multimodal manner. Giachanou et al. [19] employed neural networks to integrate text, visual, and semantic components. They utilized pre-trained GoogleNews-vectors-negative word vector models and the lexicon of affective reasoning (VADER) to process textual information, while also incorporating image tagging and LBP for the visual aspect. Similarly, Xue et al. [20] took into account the extrinsic and intrinsic characteristics of fake news and combines them with the overall consistency to identify the differences in their features, especially in the images where a branching network is designed to enhance their message. Additionally, Wu et al. [21] capitalized on people’s news consumption habits by comparing text and images. They designed a two-layer co-attention mechanism for image–text information, employed VGG to model the latent features of images, employed CNN to process the frequency domain information of images, and used co-attention to fuse the frequency and latent domain information. Finally, they combined these features with BERT for fake news detection. However, all of these multimodal approaches overlooked the fact that news headlines also constitute textual content and failed to account for the presence of multiple images within a single news article. Consequently, these limitations have resulted in unsatisfactory detection outcomes.

To address the following issues: (1) the limited utilization of multimodal information in unimodal detection, leading to unsatisfactory results; (2) the underutilization of multimodal information in recent multimodal detection approaches, such as neglecting headlines and multiple images; and (3) the simplistic fusion and detection of multimodal information without comprehensive consideration of matching distances between features, we propose an innovative model called the Three-Level Feature-Based Matching Distance Multimodal Fusion Model (TLFND). TLFND is based on RoBERTa [22] for extracting two levels of text features (title and body) and VGG-19 for obtaining image features. We incorporate bidirectional long- and short-term memory (BILSTM) in image fusion to enhance the comprehensiveness of image features. Furthermore, we calculate the Chebyshev distance metric between the three features to determine the veracity of news, bringing them into a unified dimensional space. The model is optimized using a hinge loss function with a predefined threshold. Specifically, for true news with a label value of 0, we aim to maximize the distances, while for true news with a label value of 1, we strive to minimize the three-way distance below the threshold.

Based on the above contributions, the following summarization is presented:

Taking into account the strengths and weaknesses of previous research, we have developed an innovative model called TLFND (Three-Level Feature Similarity Distance-based Multimodal Fusion Model) for detecting fake news. This model effectively harnesses the power of news headlines, textual content, and multiple images to accurately identify fake news.
We propose an auxiliary task that uses the Chebyshev distance metric function as a distance metric between multiple modalities; calculates the distance to the center of mass using headlines, text, and fused images; optimizes the model using HingeEmbeddingLoss with set thresholds; and finally, designs an adaptive evolutionary algorithm to calculate the loss functions of the four components in order to detect false news more accurately.
We conducted a comprehensive evaluation of the TLFND model using three widely used real-world datasets (PolitiFact, GossipCop, and Twitter). Through multiple sets of experiments and the incorporation of various evaluation metrics, our findings consistently demonstrated the superior performance of TLFND when compared to the current state-of-the-art multimodal fake news detection models.

The following sections of this paper are organized as follows: In Section 2, we provide a comprehensive review of previous research efforts in the field of fake news detection. Section 3 focuses on the symbolic representations used in this paper and discusses the structural framework of the TLFND model. In Section 4, we delve into the various components of the TLFND model, providing detailed explanations of the technical aspects involved. Section 5 presents the experimental setup, including the dataset, experimental parameters, baseline, and comparison results. Finally, Section 6 concludes our study with a comprehensive summary of our findings.

2. Related Work

In this section, we provide a comprehensive overview of the existing methods proposed for the automatic detection of fake news. We categorize these methods into two groups: unimodal detection methods and multimodal detection methods. For each method, we examine the key techniques employed and the progress made in the field. Additionally, we emphasize the strengths and limitations of each approach.

2.1. Single-Mode Detection Method

The unimodal approach to fake news detection focuses on analyzing a single modality, such as text, image, or video, by identifying the features within that modality. For instance, Ma et al. [23] employed deep learning techniques, including RNN, LSTM, and GRU, to convert text into a vector representation and feed it into a classifier for results. However, they overlooked other important textual information such as headlines and comments. In a novel approach, Yu et al. [24] developed a perceptual framework that combines the news environment and domain history. By considering the contextual factors, they designed a perceptual recognition module and achieved significant improvements by incorporating domain fusion. Kausar et al. [25] utilized n-grams with TF-IDF for word embedding to extract content features and trained LSTM and BERT models to process news contextual features. They then employed a feedforward neural network for classification. However, their approach did not consider the comprehensive utilization of multiple textual features. Addressing the issue of biased information, Liao et al. [26] studied topic tags and authors’ historical postings. They employed representation learning and multi-task learning, combined with a dynamic weighting strategy to reduce risks, and achieved a promising detection performance. Bazmi et al. [27] recognized the credibility disparities between users and news based on differences in news topic viewpoints. By integrating the viewpoint, source bias, and user bias, they constructed three corresponding components and applied joint coding to detect biases through simulated interactions. However, their method did not fully consider the dynamic nature of news dissemination. To address this limitation, Song et al. [28] proposed a dynamic graph neural network that incorporates temporal information from news dissemination graphs. By generating dynamic representations using a perception module, they improved false news detection by capturing the temporal dynamics. In another approach, Kumar et al. [29] developed a hybrid model that optimized feature selection based on usefulness. They initially extracted features using TF-IDF for weighting and then employed the MGO algorithm to select the most salient features. Finally, they combined the selected features for detection. However, their approach did not fully exploit the multimodal nature of news, which often contains multiple types of information. A common drawback in the aforementioned studies is that they only extract features from a single modality for prediction, while contemporary news encompasses multiple modalities. This limitation hinders the comprehensive utilization of news information for effective detection.

2.2. Multimodal Detection Methods

To overcome the limitations of single modality approaches, researchers have increasingly turned to multimodal techniques. Advanced feature extractors such as BERT and Transformer for text and VGG and ResNet for visuals have emerged to facilitate convenient multimodal fusion in detection tasks. Song et al. [30] utilized multimodal adversarial multitask learning to capture and homogenize news article feature distributions from different domains. They employed the transformer KAT to enhance the selective embedding of entities from external knowledge graphs, resulting in improved performance. The attention mechanism has also been explored for news tasks. Guo et al. [31] employed an attention mechanism neural network to enhance modal fusion. They designed a structured framework to protect middle-layer information, preventing information loss during fusion and thereby enhancing detection. In a different approach, Li et al. [32] proposed semantic enhancement for fusion. They combined textual, visual, and semantic information by leveraging snapshot techniques, which involved integrating these different modalities. They designed an adaptive network to classify special and shared features, reducing errors in the fusion process. Wang et al. [33] not only fused different modalities but also considered various image features. They incorporated attention mechanisms to jointly combine correlations and dependencies between the features, enhancing fusion in the process. Metric calculations have also been introduced. Chen et al. [34] proposed aligning modal features by mapping features from different modalities into an embedding space. They measured the distribution between single modalities using KL scatter. However, these models often require complex parameter settings. To address this, Singh et al. [35] designed a stacked framework. For text processing, they employed BERT and ELECTRA, while for images, they utilized the efficient NasNet Mobile. This framework not only achieved excellent detection results but also reduced the overall parameters by about 20%, image parameters by 2%, and text parameters by 60%.

3. Problem Statement

In this section, we discuss the symbolic representations utilized in this paper and describe the application process of the proposed TLFND model. In a news article, various components such as the title, content, image, and label (indicating true or false news) are present. We represent the collection of news articles as

N_{i} = \{(T_{i}, C_{i}, S_{i} \{V_{j}, V_{j + 1}, \dots\}, B_{i})\}

, the title as

T_{i}

, the content as

C_{i}

, the collection of images as

S_{i}

, a single image in a news article as

V_{i}

, and the label as

B_{i}

. A false news article is denoted by

B_{i} = 0

, while a true news article is represented by

B_{i} = 1

.

To begin, we extract

T_{i}

features

T_{i}^{r}

and

C_{i}

features

C_{i}^{r}

using the RoBERTa model (described in Section 4.1). Subsequently, we extract individual

V_{i}

features

V_{i}^{r}

using the pre-trained VGG-19 model (described in Section 4.2). Afterward, we fuse the multiple

V_{i}^{r}

features into

S_{i}^{r}

in a BILSTM model operating in two opposite directions (described in Section 4.2). Next, we concatenate

T_{i}^{r}

,

C_{i}^{r}

, and

S_{i}^{r}

to create a multimodal feature vector denoted as

E_{i}^{r}

. Finally, we input

E_{i}^{r}

and

B_{i}

into the matching distance calculation module (described in Section 4.3).

4. Proposed Method

In response to the existing problems of previous generations, we propose a new TLFND model that combines the RoBERTa model and the VGG-19 model to address the limitations of previous studies. We recognize that a news article contains not only the body text but also the headline, and there may exist inconsistencies among the headline, the body text, and the accompanying image. However, many researchers have overlooked this issue and focused solely on the matching distance between the body text and the image. Additionally, news articles often include multiple images, yet researchers often only consider the first image when constructing their datasets. To address these challenges, we extract text features (headline and body text) and fuse them with multi-image features. Finally, we utilize the Chebyshev distance metric to calculate the matching distance of the fused vector features, enabling the detection of false news. The overall structure of our proposed TLFND model is illustrated in Figure 1.

The model consists of the following parts:

Two-level text feature extraction module.
Image extraction and fusion module.
Three-level feature matching distance module.
Multimodal integration recognition module.

4.1. Two-Level Text Feature Extraction Module

The traditional technique Word2Vec [36] represenst each word as an independent vector, without considering the intricate relationships between words and their surrounding contexts. Notably, the BERT model [37] has demonstrated promising results across various tasks. To achieve optimal detection performance, we leverage an enhanced variant of the BERT model, namely RoBERTa. The RoBERTa model surpasses its predecessor by utilizing a larger dataset and longer training time, enabling it to learn deeper linguistic representations and richer semantic information [38]. However, RoBERTa also performs exceptionally well in sentiment analysis, thanks to its deeper language representations [39]. Furthermore, some researchers have suggested that RoBERTa exhibits advantages over other state-of-the-art Transformer architectures in tasks related to text classification and detection [40,41]. Subsequently, we define two fully connected layers, each comprising a linear layer and a ReLU activation function. Dropout layers are employed to randomly discard output from the RoBERTa model, thereby mitigating the risk of overfitting.

4.2. Image Extraction and Fusion Module

VGG [42] is a convolutional neural network architecture developed by Oxford University, while VGG-19 is a specific variant of VGG, characterized by a simple structure that facilitates a better understanding of its working implications and parameters. It also exhibits a strong generalization capability, which makes it a popular base model and contributes to its improved performance in various tasks. The VGG-19 model comprises multiple convolutional layers with a

3 \times 3

kernel size and pooling operations with a

2 \times 2

size stacked together. This design enhances the expressiveness of the model and its ability to extract features. In our approach, we avoid converting the image information into textual representations, as this may lead to information loss. Instead, we input a single image into the VGG-19 model, remove its classifier layer, and then pass the extracted features through a fully connected layer to map them to a lower-dimensional representation. Subsequently, the hidden states are combined in two opposite directions using BILSTM and mapped to the final

S^{r}

.

4.3. Three-Level Feature Matching Distance Module

We apply fully connected layers to map

T^{r}

,

C^{r}

, and

S^{r}

to a shared space. These mapped representations are then stitched together to form a multimodal feature vector

E^{r}

.

E^{r}

is a three-dimensional matrix with a shape of

p \times q \times n

, where p represents the number of samples, q represents the number of text and image features, and n represents the dimension of the multimodal space (set to 128). To calculate the center of mass M in dimension 1, we utilize Equation (1), where L denotes the number of samples and

x_{i}

denotes the feature vector of the ith sample. Following this, the center of mass M is transformed into a matrix of the same size as

E^{r}

, as shown in Equation (2). We then employ the Chebyshev distance to compute the distance between

E^{r}

and

M_{r e s h a p e}

, as illustrated in Equation (3). Here,

P T

denotes the Chebyshev distance function, i represents the first few samples, and L represents the number of samples. Finally, we compute the average HingeEmbeddingLoss of all sample pairs in

d i s t_m a t

using PyTorch’s built-in loss function, HingeEmbeddingLoss, as shown in Equation (3). We set a predefined threshold value to minimize the distance between matched sample pairs (label value = 1) and maximize the distance between mismatched sample pairs (label value = 0). Therefore, the objective of using HingeEmbeddingLoss is to optimize the model parameters in such a way that the distance between positive samples is minimized, while the distance between negative samples is maximized. This optimization aims to enhance the performance and discriminative power of the model, as depicted in Equation (4).

M = \frac{1}{H} \sum_{i = 1}^{H} x_{i}

(1)

M_{r e s h a p e} = M_{p \times q \times n}

(2)

d i s t_m a t = \frac{1}{H} \sum_{i = 1}^{H} P T {(E^{r}, M_{r e s h a p e})}_{i}

(3)

H i n g_l s = \frac{1}{L} \sum_{i = 1}^{L} H I N G E_L O S S (y_{i}, {d i s t_m a t}_{i})

(4)

4.4. Multimodal Integration Recognition Module

Combining the aforementioned three modules, we propose the primary objective of this paper, namely, a multimodal integrated recognizer. Firstly, we pass the outcomes of the three modules mentioned above into the multimodal integrated recognition module. Three prediction results (

N_{T C}

,

N_{S}

,

N_{E}

) are defined to compute the final loss function. The text feature vector and title feature vector are concatenated and subjected to linear transformation through the fully connected layer, yielding

N_{T C}

, as depicted in Equation (5). Similarly, the image features undergo linear transformation to obtain

N_{S}

, as shown in Equation (6). Furthermore, the fused multimodal features are linearly transformed to obtain

N_{E}

, as illustrated in Equation (7). The corresponding weight matrix and bias vector are employed for the calculations.

N_{T C} = [T_{r}, C_{r}] * W_{T C} + B_{T C}

(5)

N_{S} = S_{r} * W_{S} + B_{S}

(6)

N_{E} = E_{r} * W_{E} + B_{E}

(7)

4.5. Adaptive Evolutionary Loss

In the TLFND fake news detection model, the design of the overall loss function is crucial, and we calculate a total of four partial loss functions and design am adaptive evolutionary algorithm, as shown in Algorithm 1. The weight-adjusted values of the relative differences of the loss values are generated to weight the different loss values according to their relative importance to better balance the contributions of different losses. Finally, the weighted summation is performed using the weight-adjusted loss values and the corresponding weights. Combined with the above character representation, we set the following four loss values:

{L N}_{T C}

: cross-entropy loss calculation for

N_{T C}

;

{L N}_{S}

: cross-entropy loss calculation for

N_{S}

;

{L N}_{E}

: cross-entropy loss calculation for

N_{E}

;

{L N}_{M}

: matching distance loss calculation (Section 4.3). For

{L N}_{T C}

,

{L N}_{S}

, and

{L N}_{E}

the calculation is shown in Equation (8).

L N = - \sum (y * ln (p))

(8)

Algorithm 1 Loss adaptive evolutionary algorithm

5. Experiments and Results

In this section, we provide a comprehensive overview of the experimental setup, which encompasses the dataset description, parameter settings, and a comparison of baselines along with the evaluation metrics.

5.1. Datasets

We validated the performance of the TLFND model using a fake news detection database called FakenewsNet DataSet (consisting of two datasets, PolitiFact and GossipCop) [43] in conjunction with a dataset, Twitter, applied to the Scalable Multimedia task of MediaEval 2016 [44].

PolitiFact dataset. The PolitiFact dataset focuses on politically relevant news and incorporates fact-checking information gathered from the PolitiFact website (www.politifact.com accessed on 15 June 2023). It includes news articles, fact checks, and associated comments from social media platforms. Fact checkers have rated each news article as true, partially true, or false. Each news article within this dataset contains various details, such as text content, news image address, posting time, author information, and social media comments.
GossipCop dataset. The GossipCop dataset, on the other hand, revolves around entertainment and celebrity news. It consists of entertainment news articles and related comments obtained from the GossipCop website (www.gossipcop.com accessed on 15 June 2023). Similar to the PolitiFact dataset, each news article within the GossipCop dataset includes information like text content, news image address, posting time, author information, and associated social media comments. The news items in this dataset have been rated as either true or false.
Twitter dataset. The Twitter dataset is an English-language dataset containing images and text released for the MediaEval multimedia task, and in order to validate the model performance of TLFND in multiple scenarios, we will use tweets containing text and image content, filtering out other tweets (e.g., with video, etc.).

To adapt the TLFND model, we extracted a subset of data from the FakenewsNet DataSet, which is a database specifically designed for fake news detection. We focused on collecting news headlines, content, and images while excluding publication time, author information, and user comments, as these data were not relevant to our model. Additionally, we removed news articles that did not contain any images. In total, we collected 853 data samples from the PolitiFact dataset, consisting of 529 real news items and 326 fake news items. Similarly, we collected 15,613 data samples from the GossipCop dataset, including 12,119 real news items and 3494 fake news items. Finally, 11,261 data—4425 real data and 6836 fake data—were collected in the Twitter dataset. All three datasets were divided into training and test sets using a 5:1 ratio, as presented in Table 1.

5.2. Evaluation Metrics

In this section, evaluation metrics for assessing the fake news detection model TLFND, including accuracy, recall, precision, and F1 score, are presented in Equations (9), (10), (11), and (12), respectively.

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

(9)

Recall = \frac{T P}{T P + F N}

(10)

Precision = \frac{T P + T N}{T P + F P}

(11)

F 1 Score = \frac{2 \times (Precision \times Recall)}{Precision + Recal}

(12)

5.3. Model Parameters

We conducted our experiments on an NVIDIA RTX 4090 GPU with 24 GB of RAM, utilizing CUDA version 11.0 for implementation. In our setup, we defined the maximum input length for news headlines as 100 and for news content as 510. We extracted headline and content dimensions of 128 and 512, respectively. For the input images, we resized them to

224 \times 224

pixels. Subsequently, the images were processed through two fully connected layers of VGG-19, with the first fully connected layer having a dimension of 1024 and the second fully connected layer outputting a dimension of 256. To prevent overfitting and enhance model convergence, we employed the AdamW optimizer with a learning rate of

1 \times 10^{- 5}

. Furthermore, to improve the generalization ability of the model, we applied a dropout rate of 0.4. In both datasets, we use an approximate ratio of 5:1 to divide the dataset for training and testing. During the training phase, each batch consisted of sixteen samples (train_batch_size), while during the evaluation phase, each batch contained one sample (eval_batch_size). We conducted a total of 50 training sessions (epochs), with the specific parameter settings detailed in Table 2.

5.4. Baselines

We conducted experiments using the FakenewsNet database (PolitiFact and GossipCop) and the Twitter dataset. We compared the results of two types of models: single-peak detection models and multi-peak detection models.

5.4.1. Single-Mode Detection

SVM [15]: The SVM classifier classifies news articles by identifying an optimal hyperplane within the feature space, which includes relevant features such as word frequency and word vectors, social media features like the number of retweets and likes, and structural features such as headline length and paragraph structure. In our dataset, we utilize text features to train the SVM model and achieve a good fit.

CNN [45]: We employed a state-of-the-art shallow convolutional neural network (CNN) specifically designed for this type of task, which previously secured first place in the 9th International Competition on Authorship [46]. This network processes the samples from the dataset, constructs a dictionary, and customizes a preprocessing function for generating n-grams. Finally, the output is fed into a fully connected global pooling layer.

VGG-19 [42]: VGG-19 is a variant of the VGG model that can effectively extract high-level features of images and capture details and semantic information in images by stacking multiple convolutional and pooling layers.

5.4.2. Multimodal Detection

SAFE [47]: The model is a representative work for recognizing the similarity of news images and texts. The image2text model is used to transform the images into corresponding headlines, which are then mapped to the same vector space as the text, and an improved cosine similarity is proposed for recognition.

SpotFake+ [48]: This author summarizes the drawbacks of last year’s study by introducing migration learning into the model and training it for recognizing semantics and exploring contextual relationships, processing images and text using VGG and XLNET, and feeding the training results into a fully-connected layer for classification.

DEFD [49]: The model utilizes the integration of deep learning and attention mechanisms to obtain text features from pre-trained XLNet and image features from pre-trained VGG-19. A hybrid feature loss function is designed to reduce the classification error, and finally, the final prediction results are output using a weighting mechanism.

BCMF [50]: The method is presented in a journal with a 5-year average factor of 7.4 and uses a contextual pre-trained visual model, namely DEIT, to process text with the help of BERT and proposes a novel mechanism, i.e., text to image and image to text, in a bidirectional loop. Finally, the detection is performed by FFN input to the Softmax function.

5.5. Experimental Results and Analysis

In order to be able to verify the performance of the TLFND model in all aspects, we tested it in three different groups of experiments.

5.5.1. Comparison Experiments

We use the PolitiFact, GossipCop, and Twitter datasets as experimental datasets and the evaluation metrics in Section 5.2 as the experimental metrics and compare the two types of models in Section 5.4. The results are shown in Table 3, Table 4 and Table 5.

According to the experimental results, TLFND shows a superior performance to other unimodal and multimodal methods on all four evaluation metrics for all three datasets. Specifically, TLFND achieved an accuracy of 94.4% and an F1 score of 97% on the PolitiFact dataset, representing a 2.1% improvement in accuracy and a 5.7% improvement in F1 score compared to the current state-of-the-art methods. On the GossipCop dataset, TLFND achieved an accuracy of 90.9% and an F1 score of 93.9%, indicating a 1.8% improvement in accuracy and a 2.9% improvement in F1 score compared to the current state-of-the-art method. On the Twitter dataset, TLFND achieved an accuracy of 83.3% and an F1 score of 83.7%, indicating a 1.6% improvement in accuracy and a 2.3% improvement in F1 score compared to the current state-of-the-art method. To provide a visual representation of the differences in evaluation metrics, we have included line graphs depicting the experimental results on both datasets (refer to Figure 2, Figure 3 and Figure 4).

Multimodal methods outperform unimodal methods such as the SAFE model and the CNN model on both datasets. The SAFE model represents a significant advancement in utilizing both news headlines and images. However, it does not directly compare images; instead, it transforms images into text. This transformation can lead to a loss of thematic meaning and may be one of the reasons for the SAFE model’s relatively poor performance in multimodal tasks. In contrast, the CNN model is the most effective model among the unimodal methods, but its performance is still inferior to that of the MAVE model, which further confirms that the performance and accuracy of the model can be improved by fusing information from different modalities.

Based on the above, the TLFND model performs best, and we guess that it is mainly due to the following reasons:

TLFND extracts text features using the RoBERTa model, which is trained with a higher number of iterations compared to the BERT model as well as a dynamic masking strategy, thus improving the model’s ability to understand and generalize the context. The text features are not only the body content, but also the title, which is very important for news, and the keywords, event descriptions, and other information in the title also influence the body content. We used the VGG-19 model to extract image features. Usually, a news article will have multiple images, and the multimodal approach does not introduce multiple images; thus, we use BILSTM to fuse multiple image features for better understanding of news information.
We designed a multimodal matching distance module, which will extract the title features, body features, and fused image features and stitch them into a multimodal feature vector. We then use the Chebyshev distance to calculate the distance of the three features from the center of mass and set a threshold such that, if the sample pair label is true news, then distance between the sample pairs is as small as possible compared to the threshold, but if the sample pair label is false news, then the distance between the sample pairs is as large as possible compared to the threshold. This feature should improve the performance and discriminative ability of the model.
The loss function plays a crucial role in multimodal false news detection. It is used to measure the difference between the model prediction results and the true label. Designing a loss function suitable for the model can enable the model to learn more accurate and reliable prediction results. A loss function is designed in our proposed TLFND model, and a total of four parts of the loss function are calculated: F1, the feature vector of linear transformation of the text feature vector and title feature vector after stitching through the fully connected layer using cross-entropy; F2, the feature vector of linear transformation of the image features using cross-entropy; F3, the feature vector of fused multimodal features using cross-entropy; F4, the matching distance loss calculated using the HingeEmbeddingLoss loss function. Moreover, a dynamic weighting algorithm is designed to weight different loss values according to their relative importance to better balance the contributions of different losses. Finally, the weight-adjusted loss values are weighted and summed with the corresponding weights to serve as the combined loss values.

5.5.2. Ablation Study

To assess the importance of each component to the model and its contribution to the overall performance and to help understand how the model works and the key factors, we conducted an ablation study.

The TLFND variants are as follows: TLFND-T: removed the visual information, multimodal information, and matching distance and used only text information. TLFND-C: removed the text information, multimodal information, and matching distance and used only visual information. TLFND-E: removed the matching distance and used text information, image information, and multimodal fusion. The experimental results are shown in Table 6.

From the results shown in Table 6, we have the following findings:

TLFND-C performs less well, which indicates that single text features are not equally important as single image features, and text detection is better in both datasets.
TLFND-E performs better than both TLFND-T and TLFND-C, which proves that fusing modalities is better than single modalities, and this confirms the research significance of multimodal news detection.
Ultimately, it is shown that the TLFND model is the best performer in terms of each metric, and each module has its own unique role that complements each other.

In summary, each module in the TLFND model has its own importance, and each module is indispensable to the TLFND model and has its own role in false news detection. The TLFND model that integrates the four modules to form the TLFND model performs well in false news detection.

5.5.3. Matching Distance Module Analysis

In the TLFND model, we utilize the Chebyshev distance metric to calculate the matching distance between

T^{r}

,

C^{r}

, and

S^{r}

. To demonstrate the effectiveness of the Chebyshev distance metric in false news detection, we propose comparing it with different distance metrics. Specifically, we apply the following two alternative distance metrics to the TLFND model: TLFND-COS, which replaces the Chebyshev distance metric with cosine similarity, and TLFND-MD, which replaces the Chebyshev distance metric with Manhattan distance. The experimental results of this comparison are presented in Table 7.

From the results in Table 7, we can learn that using the Chebyshev matching measure in the similarity distance module performs better than the cosine similarity distance measure and the Manhattan distance measure. This may be due to the following reasons:

The Chebyshev distance metric considers the maximum absolute difference between vectors, i.e., the maximum difference is taken in each dimension, which can better capture the overall difference between vectors. In contrast, the Manhattan distance only considers the cumulative difference between vectors, while the cosine similarity metric only considers the angle between vectors.
The Chebyshev distance metric is able to capture the difference between modes more sensitively in the multimodal matching distance, while the Manhattan distance and cosine similarity metrics may be affected by the feature scale or distribution.

5.5.4. Convergence Analysis

To explore and analyze whether the TLFND model faces overfitting during training, we generated average loss iteration curves of the TLFND across three distinct datasets, as illustrated in Figure 5. From Figure 5, it is evident that the three losses progressively reduce at the outset and gradually stabilize later, signifying that the model achieves a certain equilibrium. To address the issue of overfitting, we first introduced the Dropout function within both the text and image processing functions. This function facilitates the random dropping of neurons during model training, thereby reducing the model’s susceptibility to overfitting on the training data. Subsequently, we incorporated L2 regularization via the AdamW optimizer, aiding in controlling the model’s complexity. Lastly, we divided the loss function into four parts and designed an adaptive evolutionary algorithm. This algorithm not only optimizes the model but also manages overfitting issues.

6. Conclusions

To enhance the performance of multimodal methods in fake news detection tasks, we propose a groundbreaking model called TLFND. This model is based on the RoBERTa and VGG-19 models, combining their strengths to create a powerful fake news detection system. The TLFND model comprises four components: a two-level text feature extraction module, an extracted and fused image module, a three-level feature matching distance module, and a multimodal integrated recognition module. Compared to existing multimodal models, the TLFND model takes a significant step forward by considering two types of text: headlines and body text. It effectively fuses these text sources with multiple news images using BILSTM, resulting in a comprehensive understanding of the news content. Additionally, we introduce the Chebyshev distance metric for the first time, enabling accurate calculation of the matching distance between the three distinct features. To optimize the model’s performance, we employ an adaptive evolutionary algorithm. This algorithm computes the loss values of the four components, facilitating parameter optimization and learning through feedback signals. The TLFND model evolves and improves iteratively, delivering superior performance in fake news detection. We evaluated the TLFND model on three real-world public datasets, and the experimental results demonstrate its superiority. In all four metrics, the TLFND model outperforms other approaches, securing the top position. This achievement showcases the practical application of the TLFND model in the crucial task of binary false news detection, effectively countering the spread and impact of false information. Moreover, the TLFND model serves as a valuable reference for researchers in the field and holds potential for broader applications in various domains.

Looking ahead, our future research efforts will focus on adapting the TLFND model to different domains and conducting extensive testing and refinement to enhance its effectiveness and technical details. By continuously improving the model, we aim to make meaningful contributions to society’s fight against fake news.

Author Contributions

Conceptualization, J.Z. and J.W.; formal analysis, S.Y.; methodology, J.W. and R.W.; validation, H.D.; writing—original draft, J.W.; Writing—review and editing, J.W. and J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Youth Basic Research Program under Grant No. 202301AU070194 of Yunnan Provincial Science and Technology Department, the Science and Technology Plan in Key Fields of Yunnan Province under Grant No. 202202AD080002 and the Basic Research Program under Grant No. 202001BB050076 of Yunnan Provincial Science and Technology Department.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

We declare that we do not have any commercial or associative interests that represent a conflict of interest in connection with the work submitted.

References

Carr, C.T.; Hayes, R.A. Social media: Defining, developing, and divining. Atl. J. Commun. 2015, 23, 46–65. [Google Scholar] [CrossRef]
Figueira, Á.; Oliveira, L. The current state of fake news: Challenges and opportunities. Procedia Comput. Sci. 2017, 121, 817–825. [Google Scholar] [CrossRef]
Farkas, J.; Schou, J. Fake news as a floating signifier: Hegemony, antagonism and the politics of falsehood. Javn.-Public 2018, 25, 298–314. [Google Scholar] [CrossRef]
Hirst, M. Towards a political economy of fake news. Political Econ. Commun. 2017, 5, 82–94. [Google Scholar]
Zhang, X.; Ghorbani, A.A. An overview of online fake news: Characterization, detection, and discussion. Inf. Process. Manag. 2020, 57, 102025. [Google Scholar] [CrossRef]
Rennó, L.R. The Bolsonaro voter: Issue positions and vote choice in the 2018 Brazilian presidential elections. Lat. Am. Politics Soc. 2020, 62, 1–23. [Google Scholar] [CrossRef]
Symonds, A. Amazon rainforest fires: Here’s what’s really happening. New York Times, 23 August 2019. [Google Scholar]
Feder, A.; Gandal, N.; Hamrick, J.; Moore, T. The impact of DDoS and other security shocks on Bitcoin currency exchanges: Evidence from Mt. Gox. J. Cybersecur. 2017, 3, 137–144. [Google Scholar] [CrossRef]
Long, Y.; Lu, Q.; Xiang, R.; Li, M.; Huang, C.R. Fake news detection through multi-perspective speaker profiles. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Taipei, Taiwan, 27 November–1 December 2017; pp. 252–256. [Google Scholar]
Jain, A.; Shakya, A.; Khatter, H.; Gupta, A.K. A smart system for fake news detection using machine learning. In Proceedings of the 2019 International Conference on Issues and Challenges in Intelligent Computing Techniques (ICICT), Ghaziabad, India, 27–28 September 2019; Volume 1, pp. 1–4. [Google Scholar]
Castillo, C.; Mendoza, M.; Poblete, B. Information credibility on twitter. In Proceedings of the 20th International Conference on World Wide Web, Hyderabad, India, 28 March–1 April 2011; pp. 675–684. [Google Scholar]
Pérez-Rosas, V.; Kleinberg, B.; Lefevre, A.; Mihalcea, R. Automatic detection of fake news. arXiv 2017, arXiv:1708.07104. [Google Scholar]
Sharma, K.; Qian, F.; Jiang, H.; Ruchansky, N.; Zhang, M.; Liu, Y. Combating fake news: A survey on identification and mitigation techniques. ACM Trans. Intell. Syst. Technol. (TIST) 2019, 10, 1–42. [Google Scholar] [CrossRef]
Qazvinian, V.; Rosengren, E.; Radev, D.; Mei, Q. Rumor has it: Identifying misinformation in microblogs. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Scotland, UK, 27–31 July 2011; pp. 1589–1599. [Google Scholar]
Prasetijo, A.B.; Isnanto, R.R.; Eridani, D.; Soetrisno, Y.A.A.; Arfan, M.; Sofwan, A. Hoax detection system on Indonesian news sites based on text classification using SVM and SGD. In Proceedings of the 2017 4th International Conference on Information Technology, Computer, and Electrical Engineering (ICITACEE), Semarang, Indonesia, 18–19 October 2017; pp. 45–49. [Google Scholar]
Granik, M.; Mesyura, V. Fake news detection using naive Bayes classifier. In Proceedings of the 2017 IEEE First Ukraine Conference on Electrical and Computer Engineering (UKRCON), Kyiv, UKraine, 29 May 2017–2 June 2017; pp. 900–903. [Google Scholar]
Goldani, M.H.; Momtazi, S.; Safabakhsh, R. Detecting fake news with capsule neural networks. Appl. Soft Comput. 2021, 101, 106991. [Google Scholar] [CrossRef]
Raza, S.; Ding, C. Fake news detection based on news content and social contexts: A transformer-based approach. Int. J. Data Sci. Anal. 2022, 13, 335–362. [Google Scholar] [CrossRef] [PubMed]
Giachanou, A.; Zhang, G.; Rosso, P. Multimodal fake news detection with textual, visual and semantic information. In Text, Speech, and Dialogue, Proceedings of the 23rd International Conference, TSD 2020, Brno, Czech Republic, 8–11 September 2020; Proceedings 23; Springer: Berlin/Heidelberg, Germany, 2020; pp. 30–38. [Google Scholar]
Xue, J.; Wang, Y.; Tian, Y.; Li, Y.; Shi, L.; Wei, L. Detecting fake news by exploring the consistency of multimodal data. Inf. Process. Manag. 2021, 58, 102610. [Google Scholar] [CrossRef] [PubMed]
Wu, Y.; Zhan, P.; Zhang, Y.; Wang, L.; Xu, Z. Multimodal fusion with co-attention networks for fake news detection. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, 1–6 August 2021; pp. 2560–2569. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Ma, J.; Gao, W.; Mitra, P.; Kwon, S.; Jansen, B.J.; Wong, K.F.; Cha, M. Detecting rumors from microblogs with recurrent neural networks. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16), New York, NY, USA, 9–15 July 2016. [Google Scholar]
Yu, W.; Ge, J.; Yang, Z.; Dong, Y.; Zheng, Y.; Dai, H. Multi-domain Fake News Detection for History News Environment Perception. In Proceedings of the 2022 IEEE 17th Conference on Industrial Electronics and Applications (ICIEA), Chengdu, China, 16–19 December 2022; pp. 428–433. [Google Scholar]
Kausar, N.; AliKhan, A.; Sattar, M. Towards better representation learning using hybrid deep learning model for fake news detection. Soc. Netw. Anal. Min. 2022, 12, 165. [Google Scholar] [CrossRef]
Liao, Q.; Chai, H.; Han, H.; Zhang, X.; Wang, X.; Xia, W.; Ding, Y. An integrated multi-task model for fake news detection. IEEE Trans. Knowl. Data Eng. 2021, 34, 5154–5165. [Google Scholar] [CrossRef]
Bazmi, P.; Asadpour, M.; Shakery, A. Multi-view co-attention network for fake news detection by modeling topic-specific user and news source credibility. Inf. Process. Manag. 2023, 60, 103146. [Google Scholar] [CrossRef]
Song, C.; Teng, Y.; Zhu, Y.; Wei, S.; Wu, B. Dynamic graph neural network for fake news detection. Neurocomputing 2022, 505, 362–374. [Google Scholar] [CrossRef]
Kumar, S.; Kumar, A.; Mallik, A.; Singh, R.R. OptNet-Fake: Fake News Detection in Socio-Cyber Platforms Using Grasshopper Optimization and Deep Neural Network. IEEE Trans. Comput. Soc. Syst. 2023. [Google Scholar] [CrossRef]
Song, C.; Ning, N.; Zhang, Y.; Wu, B. Knowledge augmented transformer for adversarial multidomain multiclassification multimodal fake news detection. Neurocomputing 2021, 462, 88–100. [Google Scholar] [CrossRef]
Guo, Y. A mutual attention based multimodal fusion for fake news detection on social network. Appl. Intell. 2023, 53, 15311–15320. [Google Scholar] [CrossRef]
Li, S.; Yao, T.; Li, S.; Yan, L. Semantic-enhanced multimodal fusion network for fake news detection. Int. J. Intell. Syst. 2022, 37, 12235–12251. [Google Scholar] [CrossRef]
Wang, J.; Mao, H.; Li, H. FMFN: Fine-grained multimodal fusion networks for fake news detection. Appl. Sci. 2022, 12, 1093. [Google Scholar] [CrossRef]
Chen, Y.; Li, D.; Zhang, P.; Sui, J.; Lv, Q.; Tun, L.; Shang, L. Cross-modal ambiguity learning for multimodal fake news detection. In Proceedings of the ACM Web Conference 2022, Lyon, France, 25–29 April 2022; pp. 2897–2905. [Google Scholar]
Singh, P.; Srivastava, R.; Rana, K.; Kumar, V. SEMI-FND: Stacked ensemble based multimodal inferencing framework for faster fake news detection. Expert Syst. Appl. 2023, 215, 119302. [Google Scholar] [CrossRef]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 2013, 26, 3111–3119. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Naseer, M.; Asvial, M.; Sari, R.F. An empirical comparison of bert, roberta, and electra for fact verification. In Proceedings of the 2021 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Jeju Island, Republic of Korea, 13–16 April 2021; pp. 241–246. [Google Scholar]
Dai, J.; Yan, H.; Sun, T.; Liu, P.; Qiu, X. Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoBERTa. arXiv 2021, arXiv:2104.04986. [Google Scholar]
Basu, P.; Roy, T.S.; Singhal, A. But how robust is roberta actually?: A benchmark of sota transformer networks for sexual harassment detection on twitter. In Proceedings of the 2021 Fifth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), Palladam, India, 11–13 November 2021; pp. 1328–1333. [Google Scholar]
Gupta, P.; Gandhi, S.; Chakravarthi, B.R. Leveraging transfer learning techniques-bert, roberta, albert and distilbert for fake review detection. In Proceedings of the 13th Annual Meeting of the Forum for Information Retrieval Evaluation, Virtual Event, 13–17 December 2021; pp. 75–82. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Shu, K.; Mahudeswaran, D.; Wang, S.; Lee, D.; Liu, H. Fakenewsnet: A data repository with news content, social context, and spatiotemporal information for studying fake news on social media. Big Data 2020, 8, 171–188. [Google Scholar] [CrossRef]
Maigrot, C.; Claveau, V.; Kijak, E.; Sicre, R. Mediaeval 2016: A multimodal system for the verifying multimedia use task. In Proceedings of the MediaEval 2016: “Verfiying Multimedia Use” Task, Hilversum, The Netherlands, 20–21 October 2016. [Google Scholar]
Siino, M.; Di Nuovo, E.; Tinnirello, I.; La Cascia, M. Fake news spreaders detection: Sometimes attention is not all you need. Information 2022, 13, 426. [Google Scholar] [CrossRef]
Siino, M.; Di Nuovo, E.; Tinnirello, I.; La Cascia, M. Detection of hate speech spreaders using convolutional neural networks. In Proceedings of the CLEF (Working Notes), Bucharest, Romania, 21–24 September 2021; pp. 2126–2136. [Google Scholar]
Zhou, X.; Wu, J.; Zafarani, R. SAFE: Similarity-Aware Multi-modal Fake News Detection. In Advances in Knowledge Discovery and Data Mining, Proceedings of the 24th Pacific-Asia Conference, PAKDD 2020, Singapore, 11–14 May 2020; Proceedings, Part II; Springer: Berlin/Heidelberg, Germany, 2020; pp. 354–367. [Google Scholar]
Singhal, S.; Kabra, A.; Sharma, M.; Shah, R.R.; Chakraborty, T.; Kumaraguru, P. Spotfake+: A multimodal framework for fake news detection via transfer learning (student abstract). Proc. AAAI Conf. Artif. Intell. 2020, 34, 13915–13916. [Google Scholar] [CrossRef]
Al Obaid, A.; Khotanlou, H.; Mansoorizadeh, M.; Zabihzadeh, D. Multimodal Fake-News Recognition Using Ensemble of Deep Learners. Entropy 2022, 24, 1242. [Google Scholar] [CrossRef] [PubMed]
Yu, C.; Ma, Y.; An, L.; Li, G. BCMF: A bidirectional cross-modal fusion model for fake news detection. Inf. Process. Manag. 2022, 59, 103063. [Google Scholar] [CrossRef]

Figure 1. Structure diagram of the proposed multimodal model TLFND for false news detection. T represents the news headline, C represents the news content, and V represents the images in the news.

Figure 2. Line graph representation of the accuracy and F1 score metrics in the PolitiFact dataset baseline approach versus the TLFND model.

Figure 3. Line graph representation of the accuracy and F1 score metrics in the GossipCop dataset baseline approach versus the TLFND model.

Figure 4. Line graph representation of the accuracy and F1 score metrics in the Twitter dataset baseline approach versus the TLFND model.

Figure 5. Mean loss curves for three different datasets.

Table 1. Statistics and segmentation of PolitiFact, GossipCop, and Twitter datasets.

Dataset	Train/Test	Real/Fake
Politifact	710/143	527/326
Gossipcop	13,011/2602	12,119/3494
Twitter	9384/1877	4425/6836

Table 2. Model parameter settings.

Parameters	Size
Epochs	50
Train_batch_size	16
Eval_batch_size	1
Optimizer	AdamW
Learning rate	$1 \times 10^{- 5}$
Dropout rate	0.4

Table 3. Results of comparison experiments using the PolitiFact dataset (bold represents the best results).

Method	Accuracy	Precision	Recall	F1 Score
SVM	0.589	0.467	0.911	0.617
CNN	0.749	0.786	0.837	0.811
VGG-19	0.649	0.668	0.787	0.720
SAFE	0.874	0.889	0.903	0.896
SpotFake+	0.882	0.903	0.925	0.913
DEFD	0.855	0.705	0.827	0.761
BCMF	0.923	0.904	0.904	0.904
TLFND	$0.944$	$0.974$	$0.966$	$0.970$

Table 4. Results of comparison experiments using the GossipCop dataset (bold represents the best results).

Method	Accuracy	Precision	Recall	F1-Score
SVM	0.498	0.523	0.883	0.656
CNN	0.796	0.831	0.897	0.863
VGG-19	0.775	0.793	0.884	0.836
SAFE	0.838	0.857	0.927	0.895
SpotFake+	0.856	0.879	0.918	0.898
DEFD	0.853	0.912	0.601	0.725
BCMF	0.891	0.902	0.919	0.910
TLFND	$0.909$	$0.932$	$0.947$	$0.939$

Table 5. Results of comparison experiments using the Twitter dataset (bold represents the best results).

Method	Accuracy	Precision	Recall	F1-Score
SVM	0.536	0.571	0.541	0.555
CNN	0.683	0.655	0.651	0.652
VGG-19	0.604	0.639	0.596	0.616
SAFE	0.762	0.763	0.767	0.761
SpotFake+	0.790	0.789	0.748	0.788
DEFD	0.809	0.825	0.793	0.808
BCMF	0.815	0.813	0.816	0.814
TLFND	$0.831$	$0.852$	$0.824$	$0.837$

Table 6. Ablation study on PolitiFact, GossipCop, and Twitter datasets for the TLFND model design (bold represents the best results).

Dataset	Method	Accuracy	F1 Score
Politifact	TLFND-T	0.896	0.893
	TLFND-C	0.828	0.831
	TLFND-E	0.912	0.899
	TLFND	$0.944$	$0.970$
GossipCop	TLFND-T	0.864	0.901
	TLFND-C	0.683	0.785
	TLFND-E	0.869	0.910
	TLFND	$0.909$	$0.939$
Twitter	TLFND-T	0.788	0.801
	TLFND-C	0.721	0.739
	TLFND-E	0.810	0.808
	TLFND	$0.831$	$0.837$

Table 7. Analysis of the matching distance module in the TLFND model on the GossipCop, PolitiFact, and Twitter datasets (bold represents the best results).

Dataset	Method	Accuracy	F1 Score
Politifact	TLFND-COS	0.919	0.935
	TLFND-MD	0.928	0.941
	TLFND	$0.944$	$0.970$
GossipCop	TLFND-COS	0.863	0.906
	TLFND-MD	0.875	0.913
	TLFND	$0.909$	$0.939$
Twitter	TLFND-COS	0.803	0.811
	TLFND-MD	0.819	0.815
	TLFND	$0.831$	$0.837$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, J.; Zheng, J.; Yao, S.; Wang, R.; Du, H. TLFND: A Multimodal Fusion Model Based on Three-Level Feature Matching Distance for Fake News Detection. Entropy 2023, 25, 1533. https://doi.org/10.3390/e25111533

AMA Style

Wang J, Zheng J, Yao S, Wang R, Du H. TLFND: A Multimodal Fusion Model Based on Three-Level Feature Matching Distance for Fake News Detection. Entropy. 2023; 25(11):1533. https://doi.org/10.3390/e25111533

Chicago/Turabian Style

Wang, Junda, Jeffrey Zheng, Shaowen Yao, Rui Wang, and Hong Du. 2023. "TLFND: A Multimodal Fusion Model Based on Three-Level Feature Matching Distance for Fake News Detection" Entropy 25, no. 11: 1533. https://doi.org/10.3390/e25111533

APA Style

Wang, J., Zheng, J., Yao, S., Wang, R., & Du, H. (2023). TLFND: A Multimodal Fusion Model Based on Three-Level Feature Matching Distance for Fake News Detection. Entropy, 25(11), 1533. https://doi.org/10.3390/e25111533

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TLFND: A Multimodal Fusion Model Based on Three-Level Feature Matching Distance for Fake News Detection

Abstract

1. Introduction

2. Related Work

2.1. Single-Mode Detection Method

2.2. Multimodal Detection Methods

3. Problem Statement

4. Proposed Method

4.1. Two-Level Text Feature Extraction Module

4.2. Image Extraction and Fusion Module

4.3. Three-Level Feature Matching Distance Module

4.4. Multimodal Integration Recognition Module

4.5. Adaptive Evolutionary Loss

5. Experiments and Results

5.1. Datasets

5.2. Evaluation Metrics

5.3. Model Parameters

5.4. Baselines

5.4.1. Single-Mode Detection

5.4.2. Multimodal Detection

5.5. Experimental Results and Analysis

5.5.1. Comparison Experiments

5.5.2. Ablation Study

5.5.3. Matching Distance Module Analysis

5.5.4. Convergence Analysis

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI