The Power of Context: A Novel Hybrid Context-Aware Fake News Detection Approach

: The detection of fake news has emerged as a crucial area of research due to its potential impact on society. In this study, we propose a robust methodology for identifying fake news by leveraging diverse aspects of language representation and incorporating auxiliary information. Our approach is based on the utilisation of Bidirectional Encoder Representations from Transformers (BERT) to capture contextualised semantic knowledge. Additionally, we employ a multichannel Convolutional Neural Network (mCNN) integrated with stacked Bidirectional Gated Recurrent Units (sBiGRU) to jointly learn multi-aspect language representations. This enables our model to effectively identify valuable clues from news content while simultaneously incorporating content-and context-based cues, such as user posting behaviour, to enhance the detection of fake news. Through extensive experimentation on four widely used real-world datasets, our proposed framework demonstrates superior performance ( ↑ 3.59% (PolitiFact), ↑ 6.8% (GossipCop), ↑ 2.96% (FA-KES), and ↑ 12.51% (LIAR), considering both content-based features and additional auxiliary information) compared to existing state-of-the-art approaches, establishing its effectiveness in the challenging task of fake news detection.


Introduction
Fake news detection is a research area that has attracted much attention.People are increasingly utilising social networking platforms, such as Twitter, to share their opinions.Detecting fake news is crucial, yet it is challenging due to the intricate semantics of natural language and the high dimensionality of textual data, leading to data sparsity.Furthermore, malicious actors frequently modify their writing style to imitate credible content, making it difficult to identify fake news based solely on cues from the news content.It is, therefore, prudent to consider auxiliary information, such as user behaviour clues, to advance fake news detection.
Previous studies in the field of automatic fake news detection have predominantly focused on content-based features, including n-grams and part-of-speech tags [1][2][3][4][5].However, this approach often fails to consider the significance of simultaneously learning various aspects of language to achieve effective detection.Moreover, some researchers [6] have attempted to differentiate fake content by incorporating both content and context characteristics, employing unidirectional encoding of news content with widely-used pretrained word embedding models such as GloVe [7].Nonetheless, a practical approach is needed to capture contextualised semantic patterns that classical statistical approaches or context-independent representation models cannot adequately model.To overcome the limitations of previous approaches and address the issue of considering only a single aspect of language, our proposed framework for fake news detection takes a multi-aspect representation approach to model the input text.This comprehensive approach incorporates various language levels, including content, style, morality, and sentiment.By considering multiple aspects simultaneously, our framework provides a more holistic understanding of the textual data, enhancing the detection process.In addition, we propose the adoption of context-aware representation models, such as BERT [8], to encode the text input.By utilising BERT and its ability to capture contextual knowledge, our framework gains an additional layer of understanding, leading to improved performance in detecting fake news.To evaluate the effectiveness of our framework, we conduct experiments using two different types of text representations for news content: context-independent pre-trained embedding models like GloVe and context-aware pre-trained embedding models like BERT.This comparison allows us to analyse the impact of incorporating contextual information on the detection performance.Furthermore, we assess the performance of our framework across various scenarios, including short-and long-text news content, small-and large-scale datasets, and the inclusion of a rich set of crucial auxiliary features.This comprehensive evaluation enables us to thoroughly analyse our architecture's capabilities and potential for real-world applications.
This paper focuses on two problems: (i) how to detect fake news by using multi-aspect language representations and (ii) how to process multiple resolutions of news at once while simultaneously learning how to best integrate these interpretations and other contextual information through joint feature learning.
To respond to these challenges, this work presents a context-aware fake news detection framework (BERT base -mCNN-sBiGRU) by jointly modelling context-and content-based clues through a coherent process that consists of (1) encoding news content using the pretrained BERT model, (2) using multi-channel CNN [9] (mCNN) with three input channels to process different resolutions of the input text, and (3) introducing a stacked BiGRU [10] (sBiGRU) to encode the given auxiliary information, allowing the model to capture more contextual semantical information.Our key contributions are summarised as follows: 1.
The development of a novel hybrid deep learning framework that can effectively learn from multiple sources to detect fake news.These sources include news content, social user behaviour information, and various language aspects.2.
The evaluation of the proposed framework uses two different embedding models, BERT and GloVe, to determine its efficacy.

3.
The evaluation of the proposed framework through extensive experimentation on four real-world datasets to demonstrate its effectiveness in detecting fake news.4.
The discovery that incorporating user behavioural representation with content-based information can lead to more accurate outcomes compared to existing state-of-the-art baselines.
Overall, our study focused on developing a more effective way of detecting fake news by combining various sources of information and using a hybrid deep learning framework.The findings of the study suggest that this approach can yield superior outcomes compared to existing state-of-the-art methods.
The remainder of this paper is structured as follows: Section 2 summarises relevant literature on fake news detection.Section 3 outlines our research methodology and introduces the proposed model.In Section 4, we present comprehensive outcomes on the performance of the predictive models, including all other models developed in this study.Finally, Section 5 concludes the paper.

Related Work
Current research focuses on detecting fake news using both contextual and contentbased approaches.The content-based features used in such research can be broadly classified into two groups [11]: general features and latent features.General textual features, commonly used in traditional machine learning models, employ statistical techniques like bag-of-words (BoW) models to calculate the frequency statistics of lexicons and parts-ofspeech (POS) tags [12], such as nouns and verbs, to evaluate the syntax [11].
Conversely, latent textual features capture implicit patterns or embeddings that can be generated at the word [13], sentence [14], or document level [14].The outcome of this process is the generation of compact vector representations, which can be exploited for further analysis.The existence of irrelevant or noisy text in fake news datasets, particularly those extracted from social media platforms like Twitter, poses a challenge to automatic fake news detection.Failure to process this text can negatively affect the detection performance.Addressing this challenge requires encoding news content in a way that mitigates the problem.In order to achieve this goal, several neural network-based models have been suggested, each offering distinct and valuable features that aid in distinguishing between real and fake news [5].As an example, Wang et al. [15] conducted a study that employed Bidirectional Long Short-Term Memory (BiLSTM) and Convolutional Neural Networks (CNNs) to encode the combination of textual and speaker metadata, aiming to identify fake news.
News content could have various sized lengths.In fact, despite the informative information longer text can provide, it may lead to the presence of noisy words or sentences.Contrary to static embeddings produced by traditional context-independent word embedding methods (word2vec and GloVe), more advanced pre-trained contextualised embedding models based on the attention mechanism such as BERT [8] can be exploited to provide embeddings of a word based on its context.However, using such an advanced pre-trained embedding model presents limitations to the length of a given input text where such a text must be truncated or padded to the maximum length imposed by the model.
In [16], the authors proposed a deep learning technique termed FakeBERT, which combines BERT with parallel blocks of a deep CNN featuring diverse kernel sizes and filters.This integration has demonstrated efficacy in mitigating ambiguity, a notable challenge in natural language comprehension.Nonetheless, the authors overlooked the potential benefits of integrating user behaviour cues, which could potentially improve classification accuracy.The comparative study conducted by [5] investigates the effectiveness of different machine learning and deep learning models, showcasing the capacity of BERT and its variations to improve detection performance across a range of datasets.Alghamdi et al. [17] proposed a computational framework for automatic fake news detection leveraging BERT with the LIAR dataset.The methodology employed BERT for encoding the input text, accompanied by a CNN for extracting local features.Furthermore, metadata information was encoded using a combination of CNN, BiLSTM, and a classification layer.The findings showcased enhanced performance in comparison to prior state-of-the-art techniques on the LIAR multiclass classification task.Additional data from social media platforms can provide valuable insights into the dissemination of fake news, despite potential challenges such as noise and inconsistency.Alongside the news content itself, the study incorporates auxiliary information, such as post-based features, extracted from source tweets within the Twitter context.The authors in [18] proposed an outlier knowledge management framework designed to identify fake news during emergency situations, incorporating principles from complex adaptive systems theory.Their hybrid model, which integrates CNN, BiLSTM networks, and attention mechanisms, demonstrates enhanced detection metrics while providing valuable insights into the characteristics of fake news.The authors in [19] introduced an arithmetic optimization algorithm (AOA)-based approach designed to improve classification accuracy through feature reduction.Utilizing AOA as a wrapper feature-selection technique, the study conducted extensive simulations, comparing the proposed method against established classifiers and alternative evolutionary approaches.
Numerous studies have leveraged contextual cues from social user posts, including temporal patterns observed in sequences of responses on platforms like Twitter, along with other features reflecting their engagements and interactions [5].User-based features are also believed to be valuable clues in detecting fake news content.Because users prone to sharing fake news have distinctive traits from those who do not, researchers have become interested in exploring user-based cues for identifying fake news [20].For instance, in their study, Shu et al. [20] examined user profiles to distinguish fake content from real.
It has also been demonstrated that the language used by purveyors of fake news contains strong explicit or implicit indicators that can be harnessed to advance fake news detection.The former includes lexicon and linguistic cues, such as stylistic features like part-of-speech (POS) tags, while the latter refers to implicit clues such as sentimental and emotional cues.Table 1 shows different features used in previous related work.Few studies have integrated such multiple aspect clues to detect fake news.Incorporating various aspects of language alongside a diverse array of user behaviour cues could potentially enhance the accuracy of fake news detection.In this study, we introduce a context-aware hybrid framework that integrates multi-aspect language representations, behavioural information, and a diverse set of relevant text-based features.Our experiments illustrate that the fusion of these features yields significantly improved results compared to existing baselines.✓ RNN Chen et al. [23] ✓ ✓ Anomaly detection, KNN Wu et al. [24] ✓ LSTM-RNN Gupta et al. [25] ✓ ✓ Graph-based method Gupta et al. [26] ✓ ✓ Graph-based method, DT Qazvinian et al. [27] ✓ L 1 -regularized log-linear model Zhao et al. [28] ✓ DT ranking method Chua et al. [29] ✓ Linear Regression (LR) Ma et al. [30] ✓ Kernel-based method Kwon et al. [31] ✓ ✓ Random Forest Kwon et al. [32] ✓ ✓ SpikeM Zubiaga et al. [33] ✓ ✓ Conditional Random Fields Qin et al. [34] ✓ SVM Shu et al. [35] ✓ ✓ Neural Network Jin et al. [36] ✓ ✓ LDA, Graph Li et al. [37] ✓ ✓ SVM Li et al. [38] ✓ ✓ LSTM Shu et al. [ The Global Vectors for Word Representation method, known as GloVe, was developed by Pennington et al. [7] to enhance the process of learning word vectors.Building upon the word2vec approach, GloVe is more efficient in acquiring word embeddings.By combining global statistics from matrix factorization methods like LSA with context-based learning techniques such as word2vec, GloVe has gained widespread recognition as a superior method for generating word embeddings.Transfer learning, a machine learning technique, involves storing knowledge gained from a specific task and utilising it to solve related problems, thereby improving the learning process.This technique is particularly valuable when faced with limited training data and the need to evaluate models.In Natural Language Processing (NLP), transfer learning has made significant advancements by utilising pre-trained embedding models trained on large text corpora.This has resulted in remarkable breakthroughs in NLP.However, these methods have encountered challenges in distinguishing the context in which words are written.To address this issue, contextualised word embedding models have been introduced, such as BERT [8], which is specifically designed to capture contextual information effectively.

Bidirectional Encoder Representations from Transformers (BERT)
Contextualised word embedding models have become increasingly significant in recent years, surpassing the limitations of traditional context-free neural embedding models like word2vec and GloVe in capturing deep contextual relationships.These traditional models focus on short-range context within a specific co-occurrence window, which restricts their ability to grasp nuanced contextual information [5].Consequently, their popularity has diminished in favor of transfer learning methods, with Google's BERT model leading the way.The BERT model, introduced by Devlin et al. [8], is an unsupervised language representation model that revolutionised the field.Unlike its predecessors, BERT incorporates a deeply bidirectional architecture that simultaneously considers both the forward and backward contexts in all layers, resulting in highly context-aware embeddings.The attention mechanism plays a pivotal role in improving the semantic representation of words within a given context by recognising their varying impacts.This mechanism is a crucial component of the transformer architecture, designed to assign different weights to different parts of the input text, thus distinguishing their contributions to the final output [5].To achieve this, the attention mechanism transforms each word into matrix vectors, namely Q, K, and V, through separate linear transformations, and independently calculates the associations between words.This process allows BERT to capture intricate contextual dependencies and generate more nuanced and informative word embeddings.
The formula for scaled dot-product attention can be observed in Equation ( 1) [39].
The query, key, and value vectors are represented by Q, K, and V, respectively.In order to normalise the inputs to a value between 0 and 1, the attention mechanism employs the Softmax activation function.BERT utilises a multi-head attention mechanism, which is based on the transformer's encoder, as expressed in Equation ( 2) [39], where the subscript i represents each specific head and its corresponding weight matrices.
where each head i is calculated as follows: Text classification tasks have demonstrated remarkable success through the implementation of the BERT model, but their success comes with a computational cost due to the millions of parameters required.Specifically, BERT base has 110 million parameters, while BERT large has 340 million parameters [8].Nevertheless, BERT base provides exceptional results and is simpler to train than BERT large .In this study, the authors utilised BERT base to generate context-aware text representations for the input text provided.

Long Short-Term Memory (LSTM)
The LSTM network, introduced by [40], represents a notable advancement within the realm of recurrent neural networks (RNNs).By incorporating three distinct types of gates-namely the input, forget, and output gates-LSTM overcomes certain limitations of traditional RNNs.These gates play a critical role in regulating the flow of information within the LSTM cells, thereby mitigating issues such as gradient vanishing and explosion.Consequently, LSTM has proven to be highly effective in handling long sentences, as evidenced by prior research [41].LSTM components can be formulated mathematically as follows [42]: In the formulas above, σ represents the logistic sigmoid activation function.W, b, and C t , represent the weight matrix, the bias, and the state of the memory unit at time t, respectively.
Nonetheless, one key drawback of the basic RNN architecture, including LSTM, is its limited capacity to consider future context.While these models can successfully capture dependencies based on previous context, their inability to account for subsequent context poses a notable limitation.Alternative architectures, such as Bidirectional LSTM (BiLSTM) and Bidirectional Gated Recurrent Unit (BiGRU), have been proposed as potential solutions to address this issue.These models comprise both forward and backward hidden layers, which are subsequently merged to facilitate the flow of temporal information in both directions [5].Consequently, BiLSTM and BiGRU models offer superior learning performance by effectively considering both the preceding and subsequent contexts.

Gated Recurrent Unit (GRU)
As a type of RNN, the GRU has two gates: the update and reset gates.The update gate controls the amount of information that needs to be transferred to the current state, which combines the forget and input gates.Additionally, the reset gate determines when to discard the previous hidden state.The update and reset gates are computed similarly to LSTM as follows [10]: In the formulas above, δ(.) signifies the logistic sigmoid function, W and U are gate weight matrices, and h t and b are the hidden state and bias vectors, respectively.

Convolutional Neural Network (CNN)
One-dimensional Convolutional Neural Networks (Conv1D) have emerged as a popular choice for prediction generation and have demonstrated their efficacy across a range of NLP tasks.Conv1D models employ a fixed window size filter, which slides over the input data during training.Each cell within the filter is initialised with a weight, and at each processing step, the input, typically consisting of word vectors, undergoes element-wise multiplication with the filter weights [5].This process generates an output array referred to as the feature map or filter output array, encoding salient features extracted from the input data.Multi-channel CNNs have been particularly effective for text classification tasks [9].CNNs are well-regarded for their ability to automatically extract relevant features, allowing them to capture local features with precision.A multi-channel CNN can capture diverse features across different regions of the input data by employing multiple channels, each with its own set of filters.This multi-channel approach enables the model to capture both low-level and high-level features, enhancing its ability to discern intricate patterns and improve classification performance.

Content-Based Features
News content representations.News content provides valuable clues for distinguishing between fake and real news.We utilised BERT representations (embeddings) to encode each news text.The selection of this pre-trained BERT model is justified by its outstanding performance across a variety of NLP tasks.For the text representations, we initialised the networks with 100-dimensional pre-trained embeddings using the off-the-shelf context-free embedding method (i.e., GloVe).Preprocessing of the input text was performed using the NLTK package, including lowercasing, tokenization, stemming, and punctuation removal.Notably, these preprocessing steps were not applied to BERT's input text due to its built-in tokenizer and punctuation handling capability; for more details, refer to [43].Our BERT-based models exhibited a significant drop in performance when punctuation was removed.According to [11], distinguishing fake news from the truth involves various factors, including quality, style, and quantity (e.g., word count and expressed sentiments).We, therefore, explored a range of content-based features as follows.
Stylistic features.We extract word-based features such as word length, the number of words beginning with capital letters and those starting with lowercase letters, character length, count of digits, exclamation marks, question marks, and periods.
Sentimental features.Fake news creators tend to write content that provokes readers' emotions to promote their creations' success and draw wide public attention.That is, fake news usually has a strong positive or negative sentiment of hate, anger or resentment [44].Social science research shows that news stories that evoke high-arousal or triggering emotions (such as awe, anger, or anxiety) go viral more often on social media [45,46].It is thus imperative to extract useful sentimental clues in our text.We extract such features (i.e., containing two categories: positive and negative) using the NRC lexicon [47].It is claimed that negative emotions with stronger intensity expressed in fake news content are expected to provoke intense emotions in the public [48].Consequently, hyperbolic-based characteristics, such as words that convey intense positive or negative sentiments like "terrifying", extracted from clickbait news headlines are also considered [49].

Context-Based Features
For the FakeNewsNet dataset, to obtain the contextual information (i.e., user posting behaviour) for each news article, we collect set of tweets related to each news article and summarise some features (e.g., the number of likes, number of retweets, number of verified tweets, etc.) over such set of tweets for each news article.For modelling, we use the news articles and such tweets.For the LIAR dataset, we consider useful cues from user profile information.We extract temporal features derived from the date attribute as additional useful cues for the FA-KES dataset.

Model Architecture
The proposed approach to detecting fake news involves a binary classification task in which we aim to develop a model that predicts the credibility of a given news article.Specifically, the model determines whether a given article is fake or real based on certain attributes.The BERT base -mCNN-sBiGRU (See Figure 1) consists of two main modules: the mCNN module and the sBiGRU module.The former (right branch) consists of four layers: input, embedding, convolution (different filter channels), max-pooling, and flattening.First, BERT base is used to generate 768-d vector representations for input text (statements/news), which are fed into what we call the mCNN module.The mCNN module is defined with three input channels (We also experimented with increasing the number of input channels in our model; however, this led to a decrease in performance) for processing different n-grams of the input text.The model consists of several channels, each comprising three layers: convolution, ReLU activation, and kernel size set to 4, 6, or 8 g.These layers work together to extract different word n-gram features and capture more complex ones.Additionally, a max-pooling layer consolidates the output and facilitates the extraction of the most important features from each feature map.A flattening layer then reduces the three-dimensional output to a two-dimensional one in preparation for concatenation later.Finally, the extracted features from the three channels are concatenated into a single vector to prepare for concatenation with the output of the stacked BiGRU.Using multiple convolution kernels at the cost of the model's complexity gives the model far more expressive power by allowing it to learn high-level contextual features.The latter (left branch) consists of a stacked BiGRU layer with 50 units to encode the auxiliary information by using multiple BiGRU layers run for the same number of steps in order to learn long-term bidirectional (forward and backward) dependencies.The resultant output is concatenated with the previously mentioned single merged vector.After passing the data through a dense layer comprising one unit with a Sigmoid activation function, the model is constructed and trained using the Adam optimizer and binary-cross entropy as the loss function.The training is conducted for seven epochs, with a batch size of 16, on a single GPU.
It is important to mention that these hyper-parameter values were chosen after multiple runs, as they consistently yielded the best results.
We have chosen to utilise a multi-channel CNN approach for encoding the input text due to its ability to handle multiple news resolutions simultaneously.This allows us to process different aspects of the news content and integrate them effectively by jointly learning features.By employing this approach, we aim to capture a comprehensive understanding of the input text.In addition to capturing textual information, we recognise the importance of behavioural patterns in detecting fake news.We have incorporated various behavioural features to address this, including user posting behaviour, sentiments, morality, etc.To model these patterns effectively, we have employed a stacked BiGRU architecture.In this architecture, multiple BiGRU layers are run for the same number of time steps to encode behavioural patterns so as to capture the implicit and explicit temporal dependencies and context in such data.By leveraging the bidirectional nature of BiGRU, we can capture information from both past and future contexts, allowing us to model the dynamics of the behavioural clues effectively.

Experiments
In this section, we describe the experiments carried out to evaluate the effectiveness of the proposed architecture.The experiments were conducted using the Python programming language, and the models were built using the TensorFlow library.We evaluate the performance of the models using five commonly used evaluation criteria for text classification tasks: accuracy, precision, recall, and F1 measure.The datasets are divided into training (80%) and test (20%) sets to evaluate the model's performance.

Datasets
The effectiveness of the proposed framework is evaluated using three publicly available datasets, detailed as follows.

FakeNewsNet Dataset
In line with [51], we recognise the importance of analysing user engagement with news articles on social media platforms to improve the detection of fake news.Therefore, we incorporate relevant information from user-news interactions to enhance the accuracy of our detection framework.To facilitate this, we utilised the FakeNewsNet (https://github.com/KaiDMML/FakeNewsNet (accessed on 1 January 2023)) dataset, which was collected from two fact-checking platforms-PolitiFact and GossipCop.This dataset consists of labelled news articles and includes social context information obtained from Twitter, such as user engagements/activities.This social user posting behaviour encompasses various features, including follower count, favourite count, retweets count, and verified tweets count.In our approach, we utilised this dataset as input for the BERT-mCNN module to encode the news articles.Through the utilisation of this module, we captured the contextualised semantic information within the news articles, facilitating a comprehensive representation of the textual content.Simultaneously, we used the user posting behaviour clues as input to the sBiGRU module.This module effectively encoded the temporal dynamics and patterns of user interactions by leveraging the sBiGRU module.The statistics of the dataset are shown in Table 2.The FA-KES (https://zenodo.org/record/2607278#.X3oK8WgzaUk (accessed on 2 February 2023)) dataset comprises 804 news articles concerning the Syrian war.Each article includes the full body of text, headline, news sources, date, location, and two class labels denoted as '0' and '1' for fake news and real news, respectively.We utilise the BERT-mCNN module, as detailed in Section 3.2, to encode both the headline and news text.Additionally, numerical features such as date features (month, day, year, and weekday) are incorporated.These numerical features are vital for providing temporal information associated with the news and are encoded using the sBiGRU module.The statistics of the dataset are shown in Table 3.The LIAR dataset (https://www.cs.ucsb.edu/william/data/liardataset.zip(accessed on 19 February 2023)) consists of approximately 12,800 pieces of information.This dataset comprises two main components: the user profile (e.g., subject, speaker's name, credit history, etc.) and short political statements.The statements were reported between 2007 and 2016 and were categorised by the editors of Politifact.com using six fine-grained categories.For our study, we considered a binary classification problem where we categorised statements labelled as true, mostly true, and half true as "true" while the remaining were considered "false".In our approach, we have incorporated textual auxiliary features, such as subject, speaker's name, etc., by appending them to the end of each corresponding statement.This allows us to integrate these additional textual attributes into the encoding process of the model.On the other hand, numerical features such as credit history features are used as direct inputs to the sBiGRU module; see Section 3.2 for more details.The statistics of the dataset are tabulated in Table 4.We hypothesise that utilising content-based features and contextual cues can potentially improve detection performance.Based on this hypothesis, we extract content-based features described in Section 3.1.1and experiment with different datasets containing different user behavioural information.We conduct three experiments: (1) To evaluate the effectiveness of the proposed model, we compare it against baseline methods.(2) We experiment using contextual features such as user posting behaviour, user profile and temporal information with text content (e.g., news article) solely as an input to our models; we experiment with two pre-trained embedding models, namely, BERT base and GloVe.(3) We experimented using merely content-based features (i.e., features extracted from text input such as sentiment, morality, and other linguistic-based clues), with original text content (e.g., news article) as an input to our models, with the above-mentioned pre-trained word representation models.

Baseline Methods
In this study, we conduct a comparative analysis between the proposed method and state-of-the-art algorithms across multiple datasets.Our primary objective is to assess the predictive efficacy of our model and the utility of the features introduced for detecting fake news.The comparative evaluation encompasses a range of methods proposed for fake news detection and general text classification models: 1.
mCNN [9]: a model consisting of multiple convolution filters to capture different granularity from text data (e.g., a news article).We use BERT as an encoding model; thus, we call this baseline BERT base -mCNN.

2.
SAF [52]: a model which uses the FakeNewsNet dataset that integrates social user activities-related features with linguistic-based features.The results reported here are adopted from [52].

BiLSTM-BERT [53]
: a natural language inference approach is used to determine the truthfulness of news using the PolitiFact dataset.This method utilises BiLSTM and BERT embeddings.The results reported here are adopted from [53]. 4.

LNN-KG [1]
: a neural network model was applied to the PolitiFact dataset, which used different representations for both the textual patterns and embeddings of concepts found in the input text.The results reported here are adopted from [1].

5.
Multinomial Naive Bayes [2]: a hybrid model leveraging the FA-KES dataset integrates features derived from both textual content and metadata associated with news articles to discern fake news.The results reported here are adopted from [2]. 6.
Hybrid CNN-RNN [3]: a deep learning model that utilises both CNN and RNN in a hybrid approach for the classification of fake news using the FA-KES dataset.The results reported here are adopted from [3]. 7.
Naive Bayes (NB) [4]: a probabilistic model used to detect fake from real news on the LIAR dataset.The results reported here are adopted from [4].8.
Support Vector Machine (SVM) [2]: a popular classifier to detect fake from real news on the LIAR dataset.The results reported here are adopted from [2]. 9.
BERT base -BiGRU(Att): a stacked BiGRU architecture is employed, incorporating multiple BiGRU layers operating for an identical number of time steps, followed by an attention layer.This configuration enables the model to capture bidirectional dependencies, encompassing both forward and backward contexts, thereby enhancing effectiveness compared to unidirectional GRU models.10.BERT base -text-sBiLSTM: this model focuses solely on textual content, disregarding user posting features and other auxiliary factors.By solely analysing the text, the model may overlook the nuances where certain portions of the text are true but are used to bolster false claims.Incorporating BiLSTM provides an advantage as it enables a thorough examination of the input text, encompassing both preceding and subsequent events, thereby enhancing the model's ability to discern the veracity of the content [4].11.BERT base -CNN-sBiGRU: this framework employs a combination of BERT-CNN for encoding text representations and stacked BiGRU layers for modelling additional auxiliary features.Subsequently, the outputs from both models are merged, followed by the application of a Sigmoid layer.

Results and Discussion
Based on the results presented in Tables 5-8, the proposed model consistently outperforms the state-of-the-art results by notable percentages: ↑3.59% (PolitiFact), ↑6.8% (GossipCop), ↑2.96% (FA-KES), and ↑12.51% (LIAR), considering both content-based features and additional auxiliary information.For the FakeNewsNet dataset, the proposed model exhibits superior performance compared to the models presented in [52].The referenced model in [52] considered various feature sets learned jointly with news article content, providing significant contextual knowledge to the model.The improved performance of the sBiGRU(Att) model can be attributed to its ability to select the most salient parts of a sequence, facilitated by the attention mechanism.In contrast, the text-sBiLSTM model shows suboptimal performance across all datasets, especially on the LIAR dataset.Consistent with [15], we acknowledge that BiLSTM is more prone to overfitting on the LIAR dataset, leading to underperformance.Additionally, the mCNN model performs poorly on the same dataset, indicating vulnerability to overfitting.Figure 2a illustrates the training and validation loss and accuracy on the LIAR dataset using BERT base -text-sBiLSTM, while Figure 2b depicts those of BERT base -mCNN.Similar insights can be observed in the FA-KES dataset.Surprisingly, the CNN-sBiGRU model achieves the lowest F1 score, particularly on the LIAR dataset, suggesting its incapacity to capture useful patterns for detecting fake content.However, the same network ranks as the second-best performing model on FA-KES.In contrast, the proposed model consistently yields the best results across datasets, demonstrating its capability to capture dataset intricacies effectively.Models integrating metadata, content-based cues, and news content representation consistently outperform those relying solely on context-based cues (metadata) or content-based clues.We can see that BERT base -mCNN-sBiGRU all > BERT base -mCNN-sBiGRU metadata > BERT base -mCNN-sBiGRU content .The results can be seen in the tables below.Numbers in bold refer to the highest scores achieved among all the implemented algorithms.For brevity, we exclude the outcomes achieved by GloVe when utilising the fusion of content-and context-based features.It should be noted that A%, P%, R%, and F1% denote accuracy, precision, recall, and F1 score, respectively.Based on the experimental results, we have the following observations. 1.
The interplay between the distinct properties of metadata and news content uncovers more patterns for machine learning models to identify, ultimately leading to enhanced detection performance.This seems to confirm the hypothesis that using both contentand context-based clues can improve the detection performance.More details on assessing impacts of selecting useful features can be found in Section 4.5.

2.
The consideration of solely news content features for detecting fake news yielded subpar results, underscoring the importance of behavioural information in distinguishing between fake and genuine news articles.

3.
Using context-aware embedding models like BERT base has demonstrated exceptional performance compared to off-the-shelf context-independent pre-trained models like GloVe.This highlights the superiority of semantically contextual representations over context-independent embedding methods, albeit with the caveat of their higher computational complexity.

Prediction Performance of BERT vs. GloVe Using Solely Context-Based Features
In this section, we assess the effectiveness of the proposed model and different baseline models (i.e., CNN-BiGRU and sBiGRU(Att)) on different datasets using solely basic contextual features such as user posting behaviour in FakeNewsNet (Tables 9 and 10), spatiotemporal features in the FA-KES dataset (Table 11) and the user profile in the LIAR dataset (Table 12).The investigation found that using mCNN coupled with stacked BiGRU (the proposed approach) achieved the best accuracy.The results are tabulated in the tables below, highlighting the best scores in bold.We have achieved better accuracy using pretrained BERT base embeddings than the off-the-shelf pre-trained GloVe.The analysis found that the attention mechanism boosts the detection performance, where the attention-based classifier (sBiGRU(Att)) yields comparable results to the proposed approach.This could be attributed to the fact that placing a stacked BiGRU on the input text representations results in more semantic representations that were harnessed by extracting both past and future contexts.More than that, such semantic representations might be improved using the attention layer on the output of the stacked BiGRU layers, leading to more accurate results.This model (sBiGRU(Att)) is found to be the best-performing model for GossipCop and LIAR datasets and the second-best among all the models.For the baseline CNN-sBiGRU, we found it to be the second-best model for FA-KES.Moreover, it has been shown that the proposed framework achieved the best accuracy compared to all baselines.
Exploiting user behaviour would give the model extra contextual knowledge, resulting in good detection performance.It should be noted that we conducted further experiments to test the contextual features in FakeNewsNet, and as a result, considering features such as followers count, favourites count, retweets count, and verified tweets count only achieved 92% F1 score on PolitiFact, showed better results than using all features.The opposite is true for GossipCop data, where using all users posting behavioural information yields the best results with a 91% F1 score.Readers are referred to Section 4.5 for more details.
It is important to acknowledge that the imbalanced class distribution within the FakeNewsNet dataset could potentially impact the performance of the model.The inherent skewness in the number of fake and real news instances may lead to a model biased towards the majority class.We anticipate that by equalising the representation of fake and real news instances in such a dataset, the model will be better equipped to learn from both classes, thereby mitigating potential bias and improving its ability to generalise to a wider range of scenarios.Future iterations of our research will explore and implement class balancing techniques to ensure a fair and robust evaluation of our fake news detection model.
Moreover, the incorporation of user behavioural information indeed introduced heightened complexity to the model architecture.The utilisation of the stacked BiGRU layer, chosen for its effectiveness in capturing temporal dependencies in user behaviour, contributed to increased computational overhead and extended training times.As part of the future work, we aim to explore innovative approaches to further streamline the model, investigating techniques such as model quantization and pruning to reduce computational demands without compromising the model's ability to capture nuanced user behaviour.The tables (Tables 13-16) below show how deep learning models perform when relying merely on content-based features.The proposed model performs the best among all baselines (across all datasets), with stacked BiGRU(Att) being the first-best model on the LIAR dataset.It is noticeable that the proposed framework utilising merely user behavioural patterns outperforms these models that rely on just content-based features on all datasets.The stacked BiGRU(Att) model continues its promising results, while the CNN-sBiGRU network performs poorly using solely content-based features on LIAR and FA-KES datasets.

Assessing Impacts of Selecting Useful Features
In this section, we aim to thoroughly evaluate the efficacy of the proposed model.Additionally, we plan to conduct a case study to examine the model's performance with various features.To achieve this, we feed the model with these features separately and compare their performance.According to a study by [54], false political news on Twitter tends to be retweeted by a large number of users and spreads very quickly.The authors in [55] also suggest that the dissemination of fake news on social media starts with the behaviour of the user who posts the news.As such, (1) we analyse the FakeNewsNet dataset by investigating the role played by each user posting behavioural features; and (2) we also analyse the performance of each language feature individually using all datasets.The results of these experiments are shown in the tables below.
For the first experiment, we evaluate the performance of user posting behaviour using the proposed model (BERT base -mCNN-sBiGRU).We test different sets of feature combinations.Table 18 shows the performance of the proposed model using news article content and user posting behaviour on Twitter for the PolitiFact dataset.Similarly, Table 19 shows that of the GossipCop dataset.Using merely followers count, favourite count, retweets count and verified tweets features achieves a significantly better performance than using other features on the PolitiFact dataset.On the other hand, considering all features shows better results for the GossipCop dataset.For the second experiment (See Table 20), first, we run the model using solely sentimental features, which shows that the model did not perform well when considering merely such features across all datasets.Interestingly, the model performs even worse when considering the combination of sentimental and morality-based features.
Similar observations can be applied when using only morality-based features.In contrast, the model performs well when considering only the combination of linguistic and sentiment cues on PolitiFact, while considering merely sentiment clues yields better results on the GossipCop dataset.The combination of linguistic and morality-based features yields good performance on the FA-KES dataset; thus, identifying language indicators from user-generated information is critical for detecting fake news.It is noticeable that considering all features together further increases the performance of fake news detection.We deduce that content-and context-based features provide complementary information towards improving fake news detection.
To this end, the research methodology is strategically designed to stay adaptive and robust to the dynamic nature of fake news and the evolving landscape of online content.Recognising the limitations of static approaches, we incorporate advanced contextual embeddings, specifically leveraging pre-trained models like BERT, to capture the everchanging semantics of news articles.The proposed framework places a strong emphasis on multi-aspect language representations, acknowledging the diverse forms fake news can take.Furthermore, the inclusion of user behavioural information adds a layer of adaptability, as user engagement patterns continually shift over time.The hybrid nature of the proposed framework, encompassing pre-trained embeddings, multi-channel CNN, and stacked BiGRU, allows the model to dynamically learn from multiple sources, ensuring its efficacy in the face of evolving content structures.Through extensive experimentation on diverse real-world datasets, we evaluate the performance of the proposed framework across various scenarios, thereby affirming its adaptability and effectiveness in addressing the challenges posed by the dynamic nature of misinformation in online content.

Limitations
While the proposed approach demonstrates promising results in fake news detection, several limitations need to be acknowledged.Firstly, the effectiveness of the proposed model may vary across different datasets, as it heavily relies on the quality and diversity of the training data.Additionally, the computational resources required for training and testing the model may pose practical constraints, particularly for researchers with limited access to high-performance computing facilities.Lastly, the generalisation of the findings to other languages or domains beyond the scope of the study remains uncertain and requires further investigation.Addressing these limitations would be essential for enhancing the robustness and applicability of the proposed approach in real-world scenarios.

Note
The data can be downloaded from the links provided in Section 4.1.

Conclusion
We have proposed a novel deep learning framework that integrates additional auxiliary feature representations together with news content representation.The interplay between user posting behaviour, temporal or user profile cues, and model training helps uncover informative features that distinguish falsified from genuine information.Enriching such features with extra valuable content-based knowledge significantly improves the detection performance.Experiments on four real-world datasets demonstrate the effectiveness of our proposed framework.Our experiments demonstrate that not only does BERT base -mCNN-sBiGRU perform well, but individual components of it also outperform comparative methods.BERT offers a good advantage to improve performance but at the expense of merely considering sequences within a certain length.That is, BERT imposes a 512-token length limit where longer sentences are simply truncated, resulting in the loss of some important information.One way to overcome this is an effective summarising method, which can be applied so as to shrink the sequence to a length that is somewhat equivalent to or less than the token length limit imposed by BERT.Moreover, using knowledge-based networks to identify the veracity of news by checking the facts would give extraordinary and correct fake news classification results.Furthermore, future endeavours should consider adapting and refining these indicators to suit the specific linguistic and cultural nuances of diverse languages.This nuanced approach aims to enhance the framework's robustness and applicability in capturing each language's unique characteristics, thereby contributing to the development of more effective language-specific indicators for global fake news detection.Furthermore, although conducting a statistical analysis is a commendable idea, it presents significant challenges due to issues such as data imbalance and algorithmic randomness.Overcoming these challenges requires a robust statistical model capable of providing insights across all approaches.However, given the complexity and scope of this task, it exceeds the current paper's capacity, and therefore, we

Figure 1 .
Figure1.The high-level structure of the proposed approach comprises three components: (1) the news content module; (2) the multi-modalities module, where the former is used to model semantic contextualised representations from the news content while the latter runs gated layers to learn the multimodal information and (3) the classifier component to make prediction by fusing information of these two modules.

Figure 2 .
Figure 2. LIAR training and validation accuracy and loss graphs using (a) BERT base -text-sBiLSTM and (b) BERT base -mCNN models.

Table 1 .
Previous studies on fake news and rumours detection using various features.

Table 2 .
The statistical information of the FakeNewsNet dataset.

Table 3 .
The statistical information of the FA-KES dataset.

Table 4 .
The statistical information of the LIAR dataset.

Table 5 .
A comparison (%) of detection effectiveness on the PolitiFact dataset.The best performance scores are bolded.

Table 6 .
A comparison (%) of detection effectiveness on the GossipCop dataset.The best performance scores are bolded.

Table 7 .
A comparison (%) of detection effectiveness on the FA-KES dataset.The best performance scores are bolded.

Table 8 .
A comparison (%) of detection effectiveness on the LIAR dataset.The best performance scores are bolded.

Table 9 .
Performance comparison (%) of contextual features using (a) BERT base and (b) GloVe on the PolitiFact dataset.The best performance scores are bolded.

Table 10 .
Performance comparison (%) of contextual features using (a) BERT base and (b) GloVe on the GossipCop dataset.The best performance scores are bolded.

Table 11 .
Performance comparison (%) of contextual features using (a) BERT base and (b) GloVe on the FA-KES dataset.The best performance scores are bolded.

Table 12 .
Performance comparison (%) of contextual features using (a) BERT base and (b) GloVe on the LIAR dataset.The best performance scores are bolded.

Table 13 .
Performance comparison (%) of content-based features using (a) BERT base and (b) GloVe on the PolitiFact dataset.The best performance scores are bolded.

Table 14 .
Performance comparison (%) of content-based features using (a) BERT base and (b) GloVe on the GossipCop dataset.The best performance scores are bolded.

Table 15 .
Performance comparison (%) of content-based features using (a) BERT base and (b) GloVe on the FA-KES dataset.The best performance scores are bolded.

Table 16 .
Performance comparison (%) of content-based features using (a) BERT base and (b) GloVe on the LIAR dataset.The best performance scores are bolded.

Table 17
gives insights into the computational efficiency of the proposed model for different datasets.

Table 17 .
Training and inference time of the proposed model in seconds.

Table 18 .
Evaluating (%) the effectiveness of selecting useful features using BERT base -mCNN-sBiGRU on the PolitiFact dataset.Note that A, P, R, and F1, refer to accuracy, precision, recall, and F1 score, respectively.The best performance scores are bolded.

Table 19 .
Evaluating (%) the effectiveness of selecting useful features using BERT base -mCNN-sBiGRU on GossipCop dataset.Note that A, P, R, and F1, refer to accuracy, precision, recall, and F1 score, respectively.The best performance scores are bolded.

Table 20 .
Evaluating (%) the effectiveness of selecting useful features.Note that P, G, F, and L, respectively, refer to PolitiFact, GossipCop, FA-KES, and LIAR datasets using BERT base -mCNN-sBiGRU.The best performance scores are bolded.