HyproBert: A fake news detection model based on deep hyper context

. News media agencies are known to publish misinformation, disinformation, and propaganda for the sake of money, higher news propagation, political influence, or other unfair reasons. The exponential increase in the use of social media has also contributed to the frequent spread of fake news. The introduction of deep learning techniques for complex natural language processing has enhanced the understanding and processing ability of fake news and propaganda detection. A hybrid HyproBert model for automatic fake news detection is proposed in this paper. To begin, the proposed HyproBert model uses DistilBERT for tokenization and word embeddings. The embeddings are provided as input to the convolution layer to highlight and extract the spatial features. Subsequently, the output is provided to BiGRU to extract the contextual features. The CapsNet, along with the self-attention layer, proceeds the output of BiGRU to model the hierarchy relationship among the spatial features. Finally, a dense layer is implemented to combine all the features for classification. The proposed HyproBert model is evaluated using two fake news datasets (ISOT and FA-KES). As a result, HyproBert has achieved a higher accuracy as compared to other baseline and state-of-the-art models.


Introduction
Over the years, the increase in the usage of social media platforms has contributed to facilitating the phenomenon of spreading false information. The spread of fake news can be linked to achieving several goals, which can be targeted toward specific individuals, groups, politics, regions, religions, ethnicity, culture, race, etc. It can also be interconnected with illegal and fraudulent activities [8] [9]. The generation and propagation of fake news has a huge negative impact on the economy, peace, and health all over the world. It is considered a major threat to journalism. It was reported in the year 2013 that the American stock market had lost over 130 billion dollars over fake news of its president being injured in an explosion. In 2016, the US elections were packed with fake news glorifying and defaming both presidential candidates. One such fake news claim claims "Pope Francis endorsed President Trump." This resulted in a huge turmoil among the rivals and started riots among the supporters. In recent years, fake news related to the spread and effects of COVID has disturbed the harmony in society, created social instability, and resulted in the economic collapse of several governments.
To overcome the adverse effects of fake news, the propagation of such needs to be stopped. There is a lot of research in place to moderate social media content with the help of human content moderators. Machine learning algorithms, along with previously moderated data, are effective for fake news detection. However, machine learning models do not provide a significant level of efficacy and improvement over time for various datasets. This research highlights the significance of deep learning techniques for natural language processing tasks. The proposed model is an ensemble of deep models, adept placements, and hyperparameter tuning for an effective solution for fake news detection. It utilizes the most valuable expertise of deep learning models in a structured manner to provide an automated, efficient, and effective classification solution for identification and detection of fake news.
The main contributions of this research in the field of fake news detection include: -The introduction of a novel deep neural network model, HyproBert, for fake news detection -Extraction and evaluation of content attributes at multiple orientations based on deep hyper context -Analysis of the developments in hyperparameter optimization on HyproBert Section 2 describes the literature review of fake news detection along with the limitations in this domain. Section 3 includes the methodology of the proposed HyproBert model. Section 4 includes the experiments and results. Section 5 discussed the evaluation results and comparative analysis. The research along with the future work is concluded in section 6.

Literature review
In this section, a review of various existing machine learning and deep learning techniques being used for detection of fake news is presented. This section also summarizes the status and limitations of fake news detection in the news media.
The research on fake news detection is divided into two types: supervised learning and unsupervised learning. Ozbay and Alatas [1] suggested a two-step method to identify false news using text information; the first stage entails preprocessing, and the second stage involves applying the textual feature vector to 23 clever supervised classifiers for further experimental assessment. By applying multiple word embedding approaches across five datasets in three languages sourced from different social media sites, Faustini Covoes [2] created a model for detecting false news that is independent of language and platform. To this end, Guo et al. [3] thoroughly reviewed the advancements in the field of bogus information detection, detailing the many problems, unique methodologies, and specifics of available datasets.
For the first time in the domain of fake news detection, Ozbay and Alatas [4] has employed two metaheuristic algorithms for the identification of fake news. Grey Wolf Optimization (GWO) resulted in higher performance for social media data. Kumar et al. [7] used five different classifiers and thirteen different attributes to categorize tweets; they also used particle swarm optimization (PSO) to extract a ideal feature set from the tweets' text and improve classifier performance.
Perez-Rosas et al. [5] created two new datasets for automatic identification of false news, each including seven different types of news related content. They laid out an extensive framework and patterns for distinguishing between authentic and fake news. In order to identify fake news, Ahmed et al. [6] developed a novel dataset called ISOT that was compiled from actual news stories around the globe. Linear Support Vector Machine (LSVM) classifier along with Term Frequency-Inverse document frequency (TF-IDF) evaluation for feature vector representation were utilized for the classification of fake news. However, these methods require bespoke features, which adds to their production and time requirements. Akyol et al. [10] determined applicability of Random Forest (RF), Gradient Boost Tree (GBT) and Multilayer Perceptron (MLP) for the concept of fake news detection.
With the rise of deep learning theory, an increasing number of researchers employ it to detect fake news. Recent research is progressively more focused on unsupervised and semi-supervised detection methods. The capacity of deep learning models to automatically extract high-level properties from news articles is a major draw for academics; this makes them specifically useful for diagnosing fake news [11]- [20]. Goldani et al. [? ] recommend the use of embedding models and Convolutional Neural Networks (CNN) to detect bogus news. During model training, static and dynamic word embeddings were compared and gradually updated. Their approach was evaluated using two public datasets. ISOT datasets improved accuracy by 7.9% and 2.1%, respectively. Ma et al. [14] developed a recurrent neural network (RNN) to represent the textual data sequence for rumor detection. Using text data, Asghar et al. [11] created a deep learning model that recognizes rumors by combining CNN with bidirectional long short-term memory (Bi-LSTM). In order to effectively manage the textual contents in a bi-directional method, Kaliyar et al. [13] developed a hybrid of the CNN and BERT model, named as a FakeBERT.
Yu et al. [18] emphasized the shortcomings of RNN-based algorithms for the early identification of disinformation. They suggested a CNN model for extracting significant features and spotting propaganda. To identify and categorize bogus news, Shu et al. [15] took a pragmatic approach by establishing a co-attention method to find the top K most important lines in the material and combine them with the top K most significant user reactions. Autoencoder-based unsupervised fake news detection (UFNDA) was proposed by Li et al. [20]. Their methodology combines the context and content data from news to produce a feature vector that will improve the identification of false news. For early rumor detection, Chen et al. [12] developed an attention based RNN to accumulate unique language properties over time. Wang [16] offered a new dataset named LIAR. He suggested a deep learning model that combines CNN and Bi-LSTM to extract textual features and meta-data features, respectively. Yin et al. [17] employed SVM classifiers are classify relationships. CNN and Principal component analysis (PCA) as the feature extraction tools.

Current status and limitations
The current literature analysis concludes with the reality of false news and its uncontrolled spread. The extraction of deep semantic and contextual information is crucial for the identification and classification of propaganda. Extracting context from an instance's several orientations is a key function of the Capsule network [24]. Capsule networks are also utilized for NLP in various studies, including text-based categorization [25], misleading headlines recognition [26], and opinion categorization [27]. It has also been utilized specifically for fake news detection [28] [29], but these studies lack the effective exploration of CapsNet. While DL models have seen extensive use in analyzing news articles, they have yet to work out the following: retaining long-term word dependencies; using a parallelization technique in training; accepting bidirectional input sentences; and maintaining an attention mechanism. The proposed HyproBert model integrates the convolution layer, BiGRU, and CapsNet with the self-attention to extract the spatial features along with contextual information and hidden representation from multiple orientations. In the current work, DistilBERT, a lighter and more efficient version of the BERT model has been implemented to extract context-based high-level characteristics from news text. The goal of the proposed HyproBert model is to offer an interesting, effective, and efficient way to detect fake news.

Methodology
In this section, detailed description of the proposed model HyproBert for fake news detection in news articles is presented.
The proposed HyproBert model is presented in Figure 1. The title of the news article along with the complete text is utilized for the tokenization process. Afterward, embedding vectors are generated to represent features in the solution space for further classification. Initially, CNN is applied along with operations to extract the spatial features from the embeddings. BiGRU is applied sequentially to the output of CNN to gather contextual information. Capsule networks are used along with the attention layer to enhance the spatial features. Although CNN has extracted the spatial features from the contents, the capsule network identifies the spatial relations between the extracted features to model the hierarchical relationships within the data. The proposed HyproBert model is an ensemble of essential, efficient, and effective DL models. The sequential processing of input extracts, enhances, and correlates highly valuable features semantically and contextually. The introduction of the attention layer between the BiGRU and CapsNet enlightens context awareness and helps CapsNet to identify semantic representation and hidden contextual information. The details of the steps involved, from preprocessing data to the final output of the proposed HyproBert model, are provided in the following section.

Data preprocessing
In order to utilize a dataset, it must be cleansed according to the requirements of the model. It will significantly help the model to understand the data in a better manner and increase its performance. Generally, it includes steps like stop word removal, tokenization, removal of special characters and spaces, numbers to words, sentence segmentation, etc. The datasets under consideration are real-world news datasets, and they contain several URLs. Before using these datasets for experimentation, URLs, stop words, and other unnecessary noise are removed from the datasets.

The proposed HyproBert Model
The components of the proposed model HyproBert along with the function and significance of each component is described as follows.

Input layer
After preprocessing, the data is sent into HyproBert input layer. The input layer is responsible for separating the title and text of a news article into smaller tokens. Furthermore, each token is indexed with a unique number using a dictionary. Additionally, padding is used to keep the length of input text constant. Finally, numerical vectors are created from all the text from the news article N, converted into tokens t, and indexed with a dictionary D, such that N ∈ D 1×t . DistilBertT okenizer is used for end-to-end tokenization.

Embedding layer
The embedding layer is responsible for learning the word representation based on training over immense realworld data. It also finds that words that are connected conceptually have comparable vector representations. In this research, we are using DistilBERT embedding, which is an approximate version of the BERT. It performs at around 97% similar to the BERT model and uses the approximation of the posterior distribution generated by the BERT model [21]. It has over 110 million parameters and is still much smaller than BERT. The number of transformer layers is also reduced to six layers. Furthermore, the training and prediction times are significantly smaller. This makes DistilBERT the ultimate choice for the proposed model.

Convolutional layer
Convolutional layers are the fundamental constituents of convolutional neural networks. The output of neurons is controlled by the convolutional layer. By figuring out the scalar product of their weights and the density region of the input, the neurons are linked to parts of the input that are close to them.
The convolution layer is utilized in HyproBert to extract the spatial features from the embedding layer. In experimentation, a one-dimensional convolution layer along with 128 filters of 3 sizes is employed. The local spatial features are combined in a pooling operation to generate high-order features. Max Pooling function is utilized for the pooling operation. ReLU is used as an activation function to improve the performance. The feature sequence can be represented as S i = S ((W n .x n ) + b)), where i is the index of the feature sequence, W n is the filter weight, x n is the window size, and b is the bias weight.

BiGRU layer
The gated recurrent unit (GRU) is the fundamental building block of the gated recurrent neural network (GRNN). At each iteration, GRNN takes a textual vector as input and uses the previous iteration's output vector as a weight in a weighted sum that is used to update the nodes in its hidden layer. The present context is built using a bias vector to selectively retain or discard related information. The operation of a hidden layer is managed by the following equations: Where I t is the update gate, K t is the reset gate, σ is a sigmoid function, tanh is a hyperbolic tangent function, W I , W k , and r t are training parameters. A BiGRU network is capable of understanding and learning relationships between current, past, and future data, which is effective for extracting deep features from the input sequence. The structure of BiGRU is presented in figure 2. The proposed model HyproBert utilizes BiGRU for its specialties in processing context and extracting sentence representations with forwarding and backward direction propagation. The BiGRU layer is applied directly to the output of the convolution layer. A forward GRU processes the input sequence of (S 1 , S 2 , S 3 , . . . , S 128 ) and the backward GRU processes the input sequence of (S 128 , S 127 , S 126 , . . . , S 1 ), later both GRU,s are integrated to combine the collected context information. As a result, the output of the BiGRU layer has an improved feature representation of propaganda incorporated in the input news text.

Attention layer
The contextual information and representations from BiGRU are stored in a fixed-sized vector. In the case of large input sequences, the related information cannot be fully stored in the vector, which hinders the understanding of the model, hence decreasing the overall efficiency. The attention mechanisms are known to focus on the most important parts of input while generating dynamic adaptive weights [22] [23]. Furthermore, self-attention (SA) includes the interaction of inputs altogether to generate an output for a single input. SA is a powerful tool for identifying selfawareness and dependencies amongst input sequences. SA considers the entire context of vector representations and substitutes the words on the principle of location and associated weights. In the proposed model, HyproBert, SA is used to highlight the context awareness and relevant features to detect the hidden relations in news data. This will result in a better understanding of the data and increase classification accuracy.
Where, Q is query, k is key, v is the value and d k is the linear dimension of k.The Hyprobert model uses the self-attention process, therefore it sets Q = K = V. This offers the benefit that the present position's information and the information of all other places may be computed to capture the interdependence throughout the full sequence. If the input is a sentence, for instance, each word must be attention computed alongside all other words in the phrase.

Capsule network layer
Generally, CNN is known to mislay important information during the classification process. A higher performance of CNN requires hyperparameter tuning, which is a cumbersome and manual task. Hinton et al. [31] presented the Capsule network, a novel neural network architecture. It collects the syntactically enhanced characteristics from the input data while considering the positions and order of other words in a sentence. It has the ability to distinguish between full and partial relations in textual data. It outperforms a CNN in terms of recognizing representation of the data and underlying relevant information in the input text [25] - [27]. Over the years, it has shown outstanding results for text categorization and information retrieval.
A capsule network is a combination of several capsules, connected as neurons to detect semantic and syntactic information from input data. The capsules are presented as vectors of classification probability, and the direction of the capsules represents the position of text in the input. Additionally, the weight of hidden features is adjusted by the dynamic routing algorithm [32]. This improves and controls the limits on attention and connections between capsules to optimize the capsule network.   The proposed HyproBert model employs the capsule network to develop and focus the semantic and syntactic awareness of the news text input. The output from the attention layer is provided as input for CapsNet. Initially, a nonlinear function is used to convert the input to a feature capsule F i , which is then used to produce a prediction vector P j∥i along with a weight matrix W i j ∀ P j∥i = W i j F i to forecast the relationship between the layers of the input and the outcome. Furthermore, the coupling coefficients C i j are calculated using the dynamic routing algorithm, represented in equation 5. The vector T i j is updated throughout the number of iterations ∃ T i j ∥ T i j + R M j f (F i , θ j ). This process provides attention to the high weights of related propaganda words and overlooks the irrelevant, less effective words. Resultantly, the output of the capsule has a higher-level contextual representation considering various orientations. The output of the capsule can be represented as equation 3. The final output of capsule R j contains a local ordering of words considering various orientations, it is represented in equation 5.

Dense and output layer
The proposed HyproBert model uses the output of the capsule network layer as input to the fully connected layer. This layer classifies the input by applying the Sigmoid activation function. The collection of higher probability scores determines the news article to be fake news and vice versa.
The orderly processing of the proposed HyproBert model is presented in algorithm 1. Firstly, the processing of HyproBert is set to sequential. The news article's title and the text body are separated for tokenization. The tokens are then used for word embeddings. DistilBERT is used for both tokenization and embedding. A single layer of onedimensional CNN executes the word embedding vectors to extract the spatial features from the data. An optimized

Experiments and results
The experimental details are presented in this section. It includes datasets, experiments, hyperparameter settings, evaluation metrics, comparative analysis, and results.

Dataset
In this research, we validated the proposed HyproBert model using two data sets i.e., FA-KES and ISOT. The FA-KES is a fake news dataset. It contains 804 news articles. These news articles are related to the Syrian war and contain title, date, location, and news text information. All the articles are marked, i.e., fake news is marked with '0', and vice versa. It is considered a well-balanced dataset as 47% of news articles are fake news and the rest (53%) are real news.
The ISOT is also a fake news detection dataset collected by the ISOT lab at the University of Victoria. It consists of 45000 news articles. The collected news is related to politics and world news between 2016 and 2017, containing the title, date, topic, and news body text. The source for true news is the Reuters website, and fake news is collected from various sources, including PolitiFact and Wikipedia. This is also considered a well-balanced dataset, with almost an equal distribution of fake news and true news.

Experimental settings
The experimentation of the proposed HyproBert model is performed using Google Colab. It is a cloud-based tool that provides the environment to execute code along with the required GPUs and TPU for high performance. The preferred choice of coding language is Python T ensorFlow is an open-source library package developed by google to facilitate researchers to build and test machine learning models. We have used tensor f low.keras, a high-level API to build and train the proposed HyproBertmodel. A natural language toolkit (NLT K) in python is used for preprocessing datasets. Pandas and Numpy packages are used for data manipulation and analysis.

Hyperparameter settings
The datasets under consideration (FA-KES and ISOT) are divided into training and testing subsets, with 80% and 20%, respectively. The optimized hyperparameters for the convolution layer are set as the number of filters to be 128 and the pool size of 02. Furthermore, the number of neurons employed by BiGRU is 128. An attention layer is implemented to extract the hierarchical context representation using the weights of the keywords. Finally, the capsule network with 03 routing iteration and 05 capsules with 04 dimensions is utilized. Additionally, the proposed HyproBert model uses a dropout of 0:4, a batch size of 128, an epoch of 100, and Sigmoid as an activation function and Adam as an optimizer.

Evaluation metrics
The proposed HyproBert model is evaluated using standard evaluation matrices. Accuracy is the ratio of accurately recognized fake news to the total number of news articles. Precision is defined as the percentage of authentic fake news to all projected false news. Recall is the percentage of fake news that is accurately identified. F Measure is defined as the harmonic mean of Precision and Recall.

Accuracy =
T p + T n T otalNews (6) Where, T p is the number of true positives i.e., correctly classified as fake news, T n is the number of true negatives i.e., correctly classified as real news. T otal News is the total number of news articles. F p is the number of false positives i.e., incorrectly classified as fake news and Fn is the number of false negatives i.e., incorrectly classified as real news.

Results
TThe result of the proposed HyproBert model on FA-KES and ISOT datasets along with the comparison is shown in Table 2 and Table 3 below. The results have revealed that the HyproBert model has surpassed the performance benchmarks for the FA-KES and ISOT dataset in terms of Accuracy,Precision, Recall and F Measure . The proposed HyproBert model has outperformed the baseline models and state-of-the-art models. Various factors have contributed to achieving the high performance of HyproBert, including the selection of the hyperparameter values, which played a vital role in higher accuracy. The choice of one-dimensional CNN has executed the extraction of spatial features efficiently and effectively. The BiGRU is deployed in a multi-layer manner, which captures and integrates different textual features simultaneously. By including input gates, output gates, and forget gates, it is able to circumvent the issue of long-term reliance. The multi-layer BiGRU provides significantly better feature extraction. However, increasing the number of BiGRU layers will increase the processing cost exponentially. Therefore, the optimal number of layers is obtained from experimentation. To remember the background details that BiGRU has supplied, the attention mechanism computes an adaptive weight that changes constantly. It strengthens the generalization ability and solves the overfitting problem. A self-attention layer along with CapsNet is also added to the ensemble to dynamically highlight the syntactically explicit features considering various orientations. Although, Elhadad et al. [30] have reported 100% accuracy with their Decision Tree classifier on the ISOT dataset, however, the accuracy of their approaches varied from 85% to 100%. Only 25.2 thousand articles from the ISOT dataset were used, therefore it appears they only used a portion of it. The proposed HyproBert method performs better than Elhadad et al. [30] state-of-the-art method. When it came to accuracy, Random Forests performed significantly better than other approaches on both datasets. For FA-KES, KNNs performed best, while Decision Trees performed best for ISOT.

Discussions
As seen in the previous section, the proposed HyproBert model has outperformed all the baseline models for fake news detection. The study of baseline models provided the intuition for the development of HyproBert. The best performing baseline models are integrated in a logical manner to solve a fake news detection problem effectively, efficiently, and correctly. There are various reasons that contribute to the high performance of HyproBert. Various arrangements and hyper-parameters are involved to get the optimal outcomes from HyproBert. These optimal values are gathered using extensive experimentation for each parameter involved. The details and effects of these parameters are described in the section below.

Convolutional layer optimization
Hyperparameters play a vital role in the performance of deep neural networks. An optimal set of hyperparameters produces a highly optimized output. The filter size is an important parameter in detecting the spatial features from the input. Various configurations of filter size are experimented with to achieve the optimal filter size. The optimal size of filters is determined to be 3 for both datasets under consideration. The effects of the filter size on the performance of the proposed HyproBert model are presented in figure 5. The number of epochs is also considered an important factor in improving the performance of the proposed HyproBert model. Increasing the number of epochs does not always result in more elevated performance. Therefore,

BiGRU optimization
The number of neurons in a BiGRU highly affects the classification performance of the proposed HyproBert model. The experimentation includes the number of neuron values from 32, 64, 128, and 256. The optimal number of neurons is determined to be 128 for both datasets. The impact of the number of neurons of BiGRU on HyproBert is represented in figure 8.
A single BiGRU is very effective in the classification process. The performance of BiGRU can be enhanced by deploying it in a multi-layer manner. Increasing the number of layers of BiGRU will increase resource utilization as well. Therefore, to a certain extent, it is considered a tradeoff between available resources and performance. The optimal number of BiGRU layers is determined to be 2 for both datasets. The results of the increasing number of layers on the Accuracy, Precision, Recall, and F Measure are presented in table 4 and 5 for the FA-KES and ISOT datasets, respectively.
The experimentation also includes testing various multi-layers of BIGRU to find the optimal number of layers of iterations and training time for the proposed HyproBert model. The effect of increasing Bi-GRU layers and the number of iterations is represented in figure 12.

CapsNET optimization
Just like with conventional neural networks, the performance of CapsNet heavily depends on its hyperparameters. Generally, the determination of hyperparameters requires experimentation because there is no specific optimal value for each hyperparameter. The optimal values vary for different data and solution types. One such hyperparameter is the number of capsules. The optimal number of capsules is determined to be 5 for both datasets. The capsule is the set of neurons that highlight a specific property. The effects of the number of capsules for the proposed HyproBert model are presented in figure 10.
The next important consideration is the dimensions of the capsule. The lower dimension of the capsule will contain fewer features, and it will lose most of the information. A higher-dimensional capsule will undoubtedly contain more important features, but the computation complexities will increase. Therefore, an optimal number of dimensions is needed for optimal output. The dimension of a capsule indicates the length of the output vector. The dimension values, including 2, 4, and 8, were tested for both datasets under consideration. After considering various options during the experimentation, the optimal dimension of the capsule was determined to be 04 for both datasets. The effects of the dimensions of a capsule on the proposed HyproBert model are presented in figure 11.
The number of routing iterations is also a hyperparameter of CapsNet. It is responsible for connecting the capsules of the consecutive layers. The experimentation process for evaluating the optimal value of routing iteration was performed on both datasets under consideration. The testing range of values includes 2, 3, 4, and 5. The optimal value is determined to be 03 iterations for both datasets. The effects of the number of routing iterations on the proposed HyproBert model are presented in figure 12.

Ablation Study
An ablation study was performed to evaluate the effectiveness of the designed architecture of the proposed model HyproBert. The architecture of the HyproBert is altered with different ensemble models in three combinations to measure the best possible outcome. The ablation architectures have been evaluated with the FA-KES and ISOT datasets.    The first ablation architecture is BiGRU + CapsNet, BiGRU collects the contextual information and CapsNet perceives the hidden relations amongst the identified features. The first ablation architecture is BiLSTM + CapsNet + Conv. BiLSTM processes data in cell state as well as hidden states, which can be costly. The third ablation architecture is Conv + BiLSTM + Attention + CapsNet, BiGRU is replaced by the BiLSTM. The fourth is the ensemble of the proposed HyproBert model. The three identified models for the said study along with the proposed model and resultant F1 score are presented in the table. The results suggest that the proposed architecture of HyproBert is effective and performs better than its counter-ablation study architectures. The Receiver operating characteristics ROC curve determines the usefulness of the binary classifier system, presented in figure 13. The area under curve AUC determines the performance of the model i.e., greater area means greater performance.