Ternion: An Autonomous Model for Fake News Detection

: In recent years, the consumption of social media content to keep up with global news and to verify its authenticity has become a considerable challenge. Social media enables us to easily access news anywhere, anytime, but it also gives rise to the spread of fake news, thereby delivering false information. This also has a negative impact on society. Therefore, it is necessary to determine whether or not news spreading over social media is real. This will allow for confusion among social media users to be avoided, and it is important in ensuring positive social development. This paper proposes a novel solution by detecting the authenticity of news through natural language processing techniques. Speciﬁcally, this paper proposes a novel scheme comprising three steps, namely, stance detection, author credibility veriﬁcation, and machine learning-based classiﬁcation, to verify the authenticity of news. In the last stage of the proposed pipeline, several machine learning techniques are applied, such as decision trees, random forest, logistic regression, and support vector machine (SVM) algorithms. For this study, the fake news dataset was taken from Kaggle. The experimental results show an accuracy of 93.15%, precision of 92.65%, recall of 95.71%, and F1-score of 94.15% for the support vector machine algorithm. The SVM is better than the second best classiﬁer, i.e., logistic regression, by 6.82%.


Introduction
Fake news detection has always been a problem because of its long-term repercussions and consequences. Its root can be traced back to the 17th century in propaganda, which became misinformation in the cold war [1]. In modern days, this problem has become grave due to the emergence of social media platforms. Specifically, in the past few years, social media channels, such as Facebook, Twitter, and Instagram, have emerged as platforms for quick dissemination and retrieval of information. Figure 1 shows a snapshot of some fake news in recent years. According to various studies [2], almost 50% of the population of developed nations depend on social media for news. The importance of social media cannot be denied, and it has emerged as an effective medium at the time of crises in regard to the role it plays in breaking news, for example [3]. However, one drawback of the convenience provided by social media is the quick dissemination of fake news. In contrast to conventional mediums such as print media or television, the content of social media can be modified by users, thereby enriching the content with their opinions or biases. This can alter the meaning or context of the news altogether [5]. According to various studies, social media is a fertile ground for quick sharing of information without fact checking [1].
Fake news can be defined as the creation or modification of news content by social media user to deliberately or non-deliberately change its apparent meaning or context, contaminating it with their opinion or biases, where the intent may be to jeopardize or harm a person, organization, or society, monetarily or morally. Examples of fake news are sarcasm, memes, fake advertisements, fake political statements, and rumors [3]. A fakester is a term used for a person responsible for spreading fake news. News can have various degrees based on its credibility, i.e., true, half-true, and false [5]. Fake news can be transmitted in the form of images, video, and text. The life cycle of fake news has been described in [6] as the creation, publication, and propagation of the news.
The impact of fake news spread on social media is immense [7]. It can cause a decline in stock prices, a drop in potential investments, etc. [6]. For instance, the 2016 US election was heavily impacted by fake news [2]. The fake news about the death of President Obama led to the loss of USD 130 billion in the stock market in just a fraction of time. The intent of fake news may be to malign someone for political or personal intent or to mislead people [6]. There are numerous websites used for detecting fake news, such as FactCheck, Snopes, TruthorFiction, and PolitiFact. Moreover, Google has also launched an initiative called Google News Initiative to counter fake news [3]. However, fake news detection is still a cumbersome task. This is because fake news often contains misleading information contaminated with credible facts [2]. The motivation behind fake news can be driven by politics, financial benefit, or ideology [3,5]. In the literature, various approaches based on linguistic features or deep learning techniques, such as the recurrent neural network, convolutional neural network, transformer, bidirectional encoder representations from transformers (BERT), and their combination, have been used for fake news detection [8]. Detection of fake news can be classified as a binary or multi-class classification problem. Alternatively, it can be modeled as a regression problem. A number of datasets are also available for fake news classification, such as Kaggle, ISOT, and LIAR [3].
Despite the extensive studies being carried out, the problem of fake news detection is still very challenging, and it is believed that it requires a comprehensive multi-phased approach. Addressing this problem, this paper proposes a novel approach to validate the authenticity of news. The approach comprises first detecting the stance of the news, then identifying the author's credibility, and finally using machine learning to classify the news as fake or authentic. The objective of the research is to classify news as fake or genuine based on various attributes, such as the text of the news and its author's profile.
The potential implications of the proposed work are multifold. As discussed earlier, fake news related to medical symptoms can have severe consequences if assumed true by its consumer. Similarly, fake news can lead to irreparable damage in rgw health, political, social and economic sectors. By using the proposed approach, this catastrophic effect can be avoided. This study also serves as a baseline and opens up avenues for future research on fake news detection. There is a scarcity of research related to use of a three-pronged approach to fake news classification. Research based on machine learning and deep learning is being extensively carried out to identify a novel solution to the issue of fake news detection. The current paper proposes a three-step solution. We have not found any such study in the past. Finally, based on the proposed work, a commercial tool can be developed that can tag news as fake and also provide appropriate ratings on its credibility.
The remainder of this paper is structured as follows: Section 2 presents related work; Section 3 describes the proposed novel approach to detect fake news; the experimental results are discussed in Section 4; and, finally, Section 5 provides the conclusions and future directions.

Related Work
In recent years, several approaches have been identified to establish with a solution to the issue of the detection of fake news. Primarily, they are classified as machine learning approaches, hybrid approaches, topic-agnostic approaches, knowledge-based approaches, and language approaches [1]. The authors of [7] classified the approaches as news contentbased learning and social context-based learning. The former is based on the styles of the news being published, while the latter is based on latent information provided to a user by a news article. Users present on social media play an active in the identification of fake news. For example, Facebook ranks the comments on a post based on the number of replies or user engagement for a particular post [6]. An analysis of the existing literature revealed that there is major work in the direction of stance detection, identifying authors' credibility, and using machine learning to classify news as fake or not. Hence, we discuss the work in these three directions below. Interested readers are directed to [9] for a comprehensive survey.

Stance Detection
Among many natural language processing tasks, stance detection is a very important task. It can be the very first step in fact checking [10,11]. In 2016, an online contest was started known as the fake news challenge [12]. The objective of this challenge was to encourage the improvement of devices that may help human fact checkers to recognize intentional falsehood in reports using artificial intelligence (AI), regular language handling, and artificial knowledge. In this challenge, stance detection is regarded as stage 1 in the identification of fake news. The main aim is to determine the relevancy of a news article headline and its body. Chaudhary [13] et al. discussed numerous deep neural networkbased models for stance detection. They found that using a pre-trained global vector for word representation (GloVe) and word embedding along with a long short-term Memory (LSTM)-based bidirectional condition encoding model provided the best performance with 97% accuracy.
Bhatt et al. [14] presented a novel approach combining neural, external, and statistical features. With the help of feature engineering heuristics, handcrafted external features and statistical features from the n-gram bag-of-words model, and the deep recurrent model, the neural embedding was computed. Bourgonje et al. [15] worked on a system that used a lemmatization-based n-gram approach to carry out binary classification of headlines and article sets. They achieved the best accuracy of the system using logistic regression. In [16], the authors proposed a method to detect spam comments on YouTube by using different machine learning algorithms with the n-gram approach, and they proved that this technique is effective in detecting spam comments. García et al. [17] introduced a system for text classification that executes embedded feature elimination via an a priori algorithm. The aim of their study was to speed up the word sequence constructions by minimizing the explored branches' number as much as possible.
In order to classify fake news, Saikh et al. [18] used the technique of stance detection with textual entailment (TE). Moreover, they proposed a system that used a combination of deep learning and statistical machine learning approaches. To detect a stance in fake news, Ghanem et al. [19] combine n-gram, lexical features, and word embedding. They accomplished state-of-the-art results (59.6% Macro F1) on the FNC-1 dataset [20]. In [21], a deep neural network architecture was used to predict the stance of a headline and article body.

Author Credibility
Research suggests that information related to the authors of articles helps to identify whether the news presented is fake or not. Hence, another area of research is identifying author credibility. Sitaula et al. [2] discussed different attributes that could help to determine author credibility and its role in news. With the attributes explained, they identified 26 features that were obtained in different categories. This paper's results show not only the credibility of a given article but also the credibility of articles published by the same author. According to [22], author credibility plays a very important role in identifying fake reviews online. However, most users do not consider author credibility before sharing news on social media [23].
Research suggests that information related to the authors of articles helps to identify whether the presented news is fake or not. Hence, another area of research is identifying author credibility. Sitaula et al. [2] discussed different attributes that could help to determine author credibility and its role in news. With the attributes explained, they identified 26 features that were obtained in different categories. This paper's results show not only the credibility of a given article but also the credibility of articles published the same author. Another work related to author profiling is mentioned in [24]. A corpus of Twitter data was used for this purpose. According to [22], author credibility plays a very important role in identifying fake reviews online. However, most users do not consider author credibility before sharing news on social media [23]. Therefore, the work on author credibility can be considered to be in the stage of infancy and regarded as an open research challenge in various fields [25].

Machine Learning-Based Classification
In a considerable amount of research, machine learning algorithms have been used for fake news detection. The credibility of fake news is one of the most important discussions, and many approaches have evolved with time for its detection. To detect fake news in online text, Girgis [26] et al. utilized deep learning algorithms, such as LSTMs and RNN. Models (vanilla and GRU) were implemented on the LIAR dataset. Among all algorithms, GRU showed the best performance, so in order to achieve better accuracy, a hybrid model was developed using the techniques of CNN and GRU on the dataset. For the detection of fake news, Shlok et al. and Gilda [27] applied different machine learning approaches. More machine learning techniques for the detection of fake news can be found in [28][29][30].
Ajao et al. [31] used a long short-term recurrent neural network and hybrid between convolutional neural network models. They implemented various deep neural networks: (1) LSTM, (2) LSTM along with dropout regularization, and (3) LSTM-CNN. Among all approaches, LSTM stands out and gives 82% accuracy. Sajjad et al. [32] provided a model of decent accuracy to identify fake news using a framed model combined with knowledge engineering and machine learning. In another work, automated discovery of social news is proposed, utilizing three-element extraction procedures, a count vectorizer, term frequency-inverse document frequency, and a hashing vectorizer [4]. An ensemble-based technique for fake news detection is presented in [33]. Ensemble-based approaches combined various weak classifiers to achieve better accuracy for combined classification tasks.
In [34], various machine learning algorithms, such as logistic regression, naive Bayes, and random forest classification, are used.
In [31], a deep learning technique called Fake-BERT was used for the detection of fake news. In [6], a deep learning-based model, EchoFakeD, was proposed with a mix of content and contextual features. The authors proposed an effective tensor factorization scheme. In a number of studies, data augmentation, transfer learning, auto-encoders, and other semi-supervised models have been used for fake news detection [8]. A capsulebased neural network was used in [3] to classify fake news. In [35], the authors used geometric deep learning based techniques for fake news detection. These are an extension of the convolutional neural network that fuses other information, such as user profiles, news propagation, and the actual content. A hybrid deep learning model based on the combination of CNN and RNN was presented in [36]. The proposed model utilizes a combination of embedding, CNN, and RNN layers implemented in Keras and tested on ISO and FA-KES datasets. In [37], blockchain technology was used for the detection of fake news.
In recent years, following the spread of COVID-19, several pieces of fake news have spread in this context. Therefore, numerous studies have focused on the detection of news related to COVID-19. For instance, a novel approach to the detection of fake tweets related to COVID-19 was proposed in [8]. In a similar direction, an analysis of public sentiments based on tweets related to COVID-19 was performed in [38]. In [36], several supervised learning approaches, such as CNN, LSTM, and BERT, were used for the detection of fake news related to COVID-19. Moreover, unsupervised learning techniques, such as model pre-training and distributed word representations, were used.
After an extensive review of the literature, it was found that most of the studies on this topic have focused on stance detection, author credibility, and classification of news. However, existing approaches are limited because of the lack of social or political context awareness underlying the news. Therefore, a multi-stage pipeline is required for the correct classification of the credibility of news. This paper presents a novel approach, combining stance detection, author credibility, and news classification. This approach is motivated by [34], a study in which several machine learning algorithms are used for classification. The objective of this study is to spot fake news on a social medial platform, i.e., Twitter. Similar studies focusing on a specific platform have been conducted [35,39,40].

Proposed Approach and Implementation Details
This paper proposed a novel approach to fake news detection. The proposed method comprises the following modules: (1) data collection, (2) pre-processing, (3) feature extraction, and (4) inference engine. The architecture of this fake news detector is depicted in Figure 2.

Dataset Description
For this paper, a dataset called the fake news dataset [14] is selected from Kaggle. The dataset contains five features, namely "Id", "Title", "Text", "Author", and "Label." The dataset has 20718 entries, of which 10349 entries are deemed fake news and the remaining are real news. A description of the dataset is provided in Table 1. A few records of the dataset are displayed in Figure 3. The extracted data from the dataset were passed through the pre-processing module. By using the Natural Language Tool Kit (NLTK) library [19], the text was divided sentence by sentence in tokens. This was followed by Parts of Speech (PoS) tagging, lemmatization, stop word elimination, and Named Entity Recognition (NER). In this module, the proposed model not only identifies traditional NER (i.e., name, location, and organization), but it also recognizes multiple NER, such as movies, book titles, cartoons, etc. This extension of NER is achieved by utilizing DBpedia.  A word cloud was made for the headline and body text of fake and real news in the selected dataset, and it is shown in Figure 4. Word cloud is a visualization technique of word frequency. The more regularly terms show up in the content being assessed, the bigger the word in the image created. For machine learning with fake news detection, pre-processed text documents should be represented in vector form. To convert text into features, machine learning provides a variety of options in which classifiers use Bags of Word (BoW) along with the TF-IDF vectorizer. Furthermore, the data were split into train, validation, and test datasets.
During stance detection, the very first step in the inference engine, it is determined whether or not the headline and the body of a news article are relevant or not. Listing 1 shows the pseudo-code of stance detection. In order to find relevancy, the cosine similarity technique is implemented, which is used to find similarity between two text documents irrespective of their size. If their headlines and body texts are similar, then one can proceed to the next module, i.e., author credibility; otherwise, the model declares that the examined news is fake news. In NLP, it is a well-informed and popular approach. It allows for detection in favor of the audience, and from the text, it determines whether the audience found the objective to be against, in favor of, or impartial to the target [41]. The objective could be an individual, an association, an administration strategy, a development, an item, and so forth. The next step is the verification of author credibility. In this module, the inference engine validates an author's information to judge whether the news is fake or not. Twitter API [42] is used to obtain the author's Twitter profile. It first checks how many followers the author has and then checks how many times this news has been retweeted.
Priya Gupta et al. in [41] described different features of evaluating the believability of client-produced content on Twitter, and a novel continuous framework to survey the trustworthiness of tweets was proposed. The discussed framework was implemented to accomplish this by relegating a score or rating to content on Twitter to show its dependability. The authors of [43] et al. investigated different grouping strategies in order to help versatility, and another solution to the constraints present in previously existing procedures was proposed.
Finally, for fake news detection, four different machine learning algorithms are applied. In this paper, we compare the results of all four algorithms. The selected algorithms are as follows: • A decision tree is one of the most popular classifiers that helps in prediction and classification, and it is supervised in nature. It splits the dataset by recursively selecting features. The selected features of the dataset can be in nominal or continuous form. This is a well-known classifier for data classification. The most distinct feature is the conversion of the process of complex decisions in order to simplify the process definition, and, as a result, it provides an easy way to understand and interpret the outcome [44]. • Random forest is a regulated AI method that is supervised in nature. On the basis of random element choice, a set of decision trees (base classifiers) is produced, and the dominant party with respect to voting is selected for classification. It generates accurate and diverse decisions that are dynamic algorithms for this classifier [45]. In a random forest, the individual decision trees are an ensemble, and they operate on average to increase the accuracy of the prediction of the model. This model also focuses on the reduction in over-fitting. The sub-samples are drawn with replacement, keeping their size the same as the original input sample size. • Logistic regression is an AI technique for classification. In this algorithm, the probabilities portraying the potential results of the possible outcomes are demonstrated utilizing a logistic function. It is widely used in circumstances in which humans are not suited to perform the classification and automated functionality is required for this purpose [46]. • The support vector machine (SVM) is known as a supervised learning algorithm that is widely used to predict or classify data. Its classifier is officially characterized by an isolating hyperplane. That is, the labeled dataset for training is required, and the algorithm yields an ideal hyperplane that generates new examples. In twodimensional space, this hyperplane is a line separating a plane in two sections where each class is located on one of the two sides. SVM carries out generous upgrades and best-performing strategies, and it can be applied to a wide range of learning tasks. Moreover, it is completely programmed, eliminating the requirement for manual parameter tuning [47].

Experimental Results
For experiments, the authors of this paper implemented the proposed approach in Python. To begin the experiment, the selected dataset was passed through the proposed pipeline. Initially, the pre-processing step was performed by using the NLTK library. Stance detection and author credibility were then determined. During the author credibility and stance detection phases, 28.88% of the news was classified as fake, among which 8% was in fact genuine ( Figure 6). In the last step, different machine learning algorithms were applied to the data after the pre-processed text document was converted into vector form using the TF-IDF vectorizer.
Moreover, different machine learning algorithms were applied to the proposed dataset. The first model applied was a decision tree for the detection of fake news. The performance of the decision tree was represented by a confusion matrix. Figure   It can be seen that for the decision tree, TP is 1916, and TN is 1524. Hence, the overall accuracy is as follows: In many situations, accuracy is not a very good measure. Hence, it is essential to calculate other measures, such as precision, recall, and F1-score. The definitions of these terms are as follows: The precision of the classifier is defined mathematically as The confusion matrix for random forest classifier, as illustrated in Figure 8, shows that the accuracy of the classifier is 82.23%, the precision value is 81.95%, the recall is 84.44%, and the F1-score is 83.17%. The confusion matrix and accuracy of this logistic regression classifier, as illustrated in Figure 9, shows that the accuracy of the classifier is 87.2%, the precision value is 87.90%, the recall is 88.88%, and the F1-score is 88.30%. Lastly, an SVM classifier was applied. The confusion matrix and the accuracy of this classifier are shown in Figure 10, and it can be observed that the accuracy of the classifier is 93.15%, the precision value is 92.65%, the recall value is 95.71%, and the F1-score is 94.15%.
After implementing all of the classifiers, their results were compared, and it was observed that all of the experiments conducted using the support vector machine provide the best accuracy for the proposed fake news detector and perform better than the other classifiers with an accuracy of 93.15%, precision of 92.65%, recall of 95.71%, and F1-score of 94.15%. Table 2 and Figure 11 provide a comparison of various aspects of the classifier. Comparing the SVM with logistic regression, which was the second best classifier, it can be observed that SVM is better than logistic regression in terms of accuracy as follows: Improvement in accuracy = 93.15 − 87.20 87.20 Improvement in accuracy = 6.82% (c) Comparison of recall of classifiers.

Conclusions and Future Work
The detection of fake news on social media platforms is an essential topic of discussion considering the wide dissemination of news and the number of people consuming information through it. In this paper, a solution is proposed based on natural language processing and machine learning for a fake news dataset produced by Kaggle. The proposed approach is based on stance detection, author credibility, and machine learning algorithms. Stance detection verifies the relevancy between the title and paragraphs of a news article; if there is a match, the next module checks whether the author is authentic in order to determine whether or not the news should be believed. Finally, machine learning algorithms, i.e., logistic regression, support vector machine, decision tree, and random forest, are implemented, and among these, the support vector machine stands out with an accuracy of 93.15%.
In modern day, access to the internet has become ubiquitous. In just one minute on the internet, 18 million text messages are exchanged over WhatsApp, 2.4 million snaps are created on SnapChat, 38 million SMS messages and 187 million emails are sent, and 0.5 million tweets are posted [48]. Unfortunately, most of the population is dependent on the consumption of information from the internet. Hence, fake news detection has become a major concern. Most of the information flow on the internet is unverified and generally assumed true. This can be used to spread misinformation, destabilize a regime, and create riots. It has been predicted that in the next few years, people will consume more false information than true content [21]. Unfortunately, most content analyses cannot address fake news detection because of its challenges. The existing natural language processing techniques are limited because of the absence of the political or social context required to understand the content [35]. Therefore, there is a need for a multi-stage solution that can address this issue in the form of a pipeline. The proposed approach provides a three-pronged solution to verify the authenticity of any news article. After working on the stance and credibility of the author, the solution is then formulated to address a machine learning problem using any of the tested algorithms, such as SVM, random forest, and decision trees. The main advantages of using machine learning are its ability to learn the rules for the detection of fake news by using data and the fact that the end user is not required to explicitly program these rules.
There are several limitations of the proposed approach that can be worked on in the future. The proposed approach does not consider the correlation among news items. The correlation among news articles can assist in determining the credibility of a news article. Moreover, the author credibility check is based on Twitters' information. This can be extended to include other attributes that are generally not available on social media. The proposed approach can also be extended to the use of advanced deep learning algorithms based on convolutional neural networks, LSTM, GRU, or BERT. Currently, the proposed approach is a sequential pipeline, and news passes through each stage one by one. A novel objective function can be developed based on the scores of stance detection, author credibility, and a machine learning classifier to determine if news is fake or not in a joint fashion. The currently available solutions only mark the news as authentic or unauthentic; however, a working solution requires the score or rating on the credibility of news. The detection of fake news is only one aspect of a bigger problem.
Work regarding the fake news evolution process, its mitigation, and later steps of account detection and deletion must also be conducted.