Fake News Data Exploration and Analytics

: Before the internet, people acquired their news from the radio, television, and newspapers. With the internet, the news moved online, and suddenly, anyone could post information on websites such as Facebook and Twitter. The spread of fake news has also increased with social media. It has become one of the most signiﬁcant issues of this century. People use the method of fake news to pollute the reputation of a well-reputed organization for their beneﬁt. The most important reason for such a project is to frame a device to examine the language designs that describe fake and right news through machine learning. This paper proposes models of machine learning that can successfully detect fake news. These models identify which news is real or fake and specify the accuracy of said news, even in a complex environment. After data-preprocessing and exploration, we applied three machine learning models; random forest classiﬁer, logistic regression, and term frequency-inverse document frequency (TF-IDF) vectorizer. The accuracy of the TFIDF vectorizer, logistic regression, random forest classiﬁer, and decision tree classiﬁer models was approximately 99.52%, 98.63%, 99.63%, and 99.68%, respectively. Machine learning models can be considered a great choice to ﬁnd reality-based results and applied to other unstructured data for various sentiment analysis applications.


Introduction
Fake news is something that everyone is very fond of and needs no introduction. We have seen that internet use has taken off dramatically in recent years, as social media platforms such as Facebook, Twitter, WhatsApp, etc., have evolved. We also should not forget to mention YouTube, one of the biggest culprits in spreading fake news among the population. These applications have many benefits, such as sharing something useful for the betterment of the population. One biggest disadvantage is fake news, which spreads in the same way that fire spreads in a forest. The reason for spreading fake news would be to achieve financial or political benefits for yourself or your organization [1]. Fake news applies sentiment analysis, the branch of information retrieval and information extraction [2,3].

•
Pre-processed and extensive data exploration are applied in our work to understand fake and real news. • As per our knowledge, our proposed four machine learning models are more efficient than previous studies reported.

•
The proposed approach could help determine fake or real news for various other types of datasets.
The organization of our study is as follows: Section 2 of the paper presents the related work completed for the detection of fake news. Section 3 presents the methods and materials. Section 4 presents the results obtained by applying different machine learning techniques on the given dataset. Section 5 represents the discussion of the results obtained Electronics 2021, 10, 2326 3 of 15 by applying machine learning techniques. Finally, the last section, Section 6, presents the conclusion of the study and future work.

Literature Review
Fake news data are pervasive, and it has become an exploration challenge to consistently check the data, content, and distribution to label it as right or wrong. Many researchers have been trying to work on this problem, and they have also somehow been successful. Some have researched the field of machine learning, and some have explored deep learning. Still, no one has ever produced research in the field of sentiment analysis or sentiment information.
Ahmed et al. [18] applied a 4-g model with term frequency and TF-IDF to extract fake contents. The nonlinear machine learning models did not perform well than the linear models for simulated and actual news. A limitation of the study was less accuracy when applied higher n-gram.
Conroy et al. [19] overviewed two significant classes of strategies for discovering fake/false news. The first overviewed class was related to linguistic methodologies, wherein the material of beguiling messages is removed and dissected to relate language designs with double-dealing. The second overviewed type was related to network approaches, in which network data, for example, message metadata or organized information organization inquiries, could be compiled to produce total misdirection measures. We see the guarantee of an imaginative half and half methodology that joins semantic sign and artificial intelligence with network-based social information.
Hussein [20] has produced 41 articles on sentiment analysis (SA) through natural language processing (NLP). The study did not manage wrong/bogus/fake news, but instead, it continued detecting fake websites or inaccurate reviews. Moreover, the more exploration in a feeling challenge, the less the average precision rate is. This paper explains the work that could be completed in the future. The article says that the focus should be on developing a larger examination circle that can explore input consistently in the future.
Bondielli and Marcelloni [21], played with features that were considered to help detect wrong, fake, or even rumored approaches, providing an examination of the different methods used to complete these assignments, and featured how the assortment of applicable information for performing these assignments is challenging. The limitation of the study was that one is to report and examine the different meanings of fake news and bits of gossip/rumors that have not been written correctly. Second, the assortment of important information featured in the study to represent fake news was incorrect, and the performance of the machine learning models was lower.
Bali et al. [22] study on fake news detection was addressed from the standpoint of NLP and ML. Three representative datasets were assessed, each with its own set of features extracted from the headlines and contents. According to the study's results, gradient boosting surpassed all other classifiers. The accuracy and F1 scores of seven alternative maching learning algorithms were investigated, but they all remained under 90%.
Faustini and Covões [23] recommend using one-class classification to detect take news by developing a solely bogus sample in the training dataset (OCC) model. The case study focuses on the Brazilian political scene at the beginning of the 2018 general elections and uses information from Twitter and WhatsApp. The study consumed a great deal of human labour for fact-checking, and the study was quite costly and time-consuming.
Shaikh and Patil's [24] study extracted features from the TF-IDF of news datasets to detect fake news resources, and their datasets were limited. The passive-aggressive classifier and SVM model achieved 95% accuracy. The dataset samples were minimal.
Recent research by Ahmad et al. [25] looks into different linguistic qualities that can differentiate between fake and actual content. They use a variety of ensemble approaches to training a variety of machine learning algorithms. In comparison to individual learners, experimental evaluation reveals the higher performance of the suggested ensemble learner strategy. The KNN model did not perform well for this study. However, the study's implications are only textual data. Other data types are not addressed.
In another study, Hakak et al. [26] developed an ensemble classification model for detecting fake news, which outperformed state-of-the-art models in terms of accuracy. The proposed methodology collects fundamental properties from false news datasets, then categorizes them using an ensemble model that combines three main machine learning methods. However, the study's implications cannot be generalized due to limited dataset considerations.
Abdullah et al. [27] created a deep learning model that was applied to detect fake news. The study was used to detect fake news using a multimodal model. Still, its performance did not produce good results through a convolutional neural network (CNN), and long short-term memory (LSTM) approaches. The model training time was time taken, and the study was biased towards datasets.
A study by Sharma et al. [28] developed a tool for fake news detection. The research took a phony dataset from the general public to determine the basic techniques of how the deep learning models of LSTM and BI-LSTM work. The models achieved high loss rates, and LSTM and BI-LSTM only achieved a performance rat 91.51%.
Nasir et al. [29] determined automatic detection approaches based on deep learning, and machine learning was researched to combat the rise and distribution of fake news. The categorization of fake news, a recent study suggested a novel hybrid deep learning model. The model has effectively verified two fake news datasets, yielding detection results that were much more superior to non-hybrid baseline approaches. Still, in the ISOT dataset machine learning models, the accuracy was less than 90%.
All of the above studies suggested a clear gap in achieving higher performance through machine learning models from datasets based on multiple features such as title and the subject of the fake news.

Dataset and Methodology
This section consists of the materials and methods used in this study to detect fake news from the chosen dataset. Furthermore, Section 3.1 explains the datasets and all of the information related to the dataset. Section 3.2 presents the data pre-processing, Section 3.3 is about data exploration, and the last section, Section 3.4, is related to the methods and algorithms essential to solving this problem.

Dataset Description and Architecture
The dataset used in this study consists of fake news and real news. Each file of the dataset consists of more than twenty thousand examples of fake news and real news. The dataset considers the title, text, subject, and date that the articles were posted, and the dataset comprises information used from the fake and real news datasets used for Ahmed, Traore and Saad [18]. Figure 1 shows an image representing the number of fake and real news samples in the form of a bar chart. Figure 2 shows the system architecture representing the stages used in our approach. After analyzing the dataset, we pre-processed it, trained and test split it, applied four machine learning classification models to it, and then performed experiments on the test set.  Figure 2 shows the system architecture representing the stages used in our approach. After analyzing the dataset, we pre-processed it, trained and test split it, applied four machine learning classification models to it, and then performed experiments on the test set.

Data Pre-Processing
The data needs to be pre-processed before the training, testing, and modeling phases. Before moving to these phases, the real news and fake news are concatenated. In the dataset cleaning process, we removed the columns from the datasets that were not needed for processing. The punctuation and stop words were also removed. Stop-words are those words that frequently occur, such as "I, are, will, Shall, is it, etc. Uppercase letters were converted into lowercase letters. After the dataset was cleaned, it looked good and was ready for the exploration step. However, for the sake of more in-depth research, the dataset exploration was completed on both the cleaned and uncleaned data. For the exploration process, both the fake and real datasets were grouped into a data frame to make the processing easier.
The combined total of fake and real news samples can be seen in Table 1.  Figure 2 shows the system architecture representing the stages used in our approach. After analyzing the dataset, we pre-processed it, trained and test split it, applied four machine learning classification models to it, and then performed experiments on the test set.

Data Pre-Processing
The data needs to be pre-processed before the training, testing, and modeling phases. Before moving to these phases, the real news and fake news are concatenated. In the dataset cleaning process, we removed the columns from the datasets that were not needed for processing. The punctuation and stop words were also removed. Stop-words are those words that frequently occur, such as "I, are, will, Shall, is it, etc. Uppercase letters were converted into lowercase letters. After the dataset was cleaned, it looked good and was ready for the exploration step. However, for the sake of more in-depth research, the dataset exploration was completed on both the cleaned and uncleaned data. For the exploration process, both the fake and real datasets were grouped into a data frame to make the processing easier.
The combined total of fake and real news samples can be seen in Table 1.

Data Pre-Processing
The data needs to be pre-processed before the training, testing, and modeling phases. Before moving to these phases, the real news and fake news are concatenated. In the dataset cleaning process, we removed the columns from the datasets that were not needed for processing. The punctuation and stop words were also removed. Stop-words are those words that frequently occur, such as "I, are, will, Shall, is it, etc. Uppercase letters were converted into lowercase letters. After the dataset was cleaned, it looked good and was ready for the exploration step. However, for the sake of more in-depth research, the dataset exploration was completed on both the cleaned and uncleaned data. For the exploration process, both the fake and real datasets were grouped into a data frame to make the processing easier.
The combined total of fake and real news samples can be seen in Table 1.

Data Exploration
The data exploration stage is used to explore and visualize the data to identify patterns and insights from fake and real news. We plotted various charts using Matplotlib [30] and Seaborn [31] using the Python libraries.
First, we plotted word clouds for the accurate and fake news samples. The word clouds showed all of the essential terms in the datasets. Figure 3a shows the real news keywords in the word clouds for words in the title, showing comments such as Trump, Korea, republican, house, Russia, say, new, leader, white, and senate. Figure 3b shows the word cloud for fake news sample, comprising comments from the titles of the selections, such as Trump, video, watch, Clinton, Obama, Tweet, president, woman, Muslim, democrat.

Data Exploration
The data exploration stage is used to explore and visualize the data to identify patterns and insights from fake and real news. We plotted various charts using Matplotlib [30] and Seaborn [31] using the Python libraries.
First, we plotted word clouds for the accurate and fake news samples. The word clouds showed all of the essential terms in the datasets. Figure 3a shows the real news keywords in the word clouds for words in the title, showing comments such as Trump, Korea, republican, house, Russia, say, new, leader, white, and senate. Figure 3b shows the word cloud for fake news sample, comprising comments from the titles of the selections, such as Trump, video, watch, Clinton, Obama, Tweet, president, woman, Muslim, democrat.  Figure 4a shows the word clouds of the keywords from the titles from the real news samples, with words such as Trump, state, republican, president, said, Reuters, and party. Figure 4b shows word clouds depicting the keywords from the titles of the fake news samples, with words such as Trump, people, and said.     Figure 4a shows the word clouds of the keywords from the titles from the real news samples, with words such as Trump, state, republican, president, said, Reuters, and party. Figure 4b shows word clouds depicting the keywords from the titles of the fake news samples, with words such as Trump, people, and said.

Data Exploration
The data exploration stage is used to explore and visualize the data to identify patterns and insights from fake and real news. We plotted various charts using Matplotlib [30] and Seaborn [31] using the Python libraries.
First, we plotted word clouds for the accurate and fake news samples. The word clouds showed all of the essential terms in the datasets. Figure 3a shows the real news keywords in the word clouds for words in the title, showing comments such as Trump, Korea, republican, house, Russia, say, new, leader, white, and senate. Figure 3b shows the word cloud for fake news sample, comprising comments from the titles of the selections, such as Trump, video, watch, Clinton, Obama, Tweet, president, woman, Muslim, democrat.  Figure 4a shows the word clouds of the keywords from the titles from the real news samples, with words such as Trump, state, republican, president, said, Reuters, and party. Figure 4b shows word clouds depicting the keywords from the titles of the fake news samples, with words such as Trump, people, and said.     Further, we created new features, "year," which can be seen in Figure 6a, and "month" in Figure 6b, after using the date column to check which year contained more fake or real news. All of the information for the year 2015 in the dataset is fake news. The amount of fake news is higher until month 8, after which the amount of real news increases drastically. It essentially means that if the month is <=8, then the probability of the news being fake news is higher. We plotted a bar chart with counts of various news subjects in Figure 7. Political and world news contained the highest counts after cleaning the dataset.  Further, we created new features, "year," which can be seen in Figure 6a, and "month" in Figure 6b, after using the date column to check which year contained more fake or real news. All of the information for the year 2015 in the dataset is fake news. The amount of fake news is higher until month 8, after which the amount of real news increases drastically. It essentially means that if the month is <=8, then the probability of the news being fake news is higher. Further, we created new features, "year," which can be seen in Figure 6a, and "month" in Figure 6b, after using the date column to check which year contained more fake or real news. All of the information for the year 2015 in the dataset is fake news. The amount of fake news is higher until month 8, after which the amount of real news increases drastically. It essentially means that if the month is <=8, then the probability of the news being fake news is higher. We plotted a bar chart with counts of various news subjects in Figure 7. Political and world news contained the highest counts after cleaning the dataset.  We plotted a bar chart with counts of various news subjects in Figure 7. Political and world news contained the highest counts after cleaning the dataset. Further, we created new features, "year," which can be seen in Figure 6a, and "month" in Figure 6b, after using the date column to check which year contained more fake or real news. All of the information for the year 2015 in the dataset is fake news. The amount of fake news is higher until month 8, after which the amount of real news increases drastically. It essentially means that if the month is <=8, then the probability of the news being fake news is higher. We plotted a bar chart with counts of various news subjects in Figure 7. Political and world news contained the highest counts after cleaning the dataset.   There are eight different subjects, and their frequencies are seen in Table 2.  Figure 8a explores the length of the text of real news, and Figure 8b explores text length in fake news. In real news, the longest sentence is 3500, and in fake news, the longest sentence is around 7000. There are eight different subjects, and their frequencies are seen in Table 2.   After the exploration, the data were prepared for modeling, training, and testing, then presented to the machine learning algorithms. The machine learning algorithms were applied to the cleaned and uncleaned datasets. All machine learning algorithms and their  There are eight different subjects, and their frequencies are seen in Table 2.   After the exploration, the data were prepared for modeling, training, and testing, then presented to the machine learning algorithms. The machine learning algorithms were applied to the cleaned and uncleaned datasets. All machine learning algorithms and their After the exploration, the data were prepared for modeling, training, and testing, then presented to the machine learning algorithms. The machine learning algorithms were applied to the cleaned and uncleaned datasets. All machine learning algorithms and their

Our Approach
The methods that we used edict which news was fake and real are discussed in this section.

TF-IDF Vectorizer
A python library known as Scikit learn was used [32]. This library is perfect when performing any task with the TF-IDF vectorizer model. This method includes TF-IDF vectors that represent a term's relative significance in the record or as a whole. The next factor of this method is that term frequency is very important (TF). It represents the frequency of a word occurring in the dataset (we determined the word frequency in an article when undergoing data exploration) [33]. The formula for finding the TF is shown in Equation (1): The next thing that needs to be determind to ensure that the the model works properly is the IDF, which stands for inverse document frequency. It is used to measure how notable a term is in the entire dataset. The formula for IDF is shown in (2): Total

number o f documents Number o f documents with term t in it
(2) The next thing that should be determined is the TF-IDF. The TF-IDF is equal to the inverse document frequency integrated into term frequency, the formula of which is shown in (3): The TF-IDF model extracted the feature engineering and counted the most relevant terms from the real and fake news in our dataset. For this reason, it helped to achieve better performance. Second, the technique that we are working with is the TF-IDF vectorizer technique. TF-IDF Vectorizer utilizes an in-memory jargon (a python dictionary) to plan the most successive words to highlight files and process a word event recurrence (scanty) network. The TF-IDF vectorizer is tokenized records and archived recurrent weightings [34].

Logistic Regression
The third technique that we are using to make this model work correctly is the logistic regression technique. Logistic regression in machine learning dictates that logistic regression can discover a connection among the highlights (probability) and likelihood (outcome) of a specific result. A logistic regression classifier is used when the predicting value is categorical. For instance, when predicting the value, it will give either a true or false response. Logistic regression can discover a connection among the highlights (probability) and likelihood (outcome) of a specific result [35]. The logistic regression model can be imported from the sklearn linear_model.

Random Forest Classifier
The random forest has almost the same hyperparameters as a decision tree or a sacking classifier. This technique adds more arbitrariness to the model while developing the trees. First of all, a random forest classifier is a technique that makes different choice trees and consolidates them to produce a more exact and stable prediction. The random forest has hyperparameters that are almost the same as a decision tree or a sacking classifier. This technique adds more arbitrariness to the model while developing the trees [36]. There are diverse arbitrary trees that provide worth, and worth with more votes is the genuine aftereffect of this classifier [37]. It can also be imported from the sklearn, as was the linear model.

Decision Tree Classifier
As we know, this classifier is one of the best classifiers in machine learning. Decision trees are known for their non-parametric supervised learning methods that can be used for processes such as classification and regression tasks. It works in a model way [38]. Tree models where the objective variable can take a discrete arrangement of qualities are called order trees. Decision trees perform with good results and can be made quickly based on Gini index The last machine learning algorithm we will be using is the decision tree classifier. Decision trees are known for their non-parametric supervised learning methods that can be used for both processes, such as classification and regression tasks. Additionally, a decision tree may be suitable for detecting fake news [39]. First of all, it is essential to import the decision tree classifier from the sklearn tree model.

Experimental Results
This section has two different sections about the experimental setup in Sections 4.1 and 4.2 is related to the results.

Experimental Setup
All four models were implemented on Google Colab, which provided a cloud environment. For this, we used python 3.5 and above. The libraries that we used for training and testing were Numpy, Pandas, Scikit learning, Natural language Tool Kit (NLTK), Matplotlib, and Seaborn. We divided the dataset into the training and test set with a ratio of 80:20.

Results
The results were evaluated through a confusion matrix and a Scikit library classification report of precision, recall, F1-score.
First, the TF-IDF vectorizer was evaluated on the test dataset. The TF-IDF vectorizer achieved an accuracy of 99%, which is almost perfect. The model was able to determine a total of 4709 fake news instances and 4222 real news instances. However, it produced 25 real-fake news and 24 fake-real news, which means that these news samples were somehow real and fake at the same time.
Secondly, the logistic regression model was evaluated based on the test dataset. The model was performed with an accuracy of 98%. The model was able to determine a total of 4644 fake news instances and 4248 real news instances.
Thirdly, the random forest classifier achieved an accuracy of 99%. The model was able to determine a total of 4688 fake news instances and 4210 real news instances.
Lastly, we applied the decision tree classifier, which performed with 99% accuracy. The model determined a total of 4716 fake news instances and 4235 real news instances. The 15 real-fake news and 14 fake-real news instances mean that these news samples were somehow real and fake at the same time. A summary of all of the results obtained before cleaning the data, which shows all of the results (accuracies), fake news, and true news inputs in a numeric form, is given in Table 3. The summary of all the results obtained after cleaning, which shows all of the results (accuracies), fake news, and true news inputs in a numeric form, is given in Table 4. A summary of all of the results obtained before cleaning the data, which shows all of the results (accuracies), fake news, and true news inputs in a numeric form, is given in Table 3. The summary of all the results obtained after cleaning, which shows all of the results (accuracies), fake news, and true news inputs in a numeric form, is given in Table 4. A classification report for all of the machine learning algorithms can also be found. All of the details of the classification report are shown in Table 5. Furthermore, we also calculated the precision, recall, and F1-score of each model.

Discussion
From the result of the current study, we see that all of the classifiers showed exceptional results that would ensure that a research study would be successful. The present study yields more than 90% success rates, which is a feat considering the first time the authors have attempted such a project. This research has shown that fake news can be detected quickly and can be dealt with beautifully. As most research papers are considered successful when results above 80% are achieved, the current study yielding the best possible results that it could is quite an achievement. The recent research showed that fake news did not remain an overwhelming problem in society.
Additionally, this study also determined that the main thing which should be completed in similar studies is dataset cleaning. There are a variety of factors that cause the spread of fake news. Our paper has shown that fake news can be handled. In our opinion, the future work that needs to continue this study would be to make a graphical user interface. GUI is necessary to make an application look attractive, and a good GUI is essential when building an application. Using the GUI, people can just copy-paste any text in the GUI and have its classification results. It shows that technology has made our lives easy as well as challenging. In terms of user requirements, technological options, and support for the decision, we see that if we analyze the user requirements, one main user requirement will be to differentiate between fake and real news. The users will be able to determine what type of news is real and which news is fake. The technologies that are involved in this research study are machine learning techniques. These techniques include the TF-IDF-vectorizer, random forest classifier, logistic regression, and decision tree classifier techniques, which can be used after importing the necessary libraries. We chose this design because these classifiers are capable of producing perfect results in terms of accuracy. A comparison of the different schemes tested within the last three years is shown in Table 6. 5 Term Frequency-Inverted Document Frequency (TF-IDF) and Support Vector Machine (SVM) 95.05% [24] From Table 6, we see that the accuracies of other papers are lower than the accuracies of our work. It shows that our results are perfect. One of the limitations of our study is the datasets were not massive. The analysis was only performed on four machine learning models.

Conclusions
Our social media is generating every kind of news; mostly, these are fake. Usually, we see clashing realities for a similar point and wonder whether both are valid. We set ourselves in a fix trying to figure out which source to put our confidence. As we have also discussed in the Discussion section, cleaning the dataset is very important. It is essential because it changes the results of the study. As we have seen from determining the frequencies of words as they occur in the dataset, we see that when the data is cleaned, the words such as Trump and said are the most frequently occurring. However, when the dataset has not been cleaned, words such as the, are, and appear the most often. These words on their own have no identity and are considered meaningless until they are used with the other terms. Hence, the datasets should be cleaned to produce accurate results. On a concluding note, the authors want to say that sometimes spreading fake news causes happiness, but for many, it causes sorrow. The spreading of fake news should be stopped as soon as possible. In our research, we used some excellent machine learning algorithms that we're able to show us some splendid results. The algorithms showed an accuracy of more than 99%, which is almost perfect. As a result of this research, people who are pretty addictied to the internet are now not to be afraid of fake news. In the end, there are some limitations and insufficiencies in the presented paper. These occur if the dataset is unbalanced or has not been cleaned, as it will not give accurate results and may be ineffective. The extensive data framework, Spark machine learning, could achieve better results in terms of processing time [40][41][42][43][44][45]. Furthermore, deep learning-enabled big data models could also be applied to fake news datasets from recently inspired LSTM [46][47][48][49][50].