Fighting the COVID-19 Infodemic in News articles and False Publications: The NeoNet Text Classifier, a Supervised Machine Learning Algorithm

: The spread of the Coronavirus pandemic has been accompanied by an infodemic. The false information that is embedded in the infodemic affects people’s ability to have access to safety information and follow proper procedures to mitigate the risks. This research aims to target the falsehood part of the infodemic, which prominently proliferates in news articles and false medical publications. Here, we present NeoNet, a novel supervised machine learning text mining algorithm that analyzes the content of a document (news article, a medical publication) and assigns a label to it. The algorithm is trained by TFIDF bigram features which contribute a network training model. The algorithm is tested on two different real-world datasets from the CBC news network and Covid19 publications. In five different fold comparisons, the algorithm predicted a label of an article with a precision of 97-99 %. When compared with prominent algorithms such as Neural Networks, SVM, and Random Forests NeoNet surpassed them. The analysis highlighted the promise of NeoNet in detecting disputed online contents which may contribute negatively to the COVID-19 pandemic.


Introduction
Without doubt, the Coronavirus pandemic has affected the world around us in unprecedented way. Particularly, an emerging infodemic of news articles, social media posts, and publications has accompanied the global pandemic and circulated a vast volume of information, some of which is misleading [1]- [7]. According to the World Health Organization, an infodemic is "an overabundance of information -some accurate and some not." [8]. This means that our digital world is riddled with an enormous amount of misinformation and disinformation resulting from fake news articles, careless social media posts, or publications that has not gone through rigorous peer-review process [9]. As a result, rumors, conspiracy theories, and stigma are linked to the ongoing COVID-19 pandemic and circulated on social media platforms and news networks. The impact of the infodemic on the general-public is unquestionable as it makes it hard for people to identify reliable guidelines from trustworthy sources [10]. Clearly, the spread of misinformation and disinformation has existed long before the pandemic. It has also been considered as a social-determinant of health due to its impact [11] The coronavirus infodemic aspects are many: (1) The spread of rumors across the world has led to inappropriate behavior and have caused adverse effect on people's physical and the mental health [2], [12] (2) conspiracy theories have widespread during the pandemic in attempt to explain the unusual circumstances [13]. In fact, similar theories have emerged during the SARS outbreak in China and Ebola outbreak the Congo [14]. (3) label-prediction process. The final outcome generates a COVID-19 SAFE/DISPUTED label accordingly. In the past few months, Twitter started flagging socialized contents of political dispute "this claim is disputed". Twitter has also taken more advanced measures and applied filters to remove vaccine misinformation from the platform [20]. In the deal scenario, we envision that our mechanism is theoretically adopted by all major social media platforms and flag socialized news articles as SAFE or DISPUTED COVID-19 articles.

The Role of Machine Learning and Text Mining in Misinformation and Fake News Classification
From the motivation presented in the Introduction section above, it has become clear that computational science in general; and machine learning and computational linguistics, in particular, must be at the forefront of the fight against the infodemic. Prior the COVID-19 infodemic, machine learning, and natural language processing have played an essential role in fighting misinformation and fake news [21]. We believe that innovating new solutions that leverage the power of both fields is the right step to take in this fight.
The literature is rich of valuable methods and algorithms that demonstrate both the machine learning algorithms and natural language processing approaches, individually or as hybrid. Here we share the background and approaches that represent the backbone of the methods of this paper. In the early 2000's, Soon et al., claimed that training a machine learning algorithms with specific linguistic features holds a promise in classifying text in general. The authors claimed that their algorithm is the first-learning-based system trained by bigram features to achieve comparable results to non-learning methods [22]. Mackey et al., in their efforts of identifying suspected fake contents on social media, they combined natural language processing and machine learning. The approach identified keywords associated with the pandemic and suspected marketing [23]. By analyzing millions of social media posts, the authors adopted a deep learning algorithm that detected high volumes of suspicious and untrustworthy products.
Liu et al., presented a "survey like" paper to demonstrate the various applications of combining both natural language processing and machine learning. Specifically, the method of training algorithms using word features (bigrams) [24]. Bigrams are a sequence of two words that appear in the text (e.g., global pandemic) [25]. They provide valuable and richer textual features than mere single high-frequency words counterparts. Aphiwongsophon et al., demonstrated how famous ML algorithms (e.g., Na-iveBayes [26], [27], and Support Vector Machines [28]- [30]) can be used to detect fake news. Their results shown promise with accuracy of 96% or better [31].
Following a similar path, H. Ahmed et al., also used classical machine learning algorithms, (i.e., a variation of support vector machine), but rather trained them using ngram features [32]. The accuracy of their algorithm was lower than the previous methods (92%). The authors, however, argued that training the algorithm with the n-gram is better in terms of feature quality than features of high frequency that do not contribute to the context of the dataset.
Another interesting approach but Conroy et al., who also used machine learning to detect deception in identifying fake news. The approach combined machine learning, linguistic features (e.g., n-grams), and network analysis for networks of linked data. The authors claim that both linguistic and network analysis methods have shown high accuracy in classification tasks of detecting fake news. The authors conclude their research by making the following recommendations: (1) achieving maximum performance require deeper linguistic analysis of, (2) the utilization of linked data and corresponding format will assist in achieving up-to-date fact checking [33].

Limitations of Related studies
The above introduction explains the related methods motivate the subject and present the current state of the art. It is clear that both machine learning and text mining present the corner stones for text classification and anomaly detection [34]- [40]. However, regardless of the underlying algorithmic classification method (naïve bayes, support vector machines), they were all trained from a static set of textual features such as bigrams. Once the featured are derived, there was no further work on how the features are related to each other to tell a much bigger story. Our network training model, however, connects the features in the way the bigrams are naturally connected in the text. This offer the following advantages (1) it makes the model extensible by new datasets without doing the entire training, (2) a network model allows pruning (i.e., getting rid of the noise) using inherent centrality measures (degree, betweenness, closeness, etc.), (3) if necessary, a network model allows multi-label classification by applying network clustering techniques at it has become apparent in the PubMed Case Study onward.

Contributions
The main contributions of this work can be summarized as follows (1) an extensible network model that can be trained with new datasets without retaining and inclusion of previous datasets, (2) the network model enables binary classification if used as it is, (3) the network model can support multi-label classification if it is further analyzed using network clustering algorithm (e.g., Girvan-Newman algorithm). (4) As stated above, some journal publications have proven to be not credible, the NeoNet algorithm is designed to classify plain text articles. This is further discussed below in the PubMed Case Study section. We have proven that it is to train it with publication articles without any changes. Clearly such contributions make the algorithm a general-purpose text classification that maybe be utilized in various applications (as social media such as Pa-tientsLikeMe [41] and online medical forums such as Doctor's Lounge [42].)

Materials and Methods
With the previous introduction and the recommendations made to fight the COVID-19 infodemic, we present a novel supervised machine learning algorithm, which we call NeoNet. The algorithm is specifically designed for COVID-19 news classification. The overall approach of the NeoNet algorithm is centered around a bigrams network. We applied the TFIDF algorithm [43] to extract bigrams (a pair of words) which is the bridge to identifying discriminant features. The bigram features naturally present themselves as a network which we use as a training model. Hence, the role of feature selection using TFIDF to identify bigrams is significant. TFIDF features have two folds: (1) provide discriminant features that contribute significantly to the training phase of the algorithm, (2) provide linked features that take the mere article contents to a connectivity level. The result is an interesting network model that offers a platform for testing whether a new article is relevant from both content and connectivity. In this section, we discuss how the algorithm is designed, implemented, and tested. Particularly, we present the cornerstone steps that lead to determining the class of news articles: (1) TFIDF feature selection, which is used to extract bigram features from news articles [23], [44] (2) TFIDF bigram-based network model, which we use for training the algorithm before it is able to predict the label of new articles, (3) a supervised machine learning algorithm, which predicts the final label for each news article as a SAFE/DISPUTED COVID-19 news article.
The main dataset used in this work is identified as (COVID-19 News Articles Open Research Dataset) which is available at Kaggle [45]. The data exists as a Comma Separated Value (CSV) file that is comprised of seven columns. The ones of interest to this here are the (article title), (article description), and (full article) and contains 6782 articles. The articles collected from the website of the Canadian CBC News network. The preprocessing of the text is done using the Pandas[46] framework and the linguistic analytic is done using TextBlob [47]. We split the dataset 10-folds where each partition contained 500 articles. We used one for training the algorithm and another five to test it. For each test-fold, we set the minimum support parameter to a certain threshold and compared the performance. Figure shows

TFIDF Feature Selection and Model Construction
Every training model starts with a good representation of data items. For text classification in particular, feature selection is the prerequisite step necessary for such a task. Various approaches are designed around the idea of selecting a set of words that best represents the document (or a set of documents). The most common text feature selection known is based on the idea of term frequency. Specifically, the Term Frequency and Inverse Document Frequency (TFIDF) method [32], [43] has been most dominant. In this section, we discuss how we extracted the bigram features needed for training the NeoNet classifier. For this task, we used a COVID-19 news dataset that is trustworthy, and publicly available (published on Kaggle [45]). Due to the fact that raw text presents users with inherent issues (e.g., format, encoding, and punctuation), we performed a preprocessing step to address such issues.
We split the list of articles into 10-folds of 500 articles. We used one-fold to be analyzed for feature selection using the TFIDF algorithm. Given that a TFIDF feature can be a word or more, we calibrated the algorithm to capture features that are of exactly 2 words (i.e., bigram). The TFIDF scores each feature and rank them accordingly. When the TFIDF was run against the training articles it produced 193914 bigrams. The TFIDF measure produces features of a certain confidence. In the training fold (500 articles) of the dataset that we have used produced 193914 bigrams. Clearly, this causes the model to be noisy which also could lead to an overfitting problem. Therefore, we only selected the top 500 ranked and ignored the rest. Table 1 shows a sample of the top-ranked features selected from the training dataset before the noise removal.   . The most significant term and highly ranked according to the dataset is "covid 19". Its corresponding sore according to TFIDF is 20.9. The rest of the terms fluctuated between 3 and 7 which shows their relevance when compared with the score of the "covid 19" bigram. The remaining bigrams in each of the remaining three subplots contained bigrams such as ["tested positive", "physical distancing" , "coronavirus cases","social distancing", "federal governments"]. Clearly, such bigrams depict an accurate picture of the pandemic in terms of mitigation presented by "social distancing", "physical distancing". The global impact of the pandemic was also presented by bigrams such as "federal governments" and "health organizations", and "public health" concerns.
Bigrams, as network construction means, are widely used in various computational problems [48]. We present an incremental network construction approach that is wellknown in the literature in prominent algorithms (e.g., Prim's algorithm [49], [50]) which starts with an empty set of nodes and incrementally adds new nodes, one node at a time. In a similar fashion, we follow the same method of construction. Our goal is to add all the bigrams that also meet a certain criterion. The bigram extraction step, which discussed above, produced the set of length-two features. The length-two not only captures the core necessary features for classification but also offers a network model that can be used for training a classification algorithm. They offer a source-target mechanism where the source and target are nodes in the network and connected with an edge. The continuation of adding new bigrams forges an incremental linkage. The final outcome of such a process results in a graph where its structure and characteristics are dependent upon the dataset being analyzed (i.e. healthcare, politics, business, etc.). For the COVID-19 domain, following the incremental process ensures an upfront production of high-quality features. The network ensures that classified bigrams are related to the content and not a result of verbatim exact match.
The following example demonstrates how a TFIDF feature of length two can provide the foundations for constructing the needed network. A sentence such as (Top U.S. health official Dr. Anthony Fauci said it has a "clear cut, significant, positive effect in diminishing the time to recovery" [51] after favorable results of a clinical trial.) When performing the TFIDF feature extraction step, it produces the following (health official), (clinical trials) bigrams. These two bigrams contribute four different nodes (unigrams), namely (health, official, clinical, trials). It will also contribute two edges: an edge from health and official) and another from clinical and trials. As we analyze more sentences, we encounter the mention of (strict public health measures). In turn, this contributes another bigram (health measure). Putting these bigrams together and connecting them based on the bigram relationship ought to form a graph where the node health is connected to (official and measure). The continuous addition of bigrams extracted from the training dataset will result in a much larger network. Figure 3   This is explained by the fact that some bigrams might share a common word among them. Example: ('health issues') and ('health problems') have the node ('health') in common. While such bigrams features will be used as they are in other machine learning algorithms, a network model such as ours naturally prune repeated features that can lead to overfitting which is a problem that other algorithms suffer from. Table 1 summarizes the training model which was constructed from the top-500 TFIDF features selected, and Figure 4 shows the training model after construction. Figure 4 shows a simplified version of the network training model that is constructed from 100 bigrams features (also produced by TFIDF).

The Design of NeoNet Classifier
The previous step explained how a network-based training model is derived from a given set of news articles. Here we present the algorithmic steps that leads to labeling a new article that is yet to be seen by our algorithm. The algorithm is controlled by a configuration parameter which we called: minimum support which is inspired by the Apriori algorithm [52]- [54]. The minimum support guarantees a certain number of bigrams to be present in each article, otherwise, it will be labeled as suspicious. It ensures that the article contains sufficient contents that contributes to the training model. If the article does not meet this condition it will not be classified as SAFE. Clearly, an article that does not have a minimum number of TDIDF features also communicate significant facts worthy of reading [55]. As for the percentage generated by the minimum confidence, it guides the setting of the minimum support and helps to set it to a sufficient level. This becomes significant in long vs short article. In long articles, it is expected to have a higher number of TFIDF features than shorter articles. If the minimum support parameter is set too low, this percentage helps correct this issue and ensure that news articles are not classified as "SAFE" if they should be classified as "SUSPICIOUS". The NeoNet algorithmic steps are described below and also is expressed in pseudocode in Algorithm 1.

Experiments and Results
Using the training model resulting from the bigram feature selection step, we conducted a series of classification (testing) experiments. Using five different folds of the dataset, and different configurations of the minimum support parameter, we measured the precision of the NeoNet algorithm. The rationale behind this is to come up with a threshold that produces the best outcome. The minimum support parameter is based on the number of bigrams produced by each article. The higher the number of the bigrams matching, the higher the chances of an article being classified as a positive COVID-19 class. However, the experimentation guides the algorithm to identify a reasonable threshold. A very high number of bigrams would lead to classifying articles that are extremely similar in content. As a result, the algorithm would miss articles that belong to the COVID-19 class, but less similar to the training model. On the other hand, a very low threshold would lead to classify any article with a slight overlap as positive and would make the algorithm not precise. We used the following min-sup configurations [5, 10, 15, bigram_list ← extract_bigrams(d) 7:

Algorithm 1 NeoNet: A Noun-phrase Bigram-based Classification Algorithm
for each bigram ∈ G_list do 8: if left_unigram ∈ G then 9: add G ← right_bigram 10: else 11: if right_unigram ∈ G then 12: add G ← right_bigram 13: if sup_count ≥ min_sup 14: positive ← +1 15: d ← COVID 19 label 16: unt il no more document to classify 20, 25] bigrams. Figure 5 shows the five different test sets and how they were classified according to NeoNet with various minimum support levels. The analysis shows that mandating that at least 5 bigrams are matched against the training set and produced a precision value between (99.6%-100%). If it is set to a more demanding parameter (minisup = 25) the classification results fluctuate between (91.18%-95.59%). The experimentations showed that somewhere in the middle is reasonable (min-sup = 15), and it produced classification precision that fluctuates between (97.99%-99.20%). Each fold is plotted using five different curves, one for each minimum support value. Figure 5 shows the performance of NeoNet with different configurations of the min-sup parameter against CBC News Dataset using min-sup levels: [5,10,15,20,25].
We have showed above how NeoNet can be controlled using the min support to make flexible to use in various case scenarios. However, the algorithm also performs exceptionally well without such configurations when needed. In this section we show its performance analysis compared with the most prominent algorithms (e.g., SVM, neural nets, random forests). Given the fact that such algorithms don't necessarily utilize a similar notion of the min support/confidence, we set both parameters to (min-sup=15). The algorithm is trained using a 500 articles fold. The rest of the dataset was split into 5-folds each of 500 articles.
As expected, each fold of the dataset was tested against NeoNet and compared with a counterpart algorithm. The algorithm was tested on each of the five folds and compared against all other algorithms to measure the precision achieved for all the 5-folds. Figure 6s below shows how NeoNet's performance (shown on the far left of the x-axis) outperforms all other algorithms unless a perfect classification results is achieved by both algorithms (e.g., NeoNet vs Neural Net).
The diagram shows a common theme: Except for NeoNet, Fold-3 (depicted in grey) appears to be scored the lowest among all the algorithms. Another noteworthy observation is that Fold-1 and Fold-5 also appear to be scored the highest (a 99.2%) among several algorithms which include Stochastic Gradient Descend, SVM, Random Forests, and Neural Nets. This is as close as it gets when their precision is compared with NeoNet. However, when all algorithms failed to achieve a perfect precision, NeoNet have shown dominance and outperformed all algorithms.
We conducted the experiment using the Orange Data Mining Toolbox in Python [56] A  Figure 6 shows the performance analysis of NeoNet vs the most prominent Machine Learning algorithms and other common statistical-based methods.

The PubMed Case Study
The Experiment section discussed above have shown the new methods and approaches this research have taken to produce a label for never-seen-before news articles. However, the premise of the algorithm is to show that it can work the same way with other textual inputs such as medical publications, doctors' notes, etc. In this section we demonstrate another case scenario that will show how NeoNet is a text-classification general purpose algorithm that can perform the same way regardless of the input source type (news, online medical forums, doctors' notes, or medical publication). Starting with a publication dataset extracted from PubMed after search the web portal for keywords such as (Covid 19 and coronavirus). Though, the search results produced more than 2500 medical abstract (approx. 100, 000), we only used 500 abstracts for training, and 5-fold of 500 abstracts for testing. This is indeed consistent with the same experiments performed against the CBC news dataset.
Following the same processes explained above and also demonstrated in Figure 1, we have extracted the TFIDF bigrams and constructed a similar training network model. The generated network constructed has the following characteristics (1) number of nodes: 467, (2) number of edges: 330, and (3) average degree: 2.83. Table 2 below draws a comparison between the two training models generated from the two datasets. While the two datasets produced a relatively close number of nodes, the PubMed a significantly smaller number of edges. This is explained by the fact that publications cover various "clusters" of public health issues such as the vaccine development, drug treatment, the covid-19 disease and its impact on the human body among other things. On the other hand, news articles address the general public in much less domain-specific but commonly related terms. As for the actual classification results, we performed similar experiments against the 5-folds test sets provided by the PubMed dataset. Each fold was tested using a minimum support of the following values (5,10,15,20,25). We have observed a very similar pattern: the more the number of bigrams needed to classify a document as a COVID-19, the less the precision of the classification. For example, when the classification required 5-bigrams (i.e. 10 connected terms in the training model) the classification precision fluctuated between 98%-99.4%. In another case, when the minimum support was set to 25-bigrams (i.e. 50 connected terms in the training model), the classification precision fluctuated between 89.98%-91.98%. Figure 7 shows the entire analysis of each fold and precision resulted with the various minimum support configurations. When comparing the precision derived from the 5-folds CBC News dataset we find that precision fluctuated between 99.0%-100% when the minimum support was set to 5-bigrams. When the minimum support was set to 25-bigrams the precision fluctuated between 94.18%-95.59%. Clearly, the drop-in precision in the PubMed 5-folds was due the training model being less connected due to the significantly lower number of edges in its training model. We believe the precision can be enhanced in the case if multilabel classification (vaccine development, drug treatment, covid-19 disease and symptoms, etc.) This finding requires further investigation in the future to assess how the training model can offer more insights using the underlying inherent subtopics.  [5,10,15,20,25]. The plots are displayed from left to right respective to the values of the configurations.

Discussions & Future Directions
In this article, we discussed the how the COVID-19 pandemic has also been accompanied but an infodemic. Particularly, we discussed the various aspects of the infodemic and how it presents a serious health threat to the general public due to the misinformation/disinformation that may exist in the source (e.g., scientific publications, fake news, and social media posts). For instance, we presented an evidence of a disinformation that existed in a publication, which eventually presumed to be from a suspicious source. The article reported health issues associated with vitamin D. As the article was published, it was also highlighted by a reputable news organization (i.e., DailyMail). The matter was made worse when also DailyMail news article [57] was also socialized on Facebook and Twitter. Clearly, such misinformation or (disinformation in this case) threaten the world's public health.
This paper also highlighted the various efforts have been taken by the scientific community in the fight against the infodemic and made recommendations. One specific reference, Eysenbach, addressed the infodemic and introduced four pillars that must be observed in order to win this fight. The recommendations included information monitoring, encouraging knowledge refinement and quality improvement processes. Our research here has taken such recommendations into serious level and implemented them accordingly. Specifically, we presented an information monitoring and a knowledge refinement solution that addressed the problem from the source. The research also performed a diligent literature review on what specific tools and research methods should be used. The technical recommendations were influenced by advances of machine learning, computational linguistic, and network science.
Indeed, this paper have presented a novel machine learning algorithm that utilized knowledge refinement produced by natural language processing to produced training features. We then empowered the algorithm by a network model. Such a model offered both the structural components (i.e., nodes and edges) and the node degree centrality to perform the knowledge refinement when constructing the training model. This led to the generation of highly representative features and eliminated the noise by using the degree centrality as a heuristic.
As for the actual step of training and testing the algorithm, we selected a trustworthy set of news articles which is published on Kaggle [45] (p10), and divided it into five-folds. We performed five different experiments to come up with a reasonable min-sup threshold. Each experiment was performed against the 5-folds with a given configuration of the min-support. The experiment was repeated with [5,10,15,20,25] and showed that a threshold of 15 produced the best results without being too strict or too noisy. This specific threshold produced a classification The introduction of a network approach that is based on TFIDF features for training its model, have taken the text classification from mere content matching level to a connectivity level expressed by the underlying relationships that make up the training model. The future direction of this work will consider developing an adaptive approach to set such a configuration automatically. We will also consider promoting the algorithm to be multi-lingual and test it against various datasets from various news organizations.
It is worthwhile mentioning that the algorithm will perfectly function on all other text sources, not only the news. As demonstrated in the PubMed Case Study, we also expect the algorithm to function the exact same way, and without any modifications, to online medical forums or doctor's notes. The setting of the minimum support parameter will require calibration by experimentation. Our reason to believe that NeoNet will be successful is that it was already tested on two different types of data (news articles and medical publications) and produced comparable results. This is due to the fact that the the algorithm is trained using bigram features extracted from full-text. Such features are highly significant in the context of medical publications since they may reference entities such as organs, disease, gene, protein, indication, symptom, etc. Eventually, the training model will be rich, and suspected sources will fail to classify positively against the model. We also believe that adding features from doctor's notes to the training model of the scientific literature will eliminate suspicious oness such as false the refeference that promoted vitamin D. This is yet to be explored in future publications.
Testing the algorithm agaist an entirely set of automaticlly generated fake news is another interesting future direction the authors are considering. By using means of automatic genration of text, we use one of the same datasets to generate such text. Such a test will provide future merits on how to continue to fight the COVID-19 Infodemic.

Conclusions
To conclude, we presented a general-purpose NeoNet a supervised machine learning algorithm that analyzes textual content and produce a an extensible network training model. The method and the analysis of this research indeed confirmed the invasion and widespread of misinformation in major news organizations (e.g., DailyMail), social medica, and even academic publishers. Our research highlighed the dire need and the importance of adotping new defensive mechanism and procedures towards such misinformation and other forms disputed digial contents. While prominent social media platforms (e.g., Twitter) have taken initial steps towards (flagged disputable contents from influential figures), such issues remain despirate and urgent. The finding of our research is begging for adotping new digital publishing models and an overall digital transformation movement that guarantee the freedom of speech and credibility. As demonstrated above, our algorithm provides an intelligent tool for all online text-producing organizations. Giving the promise demonstrated by the analysis of this work, we believe that such a tool provides a deeper defensive mechanism than existing counterparts when new policies are made and ready to be dopted.

Authors Contributions:
• First Author: Dr. Mohammad AR Abdeen: The Principle Investigator and the one who secured the initial funding of this research. The proposal ideas were the core components which have translated into research components. He guided the research directions through the various phases.
• Corresponding Author: Dr. Ahmed Abdeen Hamed: The NeoNet algorithm designer, the handson scientists who have implemented the algorithm, tested it, suggested new directions to the PI and senior authors, the one performed all evaluations during the lifecycle of this work. He is the main manuscript writer and the one who was in charge of communication either with the team or the respected reviewers.
• Senior Author: Dr. Xindong Wu: The senior author of this work. He has guided the overall research directions and have provided significant insights during the various phases of algorithm evaluation. Dr. Wu has also provided directions on how to respond to the reviewers in a satisfactory fashion.