Analysis and Prediction of User Sentiment on COVID-19 Pandemic Using Tweets

: The novel coronavirus disease (COVID-19) has dramatically affected people’s daily lives worldwide. More speciﬁcally, since there is still insufﬁcient access to vaccines and no straightforward, reliable treatment for COVID-19, every country has taken the appropriate precautions (such as physical separation, masking, and lockdown) to combat this extremely infectious disease. As a result, people invest much time on online social networking platforms (e.g., Facebook, Reddit, LinkedIn, and Twitter) and express their feelings and thoughts regarding COVID-19. Twitter is a popular social networking platform, and it enables anyone to use tweets. This research used Twitter datasets to explore user sentiment from the COVID-19 perspective. We used a dataset of COVID-19 Twitter posts from nine states in the United States for ﬁfteen days (from 1 April 2020, to 15 April 2020) to analyze user sentiment. We focus on exploiting machine learning (ML), and deep learning (DL) approaches to classify user sentiments regarding COVID-19. First, we labeled the dataset into three groups based on the sentiment values, namely positive, negative, and neutral, to train some popular ML algorithms and DL models to predict the user concern label on COVID-19. Additionally, we have compared traditional bag-of-words and term frequency-inverse document frequency (TF-IDF) for representing the text to numeric vectors in ML techniques. Furthermore, we have contrasted the encoding methodology and various word embedding schemes, such as the word to vector (Word2Vec) and global vectors for word representation (GloVe) versions, with three sets of dimensions (100, 200, and 300) for representing the text to numeric vectors for DL approaches. Finally, we compared COVID-19 infection cases and COVID-19-related tweets during the COVID-19 pandemic.


Introduction
Coronavirus disease, also known as COVID-19, is a recent virus disease that emerged in 2019 [1]. Many patients with unexplained origin pneumonia appeared in Wuhan, China, in December 2019. These patients have been traced back to the Wuhan seafood, and wet animal wholesale market through contact tracing [2]. Chinese authorities performed a deep sequence analysis of the symptoms providing ample evidence that the novel coronavirus was the disease's causative agent of coronavirus disease . Since then, COVID-19 has spread rapidly in China and other countries worldwide. The World Health Organization has declared COVID-19 a public health emergency of international concern 1. We examined people's emotions with COVID-19 by considering neutral, positive, and negative labels.

2.
We used ML models to calculate the accuracy of various ML approaches to classify the user's feelings about COVID-19 and show that the random forest provides a better result than other ML models.

3.
We have expanded our focus on exploring DL models to classify the user's sentiment about COVID-19, compute the DL models' predictive performance, compare the ML models' results, and show that Maximum times DL models provide a better result than ML models. 4.
We try to relate COVID-19 outbreak cases and COVID-19-related tweets among the nine states in the USA.
The remaining paper is arranged as follows. Section 1 begins with a brief introduction. Section 2 gives a quick rundown of the related literature. A precise explanation of the whole methodology is illustrated in Section 3. Section 4 discusses the experimental findings. Finally, Section 5 outlines the conclusion and possible future work.

Related Works
Data from social networks are used in analytics to understand human behaviors [11][12][13][14][15][16][17]. During the COVID-19 pandemic, general people have faced a significant psychological burden because of long-term financial and social crises. It is essential to analyze public opinion to understand people's sentiments and feelings facing the pandemic. Sentiment analysis is an efficient approach for text analysis that automatically mines unstructured information for a sentiment like social media, emails, and customer service tickets. However, machine learning (ML) approaches can use various kinds of data to mine information automatically [18][19][20][21][22][23]. For example, Jain et al. [18] explore different measures for Twitter sentiment analysis using ML algorithms. A comprehensive methodology is specified for sentiment analysis. The multinomial Naïve Bayes and the decision tree models are employed as tools for analysis. The decision tree obtains the best results as evaluations showing 100% accuracy, precision, recall, and F 1 -score. From various countries, there are a large number of researchers trying to converge and distribute COVID-19 [19,24] Twitter datasets. Based on COVID-19 specific tweets, the authors of [11] use three different Twitter datasets to perform sentiment analysis. After collecting the datasets, data is preprocessed, TF-IDF is used for vector representation, and several ML models are used to predict sentiments. After evaluating, the decision tree provides the best accuracy compared to others and is 93%. The authors in [12] extract opinions from Twitter based on particular keywords and then use the Naïve Bayes classifier (NBC) algorithm to identify tweet sentiments.
Pokharel et al. [25] describe the Nepalese citizens' sentiments about the coronavirus outbreak. They collect tweets using keywords specified as CORONAVIRUS and COVID-19. The sentiment analysis is performed based on those tweets shared in Nepal from 21 May to 31 May 2020. In [26,27], authors have developed a mediative fuzzy correlation technique that can show the relationship between the increments of COVID-19-positive patients in terms of the increments concerning the passage of time.
In 2006, G.E. Hinton first proposed DL. Now DL is a subset of the ML process that refers to Deep Neural Network [28]. At present, the performance of DL algorithms yields effective natural language processing in sentiment analysis over multiple datasets. For instance, the authors of [29] proposed a model that combines a convolutional neural network (CNN) and long short-term memory (LSTM) to predict the sentiment of Arabic tweets. They gain F 1 -score is about 64.46% compare to the state-of-the-art DL model's about 53.6% F 1 -score. Goularas et al. [30] propose models that involve combinations of CNN and long short-term memory (LSTM) networks. Also, they compare two popular word embedding systems for vector representation, such as the Word2Vec and GloVe models. Their main contribution is sentiment analysis with the same dataset to analyze the performances and evaluate the process under a single testing framework. Ain et al. [31] critique and review several papers using DL techniques like convolutional neural networks, recursive neural networks, and recurrent neural networks/LTSM for analyzing user sentiment.
Cliche [32] develops two DL models(named CNN and LTSM) to predict binary classification for sentiment analysis using the pre-trained model and gets less than 73% accuracy. Chen et al. [33] propose a combined advanced LTSM-CNN model based on the model proposed by Sosa P M. and compared it with other combined LTSM-CNN models, achieving 78.6% accuracy. Ali et al. [34] apply sentiment analysis over a dataset of English movie reviews (IMDb dataset [35]) using DL techniques to classify these dataset files into positive and negative reviews.
Sosa P M [36] combine two deep learning models named long-short term memory (LSTMs) and convolutional neural networks (CNNs) to perform sentiment analysis based on Twitter data and finally compare their accuracy against regular CNN and LSTM networks. Moreover, their combined model CNN-LSTM gained (3%) better and 3.2% worse accuracy than standard CNN and LSTM models. Another proposed LSTM-CNN model gains 8.5% and 2.7% better accuracy than regular CNN and LSTM models.

Methodology
This section describes the comprehension involved in the study and all materials and processes that are used. We have pre-processed the raw data after collecting it to eliminate any irregularities. We used sentiment analysis to evaluate the sentiment of each document. Then we have extracted features using various techniques. Finally, we used machine learning and deep learning models to classify user sentiments. Our approach is depicted in Figure 1. As previously mentioned, our whole procedure is outlined below.

Data Acquisition
In this work, we look at two distinct datasets. These two datasets are as follows: Dataset-I. Dataset-I was obtained from Kaggle [37] and contains a large number of tweet texts on Covid19 that include the keywords "Corona", "Covid19", and "Coronavirus" (case ignored). We took 15 days of data from this dataset from 1 April to 15 April 2020, belonging to nine states in the United States, including Arizona, Washington, Florida, Georgia, Nevada, California, New York, Texas, and Illinois, for our research purposes. In this study, we considered the nine states of the United States since Twitter is the most popular social media site in the United States, with the greatest amount of tweets posted by users from these nine states. Due to computer resource constraints, we only used 15 days of data (1 April to 5 April) to conduct our research. Table 1 shows the number of tweets gathered from these nine states. Dataset-II. To explore the association between the number of tweets and the number of covid cases, we obtained another dataset from Kaggle that contained the number of covid cases in each state of the United States, as shown in Table 2. To conduct the investigation, we used the same nine states as in Dataset-I: Arizona, Washington, Florida, Georgia, Nevada, California, New York, Texas, and Illinois. To perform a comprehensive analysis, we also looked at the number of covid cases identified in 15 days between 1 April and 5 April 2020.

Data Processing
Twitter's language model has its own set of properties. Raw tweets typically include much noise, misspelled words, and many abbreviations and slang phrases that limit our model accuracy. To improve accuracy and remove noisy features, we pre-processed the data. The following steps are performed to pre-process the dataset: 1.
Firstly, we removed all forms of symbols such as #,@,!,$,%,&, HTML tags, and numbers included in the whole dataset. We used a regular expression module from the Python language to perform these steps.

2.
Our collected dataset contains both lower case and upper case words. We convert all words into lower case words.

3.
Then, we performed tokenization on our whole text data. Tokenization means the division of smaller units of a comprehensive text document, such as individual terms or phrases [38].

4.
Finally, we utilized stemming on our whole text dataset to get clean tweet text. Stemming is an approach for obtaining the root shape of terms by eliminating their affixes [39]. We utilized the NLTK library from Python to perform tokenization and stemming.

Sentiment Analysis
Analyzing a text and evaluating its sentiment is known as sentiment analysis. The aim is to assess whether or not user text conveys positive, negative, or neutral sentiments. We use the TextBlob library, which can process these three types of classification [40].
To get classification, textblob provides polarity(P) and subjectivity(S) value. When the polarity value is greater than 0 (p > 0), it is positive, and it is neutral when the polarity value is equal to 0 (p = 0). Otherwise, it is negative. The subjectivity is a floating-point integer of [0.0, 1.0], with 0.0 being highly objective and 1.0 being highly subjective. Each tweet is labeled with sentiments after these measures are completed.
For the sentiment label results, we can take some real-life examples of COVID-19. Let us consider the tweets tweet1 and tweet2 in Example 1 and Example 2 to deal with which label they belong to. More precisely, the probability of positive, negative, and neutral statements is shown below. print(format(tweet1.sentiment)) 3.
Sentiment(polarity = −0.2113, subjectivity = 0.625) Generally, it is difficult to label, but textblob makes it easy to label. As we can see, the polarity value is −0.2113, and the subjectivity value is 0.625. Since the polarity is −0.2113, it indicates that the tweet is negative, and a subjective score of 0.625 suggests that it is subjective.
Sentiment(Polarity = 0.0, Subjectivity = 0.0) The above sentiment has a polarity score of 0.0 and a subjectivity score of 0.0, indicating the statement is neutral and highly objective. However, in the manual approach, it is too hard to say, "is it positive or neutral?". For this reason, we use the textblob library function to get the labels in our dataset.

Feature Extraction
Feature extraction enhances the performance of trained models by extracting features from input data. We have performed several feature extraction techniques that convert text data into numeric vectors.
Traditional Bag-of-words (BoW): BoW model is essential for encoding and retrieving information (IR) in natural language. This model, which ignores grammar and word order while preserving multiplicity, identifies a text as a bag of its terms [41].
Term Frequency-Inverse Document Frequency (TF-IDF): The TF-IDF is a scoring metric used in information retrieval (IR) or overview [42]. The primary objective of TF-IDF is to emphasize the significance of a word in a given text. The definition is computed by combining the following metrics: (i) the number of instances in a text that a word appears and (ii) the word's inverse document frequency over a set of documents.
Word Embeddings: Word embedding is an art of vector representation. The background reason is to seize as much applicable, semantical, and syntactical information. Every word is represented as a numerical-valued vector in a predefined vector space. The most popular word embedding methods used to convert words into numerical vectors are BoW, TF-IDF, Word2Vec, GloVe, fastText, etc. We used two-word embedding methods for our conveniences, like Word2vec and GloVe embedding.
Word to Vector (Word2Vec): Mikolov et al. [43] proposed a well-known word embedding technique named Word2vec, and this method maps those kinds of words that have similar meanings and are close to each other. This technique uses two types of methods, and the first one is the skip-gram model, which accepts the center word as an input, sends them to an embedding layer, and then predicts the context words in a small dataset. The continuous Bag of Words (CBOW) model is the second one that uses context words as input, sends them to an embedding layer, and finally predicts the original or center word. CBOW works very fast and provides better representations for the most frequent words.
Global Vectors for Word Representation (GloVe): Pennington et al. [44] proposed a model very similar to word2vec and can be used to gain dense word vectors. However, the working methodology of GloVe embedding is slightly different from word2vec and is trained on an aggregated word-word co-occurrence matrix. A given corpus depicts the frequency of words co-occurring with each other. The basic system of methods of the GloVe embedding model is to create substantial word-context co-occurrence matrix pairs as every material in this matrix represents a word.

Classifier Models
Several classification methods have already been used to analyze user sentiment in online social networks. It is worth noting that the classifiers are primarily associated with (i) ML and (ii) DL techniques. We use eight classification models in this study, including six ML and two DL classifiers as described below.

Machine Learning (ML) Techniques
We used several ML algorithms in our study description as below. Logistic Regression (LR): LR is a statistical method that employs a logistic equation in its simplest form that describes a binary dependent variable. However, there are many more complicated variations [45]. In regression analysis, logistic regression is utilized to calculate the variables of a logistic method.

Support Vector Machine (SVM):
The SVM is a plane-based classification algorithm that constructs a discrete hyperplane in the descriptive space of training data and components [46]. The instances or cases are categorized according to which side of the hyperplane they are on. SVM divides the hyperplane that moves through the center of the two groups in a linearly separable dataset, separating the two. SVM's main objective is to reveal the best hyperplane in training data between two data groups. By solving an optimization problem, SVM finds the hyperplane using the following equation [47]: where, 0 ≤ α ≤ C for i = 1,2,. . . , n. k-Nearest Neighbour (k-NN): The k-NN method is one of the most straightforward machine learning algorithms available. It is based on supervised learning techniques. It stores all the data available and, depending on the similarities, classifies a new data point. This ensures that if new data is acquired, it can be easily classified using the k-NN method [48].
Multinomial Naïve Bayes: The Naïve Bayes method is a technique that uses the Bayes theorem to handle classification issues. It is a probabilistic classifier, which means it predicts based on the probability of an object [49]. The probability of X observation belonging to class Y k (for example, with X being a vector of word occurrences or word counts) is calculated using the following equation [50]: The multinomial Naïve Bayes classifier is an improved version of the Naive Bayes classifier that is primarily used for text [51].
Decision Tree (DT): A DT is a tree structure resembling a flowchart, with core nodes marked by rectangles and ovals indicating the leaf nodes [52]. The Decision Tree method pertains to supervised learning methods.
Random Forest (RF): RF is a renowned supervised learning method based on an ensemble learning approach that brings together various classification elements to solve a complex issue and increase the performance of the model [53]. It is a multi-decision tree ensemble classifier that uses a randomly chosen subset of training data and parameters to generate multiple decision trees [54].
Extreme Gradient Boosting (XGBoost): XGBoost is a recent algorithm that has dominated applied ML [55]. XGBoost is a gradient-boosted decision tree implementation optimized for speed and efficiency. XGBoost models necessitate more information and model tuning than techniques such as a random forest to reach optimum performance.

Deep Learning (DL) Techniques
In the previous research, DL techniques had a very high performance compared to traditional ML techniques with automatic feature extraction that successfully executed sentiment analysis. We implement two DL models that are increasingly applied in sentiment analysis: (i) Convolutional neural networks (CNN) and (ii) Long Short Term Memory (LSTM).
Convolutional Neural Networks (CNN): CNN is a particular neural network used in various sections, including natural language processing, speech processing, and computer vision. We used CNN model to analyze user sentiment in the Twitter dataset. Kim [56] proposed CNN first intended 1d-CNN model, suitable with one dimension patterns. A model of this kind is helpful in natural language processing. It takes input sentences with various lengths and provides an output with fixed-length vectors. Severyn et al. [57] proposed a CNN model with primary elements like sentence matrix, activation, convolutional, pooling, and softmax layers. Our CNN model architecture is as CNN architecture of Kim [56] with minor modifications. Our CNN model also has three layers: convolution (CONV) layer, pooling (POOL) layer, and fully connected (FC) layer. Firstly, the CONV layer receives the input data, then the filter and the input data are calculated by the dot product. We used tweets as the input of the network. The tweets are tokenized into words to map a word vector, i.e., GloVe (embedding with different dimensions(100 dimensions, 200 dimensions, and 300 dimensions)). Then word2vec and encoding techniques (s × d) indicate the entire tweets mapped to a matrix of size, where the symbol s indicates the number of words in the tweet and the dimension of the embedding space is d. For doing the exact dimension of the matrix, we need to follow the strategy of padding X R s×d . A single convolution involves a filtering matrix w R h×d , where h is the size of the convolution. The operation of convolution can be defined as ( [32]) where a bias term and a nonlinear function are denoted b R and f (x) respectively, we chose a ReLU function as activation function. The output is the concatenation of the convolution operator, and all words in the tweet are defined c R s+h+1 . For each convolution, c max = max(c) is defined by max-pooling operation. We combine c max into c max R m for every filters into one vector where m is the total number of filters. We used a fully connected layer to pass through a softmax layer and used a dropout layer to reduce the overfitting.

Long Short Term Memory (LSTM):
A recurrent neural network (RNN) is a class of artificial neural networks. It has a special type of network named Long Short-Term Memory (LSTM) that can explore and study long-term dependencies. The main applications of LSTM are speech recognition, language modeling, sentiment analysis, and text prediction. Wang et al. [58] first introduce LSTM networks for tweet sentiment analysis. LSTM inaugurates a memory cell that can be conserved state over long periods and overcome this problem of long-distance dependence [59]. A memory cell is the core of LSTM, denoted by c t as well as connected recurrently to itself. The three multiplication units of LSTM are: (i) an input gate i t , (ii) a forget gate f t , and (iii) an output gate o t . Formally, LSTM can be computed by [60] i where hidden unit at time step t denotes h t , the input at the current time step denotes x t , bias term stands for b, logistic sigmoid function stands for σ, and the elementwise multiplication stands for .

Evaluation Criteria
We use four standard metrics, namely accuracy, precision, recall, and F 1 -score [30].
In the above equations, TP is the true positive and predicted correctly. FP is the false positive and predicted incorrectly, TN is the true negative and predicted correctly, and FN is the false negative and predicted incorrectly.

Experimental Results Analysis
This section presents evaluation metrics in accuracy, precision, recall, and f1-score-further, a brief discussion of the results is made.

Setup for the Experiment
We utilized the Keras [61] deep learning platform in the experiments, which uses Tensorflow [23] for deep learning method implementation as a back-end. We trained our model using Google Colab, a free cloud service with a free GPU (Graphics processing unit) that comes in handy when working with big datasets.

Parameters Setting
We used TF-IDF and bag-of-words to convert our tweets into numeric vectors to construct machine learning models. While using both TF-IDF and bag-of-words, we ignored terms that appeared less than 1000 times in the documents. We used the Adam optimization algorithm to train deep learning models, which incorporate two stochastic gradient descent extensions such as AdaGrad and RMSProp. Furthermore, we used ReLUs activation functions, sparse_categorical_crossentropy for the loss function, and the softmax activation function for ternary classification.

Sentiment Analysis
We used the Twitter dataset, as is seen in Table 1, for this experiment. This experiment was carried out to classify user tweets into three categories: neutral, positive, and negative. We have explored people's emotions towards COVID-19 by looking at the tweets. People are mostly curious regarding COVID-19, whose tweets fall into the neutral category. According to the experiment, 61.8% of people's emotions are neutral, 20.6% are positive, and just 17.6% are negative, as seen in Figure 2.

Machine Learning Analysis
After extracting the features, we performed a train-test-split on the dataset. This process involves taking the dataset and dividing it into two subsets. We have chosen an 80:20 ratio, i.e., 80% of the data for the training dataset and 20% for the test dataset. We used the seven machine learning algorithms to train our models: logistic regression, support vector machine (SVM), decision tree, random forest, Naïve Bayes, k-nearest neighbors (k-NN), and XGBoost. For each algorithm, the accuracy of the test dataset was determined. Table 3 represents the confusion matrix, precision, recall, and F 1 -score values used to verify performance.  Figure 3a.

Deep Learning Analysis
After extracting the DL features, we adopted the same 80:20 approach of train-testsplit to divide our dataset to evaluate the deep neural networks models. For CNN and LSTM algorithms, the accuracy of the test dataset was determined. Table 4 represents the confusion matrix, precision, recall, and F 1 -score values used to verify performance. Figure 4a shows the accuracy of two DL models with different techniques, namely: CNN and LSTM. The accuracy of CNN using GloVe embedding with 100d is 98.5%. Moreover, the accuracy of CNN using GloVe embedding with 200 and 300 dimensions are 98.9% and 99.1%, respectively. We also find that CNN's accuracy using word2vec embedding is 99.9%, and the accuracy of CNN using the encoding technique is 99.3%. Similarly, the accuracy of LSTM using GloVe embedding with three different dimensions like 100,200 and 300 are identical, which is about 61.7%. Furthermore, CNN provides 99.9% and 99.2% accuracy using word2vec and encoding techniques.

Infected COVID-19 Cases vs. Estimated COVID-19 Cases Using Twitter Dataset
We are curious to know the analogies between the real COVID-19 infected cases (i.e., Dataset-II) and the estimated COVID-19 cases using the Twitter dataset (i.e., Dataset-I). Figure 5 shows the experimental results between COVID-19 cases and COVID-19 related tweets. In this experiment, we consider the semi-log scale (i.e., the Y-axis log scale). The results show that when COVID-19 cases increase, people post more COVID-19-related tweets on social media in all states except California and Georgia. The results show that California has the highest number of COVID-19 cases, but there are relatively few tweets about COVID-19 on Twitter, while Georgia has the opposite. We believe that this happens in case of tweets are posted to raise awareness about COVID-19. However, in our next work, we will focus on understanding the relationship between COVID-19-related tweets and the number of COVID-19 cases for further analysis.

Discussion and Conclusions
This research aims to evaluate user sentiment by creating ML and DL models that can effectively forecast sentiment and compare COVID-19 infection cases and COVID-19 associated tweets. From 1 April to 15 April 2020, we gathered data from Twitter using the search keywords CORONAVIRUS and COVID-19 from nine states of the USA.
It is concluded from the research that most of the user sentiments are neutral. Both TF-IDF and the traditional bag-of-words feature extraction techniques work well at classifying user sentiments in machine learning models. Random Forest with bag-of-words and TF-IDF worked exceptionally well with other ML models. However, the random forest classifier generated the most stable and reliable results when combined with the TF-IDF feature extraction technique. Logistic Regression and SVM perform better with traditional bag-of-words, while TF-IDF extracts features for the other models. In DL, features are automatically trained and extracted, achieving higher precision and efficiency than ML versions. We used GloVe embedding with three dimensions, Word2Vec embedding, and an encoding technique to convert input data before feeding it into DL models. CNN and LSTM architectures were examined and paired with various methods to conduct sentiment analysis. We ran several tests on the dataset, including tweets, to compare CNN and LSTM models. After analyzing, we identified that the deep learning model constructed using word2vec and encoding feature extraction techniques outperformed Glove embedding feature extraction techniques. Although the best performance was obtained using the LSTM with the word2vec feature extraction technique. After evaluating the experiments, we observed that CNN surpasses the LSTM model.
In the future, we will focus on multiple social networking platforms such as Facebook, Instagram, and LinkedIn to create an effective model capable of classifying user sentiments more accurately. The construction model would then be compared to other established models to improve sentiment classification accuracy.