Sentiment Analysis and Topic Modeling on Tweets about Online Education during COVID-19

: Amid the worldwide COVID-19 pandemic lockdowns, the closure of educational institutes leads to an unprecedented rise in online learning. For limiting the impact of COVID-19 and ob-structing its widespread, educational institutions closed their campuses immediately and academic activities are moved to e-learning platforms. The effectiveness of e-learning is a critical concern for both students and parents, speciﬁcally in terms of its suitability to students and teachers and its technical feasibility with respect to different social scenarios. Such concerns must be reviewed from several aspects before e-learning can be adopted at such a larger scale. This study endeavors to investigate the effectiveness of e-learning by analyzing the sentiments of people about e-learning. Due to the rise of social media as an important mode of communication recently, people’s views can be found on platforms such as Twitter, Instagram, Facebook, etc. This study uses a Twitter dataset containing 17,155 tweets about e-learning. Machine learning and deep learning approaches have shown their suitability, capability, and potential for image processing, object detection, and natural language processing tasks and text analysis is no exception. Machine learning approaches have been largely used both for annotation and text and sentiment analysis. Keeping in view the adequacy and efﬁcacy of machine learning models, this study adopts TextBlob, VADER (Valence Aware Dictionary for Sentiment Reasoning), and SentiWordNet to analyze the polarity and subjectivity score of tweets’ text. Furthermore, bearing in mind the fact that machine learning models display high classiﬁcation accuracy, various machine learning models have been used for sentiment classiﬁcation. Two feature extraction techniques, TF-IDF (Term Frequency-Inverse Document Frequency) and BoW (Bag of Words) have been used to effectively build and evaluate the models. All the models have been evaluated in terms of various important performance metrics such as accuracy, precision, recall, and F1 score. The results reveal that the random forest and support vector machine classiﬁer achieve the highest accuracy of 0.95 when used with Bow features. Performance comparison is carried out for results of TextBlob, VADER, and SentiWordNet, as well as classiﬁcation results of machine learning models and deep learning models such as CNN (Convolutional Neural Network), LSTM (Long Short Term Memory), CNN-LSTM, and Bi-LSTM (Bidirectional-LSTM). Additionally, topic modeling is performed to ﬁnd the problems associated with e-learning which indicates that uncertainty of campus opening date, children’s disabilities to grasp online education, and lagging efﬁcient networks for online education are the top three problems.


Introduction
The outbreak of COVID-19 transformed the daily activities of human beings from living, traveling, and working to social interactions.Like many other sectors, the education system experiences grave implications involving students, instructors, and institutions around the globe.In the midst of worldwide COVID-19 lockdowns, educational institutes have been closed for formal face-to-face education leading to digital transformation and the unprecedented rise of online learning.Online learning, also called e-learning, is learning through synchronous or asynchronous environments involving the use of internet-enabled mobile devices such as mobile phones, laptops, tablets, etc. [1].The transition from traditional education to online education is not possible overnight and several challenges may hinder this transition.Despite its advantages, the challenges of transition may impair the full potential of online education.Several studies investigate the effectiveness and advantages of online education over conventional teaching methods.The advantages include overall flexibility, extended reach of teaching, accessibility, and non-confinement of time and place as well as the pace of learning.On the other hand, several serious challenges pose serious threats to e-learning over conventional classroom teaching methods.The limitations include the availability of communication technology infrastructure, high cost of equipment and devices, limited technical know-how of teachers and learners, and cultural change needed for successful and effective online education.
The COVID-19 pandemic affected the education system globally with conventional education activities suspended.Billions of students from different educational and training courses were not able to attend the in campus teaching sessions.Most of the educational and teaching institutions around the world switched their teaching-learning process to different e-learning platforms and communication media.Not only does online education provide significant advantages in the teaching and learning process, in the present scenario of the COVID-19 pandemic, it served as a backbone for the education system globally.While switching from face-to-face conventional teaching to e-learning, it must be ensured that the e-learning method should be at least a feasible alternative if not better than the traditional education.As some studies such as [2][3][4] argue that, even with the present technological revolution which demands the adoption of e-learning, the conventional face-to-face in campus sessions cannot be replaced fully.Furthermore, face-to-face teaching is a cornerstone for most educational institutions.According to the famous Bloom's Taxonomy, the framework for the classification of educational outcomes classifies learning outcomes in six domains: knowledge, comprehension, application, analysis, synthesis, and evaluation [5].Most modern educational institutions rely on Bloom's taxonomy for the learning outcome process.Considering the above-mentioned educational outcomes and adoption of e-learning, there is a significant need to evaluate the effectiveness and challenges of e-learning.
This study presents the analyses of the sentiments of students, teachers as well as other stakeholders gathered from Twitter.The tweets from different entities related to education such as parents, students, teachers, and other stakeholders will be covering most of the aspects of online education.Such aspects include advantages, disadvantages, challenges, and difficulties faced in adopting the e-learning approach.Sentiment analysis, a field in text analysis, holds great potential to extract and analyze the sentiments, and opinions of people regarding a specific topic, idea, personality, or institution, thereby revealing its pros and cons with respect to common people.Over the past two decades, machine learning and deep learning approaches have proven their superiority in several fields such as image processing [6,7], object detection and localization [8], and NLP (Natural Language Processing) tasks [9], etc., and text analysis is no exception.Additionally, the use of machine learning models has been made to analyze the text in several different languages including Turkish, Lithuanian, and French, other than English [10][11][12].Bearing in mind that machine learning and deep learning approaches can be leveraged for text annotation, clustering, and classification, this study utilizes machine learning approaches for annotation while a machine and deep learning approach for sentiment classification.To put it in a nutshell, the primary goal of the study is to address the following:

•
The analysis of the effectiveness of the e-learning system to achieve the desired learning outcome through sentiment analysis on stakeholders' tweets.

•
To analyze the thoughts and experiences of learners and teachers about the transition from face-to-face education to online education.

•
To find the gap between traditional education and online education by leveraging NLP approaches for text processing, feature selection for sentiment analysis, and machine learning models for sentiment classification.

•
To find the problems associated with online education in terms of technology, social setup, and interaction by employing topic modeling.

•
To analyze the performance of various machine learning and deep learning models for sentiment analysis using different annotation approaches such as TextBlob, VADER (Valence Aware Dictionary for Sentiment Reasoning), and SentiWordNet, as well as the efficacy of TF-IDF (Term Frequency-Inverse Document Frequency) and BoW (Bag of Words) feature extraction approaches.
The rest of the paper is organized as follows: Section 2 discusses several research works related to the current study.Section 3 contains the description of data collection, feature extraction, proposed methodology, and machine learning algorithms.Results and discussions are provided in Section 4, followed by the conclusions and future work in Section 5.

Related Work
Sentiment analysis or opinion mining is the process of extracting people's opinions, emotions, attitudes, and feelings about a topic or situation from a large amount of unstructured data.A large body of research has been done in recent years to develop methods for analyzing and describing the process of sentiment analysis in different languages.
The study [13] analyzed the emotions of educational tweets during COVID-19 on the dataset obtained using the NLP toolkit and naive-based classifier.Results show that the number of tweets with negative emotions has exceeded the number of tweets with positive emotions.Another study about online education is [14] where the dataset of 1717 tweets is collected for analysis.After cleaning, 1548 tweets are extracted and categorized as favorable, negative, or neutral with an accuracy of 74.9%.A total of 154 articles about online learning are retrieved from Google and other platforms including online reviews and blogging and sentiment analysis are performed through text mining using the dictionarybased technique of the lexicon-based approach in [15].Polarity and subjectivity of articles are obtained using the TextBlob toolkit.Similarly, comments about online learning from learners, professionals, and guardians are gathered to assess educational system reforms in [16].
The study [17] compares the efficiency of the online education system with traditional classrooms with a focus on students enrolled in higher education.Research suggests that 73 percent of students have appropriate internet access and 71.4 percent of students feel well equipped to operate a computer/laptop for online classes.However, 78.6 percent of respondents believe that traditional classrooms are more effective than online learning.Althagafi et al. [18] investigate sentiment analysis of tweets to grasp better understanding of people's sentiments and opinions about online education in the mid of COVID-19.The study performs experiments using NB (Naïve Bayes), KNN (K-Nearest Neighbour), and RF (Random Forest) classifiers.In comparison to NB and KNN, the RF multi-class classification technique shows the best classification accuracy due to its ability to work well with high-dimensional data such as text categorization.Hogenboom et al. [19] proposed a model that accurately classifies the sentiments into positive, negative, and neutral.Furthermore, three basic approaches are used for sentiment analysis.First, a lexicon-based approach is used in which the sentiment lexicon is to describe the polarity and subjectivity score of textual data into positive, negative, and neutral.Machine learning algorithms are easy to implement and understand but require human efforts for labeling.Secondly, the machine learning-based approach requires labeled data to train the classifier manually for better performance.Three, a hybrid approach is a combination of machine learning and a lexicon-based approach.
The authors in [20] analyze movie reviews using KNN, NB, and LR (Logistic Regression).The dataset is gathered from several sources for analysis, and LR provides the highest accuracy.In both short and lengthy text content, many classifiers are tested.For brief text, NB and LR produce average outcomes of 91 and 74 percent, respectively.Both models do poorly on long texts [21].Machine Learning models produce good results when it comes to categorizing product reviews.For camera reviews, NB has an accuracy of 98.17 percent and SVM (Support Vector Machine) an accuracy of 93.54 percent [22].Furthermore, according to [23], sentiment analysis is the analysis of opinions involving NLP, computer science, theory of computation, and artificial intelligence.Subjectivity and polarity are two components of sentiment analysis.Polarity expresses emotions that can be positive or negative scores while subjectivity identifies the attitudes, feelings, and opinions [24].
Another study [25], performs sentiment analysis on COVID-19 tweets using machine learning and lexicon-based techniques.The data are extracted from Twitter and annotated using TextBlob, while TF-IDF and BoW features are used for machine learning models.Results indicate that the ETC models achieve the best performance with BoW features and Textblob.
Keeping in view the superior performance of deep learning models, several studies adopt deep learning models for sentiment classification.For example, Ref. [26] uses deep learning and NLP tools to determine how people feel about the COVID-19 vaccination in the UK (United Kingdom) and the US (United States).The data are collected from Facebook and Twitter using various COVID-19 and vaccine-related keywords.Afterward, the data are preprocessed and two lexicon-based techniques including VADER and Text Blob are applied for sentiments.The study shows that average positive, negative, and neutral emotions in the UK are better than in the US.Similarly, the study [27] analyzes the articles about the emergence of infectious diseases such as COVID-19 and MERS (Middle East Respiratory Syndrome) pandemics, etc., and analyze the main findings.The study discusses the classification models, lexicon-based approaches, and machine learning approachesboth individual and hybrid-as well as the application language.The authors perform sentiment analysis on tweets related to COVID-19 in [28] using deep learning models.A multi-layer LSTM (Long Short Term Memory) model is proposed for the classification of sentiment polarity and emotions.The study [29] uses a deep learning approach for COVID-19 tweets' sentiment analysis.It leverages LSTM and BERT (Bidirectional Encoder Representations from Transformers) models for sentiment classification.BERT achieves an 89% accuracy while LSTM achieves only 65% accuracy for sentiment classification.
Research findings indicate that the knowledge process is not anticipated; rather, it is viewed as a last-minute learning technique [30].To understand the need of the hour, many schools have started online courses.Almost everywhere there are two major issues; e-learning has little effect and learning through digital platforms is not as effective as traditional teaching methods are in achieving learning goals and focusing on educational priorities [31].Table 1 provides a comprehensive summary of the discussed related works.

Ref. Model / Approach
Aim Dataset Limitations [13] Naive-based classifier (model) Sentiment analysis of tweets on education during COVID-19 The area of study has generated nearly 90,000 tweets.
Study did not perform topic modeling and accuracy is not significant.
[14] Web analytics approach Find sentiment on educational posts A total of 1717 tweets collected from Twitter.
Study did not use a machine learning approach.
[ Google Forms is used to collect data.Study did not perform topic modeling and accuracy is not significant.
[18] Naïve Bayes, KNN and random forest Sentiment analysis of online education during coronavirus 10,445 tweets were gathered using the Twitter API.
Study did not perform topic modeling to discuss the reason behind negative sentiments [20] KNN, Naïve Bayes, and Logistic regression

Sentiment analysis of movies reviews
The data set is compiled from a variety of sources.
Study is not about online education sentiment analysis.
Study is about general COVID-19 tweets sentiment analysis not about online education.
[22] Machine learning (SVM and Naive Bayes) Sentiment analysis on product reviews Over 13,000 tweets obtained from six product reviews.
Study is not about online education sentiment analysis.
[29] Deep learning (BERT and LSTM) Sentiment analysis on COVID-19 A total of 3090 tweets related to COVID-19 Accuracy is not significant and its about general COVID-19 tweets

Materials and Methods
This section presents the description of the dataset and its visualization, the sentiment analysis process, and the proposed methodology for performing sentiment analysis on the selected dataset.

Dataset Description
The dataset for this study has been collected from Twitter and contains 17,155 records.The primary dataset, called online-education-during-COVID-19, is unlabeled.For data collection, several relevant keywords are used to obtain the desired tweets such as "coronaeducation", "covidneducation", "distancelearning", "Onlineclasses", and "onlinelearning", etc. Table 2 shows a sample subset from the dataset with corresponding username and location.After data gathering, the TextBlob Python package is used to obtain the polarity score of tweets.For this purpose, preprocessing is carried out to clean the dataset and remove superfluous information.The sentiment score is divided into three categories of positive, neutral, and negative.The criterion used for defining the sentiment of a tweet based on its polarity score is shown in Table 3 with sample tweets and assigned sentiment.

Methodology
This subsection contains an explanation of various phases of the methodology and the approaches used in each phase.
The sequential workflow of the methodology along with the methods, algorithms, and state of data in each phase is illustrated in Figure 1.The workflow starts from dataset extraction from Twitter into the "online-education-during-COVID-19 dataset".The next phase is cleaning the dataset using several preprocessing steps, followed by a lexicon-based approach to annotate the data using corresponding sentiment labels.The labeled dataset is further divided into training and testing sets for machine learning models train and test process, respectively.In this regard, BoW and TF-IDF features are used.A brief description of each of these phases is given in the following sections.

Preprocessing
Data analysis applications require data preprocessing to remove the superfluous information to increase the learning process of classification models for increased accuracy.Superfluous information refers to any data that contribute very little or no contribution at all to predicting the target class; however, it increases the size of the feature vector and thus introduces unnecessary computational complexity.Consequently, the performance of classification models is degraded if no or improper preprocessing is carried out.Thus, data cleaning or preprocessing are performed before encoding [32].Python's NLP toolkit has been used for preprocessing tweets data in this study.Initially, the text is converted into lower case, followed by the removal of links, HTML (HyperText Markup Language) tags, and punctuation.Then, stemming and lemmatization methods are performed to clean the text, and stopwords are removed in the end.

•
Convert to lowercase: Converting the text to lowercase reduces the complexity of the feature set as, 'go' and 'Go' are taken as different features by machine learning models, so converting to lowercase both terms will be 'go'.Models consider upper and lower case words as different words which affect the training process and classification performance.• URL links, tags, punctuation, and number removal: URL links, tags, punctuation, and numbers do not contribute to improving the classification performance because they provide no additional meaning for learning models and increase the complexity of feature space, so removing them helps to reduce the feature space.

•
Stemming and Lemmatization: The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form [33].For example, 'walks, 'walking', and 'walked' are converted to the root word 'walk' in this process.• Stopwords removal: Stop words are frequently used words that give no useful information for analysis.Stop words such as 'the', 'is', 'a', and 'an' are removed [34].Table 4 shows samples of raw text from tweets and cleaned text after applying the preprocessing steps.

TextBlob
TextBlob is a lexicon-based technique that can be used for different NLP tasks including part-of-speech tagging, sentiment analysis, noun phrase extraction, paraphrase, and sorting, etc. [35].We used it in this study for sentiment purposes.TextBlob sentiment function provides a polarity score between −1 and 1. Tweets that have a polarity score less than 0 will be a negative, equal to zero will be neutral, and greater than zero will be positive statements [36].Table 3 shows the results of TextBlob on sample tweets with polarity score and corresponding sentiment.) is used to solve the imbalanced dataset problems by balancing the number of samples for all the classes of a dataset [37].Balancing is achieved by generating synthetic samples of minority classes so that the number of minority class samples becomes almost equal to that of the majority class.The ratio of sentiments after applying the TextBlob is not equal so models can be over-fit on the imbalanced dataset.To avoid this over-fitting problem, SMOTE is used to balance the dataset by generating artificial data for the minority class.The ratio of sentiments before and after applying SMOTE is shown in Figure 2.

Data Splitting
This study uses a 75:25 split ratio where 75% of data are used for the models' training while 25% of data are taken for models' testing.Before the data split, the shuffling of data is carried out, so as to reduce the variance and ensure the generalizability of the models.Shuffling also helps to make the training data more representative of the overall distribution of the data and avoids model overfit.The number of tweets in training and testing sets are shown in Table 5 with and without the SMOTE technique.

Feature Engineering
To extract features from tweets, the two most widely used feature extraction methods are used including BoW and TF-IDF.
Bag of Words: BoW is a simple technique to extract features from simplified text or data and is commonly used in natural language processing and information retrieval [38].For text classification, BoW is used to count the occurrence of a word in a text and forms a feature vector containing the number of occurrences of each unique word.The BoW is mostly used to build the vocabulary of all matchless words and train the learning models through their frequencies.BoW feature vectors from the following sample text data statements are shown in Table 6.Sample statements are S1: england high school face mask lift S2: wear mask right way

S England High School Face Mask Lift Wear Right Way Length
Term Frequency-Inverse Document Frequency: TF-IDF is a feature extraction technique used to extract weighted features from text data.It provides the weight of each term in the corpus to improve the performance of learning models [39].TF-IDF is a product of TF and IDF.TF can be calculated as: where n t represents the number of occurrences of term t in a document d, while N (T,d) indicates total terms T in that document.IDF of a term indicates how important it is in the whole corpus [40], and it can be calculated as: where D is total number of documents in the corpus, whereas n d is the number of documents where the term t appears.Using TF and IDF, TF-IDF can be calculated as For a better understanding of TF-IDF, Table 7 shows the results of TF-IDF on two pre-processed data samples.

Topic Modeling
Topic modeling is a very popularized and important algorithm of machine learning and natural language processing.It is an approach to extract hidden topics from large documents.With the increase in the popularity of social media platforms, many researchers are interested in extracting ideas from these platforms.It is essential to discover topics through tweets as they contain unorganized short text topic modeling that has to be performed for finding such information.In this paper, the LSA (Latent Semantic Analysis) method has been used.LSA describes the strong relationship between documents and expressions.Several research works suggest that LSA performs well in short sentence classifications [41,42].When comparing with other methods for automatically indexing and retrieving information, LSA gives similar meanings with low dimensions and consumes less power.Several supervised machine learning models have been employed, each with its own set of parameters.The models are selected with respect to their wide use for sentiment analysis.A brief description of the used models is provided in Table 8, while the parameter settings of the models are given in Table 9.

Evaluation Measures
The performance of supervised machine learning models has been assessed using four evaluation parameters: sensitivity score, precision score, F1 measure, and accuracy score.The maximum and minimum accuracy ratings are 1 and 0, respectively.For measuring the values of these performance evaluation metrics, TP (True Positive), TN (True Negative), FP (False Positive), and FN (False Negative) are used.A prediction is TP when the model predicts the positive class correctly while a TN is a result in which the model correctly predicts the negative class.On the other hand, FP is the prediction when the model incorrectly predicts the negative sample as positive, and FN is the sample of positive class predicted as negative.
Accuracy shows the ratio of correct predictions to total predictions.Sensitivity refers to the capability of a model to correctly predict a sample of positive class while precision is used to evaluate the exactness of a classifier.Precision and recall alone may not be appropriate to evaluate the model, so an F1 score is used that incorporates both precision and recall:

Models Description
SVM SVM is one of the most widely used models for sentiment analysis [43].It performs classification by locating the hyper-plane that is the best match for differentiating the classes.SVM is a linear model which is used with kernel sigmoid and a c = 3.0 parameter (see Table 9).

LR
LR is a supervised machine learning algorithm used to determine the probabilities of output variable [44].It performs well when the nature of the output or dependent variable is binary, but it can also be good for multi-class data classification.It used the logistic function to categorize the data.
DT DT collects data in the form of a tree, which may alternatively be expressed as a collection of discrete rules [45].Decision trees can handle big data well.The DT algorithm works to split the record according to the attribute selection measures technique and select the best set of attributes.
RF RF is a supervised learning algorithm.It can be used for both classification and regression.This algorithm is also the most flexible and easy to use [46].The forest is made of trees, more trees in the forest, and the stronger they will be in prediction.RF makes random trees from randomly selected data samples, makes predictions from each tree, and votes for the best solution.
SGD SGD Classifier is a linear classifier that implements regularized linear models with a stochastic gradient descent as the cost function [47].It supplies regularized linear models with SGD learning to build an estimator.The SGD classifier works well with large-scale datasets, and it is efficient and easy to implement the method.SGD is implemented using the sci-kit library.

KNN
It is a supervised machine learning model used for classification of data [48].It is a simple model which is easy to implement and interpret.KNN is also known as a lazy learner because it makes predictions based on the nearest neighbor by finding the distance.It performs well when the size of data is not too large.

GNB
The GNB algorithm is a special kind of Naive Bayes algorithm that is unique.It is mostly used with continuous features.It is also expected that all of the characteristics have a Gaussian distribution or a normal distribution.Naive Bayes algorithms work on the basis of the Bayes theorem.If the data contain strongly correlated characteristics, the performance of Naive Bayes might suffer [49].
AdaBoost AdaBoost is termed adaptive boosting, which is a supervised machine learning model used for the classification of data.It used a boosting mechanism to boost the classification accuracy.Adaboost used DT as a base leaner ("weak learner") by default.The output of the learning algorithm is associated with weight, which is the end result of the density assessment [50].
ETC ETC is a tree-based ensemble model used for the classification of data by training/fitting a large number of weak learners (randomized decision trees) on distinct samples of the dataset, ETC uses the majority voting criteria to enhance prediction accuracy [25].It is an ensemble learning model that works similarly to RF.The only difference between ETC and RF is how the forest trees are constructed.

Results and Discussion
Several experiments are performed involving the use of BoW and TF-IDF, as well as imbalanced data and SMOTE balanced data.In addition, the combinations of models and feature extraction techniques have been permuted.

Results Using BoW and without the SMOTE Technique
Initially, experiments are performed on the original dataset with class imbalance using BoW features.The results of all models in terms of accuracy, precision, recall, and F1 score are shown in Table 10.SVM and SGD outperform other models with significant accuracy of 0.94 each followed by LR with 0.93 accuracy.Results indicate that linear models perform better on the dataset when BoW features are used.The primary reason is the large feature set used for training as using the BoW technique feature space is large and the linear model performs well when a large feature set is available for training.While KNN, GNB, and AdaBoost show poor performance as they require a small feature set for a good fit, and they need categorical data for the significant results.Tree-based models RF, ETC, and DT show average accuracy scores.

Results Using BoW with the SMOTE Technique
The second set of experiments involves using BoW on the SMOTE balanced dataset.Experimental results are provided in Table 11, which indicates significantly better performance as compared to results on the imbalanced dataset.On the balanced dataset, the performance of tree-based models improved significantly as well as linear models because of the increase in the feature set.RF, DT tree-based models achieved the highest accuracy score of 0.95, and SVM also shares this highest accuracy score with RF and DT.SGD and LR are just behind them with 0.94 and 0.93 accuracy scores, respectively.Using the SMOTE technique, the performance of ETC is improved from 0.80 to 0.89.Similarly, the performance of RF, DT, KNN, and AdaBoost is improved from 0.86, 0.83, 0.52, and 0.69 to 0.95, 0.95, 0.62, and 0.78, respectively.This significant improvement in models performance is due to class balance and an increase in the feature set after balancing.The use of SMOTE for data balancing also reduces the probability of the model over-fitting on the majority class and helps to improve the performance.

Results Using TF-IDF Features on the Original Dataset
For this set of experiments, machine learning models are trained using TF-IDF features from the original dataset.TF-IDF gives weighted features for the learning of models which can be useful for better training of models.The results of machine learning models with TF-IDF features on original data are shown in Table 12.Results show that SVM and SGD outperform all other models with a 0.94 accuracy score each followed by LR with a 0.93 accuracy score.Linear models again perform well on the imbalanced dataset with TF-IDF features, similar to BoW features.Still, KNN, GNB, and AdaBoost are the worst performers on imbalanced data using TF-IDF features, and only 1% improvement in AdaBoost results is observed with TF-IDF on the imbalanced dataset.

Results Using TF-IDF Features and the SMOTE Technique
Experiments are performed using TF-IDF on the balanced dataset as well as using SMOTE for balancing the minority class samples.Table 13 shows results in terms of accuracy, precision, recall, and F1 score for all the machine learning classifiers used in this study.Results indicate that models' performance has been improved significantly as compared to the models' performance on imbalanced data when TF-IDF features are used for training the models.Analogous to the performance using BoW with SMOTE, SVM shows superior performance with a 0.95 accuracy and significant precision, recall, and F1 scores.The difference in accuracy and other metrics is small, which indicates that the model has a good fit.The accuracy of RF and SGD is marginally lower than SVM with 0.94 accuracy each, followed by DT which obtains an accuracy of 0.93.On average, the performance of all models has been improved substantially when TF-IDF features are used from the SMOTE balanced dataset as compared to the imbalanced dataset.In addition to an approximately equal number of samples for each class, balancing the dataset increases the feature set as well due to generating artificial data to make the dataset balanced.This data generation creates more features for learning models, and linear learner such as SVM is the best performer on large feature sets.Consequently, models get good accuracy when the SMOTE technique is used for generating synthetic samples of the minority class.
Comparative analysis between results of BoW and TF-IDF indicates that there is no significant difference in the performance of machine learning models when models are trained using BoW or TF-IDF features on the original dataset that contains a different number of samples for three classes.The similarity in models performance can be seen in Figure 3, which indicates that the difference in the performance of RF and AdaBoost is marginal while LR, DT, KNN, SVM, GNB, ETC, and SGD are the same.Similarly, Figure 4 shows comparative accuracy of the models using BoW and TF-IDF features from SMOTE balanced data.Although the performance is improved substantially, the difference in the performance is little between BoW and TF-IDF features except for GNB, where accuracy with BoW and TF-IDF is 0.78 and 0.68, respectively.Table 14 summarizes the average accuracy for positive, negative, and neutral classes for the machine learning models with BoW and TF-IDF for the original and balanced datasets.Results indicate that the use of SMOTE to balance the dataset leads to higher classification accuracy both with BoW and TF-IDF.In this study, different machine learning models are used with two different feature extraction techniques such as BoW and TF-IDF.These feature extraction techniques have been applied with a combination of SMOTE.Analysis of experimental results proves that the SVM model can achieve the highest accuracy among all the models with different features.The accuracy of SVM is as high as 95% with BoW and TF-IDF features without using any statistical techniques and 94% with BoW and TF-IDF features when applied along with SMOTE.Table 15 shows the number of CP (correct predictions) and WP (wrong predictions) for machine learning models with both features with the combination of using SMOTE and without SMOTE.The highest number of CP is achieved by SVM using TF-IDF and SMOTE, which is 5610 with only 183 wrong predictions.Using BoW with SMOTE, the highest number of CP is 5440 by the SGD classifier.Although these classifiers perform better on the original dataset as well, the number of correct predictions is high when they are used on SMOTE balanced data.To show the adequacy and efficacy of the models, this study performs 10-fold crossvalidation with both BoW and TF-IDF features.The 10-fold cross-validation is applied after annotating the dataset using the Textblob technique.The results with 10-fold cross-validation are shown in Table 16.Results indicate that models perform significantly in 10-fold crossvalidation and SVC achieves the highest 0.94 accuracy score with +/−0.03 standard deviation using the SMOTE technique and both BoW and TF-IDF features.SVC and RF also perform significantly better without applying the SMOTE technique with a 0.93 accuracy score and 0.04 standard deviation with both BoW and TF-IDF features.To analyze the performance of TextBlob, VADER and SenitWordNet are also adopted in this study.VADER is used to find the polarity of social media posts to categorize them with respect to the sentiments such as positive, negative, and neutral [25].It is a rule-based technique that shows the intensity of positive or negative emotion in text.Similarly, another lexicon-based technique, SentiWordNet is also used in comparison to Textblob and VADER.SentiWordNet finds the polarity score from the text to categorize the data into positive, negative, and neutral sentiment [51].The ratio of sentiments such as positive, negative, and neutral with VADER, and SentiWordNet is shown in Table 17.Table 18 shows the results using VADER and SentiWordNet, which indicate that the performance of VADER is slightly better as compared to SentiWordNet.VADER is suitable especially for social media posts and shows better performance.ETC and SGC achieve the highest accuracy of 0.90 using TF-IDF features with VADER and the SMOTE technique while RF achieves 0.90 accuracy with VADER and BoW features.In the case of SentiWorNet, the highest accuracy is 0.88 by RF using TF-IDF features with the SMOTE technique.The comparison between Textblob, VADER, and SentiWordNet using BoW and TF-IDF features with and without SMOTE is shown in Figure 5.

Experimental Results Using Deep Learning Models
This section contains the results of deep learning models with each lexicon technique.Table 19 shows the results of all models which reveal that deep learning models show superior performance with TextBlob sentiments as compared to VADER and SentiWordNet.For experiments, LSTM, CNN (Convolutional Neural Networks), CNN-LSTM [52], and Bi-LSTM Bi-directional-LSTM) are utilized in this study.The implementation details of these deep learning models are given in Figure 6.All the models are compiled using the 'categorical_crossentropy' loss function because of the multi-class dataset and the 'Adam' optimizer is used for optimization.The models are fit using 200 epochs and 32 batch sizes.Results suggest that, on average, Bi-LSTM outperforms all models with Textblob sentiments by achieving the highest 0.94 accuracy score.Bi-LSTM is significant with each lexicon technique.The performance of the LSTM is marginally low with 0.94, 0.91, and 0.85 accuracy scores with Textblob, VADER, and SentiWordNet, respectively.

Topic Modeling Analysis
Topic Modeling is a text-mining tool frequently used for discovering the semantic constructs of the given text.It is a statistical modeling technique with a potential application for NLP domains like sentiment analysis.This study applies topic modeling to reveal the potential benefits of online education, as well as uncover the problems associated with it.The required preprocessing and data cleaning procedures are carried out on the dataset for applying topic modeling.The data from tweets have been transformed into an appropriate structural format for topic modeling.TF-IDF features are used to facilitate identifying the most significant terms in the corpus and a total of 4000 features are utilized.Topic modeling is performed on the tweets from positive and negative classes to identify the pros and cons of online education.Table 20 shows the LSA (Latent Semantic Analysis) results for positive tweets.LSA is the most commonly used topic modeling approach that makes use of the distributional hypothesis which infers that the semantic of words can be obtained by analyzing the contexts of words.It indicates that, if words appear in a similar context, their semantics would be the same [53].LSA can be used with different features, where this study uses TF-IDF.
LSA results show that students, while learning through online education during the COVID-19 pandemic, protect themselves from the disease.The most often appearing words in subjects in LSA are online education, online courses, and COVID-19.The positive opinions about online education are summarized in Table 20.Similarly, topic modeling with LSA for negative words is shown in Table 21.The issues that students have concerning online education are highlighted in this table by topic keywords.The major issue of discussion is the lack of technical skills and network challenges in rural regions.Similarly, children's disability to grasp online education is a serious threat to the efficacy of online education.

Conclusions
The COVID-19 pandemic led to the closure of traditional face-to-face teaching institutions and the rise of the online education system.Although online education serves as the backbone of education during the pandemic, its effectiveness and suitability have serious concerns from stakeholders such as parents, teachers, and students.For this reason, such concerns must be analyzed to find the problems faced by students and suggest modifications to utilize the full potential of online education.This study investigates the effectiveness of online education by analyzing the sentiments of its stakeholders' using social media data.The dataset used in this study has been obtained by the Twitter API using the keywords related to the topic.Various text preprocessing methods, such as stemming, normalization, tokenization, and stop words removal, etc., have been used to clean the tweets.Afterwards, lexicon-based approaches have been used to find the sentiments and label tweets.Two feature engineering techniques BoW and TD-IDF are used to classify positive, negative, and neutral reviews using several machine learning algorithms.Results indicate that using the data balancing with SMOTE enhances the classification accuracy.DT, SVM, and RF perform very well and achieved an accuracy of 0.95 using Bow and SMOTE, while SVM achieves 0.95 accuracy using TF-IDF with SMOTE.VADER and SentiWordNet techniques are also used for performance comparison with TextBlob, and results indicate that TextBlob shows superior results for data annotation as compared to VADER and SentiWordNet.Deep learning models are used in comparison with machine learning models, and results suggest the superior performance of machine learning models, primarily due to the small size of the dataset.Topic modeling through LSA suggests that the uncertainty of opening date institutions is among the most concerning topics for students.Additionally, lack of technical skills and network challenges in rural areas are major concerns for the students.

Figure 1 .
Figure 1.Architecture of the proposed methodology.

Figure 2 .
Figure 2. Ratio of sentiment with and without SMOTE.

Figure 3 .
Figure 3. Models' performance comparison using BoW and TF-IDF on the original imbalanced dataset.

Figure 4 .
Figure 4. Models' performance comparison on BoW and TF-IDF features when we used the SMOTE technique.

Figure 5 .
Figure 5. Models' performance comparison using Textbob, VADER, and SentiWordNet techniques, (a) with BoW features on the original dataset; (b) with BoW features on the SMOTE balanced dataset; (c) with TF-IDF features on the original imbalanced dataset; and (d) with TF-IDF features on the SMOTE balanced dataset.

Figure 6 .
Figure 6.Architecture of deep learning models used for sentiment classification.

Table 1 .
A summary of related work.

Table 2 .
Sample tweets from the collected dataset.
educationblog USA #EDUCATION: #Children read longer #books of greater difficulty during #lockdown periods last year, and reported tha âe| https://t.co/S9UbQtKWZL(accessed on 1 September 2021) Student Gujarat, India We havenot been given online education,so we r in severe depression brenda11831 USA 8.4 million fewer jobs than in February 2020, just before #coronavirus shut down large swaths of the U.S. economy âe| https://t.co/DevQfUWDMW(accessed on 1 September 2021) 8.4 million fewer jobs than in February 2020, just before #coronavirus shut down large swaths of the U.S. economy âe| https://t.co/DevQfUWDMW (accessed on 1 September 2021)

Table 4 .
Sample tweets from the dataset before and after preprocessing.

Table 5 .
Train and test count after data splitting.

Table 6 .
Two sample tweets from the dataset are taken for Bag of Words features on preprocessed data.

Table 7 .
TF-IDF features on preprocessed data taken from the dataset.

Table 8 .
Brief description of machine learning models used in this study.

Table 9 .
The hyper-parameter settings of machine learning models.

Table 10 .
Results using BoW features on the original dataset.

Table 11 .
Results using BoW features and the SMOTE technique.

Table 12 .
Results using TF-IDF features on the original dataset.

Table 13 .
Results using TF-IDF features and the SMOTE technique.

Table 14 .
Summary of models' performance with BoW and TF-IDF.

Table 15 .
Confusion Matrix of a model using TF-IDF and BoW without SMOTE and using SMOTE.

Table 17 .
Vader and SWN train and test count after data splitting.

Table 18 .
Model results using the VADER and SentiWordNet technique.

Table 19 .
Performance of deep learning models with each lexicon technique.

Table 20 .
Topic modeling with LSA for positive tweets.

Table 21 .
Topic modeling with LSA for negative tweets.