Sentiment Analysis and Topic Modeling Regarding Online Classes on the Reddit Platform: Educators versus Learners

: The world is witnessing an unpredictable COVID-19 pandemic that has impacted all levels of online education, shaping future trends. However, this shift was so sudden and drastic that unrevealed puzzles exist regarding the public’s authentic opinion towards online classes, even though three years have passed. Many experts and policymakers have conducted qualitative and quantitative research to explore effective pedagogies, the satisfaction of different stakeholders, and factors inﬂuential on learners’ performance. However, scant studies have examined personal opinions and concerns toward online classes hidden behind people’s anonymous posts on social media. This research investigates the sentiments, concerns, and their variance with time regarding online classes by learners and educators on Reddit, which is a dominant social network among them. Data were collected via the ofﬁcial API from identiﬁed relevant subreddits and keyword search results across Reddit. Sentiment analysis was applied to reveal their emotions and their changes. Topic modeling was conducted to discover the concerns hidden in the posts. The results revealed the concerns about online classes, such as severe cheating behaviors, and showed doubts about previous strategies to solve disadvantages in online classes. In addition, the results veriﬁed the habitual difference and motivations of social media usage between educators and learners.


Introduction
Due to the COVID-19 pandemic, most countries implemented lockdown and social distancing measures, resulting in the closure of schools, training institutions, and further educational activities [1][2][3].The paradigm shift from traditional face-to-face to online classes poses various difficulties for educators and learners, leading to various pedagogical innovations supporting continuing online classes [4].Yet, the reliability and efficiency of pedagogy for online classes heavily depend on educators' and learners' utilization and exposure to information and communications technology (ICT) [5,6].Microsoft Teams, Google Classroom, Zoom, Canvas, and Blackboard are widely-used communication and collaboration platforms to facilitate video meetings, file sharing, storage, quizzes, and rubric-based assessments [7,8].Additionally, the flipped classroom is widely adopted for learners to review learning resources such as articles, pre-recorded videos, and YouTube tutorials before classes and deepen their understanding through discussion with peers and educators during online classes [9,10].
However, there are concerns about the effectiveness and adaptability of online classes [6].For example, educators struggle with conducting online classes due to a lack of proper training and infrastructure [11].While open-minded learners quickly adapt to a new learning environment, those with fixed mindsets encounter difficulties adjusting and adapting.Without a widely adopted pedagogy for online classes, the readiness of various stakeholders needs to be flexible and supported accordingly [5].Research investigating factors impacting online classes indicates that motivation is the most significant factor, contributing to an almost 50% impact on learner performance in online classes.Thus, the biggest challenge for educators is motivating learners to focus in classes [12].Many learners prefer conventional to online classes because they face challenges in getting a stable Internet connection, understanding the teaching content, interacting with classmates and educators [13], and building up their social networks for social support [14].Many even believe online classes cannot completely replace traditional schooling [6,11,15].
With the diverse views and stakeholders, it is necessary to investigate broad opinions beyond empirical studies on social media, and similar studies are lacking in this area.Thus, this study employs a social media analytical approach [16][17][18] to investigate the sentiments, concerns, and their variance with time regarding online classes by learners and educators.

Literature Review
Reddit is among the most popular social networks, with over 50 million daily active users and more than 100,000 active topical communities called "subreddits."Its image is associated with the people's forum, a warehouse for the dankest memes, the weapon against Wall Street, and the "the front page of the Internet" [19].Being valued at up to USD 15 billion, Reddit is on the way to going public (possibly as soon as this year), with preliminary IPO registration statements already filed under the guidance of Morgan Stanley and Goldman Sachs [20].
Although the demographic features of the 330 million active users are unavailable because of its policy of not collecting personal information, evidence shows that a majority of users are young people between 18 and 34 years old (57%) [21], corresponding to the age of college learners.In addition, research on educators on Reddit reveals that educators widely use the site and that its use might be influential in training preservice educators [22].
Given Reddit's accessibility and popularity and the capability to collect high-quality data [23], a growing volume of research has used Reddit as a data source in the past decade.These studies used different types of data, including the original content, comments and comment threads, meta-data, links or media, upvoting/downvoting information, characteristics of subreddits, as well as surveys with users and moderators [21].However, few researchers share the datasets drawn from Reddit.Notably, Proferes et al. [21] provide procedure descriptions for gathering their data, the metadata of the final dataset, and even the list of subreddits selected as the sources.
The subreddits of interest vary with the specific research context, with political subreddits, mental health subreddits, and drug use subreddits being more prominent data sources [23].However, Reddit has evolved into a widespread forum with diverse topics, including education [24] and much research in computer science, engineering, and mathematics disciplines with computational-driven textual analysis [21].
Discussions on Reddit are open to anyone with or without a Reddit account unless the subreddits' setting is private.The visibility of both the original post and discussion comments is configurable by users' "upvote" and "downvote" behaviors.Registration for a Reddit account only requires a unique username and a password without email authentication [21].Users are allowed to post content anonymously under a pseudonymous account or multiple accounts [25], facilitated by "throwaway accounts" when the topic is particularly sensitive or personal, encouraging individuals to engage in more sensitive, potentially stigmatizing conversations [26].

Research Gap
On the other hand, studies on online classes during COVID-19 mainly focus on innovative pedagogies and examination effectiveness.Few investigate satisfaction and concerns from different stakeholders.A few studies examined some aspects of educationrelated issues on forums to help better understand various stakeholders' opinions, such as COVID-19 vaccination-related issues [27].However, to our knowledge, no study examines online educational issues in the forum context.Even worse is that most studies adopted traditional approaches to collect empirical data, such as surveys and interviews, which may contain response bias and limited generalizability.Given that educators and learners spend more time online, more effort is necessary to gather online self-disclosure posts for further analysis.
Reddit is one of the most popular social media among younger generations and education practitioners.Its unique anonymity mechanism gives users enough safety to express their genuine feelings and opinions.However, previous studies concentrate on the technology field, undertreating the potential of Reddit as a data source in the education discipline.As such, there is a need to comprehend the authentic sentiments among different stakeholders in online classes and the obstacles to conducting effective online classes through Reddit.Three research questions were defined to guide this research and reach the research goals as listed below: RQ1: What topics can be detected from Reddit posts about online classes?RQ2: What are the differences in sentiments and topics between different stakeholders?RQ3: How do the sentiments and topics vary with time?

Methodology 4.1. Data Collection
There are two ways to request data from Reddit: pre-stored third-party data collection Pushshift (https://github.com/pushshift;accessed on 12 May 2022) and official Reddit API.Pushshift is a social media archiving tool that can gather Reddit data and make it accessible to researchers.Data copied into Pushshift when submitted to Reddit is updated in real-time and complete [28].Everyone can access Reddit data through Pushshift API or the downloadable dataset without restrictions.The Reddit office powers Reddit API to encourage the developer community to participate in building great products.It enables programmatic control of nearly every function users can perform on the site.This API requires an authorization application that is easier, separated into several steps, and without a long-time review process.Table 1 summarizes the differences between the two approaches.

•
The downloadable dataset aligns with the FAIR principles.

•
The data structure and contents are not consistent over time owing to the changes in Reddit itself.

•
The data is more reliable and more regularly updated since it is derived from the official storage.In summary, Reddit API is easier to use and more regularly updated, but it only allows the extraction of data from up to 1000 posts at a time.Using Pushshift API is slightly more cumbersome, but a larger amount of historical data is available.Since this research covered data over two years, Pushshift was selected for data collection.

Data Source Selection
The initial examination of subreddits relevant to the extensive topics about online classes involved a subreddit search using the official search API with the term *online class in which '*' serves as the wildcard character.Fifty-seven subreddits were identified as associated with online classes.However, some results were not useful for this research, e.g., r/softwaregore is for sharing funny software vulnerabilities, and r/memes content is in the form of images, both infeasible for textual analysis.Note that on the Reddit website, the symbol 'r/' denotes a subreddit of the name following it.For example, r/softwaregore is a subreddit called softwaregore.
Next, each identified subreddit was further assessed for its appropriateness for further mining and analysis based on: (1) whether subscribers are representative of different stakeholders; (2) whether different users posted a large amount of content instead of bots or specific user groups; and (3) the average quality of content measured by the length of text, the number of comments, and the upvote and downvote score.As a result, six subreddits were finally chosen as the data sources, as detailed in Table 2. Additionally, threads in terms of online classes in a global context were also collected to reflect public perspectives using the post-search feature of Pushshift API.The period covered 1 January 2020 to 30 April 2022.

Data Analysis
We analyzed our data using three processes, i.e., data preprocessing, sentiment analysis, and topic modeling.
Data preprocessing was performed with the Regex library (https://regexlib.com;accessed on 12 May 2022) to tokenize, clean, and format the text of each subset.The first step was to tokenize the data, removing special characters and hyperlinks and segmenting sentences into individual words.Then, all stopwords, i.e., commonly used non-emotional words (e.g., I, a, the), were removed.We adapted NLTK's stopwords corpus to identify and filter the stopwords.Next, the remaining tokens were standardized, with all capital letters converted into lowercase, contractions reverted to their composite words (e.g., don't to do not), and numbers replaced by their corresponding word (e.g., 1 to one).We adapted NLTK's tokenize package (https://www.nltk.org/api/nltk.tokenize.html;accessed on 12 May 2022) to implement this standardized process.The final step, called lemmatization, initialized standardized tokens into their root word form (e.g., taught to teach).
Sentiment analysis is an approach for automatically extracting and classifying sentiment from textual documents using natural language processing (NLP), textual analysis, and computational techniques [30].This approach can avoid taking keywords out of context and consider users' feelings at the post level.This study selected VADER (Valence Aware Dictionary and sEntiment Reasoner) because it is specifically tuned to the social media context and supports web expressions such as emoticons and shorthand [31].The result returned by VADER was a dictionary of four keys: neg, neu, pos, and compound.Compound is computed by normalizing the other three scores.By adjusting the threshold based on the dataset to the compound value, the sentiment of a complete post could be split into negative, neutral, and positive.
To evaluate the effectiveness of the application of VADER to our dataset, we sampled 100 pieces of data from the global data set and labeled it manually into three classes, i.e., "negative," 'neutral,' and 'positive,' and calculated the prediction accuracy.We used accuracy as the metric because the classes of our data set were relatively balanced in terms of label distributions.In addition, there was no difference between the classes for decisionmaking implications in our research context, as we only used the classification results to plot the sentiment-time line chart.Therefore, we did not calculate the terms of precision and recall.We used the following equation to calculate the accuracy.
The cp in the above equation ( 1) indicates the correct number of predictions, and ipn indicates the incorrect number of predictions involving the 'neutral' manual labels.The reason to add the 0.5ipn part to calculate the accuracy is to cater to the ranking relations in the three classes, i.e., 'negative,' 'neutral,' and 'positive.'Therefore, it will be fairer to assign half of the marks to the failed 'neutral' cases, compared to giving the same zero mark to the failed cases predicting "negative" while the true value is "positive" and vice versa.
Topic modeling is a statistical approach for discovering "topics" hidden in the text corpus, which is the cleaned dataset in this research.Latent Dirichlet Allocation (LDA) is one of the most popular topic models, which creates two Dirichlet distribution models; one represents the topic in the document, and the other represents the words in each topic.Next, the Python machine-learning module scikit-learn (https://scikit-learn.org;accessed on 12 May 2022) was imported to create the LDA topic model.First, DTM (Document-Term Matrix) was calculated to indicate the frequency that a term occurs in the document.Then, TF-IDF (Term Frequency-Inverse Document Frequency) was computed to evaluate the term's relevance to a certain document by two metrics: how many times a term appeared in a certain document (TF) and how many times it appeared across the entire document set (IDF).The final LDA topic model was generated by populating the DTM with the TF-IDF score and optimizing the result by modifying parameters.To visualize the optimal result, pyLDAvis (https://github.com/bmabey/pyLDAvis;accessed on 11 June 2022) was applied to create an interactive figure in which the top 30 most relevant terms were displayed according to frequency once a topic was selected, in addition to static charts and word clouds drawn by the pandas and seaborn libraries.

Data Collected
After the collecting and filtering process, 19,818 posts posted from 1 January 2020 to 30 April 2022 with title and body text finally remained (retrievable at https://github.com/hellotum/learnerEducatorData, with the data set description in the file README.md).Table 3 illustrates the number of posts collected from the six subreddits and global.Additional attributes of posts include post time, number of comments, and author.Nine datasets were built and saved into CSV files, among which seven were directly built upon data collected from global and subreddits, and two were further integrated to represent different stakeholders.The learner dataset was created by combing the data from three subreddits, including r/teenagers, r/college, and r/CollegeRant, while the educator dataset was from r/Teachers and r/Professors.Table 4 shows the first five rows of the global data set as the sample of the posts.

Topics Detected (RQ1)
By observing the terms associated with each topic and the inter-topic distance when assigning different topic numbers, batch sizes, and random states, six topics are finally determined as the optimal result.Figure 1 shows three of the six topics, Topics 3, 5, and 6, distinct from each other, while there is an overlap between the three remaining topics (Topics 1, 2, and 4). Figure 2 displays the fifteen most relevant terms for each topic.Figure 3 depicts the Word clouds generated from the six topics.
Among the three distinctive topics, Topic 3 is associated with teacher, thing, happen, and various adjective terms such as funniest, weirdest, and embarrassing.It reflects the abnormal, strong, and diverse incidents and interactions between learners and educators during online classes.Topic 5 is featured with the terms help, exam, and post, which implies that learners seek help in exams and assignments when taking online classes.The salient terms math, statistic, and software indicate that learners experience more difficulties in science and engineering classes, with meme, made, time, and today being the most dominant terms in Topic 6.It illustrates that online class has become a popular resource for meme-making.In other words, it has become a part of the subculture among the young generation.
Among the three overlapping topics, Topic 1 covers the largest number of words, featuring teacher, student, and time, about small things frequently happening in online classes.Topics 2 and 4 are closely associated with each other.Topic 4 includes the bored, tifu, and camera, which emphasize the negative aspect of daily online classes, while Topic 2 is more positive since it includes like, friend, love, and fun.

Sentiment Difference between Stakeholders (RQ2)
We conducted the sentiment analysis, which resulted in a 78% prediction accuracy with our samples.We then used the classification result for further analysis.Figure 4 shows the sentiment distributions of posts in the six subreddits.Interestingly, none of the six distributions coincide with the normal distribution.Instead, there are sharp sentiment polarizations in the five subreddits, except for r/funnyonlineclasses. Users seem to express intense and differentiated emotions towards online classes on Reddit.This result verified the previous finding that sharing beliefs and opinions to find others who share similar ideas is a significant motivation for Reddit users to post content [32].Figures 5 and 6 display more detailed sentiment distributions of each subreddit, with a single post's sentiment value counted and indicated by a dot and bar.Compared with the other five subreddits, r/funnyonlineclass illustrates the most normalized distribution, where most posts display neutral and slightly positive sentiment, which appears reasonable.Surprisingly, among the three subreddits composed of learners of different ages, the sentiment of r/teenagers is the most gentle.The median sentiment value of this subreddit is neutral, even though a significant amount of sentiment is either very positive or negative.Yet, the other two subreddits, r/college and r/CollegeRant, show extreme polarization.Notably, r/college demonstrates a similar amount and extent of positive and negative emotions, whereas r/CollegeRant shows a higher number of negative emotions.This result can be explained by this subreddit's topic on the negative aspects of college life.
On the educators' side, positive sentiment exceeds negative sentiment.Moreover, with a higher median sentiment value and higher density of positive sentiment value, r/Teachers expresses stronger negative emotions than r/Professors.
After cleaning and tokenizing the sentence in the two representative datasets, the sentiment of different words in the posts was calculated based on negative, neutral, and positive word classification.The negative words and positive words in the two datasets were ranked according to the frequencies.
Further analysis was conducted on the learner and educator datasets to understand the two types of stakeholders.Figure 7 shows the kernel density estimate (KDE) plot of both sides for comparison.Overall, learners express more extreme sentiments, while educators tend to have more positive sentiments.The difference in motivations behind social media usage could explain this: educators generally use social media for self-development purposes, such as sharing resources and seeking collaborations with potential colleagues [22].Figure 8 shows the sentiment of words constituting posts extracted and ranked by frequency.Notably, educators and learners shared a similar set of sentimental words, such as bad, hate, shit, fail, and hell, widely used for expressing negative feelings, and best, love, great, kind, and free to express positive feelings in contrast.However, the term cheat is particularly concerning since it ranks high on both sides.Posts regarding cheating were extracted, and Appendix A shows several representative examples.Educators observed widespread cheating behavior and this should be taken seriously as a systemic problem.Accordingly, we propose various practical approaches to prohibit cheating and care for learners' mental health.On the learners' side, motivations for cheating are complicated.It may be due to their difficulties studying online and fear of exam failure.Furthermore, cheating is also a social behavior driven by others' influence because their classmates around them cheat.They fear that they may fall behind if they do not do so.Some are even under pressure to help others cheat.In addition to the identified topic of asking for help in exams and assignments, results indicated that cheating behavior is much more severe than expected.
Notably, results showed several unique sentimental words on both sides, e.g., panic, successful, and awesome for educators and depression, depressed, dumb, trust, and laugh for learners.These words may explain the nuanced difference in the sentiment between educators and learners.Posts containing the above unique keywords were extracted for further qualitative analysis, with some examples summarized in Appendix B.
The term "panic" reflects the following worries.Extra procedures of class preparations brought by software result in panic among many educators.Learners' panic and scarce participation even worsen this panic.The terms successful and awesome are associated with those who actively share effective experiences, give encouragement, and ask for suggestions.Among learners, the terms depress and dumb revealed their terrible mental state.They felt it was difficult to finish assignments and pass exams and blamed themselves for this.At the same time, a lack of communication among classmates made them feel lost and left behind.The positive term trust commonly came from supportive posts, which indicated hard work and self-confidence.Behind the term laugh, learners try to make the class enjoyable.However, Figure 10 shows the differences between educators and learners.In general, the features of the post number, including the peaks, bottoms, increases, and decreases in both parties, were similar to the global trend.Nevertheless, the learner's trend was much closer to the global trend, with a much lower number of posts posted in the latter two years.In contrast, the post number of educators was maintained at a certain level, without any downward trend, indicating online class is still a focus of attention among educators, though not prevalent as before.

Discussions, Conclusions, and Limitations
This research has identified various online class topics from Reddit posts.As a daily activity, educators and learners discuss their experiences, including funny and embarrassing incidents, comfortable aspects, school life memories, and creative memes inspired by the class.Therefore, the overall sentiment is neutral.However, sentiments are strong and polarized [33] when diving into posts focused on online classes.Learners' sentiment tends to be negative, expressing a sense of depression and self-doubt, which reflects approval-seeking behaviors.Educators' sentiment is much more rational and positive in contrast.Diverse informal, self-directed learning activities beyond schools' requirements are identified in relevant subreddits, revealing educators' and learners' different motivations for social media usage [8].This study provides hints to improve online classes to fulfill the needs of educators and learners and how social media can aid online education, as highlighted below.
Results have reflected that cheating in online classes is alarming as a systematic and social issue.Educators should propose solutions on various aspects to lower cheating motivation by cultivating a sense of guilt to combat stimulation from others.Particularly, academic libraries should include related ethics and academic integrity in library instruction [34].Concerning curriculum design, educators may increase the weight of the grade for assignments in online classes and continuous assessment to decrease the effect of exam cheating on overall course assessment.For example, most subjects of the information management degree (both undergraduate and postgraduate) at The University of Hong Kong have no exams and adopt continuous assessment, group projects, and individual essays [35].Using similarity-checking tools to detect possible cheating issues through submitted assignments and examination scripts also helps deter cheating through plagiarism [36].Education and human-computer interaction researchers may work on dealing with the online cheating issue to make the online mode more assessable in formal teaching.
In addition, time series analysis posts doubt the effect of strategies for solving the problems identified in online classes.The decreasing number of posts about online classes is mainly due to the less frequent discussion of this topic among learners because of the fewer online classes.This suggests a longer tracking period is necessary.Moreover, we observed that educators' and learners' post number fluctuations shared a similar pattern, which may indicate that educators' posts are crucial in facilitating discussion.Therefore, we suggest that educators post more posts to encourage the learners to exchange ideas and eventually form a community of practice to engage learners [37].
This research has revealed the concerns behind online classes and verified the different motivations and behavior of social media usage between educators and learners.The research shows the potential of using Reddit as a data source in education research and reference for follow-up research.Yet, a limitation is that this research could not fully utilize the special structure of Reddit submission and the corresponding meta-data.Future efforts are necessary to consider more attributions of submissions when analyzing Reddit data.Therefore, for future research direction, we plan to use more advanced machinelearning and deep-learning sentiment analysis algorithms to avoid analyzing keywords out of context and further investigate related issues, such as improving the accuracy of the sentiment ratings.

Figure 2 .
Figure 2. Top 15 most salient terms of each topic.

Figure 3 .
Figure 3. Word cloud of each topic.

Figure 5 .
Figure 5. Box plot of posts' sentiment distributions of six subreddits.

Figure 6 .
Figure 6.KDE plot of posts' sentiment distributions of six subreddits.

Figure 7 .
Figure 7. KDE plot of sentiment distributions of educators and learners.

Figure 8 .
Figure 8. Word sentiment in educators and learners ranked by frequency.

Figure 9
Figure 9 illustrates the change in post numbers over time on Reddit global.The dramatic increase in online-class-related posts started at the beginning of March 2020 when the worldwide lockdown of educational institutions spread.After peaking at the end of that month, the number kept decreasing until it reached its lowest point in June 2020, when summer vacation was about to start.The subsequent two apparent fluctuations coincided with the arrangement of the second online semester.The first peak occurred at the end of August 2020, the start of a new semester, and the other peak in the middle of November 2020 was the time of final exams.The result triangulates our findings that learners struggle with online examinations.In 2021, with the highest number at the start of the first semester, the number gradually decreased as the lockdown was loosened worldwide.In 2022, the number remained at a much lower level.The general decreasing trend in the number of posts is probably due to the smaller amount of time for online classes instead of the improvement in class effectiveness.

Figure 9 .
Figure 9. Number of posts varying with time in Reddit global.

Figure 10 .
Figure 10.Number of posts for educators and learners.

Figures 11 and 12
Figures 11 and 12 show the changes in global post numbers by sentiment type.Combined with Figure 8, neutral posts constitute the majority of posts, thereby decreasing the entire post number.Except for the fluctuation at the initial stage, there is no strong fluctuation in the number of negative and positive posts.Overall, the number of neutral posts decreases as time goes by, with fewer online classes, whereas the numbers of negative and positive posts are relatively stable.This observation further validates the analysis that the post number is correlated with the popularity of online classes and indicates that there are still unsolved concerns about online classes.

Figure 11 .
Figure 11.Global number of posts in different sentiments.

Figure 12 .
Figure 12.Two sides' stakeholders' number of posts in different sentiments.

Table 3 .
Number of posts from different sources.