Sentiment Analysis of Students’ Feedback with NLP and Deep Learning: A Systematic Mapping Study

: In the last decade, sentiment analysis has been widely applied in many domains, including business, social networks and education. Particularly in the education domain, where dealing with and processing students’ opinions is a complicated task due to the nature of the language used by students and the large volume of information, the application of sentiment analysis is growing yet remains challenging. Several literature reviews reveal the state of the application of sentiment analysis in this domain from different perspectives and contexts. However, the body of literature is lacking a review that systematically classiﬁes the research and results of the application of natural language processing (NLP), deep learning (DL), and machine learning (ML) solutions for sentiment analysis in the education domain. In this article, we present the results of a systematic mapping study to structure the published information available. We used a stepwise PRISMA framework to guide the search process and searched for studies conducted between 2015 and 2020 in the electronic research databases of the scientiﬁc literature. We identiﬁed 92 relevant studies out of 612 that were initially found on the sentiment analysis of students’ feedback in learning platform environments. The mapping results showed that, despite the identiﬁed challenges, the ﬁeld is rapidly growing, especially regarding the application of DL, which is the most recent trend. We identiﬁed various aspects that need to be considered in order to contribute to the maturity of research and development in the ﬁeld. Among these aspects, we highlighted the need of having structured datasets, standardized solutions and increased focus on emotional expression and detection.


Introduction
The present education system represents a landscape that is continuously enriched by a massive amount of data that is generated daily in various formats and most often hides useful and valuable information. Finding and extracting the hidden "pearls" from the ocean of educational data constitutes one of the great advantages that sentiment analysis and opinion mining techniques can provide. Sentiments and opinions expressed by students are a valuable source of information not only for analyzing students' behavior towards a course, topic, or teachers but also for reforming policies and institutions for their improvement. Although both sentiment analysis and opinion mining seem similar, there is a slight difference between the two: the former refers to finding sentiment words and phrases exhibiting emotions, whereas the latter refers to extracting and analyzing people's opinions for a given entity. For this study, we consider that both techniques are used interchangeably. The sentiment/opinion polarity, which could either be positive, negative, or neutral, represents one's attitude towards a target entity. Emotions, on the other hand, are one's feelings expressed regarding a given topic. Since the 1960s, several theories about emotion detection and classification have been developed. The study conducted by Plutchik [1] categorizes emotions into eight categories: anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. Sentiment analysis can be conducted at a word, sentence, or a document level. However, due to the large number of documents, manual handling of sentiments is impractical. Therefore, automatic data processing is needed. Sentiment analysis from the text-based, sentence or document-level corpora is employed using natural language processing (NLP). Most research papers found in the literature published until 2016-2017 employed pure NLP techniques, including lexicon and dictionary-based approaches for sentiment analysis. Few of those papers used conventional machine learning classifiers. Recent years have seen a shift from pure NLP-based approaches to deep learning-based modeling in recognizing and classifying sentiment, and the number of papers published recently on the undertaken topic has increased significantly.
The popularity and importance of students' feedback have also increased recently, especially in the times of the COVID-19 pandemic, when most educational institutions have transcended traditional face-to-face learning to the online mode. Figure 1 shows the country-wise comparison breakdown of interest over the past six years in the use of sentiment analysis for analyzing students' attitudes towards teacher assessment. The number of papers published recently indicates a growing interest towards the application of NLP/DL/ML solutions for sentiment analysis in the education domain. However, to the best of our knowledge, in order to establish the state of evidence, the body of literature is lacking a review that systematically classifies and categorizes research and results by showing the frequencies and visual summaries of publications, trends, etc. This gap in the body of literature necessitated a systematic mapping of the use of sentiment analysis to study students' feedback. Thus, this article aims to map how this research field is structured by answering research questions through a step-wise framework to conduct systematic reviews. In particular, we formulated multiple research questions that cover general issues regarding investigated aspects in sentiment analysis, models and approaches, trends regarding evaluation metrics, bibliographic sources of publications in the field, and the solutions used, among others.
The main contributions of this study are as follows: • A systematic map of 92 primary studies based on the PRISMA framework; • An analysis of the investigated educational entities/aspects and bibliographical and research trends in the field; • A classification of reviewed papers based on approaches, solutions, and data representation techniques with respect to sentiment analysis in the education domain; • An overview of the challenges, opportunities, and recommendations of the field for future research exploration.
The rest of the paper is organized as follows. Section 2 provides some background information on sentiment analysis and related work, while Section 3 describes the search strategy and methodology adopted in conducting the study. Section 4 presents the systematic mapping study results. Challenges identified from the investigated papers are described in Section 5. Section 6 outlines recommendations and future research directions for the development of effective sentiment analysis systems. Furthermore, in Section 7, we highlight the potential threats to the validity of the results. Lastly, the conclusion is drawn in Section 8.

Overview of Sentiment Analysis
Sentiment analysis is a task that focuses on polarity detection and the recognition of emotion toward an entity, which could be an individual, topic, and/or event. In general, the aim of sentiment analysis is to find users' opinions, identify the sentiments they express, and then classify their polarity into positive, negative, and neutral categories. Sentiment analysis systems use NLP and ML techniques to discover, retrieve, and distill information and opinions from vast amounts of textual information [2].
In general, there are three different levels at which sentiment analysis can be performed: the document level, sentence level, and aspect level. Sentiment analysis at the document level aims to identify the sentiments of users by analyzing the whole document. Sentence-level analysis is more fine-grained as the goal is to identify the polarity of sentences rather than the entire document. Aspect-level sentiment analysis focuses on identifying aspects or attributes expressed in reviews and on classifying the opinions of users towards these aspects.
As can be seen from Figure 2, the general architecture of a generic sentiment analysis system includes three steps [3].
Step 1 represents the input of a corpus of documents into the system in various formats. This is followed by the second step, which is document processing. At this step, the entered documents are converted to text and pre-processed by utilizing different linguistic tools, such as tokenization, stemming, PoS (Part of Speech) tagging, and entity and relation extraction. Here, the system may also use a set of lexicons and linguistic resources. The central component of the system architecture is the document analysis module (step 3) that also makes use of linguistic resources to annotate the preprocessed documents with sentiment annotations. Annotations represent the output of the system-i.e., positive, negative, or neutral-presented using a variety of visualization tools. Depending on the sentiment analysis form, annotations may be attached differently. For document-based sentiment analysis, the annotations may be attached to the entire documents; for sentence-based sentiments, the annotations may be attached to individual sentences; whereas for aspect-based sentiment, they are attached to specific topics or entities. Sentiment analysis has been widely applied in different application domains, especially in business and social networks, for various purposes. Some well-known sentiment analysis business applications include product and services reviews [4], financial markets [5], customer relationship management [6], and marketing strategies and research [5], among others. Regarding social networks applications, the most common application of sentiment analysis is to monitor the reputation of a specific brand on Twitter or Facebook [7] and explore the reaction of people given a crisis; e.g., COVID-19 [8]. Another important application domain is in politics [9], where sentiment analysis can be useful for the election campaigns of candidates running for political positions.
Recently, sentiment analysis and opinion mining has also attracted a great deal of research attention in the education domain [2]. In contrast to the above-mentioned fields of business or social networks, which focus on a single stakeholder, the research on sentiment analysis in the education domain considers multiple stakeholders of education including teachers/instructors, students/learners, decision makers, and institutions. Specifically, sentiment analysis is mainly applied to improve teaching, management, and evaluation by analyzing learners' attitudes and behavior towards courses, platforms, institutions, and teachers.
From the learners' perspective, there are a number of papers [10][11][12] that have applied sentiment analysis to investigate the correlation of attitude and performance with learners' sentiments as well as the relationship between learners' sentiments and drop-out rates in Massive Open Online Courses (MOOCs). Regarding teachers' perspectives, sentiment analysis has been widely adopted by researchers [13][14][15] to examine various teacherassociated aspects expressed in students' reviews or comments in discussion forums. These aspects include teaching pedagogy, behavior, knowledge, assessment, and experience, to name a few. Sentiment analysis was also used in a number of studies [16,17] to analyze student's attitudes towards various aspects related to an institution; i.e., tuition fees, financial aid, housing, food, diversity, etc. Regarding courses, aspect-based sentiment analysis systems have been implemented to identify key aspects that play a critical role in determining the effectiveness of a course as discussed in students' reviews and then examine the attitudes and opinions of students towards these aspects. These aspects primarily include course content, course design, the technology used to deliver course content, and assessment, among others.

Related Work
Referring to past literature, we found that one study [18] on sentiment analysis (SA) in the education domain focused on detecting the approaches and resources used in SA and identifying the main benefits of using SA on education data. Our study is an extended form of this article; thus a great deal of information is presented from different dimensions including bibliographical sources, research trends and patterns, and the latest tools used to perform SA. Instead of listing the data sources, we present the four categories of educationbased data sources that are mostly used for SA. Furthermore, to increase convenience for researchers in this domain, we present groups of studies based on the learning approaches, most frequently used techniques, and most widely used education related lexicons for sentiment analysis.
Another review study [19] provided an overview of sentiment analysis techniques for education. The authors of this study provided a sentiment discovery and analysis (SDA) framework for multimodal fusions. Rather than the text, audio, and visual signals focused in [19], our review article aims to present all aspects related to the sentiment analysis of educational information with a focus on textual information only in a systematic way. Furthermore, we also provide a long list of current approaches employed for sentiment discoveries and the results obtained by them. Similarly, [20] aimed to review the scientific literature of SA on education data and revealed future research prospects in this direction. The authors of [20] focused on the area in more depth, including the design of sentiment analysis systems, the investigation of topics of concern for learners, the evaluation of teachers' teaching performance, etc., from almost 41 relevant research articles. In contrast, to conduct our scientific literature review study, we initially filtered 612 research articles from different journals and conferences. At the final stage of filtering, we finalized and included 92 of the most related and high-quality scientific articles published between 2015 to 2020 in this work. The main aim of this paper is to provide most of the available information regarding the sentiment analysis of educational information in a systematic way in a single place.
Review studies of this kind are greatly helpful for readers in this domain. This review study will assist researchers, academicians, practitioners, and educators who are interested in sentiment analysis with a classification of the approaches to the sentiment analysis of education data, different data sources, experimental results from different studies, etc.

Research Design
To conduct this study, we applied systematic mapping as the research methodology for reviewing the literature. Since this method requires an established search protocol and rigorous criteria for the screening and selection of the relevant publications, we utilized the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, as indicated in [21]. The primary goal of a systematic mapping review (SMR) is to provide an overview of the body of knowledge and the research area and identify the amount of publications and the type of research and results available. Furthermore, an SMR aims to map the frequencies of publications over time to determine trends, forums or venues, and the relevant authors by which the research has been conducted and published. In contrast to the classical systematic literature review (SLR), which focuses on the identification of best practices based on empirical evidence, the focus of an SMR is on establishing the state of evidence. It is also worth mentioning that, from the methodology standpoint, SLR is characterized by narrow and specific research questions, and the studies are evaluated in detail regarding this quality. On the other hand, SMR deals with multiple broader research questions, and studies are not assessed based on details regarding the quality.
To ensure that all relevant studies were located and reviewed, our search strategy involved a stepwise PRISMA approach, consisting of four stages. The overall process of the search strategy is shown in Figure 3. The first stage in the PRISMA entailed the development of a research protocol by determining research questions, defining the search keywords, and identifying the bibliographic databases for performing the search. The second stage involved applying inclusion criteria, which was followed by stage three, in which the exclusion criteria were applied. The last stage was data extraction and analysis. The research questions (RQs) devised for this study were as follows: • RQ1. What are the most investigated aspects in the education domain with respect to sentiment analysis? • RQ2. Which approaches and models are widely studied for conducting sentiment analysis in the education domain? • RQ3. What are the most widely used evaluation metrics to assess the performance of sentiment analysis systems? • RQ4. In which bibliographical sources are these metrics published, and what are the research trends and patterns? • RQ5. What are the most common sources used to collect students' feedback? • RQ6. What are the solutions with respect to the packages, tools, frameworks, and libraries utilized for sentiment analysis? • RQ7. What are the most common data representation techniques used for sentiment analysis?

Search Strategy
To develop a comprehensive set of search terms, we use the PICO(C) framework. PICO (Population, Intervention, Comparison and Outcomes) aims to help researchers to design a comprehensive set of search keywords for quantitative research in terms of population, intervention, comparison, outcome, and context [22]. As suggested by [23], to avoid missing possible relevant articles, we also added a "context" section to the PICO schema.
First, for all the sections of PICO(C) in Table 1, we identified the adequate keywords, and then we constructed the search string by applying binary operators, as shown in Table 2. To ensure that no possible relevant article would be omitted in the study, we also used the context criterion. Context ("MOOC" OR "SPOC" OR "distance learning" OR "online learning" OR "e-learning" OR "digital learning") AND Intervention ("Sentiment analysis" OR "opinion mining") AND Outcome ("Students' feedback" OR "teacher assessment" OR "user feedback" OR "feedback assessment" OR "students' reviews" OR "learners' reviews" OR "learners' feedback")

Time Period and Digital Databases
The time period selected for this study was from 2015 to 2020, inclusive. The research was conducted in 2020; therefore, it covered papers published until 30 September 2020.
For our search purposes, we used the following online research databases and engines:

Identification of Primary Studies
As of September 2020, the search in Stage 1 yielded 612 papers without duplicates. In Figure 4, we present the total number of selected studies distributed per bibliographic database, identified during the first stage.

Study Selection/Screening
Screening was stage 2 of the search strategy process and involved the application of inclusion criteria. At this stage, the relevant studies were selected based on the following criteria: (a) the type of publication needed to be a peer-reviewed journal or a conference paper, (b) papers needed to have been published between 2015 and 2020, and (c) papers needed to be in English. Besides, as can be seen in Figure 3, at this stage, we also checked the suitability of papers by examining the keywords, title, and the abstract of each paper. After we applied the mentioned criteria, out of 612 papers, 443 records were accepted as relevant studies for further exploration. Table 3 presents the screened and selected studies distributed according to year and database source. The distribution of conference and journal papers reviewed in this study is illustrated in Figure 5. As can be seen from the chart, there has been an increasing trend of research works published in journals in the last two years in contrast to the previous years, where most of the studies were published in conferences.

Eligibility Criteria
In Stage 3, we applied the exclusion criteria in which we eliminated studies that were not (a) within the context of education, (b) about sentiment analysis, and (c) that did not employ the techniques of natural language processing, machine learning, or deep learning. At this stage, all the titles, abstracts, and keywords were also examined once more to determine the relevant records for the next stage. This stage resulted in 137 identified papers, which were divided among the four authors in equal number to proceed to the final stage. The authors agreed to encode the data using three different colors: (i) greenpapers that passed the eligibility threshold, (ii) red-papers that did not pass the eligibility threshold, and (iii) yellow-papers that the authors were unsure which category to classify them as (green or red). The authors were located in three different countries, and the whole discussion was organized online. Initially, an online meeting was held to discuss the green and red list of papers, and then the main discussion was focused on papers listed in the yellow category. For those papers, a thorough discussion among the involved authors took place, and once a consensus was reached, those papers were classified into either the green or red category. In the final stages, a fifth author was invited to increase the level of criticism of the discussion among the authors, to double-check all of the followed stages, and to be able to distinguish the current contribution from the previous ones.
After we applied these criteria, only 92 papers were considered for future investigation in the last stage of analysis.

Systematic Mapping Study Results
This section is divided into two parts: the first part presents the findings of the RQs, whereas the second highlights the relevant articles based upon the quality metrics.

Findings Concerning RQs
For the purposes of the analysis, the 92 papers remaining after the exclusion criteria were reviewed in detail by the five authors; in this section, the results are presented in the context of the research questions listed in Section 3.

RQ1.
What are the most investigated aspects in the education domain with respect to sentiment analysis?
Students' feedback is an effective tool that provides valuable insights concerning various educational entities including teachers, courses, institutions, etc. and teaching aspects related to these entities. The identification of these aspects as expressed in the textual comments of students is of great importance as it aids decision makers to take the right action to specifically improve them. In this context, we examined and classified the reviewed papers based on the aspects that concerned students and that the authors aimed to investigate. In particular, we found three categories and their related teaching aspects which were objects of investigation in these papers: the first category comprised studies dealing with the comments of students concerning various aspects of the teacher entity, including the teacher's knowledge, pedagogy, behavior, etc; the second category contained papers concerning various aspects of the three different entities, such as courses, teachers, and institutions. Course-related aspects included dimensions such as course content, course structure, assessment, etc., whereas aspects associated to the institution entity were tuition fees, the campus, student life, etc.; the third category included papers dealing with capturing the opinions and attitudes of students toward institution entities. The findings illustrated in Figure 6 show that 81% of reviewed papers focused on extracting opinions, thoughts, and attitudes toward teachers, with 6% corresponding to institutions, whereas 13% presented a more general approach by investigating students' opinions toward teachers, courses, and institutions. RQ2. Which approaches and models are widely studied for conducting sentiment analysis in the education domain?
Numerous approaches and models have been employed to conduct sentiment analysis in the education domain, which generally can be categorized into three groups. Table 4 shows the papers grouped based on learning approaches that the authors have applied within their papers. In total, 36 (out of 92) papers used a supervised learning approach, 8 used an unsupervised learning approach, and 20 used a lexicon-based approach. Thus, seven papers used both supervised and unsupervised approaches. Twenty papers used lexicon-based and supervised learning, whereas seven papers used lexiconbased and unsupervised learning.
In total, three (out of 92) articles used all three learning approaches as a hybrid approach, in contrast with five other articles, which did not specify any learning approach. Table 5 emphasizes that the Naive Bayes (NB) and Support Vector Machines (SVM) algorithms, as part of the supervised learning approach, were used most often in the reviewed studies, followed by Decision Tree (DT), k-Nearest Neighbor (k-NN) and Neural Network (NN) algorithms. Furthermore, the use of a lexicon-based learning approach, also known as rule-based sentiment analysis, was common in a number of studies as shown in Table 4 and very often associated either with supervised or unsupervised learning approaches. Table 6 lists the most frequently used lexicons elaborated among the reviewed articles, where the Valence Aware Dictionary and Sentiment Reasoner (VADER) and Sentiwordnet were used very often compared to TextBlob, MPQA, Sentistrength, and Semantria.  [61,79] RQ3. What are the most widely used evaluation metrics to assess the performance of sentiment analysis systems?
Information retrieval-based evaluation metrics were widely used to assess the performance of systems developed for sentiment analysis. The metrics include the precision, recall, and F1-score. In addition to this, some studies employed statistical-based metrics to assess the accuracy of systems.
It is very interesting to depict the number of articles that used a specific evaluation metric to assess the performance of systems versus the number of articles that either did not perform any evaluation or decided not to emphasize the used metrics. Figure 7 illustrates the evaluation metrics used and emphasizes the percentage of articles defined for a particular metric. As can be seen from Figure 7, 68% of the articles included either only the F1-score or other evaluation metrics including the F1-score, precision, recall, and accuracy. Only 3% of the studies used Kappa, 2% used the Pearson r-value, and the remaining 27% did not specify any evaluation metrics.

RQ4. In which bibliographical sources are the metrics published and what are the research trends and patterns?
The publication trend during the review period included in this paper indicated that there was a variation regarding the distribution of publications across years and bibliographic resources. According to our findings, as illustrated in Figure 8, it is obvious that the majority of the papers were published during 2019, where Springer and IEEE were the most represented bibliographical sources. It is also interesting to note that during 2017, there were only three resources in which papers on sentiment analysis were published.
For a better overview, we present the absolute number of publications across years with the publishers' details in Table 7. This will assist readers to swiftly identify the time period and place of publication of the reviewed articles. Regarding the applied techniques, there were only two major categories of techniques used to conduct sentiment analysis in the education domain during 2015 and 2017: NLP and ML. The first efforts [12,32] towards applying DL were presented during 2018, as shown in Figure 9. Moreover, an increasing research pattern of DL application appeared in 2019 and 2020-especially during 2020, where an equal distribution of DL versus the other techniques can be observed.

RQ5. What are the most common sources used to collect students' feedback?
Based on the literature review in preparing this study, we came across several data sources, and based on their characteristics, we divided them into the four following categories for the convenience of our readers and the researchers working in this domain. The categories are as follows: • Social media, blogs and forums: This category of datasets consists of data collected from online social networking and micro-blogging sites, discussion forums etc., such as Facebook and Twitter; • Survey/questionnaires: This category comprises data that were mostly collected by conducting surveys among students and teachers or by providing questionnaires to collect feedback from the students; • Education/research platforms: This category contains the data extracted from online platforms providing different courses such as Coursera, edX, and research websites such as ResearchGate, LinkedIn, etc.; • Mixture of datasets: In this category, we grouped all those studies which used several datasets to conduct their experiments.
As can bee seen in Figure 10, there were only 64 (69.57%) papers that reported the sources from which the data were collected, whereas almost one-third of the papers failed to show any information regarding the sources of datasets. Table 8 shows papers that reported the sources of the datasets used for conducting experiments along with their corresponding categories and description.

RQ6.
What are the solutions with respect to the packages, tools, frameworks and libraries utilized for sentiment analysis?
Sentiment analysis is still a new field, and therefore there is no single solution/approach that dominates in sentiment analysis systems. In fact, there are dozens of solutions in terms of packages, frameworks, libraries, tools, etc. that are widely used across application domains in general, and the education domain in particular. Figure 11 shows the findings of articles reviewed in this study with respect to the most commonly used packages, tools, libraries, etc. for the sentiment analysis task. Figure 11. Packages/libraries/tools used to conduct sentiment analysis in the reviewed papers.
As shown in the Treemap illustrated in Figure 11, Python-based NLP and machine learning packages, libraries, and tools (colored in blue) are among the most popular solutions due to the open-source nature of the Python programming language. Specifically, the NLTK (Natural Language Toolkit) package is the dominant solution, and it was used in 12 different articles for pre-processing tasks including tokenizing, part-of-speech, normalization, the cleaning of text, etc.
Java-based NLP and machine learning packages, frameworks, libraries, and tools constitute the second group of solutions used for sentiment analysis. These solutions are colored in orange in Figure 11. Rapidminer is the most common Java-based framework and was used in three articles.
The third group is composed of NLP and machine learning solutions based on the R programming language. Only three studies used solutions in this group to conduct the sentiment analysis task.

RQ7. What are the most common data representation techniques used for sentiment analysis?
To provide our readers with more information on sentiment discoveries and analysis, we briefly present the commonly used word embedding techniques for the sentiment analysis task.
From the related reviewed articles, we observed that very few studies employed word embedding techniques to represent textual data collected from different sources. Only one article [48] employed the Word2Vec embedding model to learn the numeric representation and supply it as an input to the long short-term memory (LSTM) network. In addition to Word2Vec, GloVe and FastText models were used in two articles [14,45] to generate the embeddings for an input layer of CNN and compare the performance of the proposed aspect-based opinion mining system.
As presented above, word embedding techniques were seen in very few papers (3) out of all the references (92), particularly regarding sentiment analysis in the education domain for students' feedback. Therefore, more focus is needed to bridge this gap by incorporating and testing different embedding techniques while analyzing the sentiment, emotion, or aspect of a student-related text.

Most Relevant Articles
To present the readers with a selection of the good-quality articles presented in this survey paper, we further narrowed down and short-listed 19 journal and conference articles. In particular, only articles published from 2018 to 2020 in Q1/Q2 level (https://www. scimagojr.com/journalrank.php) journals and A/B ranked (http://www.conferenceranks. com) conferences were identified as relevant, and these are summarized in Table 9.  Table 9 depicts pivotal aspects that were examined in the reviewed articles, including publication year and type, techniques, approaches, models/algorithms, evaluation metrics, and the sources and size of the datasets used to conduct the experiments. It can be seen that it is almost impossible to directly compare the articles in terms of performance due to the variety of algorithms/models and datasets applied to conduct the sentiment analysis task. However, it is interesting to note that the performance of sentiment analysis systems has generally improved over the years, achieving an accuracy of up to 98.29% thanks to the recent advancements of deep learning models and NLP representation techniques.

Identified Challenges and Gaps
Based on the systematic mapping study, we found that there is still a wide gap in some areas concerning the sentiment analysis of students' feedback that need further research and development. The following list shows some of the prominent issues, as presented in Table 10. Limited resources: There is a lack of resources such as lexica, corpora, and dictionaries for low-resource languages (most of the studies were conducted in the English or Chinese language); • Unstructured format: most of the datasets found in the studies discussed in this survey paper were unstructured. Identifying the key entities to which the opinions were directed is not feasible until an entity extraction model is applied, which makes the existing datasets' applicability very limited; • Unstandardized solutions/approaches: We observed in this review study that a vast variety of packages, tools, frameworks, and libraries are applied for sentiment analysis.

Recommendations and Future Research Directions
This section provides various recommendations and proposals for suitable and effective systems that may assist in developing generalizable solutions for sentiment analysis in the education domain. We consider that the recommendations appropriately address the challenges identified in Section 5. An illustration of the proposed recommendations is given in Figure 12.

Datasets Structure and Size
There is a need for a structured format to represent feedback datasets, whether they are captured at the sentence level or document level via a survey or a questionnaire form. A structured format in either an XML or a JSON file would be highly useful to standardize dataset generation for sentiment analysis in this domain. Furthermore, there is a need to associate the meta-data acquired at the time of the feedback responses. The meta-data would help to provide a descriptive analysis of the opinions expressed by a group of people for a given subject (aspect). Moreover, more than half (56.7%) of the datasets used in the reviewed papers were of a small-size, with merely 5000 samples or less, which affects the reliability and relevance of the results [102]. Additionally, most of these datasets are not publicly available, meaning that the results are not reproducible. Therefore, we recommend the collection of large-scale labeled datasets [14] to develop generalized deep learning models that could be utilized for various sentiment analysis tasks and for big data analysis in the education domain.

Emotion Detection
We found only a small number of articles focused on emotion detection. We feel that there is a greater need to take into consideration the emotions expressed in opinions to better identify and address the issues related towards the target subject, as has been investigated in many other text-based emotion detection works [103]. Furthermore, there are standard publicly available datasets such as ISEAR (https://www.kaggle.com/shrivastava/isearsdataset), and SemEval-2019 [104] that can be used to train deep learning models for textbased emotion detection tasks utilizing the Plutchik model [1] coupled with emoticons [8]. People often use emoticons to address emotions; thus, one aspect that researchers could explore is to make use of emoticons to identify the emotions expressed in an opinion.

Evaluation Metrics
Our study showed that researchers have used various evaluation metrics to measure the performance of sentiment analysis systems and models. Additionally, a considerable number of papers (27%) failed to report the information regarding the metrics used to assess the accuracy of the their systems. Therefore, we consider that a special focus and emphasis should be placed on including the utilized metrics in order to enhance the transparency of the research results. Information retrieval evaluation metrics such as the precision, recall, and F1-score would be a good practice for the performance evaluation of sentiment analysis systems relying on imbalanced datasets. Accuracy would be another metric that could be used to evaluate the performance of systems trained on balanced datasets. Statistic metrics such as the Kappa statistic and Pearson correlation are other metrics that can be used to measure the correlation between the output of sentiment analysis systems and data labeled as ground truth. Moreover, this could help and benefit other researchers when conducting comprehensive and comparative performance analyses between different sentiment analysis systems.

Standardized Solutions
We have shown that the current landscape of sentiment analysis is characterized by a wide range of solutions that are yet to mature as the field is obviously novel and rapidly growing. These solutions were generally (programming) language-dependent and have been used to accomplish specific tasks-i.e., tokenizing, part-of-speech, etc.-in different scenarios. Thus, standardization will play an important role as a means for assuring the quality, safety, and reliability of the solutions and systems developed for sentiment analysis.

Contextualization and Conceptualization of Sentiment
Machine learning/deep learning approaches and techniques developed for sentiment analysis should pay more attention to embedding the semantic context using lexical resources such as Wordnet, SentiWordNet, and SenticNet, or semantic representation using ontologies [105] to capture users' opinions, thoughts, and attitudes from a text more effectively. In addition, state-of-the-art static and contextualized word embedding approaches such as fastText, GloVe, BERT, and ELMo should be further considered for exploration by researchers in this field as they have proven to perform well in other NLP-related tasks [106,107].

Potential Threats to Validity
There are several aspects that need to be taken into account when assessing this systematic mapping study as they can potentially limit the validity of the findings. These aspects include the following: • The study includes papers collected from a set of digital databases, and thus we might have missed some relevant papers due to them not being properly indexed in those databases or having been indexed in other digital libraries; • The search strategy was designed to search for papers using terms appearing in keywords, titles, and abstracts, and due to this, we may have failed to locate some relevant articles; • Only papers that were written in English were selected in this study, and therefore some relevant papers that are written in other languages might have been excluded; • The study relies on peer-reviewed journals and conferences and excludes scientific studies that are not peer-reviewed-i.e., book chapters and books. Furthermore, a few studies that conducted a systematic literature review were excluded as they would not provide reliable information for our research study; • Screening based on the title, abstract, and keyword of papers was conducted at stage 2 to include the relevant studies. There are a few cases in which the relevance of an article cannot be judged by screening these three dimensions (title, abstract, keyword) and instead a full paper screening is needed; thus, it is possible that we might have excluded some papers with valid content due to this issue.

Conclusions
In the last decade, sentiment analysis enabled by NLP, machine learning, and deep learning techniques has also been attracting the attention of researchers in the educational domain in order to examine students' attitudes, opinions, and behavior towards numerous teaching aspects. In this context, we provided an analysis of the related literature by applying a systematic mapping study method. Specifically, in this mapping study, we selected 92 relevant papers and analyzed them with respect to different dimensions such as the investigated entities/aspects on the education domain, the most frequently used bibliographical sources, the research trends and patterns, what tools were utilized, and the most common data representation techniques used for sentiment analysis.
We have shown an overall increasing trend of publications investigating this topic throughout the studied years. In particular, there was a significant growth of articles published during the year 2020, where the DL techniques were mostly represented.
The mapping of the included articles showed that there is a diversity of interest from researchers on issues such as the approaches/techniques and solutions applied to develop sentiment analysis systems, evaluation metrics to assess the performance of the systems, and the variety of datasets with respect to their size and format.
In light of the findings highlighted by the body of knowledge, we have identified a variety of challenges regarding the application of sentiment analysis to examine students' feedback. Consequently, recommendations and future directions to address these challenges have been provided. We believe that this study's results will inspire future research and development in sentiment analysis applications to further understand students' feedback in an educational setting.
In future work, our plan is to further deepen the analysis that we performed in this mapping study by conducting systematic literature reviews (SLRs), as also suggested by [108].
Author Contributions: Conceptualization Z.K. and A.S.I.; methodology F.D. and Z.K.; Investigation and data analysis; writing-original draft preparation; writing-review and editing; supervision, Z.K., F.D., A.S.I., K.P.N. and M.A.W.; project administration, Z.K. and F.D. All authors have read and agreed to the published version of the manuscript.

Funding:
The APC was founded by Open Access Publishing Grant provided by Linnaeus University, Sweden.