Leveraging AI and Machine Learning for National Student Survey: Actionable Insights from Textual Feedback to Enhance Quality of Teaching and Learning in UK’s Higher Education

: Students’ evaluation of teaching, for instance, through feedback surveys, constitutes an integral mechanism for quality assurance and enhancement of teaching and learning in higher education. These surveys usually comprise both the Likert scale and free-text responses. Since the discrete Likert scale responses are easy to analyze, they feature more prominently in survey analyses. However, the free-text responses often contain richer, detailed, and nuanced information with actionable insights. Mining these insights is more challenging, as it requires a higher degree of processing by human experts, making the process time-consuming and resource intensive. Consequently, the free-text analyses are often restricted in scale, scope, and impact. To address these issues, we propose a novel automated analysis framework for extracting actionable information from free-text responses to open-ended questions in student feedback questionnaires. By leveraging state-of-the-art supervised machine learning techniques and unsupervised clustering methods, we implemented our framework as a case study to analyze a large-scale dataset of 4400 open-ended responses to the National Student Survey (NSS) at a UK university. These analyses then led to the identiﬁcation, design, implementation, and evaluation of a series of teaching and learning interventions over a two-year period. The highly encouraging results demonstrate our approach’s validity and broad (national and international) application potential—covering tertiary education, commercial training, and apprenticeship programs, etc., where textual feedback is collected to enhance the quality of teaching and learning.


Introduction
Curriculum and testing have a significant impact on the lives and careers of young people.Decisions made by schools influence their students' potential outcomes and chances, and the delayed effects of public examinations and evaluations are perhaps more critical [1].More recently, it has been argued that evaluation methods in higher education require a change to include elements within the university didactic assessment strategies, such as co-assessment and self-assessment [2,3].
Nevertheless, it is currently widely accepted that the systematic collection and analysis of student feedback constitutes an important quality assurance and enhancement exercise in teaching and learning in higher education.Nationwide student feedback surveys, such as the UK's National Student Survey (NSS) [4], have become the standard in many countries, and their results used to inform the development of improved teaching interventions and practices [5,6].The NSS is intended to help students make decisions about where to study and to assist institutions' planning and quality improvement methods, as well as to offer a measure of public accountability.Despite this systematic collection and analysis of student feedback, the evidence [7,8] suggests that, over the past decade, there has been only an insignificant increase in overall student satisfaction.This is attributed to the fact that higher education institutions fail to address in a timely manner the negative aspects of students' learning experiences that are recorded in student satisfaction feedback reports [9,10].
While its analysis provides valuable insights into students' thoughts and perceptions, a number of scholars [11][12][13] argue that Student Evaluations of Teaching (SET), widely used in academic personnel decisions as a measure of teaching effectiveness, cannot accurately measure the effectiveness of teaching and learning; therefore, it does not constitute a reliable indicator of educational quality.For example, one study [14] shows that the more objective the evaluation process adopted by SET, the less likely it is to relate to the teaching and learning benchmark evaluation standards.Another suggests that there is little or no correlation between SET and teaching effectiveness and student learning [15].In other words, a wide range of studies have investigated the effectiveness of SET, yet have been unable to indicate specifically how student learning is enhanced.
Furthermore, the vast majority of existing SET studies use Likert-scale responses to capture student feedback, yet, by its very nature, the response to a close-ended question cannot provide as detailed and as in-depth information as that to an open-ended question [16][17][18] and may sometimes result in an ambiguous finding [19].For example, one study that aimed to analyze student responses to the NSS 2005/07 open-ended questions [20] revealed that the most frequent issue raised in students' free-text responses pertains to the 'Overall quality of teaching', reporting that the NSS's open-ended questions help to improve the overall response rate.Such findings could not be easily achieved using closed questions due to the necessity of a long list of options.
Despite the many advantages of open-ended questions, the manual analysis of free-text responses requires a considerable expenditure of time and money.As a result, most studies analyze only a small amount of data (in common with other qualitative research).For example, Deeley et al.'s study [21] on exploring students' dissatisfaction with assessment feedback due to recruitment difficulties featured just 44 participants at a college of 5000 undergraduates; MacKay et al. [22] applied social identity theory to their institutional NSS free-text data, processing just 60 responses in total; and, due to time and cost constraints, Richardson, Slater, and Wilson [20] analyzed only 8% of a total of 10,000 free-text responses in their NSS study.Langan et al. [23] processed more than a thousand text comments in their study of the coherence of NSS questions' ratings; although this sample was bigger, the work included only categorization, not in-depth qualitative text analysis.Conversely, this study aims to use text mining to demonstrate a way to analyze a large number of qualitative data (e.g., students' free-text responses) without vastly increased resources (i.e., human effort and time).
Our contributions are as follows: At first, the study proposes a novel automatic analysis framework that can be used to automatically mine the text responses to open-ended questions in student feedback questionnaires.It aims at improving teaching practices and fostering positive learning experiences in higher education through the adoption of machine learning methods.It employs advanced text mining techniques to automatically identify the important problems, issues, and recommendations in students' open feedback on a large scale.The second contribution is the applications of both supervised learning (classification) and unsupervised learning (topic modeling) to automatically analyze and interpret students' responses to the open-ended questions in the NSS.The third and foremost contribution of this paper is implications of the models on real world data as a case study-comprising of four academic years of NSS (2014-2017) at a northwest UK university.Subsequently, the identified issues were developed into teaching interventions (two at Level 5, and two at Level 6).As a result, the responses to these teaching interventions were collected via the department's end-of-module surveys.The evaluation results showed a significant improvement in student satisfaction (22% higher scores than average), demonstrating the effectiveness of our teaching interventions and the underpinning machine learning methods.Finally, we conclude that the proposed automatic analysis framework was effective in this case and could be applied in the future to substantially reduce the time and cost demanded in practice by this process (e.g., by institution policy and decision-makers).Furthermore, because the information extraction model in the free-text responses those issues and topics of use for a variety of other purposes, our automatic text analysis framework can be easily adopted to wider studies and can be used to track student feedback, influencing institutional policies.
The rest of the paper has been organized as follows: Section 2 presents a detailed review of relevant studies on student evaluation methods and text mining approaches.Section 3 presents detailed methodology, followed by case study in Section 4. Finally, Sections 5 and 6 discuss implications of case study and conclusions, respectively.

Student Evaluation of Teaching
The student evaluation of teaching, or SET, a kind of customer satisfaction survey for higher education, has been widely accepted as a standardized evaluation exercise in North American, UK, and Australian higher education [13].SET provides a reliable means of assessing its quality of teaching and learning [24] and has been used to: (a) provide benchmark data to compare the quality of teaching and learning systematically across institutions [25]; (b) inform undergraduate applicant choices [26]; and (c) make recommendations to individual institutions on enhancing and improving the quality of their teaching practices [20].
SET is normally conducted in the form of a questionnaire with a list of ordinal-scale questions (for example, 'Rate your overall satisfaction with the course'), using a Likert scale for the response.Students' responses are then analyzed by a wide range of methodologies.SET's reliability as a teaching and learning quality indicator has been debated among pedagogical researchers [27,28], especially regarding whether an ordinal scale can capture quality in teaching and learning, which is considered to be a multidimensional attribute [29][30][31].
To provide a more comprehensive measurement, many models take multilevel factors into consideration.For example, Toland and De Ayla [32] use a three-factor model to mitigate the correlation effect on students' 'sharing common perceptions to their teachers'; Macfadyen et al. [33] conducted multilevel analysis (combining simple logistic regression and multilevel linear modeling) to test various factors of students' response/non-response to SET; and Rocca et al. [34] demonstrate the effectiveness of adopting an integrated approach (combining IRT and multilevel models) to reveal a student's characteristics (latent traits).

Text Mining
Text-mining algorithms may take one of two broad approaches: supervised [35] or unsupervised [36].A high-level overview of a supervised text-mining method is illustrated in Figure 1.In a supervised text-mining setting, a human coder is first required to annotate/label manually a sample of the documents against a pre-defined list of categories.The text-mining algorithms are then trained on the manually annotated samples and learn to associate the textual content of a document with its underlying label (e.g., textual content expresses positive/negative sentiment, or relevance to politics/sports/news/technology).
Once the training process is complete, the algorithm is used to categorize any unlabeled documents automatically to the same pre-defined list of categories.A supervised algorithm can substantially reduce the time needed to analyze a large document collection, whereby a human coder manually categorizes a sample of the documents (normally a small percentage of the whole data), and the remaining unlabeled ones are automatically categorized by the trained text-mining algorithm [37].By contrast, unsupervised approaches (illustrated in Figure 2) aim to identify the hidden patterns in a collection of documents (e.g., clusters of semantically similar documents or latent topics) and do not involve a training phase (no need for manual labeling), so can be readily applied to any unlabeled document collection.Unsupervised text-mining methods are widely used to facilitate exploratory analyses of document collections by extracting information that is not immediately evident to experts [38,39].For example, the most commonly used unsupervised training approaches use document clustering [40] or topic modeling [41].annotate/label manually a sample of the documents against a pre-defined list of categories.The text-mining algorithms are then trained on the manually annotated samples and learn to associate the textual content of a document with its underlying label (e.g., textual content expresses positive/negative sentiment, or relevance to politics/sports/news/technology).Once the training process is complete, the algorithm is used to categorize any unlabeled documents automatically to the same pre-defined list of categories.A supervised algorithm can substantially reduce the time needed to analyze a large document collection, whereby a human coder manually categorizes a sample of the documents (normally a small percentage of the whole data), and the remaining unlabeled ones are automatically categorized by the trained text-mining algorithm [37].By contrast, unsupervised approaches (illustrated in Figure 2) aim to identify the hidden patterns in a collection of documents (e.g., clusters of semantically similar documents or latent topics) and do not involve a training phase (no need for manual labeling), so can be readily applied to any unlabeled document collection.Unsupervised text-mining methods are widely used to facilitate exploratory analyses of document collections by extracting information that is not immediately evident to experts [38,39].For example, the most commonly used unsupervised training approaches use document clustering [40] or topic modeling [41].Systematic analysis of open-ended survey responses involves a considerable workload in terms of manual annotation; therefore, open-ended responses are often excluded from survey experiments [42][43][44].Nonetheless, survey researchers recognize the importance of analyzing open-ended data, pointing out that such free-text responses to open-ended questions record detailed and useful information that may not be captured by close-ended questions [45,46].
To reduce the workload associated with processing open-ended questions, several studies have explored the use of text mining to automate the underlying analysis process.Reference [47] presented one of the earliest text-mining approaches to the analysis of a dataset of open-ended responses.In their study, the data were collected from a survey that investigated employees' work-related perceptions of working in a large corporation.A dictionary-based text-mining method was used to assign a sentiment score automatically to each free-text response (a high score indicating a positive sentiment, and a low score a negative sentiment).The results achieved a high correlation between the automatically computed sentiment scores of the survey's responses to the open-ended questions and the quantitative scores of responses to its close-ended questions.Reference [43] used an unsupervised text-mining method, namely a term-clustering model, to identify thematically the coherent groups of the terms used in responses to open-ended survey questions.For evaluation purposes, their method was applied to a large-scale dataset of approximately 2000 free-text responses about consumers' preferences.The results showed that the method was able to identify informative and meaningful clusters of the terms that were discussed in the free-text comments.
In another study [44], a statistical topic-modeling method (i.e., an unsupervised text-mining approach) was developed to analyze the topical content of open-ended data automatically.The topical content of open-ended data was represented by a finite set of topics and each, in turn, was represented by a finite set of words.Moreover, to reflect its importance within the open-ended responses, each topic was assigned a weight.This study further demonstrates that the topic-modeling method is able to reveal topics that are semantically similar to hand-coded categories (although some deviations were observed).However, a limitation of such an approach is that the underlying topic-modeling method is totally unsupervised; thus, it naturally ignores any readily available hand-coded categories; therefore, the automatically computed topics may not align perfectly to the humanly annotated (i.e., ground truth) categories.
To address the above-mentioned needs in open-text analysis and the limitations of existing text-mining approaches, we developed a novel method that uses both a supervised and an unsupervised component to analyze students' open responses automatically in order to design effective teaching interventions to enhance the student learning experience.To the authors' knowledge, the study represents the first attempt to investigate the use of text-mining methods to analyze student responses to open-ended questions automatically, as well as to reinforce the teaching activities with those outcomes and evaluate their effectiveness.

Methodology: Information Extraction and Teaching Intervention
The study was conducted in two distinct phases, both with several components:

• Phase 2-Teaching Intervention
Design: to make an action list of teaching interventions to address the issues identified above.Implementation: to implement these actions in selected modules.Evaluation: to conduct an end-of-semester survey in each selected module and evaluate its results.

Annotation Scheme Development
The first stage was to develop an annotation scheme to categorize students' free-text responses, not only to provide a better understanding of the areas that concerned them but to enhance the performance of the later text-mining methods to identify issues automatically.Although there is no existing scheme for text mining (as this study is considered to be the first attempt automatically to analyze NSS open-text responses), Richardson et al.'s [20] findings on NSS free-text data provide a good starting point.After manually analyzing about a thousand responses (both positive and negative) to the NSS open-ended questions, Richardson's study devised 29 categories to cover the whole range of data, with the top-10 most common categories covering 80% of the data.In our study, we too developed an annotation scheme with 10 categories, as shown Table 1.Our annotation scheme (Table 1) provided full coverage of the NSS core questionnaire categories (The NSS questionnaire was revised in 2017, after it was first introduced in 2005.The National Student Survey website (https://www.thestudentsurvey.com,accessed on 20 August 2021) provides only the latest version.The pre-2016 version can be found available online: http://www.bristol.ac.uk/academic-quality/ug/nss/nssqs05-16.html/(accessed on 20 August 2021)).This is because, upon manually examining 10% of the free-text responses in our NSS dataset (used as case study from a UK university, consisting of 4400 responses 2014-2017), we found that because the NSS questionnaire uses a Likert scale that provides no additional fields for further comments, rather than to discuss new topics the students often used any opportunity for a free-text response to elaborate on their thoughts.We added three additional categories (Categories 6-8) for more specific topics, based on our examination of our NSS dataset, as well as Richardson et al.'s findings [20].Although these open-ended questions invited negative comment (positive comments were already covered), we found that a large proportion of responses (29%) were either positive or general, for instance, 'Overall, it's been a good course' or 'No, negatives at this time'.We considered that this was due to students tending to want to start with a positive statement or perhaps to think that comment was mandatory.Therefore, since we were targeting negative feedback, we created Category 10 for those statements that were nonessential to our study.We used the annotation scheme that we had developed to categorize about 10% of the responses in our NSS dataset.The distribution of negative statements is shown in Figure 3. Proving its effectiveness and validity, the proposed scheme is well balanced (apart from 'Student life and social support', a category that was included since the impact of social support is often overlooked, and action to improve student wellbeing needs to be more specific) [48][49][50][51].
Appl.Sci.2022, 12, x FOR PEER REVIEW 7 of 22 targeting negative feedback, we created Category 10 for those statements that were nonessential to our study.We used the annotation scheme that we had developed to categorize about 10% of the responses in our NSS dataset.The distribution of negative statements is shown in Figure 3. Proving its effectiveness and validity, the proposed scheme is well balanced (apart from 'Student life and social support', a category that was included since the impact of social support is often overlooked, and action to improve student wellbeing needs to be more specific) [48][49][50][51].

Framework Overview
This study aimed to identify automatically in the NSS data the issues reported in the free-text responses that make negative comment on the open-ended questions. Figure 4 shows the overall architecture of the proposed text-mining method, consisting of two learning components: (1) Classification: responses were grouped into smaller categories (e.g., teaching quality, assessment, etc.) to reveal the finer-grained issues to fit the NSS categories, as well as to yield a better performance in topic modeling.(2) Topic modeling: for each category, important topics were drawn and interpreted into meaningful issues upon which teaching interventions could be developed.

Free-Text Analysis-Machine Learning Approach 3.2.1. Framework Overview
This study aimed to identify automatically in the NSS data the issues reported in the free-text responses that make negative comment on the open-ended questions. Figure 4 shows the overall architecture of the proposed text-mining method, consisting of two learning components: (1) Classification: responses were grouped into smaller categories (e.g., teaching quality, assessment, etc.) to reveal the finer-grained issues to fit the NSS categories, as well as to yield a better performance in topic modeling.(2) Topic modeling: for each category, important topics were drawn and interpreted into meaningful issues upon which teaching interventions could be developed.
targeting negative feedback, we created Category 10 for those statements that were nonessential to our study.We used the annotation scheme that we had developed to categorize about 10% of the responses in our NSS dataset.The distribution of negative statements is shown in Figure 3. Proving its effectiveness and validity, the proposed scheme is well balanced (apart from 'Student life and social support', a category that was included since the impact of social support is often overlooked, and action to improve student wellbeing needs to be more specific) [48][49][50][51].

Framework Overview
This study aimed to identify automatically in the NSS data the issues reported in the free-text responses that make negative comment on the open-ended questions. Figure 4 shows the overall architecture of the proposed text-mining method, consisting of two learning components: (1) Classification: responses were grouped into smaller categories (e.g., teaching quality, assessment, etc.) to reveal the finer-grained issues to fit the NSS categories, as well as to yield a better performance in topic modeling.(2) Topic modeling: for each category, important topics were drawn and interpreted into meaningful issues upon which teaching interventions could be developed.Specifically, the analysis process was broken down into five steps (Figure 4): Step 1 is initiated by a survey researcher manually annotating a small sample of approximately 20% of the NSS open-ended comments, according to pre-defined categories.Step 2 uses the manually annotated samples to train a classification model to 'learn' to discriminate between the various categories of problem statement.In Step 3, the trained model automatically annotates the remaining 80% of the dataset's open-ended comments.In Step 4, an unsupervised topic modeling method automatically identifies the important topics within each problem statement category.Finally, the automatically generated topics are manually inspected and validated, and the results of analysis inform the development of efficient teaching interventions to address the issues raised by the students.

Automated Test Response Categorization-Classification
To undertake the classification, we divided the full NSS dataset into two subsets: (1) Labeled data: the text responses that were categorized manually against the annotation scheme, consisting of 20% of the total data, used to train the classifier for the supervised text-mining task.(2) Unlabeled data: the remaining 80% of the text responses, automatically categorized by the classification algorithm.
In our study, we first evaluated three common classification algorithms (i.e., SVM [52], Multinomial Naïve Bayes (MNB) [53], and Random Forest (RF) [54]) against the labeled data.The best-performing algorithm was then used to categorize the unlabeled data.To evaluate the algorithm performance (i.e., predictive accuracy), the labeled dataset was first divided into a training set (70% of labeled data) and a validation set (30% of labeled data).Secondly, the three classification components were trained on the training set with each algorithm, using the words in the data as predictive features.Thirdly, the trained components were used to categorize the responses in the validation set separately.Lastly, the results were compared to the labeled data in the validation set: the component with the best predictive accuracy was chosen.

Issue Subtraction and Summarization-Topic Modeling
In order to extract latent topics (i.e., to identify the issues in the text responses), for each of the 10 categories we employed the widely used LDA [55], a probabilistic topic-modeling algorithm.Specifically, LDA assumes that each document is a distribution of K latent topics, and each topic is a distribution of M words.In our approach, we used the MALLET toolkit to implement the algorithm, training it for 500 Gibbs sampling iterations and setting the number of latent topics to K = 100.For each topic, we examined the top-seven words (i.e., those that show the strongest correlation to the topic) to summarize the content of that topic.For a better interpretation, the summarization was undertaken jointly by senior lecturers at the institute from which the NSS data came.
As the topic-modeling results were challenging to evaluate [56] and contained only word-based forms, manual interpretation was needed to produce human understandable issues.The results were used later in the teaching intervention.

Teaching Intervention
To address the issues identified in Phase 1, Phase 2 involved designing, implementing, and evaluating a series of actions to improve students' experience of teaching and learning.For the implementation, we selected four modules run by the Department of Computer Science (at the institute from which the NSS data came): two at Level 6, as their students would subsequently participate in the NSS survey; and two at Level 5 to provide a more comprehensive evaluation.An end-of-module survey was conducted, using as its baseline the previous year's module evaluation results to measure the effectiveness of our teaching intervention.As Phase 2 uses the result of Phase 1, it is introduced and discussed in the later section, 'Phase 2 Design and Result'.

Case Study: NSS Data from a UK University
The dataset was collected over four academic years of NSS (2014-2017) at a northwest UK university.In this study, we used only the responses to the question about negative learning experiences, which had 4400 responses.It should be noted that, in this study, the statement categorization was the performance with an individual sentence, rather than with the whole response.This considers that a single response may consist of statements in multiple categories.For example, Figure 5 shows a response that falls into four problem statement categories.Accordingly, these responses were broken down to 10,328 sentences.
Appl.Sci.2022, 12, x FOR PEER REVIEW 9 of 22 baseline the previous year's module evaluation results to measure the effectiveness of our teaching intervention.As Phase 2 uses the result of Phase 1, it is introduced and discussed in the later section, 'Phase 2 Design and Result'.

Case Study: NSS Data from a UK University
The dataset was collected over four academic years of NSS (2014-2017) at a northwest UK university.In this study, we used only the responses to the question about negative learning experiences, which had 4400 responses.It should be noted that, in this study, the statement categorization was the performance with an individual sentence, rather than with the whole response.This considers that a single response may consist of statements in multiple categories.For example, Figure 5 shows a response that falls into four problem statement categories.Accordingly, these responses were broken down to 10,328 sentences.About 20% of sentences were randomly selected and manually annotated against the annotation scheme (Table 1).As 10% had been done at Stage 1, this annotated another 10% and produced 2000 labeled data.The annotation in both stages was undertaken by two senior lecturers at the institute from which the NSS data came.The unlabeled data consisted of the remaining 8328 sentences, which were automatically analyzed later by text mining.
Furthermore, we trained (with 70% of the labeled data, i.e., 1400 sentences) the textmining component with the following algorithms: Random Forest (RF); Multinomial Naïve Bayes (MNB); and Support Vector Machines (SVM).Their prediction accuracies (against a held-out 30% of the labeled data) are shown in Figure 6.About 20% of sentences were randomly selected and manually annotated against the annotation scheme (Table 1).As 10% had been done at Stage 1, this annotated another 10% and produced 2000 labeled data.The annotation in both stages was undertaken by two senior lecturers at the institute from which the NSS data came.The unlabeled data consisted of the remaining 8328 sentences, which were automatically analyzed later by text mining.
Furthermore, we trained (with 70% of the labeled data, i.e., 1400 sentences) the textmining component with the following algorithms: Random Forest (RF); Multinomial Naïve Bayes (MNB); and Support Vector Machines (SVM).Their prediction accuracies (against a held-out 30% of the labeled data) are shown in Figure 6.Overall, the SVM algorithm yielded the best average accuracy (50% macro-average across all classes), although its performance improvement over MNB (+1.3%) was low.It should be noted that the accuracy obtained by the three algorithms varied across the 10 Overall, the SVM algorithm yielded the best average accuracy (50% macro-average across all classes), although its performance improvement over MNB (+1.3%) was low.It should be noted that the accuracy obtained by the three algorithms varied across the 10 problem statement categories.For example, the 'Placement and employability' category showed the best predictive accuracy of all the categories with the RF algorithm, achieving a robust performance of 92.5%.The high performance in this category could be explained by the fact that many student responses relevant to 'Placement and Employability' contained the word 'placement', which was used by the underlying algorithm as a discriminating feature to categorize such problem statements accurately.Two other categories, namely 'Overall dissatisfaction' and 'Student life and social support', achieved a very low accuracy of approximately 2-14% and 17-20%, respectively, attributed to the limited number of training examples available in these two categories.
As a result, the trained SVM component was chosen to categorize the responses in the unlabeled data.Figure 7 shows the distribution of the statements in each category.Although the classification accuracy of the trained SVM was not high for all classes, we still consider this useful for increasing the amount of silver-standard labeled data that we are able to provide to our topic-modeling algorithm.The results shown in the next section indicate that the topic modeling algorithm was able to produce coherent outputs when using our automatically labeled data.

Results of Topic Modeling on NSS Data
This step automatically lists the important topics among the categorized statements.For each category, we applied the LDA topic-modeling method to summarize the top-10 topics.This generated a hundred topics.For brevity, Figure 8 shows the results for the 'Assessment and Feedback' category as an example; the rest are in the Appendix.The first observations were that the topic-modeling method was able to identify thematically coherent topics (with a topic weighting to reflect their relevance) and also represent the underlying category.However, human interpretation was required since each topic was described by a set of seven words, and some topics could be better summarized as a single issue.For example, both Topics 3 and 10 indicated 'Timely feedback to assessment'; Topics 2, 3, and 8 suggested 'Clearer assessment criteria'; and Topics 5, 7, 8, and 9 related to 'Assessment deadlines'.

Results of Topic Modeling on NSS Data
This step automatically lists the important topics among the categorized statements.For each category, we applied the LDA topic-modeling method to summarize the top-10 topics.This generated a hundred topics.For brevity, Figure 8 shows the results for the 'Assessment and Feedback' category as an example; the rest are in the Appendix.The first observations were that the topic-modeling method was able to identify thematically coherent topics (with a topic weighting to reflect their relevance) and also represent the underlying category.However, human interpretation was required since each topic was described by a set of seven words, and some topics could be better summarized as a single issue.For example, both Topics 3 and 10 indicated 'Timely feedback to assessment'; Topics 2, 3, and 8 suggested 'Clearer assessment criteria'; and Topics 5, 7, 8, and 9 related to 'Assessment deadlines'.
observations were that the topic-modeling method was able to identify thematically coherent topics (with a topic weighting to reflect their relevance) and also represent the underlying category.However, human interpretation was required since each topic was described by a set of seven words, and some topics could be better summarized as a single issue.For example, both Topics 3 and 10 indicated 'Timely feedback to assessment'; Topics 2, 3, and 8 suggested 'Clearer assessment criteria'; and Topics 5, 7, 8, and 9 related to 'Assessment deadlines'.Using human interpretation and a holistic view of a hundred auto-summarized topics, we were able to summarize the top-10 issues (from more than 10 topics, as demonstrated above) across all categories.Although this process was not made fully automatic, we consider that human involvement was far less than that required to process all 10,328 statements manually.Finally, top-10 issues identified are as labeled (I1-I10), which are further discussed to inform the teaching interventions in the next stage.Using human interpretation and a holistic view of a hundred auto-summarized topics, we were able to summarize the top-10 issues (from more than 10 topics, as demonstrated above) across all categories.Although this process was not made fully automatic, we consider that human involvement was far less than that required to process all 10,328 statements manually.Finally, top-10 issues identified are as labeled (I1-I10), which are further discussed to inform the teaching interventions in the next stage.

Discussion on Issues Identified
The automatically identified topics/issues were considered to reflect the authors' teaching experience in the department and were in line with previous studies [4, 10,57].Specifically, I1, I2, and I3 revealed students' concerns over the teaching and learning materials.We considered that such concerns were caused by insufficient communication rather than an inadequate quality of material, because the department had undergone a rigorous process (including staff training, use of templates and moderation) to ensure the quality of all teaching and learning materials; in addition, it had always received positive feedback from external examiners.Moreover, the students were of diverse backgrounds (i.e., more BTech and converted students than at a traditional university) and whose familiarity with the higher education learning environment varied significantly.For example, the academic writing in the materials may have constituted a barrier to their learning, as revealed by I3, the topic about marking criteria.To address such issues, we suggested that the teaching interventions should focus on providing students with guidance on accessing relevant information and that during the semester each module should have more checkpoints for students (regarding self-assessment and identifying misunderstandings).
I4, I5, and I6 relate to the modules' content.While I4 was a generic description of how the content needs to be more interesting and engaging, I5 and I6 revealed more detail, clearly suggesting that the students would like to gain more practical skills upon successful completion of the various modules.Some might argue that Computer Science is a field that combines both theoretical knowledge and practical skills and that students may not have the ability to judge just how practical a Computer Science module should be; however, it has been reported elsewhere that Computer Science graduates lack practical skills [58] and that work experience is important to them [59].To address these issues, more practical tasks and real-life examples should be integrated into modules.In addition, when delivering lectures, whenever possible, tutors should explain to the students how the theoretical concepts are applied in industry (ideally, with real-life examples) and how the acquired practical skills contribute to their career.
I7 and I8 refer to the timing and quality of feedback on students' coursework.Both issues initially appear surprising, as the delivery of feedback at the institution follows a rigorous process in the Computer Science department, especially as, in previous years, it introduced an additional checkpoint (a marking mentor to oversee the process and monitor the quality of feedback).Based on the authors' extensive experience of marking and moderation (over a range of over twenty modules) and the comments by other senior teaching members in the department, we concluded that the feedback given was actually both detailed and constructive, and that some even exceeded the department's requirements; therefore, it was worth exploring possible alternative explanations of these issues.
After careful consideration and discussion with student representatives, it was suggested that both issues may relate to the response time of official feedback (i.e., four weeks).First, for major coursework assignments that are usually submitted at the end of a module, students receive feedback only when the module is completed, after the end of term.Second, for minor coursework assignments that are submitted frequently (e.g., the weekly portfolio), students do not receive feedback before starting their next assignment.The existing mechanism for providing feedback to students is not ideal, as students receive no direct guidance on how to improve their marks, potentially frustrating them and resulting in negative survey responses.
While the official feedback response time cannot be shortened and the existing assessment arrangements cannot be changed, we recommend that, for major assignments, several checkpoints should be introduced to monitor students' progress and provide informal feedback and/or formative exercises as sub-assignment tasks; for minor assignments, informal/oral feedback should be provided and prior to the next assignment there should be discussion in class on common mistakes (this requires tutors to go quickly through all submissions).For all assignments, the marking criteria should be clearly explained and linked to the task, which will help the students to understand better the feedback that is provided.
I9 and I10 relate to the experience of the support that students receive from tutors out of class.In this department, the official response to a student email must be within three working days, and there are designated office contact hours for students.In the follow-up investigation, these policies were clear to all staff members, and there was no evidence of staff violating any policies.While this initial finding contradicts the issues identified, further exploration suggested that the issues could be due to inconsistencies in the level of support provided by tutors.For example, some staff tended to reply to student emails within minutes, and sometimes even outside of office hours, while others tended to respond only during office hours.When a student popped in during non-contact hours, some staff tended to put aside any task that they were working on, while others followed the policy and booked a later appointment with the student.Such varied behaviors gave a false impression to students that every staff member should at all times respond or offer help (i.e., the students were unclear about the policies and took as the norm the instant response), a finding that was later confirmed by the student representatives.To address those issues, it is suggested that, at the beginning of a module, all tutors should explicitly clarify: (a) what to expect when emailing a tutor; (b) how a tutor's office hours work; and (c) the best way of contacting them.

Teaching Intervention: Design and Implementation
Our analysis revealed that some issues in student feedback required additional action (e.g., enrichment of module content by including more practical tasks, more frequent checkpoints, feedback, etc.), while others required only amendments or a new delivery approach (e.g., a better communication or clearer explanation).As a result, we propose the actions shown in Table 2.
The actions (in Table 2) aim to address all the issues identified by the text-mining method.While Table 3 shows detail description of top ten issues, Table 4 provides a detailed overview of how each identified issue.The selection criteria were that the modules at both levels should cover all the student pathways and have more than 30 students.It should be noted that Module B, when it was too late to arrange another, enrolled only four students (because the module choice event was held after the study had been designed).The three other modules were able to cover all pathways.
4.2.3.Evaluation: Discussion on Questionnaire Results We chose a questionnaire as our evaluation method as it is used by NSS and is an effective way to gather large numbers of students' feedback [60].We extended the department's end-of-year module survey for the purpose of our evaluation.The survey design is in alignment with the NSS survey, offering five Likert-scale options: Strongly disagree; Disagree; Neither agree nor disagree; Agree; and Strongly agree.The full questionnaire, corresponding issues and actions are shown in Table 5. Questions 1 to 7 are part of the set of standard questions from the department's end-of-year survey.Questions 8 to 18 are the additional ones put to the students in the modules under study.Finally, the surveys were conducted at the end of the spring semester, and the results are shown in Table 6.The satisfaction rate was calculated by combining the 'Agree' and 'Strongly agree' responses (note: the Level 5 result does not include Module B, since only one of its four students returned the questionnaire).The results of the Level 6 modules show an average rating of 92%; 13 of 18 questions received 90%+ ratings, and five questions achieved 100%.The results of the Level 5 modules show an average rating of 74%, and three questions achieved 90%+.
As they were part of the NSS questionnaire, for Questions 2, 3, 4, 6, and 7, we listed the department's previous year's NSS scores (an average rating of 74%) to provide a direct comparison: the results for the Level 6 modules show substantial improvement (all responses were rated higher), while Level 5 yielded mixed results.Since the actions were designed for Level 6 students, the survey results indicate a significant improvement in students' experience of teaching and learning in the department: on average, 22% more satisfaction.The Level 5 results are for comparison and will be discussed further in a later section.

Effectiveness of the Teaching Intervention
Overall, the interventions designed are considered to be effective, achieving an average satisfaction rate of 92% and, for Q7 (overall satisfaction), 100%.Note that only Level 6 results are included in this discussion.To measure the effectiveness of each action, we took the average rating of related questions, as defined in Table 6 (the rating for Q7 is excluded, as it is an overall rating).For example, A1 has two related questions-Q1 and Q14-whose ratings are 100% and 94%.Therefore, the rating for A1 is (100 + 94)/2 = 97 (%).The detailed measurements of each action are listed in Table 7.It was concluded that most actions were well received by students: 11 actions achieved a 90%+ rating.A5 ('Explanation of how assignment works') showed an 85% rating and A11 ('Explanation of how email contact works') 83%, which appears to suggest that the clarification of existing documentation is less helpful to students.However, A13 ('Provisional marking criteria'), which had a 100% rating, is also about explaining documentation (i.e., the marking criteria).Therefore, we conclude that negative responses were probably caused by how the teaching was delivered rather than by the clarity of documentation and policies.In other words, clarification should be carried out by linking to the context rather than by simply referring to the documents.For example, instead of directly explaining each of the marking criteria in Week 1, it is more effective to link a difficult/key concept to the marking criteria, alongside teaching, and explain how it will be assessed, as this will help students to plan their after-class study better, enhancing their learning experience.To measure the effectiveness of how each issue was addressed, we calculated the average rating of all questions related to that issue, as defined in Table 6 (Q7's rating is excluded).For example, I2 is related to Q1, Q2, Q14, and Q17, whose ratings are 100%, 94%, 94%, and 100%.Therefore, the rating for I2 is (100 + 94 + 94 + 100)/4 = 94 (%).Table 8 shows the measurements of all issues.While most issues were addressed well and eight of the 10 scored a 90%+ average rating, we would like to discuss further I3 and I7, which were rated at under 90%.
For I3 ('Assessment criteria to be made clearer'), which shows a rating of 88%, the related questions can be categorized into two groups: (a) whether the students feel it is helpful (Q1, Q2, and Q17); and (b) whether the students feel that it is better than in previous years (Q11 and Q12).The former shows an average rating of 98%, and the latter 73%.This suggests that the students are less sensitive to the improvements made to documents than to how the teaching materials were delivered; that is, they prefer explanations about their actual assignments to documents giving the marking criteria.This analysis is in line with the previous finding, indicating that students prefer in-context explanation.As a result, we recommend that future actions should continue to focus on practical and contextual items.
I7 ('Timely feedback to be provided') was considered to be difficult to address.This is because the institution guidelines stipulate that marks and feedback on coursework assignments should be returned to students within four weeks so that, by the time that written formative feedback arrives back to students, it offers them little support for improving their work for their current module.While the feedback's high quality was confirmed by the survey results (Q3, Q5, Q13, Q15, and Q16 had an average rating of 90%), only 73% considered that the feedback was more helpful than in previous years (Q11 and Q12).In particular, 94% (Q13) students praised the in-class feedback, suggesting that further actions should focus on oral and informal formative feedback.This is supported by the 100% rating for Action 13 ('In-class check to a provisional score').

Comparison of Level 5 and Level 6 Results
The average rating of Level 5 (74%) is much lower than that of Level 6 (92%).Although it can be argued that the teaching interventions were based on Level 6 students' feedback (NSS involves only Level 6 students), this is a highly interesting result that will be discussed from several aspects.
The first question is whether the actions were appropriately implemented at both levels.For finer measurement, we split the questions into three categories: (a) four questions directly asked whether a certain action was useful: Q1 (assessment brief); Q2 (marking criteria explanation); Q8 (using real-life examples); and Q14 (coursework deadline).Level 5 results had an average rating of 91%; (b) four questions related to tutors' supportiveness: Q4 (overall); Q6 (available time); Q17 (expected outcome); and Q18 (improvement).Level 5 results show an average rating of 82%.(c) four questions asked about whether the actions were sufficient: Q3 (feedback); Q9 (example numbers); Q13 (in-class support); and Q16 (in-class check).The Level 5 results show an average rating of 67%.
The above statistics might indicate that, while the actions were helpful, they should be implemented more frequently or that tutors should improve their overall quality of teaching.However, it should be noted that on the chosen Level 5 module the lecturing hours were reduced during the year from four to three per week yet the module content was unchanged (the department made the decision just before the start of the semester, leaving no time to amend the teaching material).We observed that the reduced teaching time resulted in increasing pressure on students and a weaker understanding (the module average mark dropped from the previous year).In view of this negative issue (it is notable that only 53% students considered that they were satisfied with the module), it is considered that the actions were implemented well, but that their effects were overshadowed.To resolve this issue, action should be taken either to increase staffing or to modify the module.
The second question to be asked is whether the actions designed efficiently address the issues raised by the students.As mentioned above, the Level 5 ratings on whether the actions are useful were similar to those on the Level 6 modules, yet the overall rating was much lower.Furthermore, Q10 asked whether the real-life examples increased the motivation', to which students at Level 5 gave 55% while those at Level 6 gave 83%.It seems that such an action ('Embedding real-life examples) had no effect on students' learning experience at Level 5. Bear in mind that the actions were designed by analyzing feedback from final-year students (Level 6), who might have had a different perspective from that of the Level 5 students.We consider that the sample data might not be sufficient to draw a conclusion (i.e., whether an action can improve overall satisfaction), but it was clear that to some extent those actions could improve students' experience (i.e., 91% considered the actions were helpful).

Conclusions 6.1. Significance of Using Automatic Textual Analysis
We first measured the efficiency of the proposed automatic framework in terms of how it accelerates the process of analyzing students' responses to the open-ended questions.Specifically, at the stage of assigning the responses (10,328 sentences) into the 10 categories, our classification model required 2000 sentences to be manually processed, saving 80% of the human effort (roughly two weeks' work); at the subsequent stage of further summarizing the key issues within each category, we used topic modeling to weight the top-10 topics for each category, involving only a trivial amount of human effort compared to reviewing and prioritizing the whole dataset manually.Assuming a similar time for human review, using our automatic analysis saved four weeks' work.Furthermore, future studies can apply the trained classification model directly, obviating the need for initial manual categorization.
Second, the efficiency of the proposed automatic analysis can be measured by the quality of the identified issues.As we did not have the resources (two weeks for each stage, as estimated above) to go through the entire dataset, we used an indirect measurement to validate the identified issues against the existing literature and the authors' local teaching experience (as the responses were from students at the same institute), as discussed in the section 'Discussion on issues identified'.The outcomes were meaningful and were developed into teaching interventions.The survey results show that the students' learning experience improved as a result; it should be noted that, although the average accuracy of classification was relatively low, topic modeling was able to tolerate and produce validated outputs, suggesting that the two-stage analysis framework that was designed is resilient.Topic modeling is designed to identify salient features in a text and is robust to noise introduced by misclassifications.
Third, automatic analysis becomes more efficient on a large scale.This is because the requirement for training data does not increase linearly.In other words, human analysis requires 10 times the hours to handle 10 times the amount of data; however, in text mining, while more labeled samples lead to better performance, the effect of increasing the extent of the training data becomes less apparent when the samples are large enough.In this study, we manually labeled 20% of data and the net number is 2000 sentences.While it is hard to estimate how many samples are required to achieve an optimal performance in autoanalyzing students' open response, it is not difficult to predict that, to analyze a million student responses, there is no need to manually label 20% of the data (0.2 million).This is because the trained model has already achieved a reasonable outcome with 2000 sentences.

Limitation and Future Work
Due to the limited time and resources, our study has the following limitations.First, although the text-mining results were automatically generated, human interpretation of the obtained results was still required.In addition, the supervised component of our method showed a relatively low average predictive accuracy of 50% across the 10 pre-defined categories.This was due its inability to produce a large amount of quality training data.Since the aim of this study was to concept-prove the effectiveness of our novel text-mining method to identify the issues expressed in students' feedback and thereafter to design and implement interventions, a small training set was considered sufficient.A future work plan is to collect a larger set of training data for better performance, as well as to improve the quality of the labeled data by adapting a more rigorous annotation methodology, as suggested by other researchers [61,62].
Second, while our intervention and evaluation were implemented in four selected modules at two levels, we understand that the NSS data used in this study provide programmebased feedback.Although the modules were carefully chosen to include students from all programs, it became evident that the evaluation results varied across the modules and levels.For example, the action of 'Embedding real-life examples' was generally well received by Level 6 students but received lower ratings from Level 5 students.It is suggested that, when designing interventions in future, module-level feedback should be taken into account.
Lastly, the survey results showed the positive effect of our intervention, which is considered sufficient to justify the effectiveness of our novel approach.However, statistically speaking, such a result might not be considered to be robust enough to inform the development of university policymaking or teaching guidelines.At the 95% confidence level, the average rating is 92%, suggesting that the results are encouraging but not statistically significant.In the next stage of our study, to obtain a more reliable result, we will encourage more students to participate in the module evaluation survey.

Summary
In this article, we present a full-length pedagogy study on improving students' experience of teaching and learning, covering all the stages from the initial text analysis, issues of identification, teaching intervention design, and implementation to the final evaluation.The significant improvements in student satisfaction demonstrate the effectiveness of our teaching intervention, further proving the accuracy of the text analysis of students' free-text feedback.In contrast to traditional manual text analysis, we took a novel machine learning approach, enabling us to analyze the free-text comments on a wider scale (i.e., 4400 NSS responses), taking much less time and human resources.Technically speaking, this automatic analysis framework demonstrates its efficiency in 'closing the gap' between the analysis of student feedback and the implementation of timely teaching interventions.Moreover, we consider that this framework has huge potential to be extended: the information extracted from the free-text responses could be used for a wider range of studies; and the annotation scheme developed in Phase 1 could serve as the baseline for future classification studies.A wider implication of this work is beyond higher education, as the methods employed in this research can leveraged for case studies from tertiary education, commercial training, and apprenticeship programs, etc., where textual feedback is collected from to improve the quality of teaching and learning.

Figure 1 .
Figure 1.High-level overview of a supervised text-mining approach.

Figure 1 .
Figure 1.High-level overview of a supervised text-mining approach.

Figure 2 .
Figure 2. High-level overview of an unsupervised text-mining approach.Systematic analysis of open-ended survey responses involves a considerable workload in terms of manual annotation; therefore, open-ended responses are often excluded from survey experiments [42-44].Nonetheless, survey researchers recognize the importance of analyzing open-ended data, pointing out that such free-text responses to

Figure 2 .
Figure 2. High-level overview of an unsupervised text-mining approach.

Figure 3 .
Figure 3. Relative frequency (%) of negative responses per category in the annotation scheme developed.

Figure 3 .
Figure 3. Relative frequency (%) of negative responses per category in the annotation scheme developed.

Figure 3 .
Figure 3. Relative frequency (%) of negative responses per category in the annotation scheme developed.

Figure 4 .
Figure 4. Architecture of the proposed text-mining method to identify automatically the problem statements in the NSS free-text responses.

Figure 5 .
Figure 5. Example of a single student response consisting of multiple problem statement categories.

Figure 5 .
Figure 5. Example of a single student response consisting of multiple problem statement categories.

22 Figure 6 .
Figure 6.Predictive accuracy (%) of the Multinomial Naïve Bayes (MNB), Random Forest (RF), and Support Vector Machines (SVM) algorithms when applied to the evaluation sample of the NSS dataset.

Figure 6 .
Figure 6.Predictive accuracy (%) of the Multinomial Naïve Bayes (MNB), Random Forest (RF), and Support Vector Machines (SVM) algorithms when applied to the evaluation sample of the NSS dataset.

22 Figure 7 .
Figure 7. Relative frequency (%) of statements (manually annotated and auto-classified) per category using the annotation scheme developed.

Figure 7 .
Figure 7. Relative frequency (%) of statements (manually annotated and auto-classified) per category using the annotation scheme developed.

Figure 8 .
Figure 8.The 10 most important topics in the 'Assessment and feedback' category.

Figure 8 .
Figure 8.The 10 most important topics in the 'Assessment and feedback' category.

Table 1 .
Annotation scheme of problem statement categories used in this study and how it covers the NSS categories.

Table 2 .
Actions to improve student experience.

Table 4 .
Issues addressed by actions.

Table 5 .
Questions used in the survey.

Table 6 .
End of module survey results.The ratings were calculated by averaging all programmes (weighted by student numbers).Data refers to the previous year and the same department in which the teaching intervention took place.NSS survey results available online: http://www.hefce.ac.uk/lt/nss/results/ (accessed on 20 August 2021). *

Table 7 .
Actions and average rating of related questions.

Table 8 .
Average rating of issues and related questions.