Progressive Teaching Improvement For Small Scale Learning: A Case Study in China

: Learning data feedback and analysis have been widely investigated in all aspects of education, especially for large scale remote learning scenario like Massive Open Online Courses (MOOCs) data analysis. On-site teaching and learning still remains the mainstream form for most teachers and students, and learning data analysis for such small scale scenario is rarely studied. In this work, we ﬁrst develop a novel user interface to progressively collect students’ feedback after each class of a course with WeChat mini program inspired by the evaluation mechanism of most popular shopping website. Collected data are then visualized to teachers and pre-processed. We also propose a novel artiﬁcial neural network model to conduct a progressive study performance prediction. These prediction results are reported to teachers for next-class and further teaching improvement. Experimental results show that the proposed neural network model outperforms other state-of-the-art machine learning methods and reaches a precision value of 74.05% on a 3-class classifying task at the end of the term.


Introduction
One of the targets of educational scientists is to develop a high-quality education that is intimately linked with sustainable development goals. By virtue of high dropout and low academic success rate in education, learning data analysis has received significant attention in recent years, especially for large scale remote learning scenarios like Massive Open Online Courses (MOOCs). Those researches tend to focus on education resource prediction [1], aiming to keep a track of students' learning activities to make predictions and recommendations for online platforms. However, there are rare studies on small-scale learning data analysis, especially for on-site teaching and learning. Recently in China, the usage of internet-driven learning platforms has an exponential increment due to the corona virus outbreak. However, the teachers and students just make use of instant chatting services to continue their teaching and learning online, essentially the same as on-site form which illustrates the high significance of small-scale learning data analysis. Although the dropout rate of traditional classes is 10-20% lower than online courses, the analysis of small-scale learning data for on-site education institutions and organizations should not be ignored [2].
Currently in China, with the purpose of improving curriculum teaching quality, students in most universities are required to make evaluations for all courses of this semester with very limited time available at the end of each term. Students tend to finish the evaluation arbitrarily and casually due to the boring repeated procedure, which results in inaccurate feedback from students and difficulties in helping courses improvement. However, Data Mining (DM) approaches for MOOCs are incapable to address small-scale problems. Thus, this paper focuses on progressive learning feedback and analysis on a case study in China, i.e., a Data Structure course during the 2019 Summer semester from School of Educational Science and Technology, Nanjing University of Posts and Telecommunications. In this study, students could give satisfaction feedback instantly after each class with a convenient and time-saving manner. Consequently, the objective of this study is to address the problems as follows: (1) How could instructors have direct knowledge of learners via their feedback data and make relatively necessary interventions? (2) What kind of algorithm could perform well in small-scale data processing and how to implement it? (3) What positive influences will this study bring for future education development? The conceptual graph of this study is demonstrated in Figure 1. Compared with small-scale learning such as on-site education, the data of massive online learners is relatively convenient to collect from the virtual learning environments (VLEs) and learning management systems. Nevertheless, the collection of such dataset has also several inevitable limitations. For instance, scientists have no access to some dataset due to privacy issues, e.g. online-course platforms rejected to publish users' data due to confidentiality and privacy issues in the work done by Dalipi et al. [3]. May et al. [4] also proved that promising absolute privacy, confidentiality, and anonymity are impractical. For small scale on-site education, however, the aforementioned limitations are mostly nonexistent because learning data are collected by teachers or universities only for the purpose of course improvement. In this paper, an innovative approach is proposed for students to submit their feedback after each class instantly and conveniently. Inspired by the evaluation mechanism of electronic business websites, where consumers are allowed to give a piece of evaluation to elaborate their feedback of the shopping experience after every deal, the proposed method develop a WeChat mini program with a novel user interface to gather students' feedback after each class. Students have no need to visit websites via browsers on computers for feedback, but taking survey in WeChat mini program via the smartphone. Furthermore, all the collected data applied in data analysis will be anonymized, protecting the privacy, confidentiality, and anonymity of students' information.
To summarize, the main contributions of this study are as follows: • An innovative learning feedback mechanism via widely used WeChat mini program in China, conveniently making a collection of students' evaluations and suggestions after each class.

•
A novel artificial neural network model customized to small quantity of learning data, predicting students' final academic performance progressively. These predictions are then indirectly instructing teachers to give specific advice for diverse students and improve teaching.
• A comprehensive comparison with other state-of-the-art machine learning methods.
The rest of the paper is organized as follows: Section 2 briefly reviews the most relevant work to ours. Section 3 sheds light on the methods of course evaluation data collection, data pre-processing, and neural network model adopted. Section 4 elaborates upon the experiments and discussions about data analysis and visualization, experimental results, and comparisons with other state-of-the-art machine learning methods. Finally, we illustrate the conclusion of this paper and future work in Section 5.

Educational Data Mining
Data Mining (DM) is a technique mainly targeted at analyzing gathered data. It refers to a procedure of discovering hidden information from a large amount of data through some algorithm [5]. The information and knowledge obtained via DM can be widely used for strengthening the decision making procedure [6]. By using various algorithms, DM tends to build data patterns [7], which has been proved to be important for fields like education, network security, and business [8][9][10].
Recently, a sub-field called Educational Data Mining (EDM) has emerged for the analysis and process of educational data. Traditional DM methods show great performance in EDM. To promote students' performance [11] and polish study and instruction behavior [6], scientists design personalized learning and course recommendations for students. EDM often makes use of the students' performance data, administrative data, and activity data [12], most of which come from web-based learning environments. Romero et al. [13] carried out a survey in 2007, which was further improved in 2010 and 2013, to provide comprehensive resources for studies in EDM. These studies show that DM techniques like classification, clustering, and text mining are widely used in educational institutions. With the rapid evolution of machine learning techniques, there has been a proliferation of research in EDM using Deep Learning (DL) architectures, firstly introduced in 2015.

Student Performance Prediction
Technology-enhanced learning platforms have provided teachers with sufficient students' behavior data, and allow them to study students' performance [14,15] and optimize the learning environment [16].
Various machine learning models have shown great ability to analyze students' interaction and make predictions on students at risk of failure. Decision Tree is widely used in many academic performance prediction tasks [17]. Ahmed et al. [18] applied ID3 model for predicting the final grade of students. Hussain et al. [19] adopted Gradient Boosting Decision Tree to identify students who have fewer engagements in VLE. Logistic Regression is another extensively used approach for learning data analysis. Marbouti et al. [20] adopted Logistic Regression to identify students' outcomes in advance of the course by incorporating attributes like their attendances and assessment behavior. They achieved better predictive performance for the last few weeks. Moreover, Logistic Regression was often utilized as the baseline model to evaluate student performance [21]. Leitner et al. [22] show history information like entry tests of students and grades in previous courses can help the model classify an individual's outcome.
Deep learning technique is a branch of machine learning, and outbreaks in recent years, especially in image understanding and Natural Language Processing (NLP). It also has promising consequences in EDM tasks, e.g., predicting and classifying the performance of successful and at-risk students. Deep learning models contain multiple layers. Each layer tries to extract more abstract information and sends it to the next layer, trying to model the complex representation of the input data [23]. De Albuquerque et al. [24] applied artificial neural networks (ANNs) to identify the outcome of students and achieved very high accuracy (85%). Corrigan et al. [25] deployed Long Short Term Memory (LSTM) model to assess the performance of participants based on interactive activities of students with the VLE. Although the traditional baseline approaches are outperformed, a large number of data are needed to feed into the deep learning model for training, which is not feasible for common on-site learning.

Text Analysis
Human languages can be analyzed and understood by NLP algorithms. Sentiment analysis intends to parse sentiment from textual information and extract their polarity and viewpoint [26]. Singla et al. [27] proposed a method to analyze the Amazon mobile phone reviews, which are categorized into negative and positive polarity. They used SVM to classify sentiments and achieved an accuracy of 84.9%. Zhao et al. [28] applied Weakly-Supervised Deep Embedding-LSTM for extracting features from review text. This model obtained an accuracy of 87.9% on the Amazon dataset. An unsupervised attention model was proposed by He et al. [29] for sentiment analysis, using attention to remove words that are irrelevant from the sentiment. Wang et al. [30] employed attentional-graph neural networks for Twitter sentiment analysis.
Similar to word embedding [31], sentence embedding is adopted to encode the semantic meaning of a sentence into a feature vector. Kiros et al. [32] proposed an unsupervised sentence embedding method using two separate decoders to reconstruct the surrounding sentence from the surrounded one. Utilizing the capability of LSTM to capture long-distance dependency, Palangi et al. [33] used RNN with LSTM cells for modeling sentences. The LSTM-RNN model generates semantic vectors for each word in a sentence sequentially. In 2018, Devlin et al. from Google released BERT, a contextualized word representation that has achieved state-of-the-art performance in many NLP tasks. Wang et al. [34] developed a new BERT-based method for sentence embedding, called SBERT-WK, which combined the advantage of both parameterized [32,35] and non-parameterized methods [36,37]. This model consistently outperforms state-of-the-art approaches with low computational cost and good interpretability.

Method
To resolve the aforementioned issues in small scale learning feedback and teaching improvement, a novel pipeline is proposed to progressively collect students' feedback after each class, visualize the raw data for instructors and make performance prediction to help teachers improve their further teaching. Figure 2 illustrates the whole pipeline of the proposed method. The system firstly collects students' feedback after each lesson and saves these data into a database. The submitted data are recorded after every class, where each k th lesson consisted of students' feedback for that specific k th lesson. In Figure 2, F ij denotes the j th feature value for Student i before k th lesson. After data pre-processing, the processed data are then visualized to give teachers an intuitive sense of teaching effects. Finally, an ANN model is adopted to predict the performance of every student for further teaching improvements. The following subsections will describe each module in the proposed pipeline in detail.

Feedback Data Collection
Due to the main target of this research is to provide a fully automatic visualization and analyzing system for teachers after each class, the system needs to collect students' feedback data after each class. Inspired by the evaluation mechanism of popular e-commerce websites that collect customers' reviews on each transaction and make corresponding improvements, a WeChat mini program (Figure 2 top left) is developed to collect students' instant response after each class.
In this case study, data were collected from the Data Structure course opened in 2019 Summer by School of Educational Science and Technology, Nanjing University of Posts and Telecommunications, China, with 113 students enrolled in this course. At the end of each lesson, teacher will present the QR code of this WeChat mini program to students for feedback about this lesson, without downloading any APPs. Students could use this mini-program to fill in the questionnaire with smartphone conveniently. The designed questionnaire has only 10 questions, and 9 of them are multiple choices. Thus, the participants can fill the form quickly. The whole procedure is not boring, thus can ensure the quality of the feedback. At the end of the course, 1089 records were collected from students across 16 lessons. As shown in Figure 3, not every student submits feedback after each lesson, and in some cases students did not submit comments. The data collected from each student's feedback include different data type. English translated feedback samples are shown in Table 1. For each feedback, the first item records the knowledge point taught in this lesson. Items from the second to the tenth present the answer to the multiple choices. The last item is comments, i.e., what the student wants to suggest for this lesson, which is required to contain at least 10 Chinese characters to ensure the comment quality. Text clustering has been applied to estimate the quality of the collected comments. All comments are roughly clustered into three classes. Among them, comments containing negative attitude phrases such as 'So hard to understand' are regarded as a category in which students were more likely to perform poorly. On the contrary, students who had more positive attitude phrases like 'the teacher made it very clear' in their comments tend to achieve better academic performance, demonstrating the quality of the collected feedback. The detailed translated questionnaire is listed in Appendix A.

Data Pre-Processing
As previously mentioned, the proposed model is expected to predict the students' final performance in a progressively more accurate manner when more feedback data are introduced. Another crucial issue is data missing, i.e., not every student remembers to submit his/her feedback after the class due to various reasons. Thus, each student's feature vector has to be fixed length and contain historical information regardless of the amount of feedback he/she submitted. The procedure of data pre-processing is shown in Figure 4. Specifically, after the k th lesson, all the feedback is downloaded from the database. For each student u i , his/her feedback data on k th lesson is R k i and q k ij , indicating his/her selection on question j at k th lesson respectively. c k i refers to the comment of student u i at k th lesson, which has to be converted to an equal length feature vector. This pre-processing first removes meaningless characters in c k i , such as single English characters and punctuation. After that, sentiment analysis is adopted to estimate an overall sentiment score on each sentence [38]. For each sentence, an emotion value e k i is calculated. e k i greater than 0 means the comment is positive, while the value less or equal to 0 indicates negative attitude. In addition, the higher absolute value of e k i means the feelings are more intense. Afterwards, Sentence-bert [34] is applied to convert the processed c k i to a sentence embedding f k i . For multiple choices, the answers are correlated with the subscript of the question option. For example, in the question: "What do you think of the overall difficulty of this lesson?", the indexes of options: Easy, Medium, and Hard are encoded as 0, 1, and 2. It can be obviously observed that larger index indicates more difficulty of this class. Therefore, the proposed algorithm uses the average of each student's options subhead to present his/her average selection on each question. Similarly, the average of each student's emotion value and the comment feature vector are joined up as the students' comments. Finally, the averaged feature vector for student u i is obtained as F i . Samples of processed feedback are displayed in Table 2. For performance prediction, each student's final exam score (0-100) is collected. As the number of students who failed or scored above 90 was less than 10, to eliminate the data imbalance problem, students' final performances are separated into three categories: students with a score of less than 70 are labeled 'Worse'. Those with a score over 70 but less than 80 are labeled 'Good', and those with a score of more than 80 means 'Excellent'. The statistics of each category are listed in Table 3.

Artificial Neural Network Model
Progressive students' performance prediction can help teachers adjust and improve their future teaching. This work aims to learn a classification model that can achieve early prediction of each student's outcome into three categories: Worse, Good and Excellent. This predictor is designed to get better classification accuracy as the course goes on because the gathered data are accumulated. Furthermore, the prediction of students' performance can also give suggestions to teachers to improve their teaching methods and stylize the assignments. To achieve the goal, this work proposes an Artificial Neural Network model as the predictor. The architecture of the proposed ANN model is illustrated in Figure 5. To improve the performance of traditional ANN model, it needs to make full use of all input data and utilize features of every hidden layer due to the small scale of collected data. Thus, this work uses concatenation for fully connected layers, which aggregates features from prior layers. This helps the model utilize both low-level features generated by previous layers and condensed high-level features. This strategy enables the model to use previously generated features for improving the ability of ANN. A dense layer is followed after the concatenated features for generating softmax output. For activation function, Leaky Relu is imposed after each hidden layer, as Leaky Relu function overcomes the problem of dying neural networks in contrast to Relu activation function. Experimental results in the next section show that the proposed ANN with concatenation and Leaky Relu outperforms vanilla neural network models.

Data Visualization
In this small scale dataset, a total of 1089 course feedback records have been collected from 16 lessons with 113 students enrolled. Since the first goal of this study is to give teachers a first impression after each class, feedback data are visualized immediately to teachers. Three data analysis and visualization techniques are adopted to help teachers understand students' feedback.
Multiple Choice Selection. Multiple choices are dominating questions in the feedback form. A bar chart is used to display the numbers of students' selections. Thus, teachers can see the distribution of students' choices on every question and fine-tune their teaching plans in the next lesson. As shown in Figure 6a, 48 students said they needed to review knowledge points after class, 6 stated that they did not understand these points, and 22 believed they understood what they learned. From this bar chart, teachers could learn to ask students questions about these points to check their understandings.
Before-and-After Comparison. The average of all per lesson multiple-choice answer indexes is calculated and shown in a line chart to evaluate the difficulties of different knowledge point sections. Thus teachers can pay more attention to the sections that students felt harder. Figure 6b indicates that 'Linked List' got the lowest difficulty score, which means that it was the easiest part for all students. While 'AVL Tree' had brought most troubles to students as it showed the highest score in difficulty. Other information like the spirit status of students in every lesson (spirit score) and the fun of the course (fun score) are calculated by the averaged students' options subhead per lesson on question 3 and 4 respectively. They are also displayed in Figure 6b, which can give teachers great understandings of the effect of course.
Comments Analysis. Word cloud is a visual representation of keyword frequency and value [39]. A word more frequently appeared in a given article will be displayed in the word cloud image with bigger size. Such visualization strategy helps teachers get instant insight into the most important terms in the comments based on the size of words in word cloud image. Bigger words are more noticeable to teachers and may affect their teaching plans for solving students' trouble proposed with high frequency words in comments. A word cloud example is shown in Figure 6c. The Chinese character in the word cloud with the biggest size refers to 'coding', indicating that most students find reading code and coding a problem in this lesson.

Experimental Settings
In this paper, a three-class classification experiment is conducted for course outcome prediction. Stratified 10-fold cross-validation is also applied to train and validate the proposed and other baseline models. In the dataset, 90% of the data were used for training and 10% for validation at each fold. Grid-search was adopted to find the optimal hyperparameters for traditional machine learning methods. Three fully-connected layers are implemented to extract features from 256 to 64 units. The dropout layer with rate 0.3 is implemented between the layers to reduce ovefitting, enabling the proposed model to learn more effectively and rigorously. Leaky Relu is applied as the activation function after each fully-connected layer, except the last layer with softmax function. Adam is used as the optimizer with learning rate setting to 0.00002. Each simulation runs for 2500 epochs with batch size 113 (number of students). Figure 7 illustrates the metrics of the training procedure, where early stopping was realized to prevent overfitting.

Learning Performance Prediction
As aforementioned, an ANN model is developed to predict students' outcomes in three categories, based on their historical feedback after each lesson. By reviewing these prediction results, teachers may improve students' future performance by providing special guidance to those who are potentially at-risk. Below, the proposed ANN model is compared with other traditional machine learning methods in various configurations, showing the best parameters for our model.

Comparison with State-of-the-Art Machine Learning Methods
In this part, the proposed ANN model is compared with other state-of-the-art machine learning methods, including Logistic Regression, Random Forest, Decision Trees, and SVM. These models have demonstrated excellent performance in predicting students' outcomes on large scale dataset and are computationally cheap. However, these traditional machine learning methods have to fine-tune their hyperparameters to get the best results, which could be time-consuming. Moreover, for different input data, researchers usually have to repeat experiments to fine-tune the hyperparameters and find the best one. In this experiment, four metrics, i.e., Accuracy, Precision, Recall and F1 score are adopted to evaluate different algorithms. Table 4 demonstrates that the proposed ANN model obtains better results than other machine learning methods. Accuracy is used to evaluate the proportion of samples that have been classified correctly. From Table 4, it is shown that ANN performed best among all classifiers of 73.69% with 16 lessons, followed by Random Forest with 65.27%.
Precision and recall metrics are widely used in data science to evaluate the performance of models. Precision indicates the ability of classifiers that predict labels correctly, and recall shows how accurately the model to assign true labels has been used. Since the task is to classify students' final scores into 3 classes, the precision and recall values are calculated separately for each category. The macro-average strategy is used when the number of these three classes is almost equal. The precision and recall values of all proposed models are displayed in Table 4. It can be observed that ANN outperforms other methods significantly after data from four lessons are used. At the same time, Logistic Regression shows relatively good results only when very few lessons data are available.
F1 score is defined as the harmonic mean between precision and recall, which can provide a more realistic measure of a model's performance. The F1 score of all models is listed in Table 4. ANN again shows relatively good results along with the course's progress and performs best when 4, 8, 12 and 16 lessons data are used. Logistic Regression outperforms other methods only with less than 4 lessons.
The F1 score of all compared models running with different amounts of data is illustrated in Figure 8a. It shows that traditional methods like Random Forest and SVM outperform the proposed ANN model only before Week 4 because too few data are involved. After Lesson 4, the proposed model outperforms other methods and gets better performance when more data introduced. At the end of the course, the F1 score of the proposed method reaches 0.7372 in Lesson 16. Figure 8b shows the F1 score of the proposed ANN model and the baseline when comment features were not used. All models performed poorly in this task. Machine learning methods present similar results with their performance in Figure 8a, while the proposed ANN model shows plausible results when both comment features and more lessons' data are used. This indicates that comment information is crucial to improve the ANN's performance. The reason that ANN and machine learning models have different performance on predicting accuracy is that machine learning models often lack domain understanding of the sentence embeddings, which are high dimensional representations of text generated by deep neural networks. Thus, they fail to extract useful information from comments without specific feature engineering. At the same time, ANN can use multiple layers and non-linear activation functions to learn, understand, and utilize the representation of sentence embedding, resulting in better performance.

Comparison with Different ANN Configurations
To demonstrate the necessity of concatenation and Leaky Relu activation function, Figure 9 compares the F1 score on different ANN architecture configurations. The proposed configuration achieves the best F1 score 0.7372 across four selected lessons. ANN that does not use concatenation performs worst on this task. The dropout rate of 0.3 was applied after each layer except the last one, indicating some units were temporally ignored in the training procedure. The concatenation was used to concatenate layer output without dropout, enabling the output layer to fully used all features extracted in previous layers. Concatenation helps the model to fully use uncondensed low-level features, which may contain important information ignored by high-level layers or dropout function. Leaky Relu can also preserve information in initial layers in this task.

Conclusions and Future Work
In this study, we propose a novel approach to perform progressive class feedback, qualitative visualization, and student performance prediction, especially for small scale learning. Such analysis could also help teachers to adjust and improve their teaching strategies throughout the whole course. A case study on the Data Structure course performed at term 2019 Summer with 113 students is investigated using an Artificial Neural Network model. The precision begins at 30.00%, progressively improves during the term, and finally reaches 74.05% for a small dataset.
In the future, more machine learning methods could be explicitly investigated for such a small dataset. RNN-based models (vanilla RNN, GRU and LSTM) could be applied to extract information in sequential feedback data. Data completion is required to fill missing feedback after each class, which is also an interesting direction because the absence of submission for some students will negatively influence the whole dataset, especially for such a small scale dataset.  (Open question, no less than 10 Chinese characters)