Online At-Risk Student Identiﬁcation Using RNN-GRU Joint Neural Networks

: Although online learning platforms are gradually becoming commonplace in modern society, learners’ high dropout rates and serious academic performance require more attention within the virtual learning environment (VLE). This study aims to predict students’ performance in a speciﬁc course as it is continuously running, using the statistic personal biographical information and sequential behavior data with VLE. To achieve this goal, a novel recurrent neural network (RNN)-gated recurrent unit (GRU) joint neural network is proposed to ﬁt both static and sequential data, where the data completion mechanism is also adopted to ﬁll the missing stream data. To incorporate the sequential relationship of learning data, three kinds of time-series deep neural network algorithms: simple RNN, GRU, and LSTM are ﬁrst taken into consideration as baseline models. Their performances are compared in identifying at-risk students. Experimental results on Open University Learning Analytics Dataset (OULAD) show that simple methods like GRU and simple RNN have better results than the relatively complex LSTM model. The results also reveal that different models have different peak performance time, which results in the proposed joint model that achieves over 80% prediction accuracy of at-risk students at the end of the semester.


Introduction
With the exponential evolution of science and technology, educational tools make dramatic changes in recent decades. Virtual Learning Environments (VLEs) like Massive Open Online Courses (MOOCs), which provide lecture videos, online assessments, discussion forums, and even live video discussions via the Internt [1], has become commonplace especially in the period of the COVID-19 outbreak. Two of the benefits it brings account for the increasing adoption of online learning. Firstly, VLEs provide convenience for participants to enroll courses by breaking time and distance limitations. Moreover, online learning platforms based on the Internet are able to record a type of data, including data from a user's VLEs and other learning systems, which is called trace data [2] and profoundly help to provide personalized educational service after necessary analysis. However, online learning emerges in serious situations with a high dropout rate and heavy academic failure. Researches on distance education claim that the completion rate of courses is usually less than 7% [3]. For instance, the dropout rate of Coursera ranges from 91% to 93% [4] and similar conditions happened in the Open • Firstly, this study proposes a novel joint neural network model framework to identify at-risk students accurately based on their demographics information and interaction stream data. • Secondly, the data completion method was adopted for completing missing stream data, which enabled the model to be trained and validated on varying-length courses. • Thirdly, the experiments prove that gated recurrent unit (GRU) and simple RNN perform better in analyzing academic stream information than the LSTM model.
The organization of this paper is as follows. Section 2 briefly reviews the most relevant work via a literature survey. Section 3 presents the methods of data pre-processing and the neural networks used in the experiment. Section 4 formulates the experimental setup and discussion. The conclusion with a summary of the whole work and future directions are illustrated in Section 5.

Educational Data Mining
Data Mining(DM) is used to extract data and discover useful information from a dataset [6], which has emerged and gained rapid development in recent decades when more data are available for analyzing. Fayyad et al. [14] used DM to analyze the collected data and enhance the decision-making process based on the analyzed result. DM has been widely applied in many fields like anomaly detection, intrusion detection, and domain detection [15][16][17].
Concerning education, educational data mining (EDM) is about analyzing data collected from teaching environments by designing methods and algorithms [18]. Romero et al. [19] conducted a review in 2007, including DM techniques used in different teaching environments, which analyzed how specific DM techniques, like text mining and statistics have been applied in online lessons. Their further research involved more factors associated with the participant groups, the type of educational settings, and the data offered. It illustrated the most common tasks handled by EDM techniques in the educational environment [20].
EDM techniques are widely used in solving learning problems. Computer-supported collaborative learning was applied by Perera et al. [21] to extract collaborative patterns in discussions, education environments, and Gaudioso et al. [22] used EDM technology to support teachers in collaborative participant modeling. EDM can also be used to identify aspects related to participants' dropout intention and classify students who tend to drop out based on their historic data [23][24][25]. Researchers designed personalized learning and course recommendations for students based on EDM results used to predict students' outcome [26] and enhance learning and teaching behaviour [14]. Based on data from log event files, Ben-Zadok et al. [27] demonstrated the use of EDM to increase students' exposure to different topics, enabling teachers to analyze students' learning processes, according to their preferences and actual behavior to meet their diverse learning requirements. A study by Sabourin et al. [28] used EDM on self-regulated learning behaviors to predict student self-regulation capabilities.

Student Performance Prediction
Many studies have applied machine-learning approaches and deep-learning approaches for predicting the students' performance [29,30] and optimizing the learning settings [31] when more students' behavior data are available due to the development of technology-enhanced learning environments like MOOCs and Learning Management Systems (LMSs). Predicting students' outcomes during the course is crucial in MOOCs and LMSs because it can help teachers recognize at-risk students and assist them in passing the course. Several works have been done in at-risk student identification.
A variety of previous researches on predicting students' performance used traditional machine learning approaches to fit demographic information, interaction logs, or both. Logistic Regression (LR) was typically employed in models predicting students at-risk of failure and showed promising predictive results. Wilson et al. [32] applied an LR on participants' demographic information that corresponds to their writing tasks and personal abilities, and the produced model showed a promising area under the curve (AUC) score, at 0.89. Marbouti et al. [33] also employed LR to evaluate student performance in advance of the course with attributes of their attendances and assessment behavior. Silveira et al. [2] compared LR, SVM, Naive Bayes and J48 in predicting academic success/failure based on the institutional data and trace data generated by a VLE, and the algorithm J48 presented the best classification accuracy and had the best execution time (excluding Naive Bayes). These machine learning methods show promising results in predicting students' performance with fix-length data.
When much data generated by VLEs are time series, such as clickstream, assessment stream, and interaction event logs, traditional machine learning methods failed to use those various lengths of data for predicting outcomes. Many deep-learning based models are used to solve these kinds of data recently. Aljohani et al. [34] deployed a deep LSTM model classifying students' outcome from sequential data, and the proposed LSTM model achieved the best result with 0.7579 recall score and 0.9346 precision score, and outperformed the baseline LR and ANNs by 18.48% and 12.31% accuracy scores, respectively. Karimi et al. [35] developed a model called Deep Online Performance Evaluation (DOPE) for performance prediction, which represented the online learning system as a knowledge graph and used a relational graph neural network to learn student and course embeddings from historical data. An LSTM was utilized for harnessing the student behavior data into a condensed encoding. Their experiments showed the feasibility of DOPE, which could identify at-risk students of on-going courses. All these models used sequential data from interaction event logs but ignored the assessment data which was also is available during the procession of courses. In contrast, our model used both these stream data and thus achieved better predictive results.

Method
The dataset we used is the Open University Learning Analytics dataset (OULAD), which constitutes demographics information, course information, mutual information, and assessment performance of 32,593 participants for no more than nine months, during 2014 and 2015 [12]. It is composed of seven different courses, and each course presented at least twice and was started at different months in a year. The participants' final performance was grouped into four classes: withdrawal, fail, pass, and distinction.
Mutual information indicates the interaction of students with the VLE, and the interaction was logged in the number of clicks daily for each course. The type of interaction was categorized into 20 classes, meaning different click actions, such as visiting the recommended URL and resource, completing quizzes, and filling in questionnaires. Assessment information presents the type, weight, and expected deadline for each assessment, and the results of students' submission. Figure 1 shows the distribution of student numbers with pass or fail outcomes based on students' interaction with the VLE. It is clear that students who get higher scores in assessments and interact with the learning environment frequently are more expected to pass the course, while those who fail the course tend to have lower scores in assessments and fewer clicks to the VLE. As a result, assessment performance history and mutual history can be used to predict whether a student is at risk during any online courses. Since part of the course's nature varies from semester to semester, such as the length of the course, the type and number of assignments for the course, the time-series model shows promising results in dealing with these variable-length data. While not every student submitted all assignments for each course and VLE history was not logged every week, data inpainting is needed for data pre-processing to complete the unsubmitted assessment and unrecorded weekly VLE interaction. Many approaches on this dataset train and test on one course after the selected course ends, making the method less meaningful. Our proposed approach can train and validate history courses' information effectively and show promising results on the current course. The following subsections will describe each module in the proposed pipeline in detail.

Data Pre-Processing
As mentioned above, we expect that the length of the assessment stream and the clickstream for each student are the same so that the data can be applied in time-series models. To achieve this, we used data completion to fill in the missing data. Specifically, when the number and type of assessments are fixed for each course, we added each student's unsubmitted assessment to the assessment table and assigned a zero score. The VLE data were organized in week-wise manners, meaning the sum of clicks for each type of VLE activities in one week. While some students did not access the VLE for several weeks, we supplemented the interaction data for missed weeks and assigned a zero score. Demographics in this paper indicate data about the information of one student, such as the geographic region, gender and the highest education. When most personal attributes are unordered, we converted student information into one-hot encodings. Since we aim to use history information of one course to predict the student outcome of the current course, we did the above operation on each course when courses at different semesters varied in assessments and length of course. During the training and validation procedures, we trained and validated the model on every past course iteratively. This study predicts whether or not a participant will fail in the end based on his/her information and interaction at the current course. We combined 'distinction' labels and 'pass' labels into 'pass' labels and ignored the 'withdrawal' instances.

Approach
RNNs are the default choice for sequence modeling tasks because of their exceptional ability to capture temporal dependencies in sequential data. There are several variants of RNNs, such as LSTM and GRU, capable of capturing long term dependencies in sequences and have achieved state-of-the-art performances in many sequence modeling tasks.
Vanilla RNN is one of the simplest time series models, where the input and the hidden state are simply passed through a single tanh layer. The computation of hidden state h t and the output y t at time t are described mathematically in Equations (1) and (2): where W h , W h and W y are weights in the simple RNN, and x t is the input feature at time t. All the weights are applied using matrix multiplication, and biases are added into the resulting output. The GRU uses the reset gate and update gate to control the data stream, the reset gate means how the previous memory effects the new input, and the update gate indicates how much of the previous information to be passed along to the future. The hidden state h t in GRUs is calculated as shown in Equations (3)-(6): where U z , W z , U r , W r , U h , W h are the corresponding weights for the GRU, z t is the update gate, r t is the reset gate,h t denotes the candidate hidden state, and σ denotes the component-wise logistic sigmoid function. LSTM is more complicated than GRU and has more gate and a cell state. The cell state at time t conserves the information long before t. LSTMs, more than GRUs, can remember longer sequences, so LSTMs achieve better results in projects requiring understanding long-distance relations.
Since the length of courses in MOOCs is relatively small and we tend to predict student performance at early as possible, a powerful LSTM is not necessary for this task to capture the excessive long-term dependencies in the week-wise assessment and VLE interaction data. Experimental results show that simple RNN and GRU can converge faster and achieve relatively better performances than LSTM.
The proposed work developed a new deep network to identify participants' outcomes based on their demographics, assessment stream, and the clickstream. To achieve this purpose, we divided the model into four modules: Assessment Module, Demographics Module, Click Module, and Prediction Module, as depicted in Figure 2. Specifically, for a student i, D i , A i and C i denote his/her pre-processed demographics, assessment-wise assessment stream information and week-wise interaction stream information respectively. In the demographics module, D i is converted into a demographical feature vector F D i using FCN. The Assessment Module and Click Module are used to extract assessment-wise features F A i and week-wise features F C i , where RNNs are implemented in these modules. Finally, F D i , F A i and F C i are concatenated into a feature vector F i , representing the historical features of student i, which are used in the Prediction Module for performance prediction.

Experiments and Discussions
In this section, we conduct some experiments to verify the working of our proposed method and compare it with baselines. We will explain the experimental settings and baseline methods and then present the experimental results and discussions.

Experimental Settings
In this paper, we did a binary classification for online course outcome prediction. The classes 'pass' and 'distinction' were considered; 'pass' and 'withdrawal' classes were ignored. Since we intended to use the historic course information to predict the performance in the current course, we used 80% of the historical data for training and 20% for validation and fine-tuning the hyperparameters. As shown in Table 1, the code module indicates the course id, 2013 and 2014 mean the year of the course, and 'B' indicates that the course starts in February and 'J' in October.
Two fully-connected layers were implemented in the Demographics Module with 128 neurons, and three layers were used in the Prediction Module, from 384 to 1536 units. The architecture of the Assessment Module and Click Module was the same; the RNN in both modules contained seven hidden layers with 256 units. Leaky Relu was applied as the activation function after each fully-connected layer, except the last layer in the Prediction Module. ADAM [36] was used as the optimizer, and each simulation ran for 250 steps, with the learning rate set to 0.00002 with batch size 256 (students).

Evaluation with Baseline
We compare the simple RNN method with GRU and LSTM methods used in the Assessment Module and Click Module, while other traditional machine learning methods fail to train and test on various lengths of data. The following comparison and discussion are made based on the averaged results on all predicted courses.
After the model finished its training and validation, it was used to evaluate its performance on the test data. For the test data before week k, the demographics, click data before the k-th week, and assessment data of each deadline before the k-th week are feed into the model for prediction. As shown in Figure 3a, the forecast is made in the test set with 5 weeks to 39 weeks data, and with additional weeks, models can predict the performance with higher accuracy. Student achievement is difficult to determine based on their behavior at the very beginning of the course when fewer data can be used for prediction, and all models did not obtain promising results. The RNN-based model achieved the averaged accuracy at 60% in the 5th week, and the GRU-based model got less comparable results at 53%. In contrast, the LSTM-based model failed to predict accurately in the early stage and obtained an accuracy of less than 40% in the 5th week. As courses progress, more weeks of interaction and assessment data are available, the RNN-based model obtained accuracy from 60% at 5th week to over 90% at the 39th, the GRU-based model predicted less accurately than the RNN-based one before the 10th week, but achieved a much better result in the middle of the course, and 90% was also acquired in the last week. The LSTM-based model performed much worse than the other two methods in all course stage, which just reached at 85% in the last week from around 40% at week 5.
Since vanilla RNN is not able to capture the long-term dependence, it failed to make good use of long-term interaction data, making it less concerned with the initial data and performed worse during the middle of courses. Still, it could achieve a relatively good result at the beginning of courses when it was able to discover information in fewer stream data. LSTM uses gates and a cell state to preserve more long-term historical details. It tends to rely more heavily on long-term historical data than GRU and cannot predict in high accuracy with limited historical data. In contrast, the GRU structure is simpler than LSTM and shows less long-term dependence and some short-term dependence, which means it can focus on recent interaction data and utilize historical data.
(a) (b) As illustrated in Figure 3b, the RNN-based model converges faster than other methods. LSTM and GRU are more complex than basic RNN and have more weights to update, and they try to capture long-term relationships in the assessment stream and clickstream. This relationship is not tightly linked to the interactive data. The result tends to be affected by the latest behavior, making them need more epochs to learn hidden connections in the interaction stream. The GRU merges the input gate and forget gate in LSTM into an update gate, and ignores the memory unit, so it is simpler than LSTM and achieves faster convergence.
Precision and recall metrics are also frequently used in evaluating the predictive model. In this task, precision means the proportion of fail students identified correctly in students labeled as failures by the model. At the same time, the recall indicates the percentage of predicted at-risk samples by our model from failed participants in test data. As shown in Figure 4a, the LSTM-based model presented the best-averaged precision score in all course stages, showing its best ability in avoiding predicting those who will pass the course to be at-risk. Simultaneously, other methods got a lower precision score at the beginning and achieved higher scores as courses went on. Although LSTM showed better results in precision, the recall score was important in our task since we want all expected at-risk students to be detected by models as early as possible, so that teachers can give them instruction to help them pass the course. Figure 4b displays both the RNN-based model and the GRU-based model outperformed LSTM-based one in recall score throughout courses, and vanilla RNN got a higher score before week 15, and the GRU achieved better than the simple RNN after week 15. After the 35th week, the simple RNN model and the GRU model had a recall score of over 0.85, meaning that these two models can identify more than 85% of at-risk students. The following evaluation is calculated on the average metric score on all the mentioned test datasets.
Additionally, different sources of data; assessment stream, demographics, and clickstream, are compared using the GRU-based model. As illustrated in Figure 5, with more stream data applied, the model can identify at-risk students more accurately. Because assessment and click information is sparse in the early stage of courses and the generated less relevant stream data causes noise in the model in the Prediction Module, models using them perform much poorer than the model that only uses demographics before the 15th week. After 15 weeks, models that applied assessment data outperform the use click data because assessment performances are more related to students' outcomes.

Implication of Results
This research intended to identify participants at risk of failure in VLEs at the early stages. We compared different deep time series models in predicting students' performances based on their click behavior and assessment history with the VLE. Experimental results showed that the most complicated LSTM-based model achieved a worse predictive performance than simple RNN-based and GRU-based models, especially in the early stages, meaning excessive long-term dependencies were less useful in predicting students' outcome. Those two models also converged quickly and required fewer memory resources than the LSTM-based model. Our predictive method enables some online learning platforms to use historical interaction data to classify the student at risk of failure for all ongoing courses. It can achieve better accuracy and recall scores as courses go on. These predictions assist the administrative authorities, the educational community, and teachers to help at-risk participants as early as possible, helping them pass the course in the end.

Conclusions and Future Work
Students' demographics feature and their time-series logs are both valuable information sources for at-risk student identification. Existing studies have applied various traditional machine learning models and deep learning techniques and achieved promising results in prediction. However, they failed to use historic course information for current course prediction. In this study, we regard this problem as a sequential format and propose a novel joint neural network by combining sequential features with statistic features. Experimental results show that the proposed method makes great use of assessment and click stream data, and achieves great performance when identifying at-risk students.
In the future, a unified time-varying deep neural network model is an interesting research direction to eliminate the combination of two different models.