Predicting At-Risk Students Using Clickstream Data in the Virtual Learning Environment

: In higher education, predicting the academic performance of students is associated with formulating optimal educational policies that vehemently impact economic and ﬁnancial development. In online educational platforms, the captured clickstream information of students can be exploited in ascertaining their performance. In the current study, the time-series sequential classiﬁcation problem of students’ performance prediction is explored by deploying a deep long short-term memory (LSTM) model using the freely accessible Open University Learning Analytics dataset. In the pass / fail classiﬁcation job, the deployed LSTM model outperformed the state-of-the-art approaches with 93.46% precision and 75.79% recall. Encouragingly, our model superseded the baseline logistic regression and artiﬁcial neural networks by 18.48% and 12.31%, respectively, with 95.23% learning accuracy. We demonstrated that the clickstream data generated due to the students’ interaction with the online learning platforms can be evaluated at a week-wise granularity to improve the early prediction of at-risk students. Interestingly, our model can predict pass / fail class with around 90% accuracy within the ﬁrst 10 weeks of student interaction in a virtual learning environment (VLE). A contribution of our research is an informed approach to advanced higher education decision-making towards sustainable education. It is a bold e ﬀ ort for student-centric policies, promoting the trust and the loyalty of students in courses and programs.


Introduction
The abundance of the vast available educational data provides opportunities to utilize it for various purposes, such as tapping the learning behaviors of the stakeholders involved, improving these behaviors by addressing the issues, and optimizing the learning environment [1]. With the readily accessible educational data, several research communities have exhibited noticeable interest in predicting students' patterns and extracting meaningful insights from these patterns. Such information extraction is not only bound to the data mining community. However, new communities have also emerged that focus not only on the agenda of improving students' performance but, as a whole, optimizing the learning environment, referred to as learning analytics [2]. Educational data, accumulated due to the interactional activity between learners and instructors, has been substantiated as a multidisciplinary field of study, involving researchers from various research communities, which has yielded to the inclusion of numerous terms associated with the exploration of the educational data, such as academic analytics, predictive analytics, and learning analytics [3].
With the emergence of the learning analytics research community, much emphasis has been laid on the investigation of students' behavior, assembling methods to improve understanding to • Firstly, we intended to leverage deep learning models by transforming the dataset into a sequential format by assembling students' engagement with the virtual learning environment (VLE) weekly. • Secondly, we delivered an understanding of the behavior of students at risk of failure, contributing to the decision making policies to devise early intervention strategies to improve student performance, enforcing student retention. • Lastly, ascertaining the effectiveness of the deployed deep LSTM model in the early prediction of at-risk students compared to conventional approaches.
The organization of this paper is as follows. Section 2 briefly discusses the existing studies co-relating deep learning with various online learning platforms. Section 3 presents the data and the deployed methods for its analysis. Section 4 formulates the experimental setup and results, their discussion, and evaluation. The concluding remarks with the limitations and future directions are proposed in Section 4.

Literature Review
In several studies, the problem of predicting the academic performance of students is either classified as a regression problem, where the learners' perspective scores are predicted, or a classification problem where a learner's final result is predicted in terms of pass, fail, or dropout. The behavior of students in online learning platforms varies from that of the traditional classroom settings, where learners' inherent motivation to succeed and excel plays a significant contribution in their performance, making them solely responsible for their good/bad performance [17]. In the literature, substantial studies are found that emphasize the prediction of student performance by applying various data analytic techniques and highlighting various factors impacting a learners' performance, categorizing different features contributing to low student retention [18]. In online educational platforms, the attribute of time is considered as the vital feature of impacting learners and their performance, followed by the incoherent support provided by the instructors. Moreover, effective curriculum content is paramount to drive an intrinsic motivation in learners, encouraging them in positive and active participation, ultimately influencing a learner's performance and intention to complete a particular program [19,20].
Another dimension involves predicting students' grades for the next term potential courses [5,21]. Morsy et al. [22] characterized the course space as a latent vector and suggested regression models for next term grade prediction. Similarly, another study predicted course-specific next term grades through Markovian models [4]. Marbouti et al. [23] deployed logistic regression to assess students at risk of a failure by incorporating attributes of their attendances, quizzes, and examination behavior. They identified the students at risk of low performance in several weeks of the courses, and their predictions improved for the final weeks. Furthermore, logistic regression was also applied as the baseline evaluation technique to predict at-risk students [24]. The previous history of students, encompassing past grades in previous courses, assessments, and entry tests, is also an essential element in classifying and predicting an individual's performance [25].
The correlation of the learning analytics community with deep learning techniques is still in its preliminary stages, with few pieces of evidence observed in the existing literature. Deep learning, constituting several non-linear layers, enables self-adaptive models through the hierarchal representation phenomena, where each layer transcends the learned information, abstractly, to the successive layers [26]. Corrigan & Smeaton [13] deployed a variation of recurrent neural networks (RNN) to predict the success of students by incorporating interactional activities of students with the VLE. Their deployed deep learning approach outperformed the traditional baseline approach. Similarly, another study predicted success rate by including student attendance and tapped their behavior through log data information [27]. They predicted the grades of students, including their engagement patterns and interactions with the VLE, through the application of RNN and LSTM models. Through the deployed sequential model, they intended to predict the grades of students and identify at early stages those at risk. The deployed technique was compared with the conventional regression analysis and was found to be more effective in the early prediction on grades compared to traditional regression approaches. Fei and Yeung [28] employed a feature set, consisting of the lectures watched and downloaded, assessments scores and attempts, forum activities, and the number of times commented on forums, to predict the student performance and assess at-risk students. They implemented an array of techniques, such as support vector machines, logistic regression, input-output hidden Markov model, RNN, and LSTM, and found LSTM to surpass the other techniques.
An exciting dimension in the educational community has been the use of virtual reality in the learning context of distance education, facilitating teacher-to-student interaction more conveniently and elaborately [29]. This implicitly assists instructors in customized teaching practices for different categories of students. Furthermore, Kohler provided a theoretical framework for effective teaching to produce optimized learning outcomes in traditional and blended classroom settings [30]. E-portfolios are another tool established to facilitate the instructors in personalizing their teaching methods and supporting student interventions [31]. Overall, in a VLE, we only found a limited number of studies embarking on the adoption of deep learning techniques to evaluate and comprehend student behavior in a rigorous manner. Our study leverages the power of deep learning for the early prediction of at-risk students in a VLE.

Data and Experimentation
The procured Open University Learning Analytics dataset (OULAD) constituted student demographics, clickstream history, and assessment submission information of 32,593 students over a course duration of 9 months, from 2014 to 2015 [32]. The data were composed of several courses, with each course being taught at different intervals in a year. Four distinct performance classes were defined: distinction, pass, fail, and withdrawal.
The OULAD comprised students' information regarding their interaction with the VLE-their assessments, quizzes, and course performances. The interaction with the VLE was further categorized into 20 different activity types with each activity referring to a specific action, such as downloading or viewing lectures, course content, or quizzes. The names of each of these activity types are as follows: dataplus, dualpane, externalquiz, folder, forumng, glossary, homepage, htmlactivity, oucollaborate, oucontent, ouelluminate, ouwiki, page, questionnaire, quiz, repeatactivity, resource, sharedsubpage, subpage, and url.
The aggregated average clicks per student were processed weekly to visualize the students' weekly interactions. Figure 1 depicts the number of clicks for the two classes-pass and fail-where the aggregated activities for each class were normalized according to per student. It can be observed that the two classes demarcated in terms of the interaction level in the VLE, with 'pass' instances being more active than 'fail' instances.

153
The OULAD comprised students' information regarding their interaction with the VLE-their 154 assessments, quizzes, and course performances.

160
The aggregated average clicks per student were processed weekly to visualize the students' 161 weekly interactions. Figure

168
The OULAD was procured in a raw structured format with several data files. The log-file data 169 were computed to obtain features catering to the various actions signifying students' interactions 170 with the VLE. These features were formulated by processing the provided data tables in the database.

171
The data were computed in a week-wise manner, with each week constituting the same activity were also present in the week (i-1) and so on. Each student was identified with a unique ID in the data. Only a ratio of 5.8% in the data took more than one course and hence were repeated; however,

175
we did not intend to find academic performances of a student in multiple courses. Therefore, each 176 student was identified by a unique ID for each course. Similarly, we did not intend to analyze student 177 performance on a course-granular level; therefore, students repeating the same course were ignored.

178
The unique student IDs were computed by a combination of their older IDs, the course that was 179 taken, and the interval in which the course was presented. This study intended to analyze 'pass' and

180
'fail' instances, where the 'pass' instances were merged with the distinction instances to formulate one 181 class. Hence, the formulated data consisted of 22,437 instances.

Data Preprocessing
The OULAD was procured in a raw structured format with several data files. The log-file data were computed to obtain features catering to the various actions signifying students' interactions with the VLE. These features were formulated by processing the provided data tables in the database. The data were computed in a week-wise manner, with each week constituting the same activity features, and each week comprising a homogenous set of students, that is, students in the week i were also present in the week (i-1) and so on. Each student was identified with a unique ID in the data. Only a ratio of 5.8% in the data took more than one course and hence were repeated; however, we did not intend to find academic performances of a student in multiple courses. Therefore, each student was identified by a unique ID for each course. Similarly, we did not intend to analyze student performance on a course-granular level; therefore, students repeating the same course were ignored. The unique student IDs were computed by a combination of their older IDs, the course that was taken, and the interval in which the course was presented. This study intended to analyze 'pass' and 'fail' instances, where the 'pass' instances were merged with the distinction instances to formulate one class. Hence, the formulated data consisted of 22,437 instances.

Approach
The structure of deep learning methods is composed of many non-linear levels, where each level plays a significant role in transcending the learned representation more abstractly to its higher layers, consequently assisting in learning complex functionalities [33]. As opposed to the conventional statistical approaches, such self-adaptive techniques effectively determine the underlying associations between the data by generalizing the input sequence and learning from it [34]. RNNs are designed to learn the long-term dependencies in the sequential data through a recursive loop at each cell that supports it to keep a check on the previous input data along with the current input. RNN weights are updated by backpropagation through time, which transmits the error and the gradient over the whole vector [35]. Though RNNs were designed for long-term dependencies, pieces of evidence in the empirical research demonstrate the opposite. Due to the vanishing and exploding gradient problems, the error in RNN can only be back-propagated to a short distance [36]. To resolve these issues, LSTM was introduced, where a memory cell is augmented in the network, enabling it to retain longer sequences.
LSTM constitutes three gates; at some particular time instance say t, the input gate i t administers the writing of the data in the LSTM unit, forgetting gate f t is responsible for controlling the data that is the amount of data to be written, memory cell C t is responsible for retaining the past information, and output gate o t manages the representation of the delivered output. The manipulation of forgetting gate f t at some time t, held responsible for administering the amount of data to be retained, is described mathematically in Equation (1): Further on, the input gate computes the information by multiplying the input x t with the activation of the input gate and determining the relevant information to be retained, as shown in Equations (2) and (3): A layer of LSTM constitutes multiple blocks, each with the required gates and memory component [37]. The memory component C t is updated at each interval, Three parameters-input, memory component, and previous hidden states h t-1 -cumulatively update the currently hidden state h t via the output gate o t , as shown in Equation (5): are the corresponding biases and weights for the gates, and σ is the component-wise logistic sigmoid function. Tanh in Equations (2) and (5) is the activation function that computes the candidate values and adds it to the memory component C t .
The augmented memory cell in LSTM blocks facilitates the model with a lookback window that is flexible enough to retain longer sequences, which ultimately assists in making decisions on the basis of both past and current input sequences [38]. The proposed study deployed a deep sequential model for the early prediction of at-risk students on the basis of their week-wise clickstream interactions with the VLE. To attain this purpose, the OULA dataset was processed to retrieve weekly clickstream data for each activity. The clickstream information for each student was recorded in a weekly manner, where each i th week consisted of the interactions and activities of students for that specific i th week, as illustrated in Figure 2, where S1, S2 . . . Sn represents the unique students that are the same for all weeks.     This week-wise stack of students forms a vector consisting of the appended weeks' sequence, which was passed to the sequential model. Week-wise data were manipulated in such a way that for each i th week, the sequence vector was appended until i th weeks. These i th weeks ranged from the starting first week to the last 38th week in the course. Because the week-wise vector length depended on the i th week, therefore, the padding was implemented to enable an equal length of the vector. These padded values were masked before passing to the model. The masked values were ignored by the model, and the model did not learn its impact. Thus, LSTM layers were implemented with a flexible lookback window, enabling early prediction for each i th week. Multiple layers of the LSTM model were implemented in the architecture, depicted in Figure 2, assisting the model to learn complicated and intricate details of student engagement vector where each higher layer presented its output to its lower layers [39]. A layered architecture assisted the model in learning the inherent data representations more accurately.

Experimentation and Evaluation
This section presents the experimental setup for the evaluation of the deployed deep LSTM model. To early predict the students' academic behavior, the problem was first converted to a binary classification through the inclusion of 'fail', 'distinction', and 'pass' categories. A total number of 22,437 instances were included, where the classes 'pass' and 'distinction' were formulated as 'pass' class. Three layers were implemented in the deep LSTM architecture, and each layer constituted a range of neurons, from 100 to 300 units. The dropout was implemented between the layers to reduce the overfitting of the data. This also inhibited the inter-dependency between neurons and enabled the model to learn more effectively and rigorously [40]. At each instant of time t i , data constituting of a vector at particular instance t i was passed to the deep LSTM model. ADAM optimizer -an adaptive learning rate optimization algorithm that's been designed specifically for training deep neural networks -was used as the optimizer with some hyper-parameter tuning, where the learning rate was set from 0.0 to 0.0001. To calculate the efficiency of the deployed model, in terms of the difference between the actual and predicted values, binary cross-entropy was applied as the loss function. The problem of students' early prediction was constituted as a binary problem, with a student either passing or failing a course. Therefore, this loss function produced optimal results. The equation for binary cross-entropy is provided in Equation (7) with p i representing the likelihood of the actual data and q i representing the likelihood of the predicted data to pass or fail.
The predictions of the deployed deep LSTM model tended to improve over the weeks, as illustrated in Figures 3 and 4, where with additional weeks, the model gradually depicted the improvement in the prediction of academic performance. The accuracy ranged from week 5 to week 38, showing an increase in the predicted academic performance, as shown in Figure 3a. A particular behavior of students could not be determined in the initial weeks. Therefore, the model was deployed on the dataset of the 5th week and onwards. Results were displayed of selective weeks to depict the refinement and improvement of the model. As previously mentioned, with additional history of students' interaction and their clickstream behavior, the model predicted the academic performance for each student (pass/fail) with a confidence of 69.69% achieved in the 1st week, whereas 80.82% was acquired in the 5th week, and 95.23% was acquired in the last week. Figure 3a also depicts an increasing trend in the accuracy of the predictions after the initial 5th week, depicting the robustness of the deployed deep LSTM performing better with accumulated clickstream data. From the 10th week, the model can predict a student as passing or failing with reasonable accuracy of over 85%. Therefore, this pattern of leaning is a vital determinant in the early prediction of students' academic performance. As accuracy increases with additional weeks, the loss values tend to decrease with additional week-wise information, insinuating the robustness of the model, which increases as the student behavior is determined with additional week-wise engagement patterns. Note that Figure 3b shown the learning loss of the model.    .

(a)
Week-wise validation metric: precision and recall curves.   The precision and recall curves of the validation data are depicted in Figure 4 that tend to improve with additional week-wise information. Precision was defined as the ratio of the at-risk students identified correctly from the total number of the students identified as at-risk by the model. The recall was defined as the ratio of the students captured as at-risk by the model from the total number of at-risk students in the actual data. As illustrated in Figure 4, the precision and recall curves tended to significantly elevate after the 5th week, insinuating the significance of additional information in the model. As the model was fed more information about students' interaction and their engagement activities, the model tended to learn their behavior and produce better results. Precision values tended to improve from 59.36% achieved in the 1st week to 93.46% achieved in the last week. Similarly, recall values improved from 60.99% achieved in the 1st week to 75.79% achieved in the last week. Moreover, the learning accuracy and loss for all the weeks, from 1st to 38th, are depicted in Figure 5, illustrating the improvement in the accuracy values with additional weeks and the week-wise degradation in the loss values, implying the robustness of the predicted model. These values were obtained at the 60th epoch of the trained model.

278
The precision and recall curves of the validation data are depicted in Figure 4 that tend to 279 improve with additional week-wise information. Precision was defined as the ratio of the at-risk 280 students identified correctly from the total number of the students identified as at-risk by the model.

281
The recall was defined as the ratio of the students captured as at-risk by the model from the total 282 number of at-risk students in the actual data. As illustrated in Figure 4, the precision and recall curves 283 tended to significantly elevate after the 5th week, insinuating the significance of additional 284 information in the model. As the model was fed more information about students' interaction and 285 their engagement activities, the model tended to learn their behavior and produce better results.  Logistic Regression (LR) have been frequently adopted in the educational research community for 296 evaluating other proposed models [23]. Aggregated data were processed for the machine learning

Evaluation with Baseline
To evaluate the deployed deep LSTM model for the early prediction of at-risk students, several machine learning algorithms were deployed as baselines. ANN, Support Vector Machine (SVM), and Logistic Regression (LR) have been frequently adopted in the educational research community for evaluating other proposed models [23]. Aggregated data were processed for the machine learning algorithms, where each specific i th week data were aggregated until that i th week. A week-wise flat vector was hence computed for each student and passed to the model. The results in comparison to LSTM are illustrated in Figure 6, where W5, W10, W20, W30, W38, represent the 5th, 10th, 20th, 30th, and 38th weeks, respectively. It can be observed that LSTM performed significantly better than other baseline models in predicting at-risk students.
LSTM works sufficiently well for sequential data; hence, it efficiently analyzed the behavior of students in a weekly manner and produced optimal results compared to baseline techniques. Because the conventional non-sequential models worked with the aggregated data, which constituted the entirety of information until that particular time, hence such models were unable to analyze the behavior on a time granularity level. Due to the complete week-wise information aggregated into values, the model was unable to learn this weekly behavior or predict the at-risk students from their collective interactions aggregated into a vector. Such aggregation hindered the learning predictability of the model; therefore, the sequential model, with a good fit, captured the learning behavior of students and efficiently predicted the at-risk students on the basis of their interactions and engagement patterns.

318
This research presented the critical concern of the identification of students at risk of failure.

319
Different computation methods were deployed to predict the students at risk of a failure on the basis 320 of their behavior and engagement patterns with the VLE. Deep LSTM improved the predictability of 321 the students' decisions and assisted the educational community to develop guidelines for helping the 322 at-risk students. Such behavioral models craft a path for the administrative authorities to contribute 323 to formulating policies and strategies to implement timely interventions, regulate the decision-324 making process, and ultimately assist students through the provision of support systems. Moreover, 325 such settings will also help establish guidance committees and regular student counseling sessions 326 for maintaining a motivational infrastructure and tapping the behavior of students for data-driven 327 decision-making processes.
This study examined at-risk students by converting the problem into a sequential weekly format 330 and measuring the effectiveness of the deep sequential model versus the conventional machine 331 learning baseline models. We intended to deliver an understanding of the behavior of students at risk 332 of failure, contributing to the decision making policies to devise early intervention strategies to 333 improve student performance, enforcing student retention. The deep LSTM tended to monitor the 334 sequential week-wise pattern of students and their activities, performing better compared to 335 conventional classifiers that deal with students' interactions in a collective and aggregated manner.

336
Such early predictions will facilitate institutions to timely intervene in the students at risk by 337 providing them a support system through counseling and alert emails. Such data-driven analysis will 338 also assist the decision-makers to formulate optimal policies for students' success on the basis of their 339 behavior and interaction patterns. The deployed criteria to determine early on the risky students will 340 assist the educational community in capturing their activities and behavior for student retention by 341 Figure 6. Evaluation with the baseline techniques. LR: logistic regression, ANN: artificial neural network, SVM: support vector machine. W5, W10, W20, W30, W38, represent the 5th, 10th, 20th, 30th, and 38th weeks, respectively.

Implication of Results
This research presented the critical concern of the identification of students at risk of failure. Different computation methods were deployed to predict the students at risk of a failure on the basis of their behavior and engagement patterns with the VLE. Deep LSTM improved the predictability of the students' decisions and assisted the educational community to develop guidelines for helping the at-risk students. Such behavioral models craft a path for the administrative authorities to contribute to formulating policies and strategies to implement timely interventions, regulate the decision-making process, and ultimately assist students through the provision of support systems. Moreover, such settings will also help establish guidance committees and regular student counseling sessions for maintaining a motivational infrastructure and tapping the behavior of students for data-driven decision-making processes.

Concluding Remarks, Limitations, and Future Extensions
This study examined at-risk students by converting the problem into a sequential weekly format and measuring the effectiveness of the deep sequential model versus the conventional machine learning baseline models. We intended to deliver an understanding of the behavior of students at risk of failure, contributing to the decision making policies to devise early intervention strategies to improve student performance, enforcing student retention. The deep LSTM tended to monitor the sequential week-wise pattern of students and their activities, performing better compared to conventional classifiers that deal with students' interactions in a collective and aggregated manner. Such early predictions will facilitate institutions to timely intervene in the students at risk by providing them a support system through counseling and alert emails. Such data-driven analysis will also assist the decision-makers to formulate optimal policies for students' success on the basis of their behavior and interaction patterns. The deployed criteria to determine early on the risky students will assist the educational community in capturing their activities and behavior for student retention by intervening on time, providing pedagogical support in terms of the formulation of guidance committees and corrective strategies.
This study does not cater to the variation in the performances of the students who repeat their courses. Therefore, analyzing their behavior is another dimension that requires sufficient data of such students. Similarly, a course-level analysis is also required to capture the behavioral differences of the same students in different courses and identify the influential elements to tap their academic performance. Because the deployed dataset did not have sufficient course-level records, this is hence a limitation of our study. Moreover, analyzing the behavior of repeating students in one course and examining the differences in their first and second attempts is another crucial area of research. In the future, we plan to enrich our model prediction by including students' assessment scores, analyzing the association between their assessment submission pattern and performance.
Furthermore, we also seek to investigate activities having an influential impact on the performance by mining textual data [41][42][43] of students' feedback by employing deep advance learning [44] and natural language processing techniques [45]. A framework catering to the prominent attributes-extrinsic and intrinsic-associated with students' performance may enable the learning analytics community to adhere towards more effective decision-making systems. Moreover, analyzing students' behavior on a day-to-day basis is another dimension of interest that will assist the educational community in identifying the most influential phase where the students tend to demonstrate positive performance

Conflicts of Interest:
The authors declare that they have no known competing for financial interests or personal relationships that could have appeared to influence the work reported in this paper.