Next Article in Journal
Exploring Ensemble-Based Class Imbalance Learners for Intrusion Detection in Industrial Control Networks
Next Article in Special Issue
Clustering Algorithm to Measure Student Assessment Accuracy: A Double Study
Previous Article in Journal
Gambling Strategies and Prize-Pricing Recommendation in Sports Multi-Bets
Previous Article in Special Issue
How Does Learning Analytics Contribute to Prevent Students’ Dropout in Higher Education: A Systematic Literature Review
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:

Customized Rule-Based Model to Identify At-Risk Students and Propose Rational Remedial Actions

Department of Computer Science and Software Engineering, College of Information Technology, United Arab Emirates University, Al Ain 15551, United Arab Emirates
Big Data Analytics Center, United Arab Emirates University, Al Ain 15551, United Arab Emirates
Department of Computer Science, College of Computing and Informatics, University of Sharjah, Sharjah 27272, United Arab Emirates
Author to whom correspondence should be addressed.
Big Data Cogn. Comput. 2021, 5(4), 71;
Submission received: 3 October 2021 / Revised: 13 November 2021 / Accepted: 24 November 2021 / Published: 29 November 2021
(This article belongs to the Special Issue Educational Data Mining and Technology)


Detecting at-risk students provides advanced benefits for improving student retention rates, effective enrollment management, alumni engagement, targeted marketing improvement, and institutional effectiveness advancement. One of the success factors of educational institutes is based on accurate and timely identification and prioritization of the students requiring assistance. The main objective of this paper is to detect at-risk students as early as possible in order to take appropriate correction measures taking into consideration the most important and influential attributes in students’ data. This paper emphasizes the use of a customized rule-based system (RBS) to identify and visualize at-risk students in early stages throughout the course delivery using the Risk Flag ( R F ). Moreover, it can serve as a warning tool for instructors to identify those students that may struggle to grasp learning outcomes. The module allows the instructor to have a dashboard that graphically depicts the students’ performance in different coursework components. The at-risk student will be distinguished (flagged), and remedial actions will be communicated to the student, instructor, and stakeholders. The system suggests remedial actions based on the severity of the case and the time the student is flagged. It is expected to improve students’ achievement and success, and it could also have positive impacts on under-performing students, educators, and academic institutions in general.

1. Introduction

Modern learning institutions, notably higher education institutions, operate in a highly competitive and complex environment with the availability of the World Wide Web. Many online courses are available for students to study after hours or learn something new entirely through different e-learning platforms such as intelligent tutoring systems (ITS), learning management systems (LMS), and massive open online courses (MOOC). The competitive environment has led to several opportunities for the survival of higher education institutions. Despite these advancements in the education field, universities face challenges in increased student dropout rates, academic underachievement, graduation delays, and other persistent challenges [1,2]. Therefore, most institutions focus on developing automated systems for analyzing their performance, providing high-quality education, formulating strategies for evaluating the students’ academic performance, and identifying future needs.
Globally, a particular aspect of possibilities was focused on the concept of special education just over ten years ago. It debuted immediately after the International Conference on Education’s 48th meeting, which had the topic “Inclusive Education: The Way of the Future”. This conference was organized in Geneva in (2008), and they focused on providing education to hundreds of people with special needs globally, who had low or no access to educational opportunities or they are special or at risk of dropping out due to their mental, physical, or emotional state [3]. The long-term goal was to assist the Member States of UNESCO in providing the political and social education that every individual requires to practice their human rights and avoid fifth generation warfare [4]. Globally, a significant concern of the education stakeholders, such as policymakers and practitioners in the educational domain, is the rate of students’ dropout [5,6,7].
Chung and Lee [8,9] suggested that students who drop out add to the social cost through conflicts with peers’ family members, lack of interest, poverty, antisocial behaviors, or inability to adapt to society. OuahiMariame et al. [10] suggested that the academic performance of the student is the primary goal of educational institutions. Failing to sustain or even improve the retention rate would negatively impact students, parents, academic institutions, and society as a whole [11]. This puts the performance of students as a vital factor that must be enhanced by all necessary means.
Educational data mining (EDM) can help to improve the academic performance of the students in the academic institutions [10]. One of the most valuable solutions to increase the students’ retention rate in any academic institution is monitoring students’ progress, and identifying students with special needs, or detecting at-risk students for early and effective intervention [12]. According to OuahiMariame et al. [10], EDM can help in predicting student achievement in advance, so it facilitates academic performance tracking. Early identification of such students triggers educators to propose relevant remedial actions and prevent unnecessary dropout risks effectively [5,13]. Thus, the identification of at-risk students has been a significant development in several studies [12,14,15,16,17,18].
A plethora of works can be found on the subject. For many years, government officials, policymakers, and educational institutions’ heads have been trying their best to have a robust mechanism that may assist the teachers in identifying the special or at-risk students. Examples, such as [5,6,12] are prominent. However, most complex methods use expensive hardware, are extremely data-dependent, and make predictions only. There is a lack of having a robust system with low complexity, which can be used as a warning system for the teachers to identify the students with special needs without having to use complex data-basis and expensive hardware. Moreover, the end goal is to ensure that every student gets an equal opportunity for education and a better future, so a precise warning system with low false detection and high accuracy should be made available at a low cost. To the best of our knowledge, there is no such rule-based system that works with all the constraints for student monitoring.
In this paper, we have proposed a rule-based model that directly considers the student’s achievement in each assessment component when received. It also adopts a dynamic approach that is visualized, adaptive, and generalized for any data set. The proposed solution will allow universities to evaluate and predict the student’s performance (especially at-risk) and develop plans to review and enhance the alignment between planned, delivered, and experienced curriculum. The ultimate objectives of this paper are as follows:
  • Propose a customized instructor rule-based model to identify at-risk students in order to take appropriate remedial actions.
  • Propose a warning system for instructors to identify at-risk students and offer timely intervention using a visualization approach.
The rest of the paper is organized in the following manner. In Section 2 the literature review of identifying students’ performance is described. Section 3 explains the data set concerning data collection and data pre-processing. Moreover, it provides the implementation details in terms of data exploratory analysis. Section 4 explains the proposed customized rule-based model to identify at-risk students as well as visualization of results. Section 5 highlights some discussion and future work. Finally, Section 6 concludes the paper.

2. Literature Review

In addition, accurate prediction of student performance highlighted in [12,19] also helps instructors identify under-performing students requiring additional assistance. Early prediction systems for at-risk students have been applied successfully in several educational contexts [20]. The authors in [12] proposed and evaluated an intuitive approach to detect academically at-risk students using log data retrieved from various learning management systems (LMS), which has been commonly used to identify at-risk students in learning institutions [21,22]. The students use the LMS to manage classes, conduct online exams, connect with the instructor, use an e-portfolio system, and check the self-learning contents. A log file with a student ID, the date operated, and the type of activity are recorded by the LMS whenever a student uses the system. A private liberal arts university in Japan uses LMS across all the programs offered within the university, and it records a log file whenever a student operates the system. Therefore, all the log files collected between 1 April and 5 August 2015 are considered. A total of 200,979 records were available. A typical record contained a student ID, date of access, and the type of activity. The level of students’ commitment to learning is expected to be indicated by the level of LMS usage.
Meanwhile, the authors of [5] sought to develop an early-warning system for spotting at-risk students using eBook interaction logs. The system considers digital learning materials such as eBooks as a core instrument of modern education. The data were collected from Book-Roll, an eBook service that over 10,000 university students in Asia use to access course materials. The study involved 65,000 click-stream data for 90 students enrolled in a first-year university taking an elementary informatics course. A total of 13 prediction algorithms with data retrieved within 16 weeks of the course delivery were utilized to determine the best performing model and optimum time for early intervention. The performances of the models were tested using 10-fold cross-validation. The findings showed that all models attained their highest performance with the data from the 15th week. The study showed a successful classification of students, with an accuracy of 79%, in the third week of the semester.
Similarly, Berens et al. [6] studied the early detection of at-risk students using students’ administrative data and machine learning (ML) techniques. They developed a self-adjusting early detection system (EDS) that can be executed at any time in a student’s academic life. The EDS uses regression analysis, neural networks, decision trees, and the AdaBoost algorithm to identify essential student characteristics that identify potential student dropouts. The EDS was developed and tested in two universities to predict at-risk students accurately; (1) A state university (SU) in Germany with 23,000 students offering 90 different bachelor programs, and (2) a private university of applied science (PUAS) with 6700 students and 26 bachelor programs. At the culmination of the first semester, the prediction accuracy for SU and PUAS was 79% and 85%, respectively. This prediction accuracy has been improved to 90% for the SU and 95% for the PUAS at the end of the fourth semester. Thus, the utilization of readily available administrative data is a cost-effective approach to identifying at-risk students. However, this technique may not identify the difficulties faced by these students for successful intervention.
At the same time, Aguiar et al. [9] proposed a framework to predict the at-risk high school students. They gained access to an extensive data set of 11,000 students through a partnership with a large school district—they used logistic regression for classification, Cox regression for survival analysis, and ordinal regression methods to identify and classify which students are at high academic risk. While Hussain et al. [15] investigated the most suitable ML algorithms to predict student difficulty based on the grades they would earn in the subsequent sessions of the digital design course. This study developed an early warning system that allows teachers to use technology-enhanced learning (TEL) systems to monitor students’ performance in problem-solving exercises and laboratory assignments. The digital electronics education and design suite (DEEDs) logged input data while the students solved digital design exercises with different difficulty levels. The results indicated that artificial neural networks (ANNs) and support vector machines (SVMs) achieved higher accuracy than other algorithms and can be easily integrated into the TEL system [12].
Similarly, Oyedeji et al. [13] have tested three models: linear regression for supervised learning, linear regression for deep learning, and a five-hidden-layers neural network. The data have been sourced from Kaggle with personal and demographic features. The linear regression showed the best mean average error (MAE) of 3.26. The study of Safaa et al. [21] investigated the students’ interaction data in an online platform to establish whether students’ academic performance at the end-term could be determined weeks earlier. The study utilized 76 s-year university students undertaking a computer hardware course to evaluate the algorithms and features that best predict the end-term academic performance. It compared different classification algorithms and pre-processing techniques. The key result was that the K-nearest neighbors (KNN) algorithm predicted 89% of the unsuccessful students at the end of the term. Three weeks earlier, these unsuccessful students could be predicted at a rate of 74%. Meanwhile, the later study by the same authors [23] has investigated a dataset of university students from various majors (49,235 records with 15 features). However, there was no clear emphasis on the predictive model. Although the findings of Gökhan et al. [21] provided essential features for an effective early warning system, the method utilized is specific to the online learning systems. The sample size evaluated is also tiny compared to the number of students undertaking the online courses.
Meanwhile, Chung and Lee [8] have used the random forest to predict the dropout of high school students, and the model achieved 95% accuracy. The data was sourced from the Korean National Education Information System. The dropout status was collected from “school register change” and “the reason for dropout” in the dataset. They used 12 features to predict the student dropout status. At the same time, Gafarov et al. [24] experimented with the use of various machine learning methods, including neural networks, to predict the academic performance of the students. They compared the performance of the accuracy of the models taking into account the inclusion and exclusion of the Unified State Exam (USE) results. The suggested models performed relatively well except for the Gaussian Naive Bayes. They also tested a variety of different architectures of NN, and the best one accuracy was 90.1%.
Collaborative filtering matrix factorization researchers in [19,25] experimented with the student performance in multiple courses to predict the final GPA by considering the student grades in the previous courses. They used a decision tree classifier which is fed with the transcript data. The classification rules yielded helped to identify the courses that have a significant effect on the GPA. Similarly, Iqbal et al. [26] adopted three ML approaches; collaborative filtering (CF), matrix factorization (MF), and restricted Boltzmann machines (RBM), to predict student’s GPA. They also proposed a feedback model to estimate the student’s understanding of a specific course. The hidden Markov model used widely to model student learning is used to perform well in a specific course. Their dataset was split into 70% for training the model and 30% for the testing. The ML-based classifiers performance was RMSE = 0.3, MSE = 0.09, and mean absolute error (MAE) = 0.23.
Data mining techniques are also considered relevant analytical tools for decision-making in the educational system. The contribution of Baneres et al. [7] is twofold. First, a new adaptive predictive model is presented based only on the students’ grades and trained explicitly for each course. Deep analysis has been performed in the whole institution to evaluate its performance accuracy. Second, an early warning system has been developed to provide dashboards visualization for stakeholders and an early feedback prediction system for early intervention in identifying at-risk students. A case study is used to evaluate the early warning system using a first-year undergraduate course in computer science. They demonstrated accurate identification of at-risk students, the students’ appraisal, and the most common factors which lead to the at-risk level. There is a consensus among researchers that the earlier the intervention is, the better for the success of the at-risk student [27]. The investigation of Mduma et al. [28] is critical to this study as it collected, organized, and synthesized existing knowledge related to ML approaches in predicting student dropout.
Several researchers have revealed that the understanding of student performance-based EDM can help to disclose at-risk students [2,11,29,30]. The authors in [31] proposed predictive models derived from different classification techniques based on nine available remedial courses. In the study of OuahiMariame et al. [10] they have predicted students’ success in an online course. They have used data from Open University Learning Analytics (OULA). Multiple feature selection techniques (SFS, SFFS, LDA, RFE, and PCA) have been tested with different classification methods (SVM, NAIVE BAYE, RandomForest, and ANN). The best performance accuracy was 58% for the combination of SFS and Naive Bayes.
Aggarwal et al. [32] have compared the prediction results with and without exploiting the demographic data of the students. They have run the experiment on 6807 records of students from a technical college in India. An ensemble predictor was proposed, and it showed a significant increase in performance when the demographic data are used compared to the performance when only academic data was used. The highest accuracy achieved was 81.75%.
However, the study shows room for improvement where the model can be altered to suggest a combination of remedial actions. In addition, the work provides results for the early detection strategy of students in need of assistance. It provides a starting point and clear guidance to the application of ML based on naturally accumulating programming process data. During the source code combination, snapshot data recorded from students’ programming process, ML methods can detect high and low-performing students with high accuracy after the first week of an introductory programming course. They compare their results to the well-known methods for predicting students’ performance using source code snapshot data. This early information on students’ performance is beneficial from multiple viewpoints [33,34]. Instructors can target their guidance to struggling students early on and provide more challenging assignments for high-performing students [35]. The limitation is that their solution is applicable for programming courses only.
A recent systematic review has been conducted in [36], and results indicated that various ML techniques are used to understand and overcome the underlying challenges, predicting students at risk and students drop out prediction. Overall, this review achieved its objectives of enhancing the students’ performance by predicting students at risk and students drop out, highlighting the importance of using both static and dynamic data. However, only a few studies proposed remedial solutions to provide in-time feedback to students, instructors, and educators to address the problems.

3. Methodology

This section describes the dataset we have gathered for this study and the pre-processing activities we have conducted on the data to make it more appropriate to be handled by the data mining and ML applications and tools. We finally use the data exploratory analysis to study the correlation between the various attributes that could impact students’ performance.

3.1. Data Collection and Dataset Description

The data was collected from a programming course taught for undergraduate students at the College of Information Technology (CIT), United Arab Emirates University (UAEU). The students must take this course in order to accomplish their graduation requirements. Students from other colleges may take the course as a part of their academic study plans mainly as a free elective course. Due to the gender segregation policy in UAEU, each section is either offered for male or female students. The data represents the performance of the students in the introductory programming course for the academic periods 2016/2017 (Fall and Spring), 2017/2018 (Fall and Spring), and 2018/2019 (Fall and Spring). The original dataset was collected from comma-separated value (CSV) files (a file per section). Each file consists of a list of deidentified students’ records, a student per row.

3.2. Data Pre-Processing

Pre-processing took place on the data to bring it to a readable format by the ML techniques. All files are first unified in a structure to overcome any inconsistencies. As a few instructors teach programming course sections, there are some differences in files structure, e.g., the number of quizzes or homework components varies. To deal with this inconsistency, the maximum capacity of each assessment was identified, and all files were modified by creating additional columns with the missing checkpoints if there are any. The new columns are populated with the average of the corresponding assessments type. For example, the number of HWs varied between three and four in different files; therefore, the “HW4” column was added to all files where it was missing. Then, the averaged value of the previous three homework assignments was calculated and assigned to the new column “HW4”. The overall average of the coursework component, the final course marks, and the final grade are not affected. Other columns were created to hold some features that are necessary for the mining process. Those are the number of quizzes ( N m Q u i z ) given in the section, G e n d e r of the student, I n s t r u c t o r , section number ( S e c N m ), the academic period ( S e m e s t e r ), and the G r a d e .
Afterward, the files were integrated using Python script in one CSV file, and the marks were normalized employing min-max normalization (rescaling the features to the range between [0, 1]) employing following formula:
x = x x m i n x m a x x m i n
Finally, the missing values were treated with imputation methods and substituted with the averaged value of the same coursework components. Once the pre-processing step is completed, the normalized dataset is stored in the CSV format for further analysis.
A typical data file structure is shown in Table 1:
C i - represents the name of the checkpoint predefined earlier
g i , j - is a grade of the jth student at the checkpoint C i
m a x ( g C i ) - is the maximum possible grade for the checkpoint C i
m- corresponds to the number of students
n- denotes the number of checkpoints in the course
i , j - indices, i = 1 , n ¯ , j = 1 , m ¯
Here, we have homework components ( H W i , i = 1 , 4 ¯ , H W s = 1 4 i = 1 4 H W i ), quizzes ( Q z i , i = 1 , 6 ¯ , Q z s = 1 6 i = 1 6 Q z i ), mid-term (MT), final (Final) exam grades and total score (Total). Therefore, according to our data, C = { H W 1 , Q z 1 , H W 2 , Q z 2 , M T , Q z 3 , H W 3 , Q z 4 , Q z 5 , H W 4 , Q z 6 , F i n a l , T o t a l } . All checkpoints were used cumulatively up to the final exam as input variables to the model. This allows us to calculate an output variable risk flag (RF) each time when the new checkpoint values were fed to the proposed model.

3.3. Data Exploratory Analysis

We started our exploratory analysis by looking at our data’s separability measures concerning the total grade. For this, we employed the Mann–Whitney U test as the features were non-normally distributed. From Table 2 the performance of the students differs significantly among groups. The data are reported as median and interquartile range for the “Total” columns and median ± standard deviation values for the “High risk” and “Low risk” groups. There are no significant differences between performance groups in terms of gender ( p > 0.05 ).
To look at the predictive value of each feature in the data set, we evaluated the bi-variate correlation or a measure of the linear correlation between every two attributes in the data set utilizing Pearson correlation coefficients. The heat-map depicted in Figure 1 indicates a positive correlation between all attributes and the final course grade, which was expected. The darker color on the heat-map corresponds to the higher correlation coefficient values. As expected, the Midterm ( M T ), Quizzes ( Q z s ), and Final grades have correlation values close to one with the Total score. However, the homework assignments ( H W s ) correlation values are significantly lower. It may indicate that the level of the homework tasks is not challenging enough for the students as well as the problems with the uniqueness of the answers, as almost all students outperform their results on H W s (mean = 87.7%, std = 14%) compared to other checkpoints, like Q z s (mean = 79.3%, std = 16.4%), M T (mean = 77.1%, std = 17.3%), and F i n a l (mean = 57.5%, std = 23.%), where percentage % stands for the performance level out of 100%. We used a scatter-plot matrix to explore further the distribution of a single attribute and relationships between two variables. As the total course grade is identified from four components, such as the average value of homework assignments, quizzes, mid-term exams, and final exam marks, we employed all mentioned features to investigate the joint distributions among them as well as univariate distributions.
Additionally, as normality is a prerequisite for many statistical parametric tests, we checked if our data is normally distributed employing the Shapiro–Wilk test. The test shows that the attributes are not normally distributed in the entire population (p < 0.05). Presumably, male and female students have different performance levels concerning the type of assessment. To check this statement, we mapped plot aspects of male and female students to different colors. Figure 2 shows that male students have slightly lower performance on H W s and Q z s ; however, they outperform female students on Mid-term and final exam scores. We must admit that the data are not balanced by gender (M/F 21.1%/78.9%), so the tendency may be changed. Then we employed the non-parametric version of ANOVA, the Kruskal–Wallis test to check if the population medians of gender groups are equal. This test can be used as it is applicable for the groups, which may have different sizes. The statistical test reveals significant differences between sexes ( p < 0.05 ) for the final exam grade.
According to the CIT practice, the significant level of student performance should be greater or equal to the cutoff value of 0.7. However, to efficiently identify and notify the instructor about students’ low performance, we suggest employing a heat-map technique and highlighting the low-performance level with different colors (see Figure 3).

4. Customized Rule-Based Model to Identify At-Risk Students

In Figure 4 we illustrate the detailed flowchart of our proposed model. We start by data collection as a first phase, followed by data pre-processing to clean the data to make it ready to feed our customized model. After that, the output results will be evaluated and visualized through a heat-map. Finally, a decision to take remedial action will be taken.
To identify under-performing students, we design a model that evaluates each student’s performance directly when assessment results are received. Figure 5 illustrates the designed model, where the sequence of n input vectors x ( t i ) , i = 1 , n ¯ fed to the model that classifies students into two groups: group at-risk or group not at-risk, x is any entity from the set of Quizzes, Homework Assignments, Mid-Term Exam, and t identifies the time when the assessment was conducted. The instructor can change the set of entities according to the course and the program requirements. The model’s output, o ( t i ) , is a rule-based outcome that analyses the value and type of the entity and returns the weighted value of the student being at-risk. A threshold is introduced to identify the at-risk student, which gives a cutoff level of risk and is set to 0.7.
To identify when the remedial action takes place, a Risk Flag ( R F ) indicator is introduced if there is any. Whenever the student’s performance drops below the t h r e s h o l d , the cumulative value of R F is updated according to the following formula:
R F i = R F i 1 + a i · W i ,
where R F 0 = 0 , W i is a weight of the specific type of the checkpoint x ( t i ) and a i is a weighted coefficient to increase the risk factor score, in case the performance of the student significantly drops. If R F exceeds the value of 1, the list of remedial action ( R A ) is invoked, and an appropriate R A counter ( R A c o u n t ) is increased by 1, whereas 1 is deducted from the current R F value (see formula (3)).
R A c o u n t = R A c o u n t + 1 , if R F > 1 , R F = R F 1 , if R F > 1 .
The pseudo-code of the proposed model is shown in Figure 6. The model is designed to be fully customized by the instructor; therefore, checkpoints, weights, weighted coefficients, the list of remedial actions at the specific checkpoint, and the threshold values are used as model parameters and can be changed at the stage of initialization. The prototype of the proposed model is implemented in Python, where the instructor sets all customized parameters as JSON objects. The input of the model is the performance level of m students in the CSV file with delimiter “,” as shown in Table 1.


To visualize the at-risk students, the color-map, introduced earlier, was employed. The green color indicates no issues with the student learning curve, whereas more orange color shows that the student may struggle to grasp learning outcomes and needs some assistance (Figure 7). When the value of the R F exceeds one, the list of remedial actions specified by the instructor is invoked and the value one is subtracted from the risk flag. This step is proposed to give the student some time to improve his/her marks denoting him/her again with the green color on the map. Furthermore, we examine how the risk flag is calculated with the upcoming checkpoints. Figure 3 shows student performance values, and Figure 7 reports the value calculated for each checkpoint of the risk flag.
Let us illustrate the process of building risk flag values. First, since our proposed model can be customized, the instructor sets weights for each checkpoint based on the importance of the input attribute for the final predictions. In this case “information gain” feature selection technique is used [37] and we utilized retrieved weights as initialization values for vector W. Thus, we obtained W H W = 0.1 , W Q Z = 0.3 , W M T = 0.7 . The vector of weighted coefficients was set to the values shown in Table 3. The vector of weighted coefficients a is between 1.0 to 1.5 according to the inverse dependency between the students’ performance and R F value. The lower performance range corresponds to the higher value-weighted coefficients of a.
With the method being proposed, Table 4 illustrates R F values calculated at every checkpoint for S t u d _ 0009 , using formula (2). For example, at checkpoint ”Qz4”, the student’s performance is equal to 0.5, meaning the student got 50% of the max grade for “Qz4”. According to the risk condition, all grades with values less than 70% are automatically considered potentially critical, and the R F recalculation procedure is invoked. The R F is calculated using the previous value of the R F . R F Q z 4 = R F Q z 3 + a · W Q z = 0.9 + 1.1 × 0.3 = 1.23 , where the value of the weighted coefficient is taken from Table 3 and W Q z is a weight of checkpoint quiz, set by the instructor at the initialization stage of the model. The R F exceeded 1.0; therefore, the list of remedial actions is invoked, and the R F counter (indicates the number of times, remedial actions were invoked) is incremented by one. To keep the visualization technique valid, we subtract one from the R F value whenever it reaches one.

5. Discussion and Future Work

Identifying students’ at-risk as early as possible is considered a key indicator of student success and a monitoring factor to warrant instructors to take the appropriate remedial and corrective actions to mitigate such risk. To validate the usage of the proposed framework we decided to assess the distribution of the total grade values with regard to the number of remedial actions. Figure 8 shows that the greater number of remedial actions corresponds to the lower total grades. The linear relationship between the number of remedial actions and the total grade was also assessed with the Pearson correlation coefficient. The calculated value of 0.803 is statistically significant ( p = 1.54 × 10 50 ). Therefore, the proposed customized model can be considered and used as an effective warning system to identify the students at risk at early stages. In a future endeavor, the work presented in this paper will be incorporated in an interactive module to support students’ progress and succeeding their studies. The module allows the instructor to have a dashboard that graphically depicts the students’ performance in different coursework components. The at-risk student will be distinguished (flagged), and remedial actions will be communicated to the student and instructor stakeholders. The system suggests remedial actions based on the severity of the case and the time the students are flagged. Moreover, we are looking to collect feedback from instructors either qualitative or quantitative about the usefulness of the system and focusing more on assuring that this work provides a useful framework for evaluating students. The system will be deployed with general checkpoints and will be easy to use by any instructor. We will make use of reinforced learning for optimizing the system according to the teacher’s feedback in terms of design and optimal parameters.

6. Conclusions

Student performance is an essential part of higher learning institutions. One of the criteria for a high-quality university is its excellent record of academic achievements. The application of statistical and machine learning techniques to predict student outcomes has made significant progress in recent years, providing academic leadership with useful information to consider positive and timely interventions. A rule-based model can be used as an early warning system to identify at-risk students in a course and inform both the instructor and the student. Instructors can then use a variety of strategies to communicate with at-risk students and provide them with recommendations for improving their course performance. Using an early warning system in a course and procedures can increase students’ success in a course. As a conclusion, we can agree that students’ success plays a vital role in the growth and prosperity of a country, therefore, they should be motivated and helped out in every perspective of life. Identifying at-risk students as early as possible is considered a key indicator of student success and a monitoring factor to warrant instructors to take the appropriate remedial and corrective actions to mitigate such risk. If the remedial actions are taken on time and at-risk students are handled with proper care, they can be proved a great essence for social and economical growth. The previous method has applied many methods to point out at-risk students. However, the results still lack efficiency in terms of time and high false prediction. In this study, we propose a customized rule-based model that offers timely identification for at-risk students. It responds to the student’s performance in the assessment component as it happens. Additionally, it visualizes the at-risk students to support the educator in recognizing these students and offering timely intervention.

Author Contributions

Conceptualization, methodology, software, statistical analysis, writing—original draft preparation: B.A., T.H. and Z.S.; data curation: S.H.; writing—review and editing: all authors; visualization: B.A. and T.H.; supervision: M.A.S.; data analysis, literature review, discussion: M.A.S., N.Z. and S.H. All authors have read and agreed to the published version of the manuscript.


This research received no external funding.

Institutional Review Board Statement

The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Ethics Committee of United Arab Emirates University (ERS_2020_6085, 14 March 2020). No potentially identifiable personal information is presented in the study.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The code and software developed in this study are available upon request.


The authors would like to acknowledge the continuous support from the College of Information Technology; UAEU.

Conflicts of Interest

The authors declare no conflict of interest.


The following abbreviations are used in this manuscript:
CSVComma-Separated Value
HWHomework assignment
JSONJavaScript Object Notation
KNNKNnearest Neighbous
MAEMean Absolute Error
MLMachine Learning
MTMid-Term Exam
RARemedial Action
RBMRule-Based Model
R F Risk Flag


  1. Namoun, A.; Alshanqiti, A. Predicting student performance using data mining and learning analytics techniques: A systematic literature review. Appl. Sci. 2021, 11, 237. [Google Scholar] [CrossRef]
  2. Hellas, A.; Ihantola, P.; Petersen, A.; Ajanovski, V.V.; Gutica, M.; Hynninen, T.; Knutas, A.; Leinonen, J.; Messom, C.; Liao, S.N. Predicting academic performance: A systematic literature review. In Proceedings of the Companion of the 23rd Annual ACM Conference on Innovation and Technology in Computer Science Education, Larnaca, Cyprus, 2–4 July 2018; pp. 175–199. [Google Scholar]
  3. Watkins, M. “Inclusive education: The way of the future”—A rebuttal. Prospects 2009, 39, 215–225. [Google Scholar] [CrossRef]
  4. Acedo, C.; Ferrer, F.; Pamies, J. Inclusive education: Open debates and the road ahead. Prospects 2009, 39, 227–238. [Google Scholar] [CrossRef] [Green Version]
  5. Akçapınar, G.; Hasnine, M.N.; Majumdar, R.; Flanagan, B.; Ogata, H. Developing an early-warning system for spotting at-risk students by using eBook interaction logs. Smart Learn. Environ. 2019, 6, 4. [Google Scholar] [CrossRef]
  6. Berens, J.; Schneider, K.; Görtz, S.; Oster, S.; Burghoff, J. Early detection of students at risk–predicting student dropouts using administrative student data and machine learning methods. JEDM 2018, 11, 1–41. [Google Scholar] [CrossRef]
  7. Baneres, D.; Rodríguez-Gonzalez, M.E.; Serra, M. An early feedback prediction system for learners at-risk within a first-year higher education course. IEEE Trans. Learn. Technol. 2019, 12, 249–263. [Google Scholar] [CrossRef]
  8. Chung, J.Y.; Lee, S. Dropout early warning systems for high school students using machine learning. Child. Youth Serv. Rev. 2019, 96, 346–353. [Google Scholar] [CrossRef]
  9. Aguiar, E.; Lakkaraju, H.; Bhanpuri, N.; Miller, D.; Yuhas, B.; Addison, K.L. Who, when, and why: A machine learning approach to prioritizing students at risk of not graduating high school on time. In Proceedings of the Fifth International Conference on Learning Analytics And Knowledge, New York, NY, USA, 16–20 March 2015; pp. 93–102. [Google Scholar]
  10. OuahiMariame, S.K. Feature Engineering, Mining for Predicting Student Success based on Interaction with the Virtual Learning Environment using Artificial Neural Network. Ann. Rom. Soc. Cell Biol. 2021, 25, 12734–12746. [Google Scholar]
  11. Almutairi, F.M.; Sidiropoulos, N.D.; Karypis, G. Context-aware recommendation-based learning analytics using tensor and coupled matrix factorization. IEEE J. Sel. Top. Signal Process. 2017, 11, 729–741. [Google Scholar] [CrossRef]
  12. Sweeney, M.; Rangwala, H.; Lester, J.; Johri, A. Next-term student performance prediction: A recommender systems approach. arXiv 2016, arXiv:1604.01840. [Google Scholar]
  13. Oyedeji, A.O.; Salami, A.M.; Folorunsho, O.; Abolade, O.R. Analysis and prediction of student academic performance using machine learning. JITCE J. Inf. Technol. Comput. Eng. 2020, 4, 10–15. [Google Scholar] [CrossRef] [Green Version]
  14. Asif, R.; Hina, S.; Haque, S.I. Predicting student academic performance using data mining methods. Int. J. Comput. Sci. Netw. Secur. 2017, 17, 187–191. [Google Scholar]
  15. Hussain, M.; Zhu, W.; Zhang, W.; Abidi, S.M.R.; Ali, S. Using machine learning to predict student difficulties from learning session data. Artif. Intell. Rev. 2019, 52, 381–407. [Google Scholar] [CrossRef]
  16. Chen, Y.; Johri, A.; Rangwala, H. Running out of stem: A comparative study across stem majors of college students at-risk of dropping out early. In Proceedings of the 8th International Conference on Learning Analytics and Knowledge, Sydney, Australia, 7–9 March 2018; pp. 270–279. [Google Scholar]
  17. Lee, S.; Chung, J.Y. The machine learning-based dropout early warning system for improving the performance of dropout prediction. Appl. Sci. 2019, 9, 3093. [Google Scholar] [CrossRef] [Green Version]
  18. Lakkaraju, H.; Aguiar, E.; Shan, C.; Miller, D.; Bhanpuri, N.; Ghani, R.; Addison, K.L. A machine learning framework to identify students at risk of adverse academic outcomes. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 10–13 August 2015; pp. 1909–1918. [Google Scholar]
  19. Al-Barrak, M.A.; Al-Razgan, M. Predicting students final GPA using decision trees: A case study. Int. J. Inf. Educ. Technol. 2016, 6, 528. [Google Scholar] [CrossRef] [Green Version]
  20. Kavipriya, P. A review on predicting students’ academic performance earlier, using data mining techniques. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 2016, 6, 101–105. [Google Scholar]
  21. Akçapınar, G.; Altun, A.; Aşkar, P. Using learning analytics to develop early-warning system for at-risk students. Int. J. Educ. Technol. High. Educ. 2019, 16, 1–20. [Google Scholar] [CrossRef]
  22. Mwalumbwe, I.; Mtebe, J.S. Using learning analytics to predict students’ performance in Moodle learning management system: A case of Mbeya University of Science and Technology. Electron. J. Inf. Syst. Dev. Ctries. 2017, 79, 1–13. [Google Scholar] [CrossRef] [Green Version]
  23. Alhusban, S.; Shatnawi, M.; Yasin, M.B.; Hmeidi, I. Measuring and Enhancing the Performance of Undergraduate Student Using Machine Learning Tools. In Proceedings of the 2020 11th International Conference on Information and Communication Systems (ICICS), Irbid, Jordan, 7–9 April 2020; pp. 261–265. [Google Scholar]
  24. Gafarov, F.; Yu, R.Y.B.S.U.; PM, T.A.B. Analysis of Students’ Academic Performance by Using Machine Learning Tools. In International Scientific Conference “Digitalization of Education: History, Trends and Prospects”(DETP 2020); Atlantis Press: Amsterdam, The Netherlands, 2020; pp. 570–575. [Google Scholar]
  25. Al Breiki, B.; Zaki, N.; Mohamed, E.A. Using Educational Data Mining Techniques to Predict Student Performance. In Proceedings of the 2019 International Conference on Electrical and Computing Technologies and Applications (ICECTA), Ras Al Khaimah, United Arab Emirates, 19–21 November 2019; pp. 1–5. [Google Scholar]
  26. Iqbal, Z.; Qadir, J.; Mian, A.N.; Kamiran, F. Machine learning based student grade prediction: A case study. arXiv 2017, arXiv:1708.08744. [Google Scholar]
  27. Shuqfa, Z.; Harous, S. Data Mining Techniques Used in Predicting Student Retention in Higher Education: A Survey. In Proceedings of the 2019 International Conference on Electrical and Computing Technologies and Applications (ICECTA), Ras Al Khaimah, United Arab Emirates, 19–21 November 2019; pp. 1–4. [Google Scholar]
  28. Mduma, N.; Kalegele, K.; Machuve, D. A survey of machine learning approaches and techniques for student dropout prediction. Data Sci. J. 2019, 18, 14. [Google Scholar] [CrossRef] [Green Version]
  29. Al-Sudani, S.; Palaniappan, R. Predicting students’ final degree classification using an extended profile. Educ. Inf. Technol. 2019, 24, 2357–2369. [Google Scholar] [CrossRef] [Green Version]
  30. Iam-On, N.; Boongoen, T. Generating descriptive model for student dropout: A review of clustering approach. Hum.-Centric Comput. Inf. Sci. 2017, 7, 1–24. [Google Scholar] [CrossRef] [Green Version]
  31. Jenhani, I.; Brahim, G.B.; Elhassan, A. Course learning outcome performance improvement: A remedial action classification based approach. In Proceedings of the 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), Anaheim, CA, USA, 18–20 December 2016; pp. 408–413. [Google Scholar]
  32. Aggarwal, D.; Mittal, S.; Bali, V. Significance of Non-Academic Parameters for Predicting Student Performance Using Ensemble Learning Techniques. Int. J. Syst. Dyn. Appl. IJSDA 2021, 10, 38–49. [Google Scholar] [CrossRef]
  33. John, L.K. Machine learning for performance and power modeling/prediction. In Proceedings of the ISPASS, Santa Rosa, CA, USA, 24–25 April 2017. [Google Scholar] [CrossRef]
  34. González, A. Turning a traditional teaching setting into a feedback-rich environment. Int. J. Educ. Technol. High. Educ. 2018, 15, 1–21. [Google Scholar] [CrossRef]
  35. Ahadi, A.; Lister, R.; Haapala, H.; Vihavainen, A. Exploring machine learning methods to automatically identify students in need of assistance. In Proceedings of the Eleventh Annual International Conference on International Computing Education Research, Omaha, NE, USA, 9–13 August 2015; pp. 121–130. [Google Scholar]
  36. Albreiki, B.; Zaki, N.; Alashwal, H. A Systematic Literature Review of Student’Performance Prediction Using Machine Learning Techniques. Educ. Sci. 2021, 11, 552. [Google Scholar] [CrossRef]
  37. Kononenko, I.; Hong, S.J. Attribute selection for modelling. Future Gener. Comput. Syst. 1997, 13, 181–195. [Google Scholar] [CrossRef]
Figure 1. Pearson product-moment correlation coefficient computed for each attribute in the data set.
Figure 1. Pearson product-moment correlation coefficient computed for each attribute in the data set.
Bdcc 05 00071 g001
Figure 2. The distribution of single feature and relationships between two attributes, such as HWs, Qzs, MT, Final, Total.
Figure 2. The distribution of single feature and relationships between two attributes, such as HWs, Qzs, MT, Final, Total.
Bdcc 05 00071 g002
Figure 3. The heat-map of students performance level to visually indicate students with performance level below 70%.
Figure 3. The heat-map of students performance level to visually indicate students with performance level below 70%.
Bdcc 05 00071 g003
Figure 4. High-level flowchart for proposed rule-based model.
Figure 4. High-level flowchart for proposed rule-based model.
Bdcc 05 00071 g004
Figure 5. Sequential model design employed to continuously identify at-risk students and propose remedial actions to close the loop.
Figure 5. Sequential model design employed to continuously identify at-risk students and propose remedial actions to close the loop.
Bdcc 05 00071 g005
Figure 6. Pseudo-code of proposed sequential model.
Figure 6. Pseudo-code of proposed sequential model.
Bdcc 05 00071 g006
Figure 7. Visualization of at-risk students model using the heat-map technique.
Figure 7. Visualization of at-risk students model using the heat-map technique.
Bdcc 05 00071 g007
Figure 8. Dependency between total grade in the course and the number of remedial actions evoked based on the parameters reported in the paper.
Figure 8. Dependency between total grade in the course and the number of remedial actions evoked based on the parameters reported in the paper.
Bdcc 05 00071 g008
Table 1. Example of the student performance file the model takes as an input.
Table 1. Example of the student performance file the model takes as an input.
Student ID C 1 C 2 C 3 C n
S t u d e n t 1 g 1 , 1 g 1 , 2 g 1 , 3 g 1 , n
S t u d e n t 2 g 2 , 1 g 2 , 2 g 2 , 3 g 2 , n
S t u d e n t 3 g 3 , 1 g 3 , 2 g 3 , 3 g 3 , n
S t u d e n t m g m , 1 g m , 2 g m , 3 g m , n
Max grade m a x ( g C 1 ) m a x ( g C 2 ) m a x ( g C 3 ) m a x ( g C n )
Table 2. Comparison of the students’ performance at the checkpoints with regard to their performance level at the total grade.
Table 2. Comparison of the students’ performance at the checkpoints with regard to their performance level at the total grade.
TotalHigh RiskLow Riskp-Value
21899 (45.41%)119 (54.59%)
Gender 0.13328
Female172 (78.9%)83 (83.84%)89 (74.79%)
Male46 (21.1%)16 (16.16%)30 (25.21%)
HW10.89 [0.86–1.0]0.84 ± 0.20.93 ± 0.083.75974 × 10 8
Qz10.76 [0.65–0.91]0.67 ± 0.180.84 ± 0.185.48951 × 10 14
HW20.87 [0.8–0.95]0.8 ± 0.20.92 ± 0.082.68999 × 10 10
Qz20.75 [0.63–0.95]0.6 ± 0.250.88 ± 0.121.24295 × 10 19
MT0.75 [0.64–0.89]0.61 ± 0.130.87 ± 0.18.1207 × 10 31
Qz30.79 [0.69–0.96]0.65 ± 0.230.9 ± 0.143.70452 × 10 21
HW30.85 [0.82–1.0]0.81 ± 0.250.89 ± 0.130.0190915
Qz40.74 [0.6–0.94]0.58 ± 0.240.88 ± 0.132.69329 × 10 22
Qz50.67 [0.52–0.9]0.55 ± 0.270.77 ± 0.276.04346 × 10 12
HW40.89 [0.9–1.0]0.84 ± 0.250.94 ± 0.136.81128 × 10 5
Qz60.77 [0.67–0.92]0.63 ± 0.180.88 ± 0.112.11131 × 10 25
HWs0.88 [0.86–0.95]0.82 ± 0.180.92 ± 0.075.93254 × 10 10
Qzs0.79 [0.7–0.92]0.66 ± 0.150.9 ± 0.084.84271 × 10 29
Final0.58 [0.42–0.78]0.37 ± 0.130.75 ± 0.161.43249 × 10 33
Total0.72 [0.61–0.86]0.58 ± 0.110.84 ± 0.092.54565 × 10 37
The high-risk group corresponds to students whose total grade was below the threshold value of 70%, while the low-risk group’s total grade is above 70%. Significant differences between groups ( p < 0.05 ) are marked in bold.
Table 3. Initialization values of weighted coefficients of vector a.
Table 3. Initialization values of weighted coefficients of vector a.
Performance Rangea Value
( 20 % ; 30 % ) 1.4
( 30 % ; 40 % ) 1.3
( 40 % ; 50 % ) 1.2
( 50 % ; 60 % ) 1.1
( 60 % ; 70 % ) 1.0
Table 4. Illustration execution example of customized rule-based model.
Table 4. Illustration execution example of customized rule-based model.
Checkpoint RF Value, Calculated at Each CheckpointRisk Condition
HW1 R F H W 1 = R F 0 = 0 x H W 1 = 0.85 > 0.7
Qz1 R F Q z 1 = R F H W 1 + a · W Q z = 0 + 1 × 0.3 = 0.3 x Q z 1 = 0.7 0.7
HW2 R F H W 2 = R F Q z 1 = 0.3 x H W 2 = 0.95 > 0.7
Qz2 R F Q z 2 = R F H W 2 + a · W Q z = 0.3 + 1 × 0.3 = 0.6 x Q z 2 = 0.7 0.7
MT R F M T = R F Q z 2 = 0.6 x M T = 0.78 > 0.7
Qz3 R F Q z 3 = R F M T + a · W Q z = 0.6 + 1 × 0.3 = 0.9 x Q z 3 = 0.7 0.7
HW3 R F H W 3 = R F Q z 3 = 0.9 x H W 3 = 0.9 > 0.7
Qz4 * R F Q z 4 = R F Q z 3 + a · W Q z = 0.9 + 1.1 × 0.3 = 1.23 x Q z 4 = 0.5 0.7
Qz5 R F Q z 5 = R F Q z 4 + a · W Q z = 1.23 1 + 1.5 × 0.3 = 0.68 x Q z 5 = 0 0.7
HW4 R F H W 4 = R F Q z 5 = 0.68 x H W 4 = 0.9 > 0.7
Qz6 R F Q z 6 = R F Q z 5 + a · W Q z = 0.68 + 1 × 0.3 = 0.98 x Q z 6 = 0.65 0.7
* list of remedial actions is invoked.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Albreiki, B.; Habuza, T.; Shuqfa, Z.; Serhani, M.A.; Zaki, N.; Harous, S. Customized Rule-Based Model to Identify At-Risk Students and Propose Rational Remedial Actions. Big Data Cogn. Comput. 2021, 5, 71.

AMA Style

Albreiki B, Habuza T, Shuqfa Z, Serhani MA, Zaki N, Harous S. Customized Rule-Based Model to Identify At-Risk Students and Propose Rational Remedial Actions. Big Data and Cognitive Computing. 2021; 5(4):71.

Chicago/Turabian Style

Albreiki, Balqis, Tetiana Habuza, Zaid Shuqfa, Mohamed Adel Serhani, Nazar Zaki, and Saad Harous. 2021. "Customized Rule-Based Model to Identify At-Risk Students and Propose Rational Remedial Actions" Big Data and Cognitive Computing 5, no. 4: 71.

Article Metrics

Back to TopTop