1. Introduction
Educational Data Mining (EDM) is defined as the intersection between large areas of statistics, data mining and education [
1]. EDM is becoming a source for discovering new knowledge and patterns of student academic data to teachers and educational institutes managers, in order to support decision-making for the new challenges of education in the digital age [
2].
Among EDM’s applications, prediction of school performance and dropout has been gaining prominence since it detects a possible dropout or failure in academic activity [
3,
4,
5,
6,
7]. So, it is possible to intervene and avoid low performance, or even the student evasion. It is important to emphasize that dropout leads to wasted life-changing opportunities, less skilled labor on the market, and less chance of social mobility [
8]. To illustrate and measure the relevance of the problem, only in Brazil it is estimated that 2 billion dollars per year are invested in 1.9 million young people aged 15 to 17 who dropout high school before the end of the year or are not approved at the end of year [
9]. This investment is equivalent to the cost of all federal institutes and universities in the Brazil in 2017 [
10].
Given this scenario, data mining and data visualization tools can help to discover the relationships between variables available for management (usually extracted from academic control systems) and school dropout. It can give subsidies for better decision making in order to solve the dropout problem [
11,
12,
13]. In these works, the prediction of school dropout is characterized as a classification problem between two groups of students: (i) one with a tendency to persist, and (ii) another with tendency to dropout. However, it is important to consider that several databases used in studies are imbalanced, in which there is a significantly smaller number of students who dropout when compared to those who persist in the course [
14,
15,
16,
17]. When this problem of imbalance happens, it is important to use techniques which mitigate this phenomenon, in order to achieve more precise results and avoid the “Accuracy Paradox”, a phenomenon when a high value of accuracy does not correspond to a high-quality model, because the model is skewed to the majority class and can mask the obtained results [
18].
Due to the relevance of this problem, it is present in this work a study on the performance analysis of algorithms for school dropout prediction, with and without the use of data balancing techniques. Decision Tree and MLP neural networks were chosen as target algorithms because they are the most common techniques in the literature for school dropout prediction [
19], and Balanced Bagging as a new approach to comparison [
20,
21,
22]. The data-balancing technique adopted is based on the
downsample [
20], SMOTE [
23] and ADASYN [
24]. It is also investigated the existence or not of the “Accuracy Paradox” phenomenon, and which performance metrics should be better suited to assess classifiers, such as G-mean [
25] and UAR [
26,
27]. For study validation, this work analyzes educational data of students from the Integrated Courses (high school with training in professional education through technical courses) updated in January 2018 for the Federal Institute of Rio Grande do Norte (IFRN), Brazil.
As contribution of this work, the experimental results indicate:
The use of data balancing techniques can significantly increase the performance of predictive models when data are imbalanced (in case of school dropout);
Precision, Recall, F1 and AUC are not adequate performance metrics for imbalanced database in this work;
UAR, G-mean and confusion matrices are adequate performance metrics for imbalanced database, avoiding the “Accuracy Paradox”.
Balanced Bagging outperformed MLP and DT in performance on G-mean and UAR metrics
This paper is organized as follows: In
Section 2 the concept of “Accuracy Paradox”, balancing techniques, and performance metrics are presented. In
Section 3 are described the related works.
Section 4 presents the database used to validate the model, the development environment, and the methodology adopted for the predictive model training and evaluation. In
Section 5 the impact of the use of Balanced Bagging, balancing techniques and the analysis between the metrics Precision, Recall, AUC, F1, UAR and G-mean are described. Finally,
Section 6 describes the importance of the use of balancing techniques for predictive models, and the choice of appropriate evaluation metrics when the data is imbalanced. It is also presented the future work.
3. Related Work
In this section, it is present the related works regarding predictive models applied to the school dropout problem. From the literature review are highlight the variables and data mining techniques adopted as well as the performance evaluation metrics of the models.
The authors argue in [
19] that the most used input attributes for predictive model applied to school dropout problem are variables related to student performance, such as Cumulative Grade Points Average (CGPA ), quizzes, lab work, class test, and attendance. Another category of widely used variables is the demographic data of the students, such as gender, age, family background, and disability. Finally, some papers use variables related to extra-curricular activities, e.g., high school background and social interaction network. The algorithms used to generate the models were: Decision Tree, Artificial Neural Networks, Naive Bayes, K-Nearest Neighbor, and Support Vector Machine. Among them, those with the best accuracy were Neural Network (98%) and Decision Tree (91%).
A Neural Network is a massively parallel distributed processor made up of simple processing units, which have the natural propensity to store experimental knowledge and make it available for use [
40]. Artificial neural networks were developed to resemble the biological structures found in living beings due to the capacity to store knowledge that they present. This learning takes place through the connections, or synaptic weights, that exist between the neurons. The most famous and used neural network is the multilayer perceptron, which uses several massively connected and layered neurons. The amount of neurons, such as the number of layers, depends directly on the problem. However, some studies show that a three-layer MLP (input, hide, and output) is capable of mapping any function, either linear or nonlinear [
41].
Decision Tree (DT) is a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from a set of pre-selected input data using the divide strategy to conquer [
32].
Another paper, authors used a Logistic Regression technique to create a predictive model of evasion, considering only the academic data of students. Accuracy and confusion matrices were adopted as a performance measure. The model was used to aid in the decision making of a student retention policy and obtained a 14% reduction in the dropout rate [
11].
A interesting model prediction applied for school dropout problem, the authors used data from e-learning courses, and a combination of machine learning techniques such as MLP, support vector machines and probabilistic ensemble simplified fuzzy ARTMAP through three decision schemes. Demographic data (gender, residency, working experience), prior academic performance (educational level, multiple choice test grade, project grade, project submission date, Section activity) were suggested as input variables. Accuracy, sensitivity, and precision metrics were used in evaluations [
42].
In another paper, the decision tree techniques and hierarchical clustering are used to predict student performance. The method has two stages: predicting students’ academic achievement at the end of a four-year study program; studying typical progressions and combining them with prediction results. The input variables of the model are related only to the performance of the student, such as: admission marks from high school, and final marks of first and second-year courses at university. As an evaluation of the model, it were used the metrics accuracy, kappa and confusion matrices [
43].
In work proposed in [
15], the C4.5 technique is used to identify possible failing students during the first four weeks of the semester. The model adopted as input data the engagement ratio, bangor engagement metric, student’s program, the school, and the year of study. The model was evaluated from the metrics: true positive, false positive, precision, the area under the ROC Curve.
As evidenced in
Table 3, in none of the studies surveyed was the concern presented with data balancing, even some of them showing imbalanced bases. Differently from the studies mentioned above, this paper analyses the influence of the use of balancing techniques on model performance, the verification of the “Accuracy Paradox”, and how to measure the performance of the predictive model more reliably.
4. Methodology
The data used in this study are based on 7718 students of Integrated Education (secondary education with training in professional education through technical courses with duration of four years, in the face-to-face modality) of the Federal Institute of Rio Grande do Norte (IFRN). This educational institution is located in northeastern Brazil and distributed by 20 campuses in different cities in the state of Rio Grande do Norte (RN). The available database was extracted from the Unified Public Administration System (SUAP (
suap.ifrn.edu.br)), developed by IFRN, and has demographic information, socioeconomic characterization and the final average of students in the subjects. The last update was in January 2018.
It were selected 25 attributes, of which six are related to academic performance in Portuguese and Mathematics subjects since these are frequent subjects of all courses in the first year of student enrollment. The remaining 19 attributes are related to demographic and socio-economic characteristics of the students (
Table 4). Before training, the data were divided into test set (25% of data) and training (75% of data). Instances with NULL values have been removed.
The development environment used was the programming language
Python and the packages:
Pandas and
Numpy, for manipulation of the data;
scikit-learn for classic supervised learning [
37];
imbalanced-learn [
22] for supervised learning with imbalanced classes;
Seaborn and
Matplotlib, for the graphics. All code is available in [
44].
The
Figure 5 describes the pipeline for the predictive model of school dropout considering the challenge of imbalanced data. The target is to create a prediction model with an emphasis on the predicted accuracy of the student evaded.
Synthetically, the pipeline follows the steps:
Balance Data: Downsample, SMOTE, ADASYN were used to generate balanced data and produce models that avoid the paradox of precision. The original training set was 5788, of which 262 minority class (dropout students) and 5526 majority class (persistent students) instances. After using the Downsampling, there was a reduction in the class of persistent students, and the new data set consisted of 524 equally distributed instances. Using the SMOTE balancing technique, the minority class set was incremented to a total of 5526 instances and the new dataset now has 11052 instances. For ADASYN the new set was 5537 for minority class and 5526 for majority class.
Model / Tunning: on the balanced data are used machine learning techniques (DT, MLP, Balanced Bagging) to predict dropout. For tuning the parameters we used the exhaustive search over specified parameter values for each Model through the Gridsearch package [
37]. For the DT we performed a search on the parameters: function that defines the node break (gini, entropy), the maximum tree depth (None, 3, 5), the minimum number of samples required for the leaf (None, 5, 10, 20). For the MLP the optimized parameters were: the optimization function (Limited-memory BFGS), maximum number of iterations (200), regularization term (0.1, 0.01, 0.001), number of neurons in the hidden layer (5, 8, 11, 14, 17), the seed used by the random number generator (0, 42), and rectified linear unit function like activation function. Finally, for Balanced Bagging the number of DT that makes up the ensemble (10, 30, 100, 200, 300)
Metrics/Evaluation: with the trained models, evaluations should be performed using metrics: precision, recall, F1, UAR, AUC, G-mean and the confusion matrix.
As seen in [
19], the most common and more accurate used models in the prediction problem under the school dropout context are MLP and Decision Tree. Thus, both of them are considered in the proposal for this work. Additionally, the downsample technique is also adopted because it presented better results when tunned with Neural Networks [
30] and Decision Tree [
29]. Furthermore, downsample technique has a small computational cost [
30]. For comparison, in this paper we also use SMOTE, ADASYN as balancing techniques, and the Balanced Bagging as hybrid model.
To validate if there was a difference in performance between the models, we used the Kruskal-Wallis statistical test [
45]. This test have performed to check if there are significant differences among the medians for each method with
p-value 0.05.
5. Results and Discussion
This section presents the performance comparison of the classic MLP, DT and Balanced Bagging methods when applied for the prediction of school dropout. Scenarios with the use of downsampling, SMOTE, ADASYN and without the use of any balancing technique are verified. After the classification algorithm training process, the confusion matrix (
Figure 6), precision, recall, F1, G-mean, UAR and AUC over the entire test set (
Table 5) are also investigated to evaluated the model. It is important to highlight that the minority class represents the group of students droppout, and the majority class the group of students that persist in the course.
In
Figure 6a,e we have a high error of the minority class (38 errors of a total of 87 instances for the DT, and 46 errors for the MLP) when the sampling technique is not used, as is usually employed in the literature. However, this poor result was obfuscated due to excellent accuracy shown in Precision, Recall and F1 metrics (all values close to 1.0 in rows I and II, columns I, II and III of
Table 5) for DT and MLP. Nevertheless, the AUC, UAR and G-mean metrics were able to detect the high minority class error, with ratings all below 0.74 (rows I and II, columns IV, V, VI).
When the downsample technique was used it is possible to note in results shown in
Figure 6b,f a significant decrease of the minority class error (16 errors of a total of 87 instances for the DT and MLP). As expected, the G-mean and UAR metric resulted in a performance increase (both metrics 0.811 for DT, both metrics 0.844 for MLP), but AUC kept lower values (0.736 for DT and 0.798 for MLP). However, it is essential to highlight the increase of the FN error when using the downsample technique (from 43 to 357 in the DT, from 129 to 236 for MLP). This behavior impacted in the decrease of Recall (from 0.977 to 0.806 for DT and 0.930 to 0.872 for MLP) since this metric emphasizes the accuracy of the majority class. The precision and F1 metrics were maintained with high values.
When using the SMOTE technique we noticed a decrease of minority class errors for MLP (
Figure 6c), but for DT the error was maintained (
Figure 6g) when compared to the model without balancing technique. Similarly with the downsample, the high-error minority smote DT had high performance values for Precision, Recall, and F1 (row V, columns I, II, III), while G-mean, UAR, and AUC scored low (row V, columns IV, V, VI). For MLP with SMOTE that had few minority errors, the G-mean and UAR metrics showed an increase in values, however the AUC maintained low values (row VI).
When using ADASYN, a situation similar to SMOTE occurred: there was an improvement in the minority class correctness for MLP (identified by the G-mean and UAR metrics), however there was no score improvement in with DT.
In all the experiments described above, the AUC score had little variation, regardless of improvement in minority class accuracy. In other words, the AUC could not represent the increase in accuracy of the student dropout, the focus of this work. However, the metrics UAR and G-mean were able to identify the increase in accuracy of the minority class with values close to all models as seen in the
Figure 7.
Finally, when analyzing the Balanced Bagging technique (last line
Table 3), it was verified that in all robust unbalance metrics it had the best results (UAR:0.860, G-mean:0.859 and AUC:0.929). Looking at the confusion matrix (
Figure 6i), it is verified that this excellent result is due to the reduction of minority class error with a smaller majority class error when compared to other balancing techniques.
In order to verify that the metrics have statistically different values, we have applied a 10-fold cross validation over the test set. In addition, the Kruskal statistical test was performed for UAR and G-mean metrics. Thus,
Figure 8 presents the boxplot results with 10-fold cross validation for each learning model. In item (a) it was used the G-mean metric and in item (b) the UAR metric. In both graphs the Balanced Bagging median obtained the best results. For all Kruskal tests, the
p-value was close to 0 and less than 0.05 between Balanced Bagging and all other models. It means that at least a model exists that is better than the others, in this case, the Balanced Bagging.
It becomes evident after analyzing the results that the imbalance of data makes metrics like Recall, Precision, and F1 more likely to emphasize only the accuracy of the majority class whereas it falls into the Accuracy Paradox.On the other hand, the G-mean and UAR metric presented as a better candidates to evaluate predictive models on imbalanced data because it counts in its calculation the accuracy of the minority class. It is also evidenced that, after the use of the Balanced Bagging, downsample for MLP and DT, SMOTE for MLP and ADASYN for MLP, there were an increase in the performance of the model to predict the minority class represented by the decrease of the FP error. Nevertheless, the improvement in the prediction of the minority class worsened the accuracy of the majority class represented by the increase of the FN error. In the judgment of the authors, the impact observed in FN does not present significant problems since it predicts that the student would dropout but did not occur. This fact does not bring a significant burden to the institution of education under study. On the other side, the problem of FP error has a significant impact, given it means that the student’s prediction kept in school, but the result was that the student dropped out.
6. Conclusions
After analyzing the results, we concluded that the Accuracy, Recall and F1 metrics failed to detect the high amount of errors of the minority class (the student dropped out) when the data was imbalanced. The AUC metric remained stable even when there was an increase in accuracy. However, G-mean and UAR metrics were able to capture the minority class error for the two classifiers. We also concluded that the use of data balancing technique before training the predictive model promotes a significant increase in the results when measured by the G-mean and UAR metrics. In other words, there was an improvement in the prediction of the students being dropped out. However, the best model for the problem addressed in this paper was Balanced Bagging. Therefore, for imbalanced data contexts, it is recommended to use the G-mean and UAR metric to measure the quality of the most reliable model and avoid the Accuracy Paradox. The use of data balancing techniques is also able to increase the performance of the predictive model, but better results can be obtained with Balanced Bagging. As future work, we plan to consider the use of other advanced machine learning techniques, such as Deep Learning and Probabilistic Programming, and the testing of other balance techniques, such as k-means balancing and probabilistic sampling.