You are currently viewing a new version of our website. To view the old version click .
Forecasting
  • Article
  • Open Access

18 October 2025

Can Simple Balancing Algorithms Improve School Dropout Forecasting? The Case of the State Education Network of Espírito Santo, Brazil

and
1
Department of Economics, Federal University of Espírito Santo, Vitória 29075-910, Brazil
2
Education Center, Jones dos Santos Neves Institute, Vitória 29052-015, Brazil
*
Author to whom correspondence should be addressed.

Abstract

This study evaluates the effect of simple data-level balancing techniques on predicting school dropout across all state public high schools in Espírito Santo, Brazil. We trained Logistic Regression with LASSO (LR), Random Forest (RF), and Naive Bayes (NB) models on first-quarter data from 2018–2019 and forecasted dropouts for 2020, with additional validation in 2022. Facing strong class imbalance, we compared three balancing methods—RUS, SMOTE, and ROSE—against models trained on the original data. Performance was assessed using accuracy, sensitivity, specificity, precision, F1, AUC, and G-mean. Results show that the imbalance severely harmed RF and NB trained without balancing, while Logistic Regression remained more stable. Overall, balancing techniques improved most metrics: RUS and ROSE were often superior, while SMOTE produced mixed results. Optimal configurations varied by year and metric, and RUS and ROSE made up most of the best combinations. Although most configurations benefited from balancing, some decreased performance; therefore, we recommend systematic testing of multiple balancing strategies and further research into SMOTE variants and algorithm-level approaches.

1. Introduction

School dropout, defined as the premature interruption of formal education, arises from a complex interplay of socioeconomic, personal, and institutional factors. Dropping out of school has several negative consequences. At the individual level, it limits earning potential, employability, and social mobility. At the macroeconomic level, it impacts the distribution of human capital, contributing to income inequality and hindering socioeconomic advancement, thus perpetuating cycles of poverty and inequality []. As UNICEF [] notes, disrupting the educational process deprives children and adolescents of the essential skills necessary to develop their potential, achieve economic independence, and actively participate in society. School dropout is also associated with a range of social problems, including a lower quality of life, poor health, and gender inequality. Furthermore, Wood et al. [] highlight that high school dropout has been associated with higher rates of unemployment, incarceration, and mortality.
Developing early warning systems for dropout identification empowers educators by facilitating the identification of at-risk students. This, in turn, allows educators to plan effective interventions to mitigate this serious problem. Such models have been successfully implemented worldwide [,,,], and their importance is underscored by [].
One of the inherent challenges in predicting student dropout is the imbalance between classes. The number of students who drop out is typically much lower than those who remain in school, creating a significant data imbalance. As Fernández et al. [] point out, this imbalance can negatively impact predictive performance. According to Chen et al. [], classification models typically assume a balanced class distribution. However, class imbalance poses a challenge, as traditional algorithms tend to favor more frequent classes, potentially neglecting the less common ones. This imbalance leads to a variation in the performance of the learning algorithms, known as performance bias [].
To address the challenges posed by imbalanced datasets, numerous methodologies have been proposed, intervening at different stages of the modeling process. These solutions can be broadly classified into data-level approaches, algorithm-level techniques, or hybrid strategies [,].
Data-level approaches directly modify the training data to rebalance class distributions, often through techniques like oversampling minority classes or undersampling majority classes. A key advantage of these methods is their independence from specific learning algorithms, allowing easy integration with a variety of models []. However, Altalhan et al. [] point out that this approach may discard potentially useful information or introduce noise through synthetic data.
Algorithm-level approaches, on the contrary, focus on adapting or developing machine learning algorithms to be more responsive to underrepresented groups []. Methods within this class encompass cost-sensitive learning, ensemble methods, and algorithm-specific adjustments []. While effective, these algorithms may not fully exploit the available data or may require complex adjustments for different algorithms. Finally, hybrid approaches strategically combine the advantages of both data-level and algorithm-level techniques to further improve performance. However, combining multiple techniques makes hybrid approaches complex, potentially leading to overfitting, increased computational demands, and a greater need for careful design. For a comprehensive overview of all possibilities to tackle imbalanced problems, see [,].
In this paper, we focus on data-level approaches for four reasons: (i) they can be used with any classifier; (ii) ease of implementation; (iii) direct control over class distribution; and (iv) widespread use, as indicated by the literature review. Numerous balancing techniques have been developed, including Random Undersampling (RUS), Synthetic Minority Oversampling Technique (SMOTE) [], and Random Oversampling Examples (ROSE) [], among others. SMOTE is frequently considered a benchmark balancing technique [], as it attempts to address class imbalance by creating synthetic data points for the minority class.
Despite the inclination to use balancing techniques, some dropout classification models do not address the imbalance directly. These models, nevertheless, have demonstrated a satisfactory ability to identify at-risk students [,,,,,,]. In contrast, other researchers automatically incorporate balancing methods into their dropout models [,,]. Some have even investigated the best balancing method by comparing distinct approaches [,]. Still other works examine the efficiency of balancing methods by comparing models trained with and without these algorithms [,,,,,,,]. These studies suggest that the effect of the balancing methods is still an ongoing question.
A key conclusion from these works is that there is no universally optimal solution to the problem of imbalanced data. Some models built without balancing achieve strong results, while other studies find evidence supporting or refuting the use of balancing methods. Thus, testing various methods to choose the most appropriate for the data at hand would be the safest approach.
This research investigates the potential of simple balancing methods to improve the performance of classification models in predicting school dropouts for all state public high schools in Espírito Santo, Brazil, focusing on students in grades one, two, and three. Given the potential impact of imbalanced data on classifier performance, we tested three distinct balancing methods: RUS, SMOTE, and ROSE. RUS was chosen because of its simplicity and computational efficiency, while we chose SMOTE and ROSE to compare a traditionally used technique with an innovative one. While SMOTE is widely used and often considered a benchmark [], ROSE offers a different approach by creating new data using statistical kernels, potentially expanding the decision region []. Furthermore, we noted a lack of applications of ROSE in the context of student dropout prediction. For classification in this study, we selected three prominent machine learning models: Logistic Regression (LR), Random Forest (RF), and Naive Bayes (NB). Logistic Regression was chosen for its inherent interpretability, allowing for a clear understanding of feature importance through its coefficients. Random Forest was selected for its ability to achieve high predictive accuracy by leveraging ensemble learning techniques. Finally, Naive Bayes was included due to its computational efficiency, ease of estimation, and demonstrated effectiveness in high-dimensional problems. These three models have been widely applied in the literature for predicting student dropout. We evaluated the performance of these classifiers after calibrating them with and without data balancing algorithms.
The results indicated that the inherent class imbalance in dropout data presents a significant challenge to effective model training. However, results also demonstrated that data balancing techniques, particularly RUS and ROSE, can improve model performance in terms of the majority of the metrics considered. While LR exhibits notable stability across different balancing techniques, RF models show greater variability, with RUS and ROSE generally producing the best results. Ultimately, we highlight the importance of carefully considering class imbalance and selecting appropriate balancing techniques, as the optimal choice may vary depending on the academic year and performance metric.

3. Methodology

In this section, we present the classifiers used (LR, RF, and NB), as well as a brief introduction to the balancing methods employed (RUS, SMOTE, and ROSE). The dependent variable is defined as Y = { 0 , 1 } , where 1 represents the dropout and 0 otherwise. As predictors, we used a set of personal information, school infrastructure, and administrative data regarding student performance and attendance during the first quarter of the year. The goal is to forecast whether the student may drop out during the entire year. Table 1 lists all the predictors considered.
Table 1. Considered variables.
It is important to note that several models were estimated. Some were calibrated using the original imbalanced data, while others were trained with data balanced using the methods employed in this research. For this reason, for each specific grade of secondary school, twelve configurations were tested: (I) LR with LASSO, calibrated using the original imbalanced data; (II) RF, calibrated with the original data (imbalanced); (III) NB, calibrated with the original data (imbalanced); (IV) LR with LASSO, calibrated using RUS; (V) RF, calibrated with RUS; (VI) NB, calibrated with RUS; (VII) LR with LASSO, calibrated with SMOTE; (VIII) RF, calibrated with SMOTE; (IX) NB, calibrated with SMOTE; (X) LR with LASSO, calibrated with ROSE; (XI) RF, calibrated with ROSE; (XII) NB, calibrated with ROSE.

3.1. Classifiers

3.1.1. Logistic Regression

Logistic Regression is a common classifier and has been applied to several classification problems, such as dropout forecasting [,,], credit risk prediction [,], bankruptcy prediction [], oil market efficiency [], and other applications. Mathematically, considering a binary dependent variable Y = { 0 , 1 } , the logistic regression is given by the following
Pr ( Y = 1 X ) = e β 0 + β X 1 + e β 0 + β X ,
where β 0 is a constant, β = ( β 1 , , β k ) is a column vector of coefficients, X = ( X 1 , , X k ) is a column vector of k predictors, and β 0 + β X represents the linear combination of predictor variables. We used Logistic Regression in conjunction with the LASSO (least absolute shrinkage and selection operator), a technique used to select relevant variables. The primary advantage is the simultaneous estimation of model parameters and selection of the most relevant variables, while discarding irrelevant ones. Formally, the log-likelihood function is given by the following:
i = 1 N y i ( β 0 + β x i ) log ( 1 + e β 0 + β x i ) λ j = 1 k | β j |
where i = 1 , , N indicates the sample size, and λ 0 is the tuning parameter that controls the shrinkage, usually obtained by k-fold cross-validation. For more details, see [,]. In this research, we employed 5-fold cross-validation to select both the optimal hyperparameters (by minimizing the area under the ROC curve) and the probability threshold for converting predictions into binary outcomes.

3.1.2. Random Forest

Random Forests are tree-based ensemble methods built on the concept of stratification and segmenting predictor space []. They can be used for regression and classification problems [,]. Applications of RF are, for instance, credit modeling [,], dropout identification [,], stock market trading prediction [], water temperature forecasting [], and others. In essence, Random Forests aggregate a large number of Decision Trees and combine their individual predictions to obtain a final classification or prediction. Bootstrap and other ensemble methods are commonly employed in the construction of these trees.
Calibrating a Random Forest model typically involves tuning three key hyperparameters: the number of trees ( n t r e e ), the minimum node size ( n o d e s i z e ), and the number of features sampled at each split ( m t r e e ). The number of trees determines the size of the ensemble, with too few trees potentially limiting the ability to capture complex patterns and too many trees potentially leading to overfitting []. The number of features sampled at each split controls the randomness of the model, promoting diversity among the trees. The minimum node size dictates the minimum number of observations required in a terminal node to be considered for further splitting [].
We employed a grid search to explore different hyperparameter values and compute the error on the validation set, with 5-fold cross-validation to find the optimal hyperparameters by minimizing the area under the ROC curve. We ran tests with a sequence of trees in intervals of 100, ranging from 100 to 1000, and fitted the best model using the caret package in R []. The hyperparameter mtree was selected in intervals of 1, ranging from 1 to the total number of variables minus one. Finally, the node size varied from 1 to 20.

3.1.3. Naive Bayes

According to [], the Naive Bayes classifier is particularly suitable for high-dimensional explanatory variables. The applications of NB models are, for instance, in dropout forecasting [], lemon disease detection [], basketball game outcomes [], and medical problems [], among others. It is based on Bayes’ theorem, which, for a binary classification problem where Y = { 0 , 1 } and X = ( X 1 , , X k ) , states the following:
Pr ( Y = 1 | X 1 , . . . , X k ) = Pr ( X 1 , , X k | Y = 1 ) Pr ( Y = 1 ) Pr ( X 1 , , X k | Y = 1 ) Pr ( Y = 1 ) + Pr ( X 1 , , X k | Y = 0 ) Pr ( Y = 0 )
To apply this theorem, we can use training data to estimate Pr ( X | Y ) and Pr ( Y ) . These estimates then allow us to determine Pr ( Y | X ) for any new instance of the predictor vector X . However, estimating the distribution related to Pr ( X | Y ) can be challenging, especially in high-dimensional spaces. The Naive Bayes Classifier addresses this challenge by making a key assumption: given a class Y, the features in X are conditionally independent. This simplification allows the classifier to estimate the class-conditional marginal densities separately for each feature. For continuous features, this is typically performed using one-dimensional kernel density estimation. For discrete predictors, an appropriate histogram estimate can be used []. Under this independence assumption, Equation (3) can be rewritten as follows:
Pr ( Y = 1 | X 1 , . . . , X k ) = Pr ( Y = 1 ) i = 1 k Pr ( X i | Y = 1 ) Pr ( Y = 1 ) i = 1 k Pr ( X i | Y = 1 ) + Pr ( Y = 0 ) i = 1 k Pr ( X i | Y = 0 )
In this research, we employed a grid search approach for hyperparameter tuning of the Naive Bayes model. The tuned hyperparameters included the bandwidth of the kernel density estimation and the Laplace smoothing parameter, which addresses the issue of zero probabilities for unseen categories during training. To ensure robust model evaluation and selection, we implemented a 5-fold cross-validation procedure. For the Naive Bayes model, we did not optimize a threshold for transforming probabilities into binary predictions.

3.2. Balancing Methods

In this section, we briefly introduce the methods employed in this research: the Random Undersampling (RUS), synthetic minority oversampling technique (SMOTE) [,], and random oversampling examples (ROSE) []. For a more comprehensive understanding of these methods, see [,].
Random Undersampling involves constructing a balanced sample by randomly removing individuals from the majority class, thereby reducing its size while preserving all elements of the minority class. This approach decreases the overall sample size. Figure 1 illustrates this simple approach.
Figure 1. Example of a balanced sample. The green elements are eliminated via random sampling. The resulting sample contains, in general, the same number of elements from both classes. After this step, we can estimate the classifier based on a balanced dataset.
Chawla et al. [] proposed the SMOTE, which interpolates neighboring elements of the minority class to create synthetic elements. Consequently, the algorithm augments the minority class by incorporating simulated data while randomly removing elements from the majority class. According to the authors, this approach results in a larger and less specific decision region, thus enhancing the quality of the estimation. Figure 2 illustrates the construction of the final sample.
Figure 2. An example of a balanced sample using SMOTE. The elements represented by the light red colour are generated via SMOTE and added to the dataset. Simultaneously, some elements of the majority class are discarded through random sampling.
While SMOTE effectively addresses class imbalance by generating synthetic minority class instances, it suffers from key limitations highlighted [,]. The potential for generating duplicate synthetic samples, especially within complex or overlapping datasets, can lead to redundancy and overfitting. Furthermore, original SMOTE overlooks the underlying data distributions, potentially creating synthetic samples that do not accurately represent the true minority class and are not well-suited for high-dimensional data. To address these shortcomings, numerous variations of SMOTE have been developed, such as Borderline-SMOTE, ADASYN, SMOTE-N, among others [,].
The ROSE was proposed by [] and is based on the principles of SMOTE. The main distinction lies in the generation of the synthetic. In ROSE, the authors replace the interpolation technique with statistical kernels, thus overcoming SMOTE’s drawback of neglecting the underlying data distribution. This enables an expansion of the decision region beyond interpolation, resulting in newly generated data that differs from the original but remains statistically equiprobable. As a result, the decision region is expanding [].

4. Experiments

4.1. Dataset

The dataset used in this study comprises variables collected from the Sistema Estadual de Gestão Escolar do Espírito Santo (SEGES) (State School Management System of Espírito Santo, translated freely) and the Instituto Nacional de Estudos e Pesquisas Educacionais Anísio Teixeira (INEP) (National Institute for Education Studies and Research Anísio Teixeira, translated freely). Table 1 lists all 49 variables used.
The objective is to classify students for the 2020 academic year, using data acquired shortly after the conclusion of the first academic quarter. This classification enables the calculation of student dropout probabilities during 2020 for students enrolled in the first, second, and third years of high school across all state schools in Espírito Santo, Brazil. To this end, the model was calibrated using data from the first academic quarters of 2018 and 2019. Furthermore, the calibrated model was also employed to forecast the dropout rate for 2022, thereby evaluating the predictive accuracy of the model against two distinct datasets. The results for 2022 are presented in Appendix A.1 through Table A1, Table A2, Table A3 and Table A4.

4.2. Results

The methods described above were used to forecast student dropout in 2020 for the first, second, and third years of high school in all public state schools in Espírito Santo, Brazil. To assess whether balancing methods improve forecasting metrics, we estimated twelve distinct models for each grade: (I) LR with LASSO, calibrated using the original imbalanced data; (II) RF, calibrated with the original data (imbalanced); (III) NB, calibrated with the original data (imbalanced); (IV) LR with LASSO, calibrated using RUS; (V) RF, calibrated with RUS; (VI) NB, calibrated with RUS; (VII) LR with LASSO, calibrated with SMOTE; (VIII) RF, calibrated with SMOTE; (IX) NB, calibrated with SMOTE; (X) LR with LASSO, calibrated with ROSE; (XI) RF, calibrated with ROSE; (XII) NB, calibrated with ROSE. The results presented in this section refer to models calibrated with data from 2018 and 2019, with the aim of forecasting dropout in 2020. In addition to this dataset, we also used the same calibrated models to predict school dropout in 2022. The results of this second exercise are presented in the Appendix A.1.
Table 2 displays the number of dropouts and non-dropouts in our data set for the first, second, and third years of high school in 2018, 2019, and 2020. As can be observed, the dataset shows a significant class imbalance.
Table 2. Number of dropouts and non-dropouts in our dataset.
Table 3 and Figure 3 present the results for all twelve models. The highest value for each metric is highlighted in bold. The metrics for RF, NB, and RF with SMOTE models were not considered, as these models have significant limitations in accurately identifying dropout cases. The mathematical definition of each metric is based on the confusion matrix (Table A5) and ROC curve (Figure A1). For more details, see Appendix A.2. Comparing the three models estimated with the original data—LR, RF, and NB—it is evident that RF and NB struggled to identify students at risk of dropping out, exhibiting sensitivities approaching or reaching zero. This also prevented the calculation of precision and F1 scores for the RF. This issue stemmed from the severe imbalance in the data set, as shown in Table 2. The LR model performed better with the imbalanced data, achieving a sensitivity of 0.791 and an AUC of 0.819.
Table 3. Metrics for the first year of high school in 2020.
Figure 3. Metrics for the first year of high school. Each group of bars represents the results of the four models evaluated across all metrics considered: accuracy (Accur.), sensitivity (Sens.), specificity (Spec.), precision (Prec.), area under the ROC curve (AUC), and G-mean. The metrics are presented in a tripartite layout: LR classifier results on the left, RF in the center, and NB on the right.
When training the models with balanced data, we observed an increase in sensitivity across all models, except for the RF with SMOTE. The RF with SMOTE still faced considerable challenges in identifying students who dropped out, with a sensitivity of only 0.021. This resulted in very high accuracy (0.967) and perfect specificity, but at the expense of practical usefulness for identifying at-risk students.
Taking into account the F1, AUC, and G-mean metrics, which are commonly used for imbalanced datasets and when looking for a metric that balances different aspects of model performance, the best results were achieved by NB with SMOTE, RF with ROSE, and RF with ROSE, respectively. The metrics were for NB SMOTE (F1: 0.318, AUC: 0.788, and G-mean: 0.788) and RF with ROSE (F1: 0.215, AUC: 0.822, and 0.821), respectively. While AUC scores varied across most models, the differences were relatively small (ranging from 0.778 to 0.822). Notable exceptions were the RF trained on the original and SMOTE-balanced data and the NB model trained on the original data, all of which exhibited poor performance. The LR and NB models demonstrated notable stability across different data balancing techniques, with AUC scores ranging from 0.778 to 0.821 and 0.788 to 0.806, respectively, indicating low variance in performance. In contrast, the Random Forest (RF) models exhibited high variability, performing well with RUS and ROSE but poorly with the original data and SMOTE.
Taking into account the results for the second year of high school, Figure 4 and Table 4 show the results for the twelve models. The highest value for each metric is highlighted in bold, following the criterion adopted in Table 3. Comparing the LR, RF, and NB models trained on the original (unbalanced) data, we observed a similar pattern to that found in the first grade. The RF and NB models again exhibited poor performance in identifying at-risk students, with near-zero sensitivity. However, the LR model performed substantially better on the imbalanced data, achieving AUC and G-mean values of 0.855 and 0.854, respectively. Furthermore, LR with RUS yielded the best F1-score.
Figure 4. Metrics for the second year of high school. Each group of bars represents the results of the four models evaluated across all metrics considered: accuracy (Accur.), sensitivity (Sens.), specificity (Spec.), precision (Prec.), area under the ROC curve (AUC), and G-mean. The metrics are presented in a tripartite layout: LR classifier results on the left, RF in the center, and NB on the right.
Table 4. Metrics for the second year of high school in 2020.
Finally, Table 5 and Figure 5 present the results for the third year of high school for 2020. Using the original imbalanced data, the LR model maintained a similar level of performance as in the first and second grades, with an accuracy of 0.806, sensitivity of 0.852, and AUC of 0.829. This consistency suggests the LR model is robust across different high school grades. In contrast, the RF and NB models continued to perform poorly on the original data, exhibiting near-zero sensitivity. This mirrors the findings from previous grades, suggesting that RF and NB models struggle to identify at-risk students when the data is highly imbalanced.
Table 5. Metrics for the third year of high school in 2020.
Figure 5. Metrics for the third year of high school. Each group of bars represents the results of the four models evaluated across all metrics considered: accuracy (Accur.), sensitivity (Sens.), specificity (Spec.), precision (Prec.), area under the ROC curve (AUC), and G-mean. The metrics are presented in a tripartite layout: LR classifier results on the left, RF in the center, and NB on the right.
When applying the balancing techniques, we found that the highest AUC was achieved by the RF with RUS. Conversely, the best accuracy, specificity, precision, and F1 were attained by LR with ROSE. Regarding G-mean, there is a tie between the two models (RF with RUS and NB with RUS). Once again, Random Forest with SMOTE showed a poor performance in identifying dropouts.
Table 6 presents the optimal configuration for each metric and academic year. As observed, data balancing techniques improved the majority—approximately 95%—of the metrics analyzed. The only exception was the AUC and the G-mean metric for the second year of high school, which showed no improvement with any balancing methods.
Table 6. Optimal configuration for each metric in 2020.
Considering only the best-performing configurations, models without balancing algorithms achieved the top result in just 10% of cases. SMOTE was among the best in 19% of the cases, RUS in 33%, and ROSE in 38%. Regarding classifiers, LR was the most prevalent, appearing in 52% of the top configurations, while RF and NB comprised 23% and 28%, respectively.

4.3. Conclusions

This paper aims to evaluate the effect of data balancing algorithms on the performance of predictive models applied to a highly imbalanced education dataset. To achieve this goal, we explore three classification approaches (LR, NB, and RF) in conjunction with three balancing techniques (RUS, SMOTE, and ROSE). In addition, the classifiers were trained on the original unbalanced data, allowing a direct comparison of the effectiveness of each approach.
The methods described above were trained with data from the first quarter of 2018 and 2019 and, subsequently, were used for forecasting student dropout in 2020 for the first, second, and third grade of high school in all state schools in Espírito Santo, Brazil. In addition, as a second exercise, we also employed the same estimated models to forecast student dropout in 2022.
Our results demonstrated that the inherent class imbalance in dropout data poses a significant challenge to effective model training. In particular, the RF and the NB struggled to identify at-risk students when trained on the original imbalanced data, as evidenced by its low sensitivity.
However, the application of data balancing techniques, particularly RUS and ROSE, could improve the performance of the models. Restricting analysis to the optimal configurations identified, for the year 2020, Random Undersampling (RUS) was observed in 33% of instances, with ROSE present in 38%. The SMOTE algorithm was utilized in 19% of the configurations, while classifiers trained directly on imbalanced data yielded superior results in only 10% of cases. In 2022, the optimal strategies consisted exclusively of ROSE (24%) and RUS (76%).
The findings of this study highlight the potential of balancing methods to improve predictive performance. However, choosing the best algorithm is not straightforward and depends on the metrics considered. In most cases, RUS and ROSE showed the best performance, but it is important to note that all methods could potentially decrease performance. This could be because the algorithms considered, such as those employing Random Undersampling, potentially discard valuable information from the majority class, hindering accurate estimation. Similarly, oversampling techniques risk generating duplicate or unrepresentative synthetic samples, particularly within complex or overlapping datasets. To address these limitations, further investigation into alternative data balancing techniques is recommended, with particular emphasis on techniques derived from SMOTE. Moreover, the exploration of algorithm-level methods for highly imbalanced data, such as cost-sensitive learning or ensemble methods, warrants further study. Future research should also consider exploring alternative modeling approaches tailored to imbalanced data. Specifically, testing different link functions within the framework of Logistic Regression presents a promising avenue for improvement. By employing link functions that better accommodate skewed class distributions, it may be possible to develop more robust and accurate classification models, minimizing the need for aggressive data balancing.
These findings have practical implications for the development of early warning systems for school dropout. The use of data balancing techniques and careful selection of the model can significantly improve the ability to accurately identify at-risk students, enabling targeted interventions. It is important to note that the optimal balancing technique can vary depending on the academic year and the specific performance metric. In general, this research underscores the importance of carefully considering class imbalance and selecting appropriate balance techniques for the development of accurate and effective school dropout prediction models.

Author Contributions

Conceptualization, G.A.d.A.P. and K.d.D.D.; methodology, G.A.d.A.P. and K.d.D.D.; software, G.A.d.A.P.; validation, G.A.d.A.P. and K.d.D.D.; formal analysis, G.A.d.A.P.; investigation, G.A.d.A.P. and K.d.D.D.; resources, G.A.d.A.P. and K.d.D.D.; data curation, G.A.d.A.P. and K.d.D.D.; writing—original draft preparation, G.A.d.A.P.; writing—review and editing, G.A.d.A.P. and K.d.D.D.; visualization, G.A.d.A.P.; supervision, K.d.D.D.; project administration, K.d.D.D.; funding acquisition, K.d.D.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Espírito Santo Research and Innovation Support Foundation (FAPES), Department of Education of the State of Espírito Santo (SEDU-ES), and the Jones dos Santos Neves Institute (IJSN) through the Cooperation Agreement SEDU/IJSN/FAPES n° 015/2021, Grant Agreement FAPES and IJSN n° 141/2021, and CCAF Resolution n° 284/2021-Educational Studies.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The dataset containing sensitive data was provided by the Secretariat of Education of Espírito Santo (SEDU-ES). The authors are not permitted to share this data. Requests for data access must be submitted through the official SEDU-ES website: https://sedu.es.gov.br.

Acknowledgments

The authors would also like to thank SEDU-ES for providing the data.

Conflicts of Interest

The authors declare that there are no competing interests.

Abbreviations

The following abbreviations are used in this manuscript:
ANNArtificial Neural Network
CARTClassification and Regression Trees
LRLogistic Regression
RFRandom Forest
ROCReceiver Operating Characteristic Curve
ROSERandom Oversampling Examples
RUSRandom Undersampling
SMOTESynthetic Minority Oversampling Technique
SVMSupport Vector Machine

Appendix A

Appendix A.1. Results for 2022

Table A1, Table A2 and Table A3 present the metrics for the first, second, and third years of high school in 2022. The purpose of this analysis is to verify whether the pattern observed in 2020 was repeated. As can be seen, RF, NB, and RF with SMOTE showed significantly lower predictive performance. Furthermore, the balancing methods were able to improve the metrics considered, with particular emphasis on RUS and ROSE.
Across all models and metrics, RUS was present in 76% of the best configurations, while ROSE appeared in only 24%. These two methods dominated the best-performing configurations. Regarding classifiers, LR appeared in 52% of cases, RF in 43%, and NB in only 5%.
Table A1. Metrics for the first year of high school in 2022 for Logistic and Random Forest classifiers.
Table A1. Metrics for the first year of high school in 2022 for Logistic and Random Forest classifiers.
ClassifierBalancingAccur.Sens.Spec.Prec.F1AUCG-Mean
Logistic-0.9710.2110.9890.3110.2510.6000.456
RF-0.9770.0001.000--0.500-
NB-0.9770.0021.0001.0000.0040.5010.043
LogisticRUS0.9750.1500.9950.3990.2180.5720.385
RFRUS0.9340.5430.9430.1830.2740.7430.715
NBRUS0.9550.4750.9670.2510.3280.7210.678
LogisticSMOTE0.9720.2310.9900.3500.2780.6100.478
RFSMOTE0.9770.0001.000--0.500-
NBSMOTE0.8140.7260.8160.0850.1520.7710.770
LogisticROSE0.9650.3270.9800.2770.3000.6540.566
RFROSE0.8630.7650.8650.1180.2050.8150.813
NBROSE0.8900.7010.8940.1350.2260.7970.791
The highest value for each metric is highlighted in bold. The metrics for RF (without balacing), NB (without balacing), and RF with SMOTE models were not considered, as these models have significant limitations in accurately identifying dropout cases.
Table A2. Metrics for the second year of high school in 2022 for Logistic and Random Forest classifiers.
Table A2. Metrics for the second year of high school in 2022 for Logistic and Random Forest classifiers.
ClassifierBalancingAccur.Sens.Spec.Prec.F1AUCG-Mean
Logistic-0.9160.6830.9220.1820.2870.8030.793
RF-0.9750.0001.000--0.500-
NB-0.9750.0191.0000.5790.0360.5090.137
LogisticRUS0.9600.5360.9700.3140.3960.7530.721
RFRUS0.8520.8130.8530.1230.2130.8330.832
NBRUS0.8010.7830.8010.0910.1620.7920.792
LogisticSMOTE0.9580.5100.9690.2960.3740.7400.703
RFSMOTE0.9750.0240.9990.3780.0450.5110.154
NBSMOTE0.6160.9110.6090.0560.1050.7600.745
LogisticROSE0.9300.6590.9370.2100.3180.7980.786
RFROSE0.7050.9130.7000.0710.1320.8060.799
NBROSE0.7490.8970.7450.0820.1500.8210.817
The highest value for each metric is highlighted in bold. The metrics for RF (without balacing), NB (without balacing), and RF with SMOTE models were not considered, as these models have significant limitations in accurately identifying dropout cases.
Table A3. Metrics for the third year of high school in 2022 for Logistic and Random Forest classifiers.
Table A3. Metrics for the third year of high school in 2022 for Logistic and Random Forest classifiers.
ClassifierBalancingAccur.Sens.Spec.Prec.F1AUCG-Mean
Logistic-0.9530.6290.9570.1580.2520.7930.776
RF-0.9870.0001.000--0.500-
NB-0.7840.8600.7830.0480.0910.8210.821
LogisticRUS0.9770.4820.9830.2650.3420.7320.688
RFRUS0.8590.8310.8590.0700.1300.8450.845
NBRUS0.7160.9320.7130.0400.0760.8220.815
LogisticSMOTE0.9350.6730.9380.1220.2070.8050.794
RFSMOTE0.9860.0220.9990.1760.0380.5100.146
NBSMOTE0.8970.7450.8990.0860.1550.8220.818
LogisticROSE0.8940.7480.8960.0840.1510.8220.818
RFROSE0.6510.9460.6470.0330.0640.7970.782
NBROSE0.6990.9170.6960.0370.0710.8070.799
The highest value for each metric is highlighted in bold. The metrics for RF (without balacing), NB (without balacing), and RF with SMOTE models were not considered, as these models have significant limitations in accurately identifying dropout cases.
In Table A4, it is evident that balance was able to improve all metrics throughout the different high school years. Furthermore, the balancing methods ROSE and RUS proved to be the most effective. These results are similar to those previously observed.
Table A4. Optimal configuration for each metric in 2022.
Table A4. Optimal configuration for each metric in 2022.
MetricFirst YearSecond YearThird Year
AccuracyLogistic & RUSLogistic & RUSLogistic & RUS
SensitivityRF & ROSERF & ROSERF & ROSE
SpecificityLogistic & RUSLogistic & RUSLogistic & RUS
PrecisionLogistic & RUSLogistic & RUSLogistic & RUS
F1NB & RUSLogistic & RUSLogistic & RUS
AUCRF & ROSERF & RUSRF & RUS
G-meanRF & ROSERF & RUSRF & RUS

Appendix A.2. Evaluation Metrics

Most of the metrics employed in this research are based on the confusion matrix. Table A5 illustrates all possible classifications that our model can make. A true negative occurs when the model classifies the student as non-abandon, and they indeed do not leave the school. Conversely, a true positive happens when the model correctly identifies students who drop out. Errors occur in two situations: false positives and false negatives. In the former, the model classifies the student as abandoned, but they do not leave the school. False negatives occur when the model predicts the student as non-abandon, but they leave the school.
Table A5. Confusion Matrix.
Table A5. Confusion Matrix.
Predicted 0Predicted 1
Actual 0TN (True Negative)FP (False Positive)
Actual 1FN (False Negative)TP (True Positive)
Mathematically, the metrics are defined as:
  • Accuracy: (TN + TP)/(TN + FP + FN + TP);
  • Sensitivity: TP/(FN+TP);
  • Specificity: TN/(TN+FP);
  • Precision: TP/(TP+FP);
  • F1-Score: 2TP/(2TP + FP + FN)
  • G-mean: s e n s i t i v i t y × s p e c i f i c i t y
Accuracy measures the percentage of correct classifications. Sensitivity measures the percentage of students who leave the school who were correctly identified. However, specificity is the percentage of students who do not leave the school and who were correctly identified. The precision is given by the proportion of correct classifications of dropout students divided by the total number of classifications equal to 1 made by the model. Finally, the F1-Score can be understood as a function of precision and sensitivity.
According to [], accuracy and precision may not adequately capture de model performance in imbalanced datasets. The classifier may reach a relatively high accuracy even if that minority of categories are completely ignored []. For [], AUC and F1-score are metrics more appropriated in imbalaced datasets, while [] point out that AUC and G-mean are unaffected by imbalances in the class distribution.
In addition to these metrics, we employed the area under the receiver operating characteristic curve (AUC). A receiver operating characteristic curve (ROC curve) illustrates the relationship between the true and false positive rates for various threshold classifications. The AUC metric summarises the ROC curve by calculating the area under the curve, which ranges between 0 and 1. The greater the AUC, the better the classification. Figure A1 illustrates these concepts. In this figure the blue line represents the ROC curve itself. Each point on this curve corresponds to a specific classification threshold, showing the trade-off between the True Positive Rate and the False Positive Rate. The curve demonstrates how the classifier’s performance changes as the threshold for classifying a positive instance is varied. The red dashed line represents the line of no-discrimination (also known as random guess line). This line indicates the performance of a classifier that randomly guesses the class. An AUC of 1.0 represents a perfect classifier, while an AUC of 0.5 represents a classifier no better than random guessing.
Figure A1. The ROC curve.

References

  1. Burgess, S. Human Capital and Education: The State of Art in the Economics of Education; Technical Report; Institute for the Study of Labour: Bonn, Germany, 2016. [Google Scholar]
  2. UNICEF. Early Warning Systems for Students at Risk of Dropping Out; Unicef Series on Education Participation and Dropout Prevention; UNICEF: New York, NY, USA, 2017. [Google Scholar]
  3. Wood, L.; Kiperman, S.; Esch, R.C.; Leroux, A.J.; Truscott, S.D. Predicting dropout using student-and school-level factors: An ecological perspective. Sch. Psychol. Q. 2017, 32, 35. [Google Scholar] [CrossRef]
  4. Lamb, S.; Rice, S. Effective Strategies to Increase School Completion Report: Report to the Victorian Department of Education and Early Childhood Development; Technical Report; Communications Division, Department of Education and Early Childhood Development: Fredericton, NB, Canada, 2008. [Google Scholar]
  5. Knowles, J.E. Of needles and haystacks: Building and accurate statewide dropout early warnings system in Wisconsin. J. Educ. Data Min. 2015, 7, 18–67. [Google Scholar] [CrossRef]
  6. Uldall, J.S.; Rojas, C.G. An application of machine learning in public policy early warning prediction of school dropout in the Chilean public education system. Multidiscip. Bus. Rev. 2022, 15, 20–35. [Google Scholar] [CrossRef]
  7. Sletten, M.A.; Toge, A.G.; Malmberg-Heimonem, I. Effects of an early warning system on student absence and completion in Norwegian upper secondary schools: A cluster-randomised study. Scand. J. Educ. Res. 2022, 67, 1151–1165. [Google Scholar] [CrossRef]
  8. Fernández, A.; Garcia, S.; Herrera, F.; Chawla, N.V. SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 2018, 61, 863–905. [Google Scholar] [CrossRef]
  9. Chen, W.; Yang, K.; Yu, Z.; Shi, Y.; Chen, C.P. A survey on imbalanced learning: Latest research, applications and future directions. Artif. Intell. Rev. 2024, 57, 137. [Google Scholar] [CrossRef]
  10. Altalhan, M.; Algarni, A.; Alouane, M.T.H. Imbalanced data problem in machine learning: A review. IEEE Access 2025. [Google Scholar] [CrossRef]
  11. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  12. Menardi, G.; Torelli, N. Training and assessing classification rules with imbalanced data. Data Min. Knowl. Discov. 2014, 28, 92–122. [Google Scholar] [CrossRef]
  13. Marquez-Vera, C.; Cano, A.; Romero, C.; Noaman, A.Y.M.; Fardoun, H.M.; Ventura, S. Early dropout prediction using data mining: A case study with high school students. Expert Syst. 2016, 33, 107–124. [Google Scholar] [CrossRef]
  14. Sara, N.B.; Halland, R.; Igel, C.; Alstrup, S. High-school dropout prediction using machine learning: A Danish large-scale study. In Proceedings of the ESANN 2015 Proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence, Bruges, Belgium, 22–24 April 2015; p. 6. [Google Scholar]
  15. Adelman, M.; Haimovich, F.; Ham, A.; Vazquez, E. Predicting school dropout with administrative data: New evidence from Guatemala and Honduras. Educ. Econ. 2018, 26, 356–372. [Google Scholar] [CrossRef]
  16. Sandoval-Palis, I.; Naranjo, D.; Vidal, J.; Gilar-Corbi, R. Early dropout prediction model: A case study of university leveling course students. Sustainability 2020, 12, 9314. [Google Scholar] [CrossRef]
  17. Freitas, F.A.d.S.; Vasconcelos, F.F.; Peixoto, S.A.; Hassan, M.M.; Dewan, M.A.A.; Albuquerque, V.H.C.; Filho, P.P.R. IoT system for school dropout prediction using machine learning techniques based on socioeconomic data. Electronics 2020, 9, 1613. [Google Scholar] [CrossRef]
  18. Pereira, G.A.A.; Demura, K.D.; Nunes, I.C.; Paula, L.C.; Lira, P.S. An early warning system for school dropout in the state of Espírito Santo: A machine learning approach with variable selection methods. Pesqui. Oper. 2024, 44, 1–18. [Google Scholar] [CrossRef]
  19. Rabelo, A.M.M.; Zárate, L.E. A model for predicting dropout of higher education students. Data Sci. Manag. 2024; in press. [Google Scholar] [CrossRef]
  20. Rovira, S.; Puertas, E.; Igual, L. Data-driven system to predict academic grades and dropout. PLoS ONE 2017, 12, e0171207. [Google Scholar] [CrossRef] [PubMed]
  21. Del Bonifro, F.; Gabbrilelli, M.; Lisanti, G.; Zingaro, S.P.P. Student dropout prediction. In Artificial Intelligence in Education, Proceedings of the AIED 2020, Ifrane, Morocco, 6–10 July 2020; Lecture Notes in Computer Science; Bittencourt, I.I., Cukurova, M., Luckin, R., Millán, E., Eds.; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar] [CrossRef]
  22. Selim, K.S.; Rezk, S.S. On predicting school dropouts in Egypt: A machine learning approach. Educ. Inf. Technol. 2023, 28, 9235–9266. [Google Scholar] [CrossRef]
  23. Villar, A.; Andrade, C. Supervised machine learning algorithms for predicting student dropout and academic success: A comparative study. Discov. Artif. Intell. 2024, 4, 2. [Google Scholar] [CrossRef]
  24. Costa, E.B.; Fonseca, B.; Santana, M.A.; de Araújo, F.F.; Rego, J. Evaluating the effectiveness of educational data mining techniques for early prediction of students’ academic failure in introductory programming courses. Comput. Hum. Behav. 2017, 73, 247–256. [Google Scholar] [CrossRef]
  25. Lee, S.; Chung, J.Y. The machine learning-based dropout early warning system for improving the performance of dropout prediction. Appl. Sci. 2019, 9, 3093. [Google Scholar] [CrossRef]
  26. Barros, T.M.; Neto, P.A.S.; Silva, I.; Guedes, L.A. Predictive models for imbalanced data: A school dropout perspective. Educ. Sci. 2019, 9, 275. [Google Scholar] [CrossRef]
  27. Orooji, M.; Chen, J. Predicting Louisiana public high school dropout through imbalanced learning yechniques. In Proceedings of the 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), Boca Raton, FL, USA, 16–19 December 2019; IEEE: New York, NY, USA, 2019. [Google Scholar] [CrossRef]
  28. Wongvorachan, T.; He, S.; Bulut, O. A comparison of undersampling, oversampling, and SMOTE methods for dealing with imbalanced classification in educational data mining. Information 2023, 14, 54. [Google Scholar] [CrossRef]
  29. Cho, C.H.; Yu, Y.W.; Kim, H.G. A Study on dropout prediction for university students using machine learning. Appl. Sci. 2023, 13, 12004. [Google Scholar] [CrossRef]
  30. Psathas, G.; Chatzidaki, T.K.; Demetriadis, S.N. Learning, predictive modeling of student dropout in MOOCs and self-regulated. Computers 2023, 12, 194. [Google Scholar] [CrossRef]
  31. Kim, S.; Choi, E.; Jun, Y.; Lee, S. Student dropout prediction for university with high precision and recall. Appl. Sci. 2023, 13, 6275. [Google Scholar] [CrossRef]
  32. Martinho, V.; Nunes, C.; Minussi, C.R. An intelligent system for prediction of school dropout risk group in higher education classroom based on artificial neural networks. In Proceedings of the 2013 IEEE 25th International Conference on Tools with Artificial Intelligence, Herndon, VA, USA, 4–6 November 2013; IEEE: New York, NY, USA, 2013; pp. 159–166. [Google Scholar] [CrossRef]
  33. Bastos, J.A. Predicting credit scores with boosted decision trees. Forecasting 2022, 4, 925–935. [Google Scholar] [CrossRef]
  34. Souadda, L.I.; Halitim, A.R.; Benilles, B.; Oliveira, J.M.; Ramos, P. Optimizing credit risk prediction for peer-to-peer lending using machine learning. Forecasting 2025, 7, 35. [Google Scholar] [CrossRef]
  35. Papana, A.; Spyridou, A. Bankruptcy prediction: The case of the Greek market. Forecasting 2020, 2, 505–525. [Google Scholar] [CrossRef]
  36. Dimitriadou, A.; Gogas, P.; Papadimitriou, T.; Plakandaras, V. Oil market efficiency under a machine learning perspective. Forecasting 2019, 1, 157–168. [Google Scholar] [CrossRef]
  37. James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning with Applications in R; Springer Series in Statistics; Springer: New York, NY, USA, 2013. [Google Scholar]
  38. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer Series in Statistics; Springer: New York, NY, USA, 2017. [Google Scholar]
  39. King, J.C.; Amigó, J.M. Integration of LSTM networks in random forest algorithms for stock market trading predictions. Forecasting 2025, 7, 49. [Google Scholar] [CrossRef]
  40. Ptak, M.; Sojka, M.; Szyga-Pluta, K.; Amnuaylojaroen, T. Three environments, one problem: Forecasting water temperature in Central Europe in response to climate change. Forecasting 2025, 7, 24. [Google Scholar] [CrossRef]
  41. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  42. Spiliotis, E.; Abolghasemi, M.; Hyndman, R.J.; Petropoulos, F. Hierarchical forecast reconciliation with machine learning. Appl. Soft Comput. 2021, 112, 107756. [Google Scholar] [CrossRef]
  43. Kuhn, M. Building predictive models in R using the caret package. J. Stat. Softw. 2008, 28, 1–26. [Google Scholar] [CrossRef]
  44. Ullah, N.; Ruocco, M.; Della Cioppa, A.; De Falco, I.; Sannino, G. An explainable deep learning framework with adaptive feature selection for smart lemon disease classification in agriculture. Electronics 2025, 14, 3928. [Google Scholar] [CrossRef]
  45. Alves, J.M.; Barbosa, R.S. Machine learning for basketball game outcomes: NBA and WNBA leagues. Computation 2025, 13, 230. [Google Scholar] [CrossRef]
  46. Changpetch, P.; Pitpeng, A.; Hiriote, S.; Yuangyai, C. Integrating data mining techniques for Naïve Bayes classification: Applications to medical datasets. Computation 2021, 9, 99. [Google Scholar] [CrossRef]
  47. He, H.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
  48. He, H.; Ma, Y. Imbalanced Learning: Foundations, Algorithms and Applications; IEEE Press—Wiley: Hoboken, NJ, USA, 2013. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.