Data Balancing Techniques for Predicting Student Dropout Using Machine Learning

: Predicting student dropout is a challenging problem in the education sector. This is due to an imbalance in student dropout data, mainly because the number of registered students is always higher than the number of dropout students. Developing a model without taking the data imbalance issue into account may lead to an ungeneralized model. In this study, different data balancing techniques were applied to improve prediction accuracy in the minority class while maintaining a satisfactory overall classiﬁcation performance. Random Over Sampling, Random Under Sampling, Synthetic Minority Over Sampling, SMOTE with Edited Nearest Neighbor and SMOTE with Tomek links were tested, along with three popular classiﬁcation models: Logistic Regression, Random Forest, and Multi-Layer Perceptron. Publicly accessible datasets from Tanzania and India were used to evaluate the effectiveness of balancing techniques and prediction models. The results indicate that SMOTE with Edited Nearest Neighbor achieved the best classiﬁcation performance on the 10-fold holdout sample. Furthermore, Logistic Regression correctly classiﬁed the largest number of dropout students (57348 for the Uwezo dataset and 13430 for the India dataset) using the confusion matrix as the evaluation matrix. The applications of these models allow for the precise prediction of at-risk students and the reduction of dropout rates.


Introduction
This paper presents a novel approach for predicting student dropout using machine learning (ML) methods and data balancing techniques.The proposed method has been tested on real-world datasets collected from Tanzania and India.Additionally, this paper provides a unique contribution by suggesting the use of data balancing techniques to improve the accuracy of machine learning models for student dropout prediction.This research can contribute to environmental sustainability by providing better education planning and policymaking.It can also help in understanding the impact of climate change on student dropout by providing better predictions of the risk factors associated with it, by taking into consideration supervised learning applications.The majority of supervised learning applications face the problem of classifying unbalanced datasets, where one class is underrepresented relative to another [1][2][3][4][5][6][7][8].This problem is common in the real-world applications of telecommunications, the web, finance, ecology, biology, medicine, etc., with a negative impact on the classification performance of machine learning models [2,9,10].In the context of education, the classification of an imbalance problem occurs in the field of student dropout because the number of students enrolled is higher than the number of dropouts [11,12].Student dropout is one of the challenges facing several schools in developing countries [13,14].It is more common in girls than boys, and in lower secondary schools as compared to higher levels [15].According to [16], the imbalance ratio is around 1:10, and, in most cases, the minority class usually represents the target group [2].Regarding improving the predictive accuracy of the minority class as one of the greatest learning Data 2023, 8, 49 2 of 14 interests, many researchers have focused on developing solutions for the problem of class imbalance.The data sampling technique among the developed solutions aims to balance data before model development [17].It consists of undersampling techniques i.e., Random Under Sampling (RUS), oversampling techniques i.e., Random Over Sampling (ROS) together with Synthetic Minority Over Sampling Technique (SMOTE) and it also includes hybrid techniques i.e., Synthetic Minority Over Sampling Technique with Edited NearestNeighbor (SMOTE ENN) and Synthetic Minority Over Sampling Technique with Tomek links (SMOTE TOMEK).RUS is a non-heuristic technique that selects a subset of the majority class to create a balanced class distribution [18].In this technique, examples are randomly selected from the majority class for exclusion, with no replacement until the outstanding number of examples is thoroughly combined with that of the minority class.The main advantage of this technique, especially in Big Data, is the reduction in execution cost due to the decrease in data size caused by removing some examples.However, by excluding certain examples from the majority class, potential information may be lost that could have an impact on the learning process.On the contrary, the ROS technique is more commonly used than the RUS technique since undersampling tends to eliminate important information from the data.ROS tends to randomly balance the distribution of data up until the number of chosen examples, plus the original examples of the minority class is roughly equal to that of the majority class [19].Despite its ability to balance class distribution, ROS tends to cause overfitting problems.On the other hand, SMOTE emphasizes the creation of examples of synthetic minorities for inclusion in the original dataset [12].This technique forms new examples of minority classes by combining several examples of minority classes [20].SMOTE has become the most frequently used technique, but the limitation of this technique, similar to ROS, is to assume equal importance for all minority instances.SMOTE TOMEK hybrid technique combines both SMOTE and Tomek links.Tomek links were proposed to be applied in an oversampled training set as a data cleaning technique in order to come up with a better defined class cluster [21].This technique tends to delete examples that form Tomek links between the two classes.In the meantime, SMOTE ENN combines SMOTE and Edited Nearest Neighbor (ENN) [22].The motive behind this technique is similar to that of SMOTE TOMEK; however, ENN is used to expel examples from both classes, so any example that has been misclassified by its three nearest neighbors is removed from the training set.This technique should help to further clean up the data, as ENN tends to eliminate more examples than Tomek links.Apart from data sampling techniques, data imbalance can also be handled by using algorithmic modification techniques that focus on changing the learning algorithm to adapt the imbalance data settings [18] and cost-sensitive learning techniques that focus on minimizing costs associated with the learning process [23].While there are several approaches to dealing with imbalanced datasets, data sampling techniques are simple to use to deal with the problem of class imbalance [12].
On the other hand, RF is an ensemble classification model that is made up of several randomized decision trees [49][50][51][52][53].It is a widely used overall model because of its efficient implementation and its ability to reduce overfitting [54][55][56][57][58].The performance of the RF model is determined by the tuning of its parameters and the feature selection [59].This model is a non-parametric tree model, which is somewhat required when dealing with high-dimensional datasets [60].Since RF is based on the definition of several independent Data 2023, 8, 49 3 of 14 trees, it is straightforward to obtain a parallel and faster application of the RF method, in which many trees are built in parallel on different cores [61].As well, LR is among the classification approaches used to model the probability of discrete (binary or multinomial) outcomes [62].This model works very similarly to linear regression by analyzing the relationship between multiple independent variables and a categorical dependent variable and calculating the probability of the existence of an event by fitting data to a logistic curve [63,64].There are two kinds of logistic regression: binary logistic regression (as in the present study) and multinomial logistic regression [64,65].Despite the ability of these models (MLP, RF and LR) to predict student dropout problems, data imbalance was ignored in many studies and needs to be addressed in order to improve the predictive results of machine learning models.
For the evaluation of the performance of machine models, one of the key factors guiding the algorithmic modeling is the evaluation criteria.Accuracy as a statistical measure to quantify the level of accuracy has been used as a common metric by many researchers [66,67].However, in the imbalanced data domain, this metric is no longer an appropriate measure, for it has less effect on the minority class than the majority class, and combined with the fact that it cannot distinguish between the magnitude of errors.In the context of imbalanced datasets, standard measures using particular measures are used to account for class distribution.The confusion matrix saves the results for examples correctly and incorrectly recognized by each class in a binary class problem [68].This matrix is an important tool for assessing prediction results in a way that is very easy to understand [69].In addition, the Geometric Mean (G m ) of actual rates measures the capacity of the model to balance sensitivity (TPrate) and specificity (TNrate) [1].G m is at a maximum when TPrate and TNrate are equal.F-measure (F m ) is a harmonic mean of precision and recall [66].This metric ensures the TPrate changes more in the positive predictive value (precision) than in the True Positive rate (TPrate).A high value of F m shows that both precision and recall are sensibly high.On acquiring the highest TPrate without excessively minimizing the TNrate, the Adjusted Geometric Mean (AG m ) was introduced [2].Despite the ability of these metrics to evaluate the performance of the machine learning models, other studies have reported their limitations in terms of the effects on the minority classes in the imbalance datasets [2,70,71]; hence, the application of several metrics is highly recommended when evaluating the performance of the machine learning models.
Therefore, this paper presents several data balancing techniques for predicting student dropout using datasets from developing countries.The research problem is to identify how to effectively use machine learning models for predicting student dropout when the dataset is imbalanced.The objective of the paper is to explore the use of various data balancing techniques to improve the accuracy of machine learning models for predicting student dropout.The novelty of the paper lies in its comparison of the performance of different data balancing techniques to address the issue of imbalanced datasets.
The next section presents related works that applied data balancing techniques to addressing the problem of student dropout.Section 3 introduces the materials and methods used to conduct this study.The results are presented and discussed in Section 4. Finally, the article presents the conclusion and prospective future directions in Section 5.

Literature Review
The use of data balancing techniques to predict student dropout using machine learning has been applied in several studies, as summarized in Figure 1.A study by [11] used machine learning to predict student dropout and academic success.The study used a dataset to build machine learning models for predicting academic performance and dropout.Imbalanced data were identified, and different techniques for handling this problem were proposed, such as data-level techniques including Synthetic Minority Over Sampling Technique (SMOTE) and Adaptive Synthetic Sampling Approach (ADASYN), or algorithm-level techniques including Balanced Random Forest and SMOTE-Bagging.Another study by [72] used data balancing techniques to predict student dropout at a uni-versity in Turkey.A dataset of 1510 student records was used, and different classifiers such as decision trees and support vector machines were applied.Data balancing techniques such as oversampling and undersampling were used to improve the accuracy of the models.The results showed that the use of data balancing techniques improved the accuracy of the models and reduced the bias in the data.

Dataset
To address student dropout, this study used two publicly available datasets from developing countries.The first dataset was Uwezo data 1 on learning at the country level in Tanzania, which was collected in 2015 with the objective of assessing children's learning levels across hundreds of thousands of households.The second dataset was collected in 2016 with the aim of assessing student dropout in India 2 .T h e Uwezo dataset consisted of 61340 samples, of which 98.4% were retained and 1.6% were dropouts, and the India dataset consisted of 11257 samples, of which 95.1% were retained and 4.9% were dropouts.Therefore, these two datasets were highly imbalanced, as presented in Figure 2a,b, respectively.[76]; Kotsiantis (2007) [77].
A study by [73] used machine learning and applied data balancing techniques to predict student dropout.The study used an unbalanced dataset from a real university and applied an undersampling technique to balance it.The study used a decision tree algorithm to predict student dropout and obtain an accuracy of 83.2%.
Another study by [74] used machine learning and applied data balancing techniques to predict student dropout.The study used a dataset of student records collected from a university and applied oversampling, undersampling, and a combination of both techniques to balance it.The study applied a Random Forest algorithm to predict student dropout and obtain an accuracy of 81.2%.
A study by [75] used machine learning and applied data balancing techniques to predict student dropout.The study used an imbalanced dataset from a university and applied a combination of oversampling and undersampling techniques to balance it.The study used a decision tree algorithm to predict student dropout and obtain an accuracy of 85.3%.
One study by [76] developed predictive models for imbalanced data.The study applied data mining techniques to forecast dropout rates.The study used a decision tree, neural networks, and balanced bagging.Classifiers were tested with and without the use of data balancing techniques, including downsample, SMOTE, and ADASYN data balancing.The results showed that the geometric mean and UAR provide reliable results when predicting dropout rates using balanced bagging classification techniques.Finally, a study by [77] applied data balancing techniques to predict student dropout using machine learning.The study used a dataset of 3420 student records from a university in Greece.A variety of classification algorithms were tested, including Naïve Bayes, C4.5, and Support Vector Machines.Furthermore, data balancing techniques such as undersampling and oversampling were applied to remove the bias and improve the accuracy of the models.The results showed that the use of data balancing techniques improved the accuracy of the models for predicting student dropout.Despite the fact that many studies applied data balancing techniques to addressing the problem of student dropout, many of them were carried out in developed countries using developed countries' datasets.

Dataset
To address student dropout, this study used two publicly available datasets from developing countries.The first dataset was Uwezo data 1 on learning at the country level in Tanzania, which was collected in 2015 with the objective of assessing children's learning levels across hundreds of thousands of households.The second dataset was collected in 2016 with the aim of assessing student dropout in India 2 .The Uwezo dataset consisted of 61,340 samples, of which 98.4% were retained and 1.6% were dropouts, and the India dataset consisted of 11,257 samples, of which 95.1% were retained and 4.9% were dropouts.Therefore, these two datasets were highly imbalanced, as presented in Figure 2a,b, respectively.

Dataset
To address student dropout, this study used two publicly available datasets from developing countries.The first dataset was Uwezo data 1 on learning at the country level in Tanzania, which was collected in 2015 with the objective of assessing children's learning levels across hundreds of thousands of households.The second dataset was collected in 2016 with the aim of assessing student dropout in India 2 .T h e Uwezo dataset consisted of 61340 samples, of which 98.4% were retained and 1.6% were dropouts, and the India dataset consisted of 11257 samples, of which 95.1% were retained and 4.9% were dropouts.Therefore, these two datasets were highly imbalanced, as presented in Figure 2a,b, respectively.

Data Pre-Processing
Data from the two datasets were pre-processed prior to obtaining a final training set.This process was carried out as a precautionary measure to ensure that datasets are properly cleaned and accurate prior to model development.The data clean-up was carried out by removing information that could reveal the identity of individuals to the end-user.Missing values were replaced with medians and zeroes.The following variables were identified with missed values: Pupil Teacher Ratio (PTR), Pupil Classroom Ratio (PCR), Girl's Pupil Latrines Ratio (GPLR), Boy's Pupil Latrines Ratio (BPLR), Parent Teacher Meeting Ratio (PTMR), Main source of household income (Income), and Enumeration Area type (EA area).
Parent who checks his/her child's exercise book once a week (PCCB), Parent who discusses his/her child's progress last term with the teacher (PTD), Student who read a book with his/her parent last week (SPB), School has a privacy room for girls (SGR), Household meals per day (MLPD).On handling missing values, PTR, PCR, GPLR, and BPLR were imputed with medians, and PTMR, Income, EA area, PCCB, PTD, SPB, SGR, and MLPD were imputed with zeros.In addition, data samples with nominal variables were converted to numerical values to comply with Scikit-learn.

Data Sampling Techniques
Five data balancing techniques were employed to address the issue of data imbalance in the datasets.These techniques were employed before model development due to their ability to provide in-depth data cleaning, produce straight-forward and satisfactory results when handling data imbalance, address the overfit problem, and reduce running time and cost.RUS, ROS, SMOTE, SMOTE ENN, and SMOTE TOMEK have been implemented.RUS was performed by randomly selecting examples from the majority class for exclusion with no replacement until the outstanding number of examples were thoroughly combined with those of the minority class.This approach was chosen due to its ability to reduce the cost of execution by decreasing the size of the data through the removal of a few examples.ROS was performed by randomly balancing the distribution of data over the application of minority data duplication up to when the number of chosen examples plus the original examples of the minority class was roughly equal to that of the majority class.This approach was chosen based on its ability to not eliminate important information from the data.SMOTE was selected to form new minority class examples by incorporating several minority class examples.Furthermore, SMOTE TOMEK was selected to remove examples that form Tomek links from both classes, and SMOTE ENN was selected to expel examples from both classes; therefore, any example that has been misclassified by its three nearest neighbors was removed from the training set.This technique was anticipated to give more in-depth data cleaning, as ENN tends to eliminate more examples than Tomek links.

Classification Models
Three popular classification models: Logistic Regression (LR), Random Forest (RF), and Multi-Layer Perceptron (MLP) were assessed on a set of supervised classification datasets in order to see which model would perform better with consideration of the data imbalance problem.The selection of the three models took into consideration the supervised learning approach, particularly with respect to the classification problem.These models were selected because they were able to give satisfactory results on the prediction of student dropout.LR was selected to represent the linear model and was used to model the probability of binary outcomes (dropout/not dropout).In addition, RF represented an ensemble model and was chosen to reduce the overfitting problem and handle highdimensional data.The MLP, on the other hand, represented an artificial neural network and was selected to reduce complexity.

Evaluation Metrics
To assess the performance of classification models, three popular metrics were used: Geometric Mean (G m ), F-measure (F m ), and Adjacent Geometric Mean (AG m ).Furthermore, a confusion matrix was used to determine the best model based on the actual number of samples correctly and improperly classified.These metrics were chosen with an emphasis on the imbalance domain and as a standard measure in class distribution.G m was selected to measure the ability of the model to balance TPrate and TNrate.F m was selected to measure the harmonic means of TPrate and precision, whereas AG m was selected to measure the increase of TPrate rates without decreasing TNrate.

Experimental Design
In this study, MLP, RF, and LR were compared over six different structures (original, balanced with ROS, balanced with RUS, balanced with SMOTE, balanced with SMOTE ENN, and balanced with SMOTE Tomek) using stratified 10-fold cross validation.The datasets were alienated in training, validation, and testing by 60%, 20%, and 20%, respectively, to minimize sampling bias.The methodology used to conduct this study is summarized in Figure 3.

Experimental Design
In this study, MLP, RF, and LR were compared over six different structures (original, balanced with ROS, balanced with RUS, balanced with SMOTE, balanced with SMOTE ENN, and balanced with SMOTE Tomek) using stratified 10-fold cross validation.The datasets were alienated in training, validation, and testing by 60%, 20%, and 20%, respectively, to minimize sampling bias.The methodology used to conduct this study is summarized in Figure 3. Data balancing techniques for predicting student dropout using machine learning can help identify the key determinants of dropout more accurately.This can help schools and other educational institutions better understand the factors that lead to student dropout and take appropriate measures to prevent it.In addition to that, educational institutions can anticipate when students are at risk of dropping out and intervene early to provide the necessary support.This can help reduce the rate of student dropout and improve educational outcomes.By understanding the key determinants of student dropout and intervening early, educational stakeholders can provide targeted Data balancing techniques for predicting student dropout using machine learning can help identify the key determinants of dropout more accurately.This can help schools and other educational institutions better understand the factors that lead to student dropout and take appropriate measures to prevent it.In addition to that, educational institutions can anticipate when students are at risk of dropping out and intervene early to provide the necessary support.This can help reduce the rate of student dropout and improve educational outcomes.By understanding the key determinants of student dropout and intervening early, educational stakeholders can provide targeted interventions to improve educational outcomes.This can help improve student success and reduce the overall dropout rate.Furthermore, data balancing techniques can also help identify disparities in educational outcomes among different groups of students, such as those from different backgrounds or those with different levels of academic achievement.This can help identify and address disparities in educational outcomes and promote equity in education.

Results and Discussion
The study used two datasets to compare data balancing techniques.The datasets used were highly imbalanced due to the fact that there are still many students in school compared to students who drop out, which makes balancing the data very important in this study because the focus was primarily on the minority class, in this case dropouts.The results showed that the SMOTE ENN data balancing technique had very good solutions for achieving greater performance, followed by SMOTE TOMEK and RUS on the Uwezo datasets.For the Indian dataset, the SMOTE ENN data balancing technique performed better, followed by SMOTE TOMEK and ROS (Table 1).
The SMOTE ENN data balancing technique has shown very good solutions for achieving greater performance due to its ability to provide in-depth data cleaning.Similar results were reported by [78] when assessing a number of methods to balance machine learning data.Furthermore, [79] stressed the techniques and importance of handling data imbalance when developing training sets from a machine learning model, and [80] emphasized the good performance of hybrid data balancing techniques such as SMOTE-RSB, SMOTE-TOMEK, and SMOTE ENN when dealing with highly imbalanced data like in the case of student dropout.On the contrary, the RUS technique performed the worst in the study's experiment evaluating data sampling techniques.This could be due to the nature of the loss of certain potential information that could have an impact on the learning process.Similar results were reported by [81,82] when assessing multiple approaches to managing imbalanced datasets.However, it was reported that this approach improved predictive performance in other studies compared with the lack of data sampling techniques [83,84].Most datasets in the real world are not balanced, i.e., there is a majority and minority class, and if data balancing is ignored when training the machine learning model, it may lead to bias towards one class, and the model will learn more about the majority class and learn less about or ignore the minority class.Hence, handling unbalanced data is very important when developing a machine learning model.

Models Performance
Three machine learning models used in data balancing techniques were evaluated, and the findings showed that LR was the best model to correctly classify the highest number of student dropouts and misclassify the lowest, followed by MLP and RF in the Uwezo (Figure 4) and Indian datasets (Figure 5).
Moreover, this study found that LR and MLP were the best models to correctly classifying the highest number of student dropouts and misclassifying the lowest.This may be due to the ability of LR to model the probability of binary results and the power of MLP to produce satisfactory results for nonlinear relationships.Similar results were reported by [42,90] when determining the accuracy of their predictive models for the early prediction of stroke and student dropout, respectively.Both studies indicated that LR was the bestperforming classification model relative to the others.These results, however, contradict what was reported by [91] in their study of evaluating the performance of supervised machine learning models in healthcare, where K-Nearest Neighbor and Random Forest were reported to outperform other models such as Logistic Regression and Naive Bayes.

Models Performance
Three machine learning models used in data balancing techniques were evaluated, and the findings showed that LR was the best model to correctly classify the highest number of student dropouts and misclassify the lowest, followed by MLP and RF in the Uwezo (Figure 4) and Indian datasets (Figure 5).

Models Performance
Three machine learning models used in data balancing techniques were evaluated, and the findings showed that LR was the best model to correctly classify the highest number of student dropouts and misclassify the lowest, followed by MLP and RF in the Uwezo (Figure 4) and Indian datasets (Figure 5).
Similar metrics (Gm, Fm, and AGm) were used by [41,[75][76][77][78][79][80][81][82][83][84][85][86][87][88][89] in evaluating the performance of the developed models in order to take the class distribution into account.In addition, accuracy has been reported as a common metric for measuring the degree of correctness of machine learning models [66,67].However, its limitations in the imbalanced domain make it unsuitable for evaluating models with imbalanced data [2,72].The issue of predicting student dropout using a machine learning model is an important one, and it's been addressed by many different approaches.Data balancing is one of the most promising of these methods.Data balancing techniques are designed to identify the key determinants of student dropout and then use machine learning to develop a model that can accurately predict dropout rates.Data balancing techniques involve creating a data set that is as balanced as possible.This means that the data must be stratified to ensure that the populations being compared are equal in terms of key attributes.By ensuring that the data is balanced in terms of key attributes, it allows the machine learning model to accurately predict dropout rates.The machine learning solution presented in this study Data 2023, 8, 49 10 of 14 can be used to accurately predict students at risk of dropping out of school and provide early measures for intervention.

Conclusions
Based on the analysis of the results, the study concluded that the SMOTE ENN balancing technique provides a good solution for achieving superior performance.Furthermore, LR has been considered a potential model for the type of data used due to its high accuracy in classifying the dropout class, which is the focus of this study.The study also concluded that the use of data balancing techniques before model development helps to improve the performance of the predictive results when measured by the G m , F m , and AG m .In other words, predictive outcomes were improved by comparing original (unbalanced) data with data that were collected using sampling techniques.In a real-world environment, most datasets are imbalanced and contain a large number of anticipated examples with only a small number of unexpected examples.Most of the interest is in the predictions of the unexpected examples.Machine learning models are not as precise for predicting the minority class in unbalanced datasets.Therefore, a data balancing task is required as part of the pre-processing phase to deal with this situation.This study is limited to the application of data sampling techniques to address the problem of student dropout.Prospective future directions will focus on alternative methods, including algorithmic modification and cost-sensitive learning, with the aim of improving the predictive power of the machine learning model.

Figure 2 .
Figure 2. Dropout distributions for the Uwezo and India datasets; (a) Dropout distribution: Uwezo dataset; (b) Dropout distribution: India dataset.The Uwezo dataset consisted of 18 variables: Main source of household income (Income), Boy's Pupil Latrines Ratio (BPLR), School has a privacy room for girls (SGR), Region, District, Village, Student gender (Sex), Parent check child's exercise book once a week (PCCB), Household meals per day (MLPD), Student read a book with his/her parent last week (SPB), Parent discuss child's progress last term with the teacher (PTD), Student age (Age), Enumeration Area type (EA area), Household size (HH size), Girl's Pupil Latrines Ratio (GPLR), Parent Teacher Meeting Ratio (PTMR), Pupil Classroom Ratio (PCR), Pupil Teacher Ratio (PTR) and Dropout.India dataset

Figure 2 .
Figure 2. Dropout distributions for the Uwezo and India datasets; (a) Dropout distribution: Uwezo dataset; (b) Dropout distribution: India dataset.The Uwezo dataset consisted of 18 variables: Main source of household income (Income), Boy's Pupil Latrines Ratio (BPLR), School has a privacy room for girls (SGR), Region, District, Village, Student gender (Sex), Parent check child's exercise book once a week (PCCB), Household meals per day (MLPD), Student read a book with his/her parent last week (SPB), Parent discuss child's progress last term with the teacher (PTD), Student age (Age), Enumeration Area type (EA area), Household size (HH size), Girl's Pupil Latrines Ratio (GPLR), Parent Teacher Meeting Ratio (PTMR), Pupil Classroom Ratio (PCR), Pupil Teacher Ratio (PTR) and Dropout.India dataset consisted of variables: Continue drop, Student id, Gender, Caste, Mathematics marks,

Figure 2 .
Figure 2. Dropout distributions for the Uwezo and India datasets; (a) Dropout distribution: Uwezo dataset; (b) Dropout distribution: India dataset.The Uwezo dataset consisted of 18 variables: Main source of household income (Income), Boy's Pupil Latrines Ratio (BPLR), School has a privacy room for girls (SGR), Region, District, Village, Student gender (Sex), Parent check child's exercise book once a week (PCCB), Household meals per day (MLPD), Student read a book with his/her parent last week (SPB), Parent discuss child's progress last term with the teacher (PTD), Student age (Age), Enumeration Area type (EA area), Household size (HH size), Girl's Pupil Latrines Ratio (GPLR), Parent Teacher Meeting Ratio (PTMR), Pupil Classroom Ratio (PCR), Pupil Teacher Ratio (PTR) and Dropout.India dataset consisted of variables: Continue drop, Student id, Gender, Caste, Mathematics marks, English marks, Science marks, Science teacher, Languages teacher, Guardian, Internet, School id, Total students, Total toilets, and Establishment year.

Figure 3 .
Figure 3. Overview o f t h e experimental design.

Figure 3 .
Figure 3. Overview of the experimental design.

Figure 4 .
Figure 4. Comparison of models' performance in terms of numbers of correctly and incorrectly classified students (the Uwezo dataset).

Figure 5 .
Figure 5.Comparison of models' performance in terms of numbers of correctly and incorrectly classified students (the India dataset).

Figure 4 .
Figure 4. Comparison of models' performance in terms of numbers of correctly and incorrectly classified students (the Uwezo dataset).

Figure 4 .
Figure 4. Comparison of models' performance in terms of numbers of correctly and incorrectly classified students (the Uwezo dataset).

Figure 5 .
Figure 5.Comparison of models' performance in terms of numbers of correctly and incorrectly classified students (the India dataset).Figure 5. Comparison of models' performance in terms of numbers of correctly and incorrectly classified students (the India dataset).

Figure 5 .
Figure 5.Comparison of models' performance in terms of numbers of correctly and incorrectly classified students (the India dataset).Figure 5. Comparison of models' performance in terms of numbers of correctly and incorrectly classified students (the India dataset).

Table 1 .
Comparison of data balancing techniques (Uwezo and India datasets).