Predicting Academic Success of College Students Using Machine Learning Techniques

: College context and academic performance are important determinants of academic success; using students’ prior experience with machine learning techniques to predict academic success before the end of the first year reinforces college self-efficacy. Dropout prediction is related to student retention and has been studied extensively in recent work; however, there is little literature on predicting academic success using educational machine learning. For this reason, CRISP-DM methodology was applied to extract relevant knowledge and features from the data. The dataset examined consists of 6690 records and 21 variables with academic and socioeconomic information. Preprocessing techniques and classification algorithms were analyzed. The area under the curve was used to measure the effectiveness of the algorithm; XGBoost had an AUC = 87.75% and correctly classified eight out of ten cases, while the decision tree improved interpretation with ten rules in seven out of ten cases. Recognizing the gaps in the study and that on-time completion of college consolidates college self-efficacy, creating intervention and support strategies to retain students is a priority for decision makers. Assessing the fairness and discrimination of the algorithms was the main limitation of this work. In the future, we intend to apply the extracted knowledge and learn about its influence of on university management.


Introduction
Higher education has developed a fundamental role due to the versatility and complexity of today's world, which has led to the rapid growth of scientific literature dedicated to predicting academic success or the risk of student dropout [1][2][3][4][5][6][7].Higher education institutions and their traditional role of knowledge dissemination have changed; innovation in new knowledge especially with the irruption of artificial intelligence [8] and the training of qualified professionals make many of them interact in different areas of society.In fact, their missions through teaching, research, and the ability to share and transfer this knowledge constitute central functions of their academic and cultural activity, with the aim of improving the level of knowledge in society.They have the important role of transmitting knowledge, skills, and values to students to create competitive professionals in society.Therefore, channeling students towards academic success is transcendental, as HEIs must continue the work undertaken and further deepen their involvement, significance, and service capacity in relation to the social, cultural, and economic framework [9].Thus, the prediction of academic success with past information of students who have successfully completed their university studies has become a tool of interest for educational managers since it allows them to strengthen decisions and build improvement alternatives or educational policies.ICT is one of the most widely used alternatives today, especially machine learning.
Hence, advances in machine learning techniques, along with other areas of study, are precursors to educational data mining.In higher education, the academic success of students is statistically measured by the graduation rate, which is defined as the total number of students graduating among the total number of entering students.In fact, ref. [10] states that it is possible to think about student success more broadly by studying endogenous and exogenous factors in the student environment.Thus, the constant need to be effective in the academic success of students has led to the customization of machine learning, this to achieve specific predictive models that provide useful information.
In the last decade, many studies have focused on investigative works that address the problems of performance, dropout, and academic success in university students.As detailed in [11][12][13][14], the authors emphasize that university dropout or failure converges with students from disadvantaged social strata who project university dropout behavior.To sustain university permanence among their findings, the authors are inclined to consider that extra-university activities that guarantee retention should be strengthened.Therefore, early detection has become a tool for solving these problems.Academic history, university context (tangible and intangible resources), and other data were used as the input elements to predict the results [4].For this purpose, qualitative and quantitative research methods have been used to solve these problems.More recently, multiple studies have been derived that employed data mining or machine learning techniques that, among other things, use algorithms and two well-known techniques to extract useful knowledge from data.The first technique, supervised classification, evaluates the data and predicts the target variable (class).The work of [6,[15][16][17] has shown results related to supervised classification.
Similarly, in [18,19], using another approach based on supervised classification, they used a set of pre-selected algorithms that classify the data by applying the voting technique.Both approaches attempt to predict students' academic success or performance effectively.The second technique, unsupervised classification, is one in which the target variable is unknown and that focuses on finding hidden patterns among the data.In general, association rules are used to discover facts occurring within the data and are composed of two parts: antecedent and consequent; for example, the rule {A, B}⇒{C} means that, when A and B occur, then C occurs.In [20][21][22], they look for the occurrence of data by focusing on the association rules and evaluating the rules with metrics such as support, confidence, and lift, among others.
In the studies of [23][24][25], related to machine learning, the convergence of objectives and techniques applied for the data preprocessing stage was observed, both in feature reduction, data transformation, normalization, and instance selection, among others.At the same time, data balancing techniques and "black box" classification algorithms were analyzed.The synergy of the studies lies in the simplification of the predictive models obtained given the high degree of complexity of the extracted knowledge, for which they used decision trees, since this technique simplifies the knowledge by means of the representation of rules of type (X⇒Y).To some extent, the methods applied are part of the KDD process proposed in [26].However, data asymmetry is a typical problem in any area of study.Duplicity, ambiguity, and missing and overlapping data are frequent, especially in authentic problems.Indeed, in data mining classification techniques, problems are presented as an unequal distribution of examples among classes (target variable), where one or more classes (minority class) are underrepresented compared to the others (majority class) [27].Commonly, the data balancing method defined by Chawla [28] is used in this type of problem.However, it is intended to fill the existing gap of data balancing with educational data by using different balancing methods for multiclass problems.
The approach of this study is like previous work described in [6,[29][30][31], where similar tasks were performed with predictions in binary and multiclass classes.However, the main difference with our approach focuses on the in-depth analysis of data balancing and feature selection techniques to avoid biases in predictions.Using 53% fewer variables and improving its accuracy by 10% over the preliminary results with the raw data, we not only built classification models to identify the relevant factors of college students' academic success, but also obtained a general model from the decision tree to obtain a higher readability of the predictive model.In this way, it is intended to provide additional guidance to academic decision makers in decision making.The open license software used for this work was R [32] through a customized library to visualize, preprocess and classify the data.The Python library scikit-learn [33] was used for data balancing.
The core of the work focuses on the study of machine learning techniques that predict academic success.This has allowed us to establish the objective of the work, which is to know in advance the factors that explain the academic success of students at the end of their first year of university.To do this, it has been necessary to pose the research questions since we intend to identify the factors that contribute to the academic success of students during their first year of college.This will allow us to examine the preprocessing techniques, the predictive model, the determinants of academic success and, of course, the visualization techniques to improve its interpretation before and after obtaining the predictive model.In this sense, the following research questions were posed: Most studies on predicting academic success by machine learning have focused solely on finding a predictive model, which is, to some extent, highly effective.In contrast, the work presented, in line with RQ1, seeks the group of features that are most significant for the model and, on the other hand, also seeks a balanced training dataset, using different data balancing techniques and avoiding biases in the prediction.RQ2, on the other hand, aims to find the effective predictive model using different supervised learning algorithms.Finally, RQ3 examines which variables were relevant in the predictive model achieved by the machine learning algorithms to then obtain another model with a better interpretation for the decision maker.
The presented work differs, among other things, by the following contributions: (i) we unveil the effectiveness of educational data mining techniques, to identify academically successful students early enough to act and reduce the failure rate; (ii) the impact of data preprocessing is analyzed; (iii) the important variables underlying the predictive model of better performance are unveiled.Thus, an approach to the presented work is associated with the works of [23,29,34], where the authors have examined the characteristics and impact of the best-performing algorithm.The rest of the paper is organized as follows: in Section 2, a literature review is carried out; in Section 3, the methodology used in this work is explained; in Section 4, the main results obtained by applying machine learning are presented; in Section 5, the discussion is presented; in Section 6, the relevant conclusions, in Section 7, limitations; and finally, in Section 8 future work are described.

Literature Review
In the cited literature, there are works related to the study of machine learning in higher education and its impact on the prediction of academic performance or success.In prediction, the purpose is to predict the target variable (class) of a dataset.The works cited in Table 1 employ supervised classification algorithms that focus on obtaining the predictive model.Among other works, the use of machine learning techniques to predict the success or failure of university courses or degrees stands out.The use of the recommender system proposed by [35] suggests to computer science students the subjects they can take, in addition to the prediction of success or failure based on the previous experience of other university students.In the work, data preprocessing and example balancing techniques were applied.Then, the preprocessed data were used as input for the classification algorithms to learn and obtain the prediction model from the test data.The results achieved provide guidelines for university administrators to enhance educational quality.In this sense, the early provision of useful information to predict a given event in the student body is valuable.Hence, the study of academic performance is a relevant contribution in higher education.Helal [36], in his work, predicted the academic performance of the student body; the data used in his work were divided into groups, and each subgroup of data was evaluated with different classification algorithms to predict academic performance.Their results suggest that external students and female students performed well in the prediction.
The work of Bertolini [29] set out to examine different classification algorithms to predict final exam grades with reasonable accuracy, considering midterm grades.Similarly, Alyahyan [23] proposed the use of decision trees to predict students' academic performance and generate an early warning when low performance is detected.Different decision tree approaches as well as relevant feature extraction were employed to obtain a simpler model for decision making by academic experts.In line with this, refs.[29,34] also examined highimpact features in the data to fit representative variables with respect to college retention and dropout, to develop interventions to help improve student academic success.
Similarly, in Beaulac [39], the prediction of the academic success of university students has been studied by applying the random forest and decision tree algorithms, the latter being very intuitive for decision making; the authors propose the use of these techniques to know if at the end of the first two semesters the student would achieve the university degree.Their results have indicated that there is a strong relationship between underperforming grades and the likelihood of succeeding in a degree program, although this did not necessarily indicate a causal connection.
Several of the related articles reveal the variety of work linked to improving the educational system.The approach of Guerrero-Higueras [7], which proposes the use of the GIT version control system as an evaluation methodology to observe the frequency and use of the tool to help predict the student's academic success, stands out.The variables studied describe the student's ability with tasks related to the development of the computer science subject.This methodology as introduced differs from the rest given the adaptation of the GIT version control platform and the issues specific to the computer science area.
The literature cited above emphasizes gradualism to achieve features that achieve high accuracy in the algorithms and obtain a simple and readable model.The lack of salient features prevents obtaining an effective prediction model.This is because of the ambiguity or irrelevance of the variables [40].On the other hand, of significant importance is the reduction of outliers in the data due to duplicate observations or overlapping data [41][42][43].It is understood, of course, that all of this leads to the application of each stage suggested in the CRISP-DM [26], methodology that allows obtaining a reliable model at the end.The validity of the model obtained is checked by the performance metrics of the classification algorithms.Based on what has been presented in this section, it was observed in the literature that the work focuses mainly on two fronts: identifying significant attributes to predict student performance, success, or failure in higher education, and finding the best prediction method to improve the accuracy of the predictive model achieved.

Context
The Institution of Higher Education (IES) is geographically located in the Municipality of Quevedo, Province of Los Ríos, Ecuador.Its coordinates are set at: 1 • 00 ′ 46 ′′ S 79 • 28 ′ 09 ′′ W/−1.012778, −79.469167.According to the policies of the IES and its minimum requirements, each university course is taught in face-to-face mode, and in addition, each academic year of the university course must be passed.In this case, each academic year consists of two academic cycles (semesters).Students must enroll in the university degree program and obtain grades in each subject, with a minimum grade of seven on a scale of zero to ten.As a result of the academic activities performed and their permanence in the university degree, the academic status of the student body is determined (dependent variable/class).Academic statuses are established in three categories.The first is "Passed", when the student has completed and passed all academic courses.The second is "Change", when the student passes courses other than the initial degree.And finally, third is "Dropout", when the student leaves the university completely.

Data Collection
Data collection was performed using SQL server scripts.The data were extracted from the university's information system database server.The dataset used in this work consisted of two parts: student body and faculty, which were subsequently merged.It should be noted that the criterion for the merger was the classes taught in the first year by the faculty in the teaching process for the university degree.Thus, the first part of the information referring to the students dealt with academic and socioeconomic data, while that relating to the teaching staff referred to degrees obtained, age, and academic experience, among others.Among the diversity of professors in charge of university teaching of first-year students, there were full, associate, and occasional professors, totaling 286 professors selected for this study.
On the other hand, the number of regular students was 6690.Although the number of professors and students does not coincide, it is necessary to clarify that a professor can teach different subjects.The students selected were those who were enrolled and had completed the first year of all university courses.In short, all of the above was framed within a retrospective of six complete academic years of each university degree, that is, ten calendar years.It should also be noted that any identifying reference to both faculty and students was eliminated to obtain an anonymous dataset.Among other things, the information extracted for this work had the endorsement and permission of the competent authority of the higher education institution detailed in the Institutional Review Board Statement section.The database with the raw data had 21 variables and 6690 records (see Appendix A, Table A1 for a description of the variables used).
So far, one of the main differences in algorithms between machine learning (ML) and traditional statistical methods lies in their purpose, as the former is still focused on the ability to capture complex relationships between features and make predictions as accurate as possible, while the latter, especially linear regression (LR), logistic regression (LOR), generalized mixed models and relevance-based prediction and others, aim at inferring relationships between variables.However, the key difference between traditional statistical approaches and ML is that, in ML, a model learns from examples rather than being programmed with rules.For a given task, examples are provided in the form of inputs (called features or attributes) and outputs (called labels or classes) [44,45].
In this work, we used the Cross-Industry Standard Process for Data Mining (CRISP-DM) methodology proposed by [26], which comprises seven phases: understanding the problem, understanding the data, data preparation, modeling, evaluation and implementation; the data preparation or data preprocessing is a stage that gained importance and became a key stage, since its function is related to data preparation.In other words, the objective is to reduce the complexity of the original dataset to obtain a readable predictive model with useful variables.Therefore, the work is based on the best practice for data preprocessing suggested in [46][47][48].For this reason, Appendixes B and Cdetail the results of the various methods used for data preprocessing using feature filtering, instance selection, and class balancing.The main advantage of efficient data preprocessing was the transfer of suitable data to classification algorithms for simple and accurate learning.First, the compacted data were cleaned and transformed and then analyzed with visualization techniques that allowed, among other things, the location of trajectories, overlaps and data behavior.Second, the data were stratified into two subsets of data: training and test.Then, the training set was filtered for relevant instances and features to balance the data using different methods.The already balanced dataset was used as input data for the classification algorithms, together with the test data that were used to obtain the predictive model.Finally, this model was evaluated with the metrics proposed in this work.Figure 1 shows the activities that were performed.

Metric Assessment
The metrics referred to in this section are used to evaluate the performance of the set of algorithms used to obtain predictive models.In Equation ( 4), the term α represents P(Tp) = Sensitivity, and (1 − β) represents P(Tn) = Specificity [49].

Metric Assessment
The metrics referred to in this section are used to evaluate the performance of the set of algorithms used to obtain predictive models.In Equation ( 4), the term α represents P(Tp) = Sensitivity, and (1 − β) represents P(Tn) = Specificity [49].

Data Exploratory
The importance of data exploration is that it serves to understand the activity and behavior of the data.Visualization techniques have been used that detected significant information in the data; specifically, variables were examined according to each category of the class using graphs (Figure 2).

Data Preprocessing
The importance of data preprocessing is to synthesize and achieve expeditious data.This fact has an important consequence for classification algorithms since the integrity of the data is gradually assessed by the hit rate, i.e., the number of true positives that the prediction algorithm can detect.Within this context, the aim is to obtain the set of features and instances that are close to a reasonable hit rate.The problem around which the data preprocessing revolves is the different search strategies such as sequential, random, and complete that are proposed for this task.The evaluation criterion is set with filtering (distance, information, dependency, and consistency), hybrid and wrapper methods [50][51][52][53][54].
The data preprocessing was divided into four phases.First, missing values in the data were replaced using the k-nearest neighbor's algorithm KNN_MV [55].Second, unrepresentative instances were excluded using the "NoiseFiltersR" algorithm.Third, feature selection was studied with different algorithms and functions that have evaluated feature quality.Finally, data balancing was applied to avoid bias in the prediction model due to the small amount of minority class data.

Missing Values
Data in their original form contain inconsistent data and often have missing values.That is, when the value of a variable is not stored, it is considered missing data.Multiple techniques have been developed to replace missing values.In general, statistical techniques of central tendency are usually used; for numerical values, the mean or median is used, while for nominal values, the "mode" is usually used.Another common technique is to remove the entire record from the dataset.Deletion can cause significant loss of information.Frequent techniques are easy to use and solve the problem of missing values, although, in data mining practice, there is a tendency to implement algorithms that solve this problem by examining the entire dataset.Specifically, in this work, we have used the "rfImpute" function, which replaced missing values by the nearest neighbor technique that takes the class (target variable) as reference.

Data Preprocessing
The importance of data preprocessing is to synthesize and achieve expeditious data.This fact has an important consequence for classification algorithms since the integrity of the data is gradually assessed by the hit rate, i.e., the number of true positives that the prediction algorithm can detect.Within this context, the aim is to obtain the set of features and instances that are close to a reasonable hit rate.The problem around which the data preprocessing revolves is the different search strategies such as sequential, random, and complete that are proposed for this task.The evaluation criterion is set with filtering (distance, information, dependency, and consistency), hybrid and wrapper methods [50][51][52][53][54].Both the arcs and the adjacency matrix were filtered with cut-off points obtained from the weighted mean of the nodes (Pass = 0.0007804694, Dropout = 0.0061971, Change = 0.01684287).The graphs had weights associated with each of the arcs, and this weight fixed their density.Three groups of subfigures were separated according to the target variable (pass, dropout, change).Subfigure (a) showed three subgroups of variables (8,5,5) where a common variable overlaps.Cluster (b) showed three subgroups of variables (8,3,8); this subfigure lacks overlap.Group (c) showed four subgroups of variables (6, 7, 4, 2) overlapped by three common variables.On the other hand, red lines indicate a lower degree of association, while black lines and thickness indicate their strength of association.

Instance Selection
Instance selection was also key in the data preprocessing, since poor-quality examples were eliminated by using the NoiseFiltersR algorithm [41], which filtered out the 5% of examples that were not within the data standard.In other words, when a value is at an unusual distance from the rest of the values in the dataset, it is considered an outlier or noise.

Feature Selection
There is an important distinction to be made in this section since the generality and accuracy of the predictive model will depend on the quality of the variables.Therefore, it is crucial to decide which variables are relevant to include in the study.For this, we used nine feature selection algorithms among them: "LasVegas-LVF", "Relief" [56], "selec-tKBest", "hillClimbing", "sequentialBackward", "sequentialFloatingForward", "deepFirst", "geneticAlgorithm", and "antColony".On the other hand, the algorithms used distinct functions to value the attributes.Among the functions, we had "mutualInformation" [57], "MDLC" [58], "determinationCoefficient" [59], "GainRatio" [60], "Gini Index" [61], and "roughsetConsistency" [62,63].The group of algorithms used for the study of significant characteristics obtained subgroups of variables that have been evaluated and are shown in Table 2 and Appendix C Table A3.

Data Balancing
Sample balancing is another important step in data preprocessing.Currently, there are several techniques for data balancing or resampling using Python software 3.9 and its scikitlearn library [33].In this work, the following techniques have been studied: oversampling, combined, undersampling and ensemble.The first used the methods "Smote" [28] and "KMeansSMOTE" (oversampling with SMOTE, followed by undersampling with edited nearest neighbors) [64].The second used both "Smote-ENN" and "Smote-Tomek" (oversampling with SMOTE) [65].The third technique used was subsampling with the "RUS" method [66].Finally, the ensemble technique used "EasyEnsemble" [67] and "Bagging".Specifically, new balanced training datasets were generated.All of this was from the initial training set, in which the different techniques and methods were used to balance the data (See Table 3).

Classification Algorithms
The use of supervised classification techniques aims to achieve a prediction model that is highly accurate.Hence, several algorithms have been created that use different mathematical models to achieve the model.In this section, we detail the types of algorithms and a provide a brief description of how each works.

•
Decision Trees: Consists of building a tree structure in which each branch represents a question about an attribute.New branches are created according to the answers to the question until reaching the leaves of the tree (where the structure ends).The leaf nodes indicate the predicted class; see [35].

•
Support Vector Machine (SVM): A relatively simple supervised machine learning algorithm used in regression or classification related problems.In many cases, it is used for classification, although it is preferably useful for regression.Basically, SVM creates a hyperplane with boundaries between data types in a two-dimensional space; this hyperplane is nothing more than a line.In SVM, each datum in the dataset is plotted in an N-dimensional space, where N is the number of features/attributes of the data; see [68].

•
Neural Network: Multilayer perceptrons (MLP) are the best known and most widely used type of neural network.They consist of neuron-like units, multiple inputs, and an output.Each of these units forms a weighted sum of its inputs, to which a constant term is added.This sum is then passed through a nonlinearity, usually called an activation function.Most of the time, the units are interconnected in such a way that they form no loop; see [69].

•
Random Forest: A combination of tree predictors, where each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest.The use of random feature selection to split each node produces error rates that compare favorably with "Adaboost" but are more robust with respect to noise.The internal estimates control for error, strength, and correlation, and are used to show the response to increasing the number of features used in the split.
Internal estimates are also used to measure the importance of variables; see [70].

•
Gradient Boosting Machine: Gradient boosting is a machine learning technique used to solve regression or classification problems, which builds a predictive model in the form of decision trees.It develops a general gradient descent "boosting" paradigm for additive expansions based on any fitting criteria.Gradient boosting of regression trees produces competitive, very robust, and interpretable regression and classification procedures, especially suitable for the extraction of not-so-clean data; see [71].• XGBoost: XGBoost is a distributed and optimized gradient boosting library designed to be highly efficient, flexible, and portable.It implements machine learning algorithms under the Gradient Boosting framework.XGBoost provides parallel tree boosting (also known as GBDT, GBM) that solves many data science problems in a fast and accurate manner; see [72].

•
Bagging: Predictor bagging is a method of generating multiple versions of a predictor and using them to obtain an aggregate predictor.Bagging averages the versions when predicting a numerical outcome and performs plural voting when predicting a class.Multiple versions are formed by making bootstrap replicas of the learning set and using them as new learning sets.Tests on real and simulated datasets show that bagging can provide a substantial increase in accuracy; see [73].

•
Naïve Bayes: A probabilistic machine learning model used for classification tasks.The core of the classifier is based on Bayes' theorem: P(A B) = P(B|A)P(A) , which is the probability of A occurring, given that B has occurred.Here, B is the evidence, and A is the hypothesis.The assumption made here is that the predictors/features are independent.That is, the presence of a particular feature does not affect the other; see [74].

Results
In response to the research questions posed, different data preprocessing algorithms have been employed to reduce the dimensionality of the dataset, so that the classification algorithms obtain a simple and accurate predictive model.In the following sections, we study data preprocessing for feature selection first.Second, we study data balancing using different data balancing algorithms and, finally, the results using the metrics calculated from the confusion matrix where the performance of the algorithms was evaluated.Prior to preprocessing, the dataset was separated into two parts: 75% of the total was selected for training data, and the other 25% for testing.The latter were used to evaluate the predictive model achieved by the classification algorithms, while the training set was subjected to preprocessing techniques to reduce dimensionality and obtain adequate data.In this sense, the work has focused on achieving simplicity and improving the accuracy of the predictive model, for which different feature and filter selection methods have been configured.Table 2 shows the results of the algorithm that obtained the lowest features; the rest of the runs of other algorithms and their results can be found in Appendix C.
In view of the cited works, in the studies of [15,28], relevant features in the data were examined to improve the predictive model, in line with these.Table 2 presents the results for the pre-selected feature set, where each evaluative filter and method rated the variables according to the performance metric.Specifically, the Relief method together with the "bestk" evaluative filter achieved better efficiency, i.e., higher accuracy with fewer variables.Based on these results, a new dataset with the new characteristics was established and used as input data for the data balancing phase described in the next section.

Data Balancing
The importance of data balancing is fundamental to classification algorithms since the disparity of examples between one class and another can lead to bias in the prediction model.There are two common techniques for data balancing.The first is the oversampling of examples technique, in which the data are balanced to the same number of examples in the majority class.The second is to reduce the other classes to the same number of examples in the minority class.Both techniques, although not very efficient, are useful for obtaining primary results since the redistribution of the data is achieved with the judgment and experience of the data analyst.To some extent, this personalized judgment is avoided by the intervention of algorithms that perform data balancing.The algorithms augment, reduce or equalize the examples depending on the technique applied.From the above, Table 3 shows the data imbalance index according to the algorithms used.Thus, each algorithm generated a new balanced dataset that was used to train the classification algorithms.

Classification Algorithms
In this section, we examine the effectiveness of the set of classification algorithms proposed for this work, which is related as a multiclass problem, that is, a dependent variable (class) with three types of outputs: Dropout, Change and Passed.For this reason, and as is common in supervised classification problems, two datasets have been used: the first, for the algorithms to learn and obtain a prediction model; and the second, to evaluate the effectiveness of the model obtained.Hence, we worked with two types of analysis: the first with the original data (without data preprocessing) and the second with the different datasets generated from the preprocessing techniques used.
It is difficult not to appreciate the importance of data preprocessing, as it provides classification algorithms with balanced and clean datasets.Obtaining the predictive model requires the algorithm to learn from the provided data (training set), as the effectiveness of the model will depend on it.Therefore, for the algorithm to achieve adequate learning, the cross-validation technique k-fold cross-validation (CV) was applied; this approach randomly subdivided the training set into 10 folds with approximately equal size, and each fold, in turn, was fragmented into two sections: training and test.This was done so that at the end of training, the mean prediction was obtained from among the folds.On the other hand, to check what was learned by the algorithms, the metrics proposed in the section of methodology were used, which helped to discriminate the most effective predictive models.While it is true that effectiveness is fundamental to evaluate the predictive model, the comprehensibility of the model obtained is also important, since the experts evaluate the simplicity of the model.
Here, we present the best result of the classification algorithms that were achieved using the dataset balanced by the "EasyEnsemble" algorithm and the performance assessment of the classifiers using the ROC curve presented in Figure 3.The rest of the results with different datasets derived from the application of the data balancing algorithms are presented in Appendix B, Table A2.In view of the results, Table 4 (raw data) and Table 5 (preprocessed data) show differences in the performance of the algorithms.Negative values −0.0214 and −0.0222 for precision and AUC, respectively, are evident.This negative effect between raw data and preprocessed data is a consequence of preprocessing, so data preprocessing should be interpreted not as a contradictory process but as an improvement of the predictive model by using fewer variables from the original set.Therefore, the advantage of applying data preprocessing has been observed.It should be noted that the logloss was lower with the original data than with the preprocessed data.The increase with the latter was due to the smaller imbalance between classes.That is, the smaller the imbalance between classes, the greater the logloss, due to the smaller proportion of observations in the minority class.Table 3 shows the imbalance index between the original set and the dataset preprocessed with "EasyEnsemble" (column IR: 7.18 and 1.775 respectively).
In Table 6, the confusion matrix of the best-scoring algorithm (XGBoost) aimed to explain the predicted values of the test dataset, and the prediction model obtained by the algorithm was established.First, the type II error or β type error was analyzed, where (a) the "Dropout" class had predicted values of 868 cases, of which 741 were correct, and 127 cases were classified as "Pass"; (b) the "Change" class had 126 cases, of which 115 were correct and 11 were classified as "Pass"; (c) the "Pass" class of the 679 predicted cases had 474 that were correct, four cases were classified as "Change", and 201 were classified as "Dropout".Secondly, the type I error or type α error was analyzed, where (a) the class "Dropout" had 942 cases, of which 741 were correct and 201 "Pass"; (b) the class "Change" had 119 cases, of which 115 were correct and four were classified as "Pass"; (c) the class "Pass" had 612 cases, of which 474 were correct, 11 were classified as "Change", and 127 were classified as "Dropout".Overall, a more efficient predictive model was obtained with the XGBoost classification algorithm.In the work of [75], they highlight that the random forest algorithm obtained a better result in accuracy (ACC: 0.81) using only 10 features of the original dataset, pointing out the importance of improving academic performance and increasing the graduation rate of the students of the educational center.Consequently, it is necessary to consider that the accuracy of the model increases, and its complexity needs to be explainable as well.In this context, we looked for a way to apply a simple and readable method.The decision tree provides a simple rule-based model that improves comprehensibility.The use of the decision tree, although less efficient, is very easy to interpret.Figure 4 shows the decision tree generated from the training data and Figure 5 shows the important variables.the decision tree generated from the training data and Figure 5 shows the important v iables.

Static Comparison of Several Classifiers
Formally, statistical significance is defined as a probability measure to assess experiments or studies.Ronald Fisher promoted the use of the null hypothesis [76], establishing a significance threshold of 0.05 (1/20) to determine the validity of the results obtained in empirical tests.In this way, it is guaranteed that the provenance of their results is not due to chance coincidences.In the work of Demšar [77], the statistical significance of different classification algorithms and real-world datasets was validated by different empirical tests.In this context, the nonparametric Friedman and Wilcoxon tests were used, which are suitable for this type of analysis because they both do not skimp on the normal distribution of the data or on the homogeneity of variances, making them suitable for studies with data of a real or unmanipulated nature.The importance of the variable is calculated by summing the decrease in error when d vided by a variable.Thus, the higher the value, the more the variable contributes to improve t model, so the values are bounded between 0 and 1.

Figure 5.
The importance of the variable is calculated by summing the decrease in error when divided by a variable.Thus, the higher the value, the more the variable contributes to improve the model, so the values are bounded between 0 and 1.
Prior to the calculation of the nonparametric tests, the results matrix of the group of algorithms and the datasets was organized, using the area under the curve (AUC, see Appendix D Table A6) as the metric.The significance threshold was set at 0.05 for the Friedman and Wilcoxon tests to determine if there were significant differences between more than two dependent groups.To perform the empirical tests, we used the null hypothesis H0: there are no significant differences between the groups of algorithms, and the alternative hypothesis, Ha: there is at least one significant difference between the groups of algorithms.The results of the Friedman test yielded a chi-square (χ 2 ) of 52.305 with 8 degrees of freedom and a p-value of 1.47 × 10 −8 (See Appendix D, Table A4).Since the p-value was below the threshold, the null hypothesis was rejected, and the alternative hypothesis was accepted, confirming the existence of significant differences.Next, a pairwise comparison of algorithms will be performed using the Wilcoxon test to assess the significance of these differences.
The above analysis established that there were significant differences, so a test was performed for each pair of algorithms using the Wilcoxon test, which is a Friedman post hoc test and is presented in Table 7, where the p-values obtained are shown.

Discussion
This paper explores and discusses three research questions related to machine learning techniques that are applied to achieve a predictive model with greater accuracy and readability, in addition to the study of factors that lead to the academic success of university students when they finish the first course.The answers to the questions posed are detailed.
RQ1: Which balancing and feature selection technique is relevant for supervised classification algorithms?In general, it is evident that with the increase in variables, the accuracy of the model increases, and so does its complexity, since the classification algorithms improve performance, although the readability of the model decreases.Against this in the work of Alwarthan [24], they apply recursive feature elimination (RFE) with Pearson correlation coefficient, RFE with mutual information and GA to find relevant features, in addition to class balancing using SMOTE-TomekLink to build the final prediction model.The relevant variables were related to English courses and GPA, as well as students' social variables.Alwarthan [24] used 68 features and achieved 93% accuracy with the initial results, while feature filtering detected 44 relevant variables and 90% accuracy.On the other hand, they analyzed eight relevant characteristics that achieved 77% accuracy; the variables were directly related to the academic performance of the student body.
In [6], the filtering of characteristics using the Gini index was proposed, from which seven characteristics were selected, achieving 79% accuracy using the random-forest algorithm.These results were very similar to ours, but far from being explainable, due to the bias derived from the imbalance of the data.In the proposal made in this study, different data processing techniques were used to obtain an expeditious dataset.On the one hand, the instance filtering method was considered to reduce duplicate or noisy observations by 5%.On the other hand, for feature group filtering, six methods were used, and five filters were applied, with which an accuracy between 58% and 78% was achieved.On the other hand, when applying the "ReliefF" method, 10 features were obtained with an accuracy of 79% (algorithm C4.5).In contrast, with the literature presented, the analyzed datasets had accuracy values below 84% and 32 features on average.The difference with what is proposed in this work is greater than 5% in accuracy, initially attractive.However, the handling of 22 additional features generates a robust and poorly explainable model for decision support.
Consequently, data balancing as part of data preprocessing was crucial to achieve a robust predictive model.The literature reviewed generically posits data balancing as a step prior to feature filtering.The approach taken so far is to obtain a filtered dataset (instances and features) and then apply data balancing.Among the best classification accuracies achieved by the data balancing methods, a range between 73% and 79% was obtained.The "EasyEnsemble" method obtained the best accuracy, AUC and logloss.The latter was far from the original data, as the imbalance rate was high.For example, the imbalance rates (IR) of the original data (7.35IR) for undergraduate academic statuses (dropout, change and pass) were 57%, 7% and 36%, while for the balanced data (1.75 IR), they were 23%, 40% and 37% with synthetic observations.The accuracy of the XGBoost model with balanced data was approximately 80%.In summary, the proposed data preprocessing made the dataset unbiased and the predictive model simple and explainable.
RQ2: Which predictive model best discriminates students' academic success?Currently, there are several supervised algorithms used in higher education to predict different educational contexts in higher education.Specifically, the best discrimination was performed by the XGBoost algorithm.This criterion was based first on the values collected with the predictive model, where the accuracy value was 79.49% and the AUC was 87.75%.Sensitivity = 84.25%,which indicated the rate of positive examples that the algorithm was able to classify, while specificity = 87.53% for negative examples.Next, the logloss metric measuring computational cost had 0.3736 and an imbalance rate of 7.18 with the original dataset.However, the logloss value went to 6.34 with the preprocessed dataset and an imbalance rate of 1.775, i.e., lower computational cost and a higher data imbalance rate were inversely proportional to the performance of the predictive model.Although the predictive model obtained using XGBoost is poorly explainable due to its high complexity, it performed better by classifying examples from the test set.Explainability of the predictive model was obtained when the decision tree was applied to the training set to obtain a predictive model based on rules (If, Then) and readable for decision makers.
Similarly, [6,[16][17][18][19]24,75] converge in their predictions on higher education data using classifiers such as Random Forest (RF), SVM, Neural Networks and decision trees.Likewise, linear regression or logistic regression was used to obtain predictive models that detect failure, success, or academic performance early enough [1,81], or in turn, semi-supervised learning to obtain patterns in students who managed to pass the courses for a university degree [22].Being the main objective to achieve very attractive and reliable accuracies, undoubtedly, accuracy always comes hand in hand with the quantity and quality of the data.For example, Gil [38] obtained accuracy rates with "random forest" of 77%, 91% and 94% with features of 30, 44 and 68, respectively, where the positive correlation between number of features and accuracy was evidenced.That said, in our results, accuracies very close to 80% were achieved with only 10 features and a completely readable model (10 rules).
RQ3: Which factors are determinants of students' academic success?As part of the development of this study, variables that play a significant role in the academic success of students were found.Specifically, the variables ChangeDegree, RateApproval, Average, and Degree were determinants for the prediction model obtained.These findings are close to the results obtained by Alturki [34], where individual results from the third and fourth semester were examined, both with accuracies of 63.33% (six variables) and 92.6% (nine variables), respectively.The influential variables were grade point average, number of credits taken and academic assessment performance, applying the selection of characteristics for each academic semester.Similarly, Alyahyan [23] identified variables related to GPA and key subjects that detect student performance early enough.As detailed by Beaulac [39] in their study, they identified variables associated with undergraduate degree completion as a first group of variables, whereas the second group of variables was related to the type of major.In summary, the first-year students opt for computer and English related subjects to reach their academic achievement, i.e., characteristics related to academic performance.
Specifically, data preprocessing provides as input an expedited dataset for classification algorithms to achieve an adequate predictive model.Although the results in the reviewed literature resemble ours, and these can be improved by inducing endogenous or exogenous variables for the model to achieve more optimal results, the results can also be improved by over-fitting parameters in the algorithms.It is also worth mentioning that, for example, Ismanto [82] obtained an RF prediction model with an accuracy higher than 90% without preprocessing the data, which resulted in a complex predictive model due to its explainability.Therefore, even if the model obtains the highest accuracy, the prediction bias can also be extended if the parameters are over-fitted or the data preprocessing phase is omitted.
Kaushik [83] has defined feature selection as increasing the quality in the data to facilitate better results, all according to the proposed method set of techniques for feature selection in educational data.What is applied in this paper fits with Kaushik's perspective.
It is important to anticipate early enough and with general quality characteristics to take effective countermeasures, providing timely warnings to students to achieve academic success.In this way, the percentage of underachieving students can be reduced, and appropriate counseling and intervention can be provided to them by the college.
The results provide conclusive support for the anticipation of college completion [84][85][86], which is essential to assist students in the learning process and ensure their academic success.Thus, taking advantage of the fact that predictions made early enough by machine learning manage to reveal possible difficulties or improvements from students' historical data, its effective use requires building specific strategies [84].Consequently, the application of the knowledge obtained from the data is leveraged, for example, in constant monitoring or continuous tracking that acts as a tool to assess progress in academic performance, class attendance, extracurricular activities and other key indicators [87].Other strategies include personalized tutorial support or intervention plans, remediation and other resources for students who have demonstrated compelling needs [88,89].Machine learning, along with other data analysis techniques, offers valuable suggestions for targeted interventions for the benefit of students, with the goal of helping them achieve academic success in the shortest possible time.The results presented support the authenticity of the analyses performed, as the information is not based on mere coincidences, but on real data.In this context, significant tests were performed using statistical methods such as the nonparametric Friedman and Wilcoxon test, which are widely recognized for comparing the performance of machine learning algorithms [77,90,91].Although these tests are not recommended for a comprehensive study, due to the need to conform to other assumptions, some authors have deepened their analysis and proposed alternatives to the tests [92,93].In summary, significant tests are essential for a solid and objective interpretation of the results obtained.

Conclusions
In response to the research questions, the effectiveness of the prediction model lies in the good practice conducted in the data preprocessing phase.Hence, the importance of obtaining an expeditious dataset is crucial.Unlike the methodologies reviewed in the literature, our applied methodology avoided bias in the accuracy rates of the predictive model, as well as in the academic status (class).In fact, both the robust predictive model achieved by means of XGBoost as well as the simplified decision tree model proved to be effective.The simplified predictive model was able to detect students with high potential for academic success in seven out of ten cases, while the robust model detected them in eight out of ten cases.The simplification and explainability of the model were based on a set of rules obtained from the decision tree used, to make them understandable and provide them to academic experts as suggestions for decision making.Overall, this study provides valuable information on the factors underlying college students' academic success expectations and highlights the importance of effective data preprocessing and model simplification techniques for making accurate, meaningful, and understandable predictions about college students' academic success.

Limitations
The main limitation of this work was the absence of variables that help to have consistent measurements in the classification algorithms in terms of gender, scholarships, and financial aid, since it is important to analyze the evaluation of equity and discrimination aspects in the decisions made by the algorithms to build the predictive model.

Future Work
Looking ahead, we intend to explore how the knowledge extracted in this work and the university practices applied with this knowledge can influence classroom management, with the aim of improving students' academic outcomes and reducing the disparity in educational opportunities.To this end, we propose studies related to (i) examining how the personalization of predictive models can be adapted to the phenotype (charac-

Data 2024, 9 , 28 Figure 1 .
Figure 1.Diagram of activities performed.The processes conducted are described in four stages.

Figure 1 .
Figure 1.Diagram of activities performed.The processes conducted are described in four stages.

Figure 2 .
Figure2.Undirected graph calculated from the correlation matrix (Pearson's method).Both the arcs and the adjacency matrix were filtered with cut-off points obtained from the weighted mean of the nodes (Pass = 0.0007804694, Dropout = 0.0061971, Change = 0.01684287).The graphs had weights associated with each of the arcs, and this weight fixed their density.Three groups of subfigures were separated according to the target variable (pass, dropout, change).Subfigure (a) showed three subgroups of variables(8,5,5) where a common variable overlaps.Cluster (b) showed three subgroups of variables(8, 3,8); this subfigure lacks overlap.Group (c) showed four subgroups of variables (6, 7, 4, 2) overlapped by three common variables.On the other hand, red lines indicate a lower degree of association, while black lines and thickness indicate their strength of association.

Figure 2 .
Figure2.Undirected graph calculated from the correlation matrix (Pearson's method).Both the arcs and the adjacency matrix were filtered with cut-off points obtained from the weighted mean of the nodes (Pass = 0.0007804694, Dropout = 0.0061971, Change = 0.01684287).The graphs had weights associated with each of the arcs, and this weight fixed their density.Three groups of subfigures were separated according to the target variable (pass, dropout, change).Subfigure (a) showed three subgroups of variables(8,5,5) where a common variable overlaps.Cluster (b) showed three subgroups of variables(8, 3,8); this subfigure lacks overlap.Group (c) showed four subgroups of variables (6, 7, 4, 2) overlapped by three common variables.On the other hand, red lines indicate a lower degree of association, while black lines and thickness indicate their strength of association.

Data 2024, 9 ,
x FOR PEER REVIEW 13 of 28

Figure 3 .
Figure 3. Performance of the group of algorithms by plotting the area under the AUC curve.On the ordinate axis is the true positive rate, and on the abscissa axis the false positive rate.The classifier lines above the diagonal (dashed line) represent good classification results (better than random), while those below represent bad results (worse than random).The best performance in classifying the test data examples was obtained by the XGBoost algorithm; two algorithms had an AUC above 0.87, the rest performed below 0.86.This performance clearly indicates the effectiveness of the predictive model against the test set.

Figure 3 .
Figure 3. Performance of the group of algorithms by plotting the area under the AUC curve.On the ordinate axis is the true positive rate, and on the abscissa axis the false positive rate.The classifier lines above the diagonal (dashed line) represent good classification results (better than random), while those below represent bad results (worse than random).The best performance in classifying the test data examples was obtained by the XGBoost algorithm; two algorithms had an AUC above 0.87, the rest performed below 0.86.This performance clearly indicates the effectiveness of the predictive model against the test set.

Data 2024, 9 ,
x FOR PEER REVIEW 15 o

Figure 4 .
Figure 4.The decision tree drawn is based on the rules obtained.The nodes represent the class.T three decimal values within the node represent the probability of each class with respect to the ev uation of the rule.In turn, the total percentage of cases for the rule (cover) is shown.Below the no the condition of the rule is displayed.

Figure 4 .
Figure 4.The decision tree drawn is based on the rules obtained.The nodes represent the class.The three decimal values within the node represent the probability of each class with respect to the evaluation of the rule.In turn, the total percentage of cases for the rule (cover) is shown.Below the node, the condition of the rule is displayed.

Figure 4 .
Figure 4.The decision tree drawn is based on the rules obtained.The nodes represent the class.T three decimal values within the node represent the probability of each class with respect to the eva uation of the rule.In turn, the total percentage of cases for the rule (cover) is shown.Below the nod the condition of the rule is displayed.

Figure 5 .
Figure 5.The importance of the variable is calculated by summing the decrease in error when d vided by a variable.Thus, the higher the value, the more the variable contributes to improve t model, so the values are bounded between 0 and 1.

Table 1 .
Summary of papers related to the prediction of academic performance or success of university students.

Table 2 .
Feature filtering by the "Relief" algorithm using different k and bestk filters.The lowest feature selection and the highest accuracy achieved by the C4.5 classification algorithm were established with the "bestk" filtering (10 variables).

Table 3 .
The table displays the distribution of data per class using different data balancing techniques, along with the corresponding imbalance ratio (IR) between the majority and minority classes.A higher IR indicates a more severe class imbalance problem.

Table 4 .
Preliminary results for the original dataset, omitting data preprocessing.

Table 4 .
Preliminary results for the original dataset, omitting data preprocessing.

Table 5 .
Evaluation results of the predictive models obtained by the classification algorithms.The training set was balanced with the "EasyEnsemble" technique.Model validation was performed on the test dataset.The data were sorted according to the AUC column.

Table 6 .
Confusion matrix of the XGBoost algorithm.Here, the actual values (rows) are shown versus the values predicted by the classifier (columns).

Table A1 .
Cont.This variable is related to the usufruct of the housing where the student and his family live.Weighting of the effort in the exams to pass the subjects; the first exam (recovery) has a value of 0.25, while the second one has a value of 0.75.