# Supporting Decision-Making Process on Higher Education Dropout by Analyzing Academic, Socioeconomic, and Equity Factors through Machine Learning and Survival Analysis Methods in the Latin American Context

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Formulation of the SDP Problem

- ${\overrightarrow{X}}_{i}$ is the n-dimensional vector of attributes.
- ${Y}_{i}$ is the output variable.
- ${E}_{i}$ is the event variable such that ${E}_{i}=1$, if the dropout happens, otherwise ${E}_{i}=0$.
- ${T}_{i}$ is the time variable, i.e., ${T}_{i}$ represents the permanence time at the university.

#### 2.1. SDP Problem as a Classification Model

#### 2.2. SDP Problem as a Survival Analysis Model

- (a)
- The dropout occurred, and we can measure when it occurred (${T}_{i}$).
- (b)
- The dropout did not occur when we observed the student; we only know the number of semesters in which it did not occur (${C}_{i}$), named censoring time.

## 3. Related Work

#### 3.1. Machine Learning Algorithms

#### 3.2. Survival Analysis Methods

#### 3.3. Deep Learning Methods

## 4. Research Questions

- RQ1: How do we understand the impact of academic, socioeconomic, and equity variables on the SDP problem?
- RQ2: What is the most-efficient classification algorithm for the SDP problem?
- RQ3: What is the most-efficient survival analysis method for the SDP problem?
- RQ4: How influential is academic performance in estimating dropout risk?

## 5. Materials and Methods

#### 5.1. Population and Sample

`ID`) is unique; however, the

`ID`is different from the masked Student Identity (

`SID`), which is not necessarily unique for each person. A person has a unique

`ID`, but can have one or more

`SIDs`. This situation occurs when a student withdraws from the university and rejoins. However, in practice, we limited our dataset to one

`SID`by the

`ID`; this means that we only used one observation for each person, but stored whether she/he was previously enrolled in any program at this university.

#### 5.2. Data Preprocessing

`Change_SID`, which identifies whether or not the person changed his/her

`SID`. Similarly, some demographic attributes were altered to define categorical variables, such as

`Female`,

`Married`,

`Public_School`, and

`Scholarship`. Numerical variables were defined, such as the student’s age when she/he was admitted to the university (

`Age_Admission`). Linking students’ locations of provenance and residence with the value of the 2019 Human Development Index (HDI), we define socioeconomic variables labeled

`HDI_Provenance`and

`HDI_Residence`, respectively. Our dataset has the name of the admission semester, which can be regular or not. For example, 2001-01 and 2001-02 are regular semesters, and other configurations are non-regular semesters. They are resources that satisfy the number of hours and credits as summer courses.

`Final_GPA`). We used the number of semesters enrolled, the number of hours of absences, the number of approved courses, and the total number of courses to determine proportional variables between them. For example,

`Courses_Sem`represents the proportion of enrolled courses concerning the number of enrolled semesters, computed by

`Absences_Courses`,

`Approved_Courses`, and

`NonReg_Courses`. We processed the student status and assumed that students drop out when their student status is separated, retired, or transferred. Otherwise, the student did not drop out, and consequently, we defined an event variable labeled

`Dropout`. Finally, we employed the number of completed semesters as the time variable, labeled

`Completed_Sem`. The data cleaning and filtering were performed using Pandas and Numpy. All attributes employed in this work are summarized in Table 3.

#### 5.3. Data Exploration

`Public`) and even stands out for having the highest rate of students according to the variable

`Scholarship`. This is because, usually, in Latin America, low-income people study in public schools.

`Married`) is low in all departments. We notice the highest value in Psy with 1.8%. Furthermore, the highest values of

`Changed_SID`occur in Eng, and the opposite happens in Edu.

`HDI_Provenance`,

`HDI_Residence`and

`Absences_Courses`is similar for each academic department. However, in Edu, we found that the mean of

`Final_GPA`is higher than the mean value of the other departments. In contrast, this does not happen in STEM areas such as CS and Eng, making us think that STEM careers tend to be more complicated. A similar context occurs with other academic variables, such as

`Approved_Courses`and

`Courses_Sem`.

#### 5.4. Computational Techniques and Evaluation Metrics

- Reserving a subset of the data.
- Using the rest of the dataset to train the model.
- Testing the model using the reserved subset of data.

## 6. Results

#### 6.1. RQ1: How Do We Understand the Impact of Academic, Socioeconomic, and Equity Variables on the SDP Problem?

`Completed_Sem`has a strong negative correlation with

`Dropout`in all cases. Furthermore,

`Dropout`has a strong negative correlation with

`Final_GPA`and

`Approved_Courses`. However,

`Dropout`and

`Absences_Courses`have a moderate positive correlation.

`NonReg_Courses`and

`Dropout`. However, this correlation is almost null in the rest of the departments. In summary, we concluded from Figure 2 that the academic variables (

`Final_GPA`and

`Approved_Courses`) present a higher correlation with

`Dropout`. In contrast, the socioeconomic (

`HDI_Provenance`and

`HDI_Residence`) and equity (

`Female`) variables do not show a weak correlation in all cases. We concluded that these factors do not significantly influence predicting the dropout status.

#### 6.2. RQ2: What Is the Most-Efficient Classification Machine Learning Method for the SDP Problem?

- Logistic Regression (LR) considers
`C`= 0.1. - Support Vector Machine (SVM) considers
`C`= 10 and`gamma`= 0.01. - Gaussian Naive Bayes (GNB) considers a variance of smoothing equal to =0.001.
- K-Nearest Neighbor (KNN) considers seven neighbors.
- Decision Tree (DT) considers a minimum number of samples required to be at a leaf node equal to fifty and a maximum depth of the tree equal to nine.
- Random Forest (RF) considers a minimum number of samples required to be at a leaf node equal to fifty and a maximum depth of the tree equal to nine and does not use bootstrap.
- Multilayer Perceptron (MP) considers three layers in the sequence (13, 8, 4, 1), an activation function defined by tanh, and $\alpha $ = 0.001.
- Convolutional Neural Network (CNN) considers two layers in the sequence (13, 6, 1) and the activation functions ReLU and sigmoid.

#### 6.3. RQ3: What Is the Most-Efficient Survival Analysis Method for the SDP Problem?

- The parametric methods (Weibull and Gompertz) consider a learning rate equal to 0.01, an L2 regularization parameter equal to 0.001, the initialization method given by zeros, and the number of epochs equal to 2000.
- The Cox Proportional Hazards model CPH) considers a learning rate equal to 0.5 and an L2 regularization parameter equal to 0.01. The significance level $\alpha =0.95$, and the initialization method is given by zeros.
- Random Survival Forest (RSF) considers two-hundred trees, a maximum depth equal to twenty, the minimum number of samples required to be at a leaf node equal to ten, and the percentage of original samples used in each tree building equal to 0.85.
- Conditional Survival Forest (CSF) considers two-hundred trees, a maximum depth equal to five, the minimum number of samples required to be at a leaf node equal to twenty, the percentage of original samples used in each tree building equal to 0.65, and the lower quantile of the covariate distribution for splitting equal to 0.1.
- Multi-Task Logistic Regression (MTLR) considers twenty bins, a learning rate equal to 0.001, and the initialization method given by tensors with an orthogonal matrix.
- Neural Multi-Task Logistic Regression (N-MTLR) considers three layers with the activation functions defined by ReLU, tanh, and sigmoid. Furthermore, MTLR uses 120 bins, a smoothing L2 equal to 0.001, and five-hundred epochs, and tensors comprise the initialization method with an orthogonal matrix.
- Nonlinear Cox regression (DeepSurv) considers three layers with the activation functions defined by ReLU, tanh, and sigmoid. Furthermore, DeepSurv employs a learning rate equal to 0.001, and Xavier’s uniform initializer is the initialization method.

#### 6.4. RQ4: How Influential Is Academic Performance in Estimating Dropout Risk?

`Female`,

`Married`,

`Public`,

`Age_Admission`,

`HDI_Provenance`, and

`HDI_Residence`. We found a moderate impact with the variable

`Changed_SID`; however, its percentage of importance depends on the department. For example, in Edu, the importance percentage of

`Changed_SID`is 4.65%; in contrast, 17.87% is the importance level of

`Changed_SID`in EBS.

`Approved_Courses`and

`Final_GPA`are the most influential. In most cases,

`Approved_Courses`has the highest percentage of importance, and it is only lower than

`Final_GPA`when we analyze Edu. These results corroborate the strong negative correlation of these variables with

`Dropout`, as illustrated in Figure 2. Although this confirms a strong and meaningful impact of the academic variables, we do not know to what extent they influence the different departments.

`Approved_Courses`) and the logarithm of the risk score, denoted by

`Log_Risk`. For each subfigure, we define the x-axis as

`Approved_Courses`and the y-axis as

`Log_Risk`and color the point data according to the dropout’s status (

`Dropout`). We highlight a student who has dropped out in black, while a student who has not dropped out is in pink.

`Approved_Courses`and

`Log_Risk`. We note a particular case in Edu in which all students with a proportion of approved courses less than 0.6 ($\mathtt{Approved}\_\mathtt{Courses}<0.6$) are all dropouts. However, this situation did not occur in other departments. With this brief analysis, we found indications that the impact of

`Approved_Courses`is more influential in Education compared to the other departments. Furthermore, each department’s predicted values of

`Log_Risk`differ considerably. In Edu, we found on the y-axis that the range of values assumed by

`Log_Risk`goes from −10 to 4. However, this does not happen in the other departments, which generally range between −8 and 2.

`Approved_Courses`< 0.6). Generally, these programs are challenging due to their predominant curricula based on exact sciences in the first semesters. Moreover, there is a tendency to normalize the effect of failing some courses. Complementing our analysis with the values of

`NonReg_Courses`from Table 5 and Table 8, we deduced that many students in STEM programs take courses in non-regular semesters to recover the failed courses. This is usually considered a characteristic of the persistence of these students.

`Log_Risk`. In this context, we can complement the persistence of these students with the variable

`Changed_SID`. It does not have a more prominent percentage presence as described in Table 4; the importance of the variable in the model is one of the most relevant, as revealed in Table 8.

`Final_GPA`computed in Table 8 in both cases, we note that these values exceed 24%, which are the highest values in our dataset. Unlike measuring the influence of economic variables from the perspective of approved courses, in Edu and Psi, we found that the grades are decisive, which led us to think that students in these programs generally have higher GPAs than those in other careers. Due to the wide granting of school scholarships, as reflected in Education, more than 12% of our sample has a scholarship. Generally, scholarship students seek to maintain high grades to avoid losing this study funding. On the other hand, in Edu and Psy, we show high importance to the hours of absence; that is, the impact of being hours absent from courses (

`Absences_Courses`) in these careers is a very relevant aspect if we compare it with the other departments.

## 7. Discussion

`Final_GPA`and

`Approved_Courses`stand out with a strong negative correlation with the event variable given by

`Dropout`, as we can see in Figure 2. Similarly, we noticed a robust negative correlation between

`Dropout`and

`Completed_Sem`(see Figure 2). Therefore, based on these values, it is evident that the academic and temporal variables have a predominant role in predicting student dropout, as various works in the literature have concluded [5,11,23,40].

`Approved_Courses`) instead of the total number of approved courses solely. Using this approach, we aimed to mitigate the temporal influence of an attribute. This preliminary step is decisive for evaluating the classification and survival methods jointly.

## 8. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Bernardo, A.; Esteban, M.; Fernández, E.; Cervero, A.; Tuero, E.; Solano, P. Comparison of Personal, Social and Academic Variables Related to University Drop-out and Persistence. Front. Psychol.
**2016**, 7, 1610. [Google Scholar] [CrossRef] - Tinto, V. Dropout from Higher Education: A Theoretical Synthesis of Recent Research. Rev. Educ. Res.
**1975**, 45, 89–125. [Google Scholar] [CrossRef] - Nicoletti, M. Revisiting the Tinto’s Theoretical Dropout Model. High. Educ. Stud.
**2019**, 9, 52–64. [Google Scholar] [CrossRef] - Gutierrez-Pachas, D.A.; Garcia-Zanabria, G.; Cuadros-Vargas, A.J.; Camara-Chavez, G.; Poco, J.; Gomez-Nieto, E. How Do Curricular Design Changes Impact Computer Science Programs?: A Case Study at San Pablo Catholic University in Peru. Educ. Sci.
**2022**, 12, 242. [Google Scholar] [CrossRef] - Rovira, S.; Puertas, E.; Igual, L. Data-driven system to predict academic grades and dropout. PLoS ONE
**2017**, 12, 171–207. [Google Scholar] [CrossRef] [PubMed] - Da Costa, F.J.; de Souza Bispo, M.; de Cássia de Faria Pereira, R. Dropout and retention of undergraduate students in management: A study at a Brazilian Federal University. RAUSP Manag. J.
**2018**, 53, 74–85. [Google Scholar] [CrossRef] - Del Bonifro, F.; Gabbrielli, M.; Lisanti, G.; Zingaro, S.P. Student Dropout Prediction. In Artificial Intelligence in Education, 21st International Conference, AIED 2020, Ifrane, Morocco, 6–10 July 2020, Proceedings, Part I 21; Springer: Berlin/Heidelberg, Germany, 2020; pp. 129–140. [Google Scholar]
- Mduma, N.; Kalegele, K.; Machuve, D. A Survey of Machine Learning Approaches and Techniques for Student Dropout Prediction. Data Sci. J.
**2019**, 18, 14. [Google Scholar] [CrossRef] - Prenkaj, B.; Velardi, P.; Stilo, G.; Distante, D.; Faralli, S. A Survey of Machine Learning Approaches for Student Dropout Prediction in Online Courses. ACM Comput. Surv.
**2020**, 53, 57. [Google Scholar] [CrossRef] - De Oliveira, C.F.; Sobral, S.R.; Ferreira, M.J.; Moreira, F. How Does Learning Analytics Contribute to Prevent Students’ Dropout in Higher Education: A Systematic Literature Review. Big Data Cogn. Comput.
**2021**, 5, 64. [Google Scholar] [CrossRef] - Aulck, L.S.; Nambi, D.; Velagapudi, N.; Blumenstock, J.; West, J. Mining University Registrar Records to Predict First-Year Undergraduate Attrition. In Proceedings of the 12th International Conference on Educational Data Mining, Montreal, QC, Canada, 2–5 July 2019; International Educational Data Mining Society: Worcester, MA, USA, 2019. [Google Scholar]
- Ameri, S.; Fard, M.J.; Chinnam, R.B.; Reddy, C.K. Survival Analysis Based Framework for Early Prediction of Student Dropouts. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, Indianapolis, IN, USA, 24–28 October 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 903–912. [Google Scholar]
- Wang, P.; Li, Y.; Reddy, C.K. Machine Learning for Survival Analysis: A Survey. ACM Comput. Surv.
**2019**, 51, 110. [Google Scholar] [CrossRef] - Spooner, A.; Chen, E.; Sowmya, A.; Sachdev, P.; Kochan, N.A.; Trollor, J.; Brodaty, H. A comparison of machine learning methods for survival analysis of high-dimensional clinical data for dementia prediction. Sci. Rep.
**2020**, 10, 20410. [Google Scholar] [CrossRef] [PubMed] - Katzman, J.L.; Shaham, U.; Cloninger, A.; Bates, J.; Jiang, T.; Kluger, Y. DeepSurv: Personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Med. Res. Methodol.
**2018**, 18, 24. [Google Scholar] [CrossRef] [PubMed] - Yu, C.N.; Greiner, R.; Lin, H.C.; Baracos, V. Learning Patient-Specific Cancer Survival Distributions as a Sequence of Dependent Regressors. In Proceedings of the Advances in Neural Information Processing Systems; Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., Weinberger, K., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2011; Volume 24. [Google Scholar]
- Fotso, S. Deep Neural Networks for Survival Analysis Based on a Multi-Task Framework. arXiv
**2018**, arXiv:1801.05512. [Google Scholar] - Ishwaran, H.; Kogalur, U.B.; Blackstone, E.H.; Lauer, M.S. Random survival forests. Ann. Appl. Stat.
**2008**, 2, 841–860. [Google Scholar] [CrossRef] - Wright, M.N.; Dankowski, T.; Ziegler, A. Unbiased split variable selection for random survival forests using maximally selected rank statistics. Stat. Med.
**2017**, 36, 1272–1284. [Google Scholar] [CrossRef] [PubMed] - Pan, F.; Huang, B.; Zhang, C.; Zhu, X.; Wu, Z.; Zhang, M.; Ji, Y.; Ma, Z.; Li, Z. A survival analysis based volatility and sparsity modeling network for student dropout prediction. PLoS ONE
**2022**, 17, e0267138. [Google Scholar] [CrossRef] [PubMed] - Lee, C.; Zame, W.; Yoon, J.; van der Schaar, M. DeepHit: A Deep Learning Approach to Survival Analysis with Competing Risks. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Hu, S.; Fridgeirsson, E.A.; van Wingen, G.; Welling, M. Transformer-Based Deep Survival Analysis. In Proceedings of the AAAI Spring Symposium 2021 (SP-ACA), Palo Alto, CA, USA, 22–24 March 2021. [Google Scholar]
- Gutierrez Pachas, D.A.; Garcia-Zanabria, G.; Cuadros-Vargas, A.J.; Camara-Chavez, G.; Poco, J.; Gomez-Nieto, E. A comparative study of WHO and WHEN prediction approaches for early identification of university students at dropout risk. In Proceedings of the 2021 XLVII Latin American Computing Conference (CLEI), Cartago, Costa Rica, 25–29 October 2021; pp. 1–10. [Google Scholar]
- Garcia-Zanabria, G.; Gutierrez-Pachas, D.A.; Camara-Chavez, G.; Poco, J.; Gomez-Nieto, E. SDA-Vis: A Visualization System for Student Dropout Analysis Based on Counterfactual Exploration. Appl. Sci.
**2022**, 12, 5785. [Google Scholar] [CrossRef] - Platt, A.; Fan-Osuala, O.; Herfel, N. Understanding and Predicting Student Retention and Attrition in IT Undergraduates. In Proceedings of the 2019 on Computers and People Research Conference, SIGMIS-CPR’19, Nashville, TN, USA, 20–22 June 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 135–138. [Google Scholar]
- Vásquez Verdugo, J.; Miranda, J. Student Desertion: What Is and How Can It Be Detected on Time? In Data Science and Digital Business; García Márquez, F.P., Lev, B., Eds.; Springer: Cham, Switzerland, 2019; pp. 263–283. [Google Scholar]
- Tanner, T.; Toivonen, H. Predicting and preventing student failure - using the k-nearest neighbour method to predict student performance in an online course environment. Int. J. Learn. Technol.
**2010**, 5, 356–377. [Google Scholar] [CrossRef] - Medina, E.C.; Chunga, C.B.; Armas-Aguirre, J.; Grandón, E.E. Predictive model to reduce the dropout rate of university students in Perú: Bayesian Networks vs. Decision Trees. In Proceedings of the 2020 15th Iberian Conference on Information Systems and Technologies (CISTI), Sevilla, Spain, 24–27 June 2020; pp. 1–7. [Google Scholar]
- Siri, D. Predicting Students’ Dropout at University Using Artificial Neural Networks. Ital. J. Sociol. Educ.
**2015**, 7, 225–247. [Google Scholar] - Buchhorn, J.; Wigger, B.U.; Wigger, B.U. Predicting Student Dropout: A Replication Study Based on Neural Networks; CESifo Working Paper No. 9300; Munich Society for the Promotion of Economic Research - CESifo GmbH: Munich, Germany, 2021. [Google Scholar]
- Mezzini, M.; Bonavolontà, G.; Agrusti, F. Predicting university dropout by using convolutional neural networks. In Proceedings of the INTED2019 Proceedings, 13th International Technology, Education and Development Conference, IATED, Valencia, Spain, 11–13 March 2019; pp. 9155–9163. [Google Scholar]
- Wu, N.; Zhang, L.; Gao, Y.; Zhang, M.; Sun, X.; Feng, J. CLMS-Net: Dropout Prediction in MOOCs with Deep Learning. In Proceedings of the ACM Turing Celebration Conference—China, ACM TURC’19, Chengdu, China, 17–19 May 2019; Association for Computing Machinery: New York, NY, USA, 2019. [Google Scholar]
- Mubarak, A.A.; Cao, H.; Hezam, I.M. Deep analytic model for student dropout prediction in massive open online courses. Comput. Electr. Eng.
**2021**, 93, 107271. [Google Scholar] [CrossRef] - Zheng, P.; Yuan, S.; Wu, X. SAFE: A Neural Survival Analysis Model for Fraud Early Detection. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’19/IAAI’19/EAAI’19, Honolulu, HI, USA, 27 January–1 February 2019. [Google Scholar]
- Juajibioy, J.C. Study of University Dropout Reason Based on Survival Model. Open J. Stat.
**2016**, 6, 908–916. [Google Scholar] [CrossRef] - Csalódi, R.; Abonyi, J. Integrated Survival Analysis and Frequent Pattern Mining for Course Failure-Based Prediction of Student Dropout. Mathematics
**2021**, 9, 463. [Google Scholar] [CrossRef] - Cox, D.R. Regression Models and Life-Tables. J. R. Stat. Soc. Ser. B (Methodological)
**1972**, 34, 187–220. [Google Scholar] [CrossRef] - Bani, M.; Haji, M. College Student Retention: When Do We Losing Them? In Proceedings of the World Congress on Engineering and Computer Science, Tehran, Iran, 26–28 April 2017. [Google Scholar]
- Agrusti, F.; Mezzini, M.; Bonavolontà, G. Deep learning approach for predicting university dropout: A case study at Roma Tre University. J. E-Learn. Knowl. Soc.
**2020**, 16, 44–54. [Google Scholar] - Rodríguez-Muñiz, L.J.; Bernardo, A.B.; Esteban, M.; Díaz, I. Dropout and transfer paths: What are the risky profiles when analyzing university persistence with machine learning techniques? PLoS ONE
**2019**, 14, e0218796. [Google Scholar] [CrossRef][Green Version]

**Figure 1.**Considering students A, B, C, and D, we illustrate the (

**a**) permanence times and (

**b**) survival curves for each of them.

**Figure 2.**Heat map of the correlations between the attributes for each department. We illustrate the positive correlations (in green scale) and negative correlations (in brown scale). (

**a**) Education (Edu). (

**b**) Computer Science (CS). (

**c**) Psychology (Psy). (

**d**) Law and Political Sciences (LPS). (

**e**) Economic and Business Sciences (EBS). (

**f**) Engineering (Eng).

**Figure 3.**Comparison of predicted survival curves. The actual curve is displayed in blue (), while the predicting methods are: Weibull (), Gompertz (), CPH (), RSF (), CSF (), MTLR (), N-MTLR (), and DeepSurv (). (

**a**) Education (Edu). (

**b**) Computer Science (CS). (

**c**) Psychology (Psy). (

**d**) Law and Political Sciences (LPS). (

**e**) Economic and Business Sciences (EBS). (

**f**) Engineering (Eng).

**Figure 4.**Scatter plot between the proportion of approved courses and the logarithm of the risk score. We highlight a student dropout in black (●), otherwise in pink (●). (

**a**) Education (Edu). (

**b**) Computer Science (CS). (

**c**) Psychology (Psy). (

**d**) Law and Political Sciences (LPS). (

**e**) Economic and Business Sciences (EBS). (

**f**) Engineering (Eng).

**Table 1.**Summary of references focused on the SDP problem and grouped according to classification algorithms and survival analysis methods. Additionally, we detail if the method uses a traditional approach (Trad) or a deep learning variant (Deep) in the column Type.

Family | Type | Method | Reference |
---|---|---|---|

Logistic Regression (LR) | [11,23,25,26] | ||

K-Nearest Neighbor (KNN) | [11,27] | ||

Support Vector Machine (SVM) | [11,23,26] | ||

Classification | Trad | Gaussian Naive Bayes (GNB) | [23,25,36] |

Algorithms | Decision Tree (DT) | [23,25,26,28] | |

Random Forest (RF) | [11,23] | ||

Artificial Neural Networks (ANNs) | [26,29,39] | ||

Deep | Convolutional Neural Networks (CNNs) | [8,9,10,32,33] | |

Non-parametric methods (KM estimator) | [23,35,36] | ||

Parametric methods (Gompertz distribution) | [20] | ||

Survival | Trad | Cox Proportional Hazards regression (CPH) | [4,6,12,15,20,23,35,38] |

Analysis | Time-Dependent Cox regression (TD-Cox) | [12] | |

Methods | Random Survival Forest (RSF) | [14,15] | |

Conditional Survival Forest (CSF) | [14,20] | ||

Deep | Nonlinear Cox regression (DeepSurv) | [15,20] |

Department | Sample Size |
---|---|

Education (Edu) | 312 |

Computer Science (CS) | 768 |

Psychology (Psy) | 1146 |

Law and Political Sciences (LPS) | 2456 |

Economic and Business Sciences (EBS) | 4100 |

Engineering (Eng) | 4914 |

Attribute Name | Attribute Description | Attribute Type |
---|---|---|

SID | Masked student identifier | Numerical and anonymized |

Department | Academic department’s name | Categorical (Edu or CS or Psy or LPS or EBS or Eng) |

Changed_SID | Whether the student changed SID | Categorical (Yes or No) |

Female | Whether the student’s gender is female | Categorical (Yes or No) |

Married | Whether the student’s marital status is married | Categorical (Yes or No) |

Public | Whether the type of student’s high school is public | Categorical (Yes or No) |

Scholarship | Whether the student had a scholarship | Categorical (Yes or No) |

Age_Admission | Student’s age when admitted by the university | Discrete numerical |

HDI_Provenance | Human development index of the student’s location of provenance | Continuous numerical |

HDI_Residence | Human development index of the student’s location of residence | Continuous numerical |

Final_GPA | Final grade point average | Continuous numerical |

Courses_Sem | Proportion of enrolled courses in relation to the number of enrolled semesters | Continuous numerical |

Absences_Courses | Proportion of the number of hours of absence in relation to the total number of hours in courses | Continuous numerical |

Approved_Courses | Proportion of approved courses in relation to the total number of enrolled courses | Continuous numerical |

NonReg_Courses | Proportion of non-regular courses in relation to the total number of enrolled courses | Continuous numerical |

Completed_Sem | Number of completed semesters | Discrete numerical |

Dropout | Dropout status | Categorical (Yes or No) |

**Table 4.**Percentage distribution of categorical attributes. We highlight the best (in green) and worst (in brown) percentage with the categorical value “Yes”.

Categorical | Edu | CS | Psy | LPS | EBS | Eng | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

Attribute | Yes | No | Yes | No | Yes | No | Yes | No | Yes | No | Yes | No |

Changed_SID | 11.9% | 88.1% | 23.7% | 76.3% | 14.2% | 85.8% | 19.6% | 80.4% | 23.0% | 77.0% | 24.3% | 75.7% |

Female | 91.7% | 8.30% | 16.8% | 83.2% | 76.2% | 23.8% | 61.0% | 39.0% | 55.4% | 44.6% | 44.6% | 55.4% |

Married | 1.60% | 98.4% | 0.40% | 99.6% | 1.80% | 98.2% | 0.90% | 99.1% | 0.90% | 99.1% | 0.30% | 99.7% |

Public | 30.1% | 69.9% | 23.1% | 76.9% | 18.7% | 81.3% | 22.6% | 77.4% | 19.8% | 80.2% | 23.2% | 76.8% |

Scholarship | 12.2% | 87.8% | 4.70% | 95.3% | 3.90% | 96.1% | 4.00% | 96.0% | 1.80% | 98.2% | 5.80% | 94.2% |

**Table 5.**Mean and standard deviation (std) of numerical attributes. We highlight the best (in green) and worst (in brown) mean values.

Numerical | Edu | CS | Psy | LPS | EBS | Eng | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

Attribute | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std |

Age_Admission | 20.68 | 3.19 | 19.21 | 2.87 | 19.35 | 3.43 | 18.52 | 2.53 | 18.91 | 2.72 | 18.39 | 2.08 |

HDI_Provenance | 0.71 | 0.10 | 0.71 | 0.09 | 0.71 | 0.09 | 0.71 | 0.10 | 0.71 | 0.10 | 0.70 | 0.10 |

HDI_Residence | 0.66 | 0.12 | 0.66 | 0.11 | 0.67 | 0.11 | 0.66 | 0.11 | 0.67 | 0.11 | 0.66 | 0.11 |

Final_GPA | 12.67 | 3.12 | 11.12 | 3.36 | 11.94 | 3.34 | 11.99 | 2.93 | 11.72 | 2.85 | 11.22 | 2.85 |

Courses_Sem | 6.83 | 3.02 | 5.58 | 2.48 | 6.63 | 3.12 | 6.49 | 2.94 | 6.07 | 0.11 | 5.67 | 2.4 |

Absences_Courses | 0.14 | 0.11 | 0.13 | 0.12 | 0.12 | 0.11 | 0.14 | 0.10 | 0.14 | 0.10 | 0.12 | 0.11 |

Approved_Courses | 0.72 | 0.27 | 0.62 | 0.28 | 0.65 | 0.31 | 0.66 | 0.28 | 0.65 | 0.27 | 0.61 | 0.27 |

NonReg_Courses | 0.04 | 0.06 | 0.06 | 0.08 | 0.06 | 0.06 | 0.06 | 0.06 | 0.06 | 0.06 | 0.08 | 0.06 |

**Table 6.**Evaluation metrics using classification algorithms. We highlight the best (in green) and worst (in brown) values by academic department.

Metrics | Department | LR | SVM | GNB | KNN | DT | RF | MLP | CNN |
---|---|---|---|---|---|---|---|---|---|

Edu | 0.855 | 0.860 | 0.817 | 0.791 | 0.860 | 0.862 | 0.830 | 0.913 | |

CS | 0.829 | 0.833 | 0.823 | 0.809 | 0.839 | 0.859 | 0.818 | 0.987 | |

Accuracy | Psy | 0.883 | 0.882 | 0.847 | 0.872 | 0.865 | 0.895 | 0.873 | 0.946 |

LPS | 0.879 | 0.875 | 0.788 | 0.856 | 0.866 | 0.884 | 0.880 | 0.922 | |

EBS | 0.858 | 0.864 | 0.822 | 0.845 | 0.856 | 0.868 | 0.852 | 0.908 | |

Eng | 0.889 | 0.894 | 0.854 | 0.876 | 0.885 | 0.902 | 0.892 | 0.920 | |

Edu | 0.913 | 0.914 | 0.908 | 0.898 | 0.890 | 0.931 | 0.890 | 0.936 | |

CS | 0.908 | 0.909 | 0.908 | 0.875 | 0.895 | 0.921 | 0.892 | 0.884 | |

AUC | Psy | 0.940 | 0.939 | 0.917 | 0.930 | 0.928 | 0.954 | 0.935 | 0.937 |

LPS | 0.938 | 0.930 | 0.912 | 0.909 | 0.925 | 0.944 | 0.930 | 0.916 | |

EBS | 0.919 | 0.926 | 0.906 | 0.907 | 0.921 | 0.938 | 0.919 | 0.918 | |

Eng | 0.949 | 0.954 | 0.931 | 0.940 | 0.947 | 0.963 | 0.953 | 0.943 | |

Edu | 0.145 | 0.140 | 0.183 | 0.209 | 0.140 | 0.138 | 0.170 | 0.065 | |

CS | 0.171 | 0.167 | 0.177 | 0.109 | 0.161 | 0.141 | 0.182 | 0.012 | |

MSE | Psy | 0.117 | 0.118 | 0.153 | 0.128 | 0.135 | 0.105 | 0.127 | 0.043 |

LPS | 0.121 | 0.125 | 0.212 | 0.144 | 0.134 | 0.116 | 0.120 | 0.057 | |

EBS | 0.142 | 0.136 | 0.178 | 0.155 | 0.144 | 0.132 | 0.148 | 0.069 | |

Eng | 0.111 | 0.106 | 0.146 | 0.124 | 0.115 | 0.098 | 0.108 | 0.059 |

**Table 7.**Evaluation metrics using survival machine learning methods. We highlight the best (in green) and worst (in brown) values by academic department.

Metrics | Department | Weibull | Gompertz | CPH | RSF | CSF | MTLR | N-MTLR | DeepSurv |
---|---|---|---|---|---|---|---|---|---|

Edu | 0.867 | 0.869 | 0.871 | 0.899 | 0.902 | 0.863 | 0.901 | 0.916 | |

CS | 0.880 | 0.882 | 0.887 | 0.864 | 0.872 | 0.888 | 0.881 | 0.891 | |

C-index | Psy | 0.931 | 0.925 | 0.928 | 0.860 | 0.874 | 0.932 | 0.937 | 0.940 |

LPS | 0.911 | 0.911 | 0.915 | 0.882 | 0.904 | 0.909 | 0.918 | 0.923 | |

EBS | 0.910 | 0.908 | 0.911 | 0.879 | 0.908 | 0.916 | 0.928 | 0.935 | |

Eng | 0.933 | 0.931 | 0.936 | 0.899 | 0.924 | 0.941 | 0.949 | 0.952 | |

Edu | 0.094 | 0.099 | 0.085 | 0.109 | 0.093 | 0.082 | 0.082 | 0.081 | |

CS | 0.105 | 0.105 | 0.095 | 0.117 | 0.105 | 0.086 | 0.089 | 0.087 | |

IBS | Psy | 0.070 | 0.077 | 0.060 | 0.085 | 0.073 | 0.053 | 0.045 | 0.048 |

LPS | 0.081 | 0.087 | 0.074 | 0.092 | 0.080 | 0.068 | 0.066 | 0.063 | |

EBS | 0.081 | 0.087 | 0.075 | 0.086 | 0.075 | 0.067 | 0.059 | 0.054 | |

Eng | 0.070 | 0.076 | 0.062 | 0.080 | 0.065 | 0.049 | 0.043 | 0.041 | |

Edu | 0.083 | 0.054 | 0.042 | 0.074 | 0.065 | 0.088 | 0.096 | 0.040 | |

CS | 0.111 | 0.07 | 0.043 | 0.122 | 0.093 | 0.082 | 0.092 | 0.034 | |

MSE | Psy | 0.114 | 0.070 | 0.045 | 0.085 | 0.059 | 0.104 | 0.109 | 0.052 |

LPS | 0.107 | 0.070 | 0.047 | 0.081 | 0.052 | 0.091 | 0.094 | 0.036 | |

EBS | 0.089 | 0.063 | 0.048 | 0.081 | 0.047 | 0.084 | 0.103 | 0.038 | |

Eng | 0.112 | 0.076 | 0.050 | 0.095 | 0.053 | 0.101 | 0.110 | 0.041 | |

Edu | 0.267 | 0.212 | 0.165 | 0.271 | 0.258 | 0.223 | 0.238 | 0.169 | |

CS | 0.313 | 0.252 | 0.194 | 0.339 | 0.305 | 0.236 | 0.285 | 0.188 | |

MAE | Psy | 0.317 | 0.249 | 0.178 | 0.294 | 0.245 | 0.276 | 0.243 | 0.199 |

LPS | 0.303 | 0.245 | 0.176 | 0.285 | 0.217 | 0.247 | 0.215 | 0.158 | |

EBS | 0.275 | 0.233 | 0.182 | 0.285 | 0.208 | 0.228 | 0.232 | 0.167 | |

Eng | 0.312 | 0.258 | 0.171 | 0.307 | 0.227 | 0.228 | 0.229 | 0.166 |

**Table 8.**Importance percentage of the predictor variables. We highlight the best (in green) and worst (in brown) percentage values by academic department.

Attribute Name | Edu | CS | Psy | LPS | EBS | Eng |
---|---|---|---|---|---|---|

Changed_SID | 4.65% | 9.88% | 10.4% | 15.66% | 17.87% | 15.92% |

Female | 0% | 0% | 0% | 2.88% | 1.28% | 0.68% |

Married | 0% | 0% | 2.01% | 0% | 0% | 0% |

Public | 0% | 0% | 0.54% | 0% | 2.52% | 2.98% |

Scholarship | 3.27% | 2.67% | 2.36% | 4.33% | 0% | 8.13% |

Age_Admission | 2.41% | 3.09% | 0% | 1.47% | 0.23% | 2% |

HDI_Provenance | 0% | 0.03% | 0% | 0% | 0.14% | 2.83% |

HDI_Residence | 0% | 4.81% | 2.17% | 0% | 0.66% | 0.36% |

Final_GPA | 29.04% | 19.18% | 24.55% | 19.85% | 21.16% | 17.19% |

Courses_Sem | 10.89% | 11.77% | 11.7% | 11.66% | 12.09% | 9.78% |

Absences_Courses | 13.83% | 12.33% | 12.47% | 10.04% | 9.18% | 9.02% |

Approved_Courses | 25.54% | 24.15% | 25.84% | 22.24% | 21.92% | 21.55% |

NonReg_Courses | 10.66% | 12.11% | 7.96% | 11.87% | 12.94% | 9.56% |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Gutierrez-Pachas, D.A.; Garcia-Zanabria, G.; Cuadros-Vargas, E.; Camara-Chavez, G.; Gomez-Nieto, E.
Supporting Decision-Making Process on Higher Education Dropout by Analyzing Academic, Socioeconomic, and Equity Factors through Machine Learning and Survival Analysis Methods in the Latin American Context. *Educ. Sci.* **2023**, *13*, 154.
https://doi.org/10.3390/educsci13020154

**AMA Style**

Gutierrez-Pachas DA, Garcia-Zanabria G, Cuadros-Vargas E, Camara-Chavez G, Gomez-Nieto E.
Supporting Decision-Making Process on Higher Education Dropout by Analyzing Academic, Socioeconomic, and Equity Factors through Machine Learning and Survival Analysis Methods in the Latin American Context. *Education Sciences*. 2023; 13(2):154.
https://doi.org/10.3390/educsci13020154

**Chicago/Turabian Style**

Gutierrez-Pachas, Daniel A., Germain Garcia-Zanabria, Ernesto Cuadros-Vargas, Guillermo Camara-Chavez, and Erick Gomez-Nieto.
2023. "Supporting Decision-Making Process on Higher Education Dropout by Analyzing Academic, Socioeconomic, and Equity Factors through Machine Learning and Survival Analysis Methods in the Latin American Context" *Education Sciences* 13, no. 2: 154.
https://doi.org/10.3390/educsci13020154