# Analysis of First-Year University Student Dropout through Machine Learning Models: A Comparison between Universities

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Literature Review

#### 2.1. Explanatory Approaches

#### 2.2. Predictive Approaches

#### 2.3. Machine Learning Approaches

#### 2.3.1. Decision Trees

#### 2.3.2. Logistic Regression

#### 2.3.3. Naive Bayes

#### 2.3.4. K-Nearest Neighbors (KNN)

#### 2.3.5. Neural Networks

#### 2.3.6. Support Vector Machine

#### 2.3.7. Random Forest

#### 2.3.8. Gradient Boosting Decision Tree

#### 2.3.9. Multiple Machine Learning Models Comparisons

#### 2.4. Opportunities Detected from the Literature Review

## 3. Methodology

## 4. Exploratory Data Analysis

#### 4.1. Universidad Adolfo Ibáñez

#### 4.2. Universidad de Talca

#### 4.3. Unification of Both Datasets

## 5. Analysis and Results

#### 5.1. Results

**KNN**: combined $K=29$; UAI $K=29$; U Talca and U Talca All $K=71$.**SVM**: combined $C=10$; UAI $C=1$; U Talca and U Talca All $C=1$; polynomial kernel for all models.**Decision tree**: minimum samples at a leaf: combined 187; UAI 48; U Talca 123; U Talca All 102.**Random forest**: minimum samples at a leaf: combined 100; UAI 20; U Talca 150; U Talca All 20.**Random forest**: number of trees: combined 500; UAI 50; U Talca 50; U Talca All 500.**Random forest**: number of sampled features per tree: combined 20; UAI 15; U Talca 15; U Talca All 4.**Gradient boosting decision tree**: minimum samples at a leaf: combined 150; UAI 50; U Talca 150; U Talca All 150.**Gradient boosting decision tree**: number of trees: combined 100; UAI 100; U Talca 50; U Talca All 50.**Gradient boosting decision tree**: number of sampled features per tree: combined 8; UAI 20; U Talca 15; U Talca All 4.**Naive Bayes**: Gaussian distribution were assumed.**Logistic regression**: Only variable selection was applied.**Neural Network**: hidden layers-neurons per layer: combined 2–15; UAI 1–18; U Talca 1–18; U Talca All 1–5.

#### 5.2. Variable Analysis

#### Qualitative Analysis

## 6. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## Appendix A. Categorical Variables Description

UAI | U. Talca | ||||
---|---|---|---|---|---|

Variable | Value | Total | Dropout | Total | Dropout |

Frequency | Frequency | Frequency | Frequency | ||

Dropout | 3750 | 536 | 2201 | 472 | |

Gender | male | 2893 | 428 | 1694 | 360 |

female | 857 | 108 | 507 | 112 | |

School | private | 2856 | 436 | 128 | 24 |

subsidized | 538 | 61 | 1172 | 251 | |

public | 115 | 13 | 872 | 187 | |

null | 241 | 26 | 29 | 10 | |

Admission | regular | 3457 | 491 | 2155 | 457 |

special | 286 | 43 | 46 | 15 | |

null | 7 | 2 | 0 | 0 | |

Preference | first | 1972 | 255 | 1592 | 310 |

second | 825 | 151 | 310 | 77 | |

third | 407 | 58 | 160 | 37 | |

forth or posterior | 302 | 45 | 132 | 45 | |

null | 244 | 27 | 7 | 3 | |

Engineering degree | Bioinformatics | – | – | 137 | 49 |

Civil | – | – | 285 | 47 | |

Industrial | – | – | 542 | 72 | |

Informatics | – | – | 324 | 96 | |

Mechanics | – | – | 208 | 34 | |

Mechatronics | – | – | 285 | 67 | |

Mines | – | – | 420 | 107 |

## Appendix B. Comparison Details

**Table A2.**Learned parameters for the logistic regression for all datasets, parameters with p-value over 0.01 are not shown.

Var | Both | UAI | U Talca | U Talca All Vars |
---|---|---|---|---|

mat | −2.28 ± 0.27 | – | −2.72 ± 0.23 | −2.94 ± 0.28 |

pps | −1.54 ± 0.19 | −2.46 ± 0.39 | −0.80 ± 0.24 | – |

lang | 1.23 ± 0.26 | 1.26 ± 0.38 | 0.43 ± 0.16 | – |

optional | −1.44 ± 0.34 | – | – | – |

nem | – | – | −0.57 ± 0.25 | −0.85 ± 0.23 |

mechanical degree | – | – | – | −0.52 ± 0.12 |

civil degree | – | – | – | −0.41 ± 0.16 |

bioinformatics degree | – | – | – | 0.45 ± 0.18 |

preference | – | – | 0.74 ± 0.34 | – |

admission test | −0.37 ± 0.08 | – | −0.57 ± 0.27 | – |

region Valparaiso | – | −0.67 ± 0.19 | – | – |

region Metropolitana | – | −0.43 ± 0.15 | – | – |

**Figure A1.**Feature importance for decision trees. Black dots correspond to the means, while red bars represent one standard deviation. (

**a**) Both. (

**b**) UAI. (

**c**) U Talca. (

**d**) U Talca All Vars.

**Figure A2.**Feature importance for random forest. Black dots correspond to the means, while red bars represent one standard deviation. (

**a**) Both. (

**b**) UAI. (

**c**) U Talca. (

**d**) U Talca All Vars.

**Figure A3.**Feature importance for gradient boosting. Black dots correspond to the means, while red bars represent one standard deviation. (

**a**) Both. (

**b**) UAI. (

**c**) U Talca. (

**d**) U Talca All Vars.

## References

- Draft Preliminary Report Concerning the Preparation of a Global Convention on the Recognition of Higher Education Qualifications. Available online: https://unesdoc.unesco.org/ark:/48223/pf0000234743 (accessed on 3 September 2021).
- 23 Remarkable Higher Education Statistics. Available online: https://markinstyle.co.uk/higher-education-statistics/ (accessed on 3 September 2021).
- Delen, D. A comparative analysis of machine learning techniques for student retention management. Decis. Support Syst.
**2010**, 49, 498–506. [Google Scholar] [CrossRef] - College Dropout Rates. Available online: https://educationdata.org/college-dropout-rates/ (accessed on 3 September 2021).
- UK Has ‘Lowest Drop-Out Rate in Europe’. Available online: https://www.timeshighereducation.com/news/uk-has-lowest-drop-out-rate-in-europe/2012400.article (accessed on 3 September 2021).
- At a Crossroads: Higher Education in Latin America and the Caribbean. Available online: https://openknowledge.worldbank.org/handle/10986/26489 (accessed on 3 September 2021).
- Why Are Dropout Rates Increasing in UK Universities? Available online: https://www.studyinternational.com/news/dropping-out-university/ (accessed on 3 September 2021).
- Informes Retención de Primer año. Available online: https://www.mifuturo.cl/informes-retencion-de-primer-ano/ (accessed on 3 September 2021). (In Spanish).
- QS Latin America University Rankings 2022. Available online: https://www.topuniversities.com/university-rankings/latin-american-university-rankings/2022 (accessed on 3 September 2021). (In Spanish).
- Spady, W. Dropouts from higher education: An interdisciplinary review and synthesis. Interchange
**1970**, 1, 64–85. [Google Scholar] [CrossRef] - Tinto, V. Dropout from Higher Education: A Theoretical Synthesis of Recent Research. Rev. Educ. Res.
**1975**, 45, 89–125. [Google Scholar] [CrossRef] - Bean, J. Student attrition, intentions, and confidence: Interaction effects in a path model. Res. High. Educ.
**1981**, 17, 291–320. [Google Scholar] [CrossRef] - Pascarella, E.; Terenzini, P. How College Affects Students: Findings and Insights from Twenty Years of Research; Jossey-Bass Publishers: San Francisco, CA, USA, 1991. [Google Scholar]
- Cabrera, L.; Bethencourt, J.; Álvarez, P.; González, M. El problema del abandono de los estudios universitarios. [The dropout problem in university study]. Rev. Electron. Investig. Eval. Educ.
**2006**, 12, 171–203. [Google Scholar] - Broc, M. Voluntad para estudiar, regulación del esfuerzo, gestión eficaz del tiempo y rendimiento académico en alumnos universitarios. Rev. Investig. Educ.
**2011**, 29, 171–185. [Google Scholar] - Bejarano, L.; Arango, S.; Johana, K.; Durán, H.; Ortiz, C. Caso de estudio: Caracterización de la deserción estudiantil en la Fundación Universitaria Los Libertadores 2014-1–2016-1. Rev. Tesis Psicol.
**2017**, 12, 138–161. [Google Scholar] - Sinchi, E.; Ceballos, G. Acceso y deserción en las universidades. Alternativas de financiamiento. Alteridad
**2018**, 13, 274–287. [Google Scholar] [CrossRef] - Quintero, I. Análisis de las Causas de Deserción Universitaria. Master’s Thesis, Universidad Nacional Abierta y a Distancia UNAD, Colombia, Bogata, Colombia, 2016. [Google Scholar]
- Minaei-Bidgoli, B.; Kashy, D.; Kortemeyer, G.; Punch, W. Predicting student performance: An application of data mining methods with an educational Web-based system. In Proceedings of the Frontiers in Education Conference, Westminster, CO, USA, 5–8 November 2003; Volume 1, p. T2A-13. [Google Scholar]
- Bernardo, A.; Cerezo, R.; Núñez, J.; Tuero, E.; Esteban, M. Prediction of university drop-out: Explanatory variables and preventine measures. Rev. Fuentes
**2015**, 16, 63–84. [Google Scholar] - Larroucau, T. Estudio de los factores determinantes de la deserción en el sistema universitario chileno. Rev. Estud. de Políticas Públicas
**2015**, 1, 1–23. [Google Scholar] - Kuna, H.; Garcia-Martinez, R.; Villatoro, F. Pattern discovery in university students desertion based on data mining. Adv. Appl. Stat. Sci.
**2010**, 2, 275–285. [Google Scholar] - Garzón, A.; Gil, J. El papel de la procrastinación académica como factor de la deserción universitaria. Rev. Complut. Educ.
**2016**, 28, 307–324. [Google Scholar] [CrossRef] - Jia, P.; Maloney, T. Using predictive modelling to identify students at risk of poor university outcomes. High. Educ.
**2015**, 70, 127–149. [Google Scholar] [CrossRef] - Martelo, R.; Acevedo, D.; Martelo, P. Análisis multivariado aplicado a determinar factores clave de la deserción universitaria. Rev. Espac.
**2018**, 39, 13. [Google Scholar] - Giovagnoli, P. Determinants in university desertion and graduation: An application using duration models. Económica
**2005**, 51, 59–90. [Google Scholar] - Vallejos, C.; Steel, M. Bayesian survival modelling of university outcomes. J. R. Stat. Soc. Ser. A (Stat. Soc.)
**2017**, 180, 613–631. [Google Scholar] [CrossRef] - Quinlan, J. Induction of decision trees. Mach. Learn.
**1986**, 1, 81–106. [Google Scholar] [CrossRef][Green Version] - Kumar, S.; Bharadwaj, B.; Pal, S. Mining Education Data to Predict Student’s Retention: A comparative Study. Int. J. Comput. Sci. Inf. Secur.
**2012**, 10, 113–117. [Google Scholar] - Heredia, D.; Amaya, Y.; Barrientos, E. Student Dropout Predictive Model Using Data Mining Techniques. IEEE Lat. Am. Trans.
**2015**, 13, 3127–3134. [Google Scholar] [CrossRef] - Ramírez-Correa, P.; Grandón, E. Predicción de la Deserción Académica en una Universidad Pública Chilena a través de la Clasificación basada en Árboles de Decisión con Parámetros Optimizados. Form. Univ.
**2018**, 11, 3–10. [Google Scholar] [CrossRef] - Cox, D.R. The Regression Analysis of Binary Sequences. J. R. Stat. Soc. Ser. B (Methodol.)
**1958**, 20, 215–232. [Google Scholar] [CrossRef] - Cabrera, A. Logistic Regression Analysis in Higher Education: An Applied Perspective. In Higher Education: Handbook of Theory and Research; Springer: Berlin/Heidelberg, Germany, 1994; Volume 10, pp. 225–256. [Google Scholar]
- Santelices, V.; Catalán, X.; Horn, C.; Kruger, D. Determinantes de Deserción en la Educación Superior Chilena, con Énfasis en Efecto de Becas y Créditos; Technical report; Pontificia Universidad Católica de Chile: Santiago, Chile, 2013. [Google Scholar]
- Matheu, A.; Ruff, C.; Ruiz, M.; Benites, L.; Morong, G. Modelo de predicción de la deserción estudiantil de primer año en la Universidad Bernardo O’Higgins. Educação e Pesquisa
**2018**, 44. [Google Scholar] [CrossRef][Green Version] - Langley, P.; Iba, W.; Thompson, K. An analysis of Bayesian classifiers. In Proceedings of the Tenth National Conference on Artificial Intelligence; AAAI: Cambridge, MA, USA, 1992; pp. 223–228. [Google Scholar]
- Kumar, B.; Pal, S. Data Mining: A prediction of performer or underperformer using classification. Int. J. Comput. Sci. Inf. Technol.
**2011**, 2, 686–690. [Google Scholar] - Hegde, V.; Prageeth, P. Higher education student dropout prediction and analysis through educational data mining. In Proceedings of the 2018 2nd International Conference on Inventive Systems and Control (ICISC), Coimbatore, India, 19–20 January 2018; pp. 694–699. [Google Scholar]
- Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory
**1967**, 13, 21–27. [Google Scholar] [CrossRef] - Tanner, T.; Toivonen, H. Predicting and preventing student failure–using the k-nearest neighbour method to predict student performance in an online course environment. Int. J. Learn. Technol.
**2010**, 5, 356–377. [Google Scholar] [CrossRef] - Mardolkar, M.; Kumaran, N. Forecasting and Avoiding Student Dropout Using the K-Nearest Neighbor Approach. SN Comput. Sci.
**2020**, 1, 1–8. [Google Scholar] [CrossRef][Green Version] - Zhang, G. Neural networks for classification: A survey. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.)
**2000**, 30, 451–462. [Google Scholar] [CrossRef][Green Version] - Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell.
**2019**, 1, 206–215. [Google Scholar] [CrossRef][Green Version] - Siri, A. Predicting Students’ Dropout at University Using Artificial Neural Networks. Ital. J. Sociol. Educ.
**2015**, 7, 225–247. [Google Scholar] - Alban, M.; Mauricio, D. Neural Networks to Predict Dropout at the Universities. Int. J. Mach. Learn. Comput.
**2019**, 9, 149–153. [Google Scholar] [CrossRef][Green Version] - Boser, B.; Guyon, I.; Vapnik, V. A Training Algorithm for Optimal Margin Classifiers. In Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory; ACM Press: New York, NY, USA, 1992; pp. 144–152. [Google Scholar]
- Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn.
**1995**, 20, 273–297. [Google Scholar] [CrossRef] - Cardona, T.A.; Cudney, E. Predicting Student Retention Using Support Vector Machines. Procedia Manuf.
**2019**, 39, 1827–1833. [Google Scholar] [CrossRef] - Mesbah, M.; Naicker, N.; Adeliyi, T.; Wing, J. Linear Support Vector Machines for Prediction of Student Performance in School-Based Education. Math. Probl. Eng.
**2020**, 2020, 4761468. [Google Scholar] - Ho, T. Random decision forests. In Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, Canada, 14–16 August 1995; Volume 1, p. 278. [Google Scholar]
- Lee, S.; Chung, J. The Machine Learning-Based Dropout Early Warning System for Improving the Performance of Dropout Prediction. Appl. Sci.
**2019**, 9, 3093. [Google Scholar] [CrossRef][Green Version] - Behr, A.; Giese, M.; Teguim, H.; Theune, K. Early Prediction of University Dropouts—A Random Forest Approach. Jahrbücher für Natl. und Statistik
**2020**, 240, 743–789. [Google Scholar] [CrossRef] - Friedman, J. Stochastic gradient-boosting. Comput. Stat. Data Anal.
**2002**, 38, 367–378. [Google Scholar] [CrossRef] - Tenpipat, W.; Akkarajitsakul, K. Student Dropout Prediction: A KMUTT Case Study. In Proceedings of the 1st International Conference on Big Data Analytics and Practices (IBDAP), Bangkok, Thailand, 25–26 September 2020; pp. 1–5. [Google Scholar]
- Liang, J.; Li, C.; Zheng, L. Machine learning application in MOOCs: Dropout prediction. In Proceedings of the 11th International Conference on Computer Science Education (ICCSE), Nagoya, Japan, 23–25 August 2016; pp. 52–57. [Google Scholar]
- Liang, J.; Yang, J.; Wu, Y.; Li, C.; Zheng, L. Big Data Application in Education: Dropout Prediction in Edx MOOCs. In Proceedings of the IEEE Second International Conference on Multimedia Big Data (BigMM), Taipei, Taiwan, 20–22 April 2016; pp. 440–443. [Google Scholar]
- Fischer, E. Modelo Para la Automatización del Proceso de Determinación de Riesgo de Deserción en Alumnos Universitarios. Master’s Thesis, Facultad de Ciencias Físicas y Matemáticas, Universidad de Chile, Santiago, Chile, 2012. [Google Scholar]
- Eckert, K.; Suenaga, R. Análisis de Deserción-Permanencia de Estudiantes Universitarios Utilizando Técnica de Clasificación en Minería de Datos. Form. Univ.
**2014**, 8, 3–12. [Google Scholar] [CrossRef][Green Version] - Miranda, M.; Guzmán, J. Análisis de la Deserción de Estudiantes Universitarios usando Técnicas de Minería de Datos. Form. Univ.
**2017**, 10, 61–68. [Google Scholar] [CrossRef] - Viloria, A.; Garcia, J.; Vargas-Mercado, C.; Hernández-Palma, H.; Orellano, N.; Arrozola, M. Integration of Data Technology for Analyzing University Dropout. Procedia Comput. Sci.
**2019**, 155, 569–574. [Google Scholar] [CrossRef] - Kemper, L.; Vorhoff, G.; Wigger, B. Predicting student dropout: A machine learning approach. Eur. J. High. Educ.
**2020**, 10, 1–20. [Google Scholar] [CrossRef] - Dudani, S. The Distance-Weighted k-Nearest-Neighbor Rule. IEEE Trans. Syst. Man Cybern.
**1976**, SMC-6, 325–327. [Google Scholar] [CrossRef] - Hearst, M.; Dumais, S.; Osuna, E.; Platt, J.; Scholkopf, B. Support vector machines. IEEE Intell. Syst. Their Appl.
**1998**, 13, 18–28. [Google Scholar] [CrossRef][Green Version] - Breiman, L. Random Forests. Mach. Learn.
**2001**, 45, 5–32. [Google Scholar] [CrossRef][Green Version] - Cox, D. Some procedures associated with the logistic qualitative response curve. In Research Papers in Statistics: Festschrift for J. Neyman; David, F., Ed.; Wiley: New York, NY, USA, 1966; pp. 55–71. [Google Scholar]
- Rumelhart, D.; McClelland, J. Learning Internal Representations by Error Propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition: Foundations; MIT Press: Cambridge, MA, USA, 1987; pp. 318–362. [Google Scholar]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res.
**2011**, 12, 2825–2830. [Google Scholar] - Keras. Keras: The Pyton Deep Learning API, 2015. Available online: https://keras.io (accessed on 3 September 2021).
- Broyden, C. The Convergence of a Class of Double-rank Minimization Algorithms 1. General Considerations. IMA J. Appl. Math.
**1970**, 6, 76–90. [Google Scholar] [CrossRef] - Fletcher, R. A new approach to variable metric algorithms. Comput. J.
**1970**, 13, 317–322. [Google Scholar] [CrossRef][Green Version] - Goldfarb, D. A Family of Variable-Metric Methods Derived by Variational Means. Math. Comput.
**1970**, 24, 23–26. [Google Scholar] [CrossRef] - Shanno, D.F. Conditioning of Quasi-Newton Methods for Function Minimization. Math. Comput.
**1970**, 24, 647–656. [Google Scholar] [CrossRef] - Ng, A. Feature selection, L1 vs. L2 regularization, and rotational invariance. In Proceedings of the Twenty-First International Conference on Machine Learning, Banff, AB, Canada, 4–8 July 2004; p. 78. [Google Scholar]
- Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv
**2017**, arXiv:1412.6980. [Google Scholar] - Efroymson, M. Multiple regression Analysis. In Mathematical Methods for Digital Computers; Wiley: Hoboken, NJ, USA, 1960. [Google Scholar]
- Browne, M.W. Cross-Validation Methods. J. Math. Psychol.
**2000**, 44, 108–132. [Google Scholar] [CrossRef][Green Version]

**Figure 1.**Score conditional distributions based on the DROPOUT variable, with respect to each variable within the Universidad Adolfo Ibáñez dataset. (

**a**) Variable nem. (

**b**) Variable mat. (

**c**) Variable optional. (

**d**) Variable pps. (

**e**) Variable ranking.

**Figure 2.**Score conditional distributions based on the DROPOUT variable, with respect to each variable within the Universidad de Talca dataset. (

**a**) Variable nem. (

**b**) Variable mat. (

**c**) Variable optional. (

**d**) Variable pps. (

**e**) Variable ranking.

**Figure 3.**Score conditional distributions based on the DROPOUT variable, with respect to each variable within the combined dataset. (

**a**) Variable nem. (

**b**) Variable mat. (

**c**) Variable optional. (

**d**) Variable pps. (

**e**) Variable ranking.

Name | Description |
---|---|

ID | unique identifier per student (not used within the models) |

Year | year where the student entered the university |

Gender | Either male or female |

School | Type of school (either private, subsidized or public) |

Admission | Type of admission (either regular or special) |

Nem | High school score (national standardized score) |

Ranking | High school rank (comparison to other students within the |

same institution) | |

Mat | Mathematics score (national tests) |

Lang | Language score (national tests) |

Optional | Score from optional national test (either history or science) |

Pps | weighted score from national tests |

Preference | Order in which the student chose the university within its |

national university application form | |

Commune | place of residence |

Region | region of origin |

Dropout | label variable |

University | Contains the university from the student (either Universidad |

Adolfo Ibáñez or Universidad de Talca, only used in the combined | |

dataset) |

Model | Both | UAI | U Talca | U Talca All |
---|---|---|---|---|

Random model | 0.27 ± 0.02 | 0.26 ± 0.03 | 0.31 ± 0.04 | 0.29 ± 0.04 |

KNN | 0.35 ± 0.03 | 0.30 ± 0.05 | 0.42 ± 0.05 | - |

SVM | 0.36 ± 0.02 | 0.31 ± 0.05 | 0.42 ± 0.03 | - |

Decision tree | 0.33 ± 0.03 | 0.28 ± 0.03 | 0.41 ± 0.05 | 0.41 ± 0.05 |

Random forest | 0.35 ± 0.03 | 0.30 ± 0.06 | 0.41 ± 0.05 | 0.40 ± 0.04 |

Gradient boosting | 0.37 ± 0.03 | 0.31 ± 0.04 | 0.41 ± 0.05 | 0.40 ± 0.04 |

Naive Bayes | 0.34 ± 0.02 | 0.29 ± 0.04 | 0.42 ± 0.03 | - |

Logistic regression | 0.35 ± 0.03 | 0.30 ± 0.05 | 0.41 ± 0.03 | 0.43 ± 0.04 |

Neural network | 0.35 ± 0.03 | 0.28 ± 0.02 | 0.39 ± 0.05 | 0.42 ± 0.04 |

Model | Both | UAI | U Talca | U Talca All |
---|---|---|---|---|

Random model | 0.63 ± 0.02 | 0.64 ± 0.01 | 0.63 ± 0.04 | 0.61 ± 0.03 |

KNN | 0.73 ± 0.02 | 0.72 ± 0.02 | 0.76 ± 0.02 | - |

SVM | 0.76 ± 0.02 | 0.69 ± 0.04 | 0.71 ± 0.03 | - |

Decision tree | 0.79 ± 0.03 | 0.78 ± 0.04 | 0.73 ± 0.03 | 0.72 ± 0.04 |

Random forest | 0.80 ± 0.02 | 0.82 ± 0.01 | 0.74 ± 0.03 | 0.72 ± 0.03 |

Gradient boosting | 0.80 ± 0.01 | 0.73 ± 0.02 | 0.73 ± 0.04 | 0.73 ± 0.03 |

Naive Bayes | 0.77 ± 0.01 | 0.68 ± 0.03 | 0.74 ± 0.02 | - |

Logistic regression | 0.73 ± 0.01 | 0.72 ± 0.03 | 0.73 ± 0.01 | 0.74 ± 0.03 |

Neural network | 0.76 ± 0.03 | 0.67 ± 0.01 | 0.73 ± 0.03 | 0.72 ± 0.08 |

Model | Both | UAI | U Talca | U Talca All |
---|---|---|---|---|

Random model | 0.52 ± 0.06 | 0.58 ± 0.15 | 0.48 ± 0.10 | 0.62 ± 0.13 |

KNN | 0.58 ± 0.06 | 0.55 ± 0.07 | 0.58 ± 0.04 | - |

SVM | 0.57 ± 0.05 | 0.59 ± 0.10 | 0.66 ± 0.04 | - |

Decision tree | 0.47 ± 0.08 | 0.65 ± 0.08 | 0.62 ± 0.09 | 0.65 ± 0.04 |

Random forest | 0.48 ± 0.05 | 0.46 ± 0.07 | 0.58 ± 0.06 | 0.61 ± 0.09 |

Gradient boosting | 0.51 ± 0.05 | 0.41 ± 0.06 | 0.57 ± 0.04 | 0.59 ± 0.05 |

Naive Bayes | 0.50 ± 0.06 | 0.44 ± 0.08 | 0.61 ± 0.03 | - |

Logistic regression | 0.60 ± 0.06 | 0.62 ± 0.06 | 0.61 ± 0.04 | 0.62 ± 0.03 |

Neural network | 0.56 ± 0.12 | 0.59 ± 0.08 | 0.59 ± 0.12 | 0.64 ± 0.06 |

Model | Both | UAI | U Talca | U Talca All |
---|---|---|---|---|

Random model | 0.18 ± 0.02 | 0.19 ± 0.03 | 0.20 ± 0.03 | 0.33 ± 0.06 |

KNN | 0.25 ± 0.02 | 0.15 ± 0.02 | 0.33 ± 0.05 | - |

SVM | 0.26 ± 0.01 | 0.20 ± 0.03 | 0.31 ± 0.03 | - |

Decision tree | 0.26 ± 0.02 | 0.20 ± 0.04 | 0.31 ± 0.04 | 0.31 ± 0.04 |

Random forest | 0.28 ± 0.03 | 0.21 ± 0.02 | 0.32 ± 0.05 | 0.32 ± 0.06 |

Gradient boosting | 0.28 ± 0.02 | 0.23 ± 0.04 | 0.31 ± 0.04 | 0.31 ± 0.04 |

Naive Bayes | 0.26 ± 0.01 | 0.23 ± 0.04 | 0.32 ± 0.04 | - |

Logistic regression | 0.26 ± 0.02 | 0.19 ± 0.03 | 0.31 ± 0.03 | 0.32 ± 0.03 |

Neural network | 0.26 ± 0.03 | 0.20 ± 0.04 | 0.33 ± 0.05 | 0.33 ± 0.04 |

Model | Both | UAI | U Talca | U Talca All |
---|---|---|---|---|

Random model | 0.51 ± 0.02 | 0.51 ± 0.01 | 0.52 ± 0.04 | 0.49 ± 0.03 |

KNN | 0.62 ± 0.02 | 0.60 ± 0.02 | 0.66 ± 0.03 | - |

SVM | 0.65 ± 0.02 | 0.57 ± 0.05 | 0.61 ± 0.03 | - |

Decision tree | 0.68 ± 0.03 | 0.66 ± 0.05 | 0.63 ± 0.03 | 0.63 ± 0.05 |

Random forest | 0.69 ± 0.02 | 0.71 ± 0.02 | 0.64 ± 0.03 | 0.62 ± 0.04 |

Gradient boosting | 0.69 ± 0.02 | 0.61 ± 0.03 | 0.63 ± 0.04 | 0.63 ± 0.03 |

Naive Bayes | 0.66 ± 0.01 | 0.56 ± 0.04 | 0.64 ± 0.03 | - |

Logistic regression | 0.62 ± 0.02 | 0.60 ± 0.03 | 0.63 ± 0.02 | 0.64 ± 0.03 |

Neural network | 0.66 ± 0.03 | 0.55 ± 0.10 | 0.63 ± 0.08 | 0.63 ± 0.07 |

**Table 7.**Feature importance, for each model and dataset. The pattern of each cell represents the datasets “combined,UAI,U Talca,U Talca All”.

Var | Decision | Random | Gradient | Naive | Logistic |
---|---|---|---|---|---|

Tree | Forest | Boosting | Bayes | Regression | |

mat | Y,Y,Y,Y | Y,Y,Y,Y | Y,Y,Y,Y | Y,Y,Y,- | Y,N,Y,Y |

pps | Y,Y,N,N | Y,Y,N,N | Y,Y,Y,Y | Y,N,N,- | Y,Y,Y,N |

lang | Y,Y,Y,N | Y,Y,N,N | Y,Y,Y,Y | N,N,N,- | Y,Y,Y,N |

ranking | N,N,Y,Y | Y,Y,N,N | Y,Y,Y,Y | Y,Y,N,- | N,N,N,N |

optional | N,N,N,N | Y,Y,N,N | Y,Y,Y,Y | Y,Y,N,- | Y,N,N,N |

nem | N,N,N,N | N,N,N,N | N,N,N,N | Y,N,Y,- | N,N,Y,Y |

admission | N,N,N,N | N,N,N,N | N,N,N,N | N,N,N,- | Y,N,Y,N |

degree | - ,- ,- ,N | - ,- ,- ,N | - ,- ,- ,Y | - ,- ,- ,- | - ,- ,- ,Y |

preference | N,N,N,N | N,N,N,N | N,N,N,N | N,N,N,- | N,N,Y,N |

region | N,N,N,N | N,N,N,N | N,N,N,N | N,N,N,- | N,Y,N,N |

fam income | - ,- ,- ,N | - ,- ,- , N | - ,- ,- ,Y | - ,- ,- ,- | - ,- ,- ,N |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Opazo, D.; Moreno, S.; Álvarez-Miranda, E.; Pereira, J.
Analysis of First-Year University Student Dropout through Machine Learning Models: A Comparison between Universities. *Mathematics* **2021**, *9*, 2599.
https://doi.org/10.3390/math9202599

**AMA Style**

Opazo D, Moreno S, Álvarez-Miranda E, Pereira J.
Analysis of First-Year University Student Dropout through Machine Learning Models: A Comparison between Universities. *Mathematics*. 2021; 9(20):2599.
https://doi.org/10.3390/math9202599

**Chicago/Turabian Style**

Opazo, Diego, Sebastián Moreno, Eduardo Álvarez-Miranda, and Jordi Pereira.
2021. "Analysis of First-Year University Student Dropout through Machine Learning Models: A Comparison between Universities" *Mathematics* 9, no. 20: 2599.
https://doi.org/10.3390/math9202599