Factors, Prediction, Explainability, and Simulating University Dropout Through Machine Learning: A Systematic Review, 2012–2024

Quimiz-Moreira, Mauricio; Delgadillo, Rosa; Parraga-Alava, Jorge; Maculan, Nelson; Mauricio, David

doi:10.3390/computation13080198

Open AccessReview

Factors, Prediction, Explainability, and Simulating University Dropout Through Machine Learning: A Systematic Review, 2012–2024

by

Mauricio Quimiz-Moreira

^1,*

,

Rosa Delgadillo

^1,*

,

Jorge Parraga-Alava

²

,

Nelson Maculan

³

and

David Mauricio

¹

Facultad de Ingeniería de Sistemas e Informática, Universidad Nacional Mayor de San Marcos, Lima 15081, Peru

²

Facultad de Ciencias Informáticas, Universidad Técnica de Manabí, Portoviejo 130104, Ecuador

³

Systems Engineering-Computer Science and Applied Mathematics, CT & CCMN, Campus—Ilha do Fundão, Federal University of Rio de Janeiro, Rio de Janeiro 21941-617, Brazil

^*

Authors to whom correspondence should be addressed.

Computation 2025, 13(8), 198; https://doi.org/10.3390/computation13080198

Submission received: 3 July 2025 / Revised: 4 August 2025 / Accepted: 6 August 2025 / Published: 12 August 2025

Download

Browse Figures

Versions Notes

Abstract

College dropout represents a significant challenge for universities, and despite advances in machine learning technologies, predicting dropout remains a complex task. This literature review focuses on investigating the factors that influence college dropout, examining the models used to predict it, and highlighting the most significant advances in explainability and simulation over the period 2012 to 2024 using the PRISMA methodology. They identified 520 factors in five categories (demographic, socioeconomic, institutional, personal, and academic), with the most studied factors in each category being, respectively, gender, scholarships, infrastructure, student identification, and grades. They also identified 83 machine learning models, with the most studied being the decision tree, logistic regression, and random forest. In addition, eight explanatory models were identified, with SHAP and LIME being the most widely used. Finally, no simulation models related to university dropout were identified. This study groups factors related to university dropout into key models for prediction and analyzes the methods used to explain the causal factors that influence university student dropout.

Keywords:

university dropout; machine learning; factors; prediction; explainability; simulation

1. Introduction

Globally, one in three students drop out of higher education (HE), a phenomenon largely influenced by personal factors and the associated social costs [1]. Additionally, UNESCO data indicate that approximately 30% of students entering higher education (HE) fail to complete their studies [2]. In Europe, the Organization for Economic Co-operation and Development (OECD) reports that university dropout rates (UED) vary between 30% and 45% by 2022 [3]. These figures have generated a growing interest in mitigating university dropout (UD) and fostering student retention, issues that have acquired strategic relevance for higher education, as reducing dropout directly contributes to the development of high-level skills and the strengthening of human capital [4].

In South Africa, the higher education system faces one of the highest student dropout rates globally, resulting in an extremely low graduation rate, estimated at only 15% [5]. This phenomenon not only significantly decreases the student population in HEIs [6] but also adversely impacts society by failing to meet the growing demand for professionals needed for economic and social development [7], constituting a critical challenge for both the education system and society as a whole [8].

Moreover, in Latin America, the impact of student dropout has also resulted in considerable economic and social losses. In 2019, approximately 26% of students dropped out of school, evidencing a structural problem that, according to some experts, stands as a key indicator of deficiencies in the quality of the education system [9]. This phenomenon has significant implications for the students themselves, who face multiple barriers to continuing their academic education. Among these barriers are the lack of financial resources to cover their studies, economic dependence on parents or external subsidies, and the limitations imposed by the labor market’s demands, which do not always allow for combining study with part-time employment [10].

A key approach to mitigating UED is to identify which students are more likely to drop out, analyze the underlying causes, and establish effective strategies to encourage their academic continuity [11]. In this context, several factors associated with UED that increase the likelihood of student dropout have been studied [12]. The implementation of machine learning (ML) models has proven to be an effective tool for predicting UED [13], while explainable artificial intelligence (XAI) models have gained relevance for their ability to identify and explain the causes of dropout. In addition, simulation is used to model early-stage UED risk scenarios, allowing for the anticipation of the behavior of students at high risk of dropping out and to simulate the effects of tutoring, remedial courses, or socioeconomic support for students with a high probability of dropping out [14].

The prediction process in ML encompasses several fundamental stages that ensure the efficiency and accuracy of the developed model [15]. It starts with data collection, where large volumes of relevant and representative information are collected and stored. This is followed by data preprocessing, which includes cleaning, normalization, and transformation to address inconsistencies, outliers, or missing values, ensuring data quality and usability. Subsequently, feature selection is performed to identify and prioritize the most influential variables, thereby optimizing both the performance and simplicity of the model. With the pre-processed data and the selected features, the model is built using machine learning (ML) models, adjusting hyperparameters and training the model with appropriate techniques for the type of problem, whether classification, regression, or clustering [16]. Finally, the model is deployed in a real environment to make predictions and support decision-making. At the same time, its performance is monitored using metrics such as accuracy, sensitivity, and specificity, allowing continuous adjustments to maintain its effectiveness under dynamic conditions [17].

Several studies have been conducted on the prediction of UED using machine learning (ML). The decision tree achieved an accuracy of 99.34% in identifying the most relevant factors for predicting attrition [18]. Similarly, logistic regression was applied at the Universidade de Trás-os-Montes e Alto Douro (UTAD), yielding accuracies of 88% and 90% in two separate studies [15].

In addition, explanation methods such as SHAP (Shapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations) enable a better understanding and increased confidence in ML models by describing the factors that influence the prediction results [19]. SHAP calculates the impact of each factor on the prediction and its interaction with others, being used at the Budapest University of Technology and Economics [1]. LIME (Local Interpretable Model-agnostic Explanations) provides independent interpretations of the model’s internal architecture. By applying local perturbations to the input data, it constructs an interpretable model, such as a linear regression, that approximates the behavior of the original model and enables the identification of the most influential variables [20].

Regarding simulation, some studies have employed the Generalized mixed-effects random forest (GMERF) model, as noted by [14,21]. However, this approach is not a simulation model in the strict sense but rather a form of structured or implicit simulation, where GMERF allows the emulation of hierarchical relationships between different levels of the educational system, such as student, academic program, and institution, to anticipate dropout risks based on historical patterns.

Due to the proliferation of publications on UED, several authors have developed systematic reviews. For example, in [22], 67 papers were analyzed between 2006 and 2018, including nine conferences, which identified 112 factors, 10 preprocessing techniques, 10-factor selection techniques, 14 prediction methods, and two tools. Similarly, ref. [23] reviewed 17 articles published between 2009 and 2021, identifying 14 machine learning (ML) models and 29 factors associated with attrition. Additionally, ref. [24] identified 23 UD-related research papers that described 373 factors and 13 ML algorithms. Ref. [25] analyzed more than 50 papers, identifying 12 models, five categories, and more than 30 factors between 2013 and 2017. Ref. [26] collected 33 papers between 2010 and 2022, identifying 50 models and 10 categories of factors. Ref. [27] reviewed 134 research papers, identifying more than five categories, 30 models, and five types of anomalies between 2016 and 2021. Ref. [28] analyzed 67 papers between 2017 and 2021, finding more than 30 prediction models and 40 factors in 5 categories. Ref. [29] analyzed 12 papers on the prediction of UED where six sensitive factors, five preprocessing techniques, and 5 ML models were found. Ref. [30] reviewed 36 studies between 2000 and 2023, identifying 8 ML models, 15 factors, and 3 factor categories. Finally, ref. [24] identified 23 studies from 2000 to 2023, finding 373 factors and 16 models and highlighting the XAI limitation in black box models.

These literature reviews reveal an extensive body of work focusing on the prediction of UED, ranging from the identification of factors to the use of various models. However, these studies generally focus exclusively on predictive ability without incorporating the explainability of the results or exploring UED risk scenarios. This limitation reduces the possibility of gaining a deeper and more detailed understanding of the factors contributing to the prediction of UED, which is essential for developing effective and personalized interventions.

This research proposes to conduct a systematic review of the prediction, explanation, and simulation of UD using ML, covering publications in reviews indexed in Scopus and Web of Science (WoS) from 2012 to 2024. The central objective is to answer the following research question: what are the most relevant ML models for predicting, explaining, and simulating university dropout?

The main contributions of this article are:

To provide an inventory of factors, predictive, explanatory, and simulation models for UD.
To provide the reader with a wide range of bibliographical references to understand and research UED using ML.

The paper is structured in six sections. Section 2 details the methodology applied in this systematic review. Section 3 presents the analysis that answers the research questions posed. The discussion of these findings is presented in Section 4, and the conclusions are drawn in Section 5.

2. Theoretical Background

According to Tinto [31,32,33], student retention depends to a large extent on the level of academic and social integration that the student achieves within the institution. The lack of links with teachers, peers, and the university environment increases the risk of dropout. Subsequent studies have supported this view, highlighting that a sense of belonging and institutional support are determinants of persistence and academic success [34]. Moreover, one of the current challenges for HEIs is to leverage the large volume of available data to design strategies that improve student retention in universities [35].

This section examines the principles of ML and their application in addressing university dropout. By analyzing these principles in detail, ML becomes a crucial element for mitigating UED and implementing institutional policies.

2.1. University Dropout

UED is the definitive interruption of the educational process in HE without the award of a qualification. This phenomenon is of significant concern to educational institutions and governments because of its negative implications for the personal development of students and the economic growth of societies [36,37].

For [38], UED must be understood beyond prolonged absence or lack of enrollment because it implies a total disconnection of the student from the institutional environment, motivated by multiple factors that can range from economic hardship to a lack of a sense of belonging. Furthermore, UED is a multifactorial phenomenon influenced by academic, personal, social, institutional, and economic variables [39]. Factors such as poor academic performance, dissatisfaction with the study program, financial difficulties, and lack of institutional support are key determinants in the decision to drop out [34].

2.2. Machine Learning

ML is a sub-discipline of AI that enables computer systems to learn from data and generate predictions without needing to be programmed for each situation, allowing for the analysis of large volumes of HE-related information. According to [40], ML algorithms are particularly effective at identifying hidden patterns in academic, demographic, and behavioral data, making them ideal for addressing complex phenomena such as college dropout. Similarly, refs. [41,42] highlight that the accuracy of these models enables reliable predictions of student performance and dropout risk, thereby fostering informed institutional decisions. According to [43], by integrating multiple data sources, such as grades, attendance, participation in virtual platforms, and socioeconomic variables, ML algorithms can reliably anticipate which students are at risk of dropping out.

In addition, ML has been instrumental in building smart and adaptive learning environments that respond to the individual needs of each learner [44]. These systems can dynamically adjust the content, the pace of instruction, and assessment methods according to the learner’s profile and performance, contributing to a more personalized learning experience. This adaptability has proven crucial in fostering student motivation, engagement, and retention. As stated by [45], the application of ML also enables teachers to monitor in real-time the effectiveness of their pedagogical strategies, facilitating continuous feedback that improves the quality of the teaching-learning process and, consequently, reduces the risk of dropout.

2.3. Artificial Intelligence Explained (XAI)

XAI is a subfield of machine learning that focuses on providing understandable interpretations of predictive models, particularly in environments where decisions must be auditable and justifiable. The aim is to make the decisions of complex so-called black-box models understandable, allowing non-expert users to interpret and trust their results [46]. In addition, XAI contributes to algorithmic fairness by facilitating the detection of biases, which reinforces transparency and institutional accountability in the use of automated decision support systems [20].

2.4. Simulation

Simulation is a computational tool that focuses on representing, exploring, and anticipating complex scenarios based on initial conditions or hypothetical intervention strategies. Unlike traditional predictive models, which are limited to estimating the probability of event occurrence, such as university dropout, simulation-based approaches enable virtual experimentation with different system configurations, serving as a laboratory where causes, consequences, and cumulative effects can be explored. In this sense, [47] highlights the integration of artificial intelligence with the simulation of complex systems as a way to optimize overall performance, identify critical variables, and design more effective interventions. Thus, simulation in machine learning not only enhances predictive capacity but also significantly contributes to strategic decision-making, supported by evidence generated in a controlled and replicable manner.

3. Materials and Methods

The present study is based on an adaptation of the procedure presented in [48] and used in [49], and is structured in the following phases:

Planning: in this phase, the research questions are established, and the review protocol is defined. this protocol outlines the sources of information used, the criteria for including and excluding studies, the data search strategy, and the period considered for the review.
Development: primary studies are selected according to the plan, and their quality is then assessed for data extraction and synthesis.
Results: the results and statistical analyses, which provide answers to the research questions, are presented in Section 3.3 and Section 4, respectively.

3.1. Planification

To address the key issues related to understanding the factors, prediction, explainability, and simulation of UED, the following research questions have been formulated:

Q1. What factors exist for UED, and which are the most studied?
Q2. What machine learning models are used for predicting UED?
Q3. What are the advances of XAI in UED?
Q4. What simulation models exist for UED?

The search string specified in Table 1 was used and applied to the fields ‘title-abstract-keywords’ in Scopus and ‘topic’ in Web of Science (WoS). Articles published in journals in the period 2012–2024 were considered. The criteria for including and excluding studies are detailed in Table 2.

3.2. Development

Potential studies identified during the search were subjected to a rigorous selection process based on the inclusion and exclusion criteria detailed in Table 2. This process involved a thorough review of the content of each study to assess its relevance concerning factors, prediction, explainability, and simulation of UED using ML. Most of the studies were excluded because they focused on areas such as secondary or primary education. Figure 1 illustrates this selection process and describes the specific activities undertaken to determine whether studies are included or excluded.

3.3. Statistics

For the collection of information, a search for relevant research related to university dropout, dropout, and academic withdrawal was conducted to ensure adequate coverage of ML studies in HE. The articles that prevailed in the initial selection were reviewed in their entirety, ensuring that they provided empirical evidence, significant theoretical contributions, and applications of machine learning in the analysis of UED.

3.3.1. Number of Potential and Selected Items

Table 3 presents the number of potential and selected articles by source. Note that the total number of selected articles corresponds to 15.52% of the total number of potential studies.

3.3.2. Trend of Articles per Year

Figure 2 illustrates the trend in the number of selected papers by year of publication on UED, highlighting a 5 to 1 ratio between the periods 2019–2024 with 104 papers and 2012–2018 with 18 papers. This indicates an increasing trend of studies since 2019.

3.3.3. Number of Authors by Country of Affiliation

Figure 3 shows the geographical distribution of author affiliation of the 100 selected papers on UED. Forty-seven countries have been identified, with Peru standing out as the country that makes the highest number of contributions to the topic, representing 19%. Spain contributes 8% of the total number of countries.

3.3.4. Selected Articles by Quartile

Figure 4 shows the number of studies selected by quartile for this analysis. Notably, 46% of the articles belong to the Q2 quartile, while 34% are classified in the Q1 quartile, indicating that the selected articles are of high quality and scientific relevance.

3.3.5. Selected Articles by Publisher

Figure 5 shows the selected papers by publisher, where MDPI is predominant with 19%. IEEE Xplore reaches 14 million, and journals published by universities account for 10%. In addition, it is worth noting that the ‘Others’ category comprises 22 articles, but these are publishers with only one published article.

4. Results

This section addresses the research questions posed in Section 2.1, based on the selected articles.

4.1. What UED Factors Exist, and Which Are the Most Studied?

For a better understanding of the factors influencing UED, these must be categorized. Therefore, five categories of factors have been identified from the selected studies, which are described in Table 4.

520 UED factors have been identified in the 122 selected articles, which have been classified according to Table 4. It should be noted that the total number of factors indicated corresponds to all the individual mentions extracted from the articles prior to their standardization and grouping process. Subsequent tables show consolidated factors and, in some cases, analyze subsets of articles according to specific criteria of each analysis. Therefore, the differences in factors and article counts between the various tables reflect the synthesis and classification methodology employed, rather than the inclusion of sources other than those in WoS or Scopus.

4.1.1. Demographic Factors

In this category, 75 demographic factors were identified across 83 articles, with gender and age being the most frequently studied, at 50 (60%) and 43 (52%), respectively. Table 5 presents the 10 most relevant demographic factors; the complete list of 75 identified factors is available in Table A1.

4.1.2. Socioeconomic Factors

Within this category, 80 factors were identified in 64 articles, with scholarships and jobs being the most studied, with 16 (22%) and 12 (19%) mentions, respectively. Table 6 presents a summary of the 10 most relevant socioeconomic factors, while the full list of factors identified in this category is available in Table A2.

4.1.3. Institutional Factors

A total of 23 institutional factors were identified in 12 articles (see Table 7), with infrastructure, educational services, adequate equipment, and location being the most studied, each cited in 2 articles (17%). Table 7 re-summarizes the 10 most relevant institutional factors, while the complete list of all identified factors is available in Table A3.

4.1.4. Personal Factors

In this category, 138 personal factors were identified across 50 articles, with the year of entry being the most researched aspect, mentioned in 10 articles (20%). Table 8 presents the 10 most relevant personal factors, and Table A4 provides the complete list of identified factors.

4.1.5. Academic Factors

In total, 206 academic factors were identified across 90 articles, with grades and subjects being the most frequently investigated aspects, mentioned in 17 (19%) and 16 (18%) articles, respectively. Table 9 presents the 10 most relevant academic factors, while Table A5 provides the complete list of identified factors.

4.1.6. Summary of Categories

Figure 6 illustrates the number of UED factors, categorized by type (demographic, socioeconomic, institutional, personal, and academic) and frequency of occurrence in the reviewed studies. Most of the institutional and personal factors have a low frequency (1–5 studies). In comparison, the academic and demographic categories concentrate on the factors with the highest recurrence (range of 6 to 20 or more studies).

4.2. Which ML Models Are Used for Predicting UED?

To address UED, 149 ML-based prediction models (individual and hybrid) were identified in 86 selected articles, where 38% employed decision trees (DT), 26% logistic regression (LR), and 22% random forests (RFs). It is worth noting that a single study can encompass multiple models. In addition, other models analyzed include support vector machines (SVM), which are present in 27 articles (21%), and artificial neural networks (ANNs), which are present in 21 articles (16%). Table 10 presents a synthesis of the most relevant prediction models (used in at least two studies) having accuracy as a metric. In contrast, the complete list of identified models is available in Table A6.

As for the preprocessing steps, the most used techniques were data cleaning and data transformation. It should be noted that 26% of the works did not use preprocessing techniques (see Table 11).

4.3. What Progress Has XAI Made in the UED?

Eight XAI models for the UED (SHAP, LIME, GPI, AM, SAGE, ANFIS, PEI, and PEM-SNN) have been identified in 11 of the 122 selected studies, highlighting SHAP and LIME, which are described in Table 12. It is worth noting that SHAP assigns an importance value to each resource for a specific forecast and highlights the importance of additive measures [149]. LIME provides a local approximation that allows for a precise understanding of the most critical specific factors contributing to the prediction of UED, which serve as inputs not only to validate the model’s behavior but also to design institutional intervention strategies [20].

Table 13 presents the combination of the main factors associated with university dropout and the XAI models used to explain them in the studies reviewed. SHAP is the most widely used model, encompassing a diverse range of factors, including GPA, cumulative credits, age, family income, attendance, scholarships, gender, participation, personality, hours of study, and grades on assignments or exams. LIME is used to explain outcomes related to GPA and test scores. Models such as GPI, AM, SAGE, and PEI have focused on academic variables, including GPA, cumulative credits, and homework grades, while addressing other factors in an ad hoc manner. ANFIS has been applied to explain predictions based on age, income, and hours of study, whereas PEM-SNN stands out for its ability to integrate the explanation of multiple key factors, including GPA, accumulated credits, age, income, scholarships, and gender.

4.4. What Simulation Models Exist for the UED?

To date, no simulation studies have been identified for UED, despite its importance in analyzing scenarios for understanding dropout, such as its causes and behaviors.

5. Discussion

The result of this systematic review is a comprehensive catalog that includes factors, prediction models, explanation methods, and simulation methods focused on UED. This catalog provides a comprehensive overview that contributes to the understanding of higher education attrition through ML and establishes strategies to maximize student retention. The quality of the results is confirmed by the fact that 65% of the selected papers are from journals in the top two quartiles.

5.1. About Factors

Five categories of factors have been identified: demographic, socioeconomic, institutional, personal, and academic. The most studied factors in each category are, respectively, gender, student scholarships, university infrastructure, year of entry, and grades. Gender influences academic behavior, career choice, and experiences of discrimination, which are exacerbated by social and cultural norms, which can demotivate students and increase dropout rates. Student scholarships offer financial assistance to low-income students, enabling them to continue their studies. The quality of the university’s infrastructure significantly contributes to an enriching and fulfilling educational experience, which is essential for fostering a positive academic environment. The year of entry affects UED due to changes in educational policies and resources, which vary over time and can impact student adjustment and success. In addition, in the economic and social context of each year, recessions can increase financial hardship and dropout rates. Finally, grades are key indicators of academic difficulties, allowing for timely interventions that can prevent students from dropping out. These elements are a priority for research because of their ability to be assessed objectively, providing a solid basis for formulating effective retention policies and strategies.

The relevance of these factors is reflected in Figure 6, which shows a high concentration of academic and demographic factors (more than 11 studies). In comparison, institutional and personal factors predominate in the low frequency ranges (1–5 studies). This pattern indicates the prioritization of measurable and easily accessible factors, relegating institutional and personal aspects to the background. Thus, the heat map not only summarizes the relative weight of each group of factors but also highlights existing gaps and directs future research towards less-explored dimensions.

Furthermore, understanding the interrelationships between factors related to UED is crucial for designing more effective intervention strategies. In [154], causal inference techniques were used to analyze the impact of academic load in the first year on the risk of dropping out, showing that lower academic load significantly reduces the probability of dropping out, particularly in students with high academic vulnerability, demonstrating the usefulness of causal models to design more targeted interventions. The Qatar University (QU) study employed structural equation modeling (SEM) to identify the factors influencing the perception of the institutional image. The results showed that student services represent the factor with the most significant positive impact, followed by administrative feedback and academic services [155]. Ref. [156] develops a model based on Mixture Structural Equation Models (MSEM) to classify students who continue or drop out of university studies in Latin America. The model incorporates variables such as student health, interpersonal relationships, and class attendance, showing that adaptation to university has a positive impact on academic satisfaction.

It is noted that some factors are more challenging to assess due to their inherent complexity and variability, for example, factors related to psychological aspects. Therefore, studies prioritize more tangible and easily measurable factors, such as academic performance, financial support, and demographic characteristics [157].

5.2. About the Model

149. ML models were identified in the UED. The variety of ML models at UED indicates a strong interest in optimizing and improving outcomes. This diversity reflects the continuous effort to better understand the causal factors that lead a student to drop out and underscores the adaptability of ML to meet specific needs in the context of higher education.

Among the most widely used models for predicting university dropout are decision trees (DT), logistic regression (LR), and random forests (RFs), with 49, 34, and 28 studies, respectively, as detailed in Table 11. These algorithms were chosen due to their ability to handle incomplete information, high inter-record variability, and complex relationships between the different groups of factors influencing UED [158]. On the other hand, although deep learning models have shown superior performance in other areas, their use in UED prediction is not always satisfactory. This is because educational data are usually smaller in size and present high variability in the quality of the records, conditions that negatively affect the training of deep networks [159]. Moreover, due to the high complexity of their models, which include millions of parameters distributed in multiple nonlinear layers, it isn’t easy to understand how each input variable contributes to the UED [46].

In terms of preprocessing techniques, the most used is data cleaning, as it ensures the accuracy, completeness, and quality of the information by eliminating anomalies and biases in the dataset. In 26% of the studies, no preprocessing techniques were used, which may be because the models can process categorical variables directly and provide improved decisions without numerical coding.

In addition, preliminary UED research, as in [82], indicates an accuracy level of 97.6%, but this indicator can be misleading due to the unbalanced distribution of the data, where the number of students who do not drop out considerably exceeds the number of dropouts. In such situations, accuracy loses relevance as an evaluation measure since a model that classifies all students as non-dropouts could achieve a value of 99.5% if dropouts represent only 0.5%, even without correctly identifying any dropouts. Faced with this problem, we recommend the use of more appropriate metrics, such as the F1-Score, which integrates precision (accuracy) and sensitivity (recall), offering a balance between the effective identification of the minority class (dropouts) and the reduction of false positives, positioning itself as a more reliable metric to evaluate models in contexts with data imbalance.

5.3. About Explication

Only 11 studies were found that applied eight explainability models, where SHAP was the most widely used due to its ability to intuitively and transparently break down the contributions of each feature in the predictions of the ML models. The scarcity of studies on explainability models is due to their limited integration in education, the lack of awareness of the importance of explainability in predictive models within the educational community, and the access to computational resources that are not always available in educational settings.

Furthermore, despite the increase in XAI models, most focus on a small number of academic and demographic factors, leaving institutional, personal, and behavioral variables (see Table 13). This asymmetry highlights an opportunity to expand the use of XAI models in explaining less-addressed factors, thereby enriching the comprehensive understanding of university dropouts.

5.4. About Simulation

In this study, no research on simulating UED was identified, indicating a lack of focus on simulating scenarios to address dropout. This could be because UED is a complex phenomenon, affected by a variety of socioeconomic, academic, personal, and institutional factors, whose complex interaction makes it difficult to model for practical simulations. Moreover, running such simulations requires access to detailed and sensitive data on student behavior, which is often restricted by privacy or unavailable.

5.5. Factors, Prediction, Explanation, and Simulation

The UED approach articulates four fundamental pillars: factors, prediction, explanation, and simulation, which interact in an integrated manner, as illustrated in Figure 7, to mitigate student dropout. The first component, factor analysis, allows the identification of the most determinant variables in the risk of dropout, considering demographic, academic, socioeconomic, personal, and institutional dimensions. Subsequently, based on the data associated with these factors, the ML models generate individualized predictions on the probability of dropout. When the model predicts an attrition scenario, explicability mechanisms enable the outcome to be decomposed, identifying the underlying causes that contribute to the risk. This explanatory capability provides the institutional manager with an in-depth understanding of the critical factors, facilitating the formulation of intervention scenarios through simulation. At this stage, the specialist adjusts the risk factors in the models, exploring alternatives that lead to a retainment scenario. This iterative cycle of factors-prediction-explanation-simulation can be repeated until a retention scenario is reached. Finally, the analysis of the retention scenario also enables the identification of the degree of influence of each factor, providing precise inputs for the design and implementation of personalized strategies aimed at enhancing student retention.

6. Conclusions

A systematic review of the literature on UED was carried out using ML. Of the 786 articles identified, 122 articles were selected through detailed analysis, 81% of which corresponded to high-impact journals (Q1 and Q2 quartiles). A total of 520 factors were identified in 98 studies, 149 prediction models in 86 studies, and eight explanatory algorithms in 11 studies, with no articles on simulation. Unlike other studies, this work considered four important aspects: a greater number of factors, prediction models, explainability models, and simulation.

Regarding the factors, five fundamental categories were identified: demographic (75), socioeconomic (80), institutional (23), personal (137), and academic (205). The quantitative analysis, supported by the frequency heat map, reveals that academic and demographic factors are the most frequently studied and recurring topics in the literature, while institutional and personal factors are less frequently addressed and appear mainly in low-frequency ranges. Within each category, the most investigated factors were gender, student scholarships, university infrastructure, year of entry, and grades, respectively. This pattern confirms that academic and demographic factors continue to be the primary predictors of university dropout, aligning with the literature’s preference for objective and readily accessible variables. Regarding the ML models for predicting UED, DT stands out as the most used, followed by LR, RF, and ANN. There is an increasing trend in the use of hybrid models designed to enhance prediction accuracy. Additionally, data cleaning is one of the most widely used preprocessing techniques. In terms of explainability, eleven papers were identified that apply XAI techniques in education, with SHAP and LIME being the most used models. SHAP has been used to explain factors such as GPA, cumulative credits, age, family income, attendance, scholarships, gender, participation, personality, hours of study, and grades in homework or exams, showing its versatility and scope in educational analysis. LIME, although less frequent, has primarily focused on interpreting results associated with GPA and assessments. On the other hand, more recent approaches, such as GPI, AM, SAGE, and PEI, demonstrate a more focused application of academic variables. In the field of simulation, there were no studies to simulate UED using ML.

Our research demonstrates the relevance of incorporating ML models in the university environment for UD, as they facilitate the analysis of large volumes of academic information, enabling the identification of patterns and provision of relevant results for decision-making in universities. However, it is essential to ensure data privacy when implementing this type of solution. In addition, the incorporation of XAI will not only improve predictions but also provide clarity and ensure transparency regarding the predictions of ML models.

This study has some important limitations. First, the search period covered 2012 to 2024, and the sources were limited to the Scopus and Web of Science (WoS) databases. While this time range covers more than a decade of scientific production, future research could extend the sources and the time period to obtain an even more comprehensive picture. Secondly, the search strategy used terms such as ‘dropout’, “university”, ‘machine learning’, and ‘explainability’, which may have excluded relevant studies with other related conceptual names, limiting the retrieval of relevant papers. Finally, the possible influence of publication bias is acknowledged, as studies with positive results or high-performing models are published more frequently, in contrast to those with null or negative results. Therefore, the patterns identified should be interpreted considering this possible overestimation of performance.

This study opens relevant lines for future research. In terms of factors, it will be necessary to deepen holistic approaches that integrate multimodal data, allowing the incorporation of emerging factors associated with digital behavior, emotions, and social dynamics. Likewise, artificial intelligence offers new opportunities to identify variables inherent to personalized, adaptive contexts mediated by intelligent tutors. In prediction, the development of hybrid models and federated learning will facilitate the construction of more robust, scalable, and privacy-friendly systems. Explainability will need to shift towards causal, counterfactual, and interactive approaches, capable of providing interpretations that are understandable, actionable, and tailored to the user. In the realm of simulation, agent-based systems can be implemented to determine scenarios where the risk factors of dropout allow a student to change their future status from dropout to retention.

Finally, the convergence of these four components—prediction, explainability, and simulation—into unified intelligent systems will allow institutions to anticipate, understand, and mitigate the risk of attrition through personalized and informed decisions. It is worth noting that this type of integration has already shown effective results in sensitive domains such as pediatric congenital cardiac surgery [160], where the combination of prediction, explainability, and simulation significantly improved clinical decision-making.

Author Contributions

Conceptualization, D.M. and M.Q.-M.; methodology, D.M.; formal analysis, D.M. and M.Q.-M.; investigation, M.Q.-M. and J.P.-A.; writing—original draft preparation, M.Q.-M.; writing—review and editing, M.Q.-M.; supervision, D.M., R.D., and N.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Universidad Nacional Mayor de San Marcos—RR N° 004305-R-24 and project number C24200721.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

KPCA	Kernel Principal Component Analysis
PCA	Principal Component Analysis
LPP	Locality Preserving Projection
NPE	Neighborhood Preserving Embedding
IsoP	Isometric Projection
WCT-T	Weighted Connected Triple Transformation
WTQ-T	Weighted Triple Quality Transformation
DT	Decision Tree
LR	Regression Logistic
DT-ID3	Iterative Dichotomiser 3
SVM	Vector Support Machine
RF	Random Forest
HDBSCAN	Hierarchical Density-Based Spatial Clustering of Applications with Noise
DBSCAN	Density-Based Spatial Clust1ering of Applications with Noise
GMERF	Generalized Mixed-Effects Random Forest
BG	Gradient Boosting
GBM	Gradient Boosting Machine
XGBoost	Extreme Gradient Boosting
CNNs	Convolutional Neural Networks
DP-CNN	Convolutional Neural Network with Dynamic Pooling
NB	Naive Bayes
CLSA	Sentiment Analysis Model at Concept Level
KNN	K- NeighborsClassifier
ANNs	Artificial Neural Networks
AHP	Analytic Hierarchy Process
LSTM	Long Short-term Memory
GLM	Generalized Linear Model
SGD	Stochastic Gradient Descent
MLP	Multilayer Perception
AdaBoost	Adaptive Boosting
BAG	Bootstrap Aggregated Decision Trees
SVMSMOTE	Support Vector Machines—Synthetic Minority Over-Sampling Technique
SMOTE	Synthetic Minority Over-Sampling Technique
MMFA	Modified Mutated Firefly Algorithm
GBNs	Gaussian Bayesian Networks
FFNN	Feed Forward Neural Network
BNs	Bayesian Networks
RBF	Radial Basis Function
DT-CHAID	Decision Tree-Chi-Square Automatic Interaction Detector
LLM	Logit Leaf Model
LMT	Logistic Model Tree
LightGBM	Light Gradient Boosting Machine
SEDM	Student Educational Data Mining
PESFAM	Probabilistic Ensemble Simplified Fuzzy ARTMAP
FNN	Feed Forward Neural Network
BRF	Balanced Random Forest
EE	Easy Ensemble
RB	RUSBoost
CART	Classification and Regression Trees
CTM	Classification Tree Model
SMOTE-NC	Synthetic Minority Over-Sampling Technique for Nominal and Categorical Data
ETC	Extra Trees Classifier
CIT	Conditional Inference Tree
Bagged CART	Classification and Regression Tree Bagging
FTT	Feature Tokenizer Transformer
GMM	Gaussian Mixture Model
GBT	Gradient Boosted Trees
NNs	Neural Networks
LDA	Linear Discriminant Analysis
PR	Polynomial Regression
PEM-SNN	Piecewise Exponential Model with Structural Neural Network
ARD	Automatic Relevance Determination
LASSO	Least Absolute Shrinkage and Selection Operator
BR	Bayesian Ridge
LIRE	Linear Regression
RR	Ridge Regression
DR	Dummy Regressor
IF	Isolation Fores
DC	Data Cleaning (DC);
DTA	Data Transformation (DTA)
FS	Feature Selection
SV	Standardization of Variables
VC	Variable Coding
SMOTE	Synthetic Minority Oversampling Technique
EV	Elimination of Variables
TV	Transformation of Variables
DR	Dimensionality Reduction
CC	Categorical Coding
DS	Data Selection

Appendix A

Table A1. Demographic factors influencing the UED.

Id	Factor	References	Id	Factor	References
F001	Gender	[7,14,39,50,51,53,54,58,59,60,61,63,64,65,66,67,68,69,71,72,74,77,78,78,82,83,86,88,89,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110]	F039	Travel time to university	[93]
F002	Age	[7,15,51,54,59,64,65,66,67,68,69,71,72,74,77,78,80,83,84,86,89,91,94,97,98,99,100,101,105,108,109,111,112,113,114,115,116,117,118,119]	F040	University of origin	[116]
F003	Marital status	[51,59,61,68,69,71,76,80,86,91,93,97,102,106,108,114,116,118]	F041	Citizenship status	[52]
F004	Sex	[21,55,78,80,84,86,90,91,111,112,114,115,118,120,121,122,123,124]	F042	Geographical displacement	[130]
F005	Place of residence	[7,14,39,51,64,65,66,68,74,86,93,106,108,114,121,125]	F043	First-generation student	[95]
F006	Place of origin	[39,51,53,54,71,78,89,90,91,93]	F044	City	[110]
F007	Nationality	[7,21,53,55,58,61,64,65,76,93,101,104,115,126]	F045	Multidimensional Poverty Index of the school	[103]
F008	Parents’ educational level	[59,70,72,93,94,95,97,99,112,117,118]	F046	Zonal Address	[91]
F009	Type of school	[14,64,65,66,75,78,86,89,91,95,103,109]	F047	Pre-university preparation	[91]
F010	Ethnicity	[72,74,77,83,84,86,97,98,112,113,126]	F048	Housing tenure	[91]
F011	Date of birth	[53,58,65,87,120,121,127]	F049	Type of construction	[91]
F012	Registration age	[14,55,61,73,93,106]	F050	Water	[91]
F013	School	[39,58,72,78,108,129]	F051	Drain	[91]
F014	Foreign	[55,108,109,117,122]	F052	Phone	[91]
F015	Province of origin	[91,96,106,118]	F053	Colour TV	[91]
F016	Number of siblings	[114,116,129]	F054	Radio	[91]
F017	Type of housing	[72,101,114]	F055	Sound equipment	[91]
F018	Zip code	[51,88,113]	F056	Iron	[91]
F019	Country of origin	[99,101,108]	F057	Cellular	[91]
F020	Region	[39,117]	F058	Laptop	[91]
F021	Type of university	[39,70]	F059	Refrigerator	[91]
F022	Higher education centre	[101,122]	F060	Personal library	[91]
F023	Change of residence	[114,130]	F061	Wardrobe	[91]
F024	Foreigner	[78,93]	F062	Wire	[91]
F025	Displaced student	[55,61]	F063	Home environments	[91]
F026	Number of children	[80,85]	F064	Number of floors	[91]
F027	Computer	[77,91]	F065	Number of bedrooms	[91]
F028	Generation	[119]	F066	Number of kitchens	[91]
F029	Migration status	[72]	F067	Number of bathrooms	[91]
F030	Country of origin risk	[73]	F068	Number of rooms	[91]
F031	Proximity to the university	[61]	F069	Number of dining rooms	[91]
F032	Cohabitation status with parents	[61]	F070	Orphanhood	[91]
F033	Generation	[119]	F071	He lives alone	[91]
F034	Family size	[93]	F072	Breadwinner	[91]
F035	Family you live with	[116]	F073	Marital Status of Father	[74]
F036	Community support	[61]	F074	Marital Status of Mother	[74]
F037	Live in the country or the city	[106]	F075	Area of residence	[77]
F038	Family municipality	[105]

Table A2. Socioeconomic factors influencing UED.

Id	Factor	References	Id	Factor	References
F076	Scholarships	[55,61,72,75,83,93,94,99,104,106,112,120,127]	F116	Total amount of scholarships	[88]
F077	Works	[54,60,69,71,80,91,94,108,114,125,128]	F117	Suspensions	[122]
F078	Family income	[59,80,86,95,97,98,102,125,129]	F118	Single with dependents	[131]
F079	Income level	[14,58,80,89,96,108,118]	F119	Family financial support	[72]
F080	Admission score	[14,39,78,95,98,121,123]	F120	Health insurance	[104]
F081	Disability	[69,72,77,88,100,105,114]	F121	Type of insurance	[116]
F082	Educational level	[54,101,105,108,117,122]	F122	Medical insurance company	[54]
F083	Socioeconomic status	[64,77,83,95,124]	F123	Financial commitment of the firstborn son to his family	[85]
F084	Father’s profession	[15,76,93,129]	F124	Student perspective on their integration into the labor market	[85]
F085	Financial aid	[68,80,83,100,112]	F125	Economic problems	[130]
F086	Mother’s rating	[55,61,102]	F126	Lack of family support	[130]
F087	Father’s rating	[55,61,102]	F127	Mother’s highest educational level	[126]
F088	Father’s employment status	[15,55,102]	F128	Father’s highest educational level	[126]
F089	Mother’s occupation	[15,93,102]	F129	Employment status	[95]
F090	Internet access	[77,91,97,114,129]	F130	Housing situation	[95]
F091	Financial situation	[80,121,123]	F131	Monthly tuition payment	[103]
F092	Type of scholarship	[68,79,99]	F132	Social stratification of the school	[103]
F093	Total income	[66,77,91]	F133	Household composition	[91]
F094	Study financing	[80,81]	F134	Family Burden	[91]
F095	Working conditions	[59,72]	F135	Children in Higher Education	[91]
F096	Type of transport	[102,114]	F136	Economic dependence	[91]
F097	Current occupation	[113,118]	F137	Head of household	[91]
F098	School tuition cost	[78,117]	F138	Economic income modality	[91]
F099	Debtor	[55,61]	F139	Access to technology	[65]
F100	Registrations up to date	[55,61]	F140	Latitude	[74]
F101	Dependence on parents	[71,114]	F141	Length	[74]
F102	Source of income	[77,96]	F142	Social class	[74]
F103	Mother’s profession	[76,93]	F143	Brothers at school	[74]
F104	Economic indicator	[112]	F144	Type of license plate	[75]
F105	Professional status	[101]	F145	Type of income	[75]
F106	Political status	[106]	F146	Unemployment rate	[76]
F107	Works part-time	[133]	F147	Tuition payment up to date	[76]
F108	Student loan	[132]	F148	Economic situation	[77]
F109	Money for food	[116]	F149	Number of people in the household	[77]
F110	Study books	[116]	F150	Eligibility	[112]
F111	Scholarship percentage	[78]	F151	Academic integration	[83]
F112	Parents’ main field	[59]	F152	Subsidized loan	[83]
F113	Student’s profession	[101]	F153	Unsubsidized loan	[83]
F114	Percentage of loans	[78]	F154	Work-study	[83]
F115	Total percentage of aid	[78]	F155	Aid for merit	[83]

Table A3. Institutional factors influencing UED.

Id	Factor	References	Id	Factor	References
F156	Infrastructure	[80,81,84]	F168	Counselor’s perception of the counselor’s own expectations	[126]
F157	Educational services	[66,83]	F169	Counselor’s perception of the director’s expectations	[126]
F158	Suitable equipment	[80,81]	F170	Institutional integrity	[84]
F159	Place	[78,89]	F171	Social infrastructure	[84]
F160	Institution size	[83,84]	F172	Social aspects induction program	[84]
F161	Area	[60]	F173	Institutional control	[83]
F162	Geographical area	[78]	F174	Percentage of minorities	[83]
F163	Teacher’s commitment to the student	[85]	F175	Part-time teachers	[83]
F164	Classification of the career or institution	[85]	F176	Full-time teachers	[83]
F165	Group class	[82]	F177	Instruction expenditure	[83]
F166	School climate assessment scale	[126]	F178	Academic support expenses	[83]
F167	Counselor’s perception of teachers’ expectations	[126]

Table A4. Personal factors influencing UED.

Id	Factor	References	Id	Factor	References
F179	Year of admission	[58,64,74,75,87,90,94,95,120,127]	F248	Mobile phone addiction	[121]
F180	Motivation	[80,81,82,97]	F249	Gaming Addiction	[121]
F181	Extracurricular activities	[72,84,97]	F250	Video game addiction	[121]
F182	Commitment	[99,124,130]	F251	Shopping addiction	[121]
F183	Class participation	[84,131,132]	F252	Smoking	[82]
F184	Number of voluntary activities	[7,58,88]	F253	Student’s sense of school belonging	[126]
F185	Future time perspective	[55,130]	F254	Perception of social support	[95]
F186	Time to study	[60,80]	F255	Use of learning platform	[95]
F187	Adaptation and coexistence	[55,133]	F256	Frequency of library use	[95]
F188	Leader or president	[60,93]	F257	Participation in tutoring	[95]
F189	Addictions/vices	[85,116]	F258	Participation in mentoring	[95]
F190	Social media addiction	[119,142]	F259	Access to academic support services	[95]
F191	Club participation	[7,58]	F260	Short-term objective	[110]
F192	Stress level	[80,97]	F261	Weekly minutes on the platform	[110]
F193	Level of motivation	[64,116]	F262	Active days	[110]
F194	Participation in study groups	[84,97]	F263	Total progress	[110]
F195	Landline	[116]	F264	Device used	[110]
F196	Cellular phone	[116]	F265	Login frequency	[110]
F197	Second language	[116]	F266	Average session duration	[110]
F198	Masked student	[97]	F267	Number of sessions per week,	[110]
F199	External social relations	[80]	F268	Number of accesses in the last month	[110]
F200	Desire for knowledge	[127]	F269	Hours in the last month	[110]
F201	Frankness	[125]	F270	Hours in the last 3 months	[110]
F202	Extraversion	[125]	F271	Hours in the last 6 months	[110]
F203	Neuroticism	[125]	F272	Hours in the last year	[110]
F204	Conscientiousness	[125]	F273	Interaction with tutors	[110]
F205	Emotional commitment	[125]	F274	Participation in forums	[110]
F206	Calculating commitment	[125]	F275	Interaction with multimedia resources	[110]
F207	Regulatory commitment	[125]	F276	Average time per resource	[110]
F208	Professional interests	[124]	F277	Completed activities	[110]
F209	Conference programs	[51]	F278	Number of evaluations submitted	[110]
F210	Family problems	[142]	F279	Evaluation results	[110]
F211	Mental health	[107]	F280	Response time in activities	[110]
F212	Health problem	[107]	F281	Number of clicks per session	[110]
F213	Study habits	[143]	F282	Participation in chats	[110]
F214	Depression	[143]	F283	Preferred content type	[110]
F215	Anxiety	[143]	F284	Navigation route	[110]
F216	License history	[51]	F285	Number of messages received	[110]
F217	Communications level	[113]	F286	Number of messages sent	[110]
F218	First generation to study	[119]	F287	Activity during non-business hours	[110]
F219	Extracurricular activity scores	[7]	F288	Level of self-efficacy	[65]
F220	Participation in first-year camp activities	[7]	F289	Learning strategy	[65]
F221	Frequency of computer use	[86]	F290	Rh factor	[74]
F222	Learning approach	[133]	F291	Neuroticism	[141]
F223	I wanted practical work	[63]	F292	Extraversion	[141]
F224	Disease	[63]	F293	Kindness	[141]
F225	Pregnancy	[63],	F294	Responsibility	[141]
F226	Incompatibility between career and childcare	[63]	F295	Openness to experience	[141]
F227	Self-assessment	[61]	F296	Social integration	[83,84]
F228	Time spent exercising	[118]	F297	Perception of learning	[84]
F229	Vocational training	[125]	F298	Experiences of exam disappointment	[84]
F230	Number of friends	[93]	F299	Support and guidance	[84]
F231	Kindness	[125]	F300	Quality of teaching	[84]
F232	Leisure	[61]	F301	Alignment in teaching	[84]
F233	Study hours	[93]	F302	Clarity in instruction	[84]
F234	Planned and unplanned pregnancy	[85]	F303	Feedback active learning	[84]
F235	Bullying	[85]	F304	Higher-order thinking	[84]
F236	Sexism	[85]	F305	Cooperative learning	[84]
F237	Student adaptation to university learning	[85]	F306	Introductory courses	[84]
F238	Poor interpersonal relationships with peers	[130]	F307	Student research programs	[84]
F239	Lack of study habits and techniques	[130]	F308	Perception of difficulty	[84]
F240	Demotivation	[130]	F309	Coherence between courses in the curriculum	[84]
F241	Feeling of not belonging	[130]	F310	Educational aspiration	[83]
F242	Health problems	[130]	F311	Language	[68]
F243	Internet addiction	[135]	F312	Video platform	[77]
F244	Technology addiction	[135]	F313	Physical books	[77]
F245	Alcohol addiction	[121]	F314	Reading time	[77]
F246	Addiction to emotional dependence	[121]	F315	Internet browsing time	[77]
F247	Drug addiction	[121]

Table A5. Academic factors influencing UED.

Id	Factor	References	Id	Factor	References
F316	Ratings	[39,54,65,74,76,78,82,87,90,106,111,116,117,120,126,127]	F420	Temporary withdrawal	[132]
F317	General GPA	[7,59,64,67,69,75,83,84,89,90,97,113,120,123,126,129]	F421	Order of option to apply	[101]
F318	Secondary note	[39,72,83,86,97,103,109,111,121,129]	F422	Access order	[101]
F319	Subjects taken	[15,72,73,78,94,98,108,118]	F423	Weighted historical average	[116]
F320	Credits taken	[71,89,100,101,112,120,121]	F424	Lower test results	[106]
F321	Attendance	[55,59,61,64,72,73,76,120]	F425	Years of study at the University	[116]
F322	Type of admission	[39,58,78,90,95,107]	F426	Belongs to the institute’s school	[119]
F323	Type of school	[66,86,89,99,102,103,108]	F427	Specialty	[132]
F324	School	[39,55,70,78,89,91,117]	F428	Student status	[60]
F325	Academic year	[39,78,93,100,134]	F429	Course evaluation comments	[59]
F326	Number of failed courses	[67,72,78,97,101,111]	F430	Average evaluations first semester	[117]
F327	Subjects	[87,107,121,122,143]	F431	First period average	[78]
F328	Entrance examination	[39,53,70,117,127]	F432	Attendance status	[131]
F329	Average grades throughout the career	[39,78,86,88,120]	F433	Dropping out during the semester	[114]
F330	Academic cycle	[78,87,91,111,112]	F434	Admission date	[123]
F331	Course number	[51,71,108,111]	F435	Rewards and penalties	[88]
F332	Active semester	[7,69,90,118,121]	F436	Group study	[118]
F333	Average subjects	[39,111,118,129]	F437	High school completion status	[131]
F334	Years of graduation	[15,58,71,134]	F438	Title to obtain	[123]
F335	Admission score	[55,72,78,115]	F439	Enrolled semester number	[54]
F336	Academic department	[54,85,90,106]	F440	Number of programs enrolled	[54]
F337	Course	[61,76,88,137]	F441	Graduate	[140]
F338	Subject	[86,104,112,140]	F442	Type of graduation	[64]
F339	Registered	[78,112,120]	F443	Good high school graduation	[127]
F340	Tasks	[59,60,137]	F444	Type of associate degree	[114]
F341	Type of institution	[39,87]	F445	First-generation student	[114]
F342	Student status	[90,112,121]	F446	Number of internships	[52]
F343	Repeater	[100,121,137]	F447	First registration	[100]
F344	Access note	[93,94,130]	F448	Persistence	[100]
F345	Number of courses approved	[100,101,111]	F449	Home Language	[100]
F346	Total credits	[59,88,111]	F450	Previous year’s activity	[100]
F347	Absence	[60,88,120]	F451	Final decision	[100]
F348	Academic field	[51,79,123]	F452	Specialty access	[105]
F349	Evidence	[117,134,137]	F453	Follow the path	[105]
F350	Faculty	[65,69,120]	F454	Access description	[129]
F351	Failed subjects	[64,80,118]	F455	Leveling	[113]
F352	Approved credits	[69,88,114]	F456	Quality of online teaching activities	[127]
F353	Number of semesters completed	[68,97,106]	F457	Limited knowledge of using specialized software	[85]
F354	Average of previous semesters	[79,110,117]	F458	Academic problems	[130]
F355	Career	[65,77,98]	F459	Level of previous studies	[82]
F356	Repeating course number	[51,80]	F460	Student grade point average in ninth grade	[126]
F357	Cluster	[78,116]	F461	Hours dedicated to tasks	[126]
F358	Failed exam	[107,109]	F462	Number of course withdrawals	[95]
F359	Subjects passed	[73,159]	F463	Number of disciplinary actions	[95]
F360	Credit ratio per subject	[39,87]	F464	Last school level achieved	[110]
F361	Ratio of credits to expected credits	[39,78]	F465	Income cohort	[128]
F362	Study day	[65,118]	F466	Average grades per subject	[128]
F363	Admission program	[107,117]	F467	GPA per semester	[103]
F364	Student code	[93,114]	F468	Other programs taken	[103]
F365	Credits	[14,21]	F469	Reading comprehension score	[103]
F366	Average national exam score	[55,101]	F470	Score in logical reasoning	[103]
F367	Attempts to pass the exam	[14,21]	F471	Academic admission program	[103]
F368	Entrance qualification grade	[53,99]	F472	Performance test	[111]
F369	Grade points	[88,134]	F473	Academic Department	[111]
F370	Degree exam	[53,59]	F474	Plan hours	[111]
F371	Type of study program	[53,64]	F475	Hours recorded in the last semester	[111]
F372	Full-time status	[109,112]	F476	First year average	[111]
F373	Exam	[62,134]	F477	Program duration	[111]
F374	Level enrolled	[94,95]	F478	Title name	[89]
F375	Career application option range	[75,96]	F479	Additional learning requirements	[89]
F376	Average secondary grades	[89,116]	F480	Number of honors obtained	[89]
F377	Exam grades	[92,131]	F481	Admission method	[91]
F378	Abandoned materials	[87]	F482	Type of activity	[146]
F379	Period	[81]	F483	Type of action	[146]
F380	Project rating	[140]	F484	Access frequency by day of the week	[146]
F381	Average attempts	[117]	F485	Frequency per week and month of the semester	[146]
F382	Subject code	[87]	F486	Access time	[146]
F383	Initial test note	[144]	F487	Amount and type of interaction with materials	[146]
F384	Entrance exam date	[54]	F488	Participation in evaluations	[146]
F385	Lower consolidated result	[106]	F489	Number of subjects taken	[65]
F386	Number of national exams taken	[56]	F490	Enrollment method	[65]
F387	Delay	[122]	F491	Number of times registered	[65]
F388	Type of entry qualification	[54]	F492	Entry level	[74]
F389	Degree of study	[90]	F493	Current grade	[74]
F390	First level degree	[105]	F494	Cumulative average	[75]
F391	Anonymity of the university	[63]	F495	Level of previous education	[68]
F392	Place institution	[81]	F496	Syllabus	[68]
F393	Type of student	[114]	F497	Beginning of the semester	[68]
F394	Type of study (full-part time)	[104]	F498	Accumulated credits	[68]
F395	Reason for admission	[105]	F499	Days in exchange programs	[68]
F396	Admission category	[93]	F500	Moodle Activity Count	[68]
F397	Computer knowledge	[86]	F501	Activity trend in Moodle	[68]
F398	Disciplinary infraction	[122]	F502	Course access	[92]
F399	Admission form	[99]	F503	Test results	[92]
F400	Risk via admission	[73]	F504	Tasks submitted	[92]
F401	Binary license plate	[112]	F505	Final course grade	[92]
F402	Admission option	[105]	F506	Practice grades	[134]
F403	Average score on entrance exams	[101]	F507	Project ratings	[134]
F404	Registration value	[120]	F508	Reading comprehension	[134]
F405	Course of study	[54]	F509	Cumulative GPA	[134]
F406	Times failed degree	[99]	F510	Credits earned	[134]
F407	First-choice studies	[133]	F511	Time enrolled in university	[134]
F408	Tutorials carried out	[113]	F512	Access outside of class	[141]
F409	Antique	[100]	F513	Curricular units enrolled	[76]
F410	Previous qualification	[62]	F514	Approved curricular units	[76]
F411	Average last cycle	[116]	F515	Accredited curricular units	[76]
F412	Mode	[113]	F516	Training chain	[77]
F413	Military service	[122]	F517	Current semester average	[70]
F414	Drop subject	[132]	F518	Average of subjects passed	[70]
F415	Average score	[82]	F519	Absences	[70]
F416	Failed subjects in secondary school	[79]	F520	Field of study	[83]
F417	Repeating a secondary school year	[79]
F418	Repeating the first academic year	[94]
F419	Temporary withdrawal	[132]

Table A6. Advances in ML for UED prediction.

Studies	Dataset	Preprocessing	Model	Result (%)
[64]	21,654	Random subsampling (RUS)	DT	96.20 ¹
			LR	96.60 ¹
			SVM	97.70 ¹
			ANN	95.50 ¹
			LR + SMOTE	83.20 ¹
			DT + SMOTE	92.50 ¹
			ANN + SMOTE	88.10 ¹
			SMV + SMOTE	95.40 ¹
			LR + over-sampling	85.50 ¹
			DT + over-sampling	79.30 ¹
			SVM + over-sampling	86.90 ¹
			ANN + over-slamping	85.50 ¹
			LR + under-sampling	86.00 ¹
			DT + under-sampling	86.70 ¹
			SVM + under-sampling	87.90 ¹
			ANN + under-slamping	84.70 ¹
[118]	670	Data cleaning	JRip	96.00 ¹
			NNge	95.80 ¹
			OneR	93.70 ¹
			Prism	94.40 ¹
			Ridor	93.40 ¹
			ADTree	96.6 ¹
			DT-J48	94.30 ¹
			RandomTree	94.00 ¹
			REPTree	92.70 ¹
			SimpleCart	96.60 ¹
			ICRM v1	92.10 ¹
			ICRM v2	93.70 ¹
			ICRM v3	93.40 ¹
[82]	670	Data cleansing Discretization of variables Creation of attributes	ADTree	98.20 ¹
			J48	96.70 ¹
			RandomTree	96.10 ¹
			REPTree	96.50 ¹
			SimpleCart	96.40 ¹
			Prism	99.80 ¹
			Ridor	97.90 ¹
			ICRM v1	92.10 ¹
			ICRM v2	94.40 ¹
			ICRM v3	94.00 ¹
[39]	5951	Data cleansing Dimensionality reduction Data balancing Data transformation	Random model	51.00 ¹
			KNN	62.00 ¹
			SVM	65.00 ¹
			DT	68.00 ¹
			RF	69.00 ¹
			GB	69.00 ¹
			NB	66.00 ¹
			LR	62.00 ¹
			ANN	66.00 ¹
[57]	10,554	Unrealized	DT	72.80 ¹
			LR	84.50 ¹
			SVM	82.80 ¹
			RF	82.60 ¹
			ANN	77.80 ¹
			GB	83.70 ¹
			LLM	83.90 ¹
			LMT	80.10 ¹
			BAG	78.00 ¹
[97]	13,696	Data cleansing	DT	86.60 ¹
			LR	88.90 ¹
			SVM	89.40 ¹
			KNN	87.60 ¹
			RF	90.02 ¹
			MLP	89.20 ¹
			CNN	94.60 ¹
			GBN	85.40 ¹
[147]	79,186	Data cleaning Normalization Time series Matrix specifications	LR	85.70 ¹
			SVM	80.10 ¹
			CNN	86.40 ¹
			LSTM	80.10 ¹
			CNN-LSTM	84.80 ¹
			DP-CNN	84.20 ¹
			CLSA	87.40 ¹
[59]	7536	Unrealized	DT	84.70 ¹
			LR	76.60 ¹
			SVM	76.60 ¹
			RF	82.90 ¹
			MLP	89.60 ¹
			MSNF	87.70 ¹
			STUD	90.10 ¹
[51]	425	Unrealized	DT	97.92 ¹
			LR	99.47 ¹
			KNN	82.10 ¹
			RF	99.47 ¹
			NB	96.79 ¹
			GB	98.68 ¹
[120]	261	Feature selection Data cleaning	DT	94.00 ¹
			LR	96.00 ¹
			SVM	94.00 ¹
			RF	94.00 ¹
			NB	94.00 ¹
			ANN	97.00 ¹
[122]	60,010	Data cleansing	LightGBM	81.00 ¹
			XGBoost	83.00 ¹
			LR	50.00 ¹
			SVM	51.00 ¹
			RF	80.00 ¹
			DT	65.00 ¹
[136]	3029	Undersampling SMOTE-Tomek	LR	89.00 ¹
			SGD	86.00 ¹
			DT	98.00 ¹
			MLP	79.00¹
			RF	99.00¹
			SVM	72.00 ¹
[132]	1650	Data transformation	DT	80.00 ¹
			LR	87.59 ¹
			SVM	85.55 ¹
			RF	88.33 ¹
			NB	77.14 ¹
			MLP	83.92 ¹
[139]	32,593	Data extraction Data cleaning Data scaling	DT	78.00 ¹
			LR	80.00 ¹
			SVM	79.00 ¹
			RF	79.00 ¹
			SELOR	84.00 ¹
			SIHMM	83.00 ¹
[148]	104	Data cleansing	SVM	36.73 ¹
			PESFAM	43.24 ¹
			FFNN	68.97 ¹
			SEDM	85.71 ¹
			LR	98.95 ¹
[72]	26	Data cleaning	DT	78.00 ¹
			SVM	80.00 ¹
			KNN	73.00 ¹
			RF	92.00 ¹
			ANN	90.00 ¹
[100]	4419	Data cleaning	DT	88.46 ¹
			SVM	86.92 ¹
			KNN	83.85 ¹
			RF	92.31 ¹
			NB	79.23 ¹
[140]	261	Data cleaning	RF	91.76 ¹
			GB	86.76 ¹
			XGBoost	91.76 ¹
			FNN + RF + GB + XGBoost	93.59 ¹
			FNN	96.76 ¹
[56]	4433	Extract Transform Upload	SMOTE + RF	87.00 ¹
			SVMSMOTE +RF	87.00 ¹
			BRF	82.80 ¹
			EE	83.20 ¹
			RB	81.30 ¹
[93]	131	Data cleaning	LR	73.20 ¹
			SVM	70.99 ¹
			RF	92.30 ¹
			NB	75.50 ¹
			MLP	92.30 ¹
			DT-J48	74.00 ¹
[105]	3425	Feature selection	DT	82.05 ¹
			LR	83.37 ¹
			SVM	82.90 ¹
			KNN	85.59 ¹
			ANN	85.11 ¹
[90]	811	KPCA, PCA, LPP, NPE, IsoP, WCT-T, and WTQ-T	KNN	93.30 ¹
			ANN	94.00 ¹
			DT-C4.5	92.60 ¹
			NB	93.80 ¹
[15]	331	Data cleaning	CatBoost	84.00 ¹
			RF	81.00 ¹
			XGBoost	82.00 ¹
			ANN	87.00 ¹
[78]	143,326	Data cleaning	DT	99.53 ¹
			LR	99.58 ¹
			ANN	97.72 ¹
			XGBoost	99.28 ¹
[54]	Na	Unrealized	LR	93.76 ¹
			Random Forest (Bagging)	93.58 ¹
			AdaBoost	95.51 ¹
			ANN	94.76 ¹
[7]	Na	Data cleaning	DT	91.00 ¹
			LR	87.00 ¹
			NB	55.00 ¹
			MLP	90.00 ¹
[106]	77,384	Data cleaning	DT	94.63 ¹
			ANN	93.97 ¹
			BN	93.92 ¹
[98]	11,496	Unrealized	RF	90.10 ¹
			ANN	89.30 ¹
			Logit	91.20 ¹
[55]	12,370	Unrealized	LR + SMOKE_SVM	72.00 ¹
			RF	78.00 ¹
			ANN + SMOKE_SVM	74.00 ¹
[143]	670	Data cleaning	K-means	80.01 ¹
			HDBSCAN	65.63 ¹
			DBSCAN	95.71 ¹
[101]	3373	Unrealized	SVM	76.39 ¹
			RF	80.40 ¹
			ANN	77.95 ¹
[61]	128	Data transformation	DT	84.00 ¹
			LR	82.00 ¹
			NB	84.00 ¹
			ANN	82.00 ¹
[142]	220	Unrealized	DT	97.69 ¹
[113]	1861	Data cleaning Data transformation	DT-J48	91.80 ¹
			ANN	94.60 ¹
			DT + ANN	98.70 ¹
[86]	2422	Unrealized	LR	76.03 ¹
[86]	2422	Unrealized	AHP	64.57 ¹
[129]	5426	Data cleaning	SVM	89.041 ¹
			RF	88.312 ¹
			GB	87.103 ¹
[14]	46,000	Unrealized	GMERF	93.58 ¹
			CART	87.01 ¹
			GLM	91.05 ¹
[161]	530	Data cleaning	RF	56.67 ¹
			XGBoost	70.00 ¹
			RF + XGBoost	91.52 ¹
[109]	970	Data cleaning	ANN	62.00 ¹
			DT-C4.5	65.00 ¹
			DT-D3	62.00 ¹
[81]	160	Variable generation Data selection Data cleaning	ANN	100.00 ¹
			DT-C4.5	87.77 ¹
			DT-ID3	70.79 ¹
[87]	976	Data Selection Data cleaning Generation data integration Formatting	ANN	85.00 ¹
			DT-C4.5	68.00 ¹
			DT-ID3	75.00 ¹
[94]	1022	Unrealized	LR	80.00 ¹
[94]	1022	Unrealized	Análisis discriminante	91.50 ¹
[88]	67,060	SMOTE RandomOverSampler SMOTETOMEK SMOTEENN	LR	95.30 ¹
			ANN	98.20 ¹
			GB	98.00 ¹
			GB + RF + SVM	97.80 ¹
			XGBoost + Catboost	98.90 ¹
[137]	1862	MMFA	NB	92.85 ¹
[137]	1862	MMFA	DT	95.82 ¹
[60]	17,432	Data cleaning Data transformation SMOTE	KNN	98.20 ¹
			CART	97.91 ¹
			NB	98.24 ¹
[85]	2670	Data cleaning	MLP	98.60 ¹
[85]	2670	Data cleaning	RBF	98.10 ¹
[138]	201	Data cleaning	DT-ID3	92.90 ¹
[138]	201	Data cleaning	DT-J48	92.90 ¹
[135]	1178	Data cleaning Data transformation Attribute selection	LR	84.90 ¹
[135]	1178	Data cleaning Data transformation Attribute selection	DT	91.70 ¹
[115]	83	Unrealized	LR	89.00 ¹
[127]	176	Unrealized	LR	95.80 ¹
[96]	189	Unrealized	Cluster Analysis	83.30 ¹
[116]	6300	SMOTE	DT	95.91 ¹
[123]	1851	Unrealized	DT	87.90 ¹
[21]	41,098	Unrealized	GMERF	90.80 ¹
[79]	237	Unrealized	GBM	92.20 ¹
[108]	12,148	Data cleaning Variable coding	DT	71.40 ¹
[131]	24,770	Unrealized	XGBoost	80.32 ¹
[144]	197	Unrealized	DT	79.90 ¹
[145]	389	Data cleaning	DT	89.39 ¹
[80]	237	Unrealized	DT	87.76 ¹
[102]	32,593	Lasso and ridge	LR	86.90 ¹
[130]	3773	Normalization	LR	95.00 ¹
[151]	3172	Data cleaning	RF	87.67 ¹
[121]	3162	Data cleaning Data transformation Variable extraction	DT-CHAID	98.71 ¹
[21]	24,736	Data division Variable selection	GMERF	90.85 ¹
[126]		SMOTE-NC	DT	84.29 ¹
[95]	1500	Data cleaning Removing features	DeepS3VM (RNN + S3VM)	92.54 ¹
[110]	35,000	SMOTE	XGBoost	82.00 ¹
			LightGBM	79.80 ¹
			DT	79.80 ¹
			RF	81.50 ¹
			ETC	79.00 ¹
			LR	77.00 ¹
			SVM	61.00 ¹
[128]	197	Data cleaning Categorical coding Normalization Feature selection	RF	100.00 ¹
[111]	5883	Unrealized	CART	79.70 ¹
			CIT	81.90 ¹
			SVM	83.00 ¹
			GLM	82.30 ¹
			ANN	81.40 ¹
			NB	70.30 ¹
			BAGGD CART	81.40 ¹
			Random Forest	83.10 ¹
			ADABOOST	80.90 ¹
			XGBoost	82.30 ¹
[89]	44,875	Variable coding Standardization of variables Class imbalance Data separation	RF	85.00 ¹
[89]	44,875		FTT	87.00 ¹
[91]	329	Feature selection Dimensionality reduction	DT	80.20 ¹
			LR	73.10 ¹
			SVM	71.00 ¹
			NB	62.40 ¹
[146]		Data transformation Data cleaning Feature selection Dimensionality reduction	BIRCH	56.50 ⁴
			DBSCAN	32.08 ⁴
			GMM	43.50 ⁴
			RF	86.00 ¹
			DT	84.00 ¹
			SVM	83.00 ¹
			LR)	82.00 ¹
			KNN	81.00 ¹
[65]	985	Data cleaning Standardization of variables SMOTE	AdaBoost	88.00 ¹
[65]	985	Data cleaning Standardization of variables SMOTE	XGBoost	88.86 ¹
[74]	1865	Data cleaning Data transformation Categorical coding Feature selection	RF	88.00 ¹
			SVM	79.00 ¹
			GBT	92.00 ¹
[66]	4792	Data transformation	LR	85.00 ¹
[66]	4792	Data transformation	DT	87.00 ¹
[75]	17,904	Data cleaning Categorical coding Feature selection SMOTE	DT	91.70 ¹
			NB	83.40 ¹
			KNN	96.30 ¹
[67]	1957	Data cleaning Imputation of missing values Transformation of variables	LR	98.20 ¹
			PR	98.20 ¹
			NB	98.60 ¹
			RF	98.50 ¹
			DT	98.80 ¹
			SVM	98.80 ¹
			KNN	98.00 ¹
[68]	8813	Reindexing of time series Data deletion Variable coding Standardization of variables	CatBoost	85.30 ³
			NN	84.40 ³
			LR	84.20 ³
			LDA	84.10 ³
			RF	83.90 ³
			LightGBM	83.20 ³
			XGBoost	82.30 ³
			SVM	82.30 ³
			NB	78.00 ³
			KNN	77.70 ³
[92]	321	Data cleaning Feature selection Standardization of variables	LR	83.30 ¹
[92]	321		PR	86.30 ¹
[134]	661	Data cleaning Transformation of variables Standardization of variables Swinging	LSTM	98.30 ¹
			DNN	98.10 ¹
			DT	93.40 ¹
			RF	92.00 ¹
			LR	98.00 ¹
			SVM	74.70 ¹
			KNN	99.00 ¹
[69]	129,846	Data cleaning Transformation of variables Variable coding Semantic clustering Standardization of variables	PEM-SNN	81.10 ¹
[141]	322	Elimination of variables Variable coding Standardization of variables	ARD	1.42 ⁵
			BR	1.45 ⁵
			LIRE	1.47 ⁵
			RR	1.48 ⁵
			LASSO	1.49 ⁵
			DT	1.65 ⁵
			RF	1.60 ⁵
			AdaBoost	1.62 ⁵
			XGBoost	1.63 ⁵
			CatBoost	1.64 ⁵
			SVM	1.66 ⁵
			KNN	1.68 ⁵
			MLP	1.65 ⁵
			DR	2.13 ⁵
[76]	4424	Elimination of variables Variable coding Standardization of variables	DT	81.00 ²
			RF	87.00 ²
			XGBoost	88.00 ²
			CatBoost	88.00 ²
			LightGBM	88.00 ²
			BG	85.00 ²
			SVM	76.00 ²
[77]	288	Feature selection Converting variables Elimination of variables	DT	90.51 ¹
			K-means	44.29 ¹
			IF	30.34 ¹
			LIRE	35.06 ¹
[70]	6312	Elimination of variables Variable coding Normalization	ANN	81.00 ¹

1: accuracy; 2: F1 score; 3: AUC; 4: silhouette; 5: RMSE; root mean square error = RMSE; alternating decision tree (ADTree); kernel principal component analysis (KPCA); principal component analysis (PCA); locality preserving projection (LPP); neighborhood preserving embedding (NPE); isometric projection (IsoP); weighted connected triple transformation (WCT-T); weighted triple quality transformation (WTQ-T); decision tree (DT); logistic regression (LR); DT-ID3 (iterative dichotomiser 3); support vector machine (SVM); bosques aleatorios (RF); hierarchical density-based spatial clustering of applications with noise (HDBSCAN); density-based spatial clustering of applications with noise (DBSCAN); generalized mixed-effects random forest (GMERF); gradient boosting (GB); gradient boosting machine (GBM); extreme gradient boosting (XGBoost); sentiment analysis model at concept level (CNN); convolutional neural network with dynamic pooling (DP-CNN); naive bayes (NB); sentiment analysis model at concept level (CLSA); K-neighborsclassifier (KNN); redes neuronales artificial (ANN); analytic hierarchy process (AHP); long short-term memory (LSTM); generalized linear model (GLM); stochastic gradient descent (SGD); multilayer perception (MLP); adaptive boosting (ADABOOST); bootstrap aggregated decision trees (BAG); support vector machines—synthetic minority over-sampling technique (SVMSMOTE); synthetic minority over-sampling technique (SMOTE); modified mutated firefly algorithm (MMFA); gaussian bayesian networks (GBNs); feed forward neural network (FFNN); bayesian networks (BNs); radial basis function (RBF); decision tree-chi-square automatic interaction detector (DT-CHAID); logit leaf model (LLM); logistic model tree (LMT); light gradient boosting machine (LightGBM); student educational data mining (SEDM); probabilistic ensemble simplified fuzzy ARTMAP (PESFAM); feed forward neural network (FNN); balanced random forest (BRF); easy ensemble (EE); RUSBoost (RB); classification and regression trees (CART); classification tree model (CTM); synthetic minority over-sampling technique for nominal and categorical data (SMOTE-NC); extra trees classifier (ETC); conditional inference tree (CIT); classification and regression tree con bagging (bagged CART); feature tokenizer transformer(FTT); gaussian mixture model (GMM); gradient boosted trees (GBT); neural networks (NNs); linear discriminant analysis (LDA); polynomial regression(PR); piecewise exponential model con structural neural network (PEM-SNN); automatic relevance determination (ARD); least absolute shrinkage and selection operator(LASSO); bayesian ridge (BR); linear regression (LIRE); ridge regression (RR); dummy regressor (DR); isolation fores (IF).

References

Baranyi, M.; Nagy, M.; Molontay, R. Interpretable Deep Learning for University Dropout Prediction. In Proceedings of the SIGITE 2020—Proceedings of the 21st Annual Conference on Information Technology Education, Virtual Event, 7–9 October 2020. [Google Scholar] [CrossRef]
Bustamante, D.; Garcia-Bedoya, O. Predictive Academic Performance Model to Support, Prevent and Decrease the University Dropout Rate. In Communications in Computer and Information Science; Springer: Cham, Switzerland, 2021. [Google Scholar] [CrossRef]
OECD. How many students complete tertiary education? In Education at a Glance 2022: OECD Indicators; OECD Publishing: Paris, France, 2022. [Google Scholar] [CrossRef]
Agrusti, F.; Bonavolontà, G.; Mezzini, M. University dropout prediction through educational data mining techniques: A systematic review. J. E-Learn. Knowl. Soc. 2019, 15, 161–182. [Google Scholar] [CrossRef]
Netanda, R.S.; Mamabolo, J.; Themane, M. Do or die: Student support interventions for the survival of distance education institutions in a competitive higher education system. Stud. High. Educ. 2019, 44, 397–414. [Google Scholar] [CrossRef]
Felderer, B.; Kueck, J.; Spindler, M. Using Double Machine Learning to Understand Nonresponse in the Recruitment of a Mixed-Mode Online Panel. Soc. Sci. Comput. Rev. 2022, 41, 461–481. [Google Scholar] [CrossRef]
Lee, J.H.; Kim, M.; Kim, D.; Gil, J.M. Evaluation of Predictive Models for Early Identification of Dropout Students. J. Inf. Process. Syst. 2021, 17, 630–644. [Google Scholar] [CrossRef]
Pfau, W.; Rimpp, P. AI-Enhanced Business Models for Digital Entrepreneurship; Springer: Cham, Switzerland, 2021. [Google Scholar] [CrossRef]
Vargas, A.V.; Palacio, G.J.L. Abandono estudiantil en una universidad privada: Un fenómeno no ajeno a los posgrados. Valoración cuantitativa a partir del análisis de supervivencia. Colombia, 2012–2016. Rev. Educ. 2020, 44, 177–191. [Google Scholar]
Buduma, N.; Locascio, N. Fundamentals of Deep Learning: Designing Next-Generation Machine Intelligence Algorithms; O’Reilly Media: Sebastopol, CA, USA, 2017. [Google Scholar]
Berka, P.; Marek, L. Bachelor’s degree student dropouts: Who tend to stay and who tend to leave? Stud. Educ. Eval. 2021, 70, 100999. [Google Scholar] [CrossRef]
Nájera, A.B.U.; Ortega, L.A.M. Predictive Model for Taking Decision to Prevent University Dropout. Int. J. Interact. Multimed. Artif. Intell. 2022, 7, 205–213. [Google Scholar]
Núñez-Naranjo, A.F.; Ayala-Chauvin, M.; Riba-Sanmartí, G. Prediction of university dropout using machine learning. In Proceedings of the International Conference on Information Technology & Systems, La Libertad, Ecuador, 4–6 February 2021; Springer: Cham, Switzerland, 2021; pp. 396–406. [Google Scholar]
Cannistrà, M.; Masci, C.; Ieva, F.; Agasisti, T.; Paganoni, A.M. Early-predicting dropout of university students: An application of innovative multilevel machine learning and statistical techniques. Stud. High. Educ. 2022, 47, 1935–1956. [Google Scholar] [CrossRef]
Moreira da Silva, D.E.; Solteiro Pires, E.J.; Reis, A.; de Moura Oliveira, P.B.; Barroso, J. Forecasting Students Dropout: A UTAD University Study. Future Internet 2022, 14, 76. [Google Scholar] [CrossRef]
Bertolini, R.; Finch, S.; Nehm, R. Enhancing data pipelines for forecasting student performance: Integrating feature selection with cross-validation. Int. J. Educ. Technol. High. Educ. 2021, 18, 1–23. [Google Scholar] [CrossRef]
Lee, S.; Chung, J.Y. The machine learning-based dropout early warning system for improving the performance of dropout prediction. Appl. Sci. 2019, 9, 3093. [Google Scholar] [CrossRef]
Blundo, C.; Fenza, G.; Fuccio, G.; Loia, V.; Orciuoli, F. A time-driven FCA-based approach for identifying students’ dropout in MOOCs. Int. J. Intell. Syst. 2022, 37, 2683–2705. [Google Scholar] [CrossRef]
Heuillet, A.; Couthouis, F.; Díaz-Rodríguez, N. Explainability in deep reinforcement learning. Knowl.-Based Syst. 2021, 214, 106685. [Google Scholar] [CrossRef]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, San Francisco, CA, USA, 13–17 August 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 1135–1144. [Google Scholar] [CrossRef]
Pellagatti, M.; Masci, C.; Ieva, F.; Paganoni, A.M. Generalized mixed-effects random forest: A flexible approach to predict university student dropout. Stat. Anal. Data Min. 2021, 14, 241–257. [Google Scholar] [CrossRef]
Alban, M.; Mauricio, D. Predicting University Dropout through Data Mining: A Systematic Literature. Indian J. Sci. Technol. 2019, 12, 10. [Google Scholar] [CrossRef]
Albreiki, B.; Zaki, N.; Alashwal, H. A systematic literature review of student’ performance prediction using machine learning techniques. Educ. Sci. 2021, 11, 552. [Google Scholar] [CrossRef]
Andrade-Girón, D.; Sandivar-Rosas, J.; Marín-Rodriguez, W.; Susanibar-Ramirez, E.; Toro-Dextre, E.; Ausejo-Sanchez, J.; Villarreal-Torres, H.; Angeles-Morales, J. Predicting Student Dropout based on Machine Learning and Deep Learning: A Systematic Review. EAI Endorsed Trans. Scalable Inf. Syst. 2023, 10, 1. [Google Scholar] [CrossRef]
Mduma, N.; Kalegele, K.; Machuve, D. A survey of machine learning approaches and techniques for student dropout prediction. Data Sci. J. 2019, 18, 14. [Google Scholar] [CrossRef]
Alalawi, K.; Athauda, R.; Chiong, R. Contextualizing the current state of research on the use of machine learning for student performance prediction: A systematic literature review. Eng. Rep. 2023, 5, e12699. [Google Scholar] [CrossRef]
Guo, T.; Bai, X.; Tian, X.; Firmin, S.; Xia, F. Educational anomaly analytics: Features, methods, and challenges. Front. Big Data 2022, 4, 811840. [Google Scholar] [CrossRef]
Alhothali, A.; Albsisi, M.; Assalahi, H.; Aldosemani, T. Predicting student outcomes in online courses using machine learning techniques: A review. Sustainability 2022, 14, 6199. [Google Scholar] [CrossRef]
Idowu, J.A. Debiasing education algorithms. Int. J. Artif. Intell. Educ. 2024, 34, 1510–1540. [Google Scholar] [CrossRef]
Venkatesan, R.G.; Karmegam, D.; Mappillairaju, B. Exploring statistical approaches for predicting student dropout in education: A systematic review and meta-analysis. J. Comput. Soc. Sci. 2024, 7, 171–196. [Google Scholar] [CrossRef]
Tinto, V. Dropout from Higher Education: A Theoretical Synthesis of Recent Research. Rev. Educ. Res. 1975, 45, 89–125. [Google Scholar] [CrossRef]
Tinto, V. Limits of Theory and Practice in Student Attrition. J. High. Educ. 1982, 53, 687–700. [Google Scholar] [CrossRef]
Tinto, V. Leaving College: Rethinking the Causes and Cures of Student Attrition; University of Chicago Press: Chicago, IL, USA, 1994. [Google Scholar] [CrossRef]
Franz, S.; Paetsch, J. Academic and social integration and their relation to dropping out of teacher education: A comparison to other study programs. Front. Educ. 2023, 8, 1179264. [Google Scholar] [CrossRef]
Villegas-Ch, W.; Govea, J.; Revelo-Tapia, S. Improving Student Retention in Institutions of Higher Education through Machine Learning: A Sustainable Approach. Sustainability 2023, 15, 14512. [Google Scholar] [CrossRef]
Quincho Apumayta, R.; Carrillo Cayllahua, J.; Ccencho Pari, A.; Inga Choque, V.; Cárdenas Valverde, J.; Huamán Ataypoma, D. University Dropout: A Systematic Review of the Main Determinant Factors (2020-2024)[Version 2; Peer Review: 2 Approved]. F1000Research 2024, 13, 942. [Google Scholar] [CrossRef]
Lorenzo-Quiles, O.; Galdón-López, S.; Lendínez-Turón, A. Factors contributing to university dropout: A review. Front. Educ. 2023, 8, 1159864. [Google Scholar] [CrossRef]
Xavier, M.; Meneses, J. A Literature Review on the Definitions of Dropout in Online Higher Education. In Proceedings of the European Distance and E-Learning Network (EDEN) Proceedings, Timisoara, Romania, 22–24 June 2020; Available online: https://femrecerca.cat/meneses/publication/literature-review-definitions-dropout-online-higher-education/literature-review-definitions-dropout-online-higher-education.pdf (accessed on 25 February 2025).
Opazo, D.; Moreno, S.; Álvarez-Miranda, E.; Pereira, J. Analysis of First-Year University Student Dropout through Machine Learning Models: A Comparison between Universities. Mathematics 2021, 9, 2599. [Google Scholar] [CrossRef]
Dervenis, C.; Kyriatzis, V.; Stoufis, S.; Fitsilis, P. Predicting Students’ Performance Using Machine Learning Algorithms. In Proceedings of the 6th International Conference on Algorithms, Computing and Systems, ICACS ’22, Larissa, Greece, 16–18 September 2022; Association for Computing Machinery: New York, NY, USA, 2023. [Google Scholar] [CrossRef]
Yağcı, M. Educational data mining: Prediction of students’ academic performance using machine learning algorithms. Smart Learn. Environ. 2022, 9, 11. [Google Scholar] [CrossRef]
Wang, J.; Yu, Y. Machine Learning Approach to Student Performance Prediction of Online Learning. PLoS ONE 2025, 20, e0299018. [Google Scholar] [CrossRef] [PubMed]
Dabhade, P.; Agarwal, R.; Alameen, K.P.; Fathima, A.T.; Sridharan, R.; Gopakumar, G. Educational Data Mining for Predicting Students’ Academic Performance Using Machine Learning Algorithms. Mater. Today Proc. 2021, 47, 5260–5267. [Google Scholar] [CrossRef]
Hakim, N.; Jastacia, B.; Mansoori, A.A. Personalizing Learning Paths: A Study of Adaptive Learning Algorithms and Their Effects on Student Outcomes. J. Emerg. Technol. Educ. 2024, 2, 318–330. [Google Scholar] [CrossRef]
Alzubaidi, A.; Alzubaidi, A.; Alzubaidi, A. Assessment and Evaluation of Different Machine Learning Models for Predicting Students’ Academic Performance. J. Comput. Sci. 2023, 19, 415–427. [Google Scholar] [CrossRef]
Adadi, A.; Berrada, M. Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI). IEEE Access 2018, 6, 52138–52160. [Google Scholar] [CrossRef]
Cotfas, L.-A.; Delcea, C.; Mancini, S.; Ponsiglione, C.; Vitiello, L. An agent-based model for cruise ship evacuation considering the presence of smart technologies on board. Expert Syst. Appl. 2023, 214, 119124. [Google Scholar] [CrossRef]
Kitchenham, B.; Pearl Brereton, O.; Budgen, D.; Turner, M.; Bailey, J.; Linkman, S. Systematic literature reviews in software engineering—A systematic literature review. Inf. Softw. Technol. 2009, 51, 7–15. [Google Scholar] [CrossRef]
Shiguihara, P.; Lopes, A.d.A.; Mauricio, D. Dynamic Bayesian Network Modeling, Learning, and Inference: A Survey. IEEE Access 2021, 9, 117639–117648. [Google Scholar] [CrossRef]
Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372, 71. [Google Scholar] [CrossRef]
Mutrofin, S.; Ginardi, R.V.H.; Fatichah, C.; Kurniawardhani, A. A critical assessment of balanced class distribution problems: The case of predict student dropout. Test Eng. Manag. 2019, 81, 1764–1770. [Google Scholar]
Phan, M.; De Caigny, A.; Coussement, K. A decision support framework to incorporate textual data for early student dropout prediction in higher education. Decis. Support Syst. 2023, 168, 113940. [Google Scholar] [CrossRef]
Al-Jallad, N.T.; Ning, X.; Khairalla, M.A. An interpretable predictive framework for students’ withdrawal problem using multiple classifiers. Eng. Lett. 2019, 27, 1–8. [Google Scholar]
Berens, J.; Schneider, K.; Görtz, S.; Oster, S.; Burghoff, J. Early Detection of Students at Risk—Predicting Student Dropouts Using Administrative Student Data from German Universities and Machine Learning Methods. J. Educ. Data Min. 2019, 11, 1–41. [Google Scholar]
Velasco, C.L.R.; Villena, E.G.; Ballester, J.B.; Prados, F.Á.D.; Alvarado, E.S.; Álvarez, J.C. Forecasting of Post-Graduate Students’ Late Dropout Based on the Optimal Probability Threshold Adjustment Technique for Imbalanced Data. Int. J. Emerg. Technol. Learn. 2023, 18, 120–155. [Google Scholar] [CrossRef]
Martins, M.V.; Baptista, L.; Machado, J.; Realinho, V. Multi-Class Phased Prediction of Academic Performance and Dropout in Higher Education. Appl. Sci. 2023, 13, 4702. [Google Scholar] [CrossRef]
Coussement, K.; Phan, M.; De Caigny, A.; Benoit, D.F.; Raes, A. Predicting student dropout in subscription-based online learning environments: The beneficial impact of the logit leaf model. Decis. Support Syst. 2020, 135, 113325. [Google Scholar] [CrossRef]
Oqaidi, K.; Aouhassi, S.; Mansouri, K. Towards a Students’ Dropout Prediction Model in Higher Education Institutions Using Machine Learning Algorithms. Int. J. Emerg. Technol. Learn. 2022, 17, 103–117. [Google Scholar] [CrossRef]
Won, H.S.; Kim, M.J.; Kim, D.; Kim, H.S.; Kim, K.M. University Student Dropout Prediction Using Pretrained Language Models. Appl. Sci. 2023, 13, 7073. [Google Scholar] [CrossRef]
Hutagaol, N. Suharjito Predictive modelling of student dropout using ensemble classifier method in higher education. Adv. Sci. Technol. Eng. Syst. 2019, 4, 206–211. [Google Scholar] [CrossRef]
Sultana, S.; Khan, S.; Abbas, M.A. Predicting performance of electrical engineering students using cognitive and non-cognitive features for identification of potential dropouts. Int. J. Electr. Eng. Educ. 2017, 54, 105–118. [Google Scholar] [CrossRef]
Realinho, V.; Machado, J.; Baptista, L.; Martins, M.V. Predicting Student Dropout and Academic Success. Data 2022, 7, 146. [Google Scholar] [CrossRef]
Behr, A.; Giese, M.; Teguim Kamdjou, H.D.; Theune, K. Motives for dropping out from higher education—An analysis of bachelor’s degree students in Germany. Eur. J. Educ. 2021, 56, 325–343. [Google Scholar] [CrossRef]
Thammasiri, D.; Delen, D.; Meesad, P.; Kasap, N. A critical assessment of imbalanced class distribution problem: The case of predicting freshmen student attrition. Expert Syst. Appl. 2014, 41, 321–330. [Google Scholar] [CrossRef]
Goran, R.; Jovanovic, L.; Bacanin, N.; Stankovic, M.; Simic, V.; Antonijevic, M.; Zivkovic, M. Identifying and understanding student dropouts using metaheuristic optimized classifiers and explainable artificial intelligence techniques. IEEE Access 2024, 12, 122377–122400. [Google Scholar] [CrossRef]
Gutiérrez, B.; Dehnhardt, M.; Cortés, R.; Matheu, A.; Cornejo, C. Modelo logístico de deserción mediante técnicas de regresión y árbol de decisión para la eficiencia en la destinación de recursos: El caso de una universidad privada chilena. Rev. Ibérica Sist. E Tecnol. Informação 2024, E68, 398–412. [Google Scholar]
Hassan, M.A.; Muse, A.H.; Nadarajah, S. Predicting student dropout rates using supervised machine learning: Insights from the 2022 National Education Accessibility Survey in Somaliland. Appl. Sci. 2024, 14, 7593. [Google Scholar] [CrossRef]
Vaarma, M.; Li, H. Predicting student dropouts with machine learning: An empirical study in Finnish higher education. Technol. Soc. 2024, 76, 102474. [Google Scholar] [CrossRef]
Cai, C.; Fleischhacker, A. Structural Neural Networks Meet Piecewise Exponential Models for Interpretable College Dropout Prediction. J. Educ. Data Min. 2024, 16, 279–302. [Google Scholar]
Asto-Lazaro, M.S.; Cieza-Mostacero, S.E. Web Application Based on Neural Networks for the Detection of Students at Risk of Academic Desertion. TEM J. 2024, 13, 2581. [Google Scholar] [CrossRef]
Isleib, S.; Woisch, A.; Heublein, U. Causes of higher education dropout: Theoretical basis and empirical factors. Z. Erzieh. 2019, 22, 1047–1076. [Google Scholar] [CrossRef]
Guerra, L.; Rivero, D.; Ortiz, A.; Diaz, E.; Quishpe, S. Prediction model of university dropout through data analytics: Strategy for sustainability. RISTI—Rev. Iber. Sist. E Tecnol. Inf. 2020, 2020, 38–47. [Google Scholar]
Hinojosa, M.; Derpich, I.; Alfaro, M.; Ruete, D.; Caroca, A.; Gatica, G. Student clustering procedure according to dropout risk to improve student management in higher education. Texto Livre 2022, 15, e37275. [Google Scholar] [CrossRef]
Zapata-Medina, D.; Espinosa-Bedoya, A.; Jiménez-Builes, J.A. Improving the Automatic Detection of Dropout Risk in Middle and High School Students: A Comparative Study of Feature Selection Techniques. Mathematics 2024, 12, 1776. [Google Scholar] [CrossRef]
Arthana, I.K.R.; Maysanjaya, I.M.D.; Pradnyana, G.A.; Dantes, G.R. Optimizing Dropout Prediction in University Using Oversampling Techniques for Imbalanced Datasets. Int. J. Inf. Educ. Technol. 2024, 14, 1052–1060. [Google Scholar] [CrossRef]
Villar, A.; de Andrade, C.R.V. Supervised machine learning algorithms for predicting student dropout and academic success: A comparative study. Discov. Artif. Intell. 2024, 4, 2. [Google Scholar] [CrossRef]
Diaz, J.; Moreira, F. Toward Educational Sustainability: An AI System for Identifying and Preventing Student Dropout. IEEE Rev. Iberoam. Tecnol. Aprendiz. 2024, 19, 100–110. [Google Scholar]
Kuz, A.; Morales, R. Education in the Knowledge Society Educational Data Science and Machine Learning: A Case Study on University Student Dropout in Mexico. Educ. Knowl. Soc. 2023, 24, 14. [Google Scholar]
Villarreal-Torres, H.; Ángeles-Morales, J.; Cano-Mejía, J.; Mejía-Murillo, C.; Flores-Reyes, G.; Palomino-Márquez, M.; Marín-Rodriguez, W.; Andrade-Girón, D. Classification model for student dropouts using machine learning: A case study. EAI Endorsed Trans. Scalable Inf. Syst. 2023, 10, 1–12. [Google Scholar] [CrossRef]
Díaz, B.; Marín, W.; Lioo, F.; Baldeos, L.; Villanueva, D.; Ausejo, J. Student desertion, factors associated with decision trees: The case of a graduate school at a public university in Peru. RISTI—Rev. Iber. Sist. E Tecnol. Inf. 2022, 2022, 197–211. [Google Scholar]
Bedregal-Alpaca, N.; Cornejo-Aparicio, V.; Zarate-Valderrama, J.; Yanque-Churo, P. Classification models for determining types of academic risk and predicting dropout in university students. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 7. [Google Scholar] [CrossRef][Green Version]
Marquez-Vera, C.; Morales, C.R.; Soto, S.V. Predicting School Failure and Dropout by Using Data Mining Techniques. IEEE Rev. Iberoam. Tecnol. Aprendiz. 2013, 8, 7–14. [Google Scholar] [CrossRef]
Chen, R. Institutional characteristics and college student dropout risks: A multilevel event history analysis. Res. High. Educ. 2012, 53, 487–505. [Google Scholar] [CrossRef]
Qvortrup, A.; Lykkegaard, E. The malleability of higher education study environment factors and their influence on humanities student dropout—Validating an instrument. Educ. Sci. 2024, 14, 904. [Google Scholar] [CrossRef]
Alban, M.; Mauricio, D. Neural networks to predict dropout at the universities. Int. J. Mach. Learn. Comput. 2019, 9, 149–153. [Google Scholar] [CrossRef]
Silva, H.A.; Quezada, L.E.; Oddershede, A.M.; Palominos, P.I.; O’Brien, C. A Method for Estimating Students’ Desertion in Educational Institutions Using the Analytic Hierarchy Process. J. Coll. Stud. Retent. Res. Theory Pract. 2020, 25, 101–125. [Google Scholar] [CrossRef]
Bedregal-Alpaca, N.; Tupacyupanqui-Jaén, D.; Cornejo-Aparicio, V. Analysis of the academic performance of systems engineering students, desertion possibilities and proposals for retention. Ingeniare 2020, 28, 668–683. [Google Scholar] [CrossRef]
Kim, S.; Choi, E.; Jun, Y.K.; Lee, S. Student Dropout Prediction for University with High Precision and Recall. Appl. Sci. 2023, 13, 6275. [Google Scholar] [CrossRef]
Zanellati, A.; Zingaro, S.P.; Gabbrielli, M. Balancing performance and explainability in academic dropout prediction. IEEE Trans. Learn. Technol. 2024, 17, 2086–2099. [Google Scholar] [CrossRef]
Iam-On, N.; Boongoen, T. Improved student dropout prediction in Thai University using ensemble of mixed-type data clusterings. Int. J. Mach. Learn. Cybern. 2017, 8, 497–510. [Google Scholar] [CrossRef]
Quispe, J.O.Q.; Toledo, O.C.; Toledo, M.C.; Llatasi, E.E.C.; Saira, E. Early prediction of university student dropout using machine learning models. Nanotechnol. Percept. 2024, 20, 659–669. [Google Scholar]
Bouihi, B.; Bousselham, A.; Aoula, E.; Ennibras, F.; Deraoui, A. Prediction of Higher Education Student Dropout based on Regularized Regression Models. Eng. Technol. Appl. Sci. Res. 2024, 14, 17811–17815. [Google Scholar] [CrossRef]
Aggarwal, D.; Mittal, S.; Bali, V. Prediction model for classifying students based on performance using machine learning techniques. Int. J. Recent Technol. Eng. 2019, 8, 496–503. [Google Scholar] [CrossRef]
Alvarez, N.L.; Callejas, Z.; Griol, D. Factors that affect student desertion in careers in Computer Engineering profile. Rev. Fuentes 2020, 22, 105–126. [Google Scholar] [CrossRef]
Cam, H.N.T.; Sarlan, A.; Arshad, N.I. A hybrid model integrating recurrent neural networks and the semi-supervised support vector machine for identification of early student dropout risk. PeerJ Comput. Sci. 2024, 10, e2572. [Google Scholar] [CrossRef]
Castelo Branco, U.V.; Jezine, E.; Santos Diniz, A.V.; Silva, G.T. Sistema de Alerta para la Identificación de Posibles Factores de Deserción de Estudiantes de Grado en Período de Pandemia en Paraíba (Brasil). Res. Educ. Learn. Innov. Arch. 2022, 29, 83–101. [Google Scholar] [CrossRef]
Gutierrez-Pachas, D.A.; Garcia-Zanabria, G.; Cuadros-Vargas, E.; Camara-Chavez, G.; Gomez-Nieto, E. Supporting Decision-Making Process on Higher Education Dropout by Analyzing Academic, Socioeconomic, and Equity Factors through Machine Learning and Survival Analysis Methods in the Latin American Context. Educ. Sci. 2023, 13, 154. [Google Scholar] [CrossRef]
Hoffait, A.S.; Schyns, M. Early detection of university students with potential difficulties. Decis. Support Syst. 2017, 101, 1–11. [Google Scholar] [CrossRef]
Lacave, C.; Molina, A.I.; Cruz-Lemus, J.A. Learning Analytics to identify dropout factors of Computer Science studies through Bayesian networks. Behav. Inf. Technol. 2018, 37, 993–1007. [Google Scholar] [CrossRef]
Lottering, R.; Hans, R.; Lall, M. A Machine Learning Approach to Identifying Students at Risk of Dropout: A Case Study. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 417–422. [Google Scholar] [CrossRef]
Martins, M.; Migueis, V.; Fonseca, D. Gouveia Paulo Prediction of academic dropout in a higher education institution using data mining. RISTI—Rev. Iber. Sist. E Tecnol. Inf. 2020, 2020, 188–203. [Google Scholar]
Radovanović, S.; Delibašić, B.; Suknović, M. Predicting dropout in online learning environments. Comput. Sci. Inf. Syst. 2021, 18, 957–978. [Google Scholar] [CrossRef]
Rivera-Baena, O.D.; Patiño-Rodríguez, C.E.; Úsuga-Manco, O.C.; Hernández-Barajas, F. ADHE: A tool to characterize higher education dropout phenomenon. Rev. Fac. Ing. Univ. Antioq. 2024, 64–75. [Google Scholar] [CrossRef]
Schneider, K.; Berens, J.; Burghoff, J. Early detection of student dropout: What is relevant information? Z. Erzieh. 2019, 22, 1121–1146. [Google Scholar] [CrossRef]
Segura, M.; Mello, J.; Hernández, A. Machine Learning Prediction of University Student Dropout: Does Preference Play a Key Role? Mathematics 2022, 10, 3359. [Google Scholar] [CrossRef]
Tan, M.; Shao, P. Prediction of student dropout in E-learning program through the use of machine learning method. Int. J. Emerg. Technol. Learn. 2015, 10, 11. [Google Scholar] [CrossRef]
Wainipitapong, S.; Chiddaycha, M. Assessment of dropout rates in the preclinical years and contributing factors: A study on one Thai medical school. BMC Med. Educ. 2022, 22, 461. [Google Scholar] [CrossRef]
Yasmin. Application of the classification tree model in predicting learner dropout behaviour in open and distance learning. Distance Educ. 2013, 34, 218–231. [Google Scholar] [CrossRef]
Zárate-Valderrama, J.; Bedregal-Alpaca, N.; Cornejo-Aparicio, V. Classification models to recognize patterns of desertion in university students. Ingeniare 2021, 29, 168–177. [Google Scholar] [CrossRef]
Zerkouk, M.; Mihoubi, M.; Chikhaoui, B.; Wang, S. A machine learning based model for student’s dropout prediction in online training. Educ. Inf. Technol. 2024, 29, 15793–15812. [Google Scholar] [CrossRef]
Alfahid, A. Algorithmic Prediction of Students On-Time Graduation from the University. TEM J. 2024, 13, 692–698. [Google Scholar] [CrossRef]
Mealli, F.; Rampichini, C. Evaluating the effects of university grants by using regression discontinuity designs. J. R. Stat. Soc. Ser. A Stat. Soc. 2012, 175, 775–798. [Google Scholar] [CrossRef]
Daza, A. A stacking based hybrid technique to predict student dropout at universities. J. Theor. Appl. Inf. Technol. 2022, 100, 1–12. [Google Scholar]
Lackner, E. Community College Student Persistence During the COVID-19 Crisis of Spring 2020. Community Coll. Rev. 2023, 51, 193–215. [Google Scholar] [CrossRef] [PubMed]
Willging, P.A.; Johnson, S.D. Factors that influence students’ decision to dropout of online courses. Online Learn. J. 2019, 13, 115–127. [Google Scholar] [CrossRef]
Vega, H.; Sanez, E.; De La Cruz, P.; Moquillaza, S.; Pretell, J. Intelligent System to Predict University Students Dropout. Int. J. Online Biomed. Eng. 2022, 18, 27–43. [Google Scholar] [CrossRef]
Fontana, L.; Masci, C.; Ieva, F.; Paganoni, A.M. Performing learning analytics via generalised mixed-effects trees. Data 2021, 6, 74. [Google Scholar] [CrossRef]
Márquez-Vera, C.; Cano, A.; Romero, C.; Ventura, S. Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data. Appl. Intell. 2013, 38, 315–330. [Google Scholar] [CrossRef]
Alvarado-Uribe, J.; Mejía-Almada, P.; Masetto Herrera, A.L.; Molontay, R.; Hilliger, I.; Hegde, V.; Montemayor Gallegos, J.E.; Ramírez Díaz, R.A.; Ceballos, H.G. Student Dataset from Tecnologico de Monterrey in Mexico to Predict Dropout in Higher Education. Data 2022, 7, 119. [Google Scholar] [CrossRef]
Dasi, H.; Kanakala, S. Student Dropout Prediction Using Machine Learning Techniques. Int. J. Intell. Syst. Appl. Eng. 2022, 10, 408–414. [Google Scholar]
Albán, M.; Mauricio, D.; Albán, M. Decision trees for the early identification of university students at risk of desertion. Int. J. Eng. Technol 2018, 7, 51. [Google Scholar] [CrossRef]
Song, Z.; Sung, S.H.; Park, D.M.; Park, B.K. All-Year Dropout Prediction Modeling and Analysis for University Students. Appl. Sci. 2023, 13, 1143. [Google Scholar] [CrossRef]
Fauszt, T.; Erdélyi, K.; Dobák, D.; Bognár, L.; Kovács, E. Design of a Machine Learning Model to Predict Student Attrition. Int. J. Emerg. Technol. Learn. 2023, 18, 184–195. [Google Scholar] [CrossRef]
Meyer, J.; Leuze, K.; Strauss, S. Individual Achievement, Person-Major Fit, or Social Expectations: Why Do Students Switch Majors in German Higher Education? Res. High. Educ. 2022, 63, 222–247. [Google Scholar] [CrossRef]
Wild, S.; Schulze Heuling, L. Student dropout and retention: An event history analysis among students in cooperative higher education. Int. J. Educ. Res. 2020, 104, 101687. [Google Scholar] [CrossRef]
Wongvorachan, T.; Bulut, O.; Liu, J.X.; Mazzullo, E. A Comparison of Bias Mitigation Techniques for Educational Classification Tasks Using Supervised Machine Learning. Information 2024, 15, 326. [Google Scholar] [CrossRef]
Sacală, M.D.; Pătărlăgeanu, S.R.; Popescu, M.F.; Constantin, M. Econometric research of the mix of factors influencing first-year students’ dropout decision at the faculty of agri-food and environmental economics. Econ. Comput. Econ. Cybern. Stud. Res. 2021, 55, 203–220. [Google Scholar] [CrossRef]
Kok, C.L.; Ho, C.K.; Chen, L.; Koh, Y.Y.; Tian, B. A Novel Predictive Modeling for Student Attrition Utilizing Machine Learning and Sustainable Big Data Analytics. Appl. Sci. 2024, 14, 9633. [Google Scholar] [CrossRef]
Fernandez-Garcia, A.J.; Preciado, J.C.; Melchor, F.; Rodriguez-Echeverria, R.; Conejero, J.M.; Sanchez-Figueroa, F. A real-life machine learning experience for predicting university dropout at different stages using academic data. IEEE Access 2021, 9, 133076–133090. [Google Scholar] [CrossRef]
Alban, M.; Mauricio, D. Factors that influence undergraduate university desertion according to students perspective. Int. J. Eng. Technol. 2019, 10, 1585–1602. [Google Scholar] [CrossRef]
Huo, H.; Cui, J.; Hein, S.; Padgett, Z.; Ossolinski, M.; Raim, R.; Zhang, J. Predicting Dropout for Nontraditional Undergraduate Students: A Machine Learning Approach. J. Coll. Stud. Retent. Res. Theory Pract. 2023, 24, 1054–1077. [Google Scholar] [CrossRef]
Nuanmeesri, S.; Poomhiran, L.; Chopvitayakun, S.; Kadmateekarun, P. Improving Dropout Forecasting during the COVID-19 Pandemic through Feature Selection and Multilayer Perceptron Neural Network. Int. J. Inf. Educ. Technol. 2022, 12, 851–857. [Google Scholar] [CrossRef]
Zamora Menéndez, Á.; Gil Flores, J.; de Besa Gutiérrez, M.R. Learning approaches, time perspective and persistence in university students. Educ. XX1 2020, 23, 17–39. [Google Scholar] [CrossRef]
Vives, L.; Cabezas, I.; Vives, J.C.; Reyes, N.G.; Aquino, J.; Cóndor, J.B.; Altamirano, S.F.S. Prediction of students’ academic performance in the programming fundamentals course using long short-term memory neural networks. IEEE Access 2024, 12, 5882–5898. [Google Scholar] [CrossRef]
Alban, M.; Mauricio, D. Prediction of university dropout through technological factors: A case study in Ecuador. Rev. Espac. 2018, 39, 8. [Google Scholar]
Cedeño-Valarezo, L.; Morales-Carrillo, J.; Quijije-Vera, C.P.; Palau-Delgado, S.A.; López-Mora, C.I. Machine learning to predict school dropout in the context of COVID-19. RISTI—Rev. Iber. Sist. E Tecnol. Inf. 2023, 2023, 370–377. [Google Scholar]
Gamao, A.O.; Gerardo, B.D. Prediction-based model for student dropouts using modified mutated firefly algorithm. Int. J. Adv. Trends Comput. Sci. Eng. 2019, 8, 3461–3469. [Google Scholar] [CrossRef]
Heredia, D.; Amaya, Y.; Barrientos, E. Student Dropout Predictive Model Using Data Mining Techniques. IEEE Lat. Am. Trans. 2015, 13, 3127–3134. [Google Scholar] [CrossRef]
Mubarak, A.A.; Cao, H.; Zhang, W. Prediction of students’ early dropout based on their interaction logs in online learning environment. Interact. Learn. Environ. 2022, 30, 1414–1433. [Google Scholar] [CrossRef]
Niyogisubizo, J.; Liao, L.; Nziyumva, E.; Murwanashyaka, E.; Nshimyumukiza, P.C. Predicting student’s dropout in university classes using two-layer ensemble machine learning approach: A novel stacked generalization. Comput. Educ. Artif. Intell. 2022, 3, 100066. [Google Scholar] [CrossRef]
Rico-Juan, J.R.; Cachero, C.; Macià, H. Study regarding the influence of a student’s personality and an LMS usage profile on learning performance using machine learning techniques. Appl. Intell. 2024, 54, 6175–6197. [Google Scholar] [CrossRef]
Selvan, M.P.; Navadurga, N.; Prasanna, N.L. An efficient model for predicting student dropout using data mining and machine learning techniques. Int. J. Innov. Technol. Explor. Eng. 2019, 8, 750–752. [Google Scholar] [CrossRef]
Valles-Coral, M.A.; Salazar-Ramírez, L.; Injante, R.; Hernandez-Torres, E.A.; Juárez-Díaz, J.; Navarro-Cabrera, J.R.; Pinedo, L.; Vidaurre-Rojas, P. Density-Based Unsupervised Learning Algorithm to Categorize College Students into Dropout Risk Levels. Data 2022, 7, 165. [Google Scholar] [CrossRef]
Figueroa-Canas, J.; Sancho-Vinuesa, T. Early prediction of dropout and final exam performance in an online statistics course. Rev. Iberoam. Tecnol. Aprendiz. 2020, 15, 86–94. [Google Scholar] [CrossRef]
Nuankaew, P. Dropout situation of business computer students, University of Phayao. Int. J. Emerg. Technol. Learn. 2019, 14, 115–131. [Google Scholar] [CrossRef]
Pecuchova, J.; Drlik, M. Enhancing the Early Student Dropout Prediction Model Through Clustering Analysis of Students’ Digital Traces. IEEE Access 2024, 12, 159336–159367. [Google Scholar] [CrossRef]
Fu, Q.; Gao, Z.; Zhou, J.; Zheng, Y. CLSA: A novel deep learning model for MOOC dropout prediction. Comput. Electr. Eng. 2021, 94, 107315. [Google Scholar] [CrossRef]
Burgos, C.; Campanario, M.L.; Peña, D.d.l.; Lara, J.A.; Lizcano, D.; Martínez, M.A. Data mining for modeling students’ performance: A tutoring action plan to prevent academic dropout. Comput. Electr. Eng. 2018, 66, 541–556. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2017; Available online: https://proceedings.neurips.cc/paper_files/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf (accessed on 25 February 2025).
Melo, E.; Silva, I.; Costa, D.G.; Viegas, C.M.D.; Barros, T.M. On the use of explainable artificial intelligence to evaluate school dropout. Educ. Sci. 2022, 12, 845. [Google Scholar] [CrossRef]
Dass, S.; Gary, K.; Cunningham, J. Predicting student dropout in self-paced mooc course using random forest model. Information 2021, 12, 476. [Google Scholar] [CrossRef]
Karlos, S.; Kostopoulos, G.; Kotsiantis, S. Predicting and interpreting students’ grades in distance higher education through a semi-regression method. Appl. Sci. 2020, 10, 8413. [Google Scholar] [CrossRef]
Torres, J.A.O.; Santiago, A.M.; Izaguirre, J.M.V.; Garduza, S.H.; García, M.A.; Alejandro, G.F. Multilayer fuzzy inference system for predicting the risk of dropping out of school at the high school level. IEEE Access 2024, 2, 137523–137532. [Google Scholar] [CrossRef]
Karimi-Haghighi, M.; Castillo, C.; Hernández-Leo, D. A Causal Inference Study on the Effects of First Year Workload on the Dropout Rate of Undergraduates. In Artificial Intelligence in Education; Rodrigo, M.M., Matsuda, N., Cristea, A.I., Dimitrova, V., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 15–27. [Google Scholar]
Alhaza, K.; Abdel-Salam, A.-S.G.; Mollazehi, M.D.; Ismail, R.M.; Bensaid, A.; Johnson, C.; Al-Tameemi, R.A.N.; A Hasan, M.; Romanowski, M.H. Factors affecting university image among undergraduate students: The case study of Qatar University. Cogent Educ. 2021, 8, 1977106. [Google Scholar] [CrossRef]
Viloria, A.; Lezama, O.B.P. Mixture structural equation models for classifying university student dropout in Latin America. Procedia Comput. Sci. 2019, 160, 629–634. [Google Scholar] [CrossRef]
Ishii, T.; Tachikawa, H.; Shiratori, Y.; Hori, T.; Aiba, M.; Kuga, K.; Arai, T. What kinds of factors affect the academic outcomes of university students with mental disorders? A retrospective study based on medical records. Asian J. Psychiatry 2018, 32, 67–72. [Google Scholar] [CrossRef]
Pecuchova, J.; Drlik, M. Predicting students at risk of early dropping out from course using ensemble classification methods. Procedia Comput. Sci. 2023, 225, 3223–3232. [Google Scholar] [CrossRef]
Brigato, L.; Iocchi, L. A Close Look at Deep Learning with Small Data. arXiv 2020, arXiv:2003.12843. [Google Scholar]
Mauricio, D.; Cárdenas-Grandez, J.; Uribe Godoy, G.V.; Rodríguez Mallma, M.J.; Maculan, N.; Mascaro, P. Maximizing Survival in Pediatric Congenital Cardiac Surgery Using Machine Learning, Explainability, and Simulation Techniques. J. Clin. Med. 2024, 13, 6872. [Google Scholar] [CrossRef]
Ananthi Claral Mary, T.; Arul Leena Rose, P.J. Ensemble Machine Learning Model for University Students’ Risk Prediction and Assessment of Cognitive Learning Outcomes. Int. J. Inf. Educ. Technol. 2023, 13, 948–958. [Google Scholar] [CrossRef]

Figure 1. Systematic review process according to PRISMA [50].

Figure 2. Number of articles selected per year.

Figure 3. Number of authors by country of affiliation.

Figure 4. Number of articles per quartile.

Figure 5. Number of articles per publisher.

Figure 6. Frequency of UED factors by category.

Figure 7. Process to minimize UED.

Table 1. Database search string.

Database

Search String

Scopus

title-abstract-keywords ((“student desertion” OR “student abandonment” OR “student retreat” OR “student withdrawal” OR “desertion university” OR “dropout university” OR “desertion dropout” OR desertion OR “college dropout” OR “academic desertion” OR “academic dropout” OR “student dropout” OR “university withdrawal” OR “college withdrawal”) AND (explication OR factor OR prediction OR simulation OR methods OR framework OR forecast OR predict OR process OR explanation OR interpretation OR patterns OR analysis OR identify OR estimate OR know OR architecture OR establish OR proposal OR discover OR explainability OR predicting OR performance OR models) AND (“machine learning” OR “deep Learning” OR “decision tree” OR “Bayesian” OR “neural network” OR arn OR regression OR clustering OR “association rules” OR “automatic learning”))

Web of Science (WoS)

(“student desertion” OR “student abandonment” OR “student retreat” OR “student withdrawal” OR “desertion university” OR “dropout university” OR “desertion dropout” OR desertion OR “college dropout” OR “academic desertion” OR “academic dropout” OR “student dropout” OR “university withdrawal” OR “college withdrawal”) AND (explication OR factor OR prediction OR simulation OR methods OR framework OR forecast OR predict OR process OR explanation OR interpretation OR patterns OR analysis OR identify OR estimate OR know OR architecture OR establish OR proposal OR discover OR explainability OR predicting OR performance OR models) AND (“machine learning” OR “deep Learning” OR “decision tree” OR “Bayesian” OR “neural network” OR arn OR regression OR Clustering OR “association rules” OR “automatic learning”) (topic)

Table 2. Inclusion and exclusion criteria.

Inclusion Criteria	Exclusion Criteria
Articles addressing at least one of the key dimensions of this review: factors, prediction, explanation, or simulation of UED using ML. Journal articles. Area related to “Engineering” or “Computer Science”. Articles published in journals indexed in Scopus and Web of Science Period: 2012–2024.	Pre-publications Articles in the field of secondary or primary education. Articles that are not within the context of ML. Articles that identify factors associated with UED, without empirical or statistical validation.

Table 3. Potentially Eligible and Selected Articles.

Source	Potentially Eligible Studies	Selected Studies
Scopus	620	102
Web of Science (WoS)	166	20
Total	786	122 *

* 61 studies removed from WoS for being duplicates in Scopus.

Table 4. Category of factors influencing UED.

Category	Description	References
Demographics	Characteristics or attributes that describe the structure and composition of a population.	[51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71]
Socioeconomic	It is the analysis of the economic and social situation of each student.	[51,52,55,56,57,61,62,64,66,71,72,73,74,75,76,77,78]
Institutional	Elements related to the physical structure and functioning of the institution.	[59,78,79,80,81,82,83,84,85]
Personal	Elements immersed within the student’s family circle.	[54,60,69,72,74,79,80,86,87,88,89]
Academic	Elements related to their academic performance.	[51,53,54,55,56,57,58,59,60,62,64,65,66,67,68,69,72,73,74,75,76,77,80,86,88,90,91,92]

Table 5. Most relevant demographic factors of the UED.

Id	Factor	References
F001	Gender	[7,14,39,50,51,53,54,58,59,60,61,63,64,65,66,67,68,69,71,72,74,77,78,82,83,86,88,89,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110]
F002	Age	[7,15,51,54,59,64,65,66,67,68,69,71,72,74,77,78,80,83,84,86,89,91,94,97,98,99,100,101,105,108,109,111,112,113,114,115,116,117,118,119]
F003	Marital status	[51,59,61,68,69,71,76,80,86,91,93,97,102,106,108,114,116,118]
F004	Sex	[21,55,78,80,84,86,90,91,111,112,114,115,118,120,121,122,123,124]
F005	Place of residence	[7,14,39,51,64,65,66,68,74,86,93,106,108,114,121,125]
F006	Place of origin	[39,51,53,54,71,78,89,90,91,93]
F007	Nationality	[7,21,53,55,58,61,64,65,76,93,101,104,115,126]
F008	Parents’ educational level	[59,70,72,93,94,95,97,99,112,117,118]
F009	Type of school	[14,64,65,66,75,78,86,89,91,95,103,109]
F010	Ethnicity	[72,74,77,83,84,86,97,98,112,113,126]

Table 6. Most relevant socioeconomic factors of the UED.

Id	Factor	References
F076	Scholarships	[55,61,72,75,83,93,94,99,104,106,112,120,127]
F077	Works	[54,60,69,71,80,91,94,108,114,125,128]
F078	Family income	[59,80,86,95,97,98,102,125,129]
F079	Income level	[14,58,80,89,96,108,118]
F080	Admission score	[14,39,78,95,98,121,123]
F081	Disability	[69,72,77,88,100,105,114]
F082	Educational level	[54,101,105,108,117,122]
F083	Socioeconomic status	[64,77,83,95,124]
F084	Internet access	[15,76,93,129]
F085	Financial aid	[68,80,83,100,112]

Table 7. Most relevant institutional factors of the UED.

Id	Factor	References
F156	Infrastructure	[80,81,84]
F157	Educational services	[66,83]
F158	Suitable equipment	[80,81]
F159	Place	[78,89]
F160	Institution size	[83,84]
F161	Area	[60]
F162	Geographical area	[78]
F163	Teacher’s commitment to the student	[85]
F164	Classification of the career or institution	[85]
F165	Group class	[82]
F166	School climate assessment scale	[126]
F167	Counselor’s perception of teachers’ expectations	[126]

Table 8. Most relevant personal factors of the UED.

Id	Factor	References
F179	Year of admission	[58,64,74,75,87,90,94,95,120,127]
F180	Motivation	[80,81,82,97]
F181	Extracurricular activities	[72,84,97]
F182	Commitment	[99,124,130]
F183	Class participation	[84,131,132]
F184	Number of voluntary activities	[7,58,88]
F185	Future time perspective	[55,130]
F186	Time to study	[60,80]
F187	Adaptation and coexistence	[55,133]
F188	Leader or president	[60,93]

Table 9. Most relevant academic factors of the UED.

Id	Factor	References
F317	Ratings	[39,54,65,74,76,78,82,87,90,106,111,116,117,120,126,127]
F318	General GPA	[7,59,64,67,69,75,83,84,89,90,97,113,120,123,126,129]
F319	Secondary note	[39,72,83,86,97,103,109,111,121,129]
F320	Subjects taken	[15,72,73,78,94,98,108,118]
F321	Credits taken	[71,89,100,101,112,120,121]
F322	Attendance	[55,59,61,64,72,73,76,120]
F323	Type of admission	[39,58,78,90,95,107]
F324	Type of school	[66,86,89,99,102,103,108]
F325	School	[39,55,70,78,89,91,117]
F326	Academic year	[39,78,93,100,134]

Table 10. Most relevant developments in ML for the prediction of UED.

Model	Min	Max	References
DT	62	99.53	[7,39,56,58,60,65,66,71,72,75,76,77,78,79,81,82,87,90,91,99,100,101,102,106,108,110,111,114,116,118,119,120,121,124,129,131,132,135,136,137,138,139,140,141,142,143]
LR	50	99.58	[39,53,56,58,60,65,66,67,71,79,86,88,91,92,96,99,102,105,106,108,113,118,120,125,129,131,135,141,142,143,144,145]
RF	56.67	100	[15,39,54,56,58,66,72,74,89,93,100,102,104,106,108,118,120,126,127,129,131,134,135,141,143,146,147]
SVM	36.73	98.8	[39,56,58,66,67,71,72,74,76,91,93,99,100,102,106,108,110,118,120,127,129,131,135,141,143,144,145]
ANN	62	100	[15,39,53,56,60,69,71,72,78,79,87,88,90,93,99,101,104,110,111,118]
NB	55	98.6	[7,39,59,60,66,67,75,90,91,100,102,110,118,129,136]
KNN	62	99	[39,59,66,67,72,75,90,99,100,106,131,132,135]
XGBoost	70	99.28	[15,64,67,71,76,79,108,110,120,132,134,146]
MLP	79	98.60	[7,58,85,102,106,129,132,141]
GB	69	98.68	[39,56,88,127,134]
AdaBoost	80.9	95.51	[53,64,110,132]
GMERF	90.8	93.58	[14,21]
CART	79.7	97.91	[14,59,110]
Ridor	93.4	97.90	[82,116]
ICRM v2	93.7	94.40	[82,116]
ICRM v1	92.1	92.10	[82,116]
GLM	82.3	91.05	[14,110]
ADTree	96.6	98.20	[82,116]
CNN	86.4	94.60	[106,144]
RandomTree	94	96.10	[82,116]
LightGBM	79.8	81.00	[108,120]
Prism	94.4	99.80	[82,116]
REPTree	92.7	96.50	[82,116]
PR	86.3	98.20	[66,92]
SimpleCart	96.4	96.60	[82,116]
K-means	44.29	80.01	[77,133]

Table 11. Preprocessing techniques.

Preprocessing	References
DC	[7,15,39,59,64,66,67,68,72,74,75,78,79,82,85,87,92,97,100,101,102,106,108,111,116,118,119,120,126,127,131,133,134,135,138,139,144,145,146,147,148]
DTA	[39,59,60,65,74,111,119,129,135,142]
FS	[74,75,77,91,92,99,118,126,135]
SV	[64,67,68,76,89,92,131,132]
VC	[67,69,76,89,108,132]
SMOTE	[59,64,75,88,108,114]
Normalization	[69,126,128,144]
EV	[69,76,77,132]
TV	[66,68,131]
DR	[39,91,135]
CC	[74,75,126]
DS	[78,87]
Unrealized	[14,21,53,54,56,58,71,80,81,86,93,96,98,104,110,113,121,125,137,140]

Data cleaning (DC); data transformation (DTA); feature selection (FS); standardization of variables (SV); variable coding (VC); synthetic minority oversampling technique (SMOTE); elimination of variables (EV); transformation of variables (TV); dimensionality reduction (DR); categorical coding (CC); data selection (DS).

Table 12. Explanation progress for the UED.

Studies	Case	Model
[150]	They developed a black box model to predict and explain student dropout at the Federal Institute of Rio Grande do Norte (IFRN) in Brazil, with an explanatory level of 78% for SHAP and 57% for LIME. Characteristics such as academic performance, parents’ educational level, and family income are determinants of school dropout.	SHAP
[150]		LIME
[151]	Prediction of UED in a college algebra course at Arizona State University was conducted using RF in conjunction with SHAP and identified that the number of topics mastered, variability in performance, and activity tendencies had the most significant weight in predicting attrition.	SHAP
[152]	They implement a semi-supervised regression algorithm that utilizes multi-view learning to enhance the prediction of student grades, specifically for undergraduate students’ final grades, in conjunction with SHAP. The analysis identifies that participation in optional contact sessions, grades on written assignments, and student interactions are influential on UED.	SHAP
[110]	SHAP was utilized in an online education environment at a vocational training institute in Algeria to assess the contribution of active minutes per week, days online, and demographic data to the prediction, aiming to build a more accurate and understandable model.	SHAP
[111]	The study applied SHAP and LIME to the RF model at Majmaah University. SHAP identified that the number of hours logged in the last semester, first-year GPA, and length of study are the factors that have the most significant influence on predicting graduation. LIME analyzed an individual case where the main variables influencing not graduating on time were a low first-year GPA, a long duration of studies, and a low number of hours registered in the last semester. This local interpretation enabled a clear understanding of the reasons behind the prediction, providing valuable information for potential personalized interventions.	SHAP
[111]		LIME
[89]	Explainability techniques were applied at the University of Trento (Italy) to analyze the UED prediction models. Grouped Permutation Importance (GPI), used in RF through the analysis of the impact of groups of variables, measures the loss in model performance by randomly permuting the values of each group and determines that among the most relevant variables are cumulative credits and weighted grade point average. Attention Map (AM) applied to FTT visualized the areas of the input vector to which the model paid the most attention, again highlighting academic factors as the most influential. SHAP provided local explanations by calculating the individual contribution of each variable in predicting UED, showing that students with lower academic loads and low achievement were at greater risk. These complementary techniques strengthened the transparency of the models and confirmed the relevance of academic performance in the early identification of university dropouts.	GPI
		AM
		SHAP
[65]	The study applied SHAP and Shapley additive global importance (SAGE) in Serbia by researchers from Singidunum University and the Information Technology School on XGBoost and AdaBoost. The most influential factors in predicting UED were identified at both the local and global levels, highlighting failed curricular units, tuition payment status, and age at entry.	SHAP
[65]		SAGE
[153]	A neuro-fuzzy model (ANFIS) was applied in Mexican institutions, combining neural networks with fuzzy logic to predict UED risk. Using linguistic rules and visual response surfaces, we identified how age, income, and internet access influenced risk. In specific cases, high vulnerability profiles were identified, facilitating its use as a diagnostic tool in marginalized areas.	ANFIS
[68]	Permutation Importance (PEI) was applied at the Finnish University of Applied Sciences to explain the predictions of UED in three models: CatBoost, NN, and LR. PEI measured the drop in model performance (F1-score) by randomly altering the values of each feature, concluding that the three most important variables were cumulative credits, number of failed subjects, and Moodle activity count.	PEI
[69]	In the study conducted at the University of Delaware, an explainable model called PEM-SN was developed to predict UED. The process involved structuring the neural network so that academic, economic, and sociodemographic factors were grouped into independent hidden layers, generating three representative neurons that combine linearly to estimate dropout risk. The results indicated that academic integration was the most influential factor, clearly differentiating between dropouts and retained students, while economic and social factors had a minor impact. This structured design facilitated an understanding of the model, allowing for the identification not only of who is at risk but also why.	PEM-SNN
[141]	The study, applied at the Polytechnic University of Bucharest (Romania), used SHAP over Automatic Relevance Determination (ARD) to explain the prediction of academic performance. Explainability identified from the fourth week of the semester that the most influential variables were previous academic performance, personality traits such as openness and conscientiousness, and the use of the LMS outside class hours. This facilitated the early detection of students at risk of SUD.	SHAP

Table 13. Most commonly used factors in XAI models.

Factor	SHAP	LIME	GPI	AM	SAGE	PEI	ANFIS	PEM-SNN
Average	X	X	X	X				X
Accumulated credits	X		X	X	X	X		X
Age	X		X		X		X	X
Employment status	X				X		X	X
Attendance/presence	X
Scholarships	X							X
Gender	X							X
Moodle activity	X					X
Personality	X
Study hours	X						X
Assignment/exam mark	X	X				X		X

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Quimiz-Moreira, M.; Delgadillo, R.; Parraga-Alava, J.; Maculan, N.; Mauricio, D. Factors, Prediction, Explainability, and Simulating University Dropout Through Machine Learning: A Systematic Review, 2012–2024. Computation 2025, 13, 198. https://doi.org/10.3390/computation13080198

AMA Style

Quimiz-Moreira M, Delgadillo R, Parraga-Alava J, Maculan N, Mauricio D. Factors, Prediction, Explainability, and Simulating University Dropout Through Machine Learning: A Systematic Review, 2012–2024. Computation. 2025; 13(8):198. https://doi.org/10.3390/computation13080198

Chicago/Turabian Style

Quimiz-Moreira, Mauricio, Rosa Delgadillo, Jorge Parraga-Alava, Nelson Maculan, and David Mauricio. 2025. "Factors, Prediction, Explainability, and Simulating University Dropout Through Machine Learning: A Systematic Review, 2012–2024" Computation 13, no. 8: 198. https://doi.org/10.3390/computation13080198

APA Style

Quimiz-Moreira, M., Delgadillo, R., Parraga-Alava, J., Maculan, N., & Mauricio, D. (2025). Factors, Prediction, Explainability, and Simulating University Dropout Through Machine Learning: A Systematic Review, 2012–2024. Computation, 13(8), 198. https://doi.org/10.3390/computation13080198

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Factors, Prediction, Explainability, and Simulating University Dropout Through Machine Learning: A Systematic Review, 2012–2024

Abstract

1. Introduction

2. Theoretical Background

2.1. University Dropout

2.2. Machine Learning

2.3. Artificial Intelligence Explained (XAI)

2.4. Simulation

3. Materials and Methods

3.1. Planification

3.2. Development

3.3. Statistics

3.3.1. Number of Potential and Selected Items

3.3.2. Trend of Articles per Year

3.3.3. Number of Authors by Country of Affiliation

3.3.4. Selected Articles by Quartile

3.3.5. Selected Articles by Publisher

4. Results

4.1. What UED Factors Exist, and Which Are the Most Studied?

4.1.1. Demographic Factors

4.1.2. Socioeconomic Factors

4.1.3. Institutional Factors

4.1.4. Personal Factors

4.1.5. Academic Factors

4.1.6. Summary of Categories

4.2. Which ML Models Are Used for Predicting UED?

4.3. What Progress Has XAI Made in the UED?

4.4. What Simulation Models Exist for the UED?

5. Discussion

5.1. About Factors

5.2. About the Model

5.3. About Explication

5.4. About Simulation

5.5. Factors, Prediction, Explanation, and Simulation

6. Conclusions

Author Contributions

Funding

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI