All-Year Dropout Prediction Modeling and Analysis for University Students

: The core of dropout prediction lies in the selection of predictive models and feature tables. Machine learning models have been shown to predict student dropouts accurately. Because students may drop out of school in any semester, the student history data recorded in the academic management system would have a different length. The different length of student history data poses a challenge for generating feature tables. Most current studies predict student dropouts in the ﬁrst academic year and therefore avoid discussing this issue. The central assumption of these studies is that more than 50% of dropouts will leave school in the ﬁrst academic year. However, in our study, we found the distribution of dropouts is evenly distributed in all academic years based on the dataset from a Korean university. This result suggests that Korean students’ data characteristics included in our dataset may differ from those of other developed countries. More speciﬁcally, the result that dropouts are evenly distributed throughout the academic years indicates the importance of a dropout prediction for the students in any academic year. Based on this, we explore the universal feature tables applicable to dropout prediction for university students in any academic year. We design several feature tables and compare the performance of six machine learning models on these feature tables. We ﬁnd that the mean value-based feature table exhibits better generalization, and the model based on the gradient boosting technique performs better than other models. This result reveals the importance of students’ historical information in predicting dropout.


Introduction
Dropping out of school has globally become a significant challenge most universities are facing.According to a recent study by UNESCO [1], the disruption to education caused by COVID-19 has put 24 million learners at risk of not being able to continue their studies.The study notes that higher education is likely to experience the highest dropout rate and a projected 3.5% decline rate in enrolment, which is expected to result in 7.9 million fewer students.Many students experience various difficulties in school life, which eventually cause them to give up studying.However, the difficulties those dropout students face cannot be solved just by dropping out.After the dropout students enter society, the difficulties these dropout students face at school will turn into social pressure due to their premature entry into society and academic failure.This may result in employment difficulties, low income, and even crime [2].
The reasons for dropping out can be attributed to school, family, society, and psychology [3][4][5][6][7].A number of studies have shown that the mono-causal approach is not enough to accurately explain the phenomenon of students dropping out, but a multitude of factors must be considered [8,9].The studies on online courses indicated that time-dependent data on student trajectories could be used to predict dropouts [10][11][12].Dropout prediction can be transformed into a sequence classification problem by obtaining students' continuous features over time [10,11].However, offline education rarely uses time series to solve dropout problems because it is challenging to get the same large data streams as online education.
Machine learning has been shown to effectively extract features from data [13].Using machine learning techniques to identify students at risk of dropping out has also proven effective [14][15][16].Decision tree-based models such as XGBoost [17] and random forest [18] achieve excellent results with a small number of features and sample sizes [14,15].The authors in [14] reported an XGBoost model with 90% accuracy in predicting dropout.In another study [15], the authors reported a 93% accuracy of the random forest model in predicting dropout.In these studies [14][15][16], grades proved to be the most important feature.However, a detailed discussion of the way features are generated is lacking.The impact of feature-generation methods on model performance is unclear.
How the feature is generated determines the prediction model's performance [19].The feature table represents how the feature is organized and determines the information that the model can obtain.Students in different semesters will have different data lengths; for example, a student in the third semester will have three semesters of scholarship records, while a student in the second semester will have only two.The method used to effectively process these data to generate feature tables that predict dropout has not been studied quantitatively.Most current research predicts student dropouts in the first academic year [20][21][22][23][24] and therefore avoids discussing this issue.However, the model's generalizability is limited if it only considers the first or a specific academic year.The main challenge in providing dropout predictions for students across the academic year is the length of the data.Because students in different academic years have data of different lengths, how feature tables are extracted to obtain the best predictions has not been studied quantitatively.
Therefore, the purpose of this study is to explore the most applicable feature table generation methods for university dropout prediction in all academic years.In this study, based on data from 60,010 students from a Korean university, we (i) analyze dropoutrelated factors hidden in student trajectory data, (ii) design four sets of feature tables to summarize student trajectory data and compare the performance of six machine learning models on these feature tables.The main contributions of this paper are summarized as follows: (1) To the best of our knowledge, this is a pioneering study exploring how the feature tables for machine learning are generated when conducting dropout prediction for university students in all academic years; (2) We identified the dropout-related factors hidden in student trajectory data; (3) We explored the temporal distribution of dropout students' characteristics based on the analysis of large data set (n = 60,010); (4) We evaluated 6 dropout prediction models for 4 feature tables based on using the F1 Score, precision, recall, and accuracy.
The rest of the paper is organized as follows: In Section 2, we introduce the contribution factors of student dropout and current research on dropout prediction.In Section 3, we describe the dataset and data pre-processing methods we used, and we then describe the feature generation methods and the machine learning models we used.At the end of Section 3, we introduce the SMOTE method that can solve the data imbalance.In Section 4, We analyze the characteristic correlations and the temporal distribution of dropout students, and we then evaluate the performance of the proposed models.In Section 5 the conclusions are given.

The Contribution Factors of Student Dropout
The studies over the past two decades have provided important information on the relevance of student performance and student dropout.The majority of the studies reveal a series of common characteristics and center their analyzes on the following group of variables: the grade point average (GPA), the number of late to class, the number of absent from class, the number of talks with the professor, the student's major and discipline.Most studies regard grades as the most critical factor influencing students' decisions to drop out.
Respondek et al. [25] indicated that the longitudinal linkages between perceived academic control and university grades revealed their influence on subsequent dropouts.Rovira et al. [26] performed similar research to show the relationship between GPA and dropouts and revealed that the academic data could be used to predict the course grade and dropouts.The authors in [15] analyzed the data from the 261 students and revealed that grades and attendance are the essential factors that predict student dropout.
Although school performance can be seen as a crucial contributing factor to student dropout, social and family factors also play an essential role.A large volume of published studies describes the role of social integration and family factor in student dropout.A recent study by [27] showed that those students who were farther away from family support were 1.32 times more likely to drop out each semester.On the other hand, the financial ability of the student's family and the financial support or scholarship that the university can provide are considered to be the key factors affecting the student's decision to drop out [28].Rising tuition costs may exacerbate the economic impact of student dropouts [29].
Recent research has revealed some personal factors that contribute to dropping out of university.Stinebrickner et al. [30] reported that a student's major influences dropout intentions and that students' excessive optimism about completing a science degree might lead to higher dropout rates.Moreira da Silva et al. [14] reported that age is an essential factor in academic dropout.The authors claim that the successful completion of the course depends on the maturity of the students (age).
These studies together provide important insight into the role of school factors, family factors, social factors, and student personal factors in student dropout.Therefore, this study comprehensively collected the features of students' school performance, scholarship, and personal background from the dataset to predict student dropout.The authors describe the features used in this study in Section 3.1.

Student Dropout Prediction
The key factors associated with student dropouts have been described in the previous section; however, the effective use of these features for dropout prediction requires machine learning techniques.The literature has revealed that the pattern hidden in educational data can be used to predict student dropout using machine learning technologies and that better predictive models can be developed by combining knowledge from other fields [31].Sivakumar et al. [32] improved the traditional decision tree algorithm using Rényi Entropy, Information Gain, and Association Function.The authors reported that the accuracy of predicting student dropout was 97.50%, significantly higher than the traditional decision tree's 92.50%.The authors in [33] proposed the Bayesian profile regression approach and emphasized the importance of students' performance, motivation, and resilience in identifying students at risk of academic failure.
The key factors determining model performance are features and sample size.Table 1 sums up the models, sample sizes, features, metrics used and the results obtained in previous studies.
High school grades and performance are often used to predict dropout probability for first-year college freshmen.Nagy and Molontay [24] used personal information and high school grades to predict the dropout probability for first-academic-year freshmen.The study used a large dataset containing 15,825 student data.However, the insufficient accuracy (0.74) indicates that the feature or model needs improvement.Cardona and Cudney [34] used high school data and academic grades as features and reported an accuracy of 0.78.This result shows that adding academic grade information can improve the accuracy of the model's prediction.Del Bonifro et al. [16] also demonstrated this result with a larger dataset.Plagge [22] used academic performance to predict first-year student dropout based on 5955 students' data.The author claimed that accuracy is directly related to dataset size.Other studies reveal that better predictive models can be developed by combining knowledge from other features.Kemper et al. [35] reported a 0.89 accuracy of the decision tree based on 3176 students' data.The authors found a strong correlation between the average exam pass rate and dropout, which allows us to reduce the model complexity and get good results.Kabathova and Drlik [15] proposed a more fine-grained model based on course-level features.Although the dataset used is very small, the dropout prediction model tailored to individuals may be more accurate.Moreira da Silva et al. [14] found that students' personal details, like age, are also an important factor in student dropout.However, the complexity of dropout prediction is not only reflected in the selection of features but also in the processing of time-dependent features.As stated in the fourth paragraph of Section 1, there have been no quantitative studies on how to handle student data of different lengths in different semesters.Some studies [14,34,35] do not use features that vary across academic years or do not discuss how data from different years are treated.Other studies [15,16] used data from the first academic year or a specific academic year to predict dropout and therefore do not discuss this issue.In similar areas of research, for example, predicting online course dropout, the impact of time-dependent features on dropout has been identified [10,11,36].However, dropout prediction methods based on online courses are not suitable for offline dropout prediction due to different data structures.Therefore, it is necessary to quantitatively study the processing methods of time-related features.

Data Description
The sample group consists of 60,010 students enrolled in a major university in South Korea from 2010 to 2021.The student data contain each student's attendance history, grades for all courses, scholarship for each semester, family income, gender, age, number of leave of absence, and tuition payment history.All student data are anonymized.Table 2 shows the 23 features used in this study.We use the cohort dropout rate method [37] to calculate the dropout rate of students in the dataset.Table 3 presents the summary statistics for the student dropout rate from 2010 to 2021, calculated according to the Cohort Method.The university had a total of 60,010 students during these 12 years.Among them, 29,099 students have graduated, 6963 students have dropped out of school, and the remaining 23,948 students are in school or suspended from school.The significant drop in the dropout rate after 2015 is that there are still students who have not graduated.It usually takes six years for male students in South Korea to graduate from university due to South Korea's compulsory military service system.

1.
Mean value-based feature extraction approach.This method calculates the longitudinal average of each feature in the student data.For example, if a student in the 4th semester has four records of the number of scholarship awards, the mean value-based feature extraction approach calculates the mean of these four scholarship awards; 2.
Median value-based feature extraction approach.This method calculates the longitudinal median of each feature in the student data; 3.
Last semester data-based feature extraction approach.This method considers only the last valid semester data in the student data; 4.
First-semester data-based feature extraction approach.This method considers only the first valid semester data in the student data.
Some features are seen as fixed attributes of students, so they are fixed in the feature table.These features are (1) Professional classification, (2) Sex, (3) Birth, and (4) Access year.

Machine Learning Models Used
This study use tree-based models, kernel-based models, and linear models for student dropout prediction, which belongs to a binary classification problem that the student will be dropped out or not.
Tree-based models use if-then-else rules to solve problems.All tree-based models can be used for classification (predicting categorical values) as well as regression (predicting numerical values).Kernel-based models transform nonlinear problems into linear problems in feature space to solve the problem.Linear models can be generalized as functions that make predictions from linear combinations of features.In this paper, five commonly used classification models have been used:

•
Four tree-based models: Decision Tree [38], which draws the different solutions of the decision as branches of the tree and uses the branching and pruning method to find the optimal solution.Random Forest [18], which consists of a bootstrap aggregation method that combines the predictions of many trees.LightGBM [39], which uses histogram-based algorithms and bucket continuous feature (attribute) values into discrete bins.XGBoost [17], which provides a parallel tree boosting to solve problems quickly

•
One linear model: Logistic regression [40], which is a linear model for classification often used as a baseline model;

•
One Kernel-based model: Support Vector Machine [41], which transforms a linearly inseparable problem in the original feature space into a linearly separable problem in a high-dimensional feature space.
Since student dropouts can be classified as a binary classification problem in machine learning, this study used four performance metrics to evaluate our models: accuracy, precision, recall, and F1 Score based on the confusion matrix, as shown in Table 4.The accuracy is the percentage of the total sample that the model correctly predicted, defined as follows: The recall (or true positive rate) measures the ability of the model to detect positive samples, defined as follows: The precision (or positive predictive value) is the ratio between the number of samples correctly classified as positive and the total number of samples classified as positive (correct or incorrect), defined as follows: The F1 score comprehensively evaluates the classification performance of the classifier with precision and recall, defined as follows:

SMOTE
A significant challenge faced by DEWS is the data imbalance problem.Students who drop out only make up about 15% of the total number of students (according to Section 4.2), which can cause the model to over-fit non-dropout students and fail to accurately identify students who would drop out.SMOTE (Synthetic Minority Over-sampling Technique) was used as a data balancing algorithm for student dropout prediction [21,42,43].SMOTE inserts artificially synthesized minority samples between samples closest to a minority sample, thereby increasing the number of minority class samples to balance the dataset.In our study, only the training dataset has been rebalanced, 50% non-dropout students and 50% dropout students, using the SMOTE algorithm, but the test dataset has not.

Feature Analysis
Figure 1 shows the heat map of the features generated by the four feature tables.What stands out in Figure 1 is that the GPA, the number of absences, and Diff Credits have a high correlation with dropout in the four heat maps.This reveals that school performance may play a significant role in student dropout.Since the correlation coefficient of the number of scholarships and the tuition fee is also relatively high, it could be considered that the financial situation of students also affects their decision to drop out.In addition, the number of absences and GPA are highly correlated, which suggests that absent students more often have relatively poorer school performance.
Although the four feature tables generate similar heat maps, some differences are worth mentioning.In Figure 1C, there is a negative correlation (−0.2) between military service and dropout, while positive correlations (0.2) appear in the rest of the graphs.The heat map in Figure 1C represents the generation method of the feature table based on the student data of the first semester.This means that in this feature table, only the data for each student's first semester will be calculated.For these students, if they have served in the military, it means that the student served in the military before formally enrolling in the university.Therefore, it can be argued that these students' university studies were not "interrupted" by military service, and thus they were less likely to drop out (military service is negatively associated with dropout).

Kaplan-Meier Curve for Student Dropout
The Kaplan-Meier curve [44] measures the nonparametric empirical distribution of the occurrence of events in ordered discrete occurrence times.Figure 2 represents the Kaplan-Meier curve for student dropout.The x-axis represents the survival time of dropout students, and the y-axis represents the remaining proportion or survival probability of dropout students.Assuming that the dropout students' proportion is one at the time of enrollment, it will gradually decrease to zero until the time of dropout.Therefore, the curve slope can indicate the rate of decrease for dropout students.
In detail, the curve shows that about 26% of students drop out in the first 12 months.After that, about 12% of students drop out in the next 12 months.The following 15% of students drop out between 24 and 36 months.The following 17% of students drop out between 36 and 48 months.The following 18% of students drop out in 48 and 60 months, and the remaining 12% leave in 60 months.
In short, the distribution of dropout probability is relatively uniform, but the dropout probability in the first school year is relatively higher than in other school years.This result is inconsistent with previous studies claiming that nearly 50% of all dropouts left college between 6 and 18 months [45][46][47].This result demonstrates the importance of dropout prediction for students in all semesters rather than for a specific semester [15,16,21].Therefore, we extract features from student data of different lengths to predict dropouts in all academic years and quantitatively investigate the impact of different feature table generation methods on the performance of the prediction models.The results are reported in Section 4.3.Although the four feature tables generate similar heat maps, some differences are worth mentioning.In Figure 1C, there is a negative correlation (−0.2) between military service and dropout, while positive correlations (0.2) appear in the rest of the graphs.The heat map in Figure 1C represents the generation method of the feature table based on the student data of the first semester.This means that in this feature table, only the data for each student's first semester will be calculated.For these students, if they have served in the military, it means that the student served in the military before formally enrolling in the university.Therefore, it can be argued that these students' university studies were not "interrupted" by military service, and thus they were less likely to drop out (military service is negatively associated with dropout).

Kaplan-Meier Curve for Student Dropout
The Kaplan-Meier curve [44] measures the nonparametric empirical distribution of the occurrence of events in ordered discrete occurrence times.Figure 2 represents the Kaplan-Meier curve for student dropout.The x-axis represents the survival time of dropout students, and the y-axis represents the remaining proportion or survival probability of dropout students.Assuming that the dropout students' proportion is one at the time of enrollment, it will gradually decrease to zero until the time of dropout.Therefore, the curve slope can indicate the rate of decrease for dropout students.
sult is inconsistent with previous studies claiming that nearly 50% of all dropouts left college between 6 and 18 months [45][46][47].This result demonstrates the importance of dropout prediction for students in all semesters rather than for a specific semester [15,16,21].Therefore, we extract features from student data of different lengths to predict dropouts in all academic years and quantitatively investigate the impact of different feature table generation methods on the performance of the prediction models.The results are reported in Section 4.3.In detail, the curve shows that about 26% of students drop out in the first 12 months.After that, about 12% of students drop out in the next 12 months.The following 15% of students drop out between 24 and 36 months.The following 17% of students drop out between 36 and 48 months.The following 18% of students drop out in 48 and 60 months, and the remaining 12% leave in 60 months.
In short, the distribution of dropout probability is relatively uniform, but the dropout probability in the first school year is relatively higher than in other school years.This result is inconsistent with previous studies claiming that nearly 50% of all dropouts left college between 6 and 18 months [45][46][47].This result demonstrates the importance of dropout prediction for students in all semesters rather than for a specific semester [15,16,21].Therefore, we extract features from student data of different lengths to predict dropouts in all academic years and quantitatively investigate the impact of different feature table generation methods on the performance of the prediction models.The results are reported in Section 4.3.

Model Test Results
There are a total of 60,010 student records in our dataset.Among them, the number of graduates is 29,099, the number of dropouts is 6963, and the remaining 23,948 students are in school or suspended from school.We divide the dataset as follows:

•
Training set: 70% of all graduates and dropouts, a total of 25,244 pieces of student data; • Test set: 30% of all graduates and dropouts, a total of 10,818 pieces of student data; • Prediction set: a total of 23,948 students in school or suspended from school were used to predict the possible dropouts in the future.
Table 5 shows the test results obtained by Logistic Regression, Decision Tree, Random Forest, LightGBM, Support Vector Machines, and XGBoost algorithms, reporting on the most popular indicators of success: accuracy, precision, recall, and F1 Score.
As shown in Table 5, the LightGBM model in the mean value-based feature table obtained the highest F1 score and accuracy on the test dataset with 79% and 94%, respectively.The precision value is 81%, which is only 2% different from the highest value of 79%.More specifically, the LightGBM model has relatively balanced precision and recall values, which means that the model can accurately distinguish dropout students from non-dropout students without much bias in the case of unbalanced samples.The XGBoost model in the median value-based feature table obtained the best precision and the second-best F1 score and accuracy.Figure 3 reveals the performance of the feature tables and models in more detail.LightGBM and XGBoost have similar performance, while Logistic Regression, Support Vector Machine, and Decision Tree are insufficient (Figure 3A).Because both the LightGBM and XGBoot models are based on the gradient boost technique, the result reveals the superiority of the gradient boosting technique in predicting student dropout.In Figure 3B, the mean value-based feature table has the highest F1 Score (70%), precision (66%), and recall (76%).On the contrary, the average F1 Score, precision, and recall of the first and final semester-based feature table are significantly lower than the mean and median value-based feature table.This demonstrates that using features that include historical student data gets better predictive performance than using data from a particular point in time (semester).This result is also intuitive; for example, if a student in the third semester had excellent grades in the first two semesters but declined in the third semester, it would be inaccurate to consider only the first two semesters or the third semester when predicting dropout.
first and final semester-based feature table are significantly lower than the mean and median value-based feature table.This demonstrates that using features that include historical student data gets better predictive performance than using data from a particular point in time (semester).This result is also intuitive; for example, if a student in the third semester had excellent grades in the first two semesters but declined in the third semester, it would be inaccurate to consider only the first two semesters or the third semester when predicting dropout.
(A) (B) Figure 4 presents the feature importance of the LightGBM model trained in the mean value-based feature table.The three most important features are (1) tuition fee, (2) the average number of scholarships per semester, and (3) entry year.It is evident that for Korean university students included in this study, the economic aspect may be an important factor influencing whether they drop out of school.Tuition and scholarships reflect the financial pressures burdened by students' families.This result is contrary to the study by [48], which reported that grades are the most important influencing factor.We believe this is due to the high tuition fees of private universities in South Korea, making it easier for students who cannot get scholarships to drop out.Our findings suggested that increasing scholarships and reducing tuition fees may be effective intervention measures to reduce the dropout rate.2) the average number of scholarships per semester, and (3) entry year.It is evident that for Korean university students included in this study, the economic aspect may be an important factor influencing whether they drop out of school.Tuition and scholarships reflect the financial pressures burdened by students' families.This result is contrary to the study by [48], which reported that grades are the most important influencing factor.We believe this is due to the high tuition fees of private universities in South Korea, making it easier for students who cannot get scholarships to drop out.Our findings suggested that increasing scholarships and reducing tuition fees may be effective intervention measures to reduce the dropout rate.Compared to previous studies [14,16,21,42,44] on dropout prediction centered on student achievement.The findings of this study reveal the importance of features that are unrelated or not directly related to grades in predicting dropout.As shown in Figure 4, among the top ten features of feature importance, there are 6 features that are irrelevant or not directly related to grades (tuition fee, access year, birth, professional classification, absence, and living area).Since current research on dropout prediction is mainly centered on academic performance, these features may be overlooked, resulting in a portion of students being left out of the dropout prediction system.Therefore, incorporating these features that are not or directly related to grades into the dropout prediction system may improve the system's performance.
We have integrated the LightGBM model based on the mean value-based feature table with the university's academic management system to predict dropouts.The proposed system has been put into operation.To interpret the model results for professors and students, we generate the dropout risk report, as shown in Figure 5. Compared to the average of the overall students, this student with a 92% probability of dropping out has a low average number of scholarship awards, while his average number of absences is high.Compared to previous studies [14,16,21,42,44] on dropout prediction centered on student achievement.The findings of this study reveal the importance of features that are unrelated or not directly related to grades in predicting dropout.As shown in Figure 4, among the top ten features of feature importance, there are 6 features that are irrelevant or not directly related to grades (tuition fee, access year, birth, professional classification, absence, and living area).Since current research on dropout prediction is mainly centered on academic performance, these features may be overlooked, resulting in a portion of students being left out of the dropout prediction system.Therefore, incorporating these features that are not or directly related to grades into the dropout prediction system may improve the system's performance.
We have integrated the LightGBM model based on the mean value-based feature table with the university's academic management system to predict dropouts.The proposed system has been put into operation.To interpret the model results for professors and students, we generate the dropout risk report, as shown in Figure 5. Compared to the average of the overall students, this student with a 92% probability of dropping out has a low average number of scholarship awards, while his average number of absences is high.This also illustrates that the mean value-based feature generation methods can identify students at risk of dropping out of school.

Conclusions
This study explored the most applicable feature table generation methods for university dropout prediction.We analyzed the factors associated with dropout in the student history data.Then we designed four different generation methods of feature tables and compared the performance of six machine learning models on these feature tables.Our results revealed that the distribution of dropout probability is evenly distributed in all academic years.This demonstrated the importance of dropout prediction for students in all semesters rather than for a specific semester.
Furthermore, our comparative study for feature table generation methods revealed that the mean value-based feature generation method is better than other methods when predicting dropout for a university student in all academic years.This provides a theoretical basis for future research about the prediction of university dropouts across the academic year.In addition, one of the strengths of our study is the completeness of the dataset.Compared to previous dropout studies with small samples [7,14], the complete data (n = 60,010) from one university and the detailed description of feature generation methods make the results of the model reliable.Some limitations of this study are worth noting.To compare the effects of different feature table generation methods on the model results, we did not perform feature combinations.Future research can therefore consider feature combinations based on the mean value-based feature table to obtain higher prediction performance.

Conclusions
This study explored the most applicable feature table generation methods for university dropout prediction.We analyzed the factors associated with dropout in the student history data.Then we designed four different generation methods of feature tables and compared the performance of six machine learning models on these feature tables.Our results revealed that the distribution of dropout probability is evenly distributed in all academic years.This demonstrated the importance of dropout prediction for students in all semesters rather than for a specific semester.
Furthermore, our comparative study for feature table generation methods revealed that the mean value-based feature generation method is better than other methods when predicting dropout for a university student in all academic years.This provides a theoretical basis for future research about the prediction of university dropouts across the academic year.In addition, one of the strengths of our study is the completeness of the dataset.Compared to previous dropout studies with small samples [7,14], the complete data (n = 60,010) from one university and the detailed description of feature generation methods make the results of the model reliable.Some limitations of this study are worth noting.To compare the effects of different feature table generation methods on the model results, we did not perform feature combinations.Future research can therefore consider feature combinations based on the mean value-based feature table to obtain higher prediction performance.

Figure 1 .
Figure 1.The heat map based on (A) the mean value-based feature table, (B) the median value-based feature table, (C) the first-semester data-based feature table, and (D) the final semester data-based feature table.

Figure 1 .
Figure 1.The heat map based on (A) the mean value-based feature table, (B) the median value-based feature table, (C) the first-semester data-based feature table, and (D) the final semester data-based feature table.

Figure 3 .
Figure 3. (A) Average performance of the six models.(B) Average performance of the four feature tables.

Figure 3 .
Figure 3. (A) Average performance of the six models.(B) Average performance of the four feature tables.

Figure 4
Figure4presents the feature importance of the LightGBM model trained in the mean value-based feature table.The three most important features are (1) tuition fee, (2) the average number of scholarships per semester, and (3) entry year.It is evident that for Korean university students included in this study, the economic aspect may be an important factor influencing whether they drop out of school.Tuition and scholarships reflect the financial pressures burdened by students' families.This result is contrary to the study by[48], which reported that grades are the most important influencing factor.We believe this is due to the high tuition fees of private universities in South Korea, making it easier for students who cannot get scholarships to drop out.Our findings suggested that increasing scholarships and reducing tuition fees may be effective intervention measures to reduce the dropout rate.

Figure 4 .
Figure 4. Feature Importance of LightGBM Model trained in the mean value-based feature table.

Figure 4 .
Figure 4. Feature Importance of LightGBM Model trained in the mean value-based feature table.

Author
Contributions: Conceptualization, Formal analysis, Investigation, Methodology, Writingoriginal draft, Writing-review & editing, Z.S.; Formal analysis, Writing-review & editing, S.-H.S.; Formal analysis, Investigation, D.-M.P.; Supervision, Validation, Data acquisition, Revising-review & editing, B.-K.P.All authors have read and agreed to the published version of the manuscript.Funding: This work was supported by the Dong-A University research fund.Institutional Review Board Statement: Not applicable.Informed Consent Statement: Not applicable.Data Availability Statement: The data presented in this study are available on request from the

Figure 5 .
Figure 5.The Student Dropout Risk Report.

Table 1 .
Model, Sample Size, Features, and Metrics Used and Results Obtained.

Table 2 .
Feature Table from Student Data.

Table 3 .
Dropout Rate Calculated according to the Cohort Method.

Table 3 .
Cont.Irrelevant, noisy, and inconsistent data are removed in the data preprocessing stage.Null values are filled using the median value.Because students in different semesters have different historical data lengths, we propose the following four feature extraction methods to generate feature tables in a uniform format.

Table 4 .
The Confusion Matrix.

Table 5 .
F1 Score, Precision, Recall, and Accuracy for class dropout in the test dataset.