Integrated Survival Analysis and Frequent Pattern Mining for Course Failure-Based Prediction of Student Dropout

A data-driven method to identify frequent sets of course failures that students should avoid in order to minimize the likelihood of their dropping out from their university training is proposed. The overall probability distribution of the dropout is determined by survival analysis. This result can only describe the mean dropout rate of the undergraduates. However, due to the failure of different courses, the chances of dropout can be highly varied, so the traditional survival model should be extended with event analysis. The study paths of students are represented as events in relation to the lack of completing the required subjects for every semester. Frequent patterns of backlogs are discovered by the mining of frequent sets of these events. The prediction of dropout is personalised by classifying the success of the transitions between the semesters. Based on the explored frequent item sets and classifiers, association rules are formed providing the estimates of the success of the continuation of the studies in the form of confidence metrics. The results can be used to identify critical study paths and courses. Furthermore, based on the patterns of individual uncompleted subjects, it is suitable to predict the chance of continuation in every semester. The analysis of the critical study paths can be used to design personalised actions minimizing the risk of dropout, or to redesign the curriculum aiming the reduction in the dropout rate. The applicability of the method is demonstrated based on the analysis of the progress of chemical engineering students at the University of Pannonia in Hungary. The method is suitable for the examination of more general problems assuming the occurrence of a set of events whose combinations may trigger a set of critical events.


Introduction
Student dropout in higher education is a world-wide problem that is worth paying attention to. The problem is especially significant in the United States, where one third of the students give up their studies before the second year, causing significant financial damage to the government [1]. A significant proportion of students do not complete their studies in Latin American countries either, especially in Chile [2]. Another issue is that dropout is significantly in different levels of education, so it also appears in students pursuing doctoral studies [3]. Therefore, the analysis of student dropout is a significant task from an international point of view, and this is only further confirmed by the fact that the prestige of educational institutions lies in the success of their participants, and the successful completion of the started training has a crucial importance from the viewpoint of the students as well.
Educational data mining focuses on analysing the impact of various factors in this area. The impact of artificial intelligence on education has already been reviewed [4]. The study found that artificial intelligence has been adopted and used in various fields of educational institutions. These areas are administrative functions, grading assignments, lead to student failure. This method is thus able to predict dropouts up to several semesters and show critical subjects and critical subject sequences based on the requirements of a subject. The association rule mining method has easily understandable probability theory, and it seems to be analogous to survival analysis [25]. The initial ideas aimed to represent the uncompleted subjects as sequences for the total study time, but it turned out that this requires huge computing capacity. The proposed associated Kaplan-Meier methodology has been compared to the Naive Bayes Classification method.
Association rule mining has already been used in the past to examine dropout. One study examined failure using several methods, and the study also includes the association rule mining [26]. The difference from the current study is that it predicts based on factors influencing the student: family problems, health problems, personal problems, institutional problems. Only fallen students were observed, and the significance of dropout is caused by the individual influencing factors. In contrast, this study considers the proportion of students who graduated and gives information about success. Machine learning methods to predict dropout in the first year based on some student-specific features such as gender and high school id were also compared [27]. The study also covered some admission tests, and this is not considered in this study. If this test is failed, the student has to attend further specific courses and has to pass. The results show that the prediction is more accurate and performs better if the proper features are selected.
The novelties of the paper are: (i) it uses a different aspect to predict the dropout, namely the uncompleted subjects; (ii) it integrates the survival analysis and machine learning methods to deeper explore the interrelations and correlations, (iii) the methodology is able to predict the dropout in a long time range. The method was developed based on the data of approximately 350 students of the chemical engineering undergraduate program of the University of Pannonia in Hungary.

Integration of Survival Analysis and Frequent Itemset Mining
This section presents the developed methodology in a generalized form as it is suitable for the examination of more general problems assuming the occurrence of a set of events whose combinations may trigger a set of critical events.
The methodology starts with the integration of the various data sources needed for the identification of the triggering and consequential events, whose probabilities are considered as competitive risks, in order to obtain a general model that is valid for the whole dataset (population) by the means of survival analysis. As the obtained model cannot provide specific predictions or risk assessments for a specific individual, in-depth event analysis is performed based on the frequent itemsets of the triggering effects.
Among the large set of itemsets generated by the frequent itemset mining algorithms, only a few will be informative regarding their ability to predict the consequential events. The applicable sets of itemsets are filtered by forming association rules that describe how a specific consequential event is caused by the certain sets of the triggering events.
The probability of the consequential events is calculated based on the integrated analysis of the identified association rules. By aggregating the calculated probabilities for the whole population, the resultant estimate is suitable for the validation of the model based on the results of the survival analysis.
The following subsections provide the details of the method.

Empirical Survival Function of the Occurrence Times
The proposed method studies the nonparametric empirical distribution of the occurrence of events in ordered discrete occurrence times: t 0 = 0, t 1 , . . . , t f , . . . , t n . The S(t f ) survival function represents the conditional probability that determines that an event occurs later than t f , provided that it has not yet occurred until the time t f −1 : Let q(t f ) = P(T > t f |T > t f −1 ) be the probability that gives a recursive description of the survival function: The value of q(t k ) can be estimated based on the m k number of events that occurred at time t k and n k the number of cases at time k in which the event has not occurred until time t k−1 yet (which means n k represents the size of the risk set at time t k ): Substituting Equation (3) into Equation (2), the Kaplan-Meier empirical distribution of the occurrence of the events can be obtained [28]: An example for the resulted distribution function is shown in Figure 1. In this example, the probability that the event (e.g., the dropout) will occur after the second time instance (e.g., semester) is 0.8, while the probability that the event will occur later than the sixth time instance is 0.35.

Handling Competing Risks in Survival Analysis
The presented Kaplan-Meier model cannot be directly applied when there is more than one consequential event, e.g., besides the dropout of students, they can successfully graduate as well (and the graduating students cannot be expelled by the university). Our key idea is that, the probability of occurrence of these consequential events should be handled as competing risks. Depending on what type of competing risks exist and which survival analysis procedure is used, there are several methods to handle competing risks. In the case of the Kaplan-Meier survival analysis, the calculation of the Cumulative Incidence Curves is the obvious way of extending the method to handle competing risks: where m c k is the number of occurrences of the c = 1, . . . , Cth competing risk at time t k , where C represents the number of competing risks.
The hazard function of the cth examined risk (h c (t k )) represents the probability of the occurrence of the cth consequential event: The Incidence Curve (I c (t k )) can be calculated from the survival function and the hazard function as: By aggregating the values of the Incidence Curve, we obtain the Cumulative Incidence Curve (CIC c (t f )) [28]: One of the significant advantages of the presented empirical distribution is that it can be easily applied even if the problem also requires the management of competing risks. However, the disadvantage of this method is that the whole dataset is treated as one and no additional information, like the impact of different uncompleted subjects, is provided on individual cases. For applications where there may be a variety of causes of an event, it is advisable to explore the impact of the sets of possible causes and their contribution to the risk of a consequential event. The following subsection presents how such frequent itemsets of events and association rules can be explored.

Frequent Event Pattern Mining for Survival Analysis
The formalisation of the frequent itemset mining-based event analysis is based on the following definitions.
Similarly to the survival analysis, the studied events can occur in discrete time instances t 0 = 0, . . . , t 1 , t f , . . . , t n . Let e i k denote the occurrence of the ith event at time t k . We study a set of j = 1, . . . , n k cases at time t k , so when the ith event occurs at time t k in the jth case, it is denoted as e i k (j). The X j k = {e i k (j), . . . , e l k (j)} set contains events that occur in the t k th time instance or kth time period in case j, while the X k = {X 1 k , . . . , X n k k } set of these sets represents all the events at the t f time. In our analysis a case is the set of uncompleted subjects of a specific student, or, in more general terms, the event trace in process mining.
The purpose of frequent itemset mining is to reveal a set of φ p k ⊆ X k informative event patterns, where p represents the index of the mined patterns, p = 1 . . . , P. A pattern is supported by the X j k case when φ p k ⊆ X j k . The importance of a pattern is measured by its support (supp(φ p k )) that measures the relative number of cases in which the φ p k pattern occurs: The φ p k pattern is frequent, if its support exceeds a specific value: supp(φ p k ) ≥ minsup. The frequent pattern mining algorithms aim to find all the frequent patterns. Therefore, the higher the minsup value is, the smaller the number of generated patterns, which intuitively improves the interpretability of the model, while at a smaller minsup value, more itemsets are extracted representing more specific cases and a more accurate, yet less interpretable, model is produced.
We are looking for frequent patterns that can be grouped into a set of triggering events and a consequential event as follows: association rule is the φ p * k set of triggering events and the e c k consequential part is the triggered consequential event.
The confidence of the φ p * k → e c k association rule is the P(e c k |φ p * k ) conditional probability, that describes the probability that the φ p * k set of triggering events causes the e c k consequential event: Based on the support and confidence measures of the association rules, the probability of the consequential events can be calculated as it is presented in the next subsection.

Integrated Analysis of the Association Rules
As in most of the cases more φ p * k frequent itemsets are generated; the proper aggregation of these association rules is a cardinal step of the analysis to calculate the probability measures of certain consequential events.
Naturally, based on each φ p * k → e c k rule, a different probability (risk) is associated with the occurrence of the e c k event. A logical conclusion is that the rule with the highest probability will have the greatest impact on the fate of a specific student; therefore, the rule with the highest probability (P(e c k (j))) is considered in the case of each student: The next step is to calculate the probability of drop out generalized for all students. In this case, it is advisable to take the maximum of the maximum probability values (P(e c k )) of individual students: P(e c k ) = max(P(e 1 k (j)) . . . P(e n k k (j))) (12) This probability defines the hazard function h c (t k ) for the e c k competing risk of the survival analysis: which can be used to estimate the m c f number of e c f events, Then, substituting Equation (13) into Equation (8), the Cumulative Incidence Curve for survival is as follows:

Application to Student Dropout Prediction
To set up the model, the course completion data of former chemical engineering students at the University of Pannonia was used who had already either graduated or been expelled from the university. Active and passive students were excluded from the study because there is no information about their outcome. Reapplied students were also excluded from the analysis. The students were completely anonymized. It was not necessary to obtain permissions as we use data from our university. The input of the method was created by integrating student log files and sample curriculum. The provided data were recorded between 2011 and 2018 and included approximately 350 students. Care had to be taken to exclude students during the data processing as to who had already applied and dropped out before 2011. If these students reapply after 2011, it causes confusing factors like the student graduating too soon for incomprehensible reasons. It was also challenging to formulate each case of uncompleted subject failure patterns.

The Description of the Analysed Dataset of Course Completions
All data was anonymized prior to your access and analysis. The studied data can be downloaded from the website of the authors (https://www.abonyilab.com/about-us/ software-and-data, accessed on 22 October 2018.).
The integrated student log file consists of two components. The student database records each attempt to complete a subject as an elementary event. There is also a binary variable describing graduation and unsuccessful graduation (drop out). Combining these with the information extracted from the sample curriculum, an integrated student log file can be created. A sample for this log file is shown in Table 1. Table 1. A sample for the student log file which integrates the student-specific data and the sample curriculum.

Recommended Semester
The student subject failures are represented as events. An example is shown in the Gantt chart in Figure 2. Letτ i be the semester in which the student should complete the ith subject according to the sample curriculum, and τ i j be the semester in which the first successful completion of the subject was recorded. The e i k (j) elementary event is the ith lack of subject completion event of the jth student in the kth semester, ifτ i < τ i j . These events can be grouped according to semesters. The e f ail k (j) causal events (whose triggering causes are to be found) represent when the jth student does not continue his studies in the k + 1 semester, and leaves the university due to failure. As will be presented in the next subsection, this event will be considered the competing risk that the student will continue his/her studies.

Investigation of Student Dropout with Survival Analysis Taking into Account the Competing Risks
Examining the study path of a university student, it is clear that if someone successfully graduates, no other outcome can happen to that person. However, if someone interrupts his/her studies or is fired for any reason, that person can re-enrol on the training. These students are excluded from the study. Thus, the unsuccessful finishing of the program and the successful graduation will be competing risks that need to be handled. In this case, by determining the Cumulative Incidence Curve of the unfortunate case, the exact dropout rate of students can be estimated. To obtain this measure, it is necessary to identify the number of students who dropped out in a given semester and the number of successful degrees that the students obtained. The number of graduates in the f th semester is denoted by m grad f and the number of students who dropped out is indicated by m f ail f . Then, substituting the parameters mentioned above into Equation (15), the Cumulative Incidence Curve can be calculated as follows: The calculation process of the individual results over time is collected and explained in Table 2.
The comparison of the function estimated by the Kaplan Meier method and the function estimated by the Cumulative Incidence method can be seen in Figure 3. The emergence of competing risks begins in the seventh semester. Since this is the length of the sample curriculum, this is the moment when the other output option, the graduation, appears. If there is no other competing risk, the Cumulative Incidence Curve is the same as the empirical distribution by Kaplan-Meier, which is well visible in the figure until the seventh semester as well, and the two functions begin to differ only after that. The relation of the functions to each other is also satisfactory, since due to the typical phenomenon that the Kaplan-Meier distribution overestimates the risks, the probability of survival is lower than in the case of the Cumulative Incidence Curve. The difference between the two functions determines the graduated students.  Since the competitive risk of graduation appears only from the 7th semester (the length of the sample curriculum), it is expected that the two functions will be different from this semester. The difference determines the graduated students.
As mentioned earlier, the disadvantage of the Kaplan-Meier model, which manages competing risks, is that it describes the entire population at once. However, it must be recognized that considerable differences can occur when students follow different subject (in)completion pathways during their university years. The consequences of failing in Mathematics or Chemistry in the first semester can be completely different. This is the reason why event analysis is introduced, into the means of frequent itemset and association rule mining.

Event Analysis with the Mining of Frequent Itemsets and Association Rules
Based on the previously presented concepts in this case study, the event e i k denotes the missing completion of the ith subject in the kth semester, X j k = {e i k (j), . . . , e l j (j)} is the pattern of missing subjects of the jth student in the kth semester, and the X k = {X 1 k , . . . , X n k k } is the pattern of missing subject completions of the students in the kth semester. It should be highlighted that the set X k f is extended to contain both the triggered e c k consequential events, so e f ail k when the given student fails at the end of the kth semester. As each case study has different types of relevant information, it is important to note that in the case of student dropout, conditions should be made to mine frequent itemsets. There are some results when the support of a certain uncompleted subject is the same as the support of that certain uncompleted subject and some other subjects together. In this case, the other subjects do not affect the dropout and may determine poor results after aggregating. To avoid this phenomenon, we use the Closed Frequent Itemset Mining method [29]. The frequent itemsets are mined based on the X k set of X j k patterns. The method has an important hyper-parameter, which is the minimum support of the frequent itemset mining algorithm. A smaller number of supports results in a higher number of rules, so the complexity of the rule base can be fine-tuned by this parameter. Similarly to other machine learning tasks, the optimal complexity of the model can be fine-tuned by cross validation as will be presented in the following section.

Integrated Analysis of Student Dropout
In order to verify the authenticity of the data and to handle the over-fitting issue, we used five-fold cross-validation. After performing the steps mentioned in the previous sections, the analysis of the results can be performed. The five most critical rules of every semester are summarized in Table 3. Based on the critical dropout rules, the subjects with their names are summarized in Table 4, using the ID-s and names of the subjects according to Appendix A. Apparently, every semester has its subject, which seems to be critical, for example, in the first semester, the core subjects providing the basic engineering knowledge such as mathematics, physics and chemistry. Moreover, there are uncompleted subjects that reoccur over multiple semesters. Examples are the comprehensive exam in chemistry, which appears from the fifth semester and lasts until the end of the analysis, or the transportphenomena, which is a critical subject in three semesters as well.
The Cumulative Incidence Curve generated from the association rules and the Cumulative Incidence Curve generated from the survival analysis is shown in Figure 4. This model apparently approximates the Cumulative Incidence Curve of survival analysis very well with the aggregation strategy of the maximum confidences method. It can be said that a student can easily be accepted at an engineering course in Hungary, even at ones that are supported by the government, as it is a highly deficient profession. Therefore, many students try to complete the course, but they soon realize that they cannot make it. In the first few semesters, more than half of the students abandon the study by the end of the fifth semester. In the first two semesters, students leave who realize on their own that course is too hard for them. A higher dropout rate is seen in the third semester. Its reason is that there are requirements to continue the course. Every student must complete all subjects recommended by the sample curriculum in the first semester by the end of the third semester. However, one time, it is possible to request a so-called fairness request, and this allows for one subject to be completed in the fourth semester. The dropout rate in the fourth semester usually affects those who have not managed this request well either. The last significant dropout is seen in the fifth semester. Its reason is that there is also a requirement to continue the course. Another dropout phenomenon is that students can decide to reapply for the course at any time. This is done to obtain better chances by erasing their previous bad results and resetting the requirement system. Thus, as the method examines only the first attempt of performing the training, these students are also considered as dropped out. Previous studies have shown that there are few students who complete the training after reapplying. However, experience shows that it is not worth applying again because the failure is still significant. Once students reach the 5th semester, they are less likely to drop out after this semester. Finally, based on the 11th semester, it can be stated that approximately 40% of students can graduate on their first attempt.  The obtained results suggest that it would be necessary for university management to reconsider some functional elements. First, it would be essential to reschedule the sample curriculum subjects. There are subjects whose primary skills are created for subjects that are recommended in later semesters. Since many people dropped out in the 3rd semester due to the requirement there, it would be important to rethink its terms. Furthermore, it can be noticed that, in many cases, there is a connection between the given subject and the teacher. In this regard, it would be important to organize useful training for these educators based on the section 1.5 of the European Standards and Guidelines [30].
In order to present the effectiveness of the developed methodology from several perspectives, we also performed comparative analysis. The Naive Bayes Classification method was selected for comparison. Based on the results, it can be said that the classifier is very poorly able to estimate dropout based on uncompleted subjects. The Cumulative Incidence Curve of the Naive Bayes classifier and the survival analysis is compared in Figure 5 for one-fold change. It can be said that the Naive Bayes model overestimated the number of failures. Based on this, the method proved to be weak for prediction. However, in the case of failed students, the model was accurate, so the method may still be suitable as an alerting system. To illustrate the effectiveness of the two methods, in both cases we determined the mean of the absolute difference between the Cumulative Incidence Curve function derived from the Naive Bayes and the proposed model as can be seen in Table 5. Based on the confidence of the association rules, the proposed method is also suitable for estimating the probability of dropping out of an active student who is still in training based on his/her current uncompleted subjects. Since the student already has a given φ p * k pattern of uncompleted subjects, the conditional probability con f (φ p * k → e c k ) = P(e c k |φ p * k ) must be calculated. Based on the missing subject completions, personalized predictions can be made by looking for what new uncompleted subjects can most likely follow the φ p * k pattern of uncompleted subjects. Thus, the developed method also answers what kind of uncompleted subjects are expected of the student. Like any methodology, this one also has its limitations. It can be observed that after a given semester, the majority of students who have not dropped out will graduate. There are very few students who reached the 11th semester, so there are significantly fewer data available in proportion, which results in uncertainty in the forecast for the last semesters. If much more data are available, more accurate results can be obtained, but the proportions still result a minimal amount of data.

Conclusions
Student drop-out is one of the problems of our age, causing significant economic loss and social tension. Despite the fact that more and more researchers analyse the issue, to our knowledge, so far no method has been developed that would predict the student's academic success based on the student's uncompleted subjects.
The present paper illustrates that the survival analysis based on a competing risk model effectively provides an estimate of the probability of graduation. The disadvantage of survival analysis, however, is that by itself it cannot incorporate the impact of different (currently) uncompleted subjects into the probability of drop out from the course. However, deviations from the sample curriculum can be present in innumerable permutations and can show significant differences in terms of risk. After identifying the problem, it was highlighted that it is expedient to extend the survival analysis model with event analysis methods. Representing subject completion deficiencies as events, frequent patterns can be identified by frequent itemset mining, from which association rules are formed to discover the lack of subject completions that leads to the dropout of a student. A method to estimate the probability of a student progressing from semester to semester and obtaining a degree based on the characteristics of the pattern of uncompleted subjects was also developed.
The probability of surviving (remaining active student in the next semester) calculated by the model approximates well the results of the survival analysis, that is, the Kaplan-Meier estimate of the empirical distribution. By extending the method, it is also possible to estimate subjects are likely to be uncompleted in the future by an active student still in training. The method can be further developed into an automated personalized counselling system.
The model may also be suitable for examining a wide class of problems. An important characteristic of the applications is the presence of overlapping process steps and the occurrence of transitions caused by the triggering phenomenon. Examples include the development activities, so the method seems to be suitable to support capability maturity model integration processes which will be one of our future research avenues.

Conflicts of Interest:
Authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

Denotation
Meaning Hazard function of ith competitive risk I i (t k ) Incidence Curve of ith competitive risk CIC i (t k ) Cumulative Incidence Curve of ith competitive risk e Event e i k The ith event occurs at time t k X j The set of events of jth case X j k The set of events of jth case at time t k X f Set of events at time t f φ f The set of typical series of events at time t k φ p k The ith typical series of events at time t k supp(φ k ) The support of set of events at time t k minsup Treshold of minimal support φ p * k Left side of the association rules of the frequent itemset φ k associated with a consequential event e c k Appendix A. Information about the Sample Curriculum Electronics laboratory practice 4 6 Process design I. 4 7 Physics I. 1 8 Physics (problem solving practice) 1 9 Physics II. 2 10 Physics lab. Pract. 2 11 Physical chemistry I. 2 12 Physical chemistry II. 3 13 Laboratory practice in physical chemistry 3 14 Problem solving practice in physical chemistry 3 15 Process control 4 16 Machine elements and presentation 1 17 Process dynamics and control 4 18 Introduction to law 4 19 Corrosion Basics 4 20 Comprehensive exam in chemistry 5 21 Chemical analysis 3 22 Chemical analysis laboratory practice 4 23 Economics 1 24 Mathematical analysis I. 1 25 Mathematical analysis I. Practice 1 26 Mathematical analysis II. 2 27 Mathematical analysis I. Practice 2 28 Quality assurance 2 29 Industrial quality management 6 30 Effective technical communication  6  31  Effective technical communication practice  6  32  IT tools for effective technical communication  6  33 Engineering thermodynamics 3 34 Technical thermodynamics 3 Laboratory practice on organic chemistry 4 46 Computer science for engineers I. 1 47 Modeling of chemical processes 5 48 Modeling of chemical processes (laboratory practice) 5 49 Design of technological systems 6 50 Design project I. 6 51 Design project II. 7 52 Transportphenomena 3 53 Chemical process engineering laboratory practice 5 54 Chemical Engineering BSc Field Practice 7 55 Chemical process safety 6 56 Selected chemical technologies 5 57 Selected chemical technologies (laboratory practice) 5 58 Process design II. 5 59 Process design III. 6 60 General and inorganic chemistry 1 61 Problem solving in general and inorganic chemistry I. 1 62 Problem solving in general and inorganic chemistry II. 2 63 Laboratory practice in general and inorganic chemistry 2 64 Hydrocarbons and petrochemical technologies 5