Student Dataset from Tecnologico de Monterrey in Mexico to Predict Dropout in Higher Education

: High dropout rates and delayed completion in higher education are associated with considerable personal and social costs. In Latin America, 50% of students drop out, and only 50% of the remaining ones graduate on time. Therefore, there is an urgent need to identify students at risk and understand the main factors of dropping out. Together with the emergence of efﬁcient computational methods, the rich data accumulated in educational administrative systems have opened novel approaches to promote student persistence. In order to support research related to preventing student dropout, a dataset has been gathered and curated from Tecnologico de Monterrey students, consisting of 50 variables and 143,326 records. The dataset contains non-identiﬁable information of 121,584 High School and Undergraduate students belonging to the seven admission cohorts from August–December 2014 to 2020, covering two educational models. The variables included in this dataset consider factors mentioned in the literature, such as sociodemographic and academic information related to the student, as well as institution-speciﬁc variables, such as student life. This dataset provides researchers with the opportunity to test different types of models for dropout prediction, so as to inform timely interventions to support at-risk students.


Introduction
High dropout rates and delayed completion in higher education are associated with considerable personal and social costs. Dropping out from higher education represents a cost for the government and society, an unnecessary expense for the family, and an experience of failure for the university student [1,2]. Therefore, the early identification of at-risk students and understanding of the main factors of dropping out have recently attracted a great deal of research interest [3][4][5]. Early detection of at-risk students allows higher education institutions to offer individualized assistance in varied forms, including remedial courses and tutoring sessions to mitigate academic failure. programs [29]. In the dataset given through this descriptor, non-identifiable information is provided for 121,584 High School and/or Undergraduate students who have enrolled at Tecnologico de Monterrey. The information corresponds to seven admission cohorts to the institution from 2014 to 2020; that is, August-December 2014 (AD14), August-December 2015 (AD15), August-December 2016 (AD16), August-December 2017 (AD17), August-December 2018 (AD18), August-December 2019 (AD19), and August-December 2020 (AD20).
The dropout rates in the institution have decreased from 8.8% in High School and 10.1% in Undergraduate in 2014 to 5.5% in High School and 7.9% in Undergraduate in 2020. However, in the 2015-2016 period, the dropout rates increased from 7.3% to 7.6% for High School, as well as in the 2018-2019 period from 7.5% to 9.4% for Undergraduate. Therefore, it is necessary to continue researching and developing models and strategies for student retention.
Among the categories of information available in this dataset are: • Sociodemographic information, such as age, gender, and type of zone to which the student's address belongs. • Enrollment information, such as program, school, and educational model. • Academic information related to the student, such as the average of the previous level, the average in the first term or midterm of the first semester, and the number of failed subjects. • Information associated with scores on admission tests, such as the admission test, standardized English proficiency test, and Mathematics grade. • Academic history, such as type of school from provenance, national/international student, and relationship with the Tecnologico de Monterrey system. • Student life, such as participation in sports, cultural, and leadership activities. • Scholarship and financial aid information, such as type of scholarship, percentage of scholarship, and percentage of scholarship loan. • Academic information related to the student's parents, such as educational level and whether the parents were students of the Tecnologico de Monterrey. • Information on the student's retention or dropout in the first year.
Tables 1-3 provide a detailed description of the variables constituting the student dataset. It is relevant to mention that this student dataset provides information on two educational models implemented at Tecnologico de Monterrey. The previous model, corresponding to the AD14-AD18 generations, is based on the teaching-learning process while the current model called "TEC21 Model", corresponding to the AD19-AD20 generations, is based on challenges and competencies [29]. In this dataset, information on the average obtained in the first term or midterm, the number of subjects failed, and the number of subjects dropped out by the student is only provided for the AD19-AD20 generations. Hence, this data is interesting to analyze from this perspective as well.
In the same way, co-curricular activities related to the integrated learning of students have also evolved in accordance with the new educational model ("TEC21 Model"). The AD14-AD17 generations of students contemplated enrolling in one type of activity or the three categories of activities offered: (1) physical education, (2) cultural diffusion, and (3) student society. For the AD18-AD20 generations, the offer of activities increased since they are now part of the well-rounded education of the student to contribute to the development of transversal skills for all students [30,31]. This evolution is called the LiFE (Leadership and Student Education) program, which goes hand in hand with the TEC21 educational model [31] and is made up of the following categories: athletic or sports activities, art or culture activities, student society activities, life or work mentoring, and wellness activities.   Value indicating the semester when the student dropped out. Where 0 = the student continues studying, 1 = the student dropped out during the first semester, 2 = the student did not enroll in the second semester, 3 = the student dropped out during the second semester, and 4 = the student did not enroll in the third semester.

Materials and Methods
The methodology used in this research is based on the Data Life Cycle used in the field of Research Data Management shown in Figure 1. The Data Life Cycle illustrates the research process and its different phases, as well as the stages associated with the data generation, use, and dissemination [32].

Data Planning
The first 40 variables shown in Tables 1-3 were defined according to the related work cited in this descriptor, as well as the Analytics and Business Intelligence Department of Tecnologico de Monterrey due to its experience in the early alerts program (student retention). The following nine variables (listed from 41 to 50 in Table 3) related to the student's dropout semester and the student's co-curricular activities were gathered after receiving the proposals of the researchers participating in the call for proposals. The dataset along with its data dictionary were built in Excel files to allow downloading them through the Tecnologico de Monterrey's Data Hub (https://datahub.tec.mx/dataverse/tec (accessed on 24 August 2022)). Taking into account the sensitivity of the data, the dataset will be made available to researchers who request it through the Data Hub.

Data Collection
The data was extracted in two phases. Firstly, data was collected from the Tecnologico de Monterrey's Data Warehouse by the Analytics and Business Intelligence Department through the SAP BusinessObjects Web Intelligence (WebI) tool. This first dataset includes personal and academic information on Undergraduate and High School students, such as gender, age, tests, schooling background of parents, among others. The variables related to retention and the socioeconomic level of the students were calculated by the same department with the purpose of designing a model to identify students at risk, used in the early alerts program. Secondly, the co-curricular activities of the students from 2014 to 2020 were obtained from the Tecnologico de Monterrey's LiFE Department.

Data Assurance
For the dataset that was extracted from the WebI tool, the following preprocessing steps were performed:

1.
Considering the privacy of students and faculty, it is important to emphasize that the data must be de-identified before it is made available for institutional use and research purposes [22]. Therefore, the student's enrollment identifier ( student.id) All records were translated into the English language. 3.
An exhaustive exploration was carried out to find inconsistencies in the values of variables 1 to 40 (described in Tables 1-3) and in the relationships among them.

4.
Spelling and typographical errors were checked for the categorical values of each variable.

5.
Missing values for the variables socioeconomic.level and social.lag were filled in with "No information". 6.
The empty values corresponding to admission.test for the Undergraduate level were replaced by "Does not apply" when the variable tec.no.tec has the value "TEC". That is, the student is a graduate of the Tecnologico de Monterrey's High School. 7.
The variable dropout.semester was categorized according to the period in which the student dropped out: before or during the semester. 8.
The values of the variables scholarship.perc, loan.perc, and total.scholarship.loan were multiplied by 100 to represent a percentage.

Data Description
The dataset was described in detail in Section 2.

Data Preservation
This dataset will be available upon request through the Tecnologico de Monterrey's Data Hub repository for its long-term preservation. The metadata was properly described and a specific Digital Object Identifier (DOI) was assigned in order that the data can be easily traceable and correctly cited. This dataset is protected by the Creative Commons Zero (CC0) waiver and is governed by Tecnologico de Monterrey's Terms of Use and a Data Policy.

Data Discovery
Based on the proposals received by the researchers, information on co-curricular activities and dropout semester were identified as potential data that could be valuable for the student dropout prediction model and were added to the original dataset.

Data Integration
The first dataset consisting of 40 variables was merged with the co-curricular activities database and semester dropout information based on the variables student.id and generation to create a single data file. As a result, the final dataset is made up of 50 attributes to test and predict student dropout at the High School and Undergraduate levels.

Data Analysis
Firstly, a descriptive analysis of dataset variables was performed using the Pandas library version 1.4.3 and the Scikit-learn library version 1.1.2 in Python 3 shown in Tables 4 and 5. Secondly, a data visualization was carried out using Tableau Desktop Professional Edition 2021.4.4.
On the one hand, Table 4 describes the numerical variables of the dataset through their unique, mean, minimum, and maximum values. The identifier of each variable corresponds to the identifier assigned in Tables 1-3. Similarly, the gain information is integrated to demonstrate the dependency between each feature in the dataset and the target variable: retention. The information gain was calculated using a mutual information classifier, the values "Does not apply" and "No information" were excluded from the calculation of the statistical variables admission.test, general.math.eval, and total.life.activities since they do not represent numerical values, and the records containing null values were also not considered in the information gain calculation. It is important to remember that for the variables average.first.period, failed.subject.first.period, and dropped.subject.first.period the data is only available for AD19 and AD20. In addition, a correlation matrix is provided in Figure 2 to show the correlation coefficients between each numerical attribute in the dataset. Due to the considerations mentioned above, the dataset used for these analyzes resulted in 25,061 records. From this matrix, it can Data 2022, 7, 119 10 of 17 be seen that the degree of linear relationship between the variable total.scholarship.loan and the variable scholarship.perc is 0.94, which means that these variables are strongly correlated. While between the variables average.first.period and failed.subject.first.period the coefficient is −0.43, which indicates that they are associated in the opposite direction.  Table 4. On the other hand, Table 5 describes the categorical variables of the dataset through their unique and mode values, and the frequency of the mode. The identifier of each variable corresponds to the identifier assigned in Tables 1-3. Regarding the co-curricular activities, the mode and frequency were calculated according to the generation to which they correspond. For example, for the variables physical.education, cultural.diffusion, and student.society, only the values corresponding to the generations AD14 to AD17 were considered. Similarly, for the LiFE activities, only the values of the generations AD18 to AD20 were contemplated. Furthermore, the "Does not apply" value was ignored for all generations. In the same way, the gain information is integrated to demonstrate the dependency between each feature in the dataset and the target variable: retention. The information gain was calculated using a mutual information classifier, it was necessary to encode the features using an OrdinalEncoder while the target variable, in this case, "retention" was encoded with a LabelEncoder. From this calculation, it can be deduced that the retention variable is more dependent on the students' co-curricular activities, such as cultural.diffusion, student.society, and physical.education, while the variables online.test and dropped.subject.first.period have less dependency on retention.
It is worth mentioning that it is recommended to carry out a greater analysis of the factors since the gain values may vary depending on the data preprocessing and the approach that each researcher considers in their experiments.
Subsequently, graphical representations were performed with the variables related to the dropout rates and the specific variables of the institution (student life). Figure 3 illustrates the number of High School and Undergraduate students who dropped out during their first year of study from AD14 to AD20. In general, the number of students enrolled increased over time for both levels. Figure 3 shows that in AD14 the number of High School students who dropped out is higher compared to other generations. It is also found that in AD15 there is a slight decrease in student dropout of 7.28% but during the following three generations, from AD16 to AD18, the dropout rates increased and ranged between 7.61% and 7.98%. In AD19, when the Tec21 model started, this rate started to decrease from 6.48% to 5.51% in AD20, which is the lowest dropout rate of the seven generations.
Although at the Undergraduate level the number of students enrolled seems to increase year after year, the number of dropouts does not behave the same. It is observed in the orange line of Figure 3 that the year with the highest student dropout is also found in the AD14 generation with a dropout rate of 10.09%. According to the graph, there was a downward trend starting from the AD15 generation with a dropout rate of 9.20%, then between the AD16 and AD17 generations, the dropout rates decreased and had a minimum variation with percentages of 8.82% and 8.71%, respectively. In AD18, the dropout rate continued to decrease with a percentage of 7.53%. Although there was a decreasing trend in dropout rates during the past generations, in AD19, despite the number of students enrolled increased, the dropout rate rose to 9.43% but in AD20 this rate decreased to 7.95%. information gain was calculated using a mutual information classifier, it was necessary to encode the features using an OrdinalEncoder while the target variable, in this case, "retention" was encoded with a LabelEncoder. From this calculation, it can be deduced that the retention variable is more dependent on the students' co-curricular activities, such as cultural.diffusion, student.society, and physical.education, while the variables online.test and dropped.subject.first.period have less dependency on retention.
It is worth mentioning that it is recommended to carry out a greater analysis of the factors since the gain values may vary depending on the data preprocessing and the approach that each researcher considers in their experiments.
Subsequently, graphical representations were performed with the variables related to the dropout rates and the specific variables of the institution (student life). Figure 3 illustrates the number of High School and Undergraduate students who dropped out during their first year of study from AD14 to AD20. In general, the number of students enrolled increased over time for both levels. Figure 3 shows that in AD14 the number of High School students who dropped out is higher compared to other generations. It is also found that in AD15 there is a slight decrease in student dropout of 7.28% but during the following three generations, from AD16 to AD18, the dropout rates increased and ranged between 7.61% and 7.98%. In AD19, when the Tec21 model started, this rate started to decrease from 6.48% to 5.51% in AD20, which is the lowest dropout rate of the seven generations.
Although at the Undergraduate level the number of students enrolled seems to increase year after year, the number of dropouts does not behave the same. It is observed in the orange line of Figure 3 that the year with the highest student dropout is also found in the AD14 generation with a dropout rate of 10.09%. According to the graph, there was a downward trend starting from the AD15 generation with a dropout rate of 9.20%, then between the AD16 and AD17 generations, the dropout rates decreased and had a minimum variation with percentages of 8.82% and 8.71%, respectively. In AD18, the dropout rate continued to decrease with a percentage of 7.53%. Although there was a decreasing trend in dropout rates during the past generations, in AD19, despite the number of students enrolled increased, the dropout rate rose to 9.43% but in AD20 this rate decreased to 7.95%.  Moreover, Figure 4 presents information on the number of High School and Undergraduate students who participated in different co-curricular activities during the fall semesters between 2014 and 2017. The total number of students enrolled in those years was 78,715. The graph shows that the majority (58,701) of the students were involved in Physical Education activities with a dropout rate of 7.10%, followed by cultural diffusion with 40,768 students enrolled and a dropout rate of 7.10%; while a smaller number of students (25,115), participated in some student society activity with a dropout rate of 6.31%.   Figure 5 shows the information on the co-curricular activities that belong specifically to the Tecnologico de Monterrey's LiFE program implemented since AD18. The number of students enrolled in these three generations was 64,611. According to the graph, more than half of the students (36,908) participated in Athletic Sports with a dropout rate of 6.09%. The Student Society Leadership was the second activity with a participation of 21,429 students and a dropout rate of 6.10%, followed by Art Culture with 20,849 students and a dropout rate of 6.02%. Compared to this last activity, slightly fewer students participated in the Wellness activities (20,052) with a dropout rate of 5.91%. Participation in activities related to Life-Work Mentoring was the least preferred by students with a participation of 12,863 but with the highest percentage of dropouts of 7.40%. It is worth mentioning that a student could have participated in one or more activities at the same time.

Conclusions
Through this data descriptor, a non-identifiable dataset of 121,584 High School and Undergraduate students from Tecnologico de Monterrey was provided in order to contribute to the scientific community with data that will allow it to generate more accurate models to predict student dropout in higher education institutions. The generation of an appropriate model based on this dataset would benefit the students, by having timely and personalized strategies from their institution that support their permanence in their career, as well as the institution, by improving their statistics of student degree completion and their student investment costs.
The dataset is made up of variables reported in the literature as good predictors of school dropout as well as variables of the institution that are part of the student life. The contribution of more data related to the variables found in the literature from an institution other than their own could allow testing models already developed in their own institution to find new findings or improve those models.
On the other hand, the new variables (student life) could provide new relationships between the factors already studied that could enhance the development of new or improved models to predict student performance and identify at-risk students. Most papers use traditional Machine Learning algorithms (e.g., logistic regression, k-nearest neighbors, and decision tree-based ensemble models) [13,34]. However, only 5% of the studies have applied unsupervised learning algorithms [16]. Furthermore, the emergence of Explainable Artificial Intelligence (XAI) tools has made it possible to use advanced Machine Learning algorithms for interpretable dropout prediction [35][36][37].