Identifying the Determinants of Academic Success: A Machine Learning Approach in Spanish Higher Education

Sánchez-Sánchez, Ana María; Mello-Román, Jorge Daniel; Segura, Marina; Hernández, Adolfo

doi:10.3390/systems12100425

Open AccessArticle

Identifying the Determinants of Academic Success: A Machine Learning Approach in Spanish Higher Education

by

Ana María Sánchez-Sánchez

^1,*

,

Jorge Daniel Mello-Román

²

,

Marina Segura

³

and

Adolfo Hernández

³

¹

Faculty of Stadistical Studies, Universidad Complutense de Madrid, 28223 Madrid, Spain

²

Faculty of Exact and Technological Sciences, Universidad Nacional de Concepción, 010123 Concepción, Paraguay

³

Department of Financial and Actuarial Economics & Statistics, Universidad Complutense de Madrid, 28223 Madrid, Spain

^*

Author to whom correspondence should be addressed.

Systems 2024, 12(10), 425; https://doi.org/10.3390/systems12100425

Submission received: 13 September 2024 / Revised: 30 September 2024 / Accepted: 10 October 2024 / Published: 12 October 2024

Download

Browse Figures

Versions Notes

Abstract

Academic performance plays a key role in assessing the quality and equity of a country’s educational system. Studying the aspects or factors that influence university academic performance is an important research opportunity. This article synthesizes research that employs machine learning techniques to identify the determinants of academic performance in first-year university students. A total of 8700 records from the Complutense University of Madrid corresponding to all incoming students in the academic year 2022–2023 have been analyzed, for which information was available on 28 variables related to university access, academic performance corresponding to the first year, and socioeconomic characteristics. The methodology included feature selection using Random Forest and Extreme Gradient Boosting (XGBoost) to identify the main predictors of academic performance and avoid overfitting in the models, followed by analysis with four different machine learning techniques: Linear Regression, Support Vector Regression, Random Forest, and XGBoost. The models showed similar predictive performance, also highlighting the coincidence in the predictors of academic performance both at the end of the first semester and at the end of the first academic year. Our analysis detects the influence of variables that had not appeared in the literature before, the admission option and the number of enrolled credits. This study contributes to understanding the factors that impact academic performance, providing key information for implementing educational policies aimed at achieving excellence in university education. This includes, for example, peer tutoring and mentoring where high- and low-performing students could participate.

Keywords:

academic performance; machine learning; educational data mining; XGBoost; random forest

1. Introduction

The term “academic performance” refers to the competence (knowledge and skills) acquired by students in a given area or subject of knowledge. Its evaluation aims to be pedagogically rigorous for the progress made by the student within an educational process [1,2,3].

Considering the above, the academic performance of higher education students refers to the evaluation of the knowledge acquired by these students as a result of the teaching and learning process in which they participate during their university studies [4]. It is assumed that students who demonstrate good academic performance are those who achieve high grades in the subjects in which they are enrolled.

Student academic performance is one of the key issues in addressing the quality of higher education [5]. In the framework of the Agenda 2030 approved in 2015, the United Nations establishes as goal number 4, of the 17 Sustainable Development Goals (SDGs) which this agenda consists of, to ‘ensure inclusive, equitable and quality education and promote lifelong learning opportunities for all’ (Sustainable Development Goals, n.d.). In the same year, the World Education Forum declared that inclusion and equity in education are the cornerstones of a transformative education agenda, with a commitment to addressing all forms of exclusion and marginalization and focusing efforts on access, equity, inclusion, quality, and learning outcomes within a lifelong learning approach [6].

There is a wide variety of perspectives and approaches using data, analytics, and machine learning techniques in research strategies. The literature reviewed focuses partly on investigating how contextual and educational factors such as course design and institutional policies, among others, can influence student performance [7]. Some of these works include data on classroom participation, online behaviors, and demographic characteristics, among other variables like interactions in a learning management system (LMS) [8,9,10].

Some articles relate the student-to-teacher ratio and academic performance, which can provide information about the effectiveness of policies to reduce the number of students per classroom [11]. Self-efficacy, motivation, procrastination, the use of learning management systems, and other study behaviors have also been studied to observe student academic performance in online environments [12,13].

The most common indicator of academic performance is the grades and scores obtained in exams, assignments, projects, or other evaluated activities to measure student performance, as well as the overall average of all grades over a specific period and the proportion of courses passed or failed by a student [14]. Grades proposed by the students themselves (self-assessments) and their comparison with the grades provided by the teachers can also be variables used to measure academic performance.

In addition to grades, other aspects have been observed to establish a relationship with academic performance. Among them, it is worth noting that student retention in a course or academic program [15], class attendance, participation in academic activities, and the time a student invests in completing tasks or courses compared to deadlines can be performance indicators. Some papers have applied psychometric tests to measure cognitive or emotional characteristics, such as self-efficacy or motivation, and relate them to academic performance [3]. Other works have incorporated demographic variables such as gender, age, ethnic origin, or other personal factors to evaluate their relationship with this type of performance.

In online learning environments, student interactions with the platform can be used as a data source to assess their academic performance, taking into account different variables: login frequency, connection time, heat maps, material downloads, participation in forums, use of chats and email, or other online educational and communication resources with teachers and classmates [2,16,17,18,19,20].

In this work, the average of the grades of the first semester and of the first academic year has been considered as a measure of academic performance. This is a classic approach that appears frequently in the analysis of academic performance as a universally recognized metric [21], although there are authors who use a related variable as a measure of academic performance, which is the time it takes for the student to graduate [22].

The purpose of this work is to develop predictive models of academic performance using an educational data mining approach, based on the Integrated Institutional Data System (Sistema Integrado de Datos Institucionales—SIDI), an educational database of the Complutense University of Madrid (Universidad Complutense de Madrid—UCM). This university has the highest number of on-site students in Spain and the third highest in Europe, currently totaling 71,806 students. This study seeks to identify the most significant variables for academic performance at the end of the first semester and the first academic year across different areas of knowledge: Social Sciences and Law, Arts and Humanities, Sciences, Health Sciences, and Engineering.

We have formulated the following research questions:

RQ1: What are the most relevant variables in university academic performance?

RQ2: Are first-semester grades a determining factor in student academic performance?

RQ3: Are first-year grades significant in a student’s academic performance?

RQ4: Do socioeconomic variables affect university academic performance?

The construction of predictive models of academic performance at the university level based on institutional databases, in addition to updating the state of the art in the field of educational data mining, has significant practical relevance because it allows for the early identification of students at risk, the development of educational strategies taking into account individual or sectoral needs, the optimization of resource allocation, and the introduction of improvements in educational policies. Based on the information generated by these models, universities can generate timely institutional interventions and improve both academic outcomes and the overall student experience.

The rest of the paper is organized as follows: Section 2 presents a literature review. Section 3 presents the dataset and introduces the different machine learning methods applied. Section 4 includes the main results of the study in the Complutense University of Madrid. Finally, the discussion and conclusions are presented in Section 5. This section also includes the practical and theoretical implications and limitations.

2. Literature Review

In the field of educational data mining (EDM), the predictive modeling of academic performance seeks to develop models to predict student academic performance using historical data and socioeconomic, demographic, educational, and behavioral characteristics [2]. Access to institutional educational databases, along with the application of advanced machine learning techniques, allows for the identification of the patterns and factors that influence academic performance. The identification of the determinant factors of academic performance, whether they are individual, surrounding, or institutional factors, is crucial for understanding the complex interactions that affect academic performance [23]. Some papers have adopted an interdisciplinary approach, integrating knowledge from psychology, sociology, economics, and education to identify and understand this socio-educational phenomenon [24,25].

Furthermore, the development of early warning systems for detecting academic risks is an important area of research in educational data mining. These systems aim to identify the students at risk of low academic performance or dropout before it occurs, which requires the setting up of predictive models that can detect the early signs of academic difficulties and provide timely and personalized interventions [26].

Since 2010, various statistical and machine learning techniques have been applied to the analysis of student data to predict student academic performance or, at an early stage, their propensity to drop out of their studies. The use of e-learning platforms has led to a massive increase in data from the digital footprint that the students provide when they use these platforms and, consequently, the possibilities of making predictions related to their academic behavior have increased [27,28].

Classification algorithms, regression, and other methods have been employed to identify patterns in the data that can predict academic success [25,29]. Studies of this kind use different statistical models and techniques, from linear and logistic regressions to neural networks and decision tree algorithms. In several cases, they utilize multiple variables to obtain a more comprehensive predictive modeling of student academic performance and their likelihood of dropping out of studies [30]. Other research has applied dimension reduction techniques such as Principal Component Analysis (PCA) to identify the underlying patterns in academic data [31].

A recurrent aspect of educational data is also the presence of the hierarchical or clustering structure of the data. According to Valle et al. [32], when data present a hierarchical structure, multilevel regression analysis is effective in modeling the relationships between the predictor variables and academic performance, taking into account the clustering structures [33].

Regression models in machine learning demonstrate a robust ability to handle hierarchically nested data, depending on the complexity of the data structure and the model’s ability to capture such complexity [34]. This ability has been observed in several works that have analyzed the relationships between predictor variables and academic performance, such as the paper by Páez and Ramírez [35] applied to engineering students. Machine learning techniques such as decision trees, Random Forests (RFs), and Support Vector Machines (SVMs) have proven to be effective in capturing the complex relationships between the variables and predicting academic performance [36].

Within the realm of academic achievement research and the prediction of student success, other studies have incorporated qualitative methods, such as interviews and focus groups, to gain a deeper understanding of student perceptions and behaviors. For example, ref. [37] uses interviews to explore how resilience and academic self-efficacy influence student performance, providing a detailed analysis of individual perceptions. Similarly, ref. [38] employs focus groups as part of their research to develop a predictive model of student success, using group discussions as a key source of information to enrich their analysis. Furthermore, ref. [39] integrates semi-structured interviews to assess students’ perceptions of their self-regulation and learning strategies. These qualitative approaches allow the traditional predictive models to be complemented with subjective data, providing a more holistic and contextualized framework of academic performance and student behavior.

3. Materials and Methods

The first phase of the study involves applying feature selection techniques, RF, and Extreme Gradient Boosting (XGBoost) to identify the key variables of academic performance at two points in time: the end of the first semester and the end of the first year. Subsequently, machine learning regression models will be trained: Linear Regression (LR), Support Vector Regression (SVR), Random Forest (RF), and XGBoost; a comparative analysis of their performance will then be conducted based on the evaluation metrics obtained [40].

The three machine learning techniques chosen, Random Forest, XGBoost, and Support Vector Regression, were selected based on their proven effectiveness in predicting academic performance and handling the types of data used in this study [40,41,42]. Linear Regression was included for its simplicity and interpretability, providing clear insights into how each predictor variable influences the target outcome and serving as a baseline for comparing the predictive performance of other techniques.

3.1. Dataset Description, Access, and Extraction

The initial dataset consisted of 11,211 first-year students, corresponding to the students enrolled in the 2022–2023 academic year at the UCM in a total of 74 degrees and 21 double degrees. The data were obtained from the SIDI. The article is framed within a research project and a confidentiality agreement was signed with the Institutional Intelligence Center of the Complutense University of Madrid to comply with the General Data Protection Regulation (GDPR). In any case, at no time has there been access to the students’ identification data.

After a cleaning process for missing data, the analysis was carried out with a sample of 8700 students from the initial set of 11,211. Records with missing data were deleted. Although 2500 students were discarded, nearly 9000 students were included in the study, a large number. Several tests were run to check that this action (in this particular dataset) did not have an impact on the results, comparing whether the distributions/means/medians (where appropriate) of the different variables were the same or not when the dataset was divided into two groups, defined by the existence (or not) of missing data in a particular variable. In all cases, the decision of the hypothesis test (chi-square, Kolmogorov–Smirnov, Student’s t, Mann–Whitney) was not to reject the null hypothesis of the same distribution/mean/median.

3.2. Preliminary Analysis: Main Statistics

The 28 variables used for the analysis of academic performance can be classified into socioeconomic variables, variables related to university access, and academic variables corresponding to the first year. The list and descriptions of all the variables used are detailed in Appendix A.

The most relevant information of the categorical variables analyzed in this study is presented below. The largest proportion of students in the sample belongs to the area of Arts and Humanities and represents 39.37%. Regarding the gender distribution, women are 64.78% of the total number of students, and the majority come from the city of Madrid (73.52%) and were born in Spain (92.79%). Concerning the level of education of the parents, higher education is the most common, with 64.29% for fathers and 56.06% for mothers. If we look at the schools from which they came, the most common profiles are high school only (94.13%), public (58.60%), and located in Madrid (73.17%). In the group of variables related to academic access, it is observed that in terms of the choice of degree, 65.46% of the students study their first option and the most common access specialty is Social Sciences and Humanities, which represents 47.82% of the total. It is also noteworthy that the ordinary call is the most frequent, with 94.62%, and that the most prevalent reason for admission is the general quota (71.63%). The vast majority of students study the degree full-time (98.15%), and 70.44% do not have a scholarship. Table 1 displays the total of the categorical variables analyzed, along with their values.

Table 2 contains the main statistics for each of the quantitative variables analyzed (mean, median, standard deviation, and skewness coefficient). The data provided in the table show several significant aspects of the academic and demographic profile of the students. With regards to age, the mean is 18.64 years old, with a median of 18 and a standard deviation of 1.95, indicating a relatively symmetric distribution. The mean academic amount is 582.46 euros, with a median of 584 and a standard deviation of 541.37, suggesting some variability in the costs associated with education. In contrast, the mean administrative fee is 30.08 euros with a median of 35, but a standard deviation of 9.00, indicating a tighter distribution of these costs. In terms of grades, the average access grade is 10.72, with a median of 11, and the number of ECTS (European Credit Transfer System) enrolled and passed in the first year have a mean of 60.76 and 50.79, respectively. However, the median number of ECTS passed in the first year is the same as the median number of credits enrolled in the first year (60). Concerning the asymmetry coefficients, they are mostly negative (except for the variables of age and academic amount), indicating distributions skewed to the left, which implies that there are more values separated from the mean to the left. These data provide detailed insight into the academic performance and demographic characteristics of the students, which may be useful to better understand their situation. Finally, regarding the dependent variables, the first-semester grade and the first-year grade have a mean of 6.31 and 6.56, respectively.

3.3. Feature Selection

Feature selection is the process of choosing the most relevant variables from a dataset before applying machine learning models, and it is crucial for several reasons. First, it reduces the dimension of the feature space, which can enhance model generalization by avoiding overfitting [43]. Second, it facilitates the interpretation and understanding of the model by focusing on relevant features [44]. Third, it improves model performance by removing the irrelevant features [45], thereby reducing the training time.

Feature selection in machine learning has been applied in various fields, such as medicine [46], fraud detection [47], and crop yield prediction [45], among others. In the field of educational data mining, feature selection has proven to be effective in improving the accuracy of predictive models and understanding the factors with an influence on student performance [23], as well as enhancing model generalization by avoiding overfitting when predicting student performance [48].

Recent research has demonstrated the value of the RF and XGBoost techniques in feature selection for improving the prediction of academic performance in educational environments [49]. For example, Pujianto et al. [50] applied the RF method to predict the acceptance of science students in secondary education, and Almutairi et al. [41] used XGBoost to predict student academic performance. In this work, RF and XGBoost have been chosen to be implemented as feature selection techniques as a preliminary step before developing the machine learning predictive models.

In this work, we chose to implement Random Forest (RF) and Extreme Gradient Boosting (XGBoost) for feature selection as a preliminary step in developing predictive machine learning models. These techniques are effective in handling high-dimensional data and capturing complex interactions, making them ideal for educational data. By identifying the key predictors of academic performance, RF and XGBoost enable the creation of accurate and generalizable models. The following subsections provide a brief reference for their technical formulations.

3.3.1. Random Forest

RF is a machine learning algorithm based on the construction of multiple decision trees from random subsets of the training data. Each tree in the forest makes independent decisions and the final prediction is obtained by combining the predictions of all trees [51].

Given a training dataset

D = \{(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})\}

, where

x_{i}

represents the input features and

y_{i}

represents the corresponding class label,

B

decision trees are generated by random sampling with replacement from

D

. Each tree

k

is trained with a randomly selected subset of features at each node. The final prediction for a regression problem is calculated as the average of the predictions of all trees [52].

\hat{y} = \frac{1}{B} \sum_{k = 1}^{B} {\hat{y}}_{k}

(1)

Here,

{\hat{y}}_{k}

represents the prediction of the k-th tree. Several studies have demonstrated the effectiveness of RF as a feature selection technique in educational data mining, showing that it is essential for eliminating irrelevant variables and improving the accuracy of predictive models of academic performance [53,54].

3.3.2. Extreme Gradient Boosting

XGBoost is a machine learning technique based on building a set of decision trees, where each tree is trained sequentially to correct the errors of the previous trees. The mathematical formulation seeks to minimize a regularized loss function consisting of two terms: one measures the discrepancy between the predictions and the actual values, and another penalizes the complexity of the regression tree model used in the ensemble [55].

Given a training dataset

D = \{(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})\}

, where

x_{i}

represents the input features and

y_{i}

represents the corresponding continuous output, XGBoost aims to minimize the following objective function:

L (θ) = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i}) + \sum_{k = 1}^{K} Ω (f_{k})

(2)

Here,

l

measures the difference between the predicted value

{\hat{y}}_{i}

and the actual value

y_{i}

, and

Ω (f_{k})

penalizes the complexity of the model. The final prediction for an input

x

is obtained by summing the predictions of all

K

trees:

\hat{y} = \sum_{k = 1}^{K} f_{k} (x)

(3)

The use of XGBoost has been shown as an effective machine learning technique in the analysis of educational data [22], particularly in predicting student academic performance [56]. As a feature selection technique, it has been applied in various research areas, including the educational field, highlighting its effectiveness in improving model performance and extracting relevant information from complex datasets [57,58,59].

3.3.3. Importance of Variables

In RF, the “Increase in Mean Squared Error” (%IncMSE) is a metric used to evaluate the significance of variables in regression problems. This measure quantifies how much the average Mean Squared Error (MSE) increases when randomly permuting the values of a variable in the training data. In particular, it calculates the difference in the MSE before and after permutation by averaging these differences across all trees in the forest. A greater increase in MSE after permutation indicates a greater importance of the variable in predicting the response variable in a regression problem [60].

The metric commonly used in XGBoost to evaluate the importance of variables is the Gain, for both classification and regression problems. The Gain reflects the improvement that the model experiences by incorporating a particular variable in terms of the reduction of the objective function [61]. During the training process, an exhaustive search is performed on all variables and possible split points for each variable. The Gain metric is calculated for each potential split point, evaluating how it affects the objective function of the model [62].

3.4. Machine Learning Regression Models

Over the years, various regression methods in machine learning have been used to predict academic performance. Linear Regression (LR) remains one of the most commonly used techniques, primarily because it allows for the interpretation of how each variable affects performance [63]. Support Vector Regression (SVR) has also been used when the data did not present linear characteristics [42]. Another technique that has demonstrated high performance in educational data mining is RF, and more recently, XGBoost has been implemented in academic performance prediction [64].

Based on the results already achieved by the cited research, this paper proposes to compare the performance of the four techniques, LR, SVR, RF, and XGBoost, in the task of predicting academic performance. The variables selected according to their importance by RF and XGBoost will be taken as the predictor variables, generating up to eight combinations of techniques and integrating the stages of feature selection and modeling.

The hyperparameters of the selected models were optimized using a grid search approach combined with 10-fold cross-validation. This is a widely used method for finding the best combination of hyperparameters by exhaustively searching through a specified parameter grid [65].

The main evaluation metrics for regression methods in machine learning are the R², or determination coefficient, and the Root Mean Square Error (RMSE). R² measures the goodness-of-fit between the model and the observed data, indicating how well the formulated regression equation describes the data distribution:

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - {\underline{y}}_{i})}^{2}}

(4)

where

y_{i}

represents the actual values,

{\hat{y}}_{i}

represents the predicted values, and

{\underline{y}}_{i}

is the mean of the actual values.

RMSE is another standard measure in regression model evaluation, representing the square root of the average of all squared errors:

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}

(5)

These metrics provide a comprehensive evaluation of model performance.

R^{2}

represents the proportion of variance explained by the model, while

R M S E

provides insight into the average prediction error [63].

3.4.1. Linear Regression

Linear Regression is a classical statistical modeling technique used in the field of educational data mining to analyze and predict student academic performance based on predictor variables. In this approach, the goal is to establish a linear relationship between a dependent variable, such as the Cumulative Grade Point Average (CGPA) of students, and a set of independent variables or factors [50,66,67].

The mathematical formulation of LR is as follows. Given a training dataset

D = \{(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})\}

, the Linear Regression model can be expressed as

y_{i} = β_{0} + \sum_{j = 1}^{p} β_{j} x_{i j} + ε_{i}

(6)

where

β_{0}

is the intercept,

β_{j}

are the coefficients for the predictor variables

x_{i j}

, and

ε_{i}

is the error term. The advantages of Linear Regression are its simplicity and ease of interpretation; however, the assumption of linearity in the data can be a significant limitation for datasets of a nonlinear nature.

3.4.2. Support Vector Regression

SVR is a machine learning technique also used in the field of educational data mining to predict student academic performance. It is an extension of SVMs, but employed specifically for regression problems. SVR essentially seeks to find, in a high-dimensional feature space or hyperplane, a function that best fits the data, minimizing the prediction error and model complexity subject to a tolerance margin [68].

The mathematical formulation is as follows. SVR aims to find a function

f (x)

that deviates from the actual targets

y_{i}

by a value no greater than

ε

for each training point, and is as flat as possible. The SVR model is

f (x) = 〈 w, x 〉 + b

(7)

where

〈 w, x 〉

is the dot product of the weight vector

w

and the input features

x

, and

b

is the bias term. The optimization problem for SVR is as follows:

\frac{1}{2} ‖ w ‖^{2} + C \sum_{i = 1}^{n} (ξ_{i} + {ξ_{i}}^{*})

(8)

subject to

|y_{i} - 〈 w, x_{i} 〉 - b| <

ε + ξ_{i}

. Here,

ξ_{i}

and

{ξ_{i}}^{*}

are slack variables that allow for some errors, and

C

is a regularization parameter that controls the trade-off between the model’s complexity and the number of allowed errors.

An advantage of using SVR is its efficiency in handling data with nonlinear features, while a drawback of SVR could be its sensitivity to the selection of the model hyperparameters.

3.4.3. Random Forest and XGBoost

Although the use of RF and XGBoost has been mentioned earlier in the field of educational data mining, particularly for feature selection according to their importance in the prediction, they have also been widely used as regression methods that search for predicting student academic performance from a set of predictors or factors [69,70,71].

RF is characterized by its ability to handle datasets with multiple predictor variables and its robustness as a predictive model due to the partitioning it performs on several subsets of data during the training stage. XGBoost has demonstrated high predictive performance on large and complex datasets; however, its interpretation may be more complex than other simpler models. Both methods require proper tuning of their hyperparameters.

4. Results

The results were obtained following the implementation of several stages. The first stage was data preprocessing, which consisted of scaling the quantitative and ordinal qualitative variables to eliminate the influence of the different dimensions of the variables [72].

The quantitative predictors (e.g., “Age”, “Academic Amount”, “Administrative Fee”, “Access Grade”, “Number of ECTS Enrolled”) were standardized using Z-score normalization to eliminate the influence of different variable scales and ensure that each feature contributes equally to the model. For the ordinal variables (e.g., “Admission Option”, “Mother’s Education Level”, “Father’s Education Level”), Min–Max scaling was applied to preserve the order and range of these variables. This method transforms the variables into a [0, 1] range, making them compatible with machine learning models without distorting their ordinal nature. The categorical variables were encoded using appropriate methods depending on their nature. For the binary variables (e.g., “Gender”), we used binary encoding (0 or 1). For the multi-category variables without inherent ordering, we employed one-hot encoding to create a new binary feature for each category, allowing the models to handle them appropriately [73].

The next stage was feature selection using two techniques, RF and XGBoost, in which the ten most relevant variables were chosen based on the %IncMSE and Gain metrics, respectively, generating two subsets of predictors.

The regression models were trained by dividing the datasets into training (80%) and testing (20%) sets. The hyperparameters of the RF, SVR, and XGBoost techniques were adjusted, and then the models were trained with the best hyperparameters. Predictions of the first-year grade and first-semester grade were performed on the test set using the trained models and the evaluation metrics, and RMSE (Root Mean Squared Error) and R² were calculated to assess the models’ performance.

The computations were performed using R software version 4.4.1 with the integrated use of the e1071, randomForest, caret, XGBoost, and kernlab libraries. The following sections describe in detail the results at the feature selection and regression method evaluation stages.

4.1. Feature Selection

Taking the First-year grade as the target variable, which is the average grade obtained by students at the end of the first year of their university studies, the RF and XGBoost techniques were used to select the most important variables for predicting this variable. Table 3 presents the results of the %IncMSE and Gain metrics for RF and XGBoost, respectively, and ranks the predictor variables in order of importance.

From Table 3 it can be extracted that the Access grade is the main predictor of the First-year grade, although to a lesser degree, the number of credits enrolled, both for the first semester and for the entire year, play a prominent role in a student’s average grade at the end of the academic year. Two variables that also appear to be significant in estimating the First-year grade are the Gender and the Admission option.

Taking instead the First-semester grade as the target variable, which is obtained by students at the end of the first semester, and running both techniques, the results presented in Table 4 are obtained. It can be observed from the table that the Access grade remains the primary predictor, followed by the number of credits enrolled for both the first semester and the entire year.

A comparison of Table 3 and Table 4 shows that the most important predictors of the First-year grade and First-semester grade are the same for each of the techniques, although with slight changes in the order of importance from the fifth variable onwards. Figure 1 illustrates the importance of the variables as determined by RF and XGBoost for predicting academic performance. Specifically, Figure 1a shows the importance of the variables for predicting the First-year grade, while Figure 1b displays the importance for predicting the First-semester grade.

4.2. Machine Learning Methods Evaluation

As a previous step to the evaluation of the machine learning methods and in order to ensure the optimal predictive performance of the algorithms, the adjustment of the hyperparameters was carried out in three techniques: RF, SVR, and XGBoost. Different configurations were taken, considering both the First-year grade and First-semester grade as dependent variables, and in the two subsets of data generated. The first subset consisted of the predictors selected by RF and the second subset comprised the predictors selected by XGBoost. The results were consistent for all tested configurations and subsets of data.

We fine-tuned the hyperparameters using a grid search approach in R. For RF, we set the number of trees to 500 and determined the optimal mtry value to be 2. For SVR, we optimized the regularization parameter (C) to 1 and the kernel coefficient (sigma) to 0.1.

For XGBoost, we set the learning rate (eta) to 0.3, the maximum tree depth to 6, and the minimum child weight to 1. These hyperparameters were determined using a 10-fold cross-validation process, ensuring robust model evaluation and minimizing overfitting by iterative training and testing the model on different subsets of the data.

After the hyperparameter setting, the machine learning regression methods were evaluated in the task of predicting academic performance, considering both the First-year grade and First-semester grade as dependent variables, discriminating the set of predictors for both the subsets selected by RF and XGBoost. The evaluation metrics R² and RMSE are presented in Table 5.

From Table 5, it can be observed that for predicting the First-year grade, the RF and XGBoost methods show similar performances, with an R² of 0.22 and an RMSE of 1.42–1.47. SVR shows a lower performance, with an R² of 0.18 and an RMSE of 1.47. Although not far behind, the lowest performance is obtained by LR, with an R² of 0.17 and an RMSE of around 1.48. Considering the subset of predictors selected by Random Forest or XGBoost, no significant differences in the evaluation metrics are noticed.

In the case of the First-semester grade prediction, the RF and XGBoost methods show similar but slightly better performances compared to the First-year grade prediction, with an R² of around 0.23–0.24 and an RMSE of around 1.56–1.59. The SVR model also shows a better performance compared to the First-year grade prediction, with an R² of 0.21–0.23 and an RMSE around 1.56–1.57.

As can be seen, the general predictive strength, although statistically significant, is not extremely high. It is likely the current models will need supplementing in some way to ensure better predictive qualities. So, factors such as attendance might be important, as might personal attributes such as individual positive psychology (i.e., resilience, self-efficacy, etc.) [30].

The RF model shows an overall slightly better performance in terms of R² and RMSE compared to the other techniques for the prediction of both dependent variables, First-year grade and First-semester grade. However, XGBoost shows a competitive performance relative to RF, with similar results in terms of R² and RMSE.

Figure 2 presents the scatter plots of the actual values versus predicted values in the test set. The two configurations with the best predictive performances were considered. For predicting the First-year grade, the RF regression method, after feature selection by the same technique, showed a slightly superior performance compared to the other configurations. The scatter plot is presented in Figure 2a. For predicting the First-semester grade, the configuration with the best predictive performance was again Random Forest, but with feature selection using XGBoost. The scatter plot of the actual versus predicted values in the test set is shown in Figure 2b.

To determine the effect of the variables on academic performance, reference can be taken from the results obtained by Linear Regression in the prediction of the First-semester grade, where this technique obtained a better predictive performance with an R² between 0.20 and 0.21, approaching the metrics obtained by the other methods. LR allows the interpretation of the direction of the relationship between the predictors and the dependent variable, through the coefficients shown in Table 6.

Table 6 shows the results and coincident coefficients for the two subsets of predictors, both the ones selected by RF and the ones selected by XGBoost. It is observed that the Access grade, the main predictor, has a direct relationship with the student’s academic performance in the first semester, i.e., the higher the Access grade, the higher the First-semester grade the model predicts for the student. Specifically, a higher Access grade consistently predicts a higher First-semester grade, implying that the students entering the university with a stronger academic background tend to perform better initially.

It is also noteworthy that, given the coding of the variable Gender (0 = Women, 1 = Men), the model predicts that the female students would obtain a better academic performance in the first semester than the males. This finding aligns with existing research, which often notes the gender differences in academic outcomes during early university education (Almutairi et al.) [41].

Another variable to highlight is the Admission option, which takes ordinal values between 1 and 12, where 1 represents that the student has entered the grade that represented his or her first option. In other words, the higher the value of the variable Admission option, the lower the degree of preference for the grade in which the student is currently studying. The results in Table 6 indicate that as the student is enrolled in grades further away from his first choice, the model predicts a lower academic performance at the end of the first semester. That is, students who take their first-choice programs are more likely to excel academically, potentially due to greater motivation and interest in their field of study.

5. Discussion

This study provides several contributions to the analysis of academic performance. First, it stands out for using an exceptionally large and recent sample: 8700 incoming students corresponding to 95 degree programs in all areas of knowledge. This makes it one of the most exhaustive analyses to date, in contrast to previous studies that tend to analyze a smaller number of students and/or degrees [21,67]. In addition, the quality of the data used, the result of collaboration with the academic authorities of the UCM, is noteworthy. An additional contribution is the comparison between the most advanced machine learning techniques, revealing similar results in the prediction of academic performance.

In this research, we have chosen to implement RF and XGBoost as feature selection techniques as a previous stage to the development of predictive machine learning models, following the line of similar studies [23,48] that have demonstrated the contribution of the feature selection process to identifying the factors that most affect student performance, as well as the improvement in model generalization, avoiding overfitting.

When entering the academic data corresponding to the first semester and the first year, the best predictor variable is also access grade, as has been observed in previous works [9,64,73]. The gender variable also appears as important, which coincides with the result of Almutari [41].

Nevertheless, unlike previous studies, it is observed that the number of credits enrolled, as well as the Admission option variables, play a significant role in predicting academic performance. None of these variables appear as relevant in the more than 300 articles reviewed in [9].

The best results obtained are with the RF technique as indicated by Almutairi et al. [42] and Qasrawi et al. [70], achieving a 0.22 R², being similar to the result obtained with the XGBoost technique, R² = 0.21. These results are similar to those obtained in the article by Roman et al. [67], despite the homogeneity and smaller number of the data [21]. The comparison of the RF and XGBoost techniques reveals promising results in terms of the predictive ability of both models, providing further evidence of their usefulness in the analysis of academic performance.

The consistent results from RF and XGBoost validate these findings. They also confirm the effectiveness of these machine learning techniques in feature selection for building predictive models. This provides a solid basis for the implementation of educational interventions aimed at improving student performance, such as personalized learning pathways, collaborative tutoring, and early intervention programs, or providing emotional support to the student.

Research Questions

This study addresses four fundamental research questions:

RQ1: What are the most relevant variables in university academic performance?

The findings of the study confirm that the entrance grade is one of the most predictive variables of academic performance, in line with previous research [9,64,73]. In addition, gender, number of credits enrolled, and admission option were also identified as significant factors in predicting performance, although the latter had not been widely recognized in previous studies [9]. These results reflect the importance of a combination of personal and academic factors in explaining student academic success.

RQ2: Are first-semester grades a determining factor in student academic performance?

The grades obtained during the first semester are confirmed as a crucial indicator of future academic performance, according to the studies by Vandamme et al. [74] and Subiros et al. [75]. The results of this study support the idea that these early grades reflect a student’s level of adaptation to the university environment, which allows for predicting their performance in later years.

RQ3: Are first-year grades significant in a student’s academic performance?

The results obtained indicate that first-year grades are a significant predictor of future academic performance, coinciding with the existing literature [21,41,67]. This underlines the importance of monitoring academic performance from the beginning of the university career to implement early intervention measures, which can contribute to improving the success rate and reducing dropouts.

RQ4: Do socioeconomic variables affect university academic performance?

This study also suggests that socioeconomic variables can have a significant impact on academic performance. Previous research has shown that factors such as parents’ educational level, family income, and occupation affect student performance [32,34]. In particular, the students from more disadvantaged socioeconomic backgrounds tend to face greater challenges that can negatively influence their academic performance. These findings highlight the importance of considering the socioeconomic context when developing strategies to improve academic performance and foster equity in higher education.

In summary, this study confirms that first-semester and first-year grades, along with factors such as entrance grade and personal variables, are key determinants of university academic performance.

By analyzing the variables that determine academic performance, the results become highly relevant in practical terms for both universities and students. Educational institutions can use these findings to identify the key factors that affect student performance and design personalized interventions, such as tutoring programs, emotional support, and pedagogical strategies tailored to individual needs. In addition, promoting a collaborative culture, based on an active dialogue between students, staff, and the community at large, becomes a central strategy for improving academic performance. The principle of student participation and ongoing collaboration in formal and informal settings reinforces an environment where students feel supported and heard, which enhances their academic success.

For students, the predictive models can be useful in identifying behavioral patterns or circumstances that may negatively impact their performance, allowing them to receive guidance and resources in a preventive manner. However, implementing these models in the real world presents challenges. Contextual and multi-dimensional factors influence performance, so it can be difficult to generalize the results to all institutions. Furthermore, technological infrastructure, staff training, and institutional willingness to invest in continuous improvement are critical aspects to ensure that these findings translate into concrete improvements. Despite these limitations, fostering active collaboration between all actors in the academic environment is key to transforming student performance in a sustainable way.

6. Conclusions

In addition, it is important to note that this study provides a comprehensive view of academic performance by considering a wide range of predictor variables. The inclusion of factors such as access grade, number of credits enrolled, gender, and Admission option provides a more complete understanding of the determinants of academic success. These findings can be highly valuable for educational institutions and decision-makers in implementing strategies to improve student retention, academic progress, and timely graduation. Ultimately, this research not only contributes to the advancement of knowledge in the field of academic performance, but also has the potential to positively impact the educational experience and academic development of university students.

As for the limitations of the study, it is important to point out its geographical nature because, despite the wide variety of degrees offered by the UCM, it is limited to the Autonomous Community of Madrid and does not cover the rest of Spain or other countries; for this reason, it would be interesting to extend this study to other universities in the future.

Other important limitations are the potential for overfitting, the limited timeframe of the study (only the first-year performance), the potential for biased or incomplete data, and the limitations of our chosen ML techniques. It has been detected that the profiles of students enrolled full-time, who come from public schools, and who have accessed the faculty through high school studies are the profiles where the models have worked best.

Another possible future line of work would be to carry out this study not in a static way, limited to a single academic year, but to carry out the study either over two years, incorporating the average grades of the first and second year, or over the life of the students’ academic careers, trying to predict the final average grade of their studies.

Author Contributions

A.M.S.-S., J.D.M.-R., M.S. and A.H., conceptualization and methodology; A.M.S.-S., data curation; A.M.S.-S. and J.D.M.-R., writing—original draft and formal analysis; M.S. and A.H., supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Ministerio de Ciencia e Innovación de España [Research project PID2020-116293RB-I00]. The APC was funded by the Department of Financial and Actuarial Economics & Statistics, Universidad Complutense de Madrid.

Data Availability Statement

The datasets were obtained from the Integrated Institutional Data System (Sistema Integrado de Datos Institucionales-SIDI), which belongs to the Institutional Intelligence Center of the Complutense University of Madrid (http://www.ucm.es/cii, accessed on 5 november 2023).

Acknowledgments

The datasets were obtained thanks to the Institutional Intelligence Center of the Complutense University of Madrid (http://www.ucm.es/cii). We also thank the reviewers for their suggestions on improving the paper.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Variables and explanations.

Variable	Explanation
Student ID	The ID that identifies the student.
Academic amount	The cost of the student’s enrollment.
Administrative fee	The administrative and insurance costs.
Degree	The grade the student is studying; the variable used to determine the area.
Area	The area to which the student’s degree belongs (Social Sciences and Law, Sciences, Health Sciences, Engineering, Arts and Humanities).
Family township	A dichotomous variable that identifies whether the student has a family in the region of Madrid or not.
Admission option	The Spanish public university access system is competitive on the basis of student performance. A student can choose up to 12 options between the degree and university to access university studies.
Gender	A dichotomous variable identifying the sex of the student.
Country of birth	A dichotomous variable that identifies whether the student is Spanish or foreign.
Admission study	A dichotomous variable that identifies whether the student has entered university from high school or from a professional training degree.
Access speciality	In the last years of school, the student must choose between subjects from different areas that will determine the specialty with which they mainly enter university (Social Sciences and Humanities, Arts, Technical Sciences, Health Sciences). However, this requirement is not compulsory; a science student can enter social science degrees and vice versa.
Time commitment	A dichotomous variable that identifies whether the student has enrolled in the first year of the full course or not.
Access grade	The university entrance grade, an average between the marks of the last two years of high school and the entrance exam (over 14).
Mother’s or guardian’s level of studies	The mother’s or guardian’s level of studies (illiterate, no education, primary education, secondary education, higher education).
Father’s or guardian’s level of studies	The father’s or guardian’s level of studies (illiterate, no education, primary education, secondary education, higher education).
Scholarship holder	A dichotomous variable that identifies whether the student receives any scholarships or not.
Type of school	A variable that identifies whether the school is a comprehensive school, only an upper secondary school, or only a professional degree school.
School holder	A variable that identifies whether the school is public, private, or private with public subsidy.
Location of the school	A dichotomous variable that identifies whether the student has attended school in the region of Madrid or not.
PAU Call	The university entrance examination has two calls, ordinary and extraordinary.
Admission Reason	A student can be accepted under different quotas (general, disabled, elite athletes, appeal).
Age	The age of the student in the first year of university.
First-semester grade	The average first-semester grade at university (over 10).
First-year grade	The average first-year grade at university (over 10).
No. of ECTS Passed 1st semester	The number of ECTS passed in the first semester at university.
No. of ECTS enrolled 1st semester	The number of ECTS enrolled in the first semester at university.
Ratio of subject passes 1st semester	The ratio between ECTS passed and enrolled in the first semester.
No. of ECTS Passed 1st year	The number of ECTS passed in the first year at university.
No. of ECTS enrolled 1st year	The number of ECTS enrolled in the first year at university.
Ratio of subject passes 1st year	The ratio between ECTS passed and enrolled in the first year.

References

Anderson, L.; Krathwohl, D. A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxonomy of Educational Objectives: Complete Edition; Addison Wesley Longman, Inc.: Boston, MA, USA, 2001. [Google Scholar]
Bravo-Agapito, J.; Romero, S.J.; Pamplona, S. Early prediction of undergraduate student’s academic performance in completely online learning: A five-year study. Comput. Hum. Behav. 2021, 115, 106595. [Google Scholar] [CrossRef]
Richardson, M.; Abraham, C.; Bond, R. Psychological correlates of university students’ academic performance: A systematic review and meta-analysis. Psychol. Bull. 2012, 138, 353–387. [Google Scholar] [CrossRef] [PubMed]
Bowen, H.R.; Fincher, C. Goals: The intended outcomes of higher education. In Investment in Learning; Routledge: London, UK, 1996; pp. 31–60. [Google Scholar] [CrossRef]
Hattie, J. Visible Learning: A Synthesis of over 800 Meta-Analyses Relating to Achievement; Taylor & Francisc Group: New York, NY, USA, 2008; pp. 1–378. [Google Scholar]
World Education Forum. Incheon Declaration: Education 2030: Towards Inclusive and Equitable Quality Education and Lifelong Learning for All. 2015. Available online: https://unesdoc.unesco.org/ark:/48223/pf0000233137 (accessed on 1 July 2024).
Marzano, R.J. Marzano Levels of School Effectiveness; Reseach Laboratory: Bloomington, IN, USA, 2012. [Google Scholar]
You, J.W. Identifying significant indicators using LMS data to predict course achievement in online learning. Internet High. Educ. 2016, 29, 23–30. [Google Scholar] [CrossRef]
Schneider, M.; Preckel, F. Variables associated with achievement in higher education: A systematic review of meta-analyses. Psychol. Bull. 2017, 143, 565–600. [Google Scholar] [CrossRef] [PubMed]
Chrysikos, A.; Ahmed, E.; Ward, R. Analysis of Tinto’s student integration theory in first-year undergraduate computing students of a UK higher education institution. Int. J. Comp. Educ. Dev. 2017, 19, 97–121. [Google Scholar] [CrossRef]
McMillan, J.H.; Myran, S.; Workman, D. Elementary teachers’ classroom assessment and grading practices. J. Educ. Res. 2002, 95, 203–213. [Google Scholar] [CrossRef]
McMillan, J.H.; Schumacher, S. Research in Education: Evidence-Based Inquiry, 7th ed.; Pearson: Hoboken, NJ, USA, 2010. [Google Scholar]
Cervero, A.; Castro-López, A.; Álvarez-Blanco, L.; Esteban, M.; Bernardo, A. Evaluation of educational quality performance on virtual campuses using fuzzy inference systems. PLoS ONE 2020, 15, e0232802. [Google Scholar] [CrossRef]
Papadogiannis, I.; Poulopoulos, V.; Platis, N.; Vassilakis, C.; Lepouras, G.; Wallace, M. First grade GPA as a predictor of later academic performance in high school. Knowledge 2023, 3, 513–524. [Google Scholar] [CrossRef]
Kondo, N.; Okubo, M.; Hatanaka, T. Early detection of at-risk students using machine learning based on LMS Log Data. In Proceedings of the 2017 6th IIAI International Congress on Advanced Applied Informatics, IIAI-AAI 2017, Hamamatsu, Japan, 9–13 July 2017; pp. 198–201. [Google Scholar] [CrossRef]
Brooks, C.; Thompson, C.; Teasley, S. Who you are or what you do: Comparing the predictive power of demographics vs. activity patterns in massive open online courses (MOOCs). In Proceedings of the L@S 2015–2nd ACM Conference on Learning at Scale, Vancouver, BC, Canada, 14–18 March 2015; pp. 245–248. [Google Scholar] [CrossRef]
Romero, C.; López, M.I.; Luna, J.M.; Ventura, S. Predicting students’ final performance from participation in on-line discussion forums. Comput. Educ. 2013, 68, 458–472. [Google Scholar] [CrossRef]
Alves, P.; Miranda, L.; Morais, C. The influence of virtual learning environments in Students’ performance. Univers. J. Educ. Res. 2017, 5, 517–527. [Google Scholar] [CrossRef]
Pascual-Miguel, F.; Chaparro-Peláez, J.; Hernández-García, Á.; Iglesias-Pradas, S. A characterisation of passive and active interactions and their influence on students’ achievement using Moodle LMS logs. Int. J. Technol. Enhanc. Learn. 2011, 3, 403–414. [Google Scholar] [CrossRef]
Abuzinadah, N.; Umer, M.; Ishaq, A.; Hejaili, A.; Alsubai, S.; Eshmawi, A.; Mohamed, A.; Ashraf, I. Role of convolutional features and machine learning for predicting student academic performance from MOODLE data. PLoS ONE 2023, 18, e0166111. [Google Scholar] [CrossRef] [PubMed]
Alabduljabbar, A.; Almana, L.; Almansour, A.; Alshunaifi, A.; Alobaid, N.; Alothaim, N.; Shaik, S.A. Assessment of fear of failure among medical students at King Saud University. Front. Psychol. 2022, 13, 794700. [Google Scholar] [CrossRef] [PubMed]
Aiken, J.M.; de Bin, R.; Hjorth-Jensen, M.; Caballero, M.D. Predicting time to graduation at a large enrollment American university. PLoS ONE 2020, 15, e0242334. [Google Scholar] [CrossRef] [PubMed]
Rasch, G. Probabilistic Models for Some Intelligence and Attainment Tests; University of Chicago Press: Chicago, IL, USA, 2024; Volume 1. [Google Scholar]
Mello-Román, J.D.; Gómez-Chacón, I.M. Creencias y rendimiento académico en matemáticas en el ingreso a carreras de ingeniería. Aula Abierta 2022, 51, 407–415. [Google Scholar] [CrossRef]
Segura, M.; Mello, J.; Hernández, A. Machine learning prediction of university student dropout: Does preference play a key role? Mathematics 2022, 10, 3359. [Google Scholar] [CrossRef]
Balfanz, R.; Byrnes, V. Early warning indicators and intervention systems: State of the field. In Handbook of Student Engagement Interventions: Working with Disengaged Students; Academic Press: Cambridge, MA, USA, 2019; pp. 45–55. [Google Scholar] [CrossRef]
Lodge, J.M.; Corrin, L. What data and analytics can and do say about effective learning. NPJ Sci. Learn. 2017, 2, 5. [Google Scholar] [CrossRef]
Macfadyen, L.P.; Dawson, S. International Forum of Educational Technology & Society Numbers Are Not Enough. Why e-learning analytics failed to inform an institutional strategic plan. J. Educ. Technol. Soc. 2012, 15, 149–163. [Google Scholar]
Tinto, V. Leaving College: Rethinking the Causes and Cures of Student Attrition, 2nd ed.; University of Chicago Press: Chicago, IL, USA, 1993. [Google Scholar] [CrossRef]
Xing, W.; Guo, R.; Petakovic, E.; Goggins, S. Participation-based student final performance prediction model through interpretable genetic programming: Integrating learning analytics, educational data mining and theory. Comput. Hum. Behav. 2015, 47, 168–181. [Google Scholar] [CrossRef]
Galvez, C. Análisis de co-palabras aplicado a los artículos muy citados en Biblioteconomía y Ciencias de la Información (2007–2017). Transinformação 2018, 30, 277–286. [Google Scholar] [CrossRef]
Valle, A.; Rodríguez, S.; Núñez, J.C.; Cabanach, R.G.; González-Pienda, J.A.; Rosario, P. Motivación y Aprendizaje Autorregulado. Interam. J. Psychol. 2010, 44, 86–97. [Google Scholar]
Gil-Vera, V.D.; Quintero-López, C.; Gil-Vera, V.D.; Quintero-López, C. Predicción del rendimiento académico estudiantil con redes neuronales artificiales. Inf. Tecnológica 2021, 32, 221–228. [Google Scholar] [CrossRef]
Peñaloza, J.L.; Vargas, C.G.; Mello, J. The Hierarchical nesting effect in the study and interpretation of academic performance in the social sciences: A 2-level multinivel application. In Proceedings of the 18th Annual International Technology, Education and Development Conference, Valencia, Spain, 4–6 March 2024; pp. 6520–6526. [Google Scholar] [CrossRef]
Páez, A.R.; Ramírez, N.D.G. Modelos predictivos del rendimiento académico a partir de características de estudiantes de ingeniería. IE Rev. Investig. Educ. Rediech 2022, 13, e1426. [Google Scholar] [CrossRef]
Fernández-Lasarte, O.; Ramos-Díaz, E.; Sáez, I.A. Academic performance, perceived social support and emotional intelligence at the university. Eur. J. Investig. Health Psychol. Educ. 2019, 9, 39–49. [Google Scholar] [CrossRef]
Cassidy, S. Resilience building in students: The role of academic self-efficacy. Front. Psychol. 2015, 6, 1781. [Google Scholar] [CrossRef]
Long, C.; Ferrier, F. Actuarial models in higher education research: The use of focus groups for developing a predictive model of student success. J. Appl. Res. High. Educ. 2012, 4, 28–44. [Google Scholar]
Cleary, T.J.; Zimmerman, B.J. Self-regulation empowerment program: A school-based program to enhance self-regulated and strategic learning. Psychol. Sch. 2004, 41, 537–550. [Google Scholar] [CrossRef]
Ochoa, L.L.; Rosas Paredes, K.; Baluarte Araya, C. Evaluación de técnicas de minería de datos para la predicción del rendimiento académico. In Proceedings of the LACCEI International Multi-Conference for Engineering, Education and Technology, Boca Raton, FL, USA, 19–21 July 2017. [Google Scholar] [CrossRef]
Almutairi, S.; Shaiba, H.; Bezbradica, M. Predicting students’ academic performance and main behavioral features using data mining techniques. Commun. Comput. Inf. Sci. 2019, 1097, 245–259. [Google Scholar] [CrossRef]
Viswanathan, S.; Vengatesh, S. Study of students’ performance prediction models using machine learning. Turk. J. Comput. Math. Educ. 2021, 12, 3085–3091. [Google Scholar] [CrossRef]
Han, Z.; Wu, J.; Huang, C.; Huang, Q.; Zhao, M. A review on sentiment discovery and analysis of educational big-data. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2020, 10, e1328. [Google Scholar] [CrossRef]
Nagawa, K.; Kishigami, T.; Yokoyama, F.; Murakami, S.; Yasugi, T.; Takaki, Y.; Inoue, K.; Tsuchihashi, S.; Seki, S.; Okada, Y.; et al. Diagnostic utility of a conventional MRI-based analysis and texture analysis for discriminating between ovarian thecoma-fibroma groups and ovarian granulosa cell tumors. J. Ovarian Res. 2022, 15, 65. [Google Scholar] [CrossRef] [PubMed]
Whitmire, C.D.; Vance, J.M.; Rasheed, H.K.; Missaoui, A.; Rasheed, K.M.; Maier, F.W. Using Machine Learning and Feature Selection for Alfalfa Yield Prediction. AI 2021, 2, 71–88. [Google Scholar] [CrossRef]
Luo, H.; Hansen, A.S.L.; Yang, L.; Schneider, K.; Kristensen, M.; Christensen, U.; Christensen, H.B.; Du, B.; Özdemir, E.; Feist, A.M.; et al. Coupling S-adenosylmethionine–dependent methylation to growth: Design and uses. PLoS Biol. 2019, 17, e2007050. [Google Scholar] [CrossRef] [PubMed]
Masrom, S.; Rahman, R.A.; Mohamad, M.; Sani, A.; Rahman, A.; Baharun, N. Machine learning of tax avoidance detection based on hybrid metaheuristics algorithms. IAES Int. J. Artif. Intell. 2022, 11, 1153–1163. [Google Scholar] [CrossRef]
Shahiri, A.M.; Husain, W.; Rashid, N.A. A review on predicting student’s performance using data mining techniques. Procedia Comput. Sci. 2015, 72, 414–422. [Google Scholar] [CrossRef]
Kaliappan, J.; Srinivasan, K.; Mian Qaisar, S.; Sundararajan, K.; Chang, C.Y.; Suganthan, C. Performance evaluation of regression models for the prediction of the COVID-19 reproduction rate. Front. Public Health 2021, 9, 729–795. [Google Scholar] [CrossRef]
Pujianto, D.; Nopiyanto, Y.E.; Wibowo, C. High school student-athletes: Their motivation, study habits, self-discipline, academic support and academic performance. Phys. Educ. Theory Methodol. 2024, 7989, 22–31. [Google Scholar] [CrossRef]
Jin, Z.; Shang, J.; Zhu, Q.; Ling, C.; Xie, W.; Qiang, B. RFRSF: Employee turnover prediction based on random forests and survival analysis. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Cham, Switzerland, 2020; Volume 12343, pp. 503–515. [Google Scholar] [CrossRef]
Scornet, E.; Biau, G.; Vert, J.P. Consistency of random forest. Ann. Stat. 2015, 43, 1716–1741. [Google Scholar] [CrossRef]
Sokkhey, P.; Okazaki, T. Developing web-based support systems for predicting poor-performing students using educational data mining techniques. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 23–32. [Google Scholar] [CrossRef]
Deepika, K.; Sathyanarayana, N. Relief-F and budget tree random forest based feature selection for student academic performance prediction. Int. J. Intell. Eng. Syst. 2019, 12, 30–39. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
Jeganathan, S.; Lakshminarayanan, A.R.; Ramachandran, N.; Tunze, G.B. Predicting academic performance of immigrant students using XGBoost Regressor. Int. J. Inf. Technol. Web Eng. 2022, 17, 1–19. [Google Scholar] [CrossRef]
Wang, J.; Xu, J.; Zhao, C.; Peng, Y.; Wang, H. An ensemble feature selection method for high-dimensional data based on sort aggregation. Syst. Sci. Control Eng. 2019, 7, 32–39. [Google Scholar] [CrossRef]
An, H.; Ren, J.; Wu, S. XGBDeepFM for CTR Predictions in mobile advertising benefits from ad context. Math. Probl. Eng. 2020, 2020, 1747315. [Google Scholar] [CrossRef]
Woo, H.; Kim, J.M. Impacts of learning orientation on the modeling of programming using feature selection and XGBoost: A gender-focused analysis. Appl. Sci. 2022, 12, 4922. [Google Scholar] [CrossRef]
Wu, H.; Wu, C.; Lu, Q.; Ding, Z.; Xue, M.; Lin, J. Spatial-temporal characteristics of severe fever with thrombocytopenia syndrome and the relationship with meteorological factors from 2011 to 2018 in Zhejiang Province, China. PLoS Neglected Trop. Dis. 2021, 14, e0008186. [Google Scholar] [CrossRef]
Li, C.; Zhou, L.; Xu, W. Estimating aboveground biomass using sentinel-2 msi data and ensemble algorithms for grassland in the shengjin lake wetland, China. Remote Sens. 2021, 13, 1595. [Google Scholar] [CrossRef]
Zhai, B.; Chen, J. Development of a stacked ensemble model for forecasting and analyzing daily average PM2.5 concentrations in Beijing, China. Sci. Total Environ. 2018, 635, 644–658. [Google Scholar] [CrossRef]
El Guabassi, I.; Bousalem, Z.; Marah, R.; Qazdar, A. A recommender system for predicting students’ admission to a graduate program using machine learning algorithms. Int. J. Online Biomed. Eng. 2021, 17, 135–147. [Google Scholar] [CrossRef]
Alhazmi, E.; Sheneamer, A. Early predicting of students performance in higher education. IEEE Access 2023, 11, 27579–27589. [Google Scholar] [CrossRef]
Bergstra, J.; Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 2012, 13, 281–305. [Google Scholar]
Adekitan, A.I.; Salau, O. The impact of engineering students’ performance in the first three years on their graduation result using educational data mining. Heliyon 2019, 5, e01250. [Google Scholar] [CrossRef] [PubMed]
Román, J.D.M.; Estrada, A.H. A study on academic achievement in mathematics. Rev. Electron. Investig. Educ. 2019, 21, 1–10. [Google Scholar] [CrossRef]
Drucker, H.; Surges, C.J.C.; Kaufman, L.; Smola, A.; Vapnik, V. Support vector regression machines. Adv. Neural Inf. Process. Syst. 1997, 9, 155–161. [Google Scholar]
Qasrawi, R.; VicunaPolo, S.; Al-Halawa, D.A.; Hallaq, S.; Abdeen, Z. Predicting school children academic performance using machine learning techniques. Adv. Sci. Technol. Eng. Syst. J. 2021, 6, 8–15. [Google Scholar] [CrossRef]
Rifatv, M.R.I.; Imran, A.; Al Badrudduza, A.S.M. Educational performance analytics of undergraduate business students. Int. J. Mod. Educ. Comput. Sci. 2019, 11, 44–53. [Google Scholar] [CrossRef]
Makombe, F.; Lall, M. A predictive model for the determination of academic performance in private higher education institutions. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 415–419. [Google Scholar] [CrossRef]
Xu, H.; Kim, M. Combination prediction method of students’ performance based on ant colony algorithm. PLoS ONE 2024, 19, 1–18. [Google Scholar] [CrossRef]
Corti, F.; Llanes, J.; Alcaraz, I.D.; Niella, M.F. Initial adaptation among university student: The case of the social sciences. PLoS ONE 2023, 18, e0294440. [Google Scholar] [CrossRef]
Vandamme, J.; Meskens, N.; Superby, J.F. Predicting academic performance by data mining methods. Educ. Econ. 2007, 15, 405–419. [Google Scholar] [CrossRef]
Subiros, J.; Rius, A.; Llorente, A.; Lozano, J. Early prediction of university dropout and academic performance using machine learning techniques. IEEE Access 2020, 8, 20900–20910. [Google Scholar]

Figure 1. Bar plots showing importance of variables as determined by RF and XGBoost for predicting academic performance: (a) importance of variables for predicting First-year grade; (b) importance of variables for predicting First-semester grade.

Figure 2. Scatter plot of actual versus predicted values in test set: (a) First-year grade prediction with RF—feature selection with RF; (b) First-semester grade prediction with RF—feature Selection with XGBoost.

Table 1. Preliminary results of qualitative variables.

Variables	Variables’ Values
Area	Social Science and Law (21.93%), Arts and Humanities (39.37%), Sciences (14.05%), Health Sciences (18.43%) and Engineering (6.23%)
Gender	Men(35.22%), women (64.78%)
Family township	Madrid (73.52%), outside Madrid (26.48%)
Country of birth	Spain (92.79%), foreign country (7.21%)
Admission option	First (65.46%), second (12.90%), third (6.28%), fourth (4.66%), fifth (3.74%), sixth (2.25%), seventh (1.37%), eighth (1.29%), ninth (0.75%), tenth (0.54%), eleventh (0.44%), twelfth (0.34%)
Access specialty	Social Sciences and Humanities (47.82%), Arts (5.97%), Health Sciences (2.07%), Technical Sciences (44.15%)
Admission study	Professional training degree (6.64%), high school (93.36%)
PAU call	Ordinary (94.62%), extraordinary (5.38%)
Mother’s or guardian’s level of studies	Illiterate (0.29%), no education (1.17%), primary education (6.82%), secondary education (27.44%), higher education (64.29%)
Father’s or guardian’s level of studies	Illiterate (0.59%), no education (1.37%), primary education (10.71%), secondary education (31.28%), higher education (56.06%)
Time commitment	Part-time student (1.85%), full-time student (98.15%)
Type of school	Only a professional degree school (1.24%), a comprehensive school (4.63%), only an upper secondary school (94.13%)
School holder	Public subsidy (1.24%), public (58.60%), private (36.69%)
Location of the school	Madrid (73.17%), outside Madrid (26.83%)
Admission reason	Appeal (0.56%), general (97.91%), disabled (0.79%), elite athletes (0.74%)
Scholarship holder	No grant holder (70.44%), grant holder (29.56%)

Table 2. Preliminary results of quantitative variables.

Variable	Mean	Median	Standard Deviation	Skewness
Age	18.64	18	1.95	15.11
Academic amount	582.46	584	541.37	1.34
Administrative fee	30.08	35	9.00	−1.78
Access grade	10.72	11	1.93	−0.55
No. of ECTS enrolled 1st year	60.76	60	5.49	−0.21
No. of ECTS Passed 1st year	50.79	60	16.98	−1.42
Ratio of subject passes 1st year	0.83	1	0.26	−1.75
No. of ECTS enrolled 1st semester	27.81	30	6.40	−1.37
No. of ECTS Passed 1st semester	20.74	24	9.56	−0.45
Ratio of subject passes 1st semester	0.75	0.80	0.30	−1.07
First-year grade	6.56	6.79	1.63	−1.32
First-semester grade	6.31	6.54	1.82	−0.99

Table 3. Importance of variables for RF and XGBoost.

Importance Order	Random Forest ¹		XGBoost ¹
Importance Order	Variables	%IncMSE	Variables	Gain
1	Access grade	101.14	Access grade	0.4768
2	Academic amount	51.55	Academic amount	0.0683
3	No. of ECTS enrolled 1st semester	44.49	No. of ECTS enrolled 1st year	0.0661
4	No. of ECTS enrolled 1st year	37.86	No. of ECTS enrolled 1st semester	0.0631
5	Gender	34.19	Admission option	0.0560
6	Admission option	33.44	Gender	0.0543
7	Family township	33.23	Age	0.0462
8	Location of the school	31.19	Administrative fee	0.0461
9	Scholarship holder	28.93	Father’s or guardian’s level of studies	0.0273
10	Administrative fee	25.34	Family township	0.0201

¹ Dependent Variable = First-year grade.

Table 4. Importance of variables for RF and XGBoost.

Importance Order	Random Forest ¹		XGBoost ¹
Importance Order	Variables	%IncMSE	Variables	Gain
1	Access grade	114.36	Access grade	0.4856
2	Academic amount	50.32	Academic amount	0.0816
3	No. of ECTS enrolled 1st semester	48.89	No. of ECTS enrolled 1st semester.	0.0747
4	No. of ECTS enrolled 1st year	39.19	No. of ECTS enrolled 1st year	0.0605
5	Scholarship holder	32.98	Age	0.0511
6	Gender	32.15	Admission option	0.0502
7	Admission option	31.13	Administrative fee	0.0410
8	Family township	28.27	Gender	0.0397
9	Location of the school	27.72	Father’s or guardian’s level of studies	0.0220
10	Admission study	24.75	Family township	0.0217

¹ Dependent Variable = First-semester grade.

Table 5. Machine learning evaluation metrics.

Feature Selection	Machine Learning Methods	First-Year Grade ¹		First-Semester Grade ²
Feature Selection	Machine Learning Methods	R²	RMSE	R²	RMSE
Random Forest	LR	0.17	1.48	0.20	1.59
	SVR	0.18	1.47	0.23	1.57
	RF	0.22	1.42	0.23	1.56
	XGBoost	0.21	1.46	0.23	1.58
XGBoost	LR	0.17	1.48	0.21	1.59
	SVR	0.18	1.47	0.21	1.59
	RF	0.22	1.43	0.24	1.57
	XGBoost	0.22	1.47	0.22	1.59

¹ Dependent Variable = First-year grade. ² Dependent Variable = First-semester grade.

Table 6. LR coefficients by subsets of predictors (RF and XGBoost). Dependent variable: First-semester grade.

Feature Selection RF ¹	First-Semester Grade	Feature Selection XGBoost ²	First-Semester Grade
Intercept	6.33	Intercept	6.19
Access grade	0.60	Access grade	0.61
Academic amount	−0.14	Academic amount	−0.22
No. of ECTS enrolled 1st semester	0.05	No. of ECTS enrolled 1st semester	0.06
No. of ECTS enrolled 1st year	0.19	No. of ECTS enrolled 1st year	0.20
Scholarship holder	0.02	Age	0.12
Gender	−0.36	Admission option	−0.88
Admission option	−0.91	Administrative fee	0.21
Family township	0.35	Gender	−0.36
Location of the school	0.07	Father’s or guardian’s level of studies	0.01
Admission study	−0.14	Family township	0.43

¹ Random Forest selected variables; ² XGBoost selected variables.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sánchez-Sánchez, A.M.; Mello-Román, J.D.; Segura, M.; Hernández, A. Identifying the Determinants of Academic Success: A Machine Learning Approach in Spanish Higher Education. Systems 2024, 12, 425. https://doi.org/10.3390/systems12100425

AMA Style

Sánchez-Sánchez AM, Mello-Román JD, Segura M, Hernández A. Identifying the Determinants of Academic Success: A Machine Learning Approach in Spanish Higher Education. Systems. 2024; 12(10):425. https://doi.org/10.3390/systems12100425

Chicago/Turabian Style

Sánchez-Sánchez, Ana María, Jorge Daniel Mello-Román, Marina Segura, and Adolfo Hernández. 2024. "Identifying the Determinants of Academic Success: A Machine Learning Approach in Spanish Higher Education" Systems 12, no. 10: 425. https://doi.org/10.3390/systems12100425

APA Style

Sánchez-Sánchez, A. M., Mello-Román, J. D., Segura, M., & Hernández, A. (2024). Identifying the Determinants of Academic Success: A Machine Learning Approach in Spanish Higher Education. Systems, 12(10), 425. https://doi.org/10.3390/systems12100425

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Identifying the Determinants of Academic Success: A Machine Learning Approach in Spanish Higher Education

Abstract

1. Introduction

2. Literature Review

3. Materials and Methods

3.1. Dataset Description, Access, and Extraction

3.2. Preliminary Analysis: Main Statistics

3.3. Feature Selection

3.3.1. Random Forest

3.3.2. Extreme Gradient Boosting

3.3.3. Importance of Variables

3.4. Machine Learning Regression Models

3.4.1. Linear Regression

3.4.2. Support Vector Regression

3.4.3. Random Forest and XGBoost

4. Results

4.1. Feature Selection

4.2. Machine Learning Methods Evaluation

5. Discussion

Research Questions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI