Investigating the Efficacy and Interpretability of ML Classifiers for Student Performance Prediction in the Small-Data Regime

Vecchi, Edoardo

doi:10.3390/educsci16010149

Open AccessArticle

Investigating the Efficacy and Interpretability of ML Classifiers for Student Performance Prediction in the Small-Data Regime

by

Edoardo Vecchi

Formazione Base, Dipartimento Tecnologie Innovative (DTI), Scuola Universitaria Professionale della Svizzera Italiana (SUPSI), 6962 Lugano-Viganello, Switzerland

Educ. Sci. 2026, 16(1), 149; https://doi.org/10.3390/educsci16010149

Submission received: 20 November 2025 / Revised: 6 January 2026 / Accepted: 9 January 2026 / Published: 19 January 2026

Download

Browse Figures

Versions Notes

Abstract

Despite the extensive application of machine learning (ML) methods to educational datasets, few studies have provided a systematic benchmarking of the available algorithms with respect to both predictive performance and interpretability of the resulting models. In this work, we address this gap by comparing a range of supervised learning methods on a freely available dataset concerning two high schools, where the goal is to predict student performance by modeling it as a binary classification task. Given the high feature-to-sample ratio, the problem falls within the small-data learning regime, which often challenges ML models by diluting informative features among many irrelevant ones. The experimental results show that several algorithms can achieve robust predictive performance, even in this scenario and in the presence of class imbalance. Moreover, we show how the output of ML algorithms can be interpreted and used to identify the most relevant predictors, without any a priori assumption about their impact. Finally, we perform additional experiments by removing the two most dominant features, revealing that ML models can still uncover alternative predictive patterns, thus demonstrating their adaptability and capacity for knowledge extraction under small-data conditions. Future work could benefit from richer datasets, including longitudinal data and psychological features, to better profile students and improve the identification of at-risk individuals.

Keywords:

binary classification; performance prediction; secondary education; entropy-based learning; feature selection; feature importance; small-data learning regime

1. Introduction

Machine learning (ML) techniques have found widespread application in the educational domain, ranging from the evaluation and development of pedagogical strategies (Kushik et al., 2020), to the analysis of students’ learning patterns with the aim of developing ad hoc support techniques (Pallathadka et al., 2023), to prediction of students’ performance according to their grades and other explanatory variables related to their background (Pallathadka et al., 2022). In the latter research branch, several studies relied on the application of data mining techniques to extract a large amount of features, which could then be used to infer a predictive model (Kabakchieva, 2013; Minaei-Bidgoli et al., 2003; Osmanbegovic & Suljic, 2012). Generally speaking, students’ performance prediction can be modeled as either a regression or a classification task: while the former is aimed at estimating the numerical value of a response variable (e.g., the student grade at the end of a course) through a set of explanatory features (Arsad et al., 2013; Sweeney et al., 2015), the latter is interested in the subdivision of the input data (e.g., the students) into different classes, labeled according to an objective and a-priori-established criterion (Shrestha & Pokharel, 2019; Zhang et al., 2018). However, the application of ML algorithms to educational datasets may not be straightforward, since predicting student performance often falls into the small-data learning regime due to the discrepancy between the large number of potential explanatory variables (e.g., individual characteristics, parents’ occupation and education, socioeconomic status, academic records) and the naturally constrained number of students (Zohair & Mahmoud, 2019). The small-data regime arises when the number T of observations of the response variable is relatively small when compared to the much larger number D of explanatory features. Specifically, when the ratio

T / D

falls below a certain threshold, model overfitting1 becomes likely, and even state-of-the-art ML methods struggle to correctly predict the classification labels. Empirical efforts have attempted to identify a numerical value for this threshold: for example, Horenko (2020) showed that entropy-based ML methods tend to significantly outperform deep learning (DL) methods when the number of data instances is less than roughly 14 times the number of features. However, depending on the problem domain and the inherent characteristics of the features, this threshold can vary, giving rise to different degrees of “smallness”, from mild to extreme (Vecchi et al., 2023). Regardless of the specific threshold, a robust model for predicting students’ performance usually relies on high-dimensional datasets, which need to be processed carefully to avoid overfitting (Ying, 2019) and to ensure the identification of the features that are relevant for the prediction of the response variable (Vecchi et al., 2022). High-dimensional datasets are usually tackled by reducing the size of the feature space, either by dismissing the data dimensions that contribute least to the observed variability, as in principal components analysis (Hotelling, 1936), or by embedding the data in a lower-dimensional space, as in t-SNE (Van der Maaten & Hinton, 2008). For example, Roy et al. (2023) attempted different dimensionality reduction techniques on four different datasets concerning students’ performance, with the aim of improving the predictive performance of the employed ML models. However, reducing the dimensionality of the training data may lead to the loss of relevant information (Gracia et al., 2014) and to the introduction of artificial features that cannot be directly interpreted in terms of the original variables, thus limiting the explainability of the resulting models (Bo et al., 2023).

Furthermore, in addition to the challenges stemming from the high dimensionality of the feature space, the datasets used in educational classification problems often exhibit a relatively high class imbalance (i.e., a discrepancy between one group, such as high-performing students, and another, such as low-performing students). If not properly addressed, this imbalance often leads to low or biased predictive performance (Thammasiri et al., 2014), and its effects can be further exacerbated in the small-data learning regime (Vecchi et al., 2024a). However, while data augmentation is often proposed as a solution to class imbalance and is successfully applied in image and signal domains through geometric transformations of the original data, it frequently performs poorly on tabular data (Espinosa & Figueira, 2023; Miletic & Sariyar, 2024). Due to the presence of many categorical features, this limitation is particularly evident in educational datasets, where data augmentation often produces unrealistic samples, which add little information to the classifier training (Machado et al., 2022). On top of this structural challenge, recent studies specifically performed on educational data have highlighted that data augmentation requires enough samples (e.g., hundreds to thousands) in order to create new instances that would not overfit the limited sample or, even worse, violate the original data structure by fabricating spurious or unrealistic combinations of features (M. Li et al., 2019; Mathew & Gunasundari, 2021; Wongvorachan et al., 2023).

In their review, Albreiki et al. (2021) examined the breadth of applications of ML techniques in education and found that the most recurrent challenges are assessing students’ risk of failing to graduate and predicting the dropout rate (Colak Oz et al., 2023; Hegde & Prageeth, 2018). This is usually achieved through datasets concerning high school students, university academic records (Yılmaz & Sekeroglu, 2020), or online learning platforms. These datasets, however, both for ethical and structural reasons—such as the fact that they may contain sensitive information or refer to different countries, schools, or age ranges—are usually neither accessible nor directly comparable. Nevertheless, information like personal social statistics on students and educational records can efficiently be used to predict the students’ performance (Rastrollo-Guerrero et al., 2020). Moreover, these studies tend to prioritize predictive accuracy over transparency, often utilizing black-box models (e.g., random forests or neural networks) with limited attention to explainability. For example, Wen and Juan (2023) addressed the early prediction of students’ performance from online learning activity sequences, but the use of deep neural networks and an autoencoder for latent feature extraction and dimensionality reduction limits the model’s transparency. In the educational domain, where interpretability is vital for policymaking and pedagogical intervention, the ability to trace a model’s decision back to meaningful input features is critical. Several recent works have highlighted the need for interpretable ML in educational settings, particularly in contexts where prediction must inform human decision making (Lundberg & Lee, 2017; Molnar, 2020; Ribeiro et al., 2016). This has led to a series of studies where interpretable AI models have been applied to various educational problems, ranging from career counseling (Guleria & Sood, 2023) to prediction of students’ adaptability (Nnadi et al., 2024) or assessment of students’ cognitive abilities (Niu et al., 2025).

However, despite this growing awareness, comparative studies that evaluate the trade-off between classification accuracy and model explainability across ML methods remain sparse in the context of students’ performance prediction in the small-data learning regime. This is particularly true when considering the application of explainable AI to secondary education data, since the vast majority of the existing studies are focused on higher education datasets (Alamri & Alharbi, 2021). For example, a dataset provided by the Federal Board of Intermediate and Secondary Education Islamabad was used to predict students’ performance in the context of secondary education, but the primary focus was on the predictive accuracy, while neglecting the explainability of the resulting models (Yousafzai et al., 2020). Other recent studies decided to adopted explainable AI (XAI) methods such as feature importance ranking, Shapley additive explanations (SHAP), local interpretable model-agnostic explanations (LIME), and permutation-based analysis to improve the model transparency and to provide policymakers with solid elements to support their decision process (Ahmed et al., 2025; Gunasekara & Saarela, 2025; Villegas et al., 2025). However, all these studies, besides focusing on higher education, explored datasets with thousands of observations and very few features, which do not belong to the small-data regime. This represents a gap in the literature that is addressed in our work.

In this paper, we make a twofold contribution to the existing literature on ML prediction of students’ performance in secondary education: (i) we conduct a comprehensive benchmark for assessing the predictive capability of a wide range of ML models under limited data availability, and (ii) we investigate the interpretability of these models and their ability to identify the features that are relevant for the prediction task without any a priori assumption or dimensionality reduction of the input data. Analyzing and comparing the available ML methods for educational prediction problems can provide useful insights both on which models are most effective and on which features are most informative, particularly given the challenges of the small-data learning regime. However, many existing benchmarks are limited to a small selection of algorithms or rely on datasets with few features and thousands of observations, conditions that do not reflect typical secondary education scenarios (Ahmed et al., 2025; Gajwani & Chakraborty, 2021; Rastrollo-Guerrero et al., 2020). Here, we instead consider a broad range of different ML algorithms, including support vector machine, random forest, k-nearest neighbours, lasso generalized linear model, neural network, shallow neural network, deep learning with long short-term memory, and entropic scalable probabilistic approximation. Concerning explainability, prior studies that investigated feature relevance in educational prediction tasks often focused specifically on higher education outcomes (Alamri & Alharbi, 2021; Camacho-Miñano et al., 2020), such as college graduation or final course grade prediction (Daud et al., 2017; Nachouki et al., 2023), or emphasized the impact of specific behavioral factors like attendance and course engagement on undergraduate students’ performance (Fadelelmoula & Colleges, 2018; Ha et al., 2024). Here, we consider instead the UCI Student Performance dataset (Cortez, 2014), which specifically deals with secondary education students and has been often used in the literature as a benchmark for educational data mining research. However, existing studies tend to preprocess the original data either through data augmentation to improve predictive performance (Mohammad et al., 2023) or through SMOTE to overcome class imbalance, while subordinating the explainability of a limited set of algorithms (done with post hoc explanation tools instead of intrinsically interpretable learning models) to fairness considerations (Kesgin et al., 2025). In contrast, the present study explicitly prioritizes method explainability, combining surrogate explanation methods like SHAP with intrinsically interpretable models, like random forest and, in particular, the entropic scalable probabilistic approximation, which assigns clear weights indicating the relative importance of each feature in the prediction task. This is particularly important since a careful assessment of explanatory feature relevance can foster research aimed at developing targeted intervention measures and ad hoc learning environments, tailored to the needs, strengths, and weaknesses of the individual student. While the dataset here considered limits this intervention-focused analysis since it contains mainly demographics and socioeconomic data, some of the methods used in the experiments could definitely be applied to individual-specific data (e.g., cognitive abilities, study habits, performance on specific tasks) to support personalized educational policies. Building on this rationale, the present work addresses the following three research questions:

RQ1.: How do a broad range of ML models perform in predicting student performance in a secondary education setting under small-data constraints?
RQ2.: How do these models differ in terms of interpretability and their ability to identify the most relevant explanatory features without prior dimensionality reduction or specific feature importance assumptions?
RQ3.: To what extent do ML models rely on prior grade features, and how does their removal affect predictive performance in a small-data setting?

The rest of this paper is organized as follows. In Section 2, we provide details on the dataset used in the experiments, as well as on the different ML methods introduced in the benchmark. In Section 3, we present and discuss the comparison of the ML methods for students’ performance prediction and the identification of the most relevant features for solving the prediction task. Section 4 concludes the paper by highlighting the key points of this study as well as the limitations and future challenges we plan to address.

2. Materials and Methods

In this section, we provide additional details on both the dataset and the ML methods used in the experiments. It is important to note that finding a suitable and freely available dataset for predicting students’ performance in a secondary school context proved challenging. Indeed, many of the datasets cited in the literature were not available, due to the fact that data dealing with minors are sensitive and usually not shared with the general public. In this analysis, we therefore focus on two datasets retrieved from the UCI Machine Learning Repository (Cortez, 2014), containing the grades achieved by the students of two Portuguese secondary schools in the Mathematics and Portuguese courses. The dataset was selected not only because it is one of the few freely available examples, but also because it closely aligns with the objectives of this study. Specifically, it targets secondary school students, providing a rare instance of publicly accessible data in this educational context. Furthermore, it includes a well-balanced set of demographic, socioeconomic, and academic features (such as parental education, family size, past grades, and study habits), enabling a comprehensive assessment of the main drivers of student performance. Its limited sample size and the presence of both categorical and continuous variables make it particularly suitable for benchmarking ML methods in the small-data learning regime and for evaluating the explainability of the resulting models.

2.1. Datasets Description and Preprocessing

The two datasets retrieved from the UCI repository (Cortez, 2014) describe the performance of high school students in the Mathematics and Portuguese courses at two Portuguese secondary schools. In their original form, the datasets comprise 29 explanatory features capturing demographic, socioeconomic, family-related, and academic characteristics. For each student, information is provided on parental background as well as study and leisure habits. In the present analysis, the final number of explanatory features is

D = 45

, and this increase with respect to the original dimensionality stems from the preprocessing stage performed prior to the classification task. Specifically, all categorical variables were transformed using one-hot encoding, a widely adopted technique that represents each categorical feature through a set of binary indicator variables, one for each possible category. This transformation avoids the introduction of artificial ordinal relationships between categories and is particularly suitable for algorithms that cannot directly handle categorical data, such as many variants of gradient boosting methods or support vector machines (Bolikulov et al., 2024; Hancock & Khoshgoftaar, 2020). The complete list of features used in the analysis is reported in Appendix A.

As the response variable, we focus on the results achieved by the students in the Mathematics and in the Portuguese language courses, with the latter representing their primary language. In the Portuguese school grading system, the grades span from a minimum of 0 to a maximum of 20 (with 10 being the passing grade), and we further process the grades achieved by the students in each course to model the problem as a binary classification task. The choice of opting for binary classification, rather than for multi-class classification, stems from the fact that the latter is typically more challenging and entails a series of complications, further exacerbated by the small-data regime, that are beyond the scope of this contribution (Grandini et al., 2020). While we acknowledge that reducing student performance prediction to a binary outcome simplifies a complex construct and may entail some loss of information, we also believe that this approach allows a robust modeling of key performance categories (e.g., passing vs. failing or above-average vs. below-average grades). Certainly, this problem could also have been formulated as a regression task with a discrete response variable, and then solved, for example, with one of the recently developed entropic regression learning algorithms (Vecchi et al., 2024b). However, we think that, for this kind of educational data, clustering of the students in different groups with homogeneous characteristics is more informative, beyond the mere prediction of the numerical value of their final grade. Moreover, binary classification may help reduce the influence of noise or latent confounders (arising, for example, from differences in schools, teachers, and class composition) that could have a more significant impact on multi-class or regression-based approaches (Alamri & Alharbi, 2021). In addition, in the context of small-data learning, limiting the classification output to two classes prevents having too few training instances per class, which could result in a serious degradation of the model’s performance.

Thus, the discrete response variable represented by the grade is reduced to a binary response variable according to two different perspectives, which can be succinctly labeled as ‘sufficient’ and ‘average’ grade. Specifically, we consider two scenarios with the following classes: (i) we first divide the students into two groups, based on whether they passed or failed the course (i.e., those with a sufficient final grade are labeled as 1, while the others are labeled as 0); (ii) we then compute the average grade achieved in each specific subject by the whole group, and then divide the students into outperforming (labeled as 1) and underperforming (labeled as 0) classes. Regarding the dimensionality of the dataset, for the Mathematics course, there are

T = 394

observed instances (i.e., students) and

D = 45

features, representing the explanatory variables used to perform the classification task according to the two criteria outlined above. For the Portuguese course, while the number of features remains the same, we have a larger number of students, resulting in

T = 649

instances. In both cases, the ratio

T / D

between the number of observations and the feature space dimension indicates that the classification problems belong to the small-data learning regime (Horenko, 2020; Vecchi et al., 2022). Consequently, the dataset may potentially not provide a sufficient number of instances of the response variable to derive an efficient and robust predictive model of students’ performance using the considered ML methods. We then need to be careful during the experiments to avoid overfitting of the training data and to ensure that the output models are able to generalize to unseen data.

2.2. Benchmark Analysis

The datasets described in Section 2.1 are used to setup the two experimental scenarios indicated as ‘sufficient’ and ‘average’ grade. Since we consider both cases for each subject, we end up with a total of four classification experiments, two for Mathematics and two for the Portuguese language. Starting from the input datasets, we perform 50 cross-validations for each ML method. The purpose of the cross-validation procedure is twofold: to counterbalance the risk of overfitting the limited training data, and to reduce the dependence of the results on a single, potentially favorable, split between training and test sets. For each cross-validation, the input dataset is initially split into a training set and a test set, with 75% of the data used for training and the remaining 25% reserved for testing. ML studies on educational tabular data often employ an 80%/20% split (Mohammad et al., 2023; Villegas et al., 2025). We choose this configuration to be consistent with small-data learning practices, which try to balance between a sufficient number of training instances and an adequate test set for unbiased performance evaluation (Horenko, 2020; Vecchi et al., 2023). In the case of those methods that require one to provide explicitly the validation set for hyperparameter tuning, like the entropic scalable probabilistic approximation algorithm, the input data is instead subdivided in 50% training, 25% validation, and 25% test splits (Vecchi et al., 2022). All splits are obtained through stratified sampling, thus ensuring that both classes are always adequately represented across all subsets.

The general procedure adopted in each experimental scenario is the following: first, the training set is used to train a model for each method, and then the validation set is used to evaluate the performance of the model under different configurations of the hyperparameters2, in order to find the one that yields the best trade-off between performance and overfitting. Finally, the test set is used to compute the performance metrics used in the comparative analysis, thus providing an unbiased assessment of the methods’ performance in the classification task. Repeating this procedure across 50 independent cross-validation runs for each method allows us to assess the variability of model performance induced by different data splits and to report results in terms of empirical distributions (e.g., box plots), rather than relying on a single-point estimate. Even if this approach does not constitute a formal statistical hypothesis-testing framework, it still provides a robust and informative basis for comparative analysis in a small-data learning scenario, where standard parametric assumptions may be difficult to justify (Horenko, 2020). A schematic overview of the adopted benchmarking procedure is provided in Figure 1.

2.3. Methods Selection, Hyperparameter Tuning, and Further Implementation Details

In this section, we briefly describe the ML methods used in the experiments, together with the grid search performed over a range of suitable candidates to identify the best model hyperparameters. To ensure a comprehensive evaluation in the context of educational datasets with limited sample size, we selected a diverse subset of ML algorithms spanning multiple families, including linear models, regularized approaches, kernel-based methods, ensemble techniques, shallow and deep neural networks, and specialized small-data learning algorithms. While a wide range of ML algorithms has been proposed for student performance prediction, the selection of models in this study was guided by the intrinsic characteristics of the problem, favoring model stability, robustness to overfitting, and explainability. For example, classical models such as logistic regression are often appreciated for their parsimony and interpretability, but their effectiveness on tabular data is limited (Fernández-Delgado et al., 2014; Grinsztajn et al., 2022), especially when the size of the training set is small (Motrenko et al., 2014). To address this issue, this study includes a L1-regularized generalized linear model with a logit link (lasso GLM), which corresponds to a L1-penalized logistic regression model. The latter formulation has been shown to be more robust than the baseline model in small-data learning scenarios, by shrinking irrelevant coefficients to zero and performing implicit feature selection (Vecchi et al., 2023). For the same reason, single decision trees have been excluded from the current analysis, since other studies performed on educational data have already shown that other methods present a superior performance, even at the cost of explainability (Gunasekara & Saarela, 2025). It is also important to stress that deep learning with long short-term memory is included not as a theoretically optimal choice, but as a negative control. Indeed, despite its recent application in educational performance prediction (Lin et al., 2025), recent ML literature has shown that DL and recurrent models underperform tree-based methods on tabular datasets (Shwartz-Ziv & Armon, 2022). However, its inclusion allows us to empirically assess whether an increase in model complexity, at the cost of explainability, can have a positive impact on model performance.

From a technical perspective, the experiments have been implemented entirely in MATLAB (R2024a) and run on a machine with two Intel Xeon Platinum 8360Y “Ice Lake” processors (36 cores per chip) running at a base frequency of 2.4 GHz and 54 MB shared L3 cache per chip, with 256 GB of DDR4 RAM. Computations were accelerated via parallelization across all available cores whenever possible. To ensure a fair comparison of the selected ML methods, we conducted an extensive tuning procedure, by performing a grid search with several candidate values for all the relevant hyperparameters provided by the MATLAB implementations. This approach was finalized at finding near-optimal configurations, so that the eventual differences in model performance are ascribable to the chosen methodology and not to an uneven tuning depth. The selected candidate values for each hyperparameter are listed in the corresponding method description.

2.3.1. Support Vector Machine (SVM)

A kernel-based method (Noble, 2006) tries to discriminate between the two classes by finding the hyperplane separating the projection of the data points onto a higher-dimensional space. The adopted implementation is based on the MATLAB function fitcsvm, included in the Statistics and Machine Learning Toolbox. Given the problem size, the kernel chosen for this learning task is represented by a second-order polynomial. Furthermore, we take into account the misclassification cost—i.e., the impact of a wrong prediction of the student performance—through the BoxConstraint parameter, whose candidate values are

{10^{- 4}

,

10^{- 3}

,

10^{- 2}

,

10^{- 1}

, 1, 10,

10^{2}

,

10^{3}

,

10^{4}}

.

2.3.2. Random Forest (RF)

An ensemble of bagged decision trees (Breiman, 2001) labels observations through a voting process among all trees. The implementation uses the MATLAB function TreeBagger within the Statistics and Machine Learning Toolbox. Among the available hyperparameters, we performed an extensive grid search considering different numbers of decision trees

{64

, 128, 256,

512}

and several values for the minimum number of observations per tree leaf

{3

, 5, 10, 15,

20}

.

2.3.3. k-Nearest Neighbours (kNN)

This method maps the data instances contained in the training set to abstract data points in a reference space, and then assigns binary labels to the new unseen instances by measuring their closeness to the previously analyzed data (Steinbach & Tan, 2009). The chosen implementation relies on the function fitcknn, provided by the MATLAB Statistics and Machine Learning Toolbox. The hyperparameters considered during the tuning phase are the distance metric, defined as the Jaccard distance, and the number of neighbors used to assign each instance to the class to which it is estimated to belong—which is set to be equal to the closest odd number to the square root of the size of the training set.

2.3.4. Lasso Generalized Linear Model (Lasso GLM)

This is a generalization of the standard linear regression for non-linear cases, which relies on lasso regularization to estimate the model coefficients (James et al., 2013). The implementation used in the experiments is given by the function lassoglm within the Statistics and Machine Learning Toolbox. Considering that the problem concerns a binary classification task, we assume that the underlying distribution of the response variable is binomial, and thus we consider a logit link function. This kind of link function allows us to handle a binary classification problem as well as the one considered in this study, despite the fact that GLM represents a generalization of a regression method and is normally used to solve this kind of learning task. Concerning the hyperparameters, we considered the following values for the regularization coefficient

λ

:

{10^{- 7}

,

10^{- 6}

,

10^{- 5}

,

10^{- 4}

,

10^{- 3}

,

10^{- 2}

,

10^{- 1}

, 1, 10,

100}

.

2.3.5. Neural Network (NN)

This is a widespread ML method inspired by the natural arrangement of neurons, which is based on the interactions between a sequence of connected artificial neurons passing information one to the other through a sequential path (Y.-c. Wu & Feng, 2018). The MATLAB function fitcnet, provided by the Statistics and Machine Learning Toolbox, has been employed for the implementation of a feedforward neural network. The main hyperparameters tuned during the experiments consist of the number of sequential layers and in the number of neurons forming each layer, which allow us to form different alternative configuration of the learning paths. Specifically, in the experiments we considered one configuration formed by three sequential layers—with, respectively,

[10, 8, 4]

neurons in each layer—and two configurations formed by four sequential layers—with

[32, 16, 8, 4]

and

[12, 8, 4, 2]

neurons.

2.3.6. Shallow Neural Network (SNN)

This feedforward neural network involves a small number of hidden layers, potentially only one (Bianchini & Scarselli, 2014). In this contribution, we rely on the MATLAB function feedforwardnet from the Statistics and Machine Learning Toolbox to implement a shallow neural network with a single hidden layer, for which we consider—as tunable hyperparameter—the number of constituting neurons, belonging to the set

{4

, 8, 10, 12,

16}

.

2.3.7. Deep Learning System with Long Short-Term Memory (DL with LSTM)

A neural network architecture with several layers in which are allowed not only forward but also feedback connections (Hochreiter & Schmidhuber, 1997). In the experiments, we assembled a DL system through the MATLAB function trainNetwork contained within the Deep Learning Toolbox. The hyperparameters that have been tuned are the number of hidden units—chosen in the set

{2

, 5, 10, 16, 20, 32,

50}

—and the size of the mini-batch, within the three values

{8

, 16,

32}

. Given the nature of the classification problems considered in this contribution, which are always modeled as binary classification problems, the number of components in the fully connected layers has been set to be equal to the number of classes (i.e., 2).

2.3.8. Entropic Scalable Probabilistic Approximation Algorithm (eSPA)

This supervised classification method is aimed at solving an ensemble of the entropy-optimal Bayesian network inference and feature space segmentation problems. The employed implementation is written in MATLAB and is based on eSPA+, an advanced and computationally efficient variant of the algorithm, which relies on the closed-form solution of each optimization sub-problem (Vecchi et al., 2022). The eSPA+ algorithm presents three main tunable hyperparameters: the number K of discretization boxes in which the input data are clustered—chosen in the set

{2, 5, 7, 10, 15, 20, 25, 30, 40, 50}

—and the constants

ϵ_{E}

and

ϵ_{CL}

, which are aimed at tuning—respectively—the impact of information entropy and of the conditional probabilities inferred by the input data on the overall classification task. For the constant

ϵ_{E}

, we consider the values

{10^{- 8}

,

10^{- 7}

,

10^{- 6}

,

10^{- 5}

,

10^{- 4}

,

10^{- 3}

,

10^{- 2}

,

10^{- 1}

, 1, 5,

10}

, while for the parameter

ϵ_{CL}

, we take into account as potential candidates the values

{10^{- 8}

,

10^{- 7}

,

10^{- 6}

,

10^{- 5}

,

10^{- 4}

,

10^{- 3}

,

10^{- 2}

,

10^{- 1}}

.

2.4. Performance Metrics

The choice of the most appropriate metrics for assessing the performance of a classifier is not trivial. This is particularly true in the small-data learning regime, where the limited size of the training and test sets has a major impact on the ML ability to discriminate between the classes (Althnian et al., 2021). In our experiments, we consider two different performance metrics: the area under the ROC curve (AUC) and the F-score. In a binary classification scenario, both metrics estimate the methods’ predictive performance under the assumption that one group takes the role of positive class and the other one of negative class. According to this interpretation of the results, the instances of the positive class and negative class that are correctly predicted are labeled, respectively, as true positives and true negatives. More specifically, the AUC provides information on the area beneath the curve of true positive rates against the false positive rate, and it evaluates the ability of the trained model to discriminate between the two classes according to different thresholds. In general, an AUC value equal to 1 indicates that the model is perfectly able to correctly assign unseen instances to the positive or negative class, while a value of

0.5

indicates a random assignment to one of the two classes, with the model being completely unable to discriminate between the two groups. On the other hand, the F-score measures the model performance by incorporating the concepts of precision and recall: precision is the ratio between the number of true positives and the total number of instances classified as positives, while recall is the ratio between the number of true positives and the total number of instances that effectively have a positive label. In the small-data regime, both performance metrics can carry a significant informative potential, but they may be too dependent on the structure of the test set and on the input data split. To address this issue, we perform a large number of cross-validations with different stratified splits of the data, in order to avoid any potential sample selection bias. In all figures included in Section 3, we report the average value for each performance metric and the statistically significant confidence interval where those estimated values can be found.

Even if not directly tied to students’ performance prediction, we also briefly discuss as an additional comparison metric the computational costs of the algorithms (i.e., the time required to solve the classification task). Indeed, given a similar classification performance, we will usually opt ceteris paribus for the method with the lowest computational cost, in particular when solving problems with thousand of feature dimensions.

2.5. Feature Importance

A crucial part of our analysis is the assessment of the interpretability of the different ML models and of their ability to identify which of the 45 explanatory features are actually relevant for solving the classification problem. To discuss the feature importance ranking provided by the different ML algorithms, we propose an analysis similar to the one performed by Vecchi et al. (2023), to which we refer for further technical details about the interpretation of entropy-based methods like eSPA. In this contribution, we consider two main metrics to evaluate the impact of the explanatory features on student performance prediction: (i) the output weight vectors returned by the ML methods that are directly interpretable (i.e., eSPA and RF); (ii) the Shapley additive explanations (SHAP) values, which allow the explainability of an ML method prediction by estimating the importance assigned to each feature through a procedure based on game theory (Lundberg & Lee, 2017). The inclusion of directly interpretable methods strengthens our analysis in the context of a small-data learning scenario, since SHAP value estimates may be unstable when computed from a limited amount of data (Molnar, 2020). However, as confirmed by our experimental results through the comparison with directly explainable methods, SHAP values are a reliable proxy of feature importance (Lundberg & Lee, 2017). This allows us to safely interpret the results of methods such as NN and SVM, which achieve a good predictive performance but do not explicitly assign importance scores to the explanatory features. Indeed, one of the main drawbacks of particularly complicated ML models is the lack of interpretability, which severely limits the possibility of identifying and changing the features with a direct impact on students’ performance.

It is also worth mentioning that, in the small-data regime, the feature importance values can potentially change in every cross-validation and are heavily dependent on the split of the original data. However, the fact that a subset of features tend to be selected more often than others across different cross-validations hints towards their relevance in the prediction task. Evidence of this behaviour for the eSPA+ algorithm has been provided by Vecchi et al. (2023), where the authors showed that some feature importance patterns kept repeating even if the entropic weights were changing in every cross-validation. However, a detailed discussion of this phenomenon would be purely methodological and beyond the scope of this paper, and is therefore left to future work.

3. Discussion of the Results

In this section, we present the experimental results and discuss the research questions posed in Section 1. Before doing so, we would like to understand the key features of our data through a compact visualization of their main characteristics. However, given the high dimensionality of the dataset (i.e., 45 independent dimensions), visualization through standard means becomes impractical, and we need to rely on alternative solutions, like reducing the data dimensionality. This can be done either by plotting the data in the two- or three-dimensional space spanned by the features deemed most relevant, or by using an ad hoc dimensionality reduction technique to obtain a compact representation of the data space. Since the arbitrary selection of two or three features could potentially introduce some bias in our analysis, we use the t-distributed stochastic neighbor embedding (t-SNE) algorithm to process the input data and to map them onto a lower-dimensional space (Van der Maaten & Hinton, 2008). Recent studies have highlighted that t-SNE can be sensitive both to the initialization and to the hyperparameter values, in particular the perplexity (a measure of the effective number of neighbors), and this can lead to substantially different two-dimensional representations (Kobak & Berens, 2019; Kobak & Linderman, 2021). To mitigate these issues, especially in a small-data scenario, we generate the t-SNE plots using the standard MATLAB implementation with default values for all hyperparameters (perplexity

= 30

). Still, the t-SNE plots here provided should be interpreted as exploratory tools, without attempting to provide a faithful representation of the global data structure.

In Figure 2, we can find the t-SNE plots in a two-dimensional space for the Mathematics and Portuguese datasets in the four experimental scenarios, while Table 1 summarizes the class distribution in each case. For both courses, the red dots correspond to the students with a sufficient grade or above-average performance, while the blue dots indicate the students with an insufficient grade or a below-average performance. As reported in Table 1, the Mathematics datasets result balanced when partitioned by the population average (approximately

53 %

vs.

47 %

), whereas the sufficient-grade threshold introduces a moderate imbalance (

67 %

vs.

33 %

). As we can notice from the t-SNE plots in the upper panels of Figure 2, in both cases, the two classes can be distinguished relatively easily and appear to be linearly separable. However, it is important to distinguish between the separability shown by t-SNE and the actual complexity of the classification task. This apparent ease of classification relies on the fact that the chosen ML model can effectively identify the informative features inside the 45-dimensional space. As shown in the upcoming experiments, if a method fails to identify these relevant variables even a seemingly trivial problem may become intractable. This risk is further amplified in the Portuguese dataset (

T = 649, D = 45

), despite the

50 %

increase in sample size with respect to the Mathematics dataset. Indeed, the ratio

T / D

remains close to the small-data overfitting threshold empirically estimated in the literature (Horenko, 2020), and falls inevitably below it after splitting the input data between training and test sets. The t-SNE projections for the Portuguese course (lower panels of Figure 2) mirror the Mathematics case, showing contiguous yet non-overlapping clusters. According to the sufficient-grade criterion, the class imbalance is definitely more relevant, with the minority class representing a mere

15 %

of the population.

Through stratified sampling, we ensure that this skew in the data distribution is preserved across all experimental folds, and both training and test sets contain a sufficient number of instances of the minority class. Indeed, particularly in case of imbalanced classification in the small-data regime, a lack of instances of the minority class in the training set can result either in a collapse towards the majority class or in severe overfitting (Vecchi et al., 2024a). While there are standard techniques to address class imbalance, such as undersampling (Liu et al., 2009) or oversampling (Mohammed et al., 2020), they are generally ill-suited for the small-data regime. Indeed, undersampling would discard relevant information from an already limited pool, while oversampling would generate synthetic instances of the minority class from very few observations, potentially leading to the propagation of spurious characteristics that cannot be found in the general population. Thus, in order to avoid the introduction of any additional bias, we restrict our experiments of the original data without further preprocessing.

3.1. Evaluation of ML Methods’ Performance

In this section, we address RQ1 by evaluating the predictive performance of a diverse ensemble of ML models within the specific constraints of the small-data regime. By benchmarking these methods across two distinct subjects and two classification thresholds, we aim to identify which ML methods maintain robustness even when the observation-to-feature ratio is suboptimal. The empirical results for the Mathematics course, considering both the sufficiency threshold and the average grade criterion, are reported in Figure 3. Specifically, Figure 3a shows the AUC values across the tested ML algorithms. With the exception of DL with LSTM, most methods perform better when discriminating between students above or below the average than when predicting sufficient or insufficient grades. This could be related to the fact that the dataset that considers the average grade is more balanced, and the models are then less prone to overfit the majority class. While RF, lasso GLM, eSPA, and SVM consistently achieve the highest scores, the LSTM model struggles significantly, exhibiting the lowest average performance and the widest confidence intervals. We interpret the width of these confidence intervals—derived from 50 independent cross-validations—as a high sensitivity of the chosen ML method to each specific training/test split or as an underlying lack of generalization. For LSTM, the AUC lower bound occasionally falls below

0.5

, thus indicating an inverse behavior where the model effectively performs worse than a random guess, misidentifying failing students as passing students and vice versa. Interestingly, this instability is less visible in the F-score results (Figure 3b), where intervals appear more compact. This discrepancy highlights that relying on a single metric can be deceptive and taking into consideration different quality metrics can provide more insights on the overall model performance. Notably, even a typically robust method like eSPA becomes more dependent on the training/test split of the input data when evaluated according to this performance metric.

In Figure 4, we report the results of the same analysis for the Portuguese course dataset, with panel Figure 4a summarizing the AUC of the chosen ML methods and panel Figure 4b dealing with the F-score. As mentioned in Table 1, the main difference between the two datasets lies in the severe class imbalance in the sufficient/insufficient grade scenario. Theoretically, we expect the class imbalance to hinder the classification task and bias the methods towards the overfitting of the majority class. However, the results in Figure 4a are surprisingly similar to those of the Mathematics course dataset. We attribute this stability to the larger sample size of the Portuguese data, which partially mitigates the challenges induced by the small-data regime. If we compute the ratio

T / D

in the two cases, while taking into account that we can only use half of the data for training, we obtain a value of

4.38

for the Mathematics dataset and a value of

7.2

for the Portuguese dataset. In both cases, the ratios fall well below the empirical threshold of

13.8

suggested by Horenko (2020) for reliable feature extraction in the small-data regime. Beyond this limit, we expect complex architectures like DL with LSTM to overfit the training data and be unable to generalize—a claim which is indeed consistent with the experimental results. While one could theoretically increase the

T / D

ratio by allocating more data to the training set and less to the test set (e.g., an

80 / 20

or

90 / 10

split), we argue against this potential solution. Indeed, reducing the test set would compromise the statistical significance of our assessment and hinder the replicability of the results, without necessarily improving the models’ performance and ability to generalize. Finally, Figure 4b confirms that, in the case of the highly imbalanced Portuguese dataset (

15 %

minority class when dealing with sufficient/insufficient grades), the F-score provides better results. This apparent discrepancy stems from the fact that the latter metric is more appropriate when dealing with imbalanced classification.

After evaluating the predictive performance of the various models, it is essential to consider also their computational cost and stability. This analysis directly addresses the second part of RQ1, which asks how these models perform under the specific constraints of the small-data regime in a secondary education setting. In this practical application scenario, an efficient deployment often requires a careful balance between predictive accuracy and resource efficiency. As shown in Table 2, the ranking of the methods, from the fastest to the slowest, remains roughly the same across all experimental setups. This suggests that, for educators looking to use these models, the ML architecture itself is the main driver of computational resource consumption, regardless of the specific subject or grading threshold being analyzed. Among the tested methods, the eSPA algorithm is notably the most efficient: with training times as low as

0.0002

s, it is significantly faster than its competitors, while maintaining the high predictive accuracy noted earlier. This efficiency makes it an ideal candidate for school-based applications where immediate feedback is required without the need for high-performance computing resources. Conversely, the lasso GLM proved to be the most time consuming and least stable method in the small-data regime across all the experimental scenarios. Indeed, beyond requiring the longest training phase (up to

5.50

s), it also exhibited the widest confidence intervals. The latter high variability suggests that lasso GLM is particularly sensitive to data scarcity, and its behavior depends heavily on the specific way in which the student data is partitioned between training and test sets. For a single school dataset, such instability certainly represents a drawback, as it means that the model is less able to generalize to unseen data. Finally, although the models based on neural networks (NN, SNN, and DL with LSTM) are currently dominant in ML research, our results suggest they are not the optimal choice for this specific domain. Indeed, given that unprocessed secondary education datasets are typically low-dimensional, the inherent structural complexity of these models introduces a significant increase in the computational cost without a corresponding gain in predictive performance, and simpler and more efficient methods like SVM or kNN actually prove to be more practical. This concludes our response to RQ1: while several models offer satisfactory predictive performance, the most robust and efficient choices for small-data educational classification are those that maintain high stability and low computational overhead, such as eSPA and RF.

3.2. Feature Importance Analysis

To address RQ2, in this section we evaluate the interpretability of the ML models and their ability to identify relevant explanatory features. Building upon the predictive performance results established in Section 3.1, we focus our attention on eSPA, RF, NN, and SVM. In order to ease the comparison between the four methods considered, we report for RF and eSPA both the feature weights produced as direct output by the algorithms and the feature importance estimation obtained through the SHAP values.

In Figure 5, we observe the results for the Mathematics course dataset when considering the sufficient grade threshold. It is interesting to notice that eSPA selects the second-period grade (feature 28) as the sole predictor of student performance among the 45 available features. In other words, since all other features receive negligible weights, the most efficient model to predict if a student is going to pass or fail the Mathematics course is based only on the grades achieved in the previous semester. While eSPA obtained very good results in the classification task, it must be noticed that the selection of a single feature can indicate either that the model is overfitting a subset of the training data or that the input data could be less informative than expected. Furthermore, this high selectivity raises a critical pedagogical question concerning whether the eSPA model is uncovering the best proxy for student performance prediction or is, on the other hand, overlooking other socio-economic factors that could be less impactful but still relevant.

Concerning the other methods in Figure 5, we can notice that RF partially validates eSPA results but introduces a broader context by including as predictors also the first-period grades and the number of past school failures. While the inclusion of these variables potentially makes the model more robust and interpretable, it also introduces nuanced pedagogical considerations regarding both the introduction of potential bias and the granularity of the input data. From an educational intervention perspective, the relevance of past failures is certainly informative, since a student who has previously failed and chooses to re-enroll may be less inclined to fail again, being familiar with the subjects and the material tackled during the lectures. On the other hand, the statistical weight assigned to past failures lacks crucial elements that would allow us to develop a truly personalized support. Indeed, since the dataset does not specify the precise reason for a student’s prior failure, we cannot determine if Mathematics was the subject where the student struggled the most or the one in which he achieved the best performance. This limitation highlights a critical gap between predictive modeling and actionable education science: to formulate effective ad hoc interventions, we need to provide the ML models with informative variables, that give us a complete picture of each student performance across all school years. A similar observation could be made regarding the reliance on first- and second-period grades, since they still remain static averages of performance over several weeks even if they are identified as the strongest predictors of academic momentum. In other words, they do not convey meaningful underlying trends concerning the student performance during the semester (like mid-term decline or late-semester recovery), which could instead be visible if more granular individual evaluations were recorded.

To conclude the analysis of the results reported in Figure 5, we consider also the SHAP values for SVM and NN. In general, SVM mirrors the previous methods by identifying the first- and second-period grades as the primary predictive features, but it also hints at a certain relevance of feature 24, which corresponds to student alcohol consumption during the weekends. From an educational standpoint, the emergence of this variable suggests that lifestyle choices and social behaviors can have a measurable impact on academic achievements, as highlighted also in other studies (Maniaci et al., 2023). On the other hand, the NN model adopts a more holistic view, including among the predictors a wider array of features such as student age, weekly study time, number of past failures, free time after school, and frequency of going out with friends. While including this broader set of features could theoretically make the model more robust by accounting for the complexity of student life and of socio-behavioral factors, we must also carefully consider the inherent trade-offs. Indeed, in the small-data regime, a more complex set of features increases the risk of incorporating spurious correlations, whose relevance is artificially induced by the limited size of the training set. For education researchers, this is a reminder that it is necessary to seek a balance: an ML model needs to be able to capture the multifaceted reality of the classroom without over-interpreting certain behavioral aspects that might not necessarily generalize to the broader student population. Furthermore, from a practical perspective, complicated models are both more computationally expensive to estimate and harder to communicate to non-technical stakeholders, such as parents or school administrators.

In Figure 6, we consider the feature importance of the same ML models when trying to identify the students who perform above or below the group average. This classification task is inherently more challenging than the classification with respect to the sufficiency threshold, since performing better than one’s peers is usually dependent on the group composition and not on individual characteristics alone. Unlike what we observed in Figure 5, we can notice that the eSPA algorithm selects a significantly higher number of features as relevant. This behavior has been consistently observed in all 50 cross-validations performed for this experiment, despite occasional instances where a smaller number of features was chosen. While this increased model complexity could be induced by the limitations of the small-data learning regime, the differentiation in feature weights suggests that the model can still discriminate between relevant information and redundant information. Pedagogically, this shift in feature importance indicates that while passing a course may depend on a small set of academic markers, achieving a better performance than one’s peers is a more difficult phenomenon to model. As far as concerns the other methods, there remains a consistent consensus that the first- and second-period grade are the most reliable predictors of above-average performance. However, in the RF model, weekend alcohol consumption is now assigned a negative weight: this is a critical insight on the fact that certain behaviors, while not necessarily resulting in student failure, can still have a negative impact on overall academic performance, as established in prior literature (Balsa et al., 2011). Finally, the SHAP values for SVM and NN show a generalized decrease in the number of important features compared to the sufficiency scenario, even if NN maintains as before a more holistic view, including also parental education and home-to-school travel time as potential predictors. For education scientists, this reinforces the idea that external environmental and socio-economic factors (such as the family’s educational background or the impact of daily commuting) may become deciding factors in helping a student perform above the average (Sirin, 2005).

The results for the Portuguese course datasets are reported in Table 3, which summarizes the top 5 influential features for each ML model in each experimental scenario. The classification with respect to the sufficiency threshold aggravates the small-data regime with the challenges induced by a severe class imbalance. The latter has a major impact on the performance of an entropy-based method like eSPA, which tends to assign very similar weights to all explanatory features. This kind of behaviour is related to the concept of maximum entropy, i.e., a scenario in which it is hard to distinguish the relevant from the irrelevant information, and the most reliable approach consists in conservatively assigning the same weight to each feature. However, eSPA is still able to identify in the first- and second-period grades the most relevant predictors, and a further tuning of its hyperparameters could potentially allow one to increase the gap with the other less relevant features. On the other hand, RF identifies the number of past school failures and nursery school attendance as relevant predictors beyond previous grades. Pedagogically, the relevance of nursery school attendance is highly significant in the context of a native language course, since the development of literacy and verbal expression begins in early childhood education (Isaacs, 2006). Finally, while NN and SVM consistently prioritize both second- and first-period grades, they exhibit a certain degree of dispersion in the SHAP values. This indicates, especially for NN, a certain difficulty in identifying the relevant features for the minority class and ML model tendency to overfit the majority class (i.e., sufficient students). This lack of generalization corresponds, as in the case of eSPA, to a slight decrease in model performance, highlighting how small-data classification can be made even more challenging in the presence of severe class imbalance.

To conclude the feature analysis, we briefly consider the classification with respect to the average grade in the Portuguese course (bottom part of Table 3). Unlike the sufficiency threshold, this scenario deals with almost perfectly balanced classes, allowing the models to avoid conservative feature weighting. Indeed, in this case, the eSPA algorithm identifies the second-period grade as the sole dominant proxy for predicting student performance. From an educational perspective, a model relying solely on previous grades can be particularly appropriate for a native language course. Indeed, while Mathematics involves a high degree of conceptual variance due to its hierarchical structure, language courses usually maintain a more cumulative difficulty level, and student success is more closely tied to the willingness to engage with the material and to complete exercises (Bernstein, 2006). Regarding the other models, RF incorporates additional socio-behavioral features to predict above-average performance, such as the intention to pursue higher education and social interactions. The inclusion of these variables is consistent with the literature, since high-level linguistic competence is both a requirement for and a result of academic ambition, while socializing provides a natural environment for improving these skills (Ampofo & Osei-Owusu, 2015). Finally, while SVM and NN SHAP values remain largely consistent with previous experiments, NN focuses on family-oriented variables like parental status and health, reinforcing the idea that a stable domestic environment may have a positive impact on students’ academic achievements.

To finalize the answer to RQ2, we synthesize the findings across both subjects and classification scenarios. In general, the results highlight that different ML architectures provide complementary perspectives on student data in the small-data regime. From a computational standpoint, eSPA emerges as the superior model, since it maintains high accuracy while using only the most informative yet parsimonious set of predictors (i.e., first- and second-period grades). In contrast, models like RF, NN, and SVM retain a higher degree of redundancy in their feature selection and have the advantage of capturing secondary explanatory variables that eSPA intentionally filters out. Therefore, while eSPA provides the most efficient classification rule, the other models seem to offer a broader view of environmental factors that, although less dominant in the performance prediction, remain still actionable for targeted pedagogical interventions.

3.3. Additional Experiments on Feature Importance Discrimination

The experimental results discussed in Section 3.2 consistently identified first- and second-period grades as the most important predictors of student performance. While the predictive power of past academic performance is intuitively high, the weight assigned to it effectively masks the relevance of other explanatory variables more closely related to students’ personal background or socio-economic environment. To answer RQ3 and to provide a better evaluation of these latent factors, we repeat the experiments performed in Section 3.1 and Section 3.2 after removing the features corresponding to the first- and second-period grades. While the removal of these features slightly increases the

T / D

ratio, its value remains well below the empirical overfitting threshold defined by Horenko (2020), and the problem persists in the small-data learning regime.

First of all, in Figure 7 we examine the impact of this feature removal on the data structure using t-SNE visualization, and we can notice that the data distribution in the two-dimensional space changes significantly with respect to the one reported in Figure 2. In particular, the two groups of student are overlapping and cannot be easily separated, neither to the sufficiency nor to the average grade thresholds. From a data science perspective, this situation indicates that the response variable can no longer be predicted through a few dominant features; the chosen model needs to rely on a larger set of variables. However, even if t-SNE is unable to neatly separate the two classes in a two-dimensional space, the other ML methods could still be able to find an efficient classification rule, even if the task is more challenging. In the Portuguese dataset with the sufficiency threshold (Figure 7c), due to the class imbalance, the few insufficient instances are distributed among those of the majority class. This could potentially induce classification biases in models that rely on spatial proximity of the input data, since we expect instances that are close in space to have roughly the same characteristics.

After assessing the changes in the problem structure, we then consider how the ML methods’ performance changes in the four classification scenarios. Removing the first- and second-period grades fundamentally alters the classification task, as shown by the performance metrics for the Mathematics (Figure 8) and Portuguese datasets (Figure 9). Across all scenarios, we observe a significant decline in both AUC and F-score, confirming that past grades serve as main drivers of future student academic achievements. Furthermore, all models show a significant increase in volatility, with the widening of the confidence intervals obtained across 50 independent cross-validations splits. This result suggests that, without strong predictive features, model stability is heavily dependent on the specific variance of the training split. Notably, the lower whiskers of the box plots in the Mathematics results indicate that several ML models perform worse than random guessing and, in some cases, predict the opposite label, thus systematically misclassifying the data.

In the Portuguese sufficiency scenario (Figure 9), a clear divergence emerges between the AUC, which drops significantly to indicate a diminished ability to discriminate between classes, and the F-score, which remains disproportionately high. This discrepancy is a characteristic artifact of class imbalance: since the vast majority of students pass the Portuguese course, the models default to the majority class while minimizing the number of false positives (i.e., the number of instances incorrectly attributed to the majority class). Consequently, the model predictive power for the minority class (corresponding to the at-risk students) is rather low, and this underscores the necessity of using the AUC as a primary metric in imbalanced educational datasets to avoid overestimating model utility.

To address RQ3 and identify potential latent drivers of student success in the small-data regime, we narrow our feature importance analysis to two scenarios providing complementary perspectives: the Mathematics/average and the Portuguese/sufficiency scenarios. The former represents a relatively balanced classification task and allows us to observe which socio-behavioral features the different ML models prioritize when they are not forced to compensate for class imbalance. On the other hand, the latter represents a highly imbalanced dataset where the number of failing students is small, and it allows us to assess the models’ ability to identify infrequent markers of potential academic failure within a generally successful cohort—a task of high importance for early educational intervention strategies. Preliminary analysis of the remaining scenarios (Mathematics/sufficiency and Portuguese/average) indicated a largely uniform redistribution of feature weights without the emergence of distinct, actionable pedagogical patterns. Consequently, we concentrate on the two scenarios indicated above, as they effectively cover the full spectrum of small-data learning regime challenges encountered after the removal of prior grades.

In the Mathematics scenario (Figure 10), where student performance is classified relative to the group average, the eSPA algorithm exhibits a distinct change in behavior after the removal of prior grades. Indeed, while the algorithm previously distributed weights across a broader set of variables, it now focuses on a subset of socio-economic predictors, such as the desire to pursue higher education and the father’s employment status (particularly in healthcare or as a stay-at-home parent). This result correlates, consistently with the relevant literature, the family background with the student performance (Bolu-steve & Sanni, 2013; Z. Li & Qiu, 2018), with other studies going even further and highlighting the relationship between the mother’s education and the academic performance of the children (Awan & Kauser, 2015; McGowan & Johnson, 1984). Conversely, random forest (RF) selects more features, while prioritizing the number of past school failures alongside lifestyle variables such as available free time and the frequency of social outings with peers. These latter aspects of a student’s background have been considered in educational studies (Ackerman & Gross, 2003; J. Wu et al., 2023), and a high level of peer-group integration seems to have a dual impact on student performance, depending on the reference peer group’s own academic orientation.

For the Portuguese sufficiency scenario (Figure 11), characterized by high class imbalance, eSPA identifies parental employment in teaching and healthcare, alongside the use of extra-curricular support, as the most influential predictors. This seems to point towards the importance of a pedagogical synergy within the household: parents in highly skilled or educational professions could provide a linguistically rich environment at home and possess the financial means to secure private tuition whenever necessary. Interestingly, RF and SVM consistently highlight the number of past school failures as the most significant predictor, while RF, unlike eSPA, assigns a negative score to all features related to the economic background of the parents. However, while informative in this specific dataset, the number of past school failures does not contain any kind of additional information about the problems previously encountered by the student, and does not allow the development of targeted interventions. In this context, the eSPA method seems to be more adequate for educational diagnostics, since the prioritization of socio-environmental factors over historical performance markers allows better design of early intervention strategies.

The findings from these additional experiments provide a definitive answer to RQ3: while the removal of prior academic grades leads to a noticeable decrease in overall ML model performance and stability, it simultaneously uncovers a deeper layer of socio-pedagogical insights. In other words, even when operating in the small-data regime and deprived of previous grade information, ML methods are able to identify structural and behavioral drivers of student performance. The results suggest that above-average performance in Mathematics is largely sustained by higher-education aspirations and household stability, whereas sufficiency in Portuguese is heavily dependent on external scaffolding and parents’ professional backgrounds that favor linguistic development. Furthermore, the divergence between eSPA and RF reveals that different algorithms capture different dimensions of the student experience, and they should be used together to provide educators a holistic view of student needs. Ultimately, this demonstrates that even after dismissing prior grades, ML models can highlight actionable factors beyond student performance, allowing educators to influence them before academic failure becomes a historical record.

4. Conclusions

In this paper, we provided a comprehensive analysis and benchmarking of how different ML methods can be applied to educational data, specifically addressing RQ1 concerning the feasibility of student performance prediction in the small-data regime. We considered the Mathematics and Portuguese courses in two secondary schools, modeling the problem as a binary classification task based on two criteria: (i) sufficiency of the grade, and (ii) performance relative to the group average. The student performance prediction relies on several explanatory features, including academic achievements, socioeconomic status, and extracurricular activities. Given the discrepancy between the number of instances and explanatory features, the problem falls in the small-data learning regime, which proves particularly challenging to the standard ML tools, due to the risk of overfitting the training data while losing the ability to generalize to unseen data. However, the experimental results showed that the chosen ML models can successfully predict students’ performance with varying degrees of accuracy. The predictive models developed take into account different subsets of the available features, directly impacting both their parsimony and robustness.

To address RQ2, in the second part of the experiments we focused on the interpretability of these models, analyzing which features are relevant for solving the classification task. This analysis offers insights into how the ML algorithms discriminate relevant from irrelevant information and provides a direct assessment of how the different variables interact with each other. Our results confirmed findings from the existing literature, but we arrived at these conclusions not by a priori assuming and testing the validity of a specific theoretical model, but by extracting the relevant information directly from the input data without any filtering. Finally, in order to make our experiments even more robust and address RQ3, we investigated ML model performance after removing the first- and second-period grades, which were previously identified as the strongest predictors. The experimental results showed that ML algorithms can still devise a valid predictive model without this information. Crucially, this unmasked the role of latent socio-behavioral variables (such as desire to pursue higher education, family background, and number of past school failures), whose impact on student performance was previously overshadowed by previous grades.

In this work, we established a foundation for future educational research within the small-data regime, showing that ML models can successfully extract meaningful patterns directly from unprocessed socio-behavioral data. While the accessibility of the employed datasets and our proposed analysis pipeline offer a solid starting point for the community, this study also highlights significant limitations inherent to the data itself. Gathering high-quality educational data remains a challenge, due both to the complexities of human-related data collection and to privacy concerns, especially in secondary education. Consequently, the analyzed dataset, which is one of the few freely available examples, suffers from a low level of detail in several variables. An example is student travel time to reach the school in Table A3: while it reports the duration, it lacks context regarding how the students reach the school (i.e., by car, by foot, by bike). This missing detail prevents inference about student lifestyle (e.g., being active or sedentary), which has proved relevant in comparable studies (Taras, 2005). Thus, future research should prioritize the acquisition of more comprehensive datasets, ideally collected as longitudinal panels, since the introduction of this temporal component would yield a significantly more informative analysis. Finally, an intriguing extension of this work involves the inclusion of psychological features, which could enhance the distinction of students at risk of insufficiency, thereby leading to ML proactive tools for preventing school failure and reducing dropout rates.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets analyzed in this study are publicly available on the UCI data repository (Cortez, 2014).

Conflicts of Interest

The author declares no conflicts of interest.

Appendix A. Description of Dataset Explanatory Features

In this section, we discuss the dataset employed in this experiment, which has been downloaded from the UCI repository (Cortez, 2014). The data have been kept in their original raw form, with the exception of the non-numerical categorical features—i.e., the ones with the numbers from 29 to 45 in Table A1—which have been transformed into binary variables. This approach led to some improvements in terms of predictive performance for the different ML methods considered in the experimental part.

Table A1. Explanatory features of the considered datasets. The second column (Idx) indicates the feature index following the removal of the two features concerning first- and second-period grades.

Idx	Idx	Label	Name	Type	Description
1	1	school	School	Binary	MS: 1, GP: 0
2	2	sex	Sex	Binary	Male: 1, Female: 0
3	3	age	Age	Numerical	$[15, 22]$
4	4	address	Home address type	Binary	Rural: 1, Urban: 0
5	5	famSize	Family size	Binary	Size $\geq 3$ : 1, Size $< 3$ : 0
6	6	pStatus	Parent’s cohabitation status	Binary	Apart: 1, Together: 0
7	7	Medu	Mother’s education (Table A2)	Categorical	${0, 1, 2, 3, 4}$
8	8	Fedu	Father’s education (Table A2)	Categorical	${0, 1, 2, 3, 4}$
9	9	traveltime	Home to school travel time (Table A3)	Categorical	${1, 2, 3, 4}$
10	10	studytime	Weekly study time (Table A4)	Categorical	${1, 2, 3, 4}$
11	11	failures	Number of past school failures	Numerical	n if $1 \leq n \leq 3$ , else 4
12	12	schoolsup	Extra educational school support	Binary	Yes: 1, No: 0
13	13	famsup	Extra educational family support	Binary	Yes: 1, No: 0
14	14	paid	Extra paid classes	Binary	Yes: 1, No: 0
15	15	activities	Extracurricular activities	Binary	Yes: 1, No: 0
16	16	nursery	Attendance at nursery school	Binary	Yes: 1, No: 0
17	17	higher	Aim to pursue higher education	Binary	Yes: 1, No: 0
18	18	internet	Internet access at home	Binary	Yes: 1, No: 0
19	19	romantic	Romantic relationship	Binary	Yes: 1, No: 0
20	20	famrel	Quality of family relationship	Numerical	$[1, 5]$
21	21	freetime	Free time after school	Numerical	$[1, 5]$
22	22	goout	Going out with friends	Numerical	$[1, 5]$
23	23	Dalc	Work-day alcohol consumption	Numerical	$[1, 5]$
24	24	Walc	Week-end alcohol consumption	Numerical	$[1, 5]$
25	25	health	Current health status	Numerical	$[1, 5]$
26	26	absences	Number of absences	Numerical	$[0, 93]$
27	–	grade_1	First period grade	Numerical	$[0, 20]$
28	–	grade_2	Second period grade	Numerical	$[0, 20]$
29	27	Mjob_home	Mother’s job field: At home	Binary	Yes: 1, No: 0
30	28	Mjob_health	Mother’s job field: Healthcare	Binary	Yes: 1, No: 0
31	29	Mjob_other	Mother’s job field: Other	Binary	Yes: 1, No: 0
32	30	Mjob_serv	Mother’s job field: Services	Binary	Yes: 1, No: 0
33	31	Mjob_teach	Mother’s job field: Teaching	Binary	Yes: 1, No: 0
34	32	Fjob_home	Father’s job field: At home	Binary	Yes: 1, No: 0
35	33	Fjob_health	Father’s job field: Healthcare	Binary	Yes: 1, No: 0
36	34	Fjob_other	Father’s job field: Other	Binary	Yes: 1, No: 0
37	35	Fjob_serv	Father’s job field: Services	Binary	Yes: 1, No: 0
38	36	Fjob_teach	Father’s job field: Teaching	Binary	Yes: 1, No: 0
39	37	reason_course	School choice reason: Course	Binary	Yes: 1, No: 0
40	38	reason_near	School choice reason: Closeness	Binary	Yes: 1, No: 0
41	39	reason_rep	School choice reason: Reputation	Binary	Yes: 1, No: 0
42	40	reason_other	School choice reason: Other	Binary	Yes: 1, No: 0
43	41	guardian_f	Student’s guardian: Father	Binary	Yes: 1, No: 0
44	42	guardian_m	Student’s guardian: Mother	Binary	Yes: 1, No: 0
45	43	guardian_o	Student’s guardian: Other	Binary	Yes: 1, No: 0

Table A2. Categories of parents’ education.

Category	Description
0	None
1	Primary education (4th grade)	Basic education
2	5th to 9th grade	Basic education
3	Secondary education
4	Higher education

Table A3. Categories of home to school travel time.

Category	Description
1	Less than 15 min
2	15 to 30 min
3	30 min to 1 h
4	More than 1 h

Table A4. Categories of weekly study time.

Category	Description
1	Less than 2 h
2	2 to 5 h
3	5 to 10 h
4	More than 10 h

Notes

1	In ML, we indicate with overfitting the phenomenon by which a model adapts too closely to the training data and then loses the ability to generalize to new unseen data.
2	In ML applications, we define as a hyperparameter each input variable necessary to tune or modify the algorithm modeling strategy. Usually, hyperparameters cannot be determined a priori, but they need to be inferred from the specific learning domain and from the intrinsic characteristics of the data considered.

References

Ackerman, D. S., & Gross, B. L. (2003). Is time pressure all bad? Measuring the relationship between free time availability and student performance and perceptions. Marketing Education Review, 13(2), 21–32. [Google Scholar] [CrossRef]
Ahmed, W., Wani, M. A., Plawiak, P., Meshoul, S., Mahmoud, A., & Hammad, M. (2025). Machine learning-based academic performance prediction with explainability for enhanced decision-making in educational institutions. Scientific Reports, 15(1), 26879. [Google Scholar] [CrossRef]
Alamri, R., & Alharbi, B. (2021). Explainable student performance prediction models: A systematic review. IEEE Access, 9, 33132–33143. [Google Scholar] [CrossRef]
Albreiki, B., Zaki, N., & Alashwal, H. (2021). A systematic literature review of student’ performance prediction using machine learning techniques. Education Sciences, 11(9), 552. [Google Scholar] [CrossRef]
Althnian, A., AlSaeed, D., Al-Baity, H., Samha, A., Dris, A. B., Alzakari, N., Abou Elwafa, A., & Kurdi, H. (2021). Impact of dataset size on classification performance: An empirical evaluation in the medical domain. Applied Sciences, 11(2), 796. [Google Scholar] [CrossRef]
Ampofo, E. T., & Osei-Owusu, B. (2015). Students’ academic performance as mediated by students’ academic ambition and effort in the public senior high scjools in Ashanti Mampong Municipality of Ghana. International Journal of Academic Research and Reflection, 3(5), 19–35. [Google Scholar]
Arsad, P. M., Buniyamin, N., & Manan, J.-l. A. (2013, November 25–27). A neural network students’ performance prediction model (NNSPPM). 2013 IEEE International Conference on Smart Instrumentation, Measurement and Applications (ICSIMA) (pp. 1–5), Kuala Lumpur, Malaysia. [Google Scholar] [CrossRef]
Awan, A. G., & Kauser, D. (2015). Impact of educated mother on academic achievement of her children: A case study of District Lodhran-Pakistan. Journal of Literature, Languages and Linguistics, 12(2), 57–65. [Google Scholar]
Balsa, A. I., Giuliano, L. M., & French, M. T. (2011). The effects of alcohol use on academic achievement in high school. Economics of Education Review, 30(1), 1–15. [Google Scholar] [CrossRef] [PubMed]
Bernstein, B. (2006). Vertical and horizontal discourse: An essay. In Education and society (pp. 53–73). Routledge. [Google Scholar]
Bianchini, M., & Scarselli, F. (2014). On the complexity of neural network classifiers: A comparison between shallow and deep architectures. IEEE Transactions on Neural Networks and Learning Systems, 25(8), 1553–1565. [Google Scholar] [CrossRef]
Bo, D., Hwangbo, H., Sharma, V., Arndt, C., & TerMaath, S. (2023). A randomized subspace-based approach for dimensionality reduction and important variable selection. Journal of Machine Learning Research, 24(76), 1–31. [Google Scholar]
Bolikulov, F., Nasimov, R., Rashidov, A., Akhmedov, F., & Young-Im, C. (2024). Effective methods of categorical data encoding for artificial intelligence algorithms. Mathematics, 12(16), 2553. [Google Scholar] [CrossRef]
Bolu-steve, F., & Sanni, W. (2013). Influence of family background on the academic performance of secondary school students in Nigeria. IFE PsychologIA: An International Journal, 21(1), 90–100. Available online: https://hdl.handle.net/10520/EJC150670 (accessed on 19 November 2025).
Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32. [Google Scholar] [CrossRef]
Camacho-Miñano, M.-d.-M., del Campo, C., Urquía-Grande, E., Pascual-Ezama, D., Akpinar, M., & Rivero, C. (2020). Solving the mystery about the factors conditioning higher education students’ assessment: Finland versus Spain. Education+ Training, 62(6), 617–630. [Google Scholar] [CrossRef]
Colak Oz, H., Güven, Ç., & Nápoles, G. (2023). School dropout prediction and feature importance exploration in Malawi using household panel data: Machine learning approach. Journal of Computational Social Science, 6(1), 245–287. [Google Scholar] [CrossRef]
Cortez, P. (2014). Student performance. UCI Machine Learning Repository. [Google Scholar] [CrossRef]
Daud, A., Aljohani, N. R., Abbasi, R. A., Lytras, M. D., Abbas, F., & Alowibdi, J. S. (2017). Predicting student performance using advanced learning analytics. In Proceedings of the 26th international conference on world wide web companion (pp. 415–421). Republic and Canton of Geneva, CHE: International World Wide Web Conferences Steering Committee. [Google Scholar] [CrossRef]
Espinosa, E., & Figueira, A. (2023). On the quality of synthetic generated tabular data. Mathematics, 11(15), 3278. [Google Scholar] [CrossRef]
Fadelelmoula, T., & Colleges, R. S. A. A. (2018). The impact of class attendance on student performance. International Research Journal of Medicine and Medical Sciences, 6(2), 47–49. [Google Scholar] [CrossRef]
Fernández-Delgado, M., Cernadas, E., Barro, S., & Amorim, D. (2014). Do we need hundreds of classifiers to solve real world classification problems? The Journal of Machine Learning Research, 15(1), 3133–3181. [Google Scholar]
Gajwani, J., & Chakraborty, P. (2021). Students’ performance prediction using feature selection and supervised machine learning algorithms. In D. Gupta, A. Khanna, S. Bhattacharyya, A. E. Hassanien, S. Anand, & A. Jaiswal (Eds.), International conference on innovative computing and communications (pp. 347–354). Springer Singapore. [Google Scholar] [CrossRef]
Gracia, A., González, S., Robles, V., & Menasalvas, E. (2014). A methodology to compare dimensionality reduction algorithms in terms of loss of quality. Information Sciences, 270, 1–27. [Google Scholar] [CrossRef]
Grandini, M., Bagli, E., & Visani, G. (2020). Metrics for multi-class classification: An overview. arXiv, arXiv:2008.05756. [Google Scholar] [CrossRef]
Grinsztajn, L., Oyallon, E., & Varoquaux, G. (2022). Why do tree-based models still outperform deep learning on typical tabular data? Advances in Neural Information Processing Systems, 35, 507–520. [Google Scholar]
Guleria, P., & Sood, M. (2023). Explainable AI and machine learning: Performance evaluation and explainability of classifiers on educational data mining inspired career counseling. Education and Information Technologies, 28(1), 1081–1116. [Google Scholar] [CrossRef] [PubMed]
Gunasekara, S., & Saarela, M. (2025). Explainable AI in education: Techniques and qualitative assessment. Applied Sciences, 15(3), 1239. [Google Scholar] [CrossRef]
Ha, W., Ma, L., Cao, Y., Feng, Q., & Bu, S. (2024). The effects of class attendance on academic performance: Evidence from synchronous courses during COVID-19 at a Chinese research university. International Journal of Educational Development, 104, 102952. [Google Scholar] [CrossRef]
Hancock, J. T., & Khoshgoftaar, T. M. (2020). Survey on categorical data for neural networks. Journal of Big Data, 7(1), 28. [Google Scholar] [CrossRef]
Hegde, V., & Prageeth, P. P. (2018, January 19–20). Higher education student dropout prediction and analysis through educational data mining. 2018 2nd International Conference on Inventive Systems and Control (ICISC) (pp. 694–699), Coimbatore, India. [Google Scholar] [CrossRef]
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Horenko, I. (2020). On a scalable entropic breaching of the overfitting barrier for small data problems in machine learning. Neural Computation, 32(8), 1563–1579. [Google Scholar] [CrossRef]
Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28(3/4), 321–377. [Google Scholar] [CrossRef]
Isaacs, S. (2006). The educational value of the nursery school. Early Years Education: Major Themes in Education, 1, 134. [Google Scholar]
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112). Springer. [Google Scholar] [CrossRef]
Kabakchieva, D. (2013). Predicting student performance by using data mining methods for classification. Cybernetics and Information Technologies, 13(1), 61–72. [Google Scholar] [CrossRef]
Kesgin, K., Kiraz, S., Kosunalp, S., & Stoycheva, B. (2025). Beyond performance: Explaining and ensuring fairness in student academic performance prediction with machine learning. Applied Sciences, 15(15), 8409. [Google Scholar] [CrossRef]
Kobak, D., & Berens, P. (2019). The art of using t-SNE for single-cell transcriptomics. Nature Communications, 10(1), 5416. [Google Scholar] [CrossRef] [PubMed]
Kobak, D., & Linderman, G. C. (2021). Initialization is critical for preserving global data structure in both t-SNE and UMAP. Nature Biotechnology, 39(2), 156–157. [Google Scholar] [CrossRef] [PubMed]
Kushik, N., Yevtushenko, N., & Evtushenko, T. (2020). Novel machine learning technique for predicting teaching strategy effectiveness. International Journal of Information Management, 53, 101488. [Google Scholar] [CrossRef]
Li, M., Huang, C., Wang, D., Hu, Q., Zhu, J., & Tang, Y. (2019). Improved randomized learning algorithms for imbalanced and noisy educational data classification. Computing, 101(6), 571–585. [Google Scholar] [CrossRef]
Li, Z., & Qiu, Z. (2018). How does family background affect children’s educational achievement? Evidence from Contemporary China. The Journal of Chinese Sociology, 5(1), 1–21. [Google Scholar] [CrossRef]
Lin, Y., Chen, H., Xia, W., Lin, F., Wang, Z., & Liu, Y. (2025). A comprehensive survey on deep learning techniques in educational data mining. Data Science and Engineering, 10(4), 564–590. [Google Scholar] [CrossRef]
Liu, X.-Y., Wu, J., & Zhou, Z.-H. (2009). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2), 539–550. [Google Scholar] [CrossRef]
Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Proceedings of the 31st international conference on neural information processing systems, Long Beach, CA, USA, December 4–9 (pp. 4768–4777). Curran Associates Inc. [Google Scholar]
Machado, P., Fernandes, B., & Novais, P. (2022, November 24–26). Benchmarking data augmentation techniques for tabular data. International Conference on Intelligent Data Engineering and Automated Learning (pp. 104–112), Manchester, UK. [Google Scholar]
Maniaci, G., La Cascia, C., Giammanco, A., Ferraro, L., Palummo, A., Saia, G. F., Pinetti, G., Zarbo, M., & La Barbera, D. (2023). The impact of healthy lifestyles on academic achievement among Italian adolescents. Current Psychology, 42(6), 5055–5061. [Google Scholar]
Mathew, R. M., & Gunasundari, R. (2021, March 4–5). A review on handling multiclass imbalanced data classification in education domain. 2021 International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE) (pp. 752–755), Greater Noida, India. [Google Scholar]
McGowan, R. J., & Johnson, D. L. (1984). The mother-child relationship and other antecedents of academic performance: A causal analysis. Hispanic Journal of Behavioral Sciences, 6(3), 205–224. [Google Scholar] [CrossRef]
Miletic, M., & Sariyar, M. (2024). Challenges of using synthetic data generation methods for tabular microdata. Applied Sciences, 14(14), 5975. [Google Scholar] [CrossRef]
Minaei-Bidgoli, B., Kashy, D., Kortemeyer, G., & Punch, W. (2003, November 5–8). Predicting student performance: An application of data mining methods with an educational Web-based system. 33rd Annual Frontiers in Education, 2003, FIE 2003 (Vol. 1, p. T2A-13), Westminster, CO, USA. [Google Scholar] [CrossRef]
Mohammad, A. S., Al-Kaltakchi, M. T., Alshehabi Al-Ani, J., & Chambers, J. A. (2023). Comprehensive evaluations of student performance estimation via machine learning. Mathematics, 11(14), 3153. [Google Scholar] [CrossRef]
Mohammed, R., Rawashdeh, J., & Abdullah, M. (2020, April 7–9). Machine learning with oversampling and undersampling techniques: Overview study and experimental results. 2020 11th International Conference on Information and Communication Systems (ICICS) (pp. 243–248), Irbid, Jordan. [Google Scholar] [CrossRef]
Molnar, C. (2020). Interpretable machine learning. Lulu.com. [Google Scholar]
Motrenko, A., Strijov, V., & Weber, G.-W. (2014). Sample size determination for logistic regression. Journal of Computational and Applied Mathematics, 255, 743–752. [Google Scholar] [CrossRef]
Nachouki, M., Mohamed, E. A., Mehdi, R., & Abou Naaj, M. (2023). Student course grade prediction using the random forest algorithm: Analysis of predictors’ importance. Trends in Neuroscience and Education, 33, 100214. [Google Scholar] [CrossRef]
Niu, T., Liu, T., Luo, Y. T., Pang, P. C.-I., Huang, S., & Xiang, A. (2025). Decoding student cognitive abilities: A comparative study of explainable AI algorithms in educational data mining. Scientific Reports, 15(1), 26862. [Google Scholar] [CrossRef] [PubMed]
Nnadi, L. C., Watanobe, Y., Rahman, M. M., & John-Otumu, A. M. (2024). Prediction of students’ adaptability using explainable AI in educational machine learning models. Applied Sciences, 14(12), 5141. [Google Scholar] [CrossRef]
Noble, W. S. (2006). What is a support vector machine? Nature biotechnology, 24(12), 1565–1567. [Google Scholar] [CrossRef]
Osmanbegovic, E., & Suljic, M. (2012). Data mining approach for predicting student performance. Economic Review: Journal of Economics and Business, 10(1), 3–12. [Google Scholar]
Pallathadka, H., Sonia, B., Sanchez, D. T., De Vera, J. V., Godinez, J. A. T., & Pepito, M. T. (2022). Investigating the impact of artificial intelligence in education sector by predicting student performance. Materials Today: Proceedings, 51, 2264–2267. [Google Scholar] [CrossRef]
Pallathadka, H., Wenda, A., Ramirez-Asís, E., Asís-López, M., Flores-Albornoz, J., & Phasinam, K. (2023). Classification and prediction of student performance data using various machine learning algorithms. Materials Today: Proceedings, 80, 3782–3785. [Google Scholar] [CrossRef]
Rastrollo-Guerrero, J. L., Gómez-Pulido, J. A., & Durán-Domínguez, A. (2020). Analyzing and predicting students’ performance by means of machine learning: A review. Applied Sciences, 10(3), 1042. [Google Scholar] [CrossRef]
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016, August 13–17). “Why should i trust you?” Explaining the predictions of any classifier. Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining (pp. 1135–1144), San Francisco, CA, USA. [Google Scholar]
Roy, K., Nguyen, H.-H., & Farid, D. M. (2023). Impact of dimensionality reduction techniques on student performance prediction using machine learning. CTU Journal of Innovation and Sustainable Development, 15, 93–101. [Google Scholar] [CrossRef]
Shrestha, S., & Pokharel, M. (2019, November 5). Machine Learning algorithm in educational data. 2019 Artificial Intelligence for Transforming Business and Society (AITB) (Vol. 1, pp. 1–11), Kathmandu, Nepal. [Google Scholar] [CrossRef]
Shwartz-Ziv, R., & Armon, A. (2022). Tabular data: Deep learning is not all you need. Information Fusion, 81, 84–90. [Google Scholar] [CrossRef]
Sirin, S. R. (2005). Socioeconomic status and academic achievement: A meta-analytic review of research. Review of Educational Research, 75(3), 417–453. [Google Scholar] [CrossRef]
Steinbach, M., & Tan, P.-N. (2009). kNN: K-nearest neighbors. In The top ten algorithms in data mining (pp. 151–162). Chapman and Hall/CRC. [Google Scholar] [CrossRef]
Sweeney, M., Lester, J., & Rangwala, H. (2015, October 29–November 1). Next-term student grade prediction. 2015 IEEE International Conference on Big Data (Big Data) (pp. 970–975), Santa Clara, CA, USA. [Google Scholar] [CrossRef]
Taras, H. (2005). Physical activity and student performance at school. Journal of School Health, 75(6), 214–218. [Google Scholar] [CrossRef]
Thammasiri, D., Delen, D., Meesad, P., & Kasap, N. (2014). A critical assessment of imbalanced class distribution problem: The case of predicting freshmen student attrition. Expert Systems with Applications, 41(2), 321–330. [Google Scholar] [CrossRef]
Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(86), 2579–2605. [Google Scholar]
Vecchi, E., Bassetti, D., Graziato, F., Pospíšil, L., & Horenko, I. (2024a). Gauge-optimal approximate learning for small data classification. Neural Computation, 36(6), 1198–1227. [Google Scholar] [CrossRef]
Vecchi, E., Berra, G., Albrecht, S., Gagliardini, P., & Horenko, I. (2023). Entropic approximate learning for financial decision-making in the small data regime. Research in International Business and Finance, 65, 101958. [Google Scholar] [CrossRef]
Vecchi, E., Kardoš, J., Lechekhab, M., Wächter, A., Horenko, I., & Schenk, O. (2024b). Structure-exploiting interior-point solver for high-dimensional entropy-sparsified regression learning. Journal of Computational Science, 76, 102208. [Google Scholar] [CrossRef]
Vecchi, E., Pospíšil, L., Albrecht, S., O’Kane, T. J., & Horenko, I. (2022). eSPA+: Scalable entropy-optimal machine learning classification for small data problems. Neural Computation, 34(5), 1220–1255. [Google Scholar] [CrossRef]
Villegas, W., Guevara-Reyes, R., Ortiz-Garcés, I., Andrade, R., & Cox-Riquetti, F. (2025). Machine learning models for academic performance prediction: Interpretability and application in educational decision-making. Frontiers in Education, 10, 1632315. [Google Scholar] [CrossRef]
Wen, X., & Juan, H. (2023). Early prediction of students’ performance using a deep neural network based on online learning activity sequence. Applied Sciences, 13(15), 8933. [Google Scholar] [CrossRef]
Wongvorachan, T., He, S., & Bulut, O. (2023). A comparison of undersampling, oversampling, and SMOTE methods for dealing with imbalanced classification in educational data mining. Information, 14(1), 54. [Google Scholar] [CrossRef]
Wu, J., Zhang, J., & Wang, C. (2023). Student performance, peer effects, and friend networks: Evidence from a randomized peer intervention. American Economic Journal: Economic Policy, 15(1), 510–542. [Google Scholar] [CrossRef]
Wu, Y.-c., & Feng, J.-w. (2018). Development and application of artificial neural network. Wireless Personal Communications, 102, 1645–1656. [Google Scholar] [CrossRef]
Ying, X. (2019). An overview of overfitting and its solutions. Journal of Physics: Conference Series, 1168(2), 022022. [Google Scholar] [CrossRef]
Yılmaz, N., & Sekeroglu, B. (2020). Student performance classification using artificial Intelligence techniques. In R. A. Aliev, J. Kacprzyk, W. Pedrycz, M. Jamshidi, M. B. Babanli, & F. M. Sadikoglu (Eds.), 10th international conference on theory and application of soft computing, computing with words and perceptions—ICSCCW-2019 (pp. 596–603). Springer International Publishing. [Google Scholar] [CrossRef]
Yousafzai, B. K., Hayat, M., & Afzal, S. (2020). Application of machine learning and data mining in predicting the performance of intermediate and secondary education level student. Education and Information Technologies, 25(6), 4677–4697. [Google Scholar] [CrossRef]
Zhang, X., Xue, R., Liu, B., Lu, W., & Zhang, Y. (2018, July 28–30). Grade prediction of student academic performance with multiple classification models. 2018 14th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD) (pp. 1086–1090), Huangshan, China. [Google Scholar] [CrossRef]
Zohair, A., & Mahmoud, L. (2019). Prediction of student’s performance by modelling small dataset size. International Journal of Educational Technology in Higher Education, 16(1), 27. [Google Scholar] [CrossRef]

Figure 1. Scheme of the benchmark pipeline.

Figure 2. t-SNE visualization of the UCI student performance datasets in a two-dimensional space: (a) students categorized by achieving a sufficient (1) or insufficient (0) final grade in Mathematics; (b) students categorized by performing above (1) or below (0) the average in Mathematics; (c) students categorized by achieving a sufficient (1) or insufficient (0) final grade in Portuguese; (d) students categorized by performing above (1) or below (0) the average in Portuguese.

Figure 3. Evaluation of ML prediction of students’ performance for the Mathematics course dataset. The box plots are obtained from 50 cross-validations (split:

50 %

training,

25 %

validation,

25 %

test). (a) AUC of the students’ performance prediction for the Mathematics course dataset, with the two classes being separated as sufficient/insufficient grades (left) or above/below the average (right). (b) F-score of the students’ performance prediction for the Mathematics course dataset, with the two classes being separated as sufficient/insufficient grades (left) or above/below the average (right).

Figure 3. Evaluation of ML prediction of students’ performance for the Mathematics course dataset. The box plots are obtained from 50 cross-validations (split:

50 %

training,

25 %

validation,

25 %

test). (a) AUC of the students’ performance prediction for the Mathematics course dataset, with the two classes being separated as sufficient/insufficient grades (left) or above/below the average (right). (b) F-score of the students’ performance prediction for the Mathematics course dataset, with the two classes being separated as sufficient/insufficient grades (left) or above/below the average (right).

Figure 4. Evaluation of ML prediction of students’ performance for the Portuguese course dataset. The box plots are obtained from 50 cross-validations (split:

50 %

training,

25 %

validation,

25 %

test). (a) AUC of the students’ performance prediction for the Portuguese course dataset, with the two classes being separated as sufficient/insufficient grades (left) or above/below the average (right). (b) F-score of the students’ performance prediction for the Portuguese course dataset, with the two classes being separated as sufficient/insufficient grades (left) or above/below the average (right).

Figure 4. Evaluation of ML prediction of students’ performance for the Portuguese course dataset. The box plots are obtained from 50 cross-validations (split:

50 %

training,

25 %

validation,

25 %

test). (a) AUC of the students’ performance prediction for the Portuguese course dataset, with the two classes being separated as sufficient/insufficient grades (left) or above/below the average (right). (b) F-score of the students’ performance prediction for the Portuguese course dataset, with the two classes being separated as sufficient/insufficient grades (left) or above/below the average (right).

Figure 5. SHAP values and feature importance weights of a selection of the ML classification models for the prediction of students’ performance in the Mathematics course dataset, with the two classes being separated as sufficient/insufficient grades.

Figure 6. SHAP values and feature importance weights of a selection of the ML classification models for the prediction of students’ performance in the Mathematics course dataset, with the two classes formed—respectively—by the students with a grade above or below the population average grade.

Figure 7. Visualization with t-SNE in a two-dimensional space of the UCI student performance dataset after removing the features related to the first- and second-period grades: (a) students categorized by achieving a sufficient (1) or insufficient (0) final grade in Mathematics; (b) students categorized by performing above (1) or below (0) the average in Mathematics; (c) students categorized by achieving a sufficient (1) or insufficient (0) final grade in Portuguese; (d) students categorized by performing above (1) or below (0) the average in Portuguese.

Figure 8. Evaluation of ML prediction of students’ performance for the Mathematics course dataset, after the removal of features 27 (first-period grade) and 28 (second-period grade). The box plots are obtained from 50 cross-validations (split:

50 %

training,

25 %

validation,

25 %

test). (a) AUC of the students’ performance prediction for the Mathematics course dataset, with the two classes being separated as sufficient/insufficient grades (left) or above/below the average (right). (b) F-score of the students’ performance prediction for the Mathematics course dataset, with the two classes being separated as sufficient/insufficient grades (left) or above/below the average (right).

Figure 8. Evaluation of ML prediction of students’ performance for the Mathematics course dataset, after the removal of features 27 (first-period grade) and 28 (second-period grade). The box plots are obtained from 50 cross-validations (split:

50 %

training,

25 %

validation,

25 %

test). (a) AUC of the students’ performance prediction for the Mathematics course dataset, with the two classes being separated as sufficient/insufficient grades (left) or above/below the average (right). (b) F-score of the students’ performance prediction for the Mathematics course dataset, with the two classes being separated as sufficient/insufficient grades (left) or above/below the average (right).

Figure 9. Evaluation of ML prediction of students’ performance for the Portuguese course dataset, after the removal of features 27 (first-period grade) and 28 (second-period grade). The box plots are obtained from 50 cross-validations (split:

50 %

training,

25 %

validation,

25 %

test). (a) AUC of the students’ performance prediction for the Portuguese course dataset, with the two classes being separated as sufficient/insufficient grades (left) or above/below the average (right). (b) F-score of the students’ performance prediction for the Portuguese course dataset, with the two classes being separated as sufficient/insufficient grades (left) or above/below the average (right).

Figure 9. Evaluation of ML prediction of students’ performance for the Portuguese course dataset, after the removal of features 27 (first-period grade) and 28 (second-period grade). The box plots are obtained from 50 cross-validations (split:

50 %

training,

25 %

validation,

25 %

test). (a) AUC of the students’ performance prediction for the Portuguese course dataset, with the two classes being separated as sufficient/insufficient grades (left) or above/below the average (right). (b) F-score of the students’ performance prediction for the Portuguese course dataset, with the two classes being separated as sufficient/insufficient grades (left) or above/below the average (right).

Figure 10. SHAP values and feature importance weights of a selection of the ML classification models for the prediction of students’ performance in the Mathematics course dataset, after the removal of features 27 (first-period grade) and 28 (second-period grade). The two classes are formed—respectively—by the students with a grade above or below the population average grade.

Figure 11. SHAP values and feature importance weights of a selection of the ML classification models for the prediction of students’ performance in the Portuguese course dataset, after the removal of features 27 (first-period grade) and 28 (second-period grade). The two classes are separated according to whether the students achieved a sufficient or insufficient grade in the subject.

Table 1. Class balance in the Mathematics and Portuguese course datasets. Percentages indicate the proportion of students in each class for the respective target threshold.

Course	Grade Threshold	Class Distribution
Course	Grade Threshold	Class 0	Class 1
Mathematics	Sufficient	129 (32.74%)	265 (67.26%)
Mathematics	Average	185 (46.95%)	209 (53.05%)
Portuguese	Sufficient	100 (15.41%)	549 (84.59%)
Portuguese	Average	301 (46.38%)	348 (53.62%)

Table 2. Mean training time (s) and 95% confidence intervals across 50 cross-validations for the Mathematics and Portuguese datasets in the four classification scenarios. For each problem instance, the most computationally efficient model is shown in bold.

Method	Mathematics		Portuguese
Method	Average	Sufficient	Average	Sufficient
SVM	0.0040 ± 0.0002	0.0045 ± 0.0003	0.0053 ± 0.0004	0.0067 ± 0.0007
RF	0.4385 ± 0.1050	0.5293 ± 0.1144	0.5255 ± 0.1065	0.4700 ± 0.0974
kNN	0.0028 ± 0.0002	0.0059 ± 0.0025	0.0043 ± 0.0004	0.0039 ± 0.0003
Lasso GLM	2.2828 ± 0.6870	4.2894 ± 1.0392	5.1766 ± 1.6717	5.5041 ± 1.9936
NN	0.0753 ± 0.0205	0.0455 ± 0.3539	0.1580 ± 0.0440	0.2524 ± 0.0859
SNN	0.3073 ± 0.0329	0.0108 ± 0.0397	0.4348 ± 0.0303	0.3863 ± 0.0325
DL (LSTM)	0.8366 ± 0.1675	0.9001 ± 0.1334	0.8847 ± 0.1122	1.4121 ± 0.2109
eSPA	0.0002 ± 0.0001	0.0002 ± 0.0001	0.0003 ± 0.0001	0.0003 ± 0.0001

Table 3. Top 5 influential features identified by eSPA, RF, SVM, and NN for the Portuguese course dataset, split by scenario. (*) indicates that all these feature weights are very similar.

Sufficient Grade
Rank	*eSPA ()**	RF	NN	SVM
1	grade_2	grade_2	grade_2	grade_2
2	grade_1	grade_1	grade_1	grade_1
3	traveltime	failures	Walc	failures
4	failures	nursery	failures	school
5	schoolsup	school	school	higher
Average Grade
Rank	eSPA	RF	NN	SVM
1	grade_2	grade_2	grade_2	grade_2
2	–	grade_1	grade_1	grade_1
3	–	higher	health	failures
4	–	failures	failures	paid
5	–	goout	Pstatus	health

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Vecchi, E. Investigating the Efficacy and Interpretability of ML Classifiers for Student Performance Prediction in the Small-Data Regime. Educ. Sci. 2026, 16, 149. https://doi.org/10.3390/educsci16010149

AMA Style

Vecchi E. Investigating the Efficacy and Interpretability of ML Classifiers for Student Performance Prediction in the Small-Data Regime. Education Sciences. 2026; 16(1):149. https://doi.org/10.3390/educsci16010149

Chicago/Turabian Style

Vecchi, Edoardo. 2026. "Investigating the Efficacy and Interpretability of ML Classifiers for Student Performance Prediction in the Small-Data Regime" Education Sciences 16, no. 1: 149. https://doi.org/10.3390/educsci16010149

APA Style

Vecchi, E. (2026). Investigating the Efficacy and Interpretability of ML Classifiers for Student Performance Prediction in the Small-Data Regime. Education Sciences, 16(1), 149. https://doi.org/10.3390/educsci16010149

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Investigating the Efficacy and Interpretability of ML Classifiers for Student Performance Prediction in the Small-Data Regime

Abstract

1. Introduction

2. Materials and Methods

2.1. Datasets Description and Preprocessing

2.2. Benchmark Analysis

2.3. Methods Selection, Hyperparameter Tuning, and Further Implementation Details

2.3.1. Support Vector Machine (SVM)

2.3.2. Random Forest (RF)

2.3.3. k-Nearest Neighbours (kNN)

2.3.4. Lasso Generalized Linear Model (Lasso GLM)

2.3.5. Neural Network (NN)

2.3.6. Shallow Neural Network (SNN)

2.3.7. Deep Learning System with Long Short-Term Memory (DL with LSTM)

2.3.8. Entropic Scalable Probabilistic Approximation Algorithm (eSPA)

2.4. Performance Metrics

2.5. Feature Importance

3. Discussion of the Results

3.1. Evaluation of ML Methods’ Performance

3.2. Feature Importance Analysis

3.3. Additional Experiments on Feature Importance Discrimination

4. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Description of Dataset Explanatory Features

Notes

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI