1. Introduction
The accelerated expansion of artificial intelligence (AI) within contemporary educational systems is profoundly transforming processes of academic assessment, student classification, institutional resource allocation, and pedagogical decision-making. Tools grounded in learning analytics, predictive models of academic performance, and automated systems designed to support academic guidance are progressively being incorporated as socio-technical infrastructures that mediate multiple dimensions of educational functioning. Within this context, AI is no longer merely a technical instrument but is increasingly becoming a structural component of the mechanisms through which educational governance is exercised.
However, the growing automation of decision-making processes in education raises important normative questions. A growing body of research has demonstrated that machine learning systems may reproduce—or even amplify—inequalities embedded within the data on which they are trained (
Barocas et al., 2019;
Suresh & Guttag, 2021). When such systems are deployed in socially sensitive domains—such as the identification of students with high academic potential, the allocation of educational support, or the prediction of school dropout risk—their consequences extend beyond the technical sphere and intersect with broader concerns related to distributive justice and equality of opportunity.
Within this broader debate, gender equity represents a particularly significant dimension. Empirical studies conducted across different domains of artificial intelligence have documented systematic disparities in automated systems for recognition, classification, and recommendation that affect women and other historically underrepresented groups in differentiated ways (
Buolamwini & Gebru, 2018;
West et al., 2019). Importantly, such inequalities do not necessarily arise from explicit discriminatory intentions. Rather, they frequently emerge from the statistical internalisation of pre-existing social patterns embedded within training data. As a consequence, algorithmic models may inadvertently reinforce structural stereotypes while presenting an appearance of technical neutrality.
In the educational domain, the incorporation of predictive modelling techniques has been strongly influenced by developments in learning analytics and the field of Educational Data Mining. Numerous studies have demonstrated that academic outcomes can be predicted with considerable accuracy through the analysis of socioeconomic, demographic, and educational trajectory variables (
Romero & Ventura, 2020;
Siemens & Baker, 2012). Nevertheless, predictive accuracy alone does not guarantee normative legitimacy. If algorithmic systems systematically identify particular groups as possessing greater academic potential—or, conversely, as being at greater risk—these predictions may indirectly influence institutional decisions that shape students’ future educational pathways.
From an ethical and political perspective, these concerns are closely linked to fundamental principles such as non-discrimination, transparency, and institutional accountability. International organisations have emphasised that the development and use of artificial intelligence systems in education must be aligned with standards of social justice and respect for human rights (
OECD, 2019;
UNESCO, 2021). Similarly, emerging regulatory frameworks—such as the European Artificial Intelligence Act—recognise that certain algorithmic systems used in educational contexts may pose significant risks when they influence assessment processes or access to learning opportunities (
European Parliament & Council of the European Union, 2024).
Despite the growing scholarly interest in these issues, the academic literature remains characterised by a persistent fragmentation. On the one hand, technical research on algorithmic fairness has largely focused on the development of formal fairness metrics and statistical methods for bias mitigation. On the other hand, socio-political analyses have emphasised the ethical and structural implications of automation in education, often without incorporating detailed empirical validation of specific predictive models. This analytical separation constrains a comprehensive understanding of how technical mechanisms of prediction interact with pre-existing social inequalities and how such interactions may affect the institutional legitimacy of educational systems.
Against this background, the present study examines the potential of machine learning techniques to identify patterns associated with academic resilience in contexts of socioeconomic disadvantage, using data derived from the international assessment PISA 2022. Academic resilience is not merely an empirical classification derived from large-scale assessments, instead constituting a theoretically grounded construct within educational research. It refers to the capacity of students to achieve high levels of academic performance despite exposure to socio-economic disadvantage, reflecting complex interactions between individual agency and structural conditions (
Agasisti & Longobardi, 2017;
OECD, 2019). From this perspective, resilience should be understood as a relational and context-dependent phenomenon, rather than as a fixed individual trait.
The decision to focus on academic resilience—rather than on achievement alone—is analytically motivated. While achievement-based classifications (e.g., top versus bottom quartiles) allow for the identification of performance differences, they do not explicitly capture the intersection between disadvantage and success. By contrast, the resilience framework enables the examination of students who succeed against structural constraints, thereby providing a more suitable lens for analysing equity and social mobility within educational systems.
Drawing upon a broad set of educational and contextual variables, predictive models are developed to estimate the probability that students from socioeconomically disadvantaged backgrounds achieve high levels of academic performance. In addition, the study evaluates the presence of potential predictive disparities associated with gender, with the aim of examining the algorithmic fairness of the models employed. Finally, the research discusses the implications of these findings for the development of governance frameworks capable of guiding the responsible use of artificial intelligence within educational systems, particularly with regard to transparency, accountability, and the promotion of equity.
1.1. Research Questions and Hypotheses
Building upon the conceptual framework outlined above and the increasing incorporation of machine learning techniques in the analysis of large-scale educational data, this study seeks to examine two fundamental dimensions simultaneously: the predictive performance of different artificial intelligence models and their potential implications in terms of algorithmic fairness. More specifically, the research aims to determine whether models used to identify academically resilient students display differential predictive patterns when analysing student groups defined by gender.
In line with this general objective, the study is structured around the following research questions:
RQ1. To what extent can machine learning models identify academically resilient students within populations characterised by unfavourable socioeconomic conditions?
RQ2. Which educational, familial, and contextual factors exert the greatest influence on the prediction of academic resilience according to the models developed?
RQ3. Do systematic differences exist in the rates of positive prediction between students of different genders that could be interpreted as potential indications of algorithmic bias?
RQ4. What implications do the results obtained have for the design of governance frameworks aimed at regulating the use of artificial intelligence tools in the educational domain?
Taken together, these questions enable an integrated examination of both the technical dimension of predictive modelling and its potential social and ethical implications within educational contexts.
In order to guide the empirical analysis, the following research hypotheses are proposed:
Hypothesis H1. Machine learning models based on non-linear ensemble methods will exhibit superior predictive performance compared with traditional linear models in the identification of academically resilient students.
Hypothesis H2. Predictions generated by the models may present differences in the rates of positive classification between student groups defined by gender.
Hypothesis H3. Factors associated with students’ socioeconomic context and educational trajectories will constitute more influential predictors of academic resilience than individual demographic characteristics.
Collectively, these hypotheses enable an empirical examination of the interaction between predictive capacity, model interpretability, and potential implications for algorithmic fairness in the analysis of large-scale educational data.
1.2. Contributions of the Study
This study contributes to the emerging field of artificial intelligence in education by integrating three analytical dimensions that are often examined separately: predictive modelling of academic performance, the evaluation of algorithmic fairness, and reflection on the institutional implications of the use of artificial intelligence systems within educational contexts.
First, the study develops a predictive modelling approach aimed at identifying academically resilient students within populations facing socioeconomic disadvantage. To this end, it employs data derived from the international assessment PISA 2022, enabling the analysis of educational patterns at a large scale and facilitating comparisons across multiple educational systems. This approach contributes to expanding the existing literature on academic resilience by incorporating contemporary machine learning techniques into the analysis of international educational datasets.
Second, the research incorporates a systematic assessment of potential disparities in the predictions generated by the models according to students’ gender. This analysis makes it possible to examine the extent to which predictive systems may reproduce, amplify, or mitigate inequalities embedded within educational data. By introducing tools from algorithmic fairness analysis into the study of academic resilience, the research contributes to strengthening the dialogue between educational research and contemporary debates on ethics and responsibility in artificial intelligence.
Third, the study employs model interpretation techniques that enable the exploration of the relative contribution of different variables to the predictions generated by the algorithms. This approach seeks to enhance the analytical transparency of predictive modelling by facilitating a clearer understanding of the educational and contextual factors associated with resilient academic trajectories. While student achievement has traditionally been used as the primary outcome in educational research, the concept of academic resilience offers a more nuanced analytical perspective by explicitly focusing on students who succeed despite socioeconomic disadvantage. Academic resilience is therefore not simply a measure of high performance, but a relational construct that captures the interaction between structural constraints and individual educational outcomes. This perspective has been widely adopted in comparative education research using large-scale assessments, particularly within the OECD PISA framework (
Agasisti & Longobardi, 2017;
Agasisti et al., 2018). Accordingly, the present study retains academic resilience as its central outcome variable in order to examine how predictive models capture patterns of educational success under conditions of disadvantage, rather than modelling achievement alone.
Finally, the study discusses the implications of these findings for the development of governance frameworks aimed at regulating the use of artificial intelligence tools within educational systems. In particular, it argues that the incorporation of mechanisms for evaluating fairness, transparency, and accountability constitutes a fundamental element in ensuring that predictive analytics technologies effectively contribute to the promotion of educational equity.
2. Literature Review
2.1. Bias and Fairness in Artificial Intelligence Systems
Contemporary debates on fairness in artificial intelligence have progressively evolved from approaches focused primarily on predictive accuracy towards normative frameworks that recognise the ethical and social dimensions of automated systems. In some of the most influential formulations, bias in machine learning models is defined as the systematic production of unfavourable outcomes for particular social groups, even in the absence of explicit discriminatory intent (
Barocas et al., 2019). This conceptualisation shifts the focus of the discussion from intention to impact, acknowledging that discriminatory effects may emerge from statistical dynamics that appear formally neutral.
Suresh and Guttag (
2021) propose an analytical framework that identifies multiple sources of harm throughout the life cycle of machine learning systems. These sources include biases arising during data collection, variable measurement, model construction, and system deployment. This perspective is particularly relevant in educational contexts, where datasets often reflect pre-existing structural inequalities and where automated decisions may directly influence students’ educational trajectories. Related research has also demonstrated that predictive systems may satisfy certain formal fairness criteria while simultaneously producing disparate impacts across groups, thereby illustrating the inherent tension between predictive accuracy and distributive fairness (
Chouldechova, 2017).
One of the central debates in the technical literature concerns the multiplicity of formal definitions of fairness.
Kleinberg et al. (
2017) demonstrated that certain mathematical criteria of fairness—such as calibration and equality of error rates across groups—cannot be satisfied simultaneously when there are baseline differences in the prevalence of the outcome across those groups. This result reveals that algorithmic fairness inevitably entails normative choices regarding which conception of justice should be prioritised. Consequently, the discussion of bias cannot be reduced to a purely mathematical optimisation problem; rather, it requires explicit ethical grounding and careful consideration of broader social values.
2.2. Gender and the Reproduction of Inequalities in Artificial Intelligence
The problem of bias acquires a specific and particularly significant dimension when examined from a gender perspective. A growing body of research has documented substantial disparities in automated systems that affect women and racialised groups in differentiated ways.
Buolamwini and Gebru (
2018), for instance, demonstrated that commercial facial classification systems exhibited significantly higher error rates for women with darker skin tones when compared with lighter-skinned men. This finding highlighted how non-representative training datasets may translate into technical inequalities with tangible social consequences.
In the cultural and symbolic domain,
Noble (
2018) showed how search engines can reproduce gender and racial stereotypes, reinforcing degrading or biased representations through algorithmic indexing mechanisms. These dynamics illustrate that artificial intelligence systems do not operate in a social vacuum; rather, they internalise historical patterns of inequality embedded in the data environments from which they learn. Complementarily,
West et al. (
2019) emphasised that the underrepresentation of women within technology development teams may limit the early identification of discriminatory risks, thereby reinforcing institutional biases within the design of algorithmic systems.
From a structural perspective, gender interacts with socioeconomic and cultural variables that shape educational and labour opportunities. When predictive models are trained using data that reflect such inequalities, they may replicate statistical associations between gender and academic performance that derive from historical conditions rather than from intrinsic capacities. This phenomenon generates a tension between statistical prediction and distributive justice, particularly in contexts where automated decisions may influence access to educational resources, academic programmes, or differentiated support mechanisms.
2.3. Artificial Intelligence in Education and Predictive Analytics
The fields of learning analytics and Educational Data Mining have experienced sustained growth over the past decade.
Siemens and Baker (
2012) identified the convergence between educational data mining and learning analytics as an effort aimed at understanding patterns of student behaviour and improving educational processes. The development of these research areas has significantly expanded the capacity to analyse large-scale educational datasets and identify behavioural and performance patterns among students (
Baker & Inventado, 2014;
Siemens & Baker, 2012). Subsequently,
Romero and Ventura (
2020) documented the expansion of predictive models capable of anticipating academic performance, student dropout, and participation in digital learning environments.
These advances have enabled the development of early warning systems and personalised intervention strategies designed to support student success. Nevertheless, the recent literature has cautioned that an exclusive emphasis on predictive accuracy may obscure differentiated distributive effects.
Holmes et al. (
2023) argue that artificial intelligence in education should be evaluated not solely in terms of technical effectiveness but also in relation to its alignment with principles of equity and justice. In the absence of systematic fairness assessments, predictive systems may inadvertently reinforce existing inequalities while presenting an appearance of quantitative neutrality.
Moreover, the increasing integration of intelligent tools within educational environments extends the reach of automation to decisions that were traditionally shaped by pedagogical deliberation. This transformation alters the very nature of educational governance by partially shifting decision-making processes towards data-driven systems. The evaluation of such systems therefore requires an analytical approach that combines technical assessment with normative reflection on their broader social implications.
2.4. Academic Resilience and OECD Operationalisation
Within the field of comparative education, academic resilience has emerged as a central construct for understanding how students from disadvantaged socio-economic backgrounds achieve high levels of academic performance despite structural constraints. The concept is particularly relevant in large-scale international assessments, where it enables the identification of individuals who succeed against the odds and thus provides insight into the mechanisms that may mitigate educational inequality (
OECD, 2023c;
Reardon, 2011). Early sociological contributions highlighted that educational success cannot be fully explained by structural determinants alone, as individual aspirations, institutional contexts, and cultural capital also play a significant role in shaping outcomes (
Kao & Tienda, 1995;
Salikutluk, 2016).
In contemporary empirical research, the operationalisation of academic resilience has been strongly influenced by the framework developed by the Organisation for Economic Co-operation and Development (OECD) within the Programme for International Student Assessment (PISA). According to this approach, resilient students are defined as those who belong to the lowest quartile of the index of economic, social, and cultural status (ESCS) within their country, yet achieve performance levels situated in the highest quartile of academic outcomes (
OECD, 2023b,
2023d). This definition offers a standardised and internationally comparable criterion, facilitating cross-country analyses of equity and educational effectiveness (
Agasisti & Longobardi, 2017;
Agasisti et al., 2018).
While the OECD operational definition has been widely adopted due to its clarity and comparability, it also entails important conceptual and methodological implications. First, it reduces a multidimensional and dynamic phenomenon to a threshold-based classification, thereby privileging extreme cases of success over more nuanced trajectories of improvement. Academic resilience, in a broader sense, may encompass gradual progress, adaptive coping strategies, and context-dependent forms of achievement that are not fully captured by dichotomous indicators. Consequently, the OECD framework should be understood as an analytical simplification rather than a comprehensive representation of the construct.
A second implication concerns the composition of the reference category. The classification of “non-resilient” students encompasses a heterogeneous population that includes both disadvantaged low achievers and socio-economically advantaged students whose performance does not meet the resilience threshold. This heterogeneity complicates the interpretation of empirical results, as the comparison group does not represent a uniform baseline. Instead, it aggregates individuals with fundamentally different structural conditions and educational trajectories. As a result, the binary distinction between resilient and non-resilient students should not be interpreted as a strict dichotomy, but rather as a pragmatic device for modelling extreme outcome patterns within large-scale datasets.
These considerations are particularly relevant in studies that employ machine learning techniques to predict academic outcomes. Predictive models trained on large educational datasets may internalise the statistical structure underlying the OECD classification, including the distributional asymmetries between socio-economic status and performance. As highlighted in the literature on algorithmic bias, such models can reproduce existing inequalities if the target variable reflects historically conditioned patterns rather than neutral measures of ability (
Baker & Hawn, 2022;
Mehrabi et al., 2021). In this context, the operationalisation of resilience becomes not only a measurement issue but also a normative one, as it shapes the patterns that algorithms are designed to learn and reproduce.
Moreover, recent studies have begun to integrate machine learning approaches with the analysis of academic resilience, revealing both opportunities and challenges. For instance, large-scale modelling efforts using PISA data have demonstrated the potential to identify complex interactions between socio-economic variables, school characteristics, and student performance across diverse national contexts (
Cheung et al., 2024). However, these approaches also raise concerns regarding interpretability, fairness, and the risk of reinforcing structural disparities if predictive systems are deployed without adequate safeguards (
Holmes et al., 2023;
Pham et al., 2025).
It is important to acknowledge that the operationalisation of academic resilience in cross-national datasets such as PISA involves important conceptual and methodological limitations. As a socially and culturally embedded construct, academic resilience may be interpreted differently across educational systems. In addition, both socioeconomic status and achievement thresholds are defined in relative terms within each country, which implies that students classified as “high-performing” may not necessarily demonstrate equivalent levels of knowledge and skills across contexts. Consequently, cross-country comparisons should be interpreted with caution, as they reflect relative positions within national distributions rather than absolute measures of academic performance. These considerations are particularly relevant when interpreting comparative findings across educational systems.
From a methodological standpoint, the use of a binary resilience indicator aligns with common practices in supervised learning, where classification tasks require clearly defined outcome variables. Nevertheless, this alignment may obscure the underlying complexity of educational processes. Techniques designed to address class imbalance, such as synthetic over-sampling methods (
Chawla et al., 2002), may improve predictive performance but do not resolve the conceptual limitations associated with the definition of the target variable. Similarly, advances in interpretable machine learning (
Lundberg & Lee, 2017;
Molnar, 2022) provide tools to examine model behaviour, yet they depend fundamentally on the validity of the constructs being modelled.
In light of these considerations, the OECD operationalisation of academic resilience should be interpreted as a useful but partial framework that enables large-scale comparative analysis while simultaneously introducing simplifications that must be explicitly acknowledged. Its integration into studies of artificial intelligence in education requires careful reflection on both measurement validity and ethical implications. In particular, researchers must remain attentive to the ways in which definitional choices influence not only empirical findings but also the broader narratives surrounding educational equity and student potential.
By situating academic resilience at the intersection of socio-economic inequality, educational measurement, and algorithmic modelling, this study adopts a critical perspective that recognises both the analytical value and the limitations of the OECD framework. This approach allows for a more nuanced interpretation of predictive results and contributes to the ongoing dialogue on fairness, transparency, and accountability in data-driven educational systems.
2.5. Democratic Governance and the Regulation of Artificial Intelligence
The political dimension of bias in artificial intelligence is closely linked to the concept of democratic governance. When automated systems influence access to rights, opportunities, or public resources, the legitimacy of their decisions depends on the existence of mechanisms for transparency, accountability, and institutional oversight.
Pasquale (
2015) warned that the opacity of complex computational systems may generate power asymmetries and restrict the capacity for public scrutiny.
In response to these challenges, several international frameworks have established guiding principles for the responsible development and use of artificial intelligence. The Organisation for Economic Co-operation and Development (
OECD, 2019), for example, proposed guidelines centred on fairness, transparency, and accountability. In a similar vein,
UNESCO (
2021) emphasised that artificial intelligence must respect human rights and promote social justice, highlighting gender equality as a transversal principle in the governance of digital technologies.
More recently, Regulation (EU) 2024/1689 of the European Parliament and of the Council establishes specific obligations for high-risk artificial intelligence systems, including those deployed in educational contexts when they may influence evaluation processes or access to educational opportunities (
European Parliament & Council of the European Union, 2024). This regulatory framework explicitly recognises that artificial intelligence can produce structural impacts and therefore requires systematic risk assessments and oversight mechanisms.
The convergence between technical literature, gender analysis, and emerging regulatory frameworks demonstrates that predictive bias in education cannot be understood as an isolated technical problem. Rather, it lies at the intersection of statistical modelling, structural inequalities, and democratic legitimacy. Despite this recognition, a significant gap remains between empirical studies focused on predictive performance and normative analyses concerned with governance and institutional accountability.
The present study seeks to contribute to closing this gap by adopting a research design that integrates the quantitative evaluation of predictive models with the formal analysis of gender fairness metrics and a broader reflection on institutional implications. The following section develops the theoretical framework that articulates these analytical levels within a multilevel conceptual model.
3. Theoretical Framework
3.1. Artificial Intelligence as a Sociotechnical Infrastructure
Artificial intelligence applied to educational contexts cannot be understood solely as a set of computational techniques designed to optimise predictive performance. From a sociotechnical perspective, AI systems constitute complex assemblages that integrate historical data, design choices, institutional objectives, and normative frameworks (
Kitchin, 2017;
Selbst et al., 2019). Consequently, their outputs should not be interpreted as purely neutral mathematical products; rather, they represent situated configurations that reflect broader social structures and organisational priorities.
This perspective is particularly relevant in educational environments, where the datasets used to train predictive models are frequently shaped by historical inequalities related to gender, socioeconomic status, and cultural context. As a result, any analysis of algorithmic bias must be situated within a framework that simultaneously considers technical and structural dimensions, thereby avoiding overly reductive statistical interpretations.
Understanding artificial intelligence as a sociotechnical infrastructure implies recognising that predictive systems participate in the governance of educational processes. Through the classification of students, the identification of performance patterns, and the estimation of risk indicators, such systems influence institutional decision-making and contribute to shaping the allocation of educational opportunities. In this sense, algorithmic outputs should be interpreted not merely as analytical results but as elements embedded within broader institutional ecosystems.
3.2. Formalisation of the Predictive Problem
Let denote a feature space representing academic and socio-demographic variables associated with students. Let represent the sensitive attribute corresponding to gender, and let denote the target variable indicating high academic performance.
A predictive model can be defined as a function:
parameterised by
, which produces a prediction
.
The training process consists of minimising an empirical loss function:
where
denotes the loss function (for instance, cross-entropy loss in binary classification).
From a strictly predictive perspective, the objective is to maximise performance metrics such as accuracy or the area under the ROC curve (AUC–ROC). However, this formulation does not explicitly account for potential dependencies between the prediction and the sensitive attribute . When the joint distribution reflects structural inequalities, the model may internalise correlations between gender and academic performance that do not necessarily correspond to normatively justifiable differences.
Thus, evaluating predictive performance alone is insufficient for assessing the broader implications of algorithmic systems in socially sensitive domains such as education. A comprehensive evaluation must also consider whether the predictive mechanism generates disparate outcomes across groups defined by sensitive attributes.
3.3. A Multilevel Model of Algorithmic Bias
Drawing upon the literature on sociotechnical systems and algorithmic fairness (
Barocas et al., 2019;
Selbst et al., 2019), this study proposes an analytical model composed of four interrelated levels.
3.3.1. Technical Level
The technical level refers to design decisions embedded within the predictive model itself, including variable selection, model architecture, loss functions, and classification thresholds. Technical bias may emerge under several conditions, such as:
The presence of proxy variables correlated with gender.
The prioritisation of overall predictive accuracy over fairness across subgroups.
The application of uniform classification thresholds that generate differentiated error rates between groups.
Formally, technical bias may be present if:
or if conditional error rates differ between groups defined by the sensitive attribute.
This level corresponds to the traditional focus of algorithmic fairness research, where bias is analysed through statistical disparities in predictions or classification errors across demographic groups.
3.3.2. Structural Level
Structural bias originates from the underlying distribution of the data itself. If:
the model will inevitably learn statistical patterns that reflect historical inequalities embedded in the dataset. In such cases, disparities in predictions do not necessarily arise from flaws in model design but from the computational reproduction of pre-existing social structures.
From this perspective, artificial intelligence may function as a statistical amplifier of structural inequalities. The critical analytical distinction lies between statistical correlation and normative legitimacy: not every empirical difference between groups can be considered ethically or socially justified within a framework of distributive justice.
3.3.3. Institutional Level
The institutional level refers to the organisational conditions under which the predictive system is deployed and used. This includes aspects such as:
Institutional policies governing the use of predictive models.
The interpretation of model outputs by decision-makers.
Mechanisms for auditing and evaluating algorithmic performance.
The presence of human oversight and review processes.
Even when a model exhibits acceptable fairness metrics at the technical level, its implementation may generate discriminatory outcomes if predictions are interpreted deterministically or applied without contextual evaluation. Institutional bias may therefore emerge when organisations lack mechanisms of accountability or when automated outputs substitute for pedagogical deliberation rather than supporting it.
This level highlights that fairness in artificial intelligence cannot be guaranteed solely through technical adjustments; institutional practices play a critical role in shaping the real-world consequences of predictive systems.
3.3.4. Democratic Level
The democratic level examines the aggregate effects of predictive systems on equality of opportunity and institutional legitimacy. When predictive models influence access to educational programmes, scholarships, or targeted interventions, they may alter the distribution of educational opportunities within a system.
If predictive disparities systematically affect a particular gender group, a tension arises with fundamental principles of political equality and social justice. Democratic legitimacy requires that decisions with significant social impact be subject to transparency, public scrutiny, and mechanisms of review (
Pasquale, 2015;
UNESCO, 2021).
Within this framework, the evaluation of artificial intelligence systems in education must extend beyond technical performance and include broader considerations related to democratic accountability and the governance of automated decision-making.
3.4. Formal Fairness Metrics
In order to empirically evaluate bias at the technical level, the study adopts formal fairness metrics widely recognised in the literature.
3.4.1. Demographic Parity
Demographic parity measures the difference in the rate of positive predictions between groups:
A value close to zero indicates the absence of disparity in the rate of positive predictions between gender groups.
3.4.2. Equality of Opportunity
Equality of opportunity evaluates whether the model identifies high-performing students with equal probability across groups:
where:
This metric focuses specifically on differences in true positive rates across sensitive groups.
3.4.3. Disparate Impact Ratio
The disparate impact ratio is defined as:
Values outside the approximate interval [0.8, 1.25] may indicate a potentially relevant disparity according to criteria commonly used in discrimination analysis.
Several methodological approaches have been proposed to operationalise fairness constraints in machine learning systems. Among them are algorithmic reduction techniques that transform fairness requirements into optimisation problems within supervised learning frameworks, allowing fairness constraints to be incorporated directly into the model training process (
Agarwal et al., 2018).
This theoretical framework directly informs the empirical strategy adopted in this study, particularly through the implementation of formal fairness metrics and the interpretation of predictive disparities as potential reflections of both technical and structural sources of bias.
4. Data and Methodology
4.1. Data Source
This study utilises data derived from the 2022 cycle of the Programme for International Student Assessment (PISA). This international assessment, coordinated by the Organisation for Economic Co-operation and Development (OECD), aims to evaluate the competencies of approximately fifteen-year-old students across several domains of knowledge while also examining contextual factors associated with learning processes.
The dataset employed in this research contains information from more than six hundred thousand students across multiple educational systems worldwide. In addition to performance outcomes in the assessed domains—primarily mathematics, reading, and science—the dataset includes variables related to socio-demographic characteristics, educational trajectories, and socioeconomic conditions within the family environment. This combination of academic and contextual information enables a more comprehensive examination of the factors associated with educational performance across different national contexts.
The use of large-scale international datasets such as PISA is particularly valuable for the comparative analysis of complex educational phenomena, as they provide standardised empirical evidence that facilitates the identification of general patterns across diverse educational systems (
OECD, 2023b,
2023c).
The analysis includes all educational systems available in the PISA 2022 database for which complete information on the selected variables was available. No explicit exclusion was made based on OECD membership status; therefore, both OECD and non-OECD participating systems are represented in the analytical sample. This approach ensures broad international comparability while maximising the size and diversity of the dataset.
4.2. Definition of Academic Resilience
One of the central concepts in this study is academic resilience. Within educational research, this term refers to the capacity of certain students to achieve relatively high levels of academic performance despite facing unfavourable socioeconomic conditions.
It is important to note that this operationalisation implies that the category of non-resilient students includes a heterogeneous group comprising both socioeconomically advantaged students and disadvantaged students who do not achieve high academic performance. Consequently, the classification does not distinguish between different forms of educational disadvantage or underachievement but rather reflects a relative positioning of students within their national context. In addition, because resilience is defined using country-specific thresholds, the classification is inherently comparative, meaning that students are evaluated relative to their peers within each educational system rather than against a universal benchmark.
Following the approach commonly adopted in studies based on PISA data, this research defines resilient students as those who belong to the lowest quartile of the Economic, Social and Cultural Status (ESCS) index while simultaneously achieving scores within the highest quartile of academic performance in mathematics within their respective educational contexts. This definition enables the identification of students who, despite starting from structurally disadvantaged conditions, succeed in attaining outstanding academic outcomes.
The identification of resilient students represents a valuable analytical perspective for understanding the mechanisms that may contribute to reducing educational inequalities, as it allows researchers to explore which individual, school-related, or contextual factors are associated with successful educational trajectories under conditions of disadvantage (
Agasisti & Longobardi, 2017).
4.3. Variables Used
In order to analyse the factors associated with academic resilience, a set of explanatory variables was selected to capture different dimensions of students’ educational and social contexts.
The variables considered include basic demographic characteristics such as gender and students’ migration status. Students’ migration background (IMMIG) was included as a categorical variable distinguishing between native students, second-generation students, and first-generation students, following the coding scheme provided in the PISA dataset. This variable captures differences in students’ demographic and socio-cultural contexts and is particularly relevant given prior evidence suggesting heterogeneous educational outcomes among students with migrant backgrounds. In addition, indicators related to educational trajectories were incorporated, including whether a student had experienced grade repetition. These variables make it possible to examine the influence of structural and school-related factors on academic performance.
Furthermore, measures of academic performance in reading and science were included, together with the mathematics score that constitutes the principal dimension used for identifying academic resilience in this study. With respect to mathematics performance, the first plausible value provided by PISA (PV1MATH) was used as an indicator of achievement in this domain. While the official OECD methodology recommends the use of all plausible values in order to obtain unbiased population estimates and to properly account for measurement error, prior research in machine learning and predictive analytics has shown that in classification tasks it is common practice to employ a single plausible value as a representative proxy for student performance.
This approach is primarily motivated by the structural requirements of supervised learning algorithms, which operate on fixed feature matrices and individual-level observations. Incorporating multiple plausible values would require repeated model estimation and aggregation procedures that are not always compatible with standard machine learning pipelines. Therefore, the use of PV1MATH facilitates computational efficiency and model implementation while preserving the overall distributional properties of achievement scores.
Nevertheless, this simplification implies that the uncertainty associated with latent proficiency estimates is not fully captured within the modelling framework. As a result, the findings should be interpreted within a predictive context rather than as precise estimates of student ability. This decision reflects the fact that supervised learning algorithms operate on individual observations and require a consistent data structure that enables the efficient training of classification models. Consequently, the use of PV1MATH facilitates the implementation of the computational pipeline without substantially altering the overall distribution of academic performance within the dataset.
It is important to note that the sampling weights provided in the PISA dataset were not incorporated into the modelling process. This decision reflects the predictive orientation of the present study, which is primarily concerned with identifying statistical patterns and maximising classification performance rather than producing population-level estimates. In contrast to traditional inferential analyses—where sampling weights are essential to ensure representativeness—machine learning models are typically designed to learn relationships directly from the observed data structure.
Recent contributions in the field of educational data mining suggest that, in predictive modelling contexts, the omission of sampling weights may be methodologically acceptable when the objective is not statistical inference but pattern recognition and classification performance. Nevertheless, this choice entails certain trade-offs, particularly regarding the generalisability of the results to the broader population. Consequently, the findings of this study should be interpreted within a predictive analytical framework rather than as nationally representative estimates of educational outcomes.
Moreover, because the dependent variable of academic resilience is partially defined on the basis of performance in mathematics, the inclusion of reading and science scores as explanatory variables may raise concerns regarding potential correlations among the different cognitive domains assessed by PISA. Prior research has consistently documented moderate to strong associations between mathematics, reading, and science performance, reflecting the influence of underlying general cognitive abilities and shared educational experiences (
OECD, 2019,
2023b). It is important to note that this operationalisation implies a relative and country-specific definition of resilience. The thresholds used to identify high achievement are determined within each national distribution, meaning that students classified as resilient in one country may not be directly comparable in absolute performance terms to those in another. However, this relative approach is consistent with the objective of capturing resilience as a context-sensitive phenomenon shaped by the structure of each educational system. Furthermore, the analytical focus on academic resilience—as opposed to achievement alone—was deliberately retained in this study. While alternative specifications based solely on performance distributions could have been employed, such approaches would not explicitly account for socio-economic disadvantage as a defining dimension of the outcome. The present framework therefore prioritises the examination of equity-related patterns rather than performance differences per se.
However, it is important to note that reading and science scores are not mechanically derived from the target variable and therefore do not constitute direct data leakage. Rather, their inclusion is intended to capture broader dimensions of students’ academic profiles, allowing the models to account for general patterns of achievement across domains. In this sense, these variables serve as proxies for multidimensional academic competence rather than as direct predictors of mathematics performance.
Nevertheless, the presence of correlated predictors may contribute to an increase in predictive performance, and this should be taken into consideration when interpreting the results. Accordingly, the findings should be understood as reflecting predictive relationships within a multidimensional framework rather than as evidence of independent causal effects.
Finally, the ESCS index developed within the PISA framework was included. This index synthesises information related to parental educational attainment, occupational status, and the cultural and material resources available within the household. It is important to acknowledge that the ESCS index plays a dual role in the analytical framework, as it is used both in the construction of the dependent variable (academic resilience) and as an explanatory variable in the predictive models. This modelling choice reflects the objective of capturing the relative position of students within the socioeconomic distribution while simultaneously examining how variation within disadvantaged groups relates to the probability of being classified as resilient. However, this specification may introduce a degree of dependency between the target variable and the predictors, and therefore the results should be interpreted as reflecting conditional associations rather than independent causal effects.
The combination of these variables enables the exploration of relationships between socioeconomic conditions, educational trajectories, and academic outcomes within a predictive analytical framework.
4.4. Data Preprocessing
Prior to the modelling stage, a process of data preparation and cleaning was conducted to ensure the analytical consistency and quality of the dataset. First, the available records were examined in order to identify missing values, invalid codes, or responses that did not meet the criteria established in the technical documentation of the PISA.
In order to preserve the largest possible number of observations, missing values in the selected variables were treated through median-based imputation procedures. This method preserves the general structure of the data distribution without introducing additional assumptions that could distort the results of the analysis.
Subsequently, the distribution of the dependent variable corresponding to academic resilience was examined. Because the number of resilient students is typically considerably smaller than the rest of the student population, a potential class imbalance problem was identified. Such situations may affect the performance of classification algorithms, leading them to favour predictions towards the majority class.
To mitigate this issue, the Synthetic Minority Over-sampling Technique (SMOTE) was applied exclusively to the training subset after the train–test split. This procedure prevents information leakage and ensures that the test data remain representative of the original distribution. The technique generates synthetic observations of the minority class by constructing linear combinations of existing cases, thereby improving the capacity of predictive models to detect patterns associated with academic resilience.
4.5. Machine Learning Models
In order to identify patterns associated with academic resilience, a set of supervised learning algorithms widely used in educational data analysis and contemporary machine learning research was implemented.
Specifically, six classification models were considered: Logistic Regression, Random Forest, Gradient Boosting, Extreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), and CatBoost. These algorithms represent different modelling approaches capable of capturing both linear and non-linear relationships between explanatory variables and the target variable.
Logistic regression was employed as a baseline model, given its extensive use in social science research and its relatively high degree of interpretability. In contrast, models based on decision-tree ensembles—such as Random Forest and boosting methods—offer a greater capacity to detect complex interactions among variables and to handle high-dimensional datasets.
The comparison among different algorithms allows the evaluation of which modelling approach is most suitable for predicting academic resilience within the context of large-scale educational data.
4.6. Model Performance Evaluation
The performance of the classification models was assessed using a set of metrics commonly employed in supervised learning problems. These included accuracy, precision, recall (true positive rate), the F1-score, the area under the ROC curve (AUC–ROC), and the area under the precision–recall curve (PR-AUC). The inclusion of PR-AUC is particularly important in imbalanced classification settings, as it provides a more informative evaluation of the model’s ability to identify the minority class.
Each of these metrics captures different aspects of model performance. While accuracy provides an overall measure of the proportion of correct predictions, precision and recall allow for the evaluation of the model’s capacity to correctly identify positive cases, which is particularly important in contexts characterised by class imbalance.
To ensure a reliable evaluation of predictive performance, a hold-out validation approach was employed, whereby the dataset was divided into training and testing subsets. In addition, a five-fold cross-validation procedure was conducted for the best-performing model in order to assess the robustness and stability of the results.
4.7. Evaluation of Algorithmic Fairness
In addition to evaluating predictive accuracy, this study incorporates an analysis aimed at examining potential disparities in the predictions generated for different groups of students. Specifically, the analysis investigates whether the rates of positive predictions associated with academic resilience differ systematically between students of different genders.
To this end, a metric based on the principle of demographic parity was employed, which compares the proportion of positive outcomes between groups defined by a sensitive attribute. This type of analysis enables the identification of potential imbalances in model behaviour that may reflect patterns of inequality embedded in the data used during training.
The incorporation of algorithmic fairness assessments is particularly relevant in the educational domain, where the increasing use of predictive analytics tools raises important challenges related to transparency, institutional accountability, and the promotion of more equitable educational systems.
While demographic parity provides an intuitive and interpretable measure of group-level differences in model predictions, it represents only one among several competing definitions of algorithmic fairness. Alternative frameworks, such as equality of opportunity, predictive parity, and calibration, capture distinct normative perspectives on fairness and may not be simultaneously satisfiable within a single model (
Kleinberg et al., 2017;
Barocas et al., 2019). Consequently, the fairness analysis conducted in this study should be interpreted as a partial diagnostic rather than a comprehensive evaluation of algorithmic equity. Future extensions could incorporate multi-criteria fairness assessments and constraint-based optimisation approaches in order to explore the trade-offs inherent in fairness-aware machine learning systems.
4.8. Model Interpretability
To enhance the understanding of the behaviour of predictive models, the analysis was complemented with interpretability techniques that allow the examination of the relative contribution of each variable to the predictions generated by the algorithms.
In particular, SHAP values (SHapley Additive Explanations) were employed. This methodology, grounded in cooperative game theory, decomposes the prediction of a model into individual contributions attributable to each explanatory variable. This approach provides a consistent interpretation of predictor importance and facilitates the identification of the factors exerting the greatest influence on the estimated probability of academic resilience.
The use of interpretability tools is especially important when machine learning models are applied within educational contexts, as it enhances analytical transparency and enables a clearer understanding of how different educational and social factors influence the outcomes produced by predictive models.
5. Modelling Framework and Computational Procedure
5.1. Formulation of the Classification Problem
The analytical objective of this study is to estimate the probability that a student may be classified as academically resilient based on a set of observed educational and contextual variables. From a machine learning perspective, this task can be framed as a binary classification problem, in which the dependent variable takes one of two possible outcomes: resilient or non-resilient.
Let denote a binary random variable representing the academic resilience status of a student. The value is assigned when a student satisfies the criteria established for being considered academically resilient, whereas otherwise. Furthermore, let represent the vector of explanatory variables describing the demographic, socioeconomic, and academic characteristics associated with each student.
The classification task therefore consists of estimating the conditional probability function:
which expresses the likelihood that a student is resilient given the observed feature vector. The machine learning algorithms implemented in this study aim to approximate this probability function by identifying statistical regularities embedded in the available data.
5.2. Modelling Strategy
In order to address the classification task outlined above, a comparative modelling strategy was implemented through the application of multiple supervised learning algorithms. This approach enables the predictive performance of different models to be systematically evaluated and facilitates the identification of those that demonstrate superior results in terms of predictive accuracy, stability, and interpretability.
The modelling framework incorporates both traditional statistical methods and contemporary machine learning techniques, particularly ensemble-based approaches relying on decision trees. Broadly speaking, these models operate by identifying patterns within the training data that allow the algorithm to distinguish between the two categories of the target variable.
Decision tree-based methods and boosting techniques are particularly suitable for the analysis of large-scale educational datasets, as they exhibit strong capacity for capturing non-linear relationships among variables, modelling complex interactions among predictors, and adapting effectively to heterogeneous data structures. Moreover, these approaches typically demonstrate robust generalisation performance when applied to datasets characterised by numerous explanatory variables.
In general terms, boosting algorithms construct an additive predictive model expressed as:
where:
denotes individual decision trees,
represents the weight assigned to each tree,
indicates the total number of iterations performed by the algorithm.
During the training phase, each successive tree is fitted with the objective of minimising the residual errors generated by the ensemble constructed in previous iterations. Through this iterative refinement process, boosting algorithms are able to capture complex variable interactions and represent highly non-linear relationships present in educational data.
The implementation of several algorithms within a unified analytical framework enables their predictive performance to be compared using standardised evaluation metrics, thereby supporting the selection of the most appropriate model for the classification problem under investigation.
5.3. Computational Procedure
The analytical workflow followed in this study was organised into a sequence of computational stages ranging from data preparation to the interpretation of the predictive outputs generated by the models. This procedure was designed with the dual purpose of ensuring analytical reproducibility and enabling a systematic evaluation of predictive performance.
The initial stage involved the preparation of the dataset. This included the cleaning of raw records, the treatment of missing observations, and the construction of the binary variable representing academic resilience. Following data preparation, the dataset was partitioned into training (80%) and testing (20%) subsets using stratified sampling to preserve class proportions. Class imbalance was addressed by applying SMOTE exclusively to the training data. Subsequently, the machine learning models were trained on the resampled training set, and their predictive performance was evaluated on the independent test set using multiple evaluation metrics. This sequential procedure ensures a clear separation between model training and evaluation, thereby enhancing the validity of the results.
In the following stage, the selected machine learning algorithms were trained using the predefined training subset. During this phase, each model adjusted its internal parameters according to patterns detected in the training data, thereby producing predictive functions capable of estimating the probability of academic resilience for previously unseen observations.
Once the models had been trained, their performance was evaluated using the independent test dataset. At this stage, multiple classification metrics were computed in order to compare the predictive capacity of the different algorithms. In parallel, an algorithmic fairness assessment was conducted to examine whether the predictive behaviour of the models exhibited systematic differences across student groups.
Finally, model interpretation techniques based on SHAP (SHapley Additive exPlanations) values were employed to analyse the relative contribution of the explanatory variables to the predictions generated by the algorithms. This interpretative step allows for a clearer understanding of the factors that most strongly influence the estimated probability of academic resilience.
5.4. Algorithmic Representation of the Analytical Pipeline
In order to synthesise the procedure described above, the complete analytical workflow may be represented through an algorithmic schema summarising the principal stages of the computational pipeline implemented in this study.
In general terms, the procedure begins with the loading and preparation of the dataset, followed by the construction of the academic resilience variable and the selection of relevant explanatory features. Subsequently, the workflow includes the training and evaluation of several machine learning models. The final stages involve the assessment of algorithmic fairness and the interpretation of predictive outcomes through explainability techniques.
This algorithmic representation provides a clear structure for the analytical workflow adopted in the study, thereby facilitating both replicability of the research design and transparency in the methodological process.
The pseudocode presented in Algorithm 1 summarises the principal stages of the analytical pipeline employed in this investigation, ranging from data preparation to the final interpretation of the predictive results. Among the academic variables included in the modelling framework are the plausible values for reading and science achievement, which serve as complementary indicators of the student’s academic performance profile. Additionally, the pseudocode provides a structured visualisation of the sequence of operations applied to the dataset, from the selection of relevant variables to the comparative evaluation of different classification algorithms.
| Algorithm 1. Machine Learning Framework for Predicting Academic Resilience. |
Input: PISA 2022 dataset containing student-level variables Features: Gender, Immigration status, Grade repetition, ESCS index, Reading score (PV1READ), Science score (PV1SCIE) Target Variable: Academic resilience defined according to the OECD framework: students belonging to the lowest quartile of ESCS within their country and simultaneously achieving mathematics performance in the top quartile.
Output: Trained predictive models, performance metrics, fairness evaluation, and feature importance interpretation. Step 1: Data Preparation 1.1 Load the PISA 2022 student dataset 1.2 Select relevant variables for analysis 1.3 Recode invalid response codes as missing values 1.4 Handle missing values using median imputation Step 2: Construction of the Target Variable 2.1 For each country, compute the 25th percentile of the ESCS index 2.2 Identify students belonging to the lowest ESCS quartile 2.3 For each country, compute the 75th percentile of mathematics performance (PV1MATH) 2.4 Assign |
| | RESILIENT = 1 If: • ESCS ≤ 25th percentile within country AND • PV1MATH ≥ 75th percentile within country Otherwise assign: RESILIENT = 0 |
Step 3: Feature Preparation 3.1 Encode categorical variables (e.g., gender, immigration status) 3.2 Retain numerical variables in their original scale as required by tree-based models Step 4: Train–Test Split 4.1 Split the dataset into training set (80%) and test set (20%) using stratified sampling 4.2 Set random seed to ensure reproducibility of results Step 5: Class Imbalance Handling 5.1 Apply the Synthetic Minority Oversampling Technique (SMOTE) to the training dataset. Step 6: Model Training Train the following supervised learning algorithms: |
| | Logistic Regression Random Forest Gradient Boosting XGBoost LightGBM CatBoost
|
Step 7: Model Evaluation 7.1 For each model compute: |
| | |
7.2 Analyse classification thresholds to examine precision–recall trade-offs 7.3 Generate predicted probabilities for evaluation metrics such as AUC and PR-AUC Step 8: Fairness Assessment 8.1 Evaluate predictive performance across gender groups 8.2 Compute positive prediction rates (demographic parity) 8.3 Estimate true positive rates (TPR) and false positive rates (FPR) by group 8.4 Calculate fairness gaps (e.g., demographic parity ratio and equality of opportunity differences) Step 9: Model Interpretation 9.1 Extract feature importance for tree-based models where applicable 9.2 Apply SHAP values to a sample of the test data to explain model predictions Step 10: Visualisation 10.1 Generate ROC curves 10.2 Plot feature importance 10.3 Visualise prediction disparities across gender groups Return: model performance metrics, fairness indicators, and interpretability analyses. |
6. Results
6.1. Description of the Dataset
The dataset employed in this study originates from the international Organisation for Economic Co-operation and Development (OECD) Programme for International Student Assessment PISA 2022 database (
OECD, 2023a). Following the selection of relevant variables and the removal of incomplete observations, the final analytical sample comprised 613,744 students across 80 educational systems.
The variables considered in the analysis included demographic, socioeconomic, and academic indicators: gender, immigration status, grade repetition, the Economic, Social and Cultural Status index (ESCS), and plausible values for reading and science achievement. Consistent with methodological recommendations for machine learning analyses based on PISA datasets and with the predictive modelling framework adopted in this study, the first plausible value of mathematics performance (PV1MATH) was used as the indicator of achievement in mathematics.
The ESCS index displayed a mean value of −0.31 with a standard deviation of 1.13, reflecting the considerable socioeconomic heterogeneity present across the educational systems represented in the dataset. Mean achievement scores were 438.22 points in reading and 450.46 points in science, values that align with international score distributions typically reported for PISAs.
Table 1 summarises the principal descriptive statistics of the dataset, presenting measures of central tendency and dispersion for the variables included in the machine learning models.
Regarding migration status, the average value reported in
Table 1 reflects the categorical coding used in the PISA dataset rather than a continuous measure. The distribution indicates that the majority of students in the sample are native, with smaller proportions corresponding to first- and second-generation immigrant students. This distinction is important for interpreting the results, as the predictive role of migration background reflects differences across these groups rather than a linear effect.
6.2. Identification of Academically Resilient Students
Following the definition proposed by the Organisation for Economic Co-operation and Development for analysing academic resilience, a student was classified as resilient if, despite belonging to the lowest quartile of the ESCS index within their country, they simultaneously achieved mathematics performance at or above the top quartile of their national distribution (
OECD, 2023c,
2023d).
Applying this criterion led to the identification of 16,363 resilient students, representing approximately 2.7% of the total sample. This proportion is consistent with prior empirical research documenting the relatively limited prevalence of academic resilience in international large-scale assessments (
Agasisti et al., 2018).
The observed distribution reveals a substantial imbalance between resilient and non-resilient students, which justified the application of class balancing techniques during the training phase of the predictive models.
6.3. Differences in Academic Resilience by Gender
An initial exploratory analysis examined the distribution of academic resilience according to gender. The findings indicate that the proportion of resilient students varies moderately between male and female students. However, these differences must be interpreted within the broader context of socioeconomic conditions and academic trajectories.
Figure 1 illustrates the average rate of academic resilience among male and female students. This preliminary examination allows potential differential patterns to be identified, which are subsequently explored through machine learning models and algorithmic fairness analyses.
6.4. Comparative Evaluation of Machine Learning Models
To estimate the probability that a student would be classified as academically resilient, six machine learning models widely used in educational predictive analytics were implemented:
Logistic Regression;
Random Forest;
Gradient Boosting;
XGBoost;
LightGBM;
CatBoost.
The models were trained using 80% of the dataset, while the remaining 20% was reserved for evaluation. Given the strong imbalance between resilient and non-resilient students, the Synthetic Minority Oversampling Technique (SMOTE) was applied in order to balance the class distribution during model training (
Chawla et al., 2002).
The results presented in
Table 2 show that all models achieved high levels of predictive performance, with AUC values exceeding 0.92. Among them, the CatBoost algorithm demonstrated the strongest overall discriminative performance, achieving an AUC of 0.941, closely followed by XGBoost and LightGBM. A detail comparison of predictive performance and fairness metrics across all evaluated models is provided in
Appendix A.
In addition to ROC-AUC, PR-AUC values are reported to account for class imbalance, providing a more informative assessment of model performance in identifying the minority class. The evaluation of model performance was further strengthened through additional validation procedures. In particular, a five-fold cross-validation was conducted for the best-performing model, yielding an average ROC-AUC of 0.949. This result confirms the robustness and stability of the predictive performance across different data partitions.
In addition, a threshold analysis was performed in order to examine the trade-off between precision and recall at different classification cut-off points. The results indicate that lower thresholds substantially increase recall at the expense of precision, whereas higher thresholds lead to more conservative predictions with improved precision but reduced sensitivity. This analysis highlights the importance of selecting an appropriate decision threshold depending on the intended application of the model, particularly in educational contexts where the identification of resilient students may prioritise either inclusiveness or precision.
Consistent with the overall performance metrics, ensemble-based models demonstrate a strong capacity to capture complex non-linear relationships between socioeducational variables and academic resilience, while maintaining stable predictive behaviour across validation procedures.
6.5. ROC Curves and Discriminative Capacity of the Models
In order to further examine model discrimination performance, Receiver Operating Characteristic (ROC) curves were analysed for each algorithm. ROC curves illustrate the relationship between the true positive rate and the false positive rate across different classification thresholds.
As shown in
Figure 2, boosting-based models display curves that lie closer to the upper-left corner of the plot, indicating stronger ability to distinguish correctly between resilient and non-resilient students.
6.6. Importance of Predictor Variables
After identifying the best-performing model (CatBoost), the relative importance of the predictor variables was examined (
Figure 3). This analysis enables the identification of the factors that most strongly contribute to the prediction of academic resilience.
The results summarised in
Table 3 present the most influential predictors in the best-performing model. The prominent importance of immigration status should be interpreted in light of its categorical structure, which distinguishes between native, first-generation, and second-generation students. This result does not imply a uniform effect of migration background; rather, it suggests that differences between these groups contribute substantially to the model’s ability to distinguish between resilient and non-resilient students. In particular, the predictive relevance of this variable may reflect heterogeneity in educational experiences, access to resources, and motivational dynamics across student groups with different migration backgrounds.
In contrast, the gender variable exhibits considerably lower predictive importance, suggesting that academic resilience is more closely associated with socioeconomic context and educational trajectories than with gender differences alone.
6.7. Algorithmic Fairness Analysis
To investigate potential bias in the model’s predictions, an algorithmic fairness analysis disaggregated by gender was conducted (
Table 4). This analysis included group-specific performance metrics as well as the proportion of positive predictions generated by the model.
The results indicate that the model demonstrates slightly stronger predictive performance for female students, as reflected in higher AUC and F1-score values. However, the positive prediction rate was higher among male students.
The analysis indicates a demographic parity ratio of 1.40, suggesting that positive predictions are more frequently assigned to male students. In terms of equality of opportunity, the true positive rate differs across groups, with a gap of 0.053.
These results indicate the presence of measurable disparities in model behaviour across gender groups. From a modelling perspective, these disparities can be interpreted as a consequence of the interaction between class imbalance, feature distribution, and the optimisation objectives of the algorithms. In particular, the higher false positive rate observed among male students suggests that the model adopts a less conservative classification strategy for this group, potentially reflecting differences in the distribution of predictor variables across genders.
Conversely, the higher precision observed for female students indicates that positive predictions are assigned more selectively, which may reduce false positives but simultaneously limit the identification of truly resilient cases. This asymmetry illustrates a fundamental trade-off in classification systems: improving performance for one group or metric may inadvertently affect outcomes for another.
Importantly, these differences should not be interpreted as evidence of intentional bias within the algorithm itself. Rather, they reflect structural patterns embedded in the training data, which are subsequently learned and reproduced by the model. This observation reinforces the need to complement predictive modelling with fairness-aware evaluation frameworks in order to ensure a more balanced and context-sensitive interpretation of results.
A closer examination of error distributions reveals asymmetric patterns. Male students exhibit higher false positive rates, indicating a greater likelihood of being incorrectly classified as resilient. In contrast, predictions for female students are more conservative, as reflected in lower positive rates and higher precision.
These findings indicate that the model does not operate uniformly across demographic groups but instead produces differentiated classification patterns with potentially unequal implications. From a fairness perspective, this asymmetry reflects the presence of competing optimisation dynamics, whereby improvements in certain performance metrics may generate trade-offs in others. Importantly, the observed disparities cannot be attributed solely to algorithmic design, but rather emerge from the interaction between model structure, class imbalance, and the distributional properties of the input data. This observation reinforces the view that algorithmic bias should be understood as a systemic phenomenon arising across the machine learning pipeline, rather than as an isolated technical issue.
6.8. Model Explainability Through SHAP
In order to enhance the interpretability of the predictive model, the SHAP (SHapley Additive exPlanations) method was applied, which allows the contribution of each variable to individual predictions to be quantified (
Lundberg & Lee, 2017).
The SHAP analysis presented in
Figure 4 corroborates the findings obtained through the model’s feature importance metrics, highlighting immigration status, socioeconomic context, and grade repetition as key determinants of academic resilience. To provide additional interpretative context, the SHAP analysis indicates that higher predicted probabilities of academic resilience are more frequently associated with observations corresponding to students with a migrant background, although this pattern varies across categories and should not be interpreted as a uniform effect.
6.9. International Variation in Academic Resilience
Finally, the distribution of academic resilience was examined across the different educational systems included in the PISA 2022 dataset.
Figure 5 reveals substantial variation across countries, with some educational systems exhibiting comparatively higher proportions of resilient students. These differences may be associated with institutional factors, educational policies, and structural characteristics of national school systems.
7. Discussion
The findings provide insight into the relationships between artificial intelligence, educational inequality, and academic resilience in a comparative international context. Drawing on more than 600,000 observations from the PISA 2022 dataset, the analysis shows that contemporary machine learning models are capable of identifying complex patterns associated with the academic performance of students who achieve high levels of attainment despite experiencing adverse socioeconomic circumstances.
Overall, the results confirm that artificial intelligence-based educational analytics tools possess substantial potential for examining large-scale educational phenomena. At the same time, however, the analysis reveals that such models strongly reflect the social and institutional structures embedded in the data used for training. Consequently, the application of these systems within educational settings requires critical reflection that considers not only their analytical capabilities but also their ethical and social implications.
7.1. Predictive Performance of Machine Learning Models
The comparative analysis of algorithms indicates that ensemble models based on decision trees provide the strongest predictive performance in the identification of academic resilience. In particular, the CatBoost, XGBoost, and LightGBM algorithms achieved values of the area under the ROC curve close to 0.94, demonstrating a high capacity to discriminate between resilient and non-resilient students.
This outcome is consistent with recent research in the field of educational data analytics, which has documented that machine learning techniques based on boosting strategies frequently outperform traditional statistical models when analysing large-scale educational datasets. This performance is consistent with their capacity to capture complex relationships in large-scale data.
However, the high predictive accuracy of these models should not be interpreted solely as a technical advantage. The capacity of algorithms to detect latent patterns also implies that they may faithfully reproduce structural inequalities embedded within educational systems. For this reason, evaluations of predictive performance must be accompanied by systematic assessments of fairness and transparency. It is also important to acknowledge that certain methodological choices adopted in this study—specifically the use of a single plausible value for mathematics achievement and the omission of sampling weights—introduce limitations that should be taken into account when interpreting the findings. While these decisions are consistent with common practices in machine learning-based predictive modelling, they imply that the results should be understood as reflecting statistical patterns within the analytical sample rather than as population-representative estimates.
As shown in
Appendix A, the model achieving the highest predictive performance does not necessarily correspond to the most balanced outcomes across evaluation metrics. While CatBoost, LightGBM, and XGBoost demonstrate similar levels of predictive accuracy, differences in recall and precision indicate variation in classification behaviour. In particular, the fairness analysis conducted for the CatBoost model reveals disparities across gender groups, suggesting that high predictive performance does not guarantee equitable outcomes. This finding highlights the importance of evaluating machine learning models using multiple criteria, including both performance and fairness considerations. A full comparison of fairness metrics across all models is beyond the scope of the present analysis and represents a relevant avenue for future research.
7.2. Structural Determinants of Academic Resilience
The analysis of feature importance provides valuable insights into the factors influencing the classification of resilient students. The model assigns higher predicted probabilities of academic resilience to certain groups of students with a migrant background, followed by the Economic, Social and Cultural Status index (ESCS) and grade repetition. This effect should be interpreted in relation to the categorical structure of the migration variable, which distinguishes between native, first-generation, and second-generation students. These patterns reflect group differences rather than a uniform “immigrant effect”.
This finding suggests that academic resilience cannot be understood exclusively as an individual attribute linked to personal effort or cognitive ability. Rather, it appears to be deeply shaped by structural factors associated with the broader social and educational environment.
The prominence of the ESCS index confirms results that have been widely documented in the literature on educational inequality (
OECD, 2023c;
Reardon, 2011). Numerous studies have shown that economic, social, and cultural capital continue to represent some of the most influential determinants of students’ academic performance (
Bourdieu, 1986;
OECD, 2023c). According to recent reports published by the Organisation for Economic Co-operation and Development, socioeconomic disparities remain a central factor shaping educational opportunities at the global level (
OECD, 2023c,
2023d).
Predictive models do not generate educational inequalities. Instead, they reflect structural patterns already present in the data. When interpreted within an appropriate analytical framework, the ability of these models to detect such relationships may contribute to a deeper understanding of the mechanisms through which educational inequalities are reproduced.
These findings should be interpreted in light of the conceptual limitations of the academic resilience construct. As noted earlier, resilience is defined in relative terms and does not capture the full range of processes through which students overcome disadvantage. In this sense, the results should not be interpreted as identifying fixed characteristics of “resilient students”, but rather as reflecting probabilistic patterns associated with specific educational and socioeconomic configurations. Future research may benefit from complementing this approach with frameworks that focus more explicitly on developmental pathways and protective mechanisms underlying resilient outcomes.
7.3. Algorithmic Fairness and Bias in Educational Prediction
One of the most significant contributions of the present study lies in its examination of algorithmic fairness in the prediction of academic resilience. These patterns should be interpreted within the broader historical context of gender differences in educational achievement. A substantial body of research has documented persistent gender disparities across subject domains, with female students typically outperforming males in reading, while males have historically shown an advantage in mathematics and science in certain contexts. However, recent international evidence indicates that these gaps have narrowed over time in many education systems, reflecting ongoing efforts to promote greater equity in learning outcomes (
OECD, 2019,
2023b). These developments highlight the importance of interpreting observed disparities not as fixed or universal patterns, but as context-dependent phenomena shaped by evolving social, cultural, and institutional dynamics. Against this background, the findings reveal differences in the rate of positive predictions between gender groups, suggesting the presence of potential disparities in the behaviour of the predictive model.
Although overall predictive performance remains relatively high for both gender groups, the empirical results reveal distinct classification patterns that merit closer examination. Specifically, the model produces a higher positive prediction rate for male students, accompanied by a higher false positive rate, whereas predictions for female students are characterised by greater precision and lower false positive rates.
These findings suggest that the model operates under asymmetric decision boundaries across groups, which may be linked to differences in the underlying distribution of socioeconomic and academic variables. In practical terms, this implies that male students are more likely to be classified as resilient even when this classification is incorrect, while female students are subject to a more conservative classification regime.
This pattern reflects a tension between competing definitions of fairness. While the model achieves relatively similar levels of overall predictive performance, it does not satisfy strict parity across key fairness metrics such as demographic parity and equality of opportunity. The observed demographic parity ratio of approximately 1.40 and the gap in true positive rates indicate that prediction outcomes are not evenly distributed across groups. At the same time, it is important to emphasise that the fairness assessment conducted in this study should be interpreted as a partial diagnostic rather than a comprehensive evaluation of algorithmic equity. The analysis focuses on a limited set of formal metrics and does not capture the full range of normative considerations or alternative fairness criteria that may be relevant in educational contexts.
These results underscore the importance of defining fairness objectives in the development of predictive models for educational applications. Different policy contexts may prioritise alternative fairness criteria, such as maximising the identification of disadvantaged high-performing students or minimising incorrect classifications. Consequently, the evaluation of algorithmic fairness should be understood not as a purely technical exercise, but as a normative decision-making process that requires careful consideration of the social implications of predictive systems.
These results are consistent with a growing body of research warning that algorithmic systems may reproduce existing social biases when applied in educational or institutional contexts. Several studies have highlighted that machine learning models can amplify inequalities when they are trained using data that reflect historical patterns of exclusion or discrimination (
Barocas et al., 2019;
Mehrabi et al., 2021).
Systematic evaluation of fairness metrics is therefore essential in the development of artificial intelligence systems for educational applications. The incorporation of such evaluation mechanisms makes it possible to detect potential biases within classification processes and supports the design of strategies aimed at mitigating their effects.
From a governance perspective, these findings suggest the importance of embedding fairness considerations throughout the lifecycle of artificial intelligence systems in education, from data collection and preprocessing to model deployment and evaluation. The implementation of fairness-aware learning algorithms, post-processing adjustment techniques, and continuous monitoring mechanisms may contribute to mitigating disparities in predictive outcomes. However, it is equally important to recognise that technical solutions alone are insufficient to resolve issues of algorithmic bias. Addressing these challenges requires an integrated approach that combines methodological rigour with institutional accountability, stakeholder participation, and context-sensitive policy design.
This limitation also has implications for policy interpretation, as the identification of resilient students depends on how resilience is operationalised. Predictive models may therefore influence which students are identified as requiring support, highlighting the need for careful consideration when such tools are used in educational decision-making contexts.
7.4. Interpretation of Cross-National Differences in Academic Resilience
The cross-national analysis conducted in this study reveals substantial variation in the proportion of resilient students across educational systems. While certain jurisdictions exhibit relatively high levels of academic resilience, other educational contexts display considerably lower proportions of students capable of achieving high academic performance despite facing socioeconomic disadvantage.
These differences indicate that academic resilience cannot be explained solely by individual characteristics. Institutional and structural features specific to each educational system also appear to play a decisive role. Factors such as the quality of educational policies, the allocation of school resources, curricular organisation, and pedagogical practices may significantly influence students’ capacity to overcome socioeconomic barriers.
These findings support the view that academic resilience constitutes a multidimensional phenomenon emerging from the interaction of individual, familial, and institutional conditions.
7.5. Implications for the Governance of Artificial Intelligence in Education
The results of this study contribute to the expanding debate surrounding the governance of artificial intelligence within educational systems. These implications should be interpreted in light of the empirical scope of the study and the specific modelling framework employed. As educational analytics tools increasingly influence institutional decision-making processes, it becomes important to consider how such systems can be aligned with principles of transparency, accountability, and fairness. Previous research on responsible AI deployment has emphasised the value of organisational tools and institutional frameworks that enable practitioners to operationalise fairness in real-world systems (
Holstein et al., 2019). More recent contributions further suggest that the governance of artificial intelligence in education involves not only technical solutions but also the development of ethical guidelines, regulatory frameworks, and participatory approaches involving multiple stakeholders (
Holmes et al., 2023). Recent empirical work applying machine learning techniques to PISA 2022 data also indicates the importance of critically examining how predictive models capture patterns of academic resilience across diverse educational systems. For instance,
Cheung et al. (
2024) show that while machine learning models can effectively identify resilient students, their predictive behaviour appears to be strongly conditioned by the structural characteristics of national education systems and the underlying distribution of socioeconomic variables. This underscores the importance of interpreting model outputs within their broader institutional and comparative context.
Within this context, the use of model interpretation techniques—such as feature importance analysis and the application of SHAP values—represents an important step towards enhancing the transparency of algorithmic systems. These tools enable clearer understanding the mechanisms through which predictive models generate their outputs and facilitate critical evaluation of their broader social implications. Recent research suggest that explainable artificial intelligence constitutes a key component in the responsible deployment of predictive systems in education, as it enables both technical transparency and institutional accountability (
Baker & Hawn, 2022). In addition, recent research on fairness-aware machine learning in educational contexts has emphasised that transparency alone is insufficient to guarantee equitable outcomes.
Pham et al. (
2025) show that algorithmic systems may exhibit differential performance across student groups even when overall predictive accuracy appears high, thereby underscoring the importance of systematically evaluating fairness metrics alongside predictive performance. Complementary evidence is provided by
Kesgin et al. (
2025), who emphasise that the integration of fairness-aware modelling strategies with explainable artificial intelligence techniques is essential for identifying and mitigating potential biases in student performance prediction systems. Their findings suggest that combining predictive accuracy with interpretability and fairness constraints can improve both the transparency and the ethical robustness of educational machine learning applications. This perspective aligns with the present study’s findings regarding disparities in prediction rates across gender groups.
One of the findings deserving particular attention is the prominent role of immigration status among the predictors associated with academic resilience. At first glance, as reflected in the model outputs, this result might appear counterintuitive, as numerous studies have documented that students with migrant backgrounds often face additional challenges within educational systems, including language barriers, socioeconomic inequalities, and processes of cultural adaptation (
OECD, 2023a,
2023b,
2023c,
2023d).
Nevertheless, international research has also identified phenomena that help explain such results. Several studies have described what is commonly referred to as the “immigrant paradox” or “immigrant optimism”, whereby certain groups of students from migrant families may display high levels of academic motivation and strong educational aspirations despite experiencing socioeconomic disadvantage (
Kao & Tienda, 1995;
Salikutluk, 2016). In many cases, migration processes are associated with family expectations of social mobility through education, which may translate into higher levels of academic effort and persistence.
Another possible explanation relates to the considerable heterogeneity of migration experiences across countries. Institutional conditions, educational integration policies, and support mechanisms available to migrant students vary significantly across national contexts. In some systems, targeted programmes providing language support, academic tutoring, or school integration initiatives may help mitigate the initial disadvantages associated with migrant background.
Consequently, the predictive importance of immigration status identified in the models should not be interpreted as a direct causal relationship. Rather, it may be understood as an indicator reflecting complex patterns embedded within international educational data. In this sense, the results point to the importance of further research examining the institutional and contextual factors shaping the educational trajectories of students with migrant backgrounds. At the same time, this finding illustrates how predictive models may reflect underlying structural inequalities present in the data, reinforcing concerns raised in the literature on algorithmic bias regarding the reproduction of social disparities within automated decision-making systems (
Mehrabi et al., 2021;
Baker & Hawn, 2022).
From a methodological perspective, this finding also illustrates the value of interpretability tools in machine learning models applied to educational research. Techniques such as SHAP values enable the identification of influential predictors without assuming causal relationships, thereby supporting more careful and nuanced interpretations of predictive modelling results.
From a policy perspective, these findings may have relevant implications for how predictive models are used to identify students eligible for targeted support. While such models can assist in detecting patterns associated with academic resilience, their reliance on probabilistic classification introduces a potential risk of misclassification. In particular, false positive predictions may lead to the allocation of resources to students who do not meet the intended criteria, whereas false negatives may result in overlooking students who could benefit from additional support.
These limitations highlight the importance of caution when integrating artificial intelligence tools into educational decision-making processes. Predictive models may therefore be more appropriately considered as complementary instruments rather than as definitive mechanisms for allocating educational resources. Their use requires careful interpretation, institutional oversight, and alignment with broader pedagogical and policy objectives in order to minimise the risk of reinforcing existing inequalities.
Taken together, the findings of this study suggest that artificial intelligence may become a valuable instrument for analysing complex educational phenomena, provided that its implementation is accompanied by appropriate ethical and methodological frameworks. However, these implications should be interpreted as analytically grounded reflections derived from the empirical results rather than as definitive conclusions regarding institutional practice. In line with recent research, the evidence presented here points to the importance of integrating fairness evaluation, interpretability, and contextual awareness into the design and application of predictive models. Under these conditions, machine learning systems may contribute not only to improving the understanding of educational processes but also to informing more reflective and context-sensitive policy discussions aimed at promoting equity within contemporary educational systems.
At the same time, these considerations should not be interpreted as prescriptive recommendations, but rather as analytically grounded reflections that may inform ongoing debates on the responsible use of artificial intelligence in education.
8. Limitations
Despite the empirical and methodological contributions offered by this study, several limitations must be acknowledged when interpreting the results and considering their potential application within educational contexts.
First, the data employed in this research derive from the international PISA 2022 assessment, which is based on a cross-sectional design. Although datasets of this nature enable the examination of educational patterns on a large scale and facilitate comparisons across educational systems worldwide, their cross-sectional character constrains the possibility of establishing causal relationships among the variables analysed. Consequently, the predictive models developed in this study should be interpreted primarily as analytical instruments for identifying associations and statistical regularities within the data, rather than as mechanisms capable of explaining causal processes that lead to academic resilience. This limitation has been widely acknowledged in studies drawing upon large-scale international educational assessments conducted by the Organisation for Economic Co-operation and Development (
OECD, 2023a,
2023c,
2023d). An additional limitation concerns the exclusion of sampling weights provided in the PISA dataset. Although this decision is consistent with the predictive focus of the study, it may affect the representativeness of the findings at the population level. Sampling weights are designed to account for complex survey designs and ensure that estimates accurately reflect national student populations. Their omission implies that the results should not be interpreted as population-level inferences, but rather as model-based patterns derived from the observed sample.
Second, the methodological approach adopted in this research focuses on the development and evaluation of machine learning models designed for predictive purposes. While such techniques enable the identification of complex relationships between variables and often enhance predictive accuracy, they also present important challenges in terms of interpretability and generalisability. Although the present study incorporated interpretability techniques—including variable importance analysis and explanations based on SHAP values—machine learning models nevertheless remain simplified representations of highly complex educational processes. Accordingly, the results should be understood as approximate depictions of patterns embedded within the data rather than as comprehensive descriptions of the social and educational mechanisms shaping student performance (
Molnar, 2022). A further limitation relates to the use of a single plausible value (PV1MATH) as an indicator of mathematics performance. According to OECD guidelines, multiple plausible values should be used to properly account for measurement uncertainty in latent proficiency estimates. The reliance on a single plausible value, while common in predictive modelling contexts, may lead to an underestimation of variability and uncertainty in student achievement. Consequently, the results should be interpreted as approximations within a predictive framework rather than as fully robust representations of underlying proficiency distributions.
Third, the operationalisation of academic resilience in this study follows the approach employed by the Organisation for Economic Co-operation and Development, which identifies students from socio-economically disadvantaged backgrounds who nevertheless achieve high levels of academic performance. Although this definition is widely adopted in the comparative education literature, it inevitably entails a degree of conceptual simplification. Educational resilience is inherently a multidimensional phenomenon encompassing personal, familial, school-level, and community-based factors, many of which cannot be fully captured through the quantitative indicators available in international databases (
OECD, 2023c,
2023d;
Agasisti & Longobardi, 2017).
Fourth, another limitation concerns the inclusion of academic performance variables from related domains, specifically reading and science scores, as predictors in the models. Although these variables are not directly derived from the target variable, they are moderately correlated with mathematics performance and may partially contribute to the observed predictive accuracy. As a result, part of the model performance may reflect shared variance across cognitive domains rather than independent predictive effects. This should be taken into account when interpreting the strength of the models’ discriminative capacity.
Finally, although the study incorporates an initial analysis of algorithmic fairness through the comparison of prediction rates across groups defined by gender, the assessment of potential algorithmic bias remains limited to a specific set of fairness metrics. The field of fairness in artificial intelligence recognises that multiple definitions and criteria exist for evaluating algorithmic justice, and these may lead to different conclusions depending on the analytical framework adopted (
Barocas et al., 2019;
Mehrabi et al., 2021). Future research could therefore extend the present analysis by incorporating additional fairness metrics, as well as bias-mitigation strategies during the model training process.
Taken together, these limitations do not invalidate the findings of the study; rather, they highlight the importance of interpreting the results within their specific methodological context. They also underscore the need for continued research that integrates quantitative approaches, interpretable methodologies, and responsible governance frameworks for the application of artificial intelligence within educational environments.
9. Future Research
The limitations identified in this study open several avenues for future research in the field of artificial intelligence applied to education.
First, future investigations could extend the present analysis through the use of longitudinal data, which would allow a more precise examination of the evolution of educational trajectories over time. Approaches of this nature would facilitate the analysis of academic resilience from a temporal perspective and would also enable researchers to identify factors that influence the persistence, transformation, or attenuation of educational inequalities throughout students’ academic pathways.
Second, it would be valuable to incorporate additional variables associated with institutional and contextual characteristics of schools. These could include dimensions such as pedagogical practices, the availability and distribution of educational resources, school climate, and institutional policies designed to support student learning. Integrating these elements into analytical frameworks could contribute to the development of more comprehensive predictive models and provide deeper insight into the interaction between individual-level characteristics and structural conditions shaping educational processes.
Another promising line of research involves advancing the study of algorithmic fairness within educational contexts. While the present study analyses differences in predictive performance across gender groups, future research could examine additional dimensions of equity, including disparities associated with socio-economic background, migration status, or membership of cultural and linguistic minority groups. Analyses of this nature would contribute to the development of more robust frameworks for assessing the fairness and social implications of artificial intelligence systems implemented in educational environments.
Moreover, future studies could investigate the transferability of predictive models across different educational systems. Given that the results of this research reveal substantial variation in levels of academic resilience across countries, it would be particularly relevant to examine whether models trained in specific national contexts maintain their predictive validity when applied to other educational systems characterised by different institutional structures, policy frameworks, and socio-cultural conditions.
Finally, an emerging line of inquiry concerns the integration of explainable artificial intelligence (XAI) tools within educational decision-making processes. The development of models that are more transparent and interpretable could strengthen institutional trust in the use of predictive technologies, while also facilitating their understanding by key stakeholders such as education policymakers, school administrators, and teachers.
Taken together, these prospective research directions suggest that the responsible development of artificial intelligence systems in education requires not only continued technical innovation but also an interdisciplinary perspective that integrates insights from educational research, the ethics of artificial intelligence, and scholarship on democratic governance.
10. Conclusions
The growing integration of artificial intelligence-based systems within educational environments is profoundly transforming the ways in which information is analysed, student performance is evaluated, and educational policies are designed. Within this context, the present study examined the potential and implications of machine learning models for identifying patterns associated with academic resilience among students who face socio-economic disadvantages. Drawing upon the analysis of more than six hundred thousand observations derived from the international PISA 2022, the research developed a predictive framework that integrates contemporary educational data science techniques with tools for model interpretability and algorithmic fairness assessment.
The empirical findings indicate that machine learning models based on ensemble methods—particularly those grounded in boosting techniques—achieve high levels of predictive performance in identifying resilient students. This result aligns with recent research in the field of educational data analytics, which has documented that algorithms capable of capturing complex and non-linear relationships among variables frequently outperform traditional statistical approaches when analysing large-scale educational datasets (
Romero & Ventura, 2020). Nevertheless, the high predictive accuracy observed should not be interpreted solely as a technical advantage. The capacity of these models to detect latent patterns also implies that they may reproduce, with considerable precision, the structural inequalities embedded within the data used for training.
In this regard, the analysis of variable importance and the explanations generated through interpretable modelling techniques demonstrate that academic resilience is strongly associated with structural factors linked to students’ social and educational contexts. Variables such as migration status, the socio-economic and cultural index, and grade repetition emerge as influential determinants in the classification of resilient students. These findings reinforce the argument that academic performance cannot be understood merely as an expression of individual ability; rather, it reflects complex interactions among socio-economic conditions, educational trajectories, and the institutional characteristics of educational systems.
Furthermore, the study identified differences in positive prediction rates across groups of students defined by gender. Although such disparities do not necessarily indicate the presence of direct discrimination within the models, they suggest that algorithmic systems may reflect broader social patterns embedded within educational data. This observation is consistent with the literature on algorithmic fairness, which has emphasised that machine learning systems may reproduce structural inequalities if explicit mechanisms for bias evaluation and monitoring are not incorporated into the modelling process (
Mehrabi et al., 2021;
Barocas et al., 2019).
Based on these results, the study contributes to the contemporary debate regarding the role of artificial intelligence in educational research and in data-driven decision-making processes. In particular, the findings highlight the importance of integrating principles of transparency, interpretability, and fairness in the development of educational analytics tools. The systematic evaluation of fairness metrics, together with the use of explainable artificial intelligence techniques, can play a crucial role in understanding how algorithmic models generate their predictions and in identifying potential sources of disparity among groups of students.
Moreover, the findings suggest that the responsible integration of artificial intelligence within educational systems requires the development of institutional governance frameworks capable of guiding the design, implementation, and oversight of algorithmic technologies. Several international organisations—including the Organisation for Economic Co-operation and Development and UNESCO—have emphasised the importance of establishing normative principles to ensure that the use of artificial intelligence in socially sensitive domains such as education is grounded in criteria of equity, accountability, and respect for human rights (
OECD, 2019;
UNESCO, 2021). In this regard, the present study provides empirical evidence supporting the need to strengthen mechanisms of democratic governance for artificial intelligence applied to educational analysis.
In addition, the findings highlight the importance of situating the development and application of predictive models within broader ethical and governance frameworks. The presence of measurable disparities across student groups reinforces the need for ongoing scrutiny of algorithmic systems, particularly in domains such as education where decisions may have long-term implications for individuals’ life trajectories. Ensuring that artificial intelligence systems operate in a manner that is not only accurate but also fair and transparent constitutes a central challenge for both researchers and policymakers.
Taken together, the findings of this study demonstrate that artificial intelligence-based models can serve as powerful tools for identifying complex patterns associated with academic resilience in large-scale educational data. However, the results also underscore that predictive performance alone is insufficient as a criterion for evaluating such systems. These results further illustrate the importance of balancing predictive performance with fairness considerations when selecting models for educational applications. The observed trade-offs between precision and recall, as well as the disparities identified across gender groups, highlight the inherently normative dimension of algorithmic decision-making.
In this regard, the selection of classification thresholds and the interpretation of fairness metrics should not be treated as purely technical choices, but rather as context-dependent decisions that carry important educational and social implications. For instance, prioritising higher recall may support the broader identification of potentially resilient students, whereas emphasising precision may reduce the risk of misclassification but at the cost of excluding relevant cases.
Furthermore, the findings reinforce the importance of integrating interpretability and fairness assessments into the design and evaluation of machine learning models applied to education. The use of explainable artificial intelligence techniques, such as SHAP values, provides critical insights into the factors shaping model predictions, while fairness metrics enable the identification of potential disparities across student groups. Accordingly, the findings should be interpreted within a predictive and exploratory framework, with appropriate caution regarding their generalisability to broader populations. Furthermore, the fairness analysis presented should be understood as indicative rather than exhaustive, highlighting potential disparities without constituting a definitive assessment of algorithmic fairness. The policy implications derived from this study should therefore be interpreted as indicative rather than prescriptive, reflecting the analytical scope of the research.
Ultimately, the responsible deployment of artificial intelligence in educational contexts requires a balanced approach that combines predictive accuracy with transparency, fairness, and contextual awareness. By embedding these principles within analytical frameworks, machine learning can contribute not only to improved understanding of educational inequalities but also to the development of more equitable and accountable data-driven policies.