Analysis of the Effects of World Bank Macroeconomic and Management Indicators on Sustainable Education Quality on PISA Scores Using the SHAP Explainable Artificial Intelligence Method

Kişman, Zülfükar Aytaç; Kan, Ayşe Ülkü; Uzun, Selman; Kan, Mehmet Alper; Yıldırım, Güngör

doi:10.3390/su18031415

Open AccessArticle

Analysis of the Effects of World Bank Macroeconomic and Management Indicators on Sustainable Education Quality on PISA Scores Using the SHAP Explainable Artificial Intelligence Method

by

Zülfükar Aytaç Kişman

¹

,

Ayşe Ülkü Kan

²

,

Selman Uzun

³

,

Mehmet Alper Kan

^4,*

and

Güngör Yıldırım

⁵

¹

Foreign Trade Department, Social Sciences Vocational School, Firat University, 23119 Elazığ, Türkiye

²

Department of Educational Sciences, Faculty of Education, Firat University, 23119 Elazığ, Türkiye

³

Computer Engineering Department, Graduate School of Natural and Applied Sciences, Firat University, 23119 Elazığ, Türkiye

⁴

Technology and Information Management Department, Graduate Institute of Social Sciences, Firat University, 23119 Elazığ, Türkiye

⁵

Computer Engineering Department, Faculty of Engineering, Firat University, 23119 Elazığ, Türkiye

^*

Author to whom correspondence should be addressed.

Sustainability 2026, 18(3), 1415; https://doi.org/10.3390/su18031415

Submission received: 2 December 2025 / Revised: 27 January 2026 / Accepted: 28 January 2026 / Published: 31 January 2026

(This article belongs to the Section Sustainable Education and Approaches)

Download

Browse Figures

Versions Notes

Abstract

This study proposes a multi-objective, multi-class explainable modeling framework to explain country performance profiles in PISA Mathematics (PISAM), Reading (PISAR), and Science (PISAS). Instead of treating PISA as a simple ranking, the study models each country’s Low/Medium/High-achieving class and asks which structural signals the model relies on when assigning a country to this class. To this end, the study combines governance quality (e.g., accountability, control of corruption, and political stability, etc.), economic and administrative capacity, and regional/institutional location in a single prediction pipeline and explains the resulting classifications with SHAP contributions conditional on class. While the findings do not point to a single, universal determinant, in mathematics, high-level profiles cluster around political stability, economic scale barriers, and regional location, along with governance indicators; in reading, economic capacity is explicitly integrated into this institutional core; and in science, in addition to these two dimensions, the shared institutional dynamics of regional blocs come into play. Furthermore, the study not only produces explanations but also quantitatively reports their reliability. The fit with the model output (Fidelity) and the traceability of the decision logic (Faithfulness) are 0.95/0.85 for PISAM, 0.89/0.92 for PISAR, and 0.89/0.89 for PISAS, which demonstrates high internal consistency and traceability of the decision process. Overall, the study reframes the PISA results not as isolated test scores but as structural profiles generated by the combination of governance, capacity, and region, revealing the policy-relevant levers behind “high performance” as a transparent and reproducible decision-making pipeline. This provides policymakers with an important roadmap for creating a sustainable education policy.

Keywords:

World Bank; governance indicators; PISA; sustainability; XAI; SHAP

1. Introduction

PISA is the OECD’s International Student Assessment approach, which aims to find answers to the questions of “what citizens should know and be able to do” that are comparable to student performance at the international level [1]. International large-scale assessments such as the PISA provide comparative indicators of student achievement across various competence areas [2]. In other words, one of PISA’s key advantages is that it provides international data demonstrating large-scale measures of student learning [3]. PISA is an assessment administered to 15-year-old students worldwide every three years, demonstrating the extent to which students have acquired the basic skills necessary for participation in social and economic life. The primary focus of this assessment approach is the extent to which students can infer what they have learned and the extent to which they can adapt what they have learned to different environments [1]. PISA is considered a tool used to promote student learning while playing an important role in the global basic education reform for countries [4]. This process can be considered as the education monitoring process. Wolter (2008) emphasizes that education monitoring is a comprehensive process that includes periodic education comparisons using tools such as PISA, and also aims to produce governance information for the field of education and its problems [5].

In a changing and globalizing world, the growth and internationalization of capital have led to increased complexity in production, work, and governance, as well as the emergence of new actors in public and institutional spheres. In this context, applications such as PISA, one of the OECD’s most well-known programs, have gained a strategic and important place in many countries, particularly due to their roles as an economic forum and data and information center [6]. PISA is considered an important, valid, and reliable measure that influences educational policies and decision-making processes when evaluating student performance [7]. The importance of PISA comes not only from ranking countries but also from providing an evidence-based basis for policy design by revealing the strengths and weaknesses of education systems with objective indicators. Regular repetition of assessments such as the PISA allows for monitoring progress or regression over time. It is also important to note that it sheds light on critical issues such as inequality, fair opportunity, and learning loss. Indeed, Wolter (2008) highlights the OECD’s efforts to standardize statistics, thus making different countries comparable at an international level, as an important development [5].

As a significant development, the preference for different data processing and interpretation methods in new statistical evaluation methods, which also support the empirical methodology of social sciences and are especially used in labor markets and health and can be adapted to different fields, has contributed to a better interpretation of the obtained data [5]. It is precisely at this point that indicators come to the fore. Indicators activate governance technology in policy making around the world. Kelly and Simmson (2020) argued that indicators trigger competition between states [8]. It is stated that assessments conducted by international organizations, such as PISA, are becoming increasingly popular and accepted by countries, but they also come along with some problems [9]. As PISA is an internationally recognized assessment approach designed to help governments monitor the results of education systems, which is recognized for assessing and reporting student achievement, with the rapid pace of globalization, competition for economic and human capital among countries has become more intense, and quality education has been accepted as the key to improving national competitiveness [4]. Evaluating the PISA results, which can be considered a source of big data for improving competitiveness, through indicators within a governance framework is deemed meaningful and valuable. Ensuring knowledge- and evidence-based governance and the management of education systems requires statistical information and administrative information systems. Today, this need is even more crucial than in the past [5]. Competition between countries and societies is perhaps the most important reason for this situation.

PISAM, PISAR, and PISAS, which can be interpreted as key components of PISA, represent the domains of mathematics, reading, and science, respectively. PISAM focuses on competencies such as mathematical reasoning, quantitative modeling, and decision-making under uncertainty, providing direct signals about individuals’ numerical literacy and problem-solving capacity. PISAR measures the ability to understand texts, make inferences, integrate information, and assess source credibility, demonstrating the quality of deep literacy that is the foundation of lifelong learning. PISAS, on the other hand, reveals the level of scientific literacy through the components of scientific thinking, evidence evaluation, hypothesis generation, and scientific argumentation. This is a critical indicator for the sustainability of the STEM ecosystem and the capacity for evidence-based decision-making. When these three domains are considered together, not only the average achievement but also the lower and upper ends of the achievement distribution, the state of balance between domains and trends over time can be understood. This allows for a concrete analysis of which student groups need more support, the breadth of the talent pool, and the level of system consistency. This multidimensional picture presented by PISA shows that educational outcomes are related not only to classroom practices but also to broader country and regional dynamics.

In this context, Shapley Additive exPlanations (SHAP)-based explainable artificial intelligence approach makes it possible to interpret the observed performance levels in the PISAM–PISAR–PISAS domains within an interpretable framework. SHAP decomposes the decision logic of multi-class models, making visible which effects play a role in what direction and under what conditions across the low–medium–high bands of performance. Through global rankings of importance, class-specific impact profiles, and interaction patterns, the dynamics that increase or decrease PISA performance are presented in a clear narrative. When interpreting PISAM, PISAR, and PISAS, governance indicators that give insights for governance quality (control of corruption, government effectiveness, regulatory quality, rule of law, political stability, voice and accountability), economic capacity and income level (total economic size (GDP), per capita income (GDP per capita) and economic category index), level of democracy (democracy index), regional location/clustering (Europe, Asia–Oceania, Africa, and America), and time parameter (differences between years) constitute the main contextual lenses. The effects of these dynamics on the low–medium–high bands of PISA performance are transparently decomposed by SHAP.

Consequently, the study positions PISA not merely as a scorecard but as a lens for understanding skill profiles and the patterns shaping them across PISAM, PISAR, and PISAS. This SHAP-supported approach interprets findings in a transparent, replicable, and action-oriented manner, allowing policymakers to see more clearly which levers to prioritize. Unlike the existing literature, where most explainable AI applications remain at the student or school level, this study aims to reframe PISA performance from the question of “who scored higher?” to “which institutional structure is associated with which achievement profile?” by considering governance quality, economic capacity, and regional institutional positioning at the country/region level within the same transparent modeling framework. This can inform policymakers by providing decision-support guidance for developing sustainable education policies and contribute to a clearer understanding of the relationship between education policies and PISA scores. Aligned with the United Nations’ 2030 Agenda (SDG 4: Quality Education), the proposed framework provides policy-relevant, evidence-based signals to support sustainable education policy prioritization.

Contributions and novelty. This study introduces an explainable, country/region-level framework for interpreting PISA performance that integrates governance quality, economic capacity, and regional/institutional clustering within a unified multi-class pipeline. Unlike prior PISA–XAI applications that predominantly emphasize student- or school-level predictors, our approach shifts the unit of explanation to macro-structural drivers and produces class-conditional SHAP profiles that map Low/Medium/High achievement bands to interpretable structural patterns. In addition, we quantify explanation reliability using Fidelity and Faithfulness, enabling a transparent assessment of how well the explanations track the model’s behavior. Together, these contributions provide a reproducible and policy-facing representation of “structural performance profiles” aligned with SDG 4 as a decision-support tool rather than a causal claim.

2. Related Work

This section is organized into four subsections. First, we review international literature linking PISA performance to country/region-level structural factors such as economic capacity, governance quality, democratic institutions, and regional clustering. Secondly, the study discusses the core concepts of the explainable artificial intelligence (XAI) approach, the rationale for transparency and accountability, and why it is critical in high-impact policy areas such as education. The third subsection outlines the technical challenges encountered in XAI applications (e.g., multicollinearity, fidelity, overfitting and underfitting) and the solution sets employed to manage these risks. Finally, the study presents the current applications of PISA data with machine learning and XAI, particularly through SHAP-based explanations and ensemble models. It is demonstrated that the country/region-level governance-economy dimension has not yet been systematically integrated in this literature.

2.1. PISA in the Context of Country/Region Dynamics: Related International Studies

International literature emphasizes that PISA performance should be considered not only with in-school practices but also with structural dynamics at the country level, such as economic capacity, quality of governance, democratic institutions, and regional clusters. The OECD’s latest cycle reports indicate that while many countries experienced declines in 2022 results—especially in mathematics—some East Asian economies achieved both high levels of performance and equity; these findings suggest that the macro context differentiates the combinations of performance and equity [10,11].

On the economic capacity side, significant relationships between country-level income indicators and PISA outcomes have been confirmed in several studies. Country-level comparisons report that PISA scores are strongly associated with per capita income, and in some samples, the level and composition of education expenditure is also linked to outcomes. Studies tracking the time horizon indicate that only a limited number of countries achieved significant increases in PISA scores during the 2006–2012 period and that improvements in macro institutional conditions are behind these increases [12,13].

Governance quality and institutional structure exhibit consistent correlations with the PISA results. Comparative studies using the Quality of Government approach show that good governance is positively associated with educational performance, with dimensions such as government effectiveness, regulatory quality, rule of law, control of corruption, and political stability co-varying with educational outcomes. This line is consistent with the World Bank’s six-dimensional Worldwide Governance. It is also theoretically and measurementally consistent with the WGI framework [14,15].

Democratic institutions and political arrangements are also among the sets of features associated with PISA. Country-level analyses report that democracy indicators are positively associated with PISA scores and that this association may differ between OECD and non-OECD groups. This finding is consistent with the broader comparative literature on the interaction of education with political institutions [16,17].

Regional clusters and inequality patterns suggest that countries cluster in terms of performance based on similar socio-economic backgrounds. PISA-based studies using network theory and clustering methods reveal that countries form blocs in terms of performance and socio-economic profile, and that these blocs partially overlap with regional/historical affinities. These results support the need to consider regional location and shared institutional heritage in conjunction with performance distributions [18].

Finally, multi-cycle comparisons that take the time dimension into account show that performance and equity trends can vary significantly across cycles, with countries trending in different directions in terms of both level and distribution. This accumulation confirms that the cyclical nature of PISA is not merely a “snapshot” but provides a convenient monitoring platform for tracking how policy and institutional changes are reflected in the results [19].

In summary, findings published in international literature and official reports point to the need to consider together at the country/region level: (i) economic capacity (income/GDP per capita), (ii) governance quality (WGI dimensions), (iii) democratic institutions and (iv) regional clusters when explaining PISA performance; furthermore, cross-cycle trend analysis makes visible how these relationships evolve over time.

2.2. Literature on Explainable Artificial Intelligence (XAI)

Artificial intelligence and machine learning bring the dimension of explainability to the forefront when producing decision support in high-impact areas such as healthcare, finance, public policy, and education due to the requirements for transparency, accountability, and trust [20,21]. XAI makes it understandable how models work by showing which inputs influence the decision and to what extent, it enables users and decision-makers to audit, trust, and challenge the system [22]. The conceptual framework is based on a well-established taxonomy that addresses intrinsic (inherently interpretable) approaches and post-explanation methods along the local/global and model-free/model-specific axes [23].

Explainability is not just “explainability” but the combined management of the goals of accuracy, fairness, and robustness [24]. In this way, XAI can perform model debugging and bias detection. It meets practical requirements such as auditing, regulatory compliance (regulatory expectations), and stakeholder communication. Reviews in the healthcare field emphasize that explainability increases clinician confidence and adoption, makes the origin of errors visible, and is critical for safe use in risky scenarios [25]. In application examples, hybrid approaches such as the LIME–SHAP combination in autonomous driving enhance the explanation of scene perception decisions [26] while SHAP-based importance profiles in remote sensing increase the interpretability and decision support value of environmental risk maps [27]. Furthermore, studies describing the multiplication of local explanations to reach global patterns demonstrate that system behavior can be understood holistically, especially in tree-based models [28].

2.3. Challenges and Solution Methods in Explainable Artificial Intelligence

In XAI applications, the reliability of model outputs depends on managing critical risks such as multicollinearity (high dependency) among features, Fidelity/Faithfulness of explanations, and overfitting and underfitting. In this section, the study identifies these risks and briefly summarizes the solution pursued.

Multicollinearity undermines the reliability of additive-based methods such as SHAP. In addition, SHAP can be computationally demanding for large feature sets and repeated cross-validation runs; therefore, reducing redundancy via feature elimination also supports practical scalability while preserving explanation quality [29]. The method struggles to distinguish the individual contributions of highly correlated features; therefore, the importance is either diluted across features or inconsistently distributed among group members. This can lead to misleading feature importance rankings [30,31]. Classical XAI methods typically report contributions through single-axis importance scores, whereas two-axis explanation approaches that distinguish Feature Existence Impact and Feature Value Impact provide clearer interpretability by separately capturing the effect of a feature’s inclusion in the model and the effect of changes in its values [32]. A recent perspective study, however, raises a cautionary note by highlighting that feature collinearity can significantly affect both SHAP and LIME explanations [31]. To mitigate these risks, a two-stage feature elimination is applied: first, highly correlated matches from the correlation matrix are eliminated, multicollinearity is reduced by VIF (Variance Inflation Factor) threshold. The benefit of VIF-based elements in preventing overfitting and misguided performance measurement has also been demonstrated in multi-method comparisons [33].

XAI evaluation taxonomies offer practical guidance on which explanatory properties (fidelity, stability, consistency, etc.) are required for which task and stakeholder [34]. In these taxonomies, the two most frequently used core metrics to measure the reliability of a method are Fidelity and Faithfulness. Fidelity refers to how accurately the generated explanation mimics the model’s output, which means that it is expected to reliably reflect how the model’s prediction changes when the explanation changes [35]. Faithfulness, on the other hand, refers to how accurately the explanation represents the model’s internal reasoning, which means the explanation should reflect the true feature effects generating the prediction without distorting them [36]. Frameworks based on these two metrics and systematizing them have provided comparable benchmarks for different data types and models [35]. Lately, faithfulness benchmarks spanning the multimodal and multimodel family have been developed that help researchers gain objectivity in method selection [37].

In model training, overfitting arises when the classifier overfits the training data. This causes the model to produce generalization errors on new data, seriously weakening both the prediction metrics and the stability of the XAI explanations [38]. To simply avoid the risk of overfitting, it is important to minimize the difference between the training and test accuracy [39]. Underfitting occurs when a model’s capacity is too low to grasp the underlying relationships in the data. Consequently, the model makes high errors on both training and test data due to high bias [40]. Therefore, performing hyperparameter optimization is a critical step in machine learning models. This process aims to prevent overfitting by minimizing the difference between training and testing accuracy [31] on the one hand, and to overcome the underfitting problem by maximizing metrics such as recall, precision, F1 and ROC-AUC on the other hand.

2.4. Applications of Interpreting PISA Data with Artificial Intelligence

PISA, thanks to its large-scale, cross-country comparable data, offers a suitable platform for analyzing student achievement within multidimensional contexts. The unprecedented decline observed in all areas of mathematics, reading, and science in the last cycle, as well as the disparity across countries, has made the need for data-driven approaches even more apparent [41]. In this context, studies using machine learning and explainable artificial intelligence (XAI) techniques on PISA data have increased significantly in recent years. For example, in a study conducted with PISA 2022 data, factors affecting students’ mathematical literacy were disaggregated at the classroom, family, and school levels using machine learning and SHAP, and both predictive success and interpretability were reported [42]. In a similar study conducted with PISA 2022 data, students were grouped into profiles using an explainable clustering framework, and the relationship between the resulting profiles and achievement was visualized providing actionable insights for policymakers [43].

The literature shows that tree-based (ensemble) models work well with complex (numerical and categorical) data such as PISA. When used with SHAP, these models can explain the impact of a feature both globally and at the individual student level (locally). Indeed, it has been reported that studies examining PISA 2018 data also found that models such as Random Forest and CatBoost produce useful policy outcomes with these explanatory outputs [44]. A study on PISA 2022 data combined different models using stacking. The results demonstrated that this method makes predictions more accurate and how useful machine learning is for education research-based studies [45]. SHAP-based studies conducted with PISA data can clearly identify groups at risk of underperformance and the patterns that determine them and especially the studies focusing on the post-pandemic period identify clusters of predictive factors in low-performing student groups in PISA 2022 using SHAP. This trend is consistent with recent reviews demonstrating that the widespread use of SHAP in the literature out of field (e.g., health) is also strongly adapted to educational data and that XAI methods are maturing in tabular data [31].

The PISA literature largely uses two-level models (student-level and school-level) where students are clustered within schools. This approach has become standard in the recent PISA 2022 studies. For example, this can be seen in some studies [42,46,47] (e.g., Niu, Xu & Yu, 2025; Darmawan et al., 2024; Huang et al., 2024), and the OECD’s own analytical framework specifies that the data collection structure is mainly detailed at these two levels (student and school) (OECD, 2023) [1]. This study offers the first holistic approach that combines these two strands to explain PISA performance in a country/regional context. Positioning and contribution relative to prior work. Recent PISA–AI/XAI studies have primarily focused on student- or school-level determinants and local interpretability within two-level structures. In contrast, our study shifts the unit of explanation to the country/region level and integrates governance quality, economic capacity, and regional/institutional clustering within a single explainable multi-class framework. The contribution is not only predictive performance, but a policy-facing representation of “structural performance profiles” that links Low/Medium/High achievement bands to class-conditional SHAP patterns and validates the transparency of these explanations using Fidelity and Faithfulness. This explicitly extends prior PISA–XAI applications by combining macro-structural drivers and explanation reliability into one coherent pipeline.

3. Methodology

This section outlines the analysis pipeline from data preparation to explainability verification. First, PISA math/reading/science scores were defined as the target variable, and datasets containing economic, governance, democratic, and regional indicators for each country are generated, cleaned, and scaled. Then, to increase the statistical reliability of the variables used, two-stage feature selection was applied: highly correlated variables were eliminated, and the remaining ones were reduced using the VIF for multicollinearity. Random Forest and XGBoost models were trained; the models were evaluated on accuracy, F1, recall, precision, ROC-AUC, and train-test consistency, and overfitting and underfitting were checked. The most generalizable, robust model was explained with SHAP and the contribution of each variable to the PISA outcome is quantified. Finally, the reliability of these explanations is tested using two measures: fidelity (how well the explanation mimics the model) and faithfulness (how faithful the explanation is to the model).

3.1. Creating Data Sets and Determining Target Variables

In this study, various features affecting PISA scores and consequently the sustainable education goal were examined. Our target variables in the study were PISAM, PISAR, and PISAS, and three separate datasets were created for each target variable using the same explanatory features. This approach enables the modeling of different criteria and allows for comparative analysis. The independent variables included in the datasets cover political, governance, and economic factors. These variables were selected to represent the multidimensional structure affecting educational outcomes (PISA scores) and consist of indicators frequently used in the literature. All variables used in the study, along with their abbreviations and descriptions, are presented in Table 1, based on data obtained from sources such as the World Bank, OECD, and The Economist Intelligence Unit. In the data preprocessing stage, missing values were checked. Specifically, PISA scores (which were unavailable for every year or for some countries in certain years) were imputed using trend-based interpolation methods, taking inter-year trends into account. Furthermore, all numerical variables were scaled to make them comparable. These datasets, created independently for each target variable, constitute the primary data source for machine learning model training and explainability analyses in the subsequent stages.

3.2. Correlation Analysis Between Features and Feature Selection

Multicollinearity between input features can distort feature contributions in XAI-based explanations, leading to model misinterpretation [53,54,55]. To mitigate this issue and avoid unnecessary information duplication, a two-stage feature selection method was implemented. In the first stage, the correlation matrix of the features is extracted, and features with a correlation above 0.90 are eliminated by the Greedy Elimination method. In the next stage, the remaining features are subjected to the Variance Inflation Factor (VIF) and features with VIF greater than 7.0 were eliminated.

A greedy approach is a class of heuristic approaches that does not perform backtracking and gradually constructs a solution by choosing the locally best option at each step [56]. The VIF measures the number of times the coefficient variance of each feature in a regression model is inflated due to multicollinearity with other independent features [57].

To preprocess

τ

multicollinearity, a correlation threshold-based “Greedy Elimination” algorithm was applied before the VIF analysis

R

(e.g., |R| ≥ 0.90). The algorithm works by finding the pair

(f_{i}, f_{j})

that satisfies the condition of highest absolute correlation

|ρ (f_{i}, f_{j})| \geq τ

within the current feature set R. To decide which attribute to eliminate from this pair

{(f}_{drop})

, the average absolute correlation

μ_{k}

of each attribute with all other variables in the set R is calculated by Equation (1). The attribute with the higher average correlation is eliminated by Equation (2):

μ_{k} = \frac{1}{|R| - 1} \sum_{f_{m} \in R, m \neq k} |ρ (f_{k}, f_{m})|

(1)

f_{drop} = \underset{f_{k} \in \{f_{i}, f_{j}\}}{\arg m a x} (μ_{k})

(2)

This process is repeated until all pairwise correlations in set

R

fall below the threshold τ

(\max |ρ| < τ)

. Here;

R

: The set of features evaluated in the relevant step of the algorithm (not yet eliminated),

f_{i}, f_{j}, f_{k}

: Individual features of the set

R

,

τ

: Predetermined threshold value that defines high correlation (e.g.,

τ = 0.90

),

ρ (f_{i}, f_{j})

:

f_{i}

and

f_{j}

the correlation coefficient (Pearson or Spearman) between the features,

μ_{k}

: the average absolute correlation

f_{k}

of the feature with all other features in its

R

set (a measure of the systemic redundancy of the feature),

|R|

: The number of elements of the

R

set (the number of remaining features),

f_{drop}

: Indicates the “greedy” target feature selected to be eliminated from the

R

set.

The IF value is defined in Equation (3). Here

R_{j}^{2}

is the coefficient of determination obtained by the regression of the

X_{j}

feature against all other features. Features with high VIF values were eliminated, and the process was repeated for the remaining features. The process was completed when all VIF values fell below the specified threshold.

In the second stage, the distance value is calculated using the absolute correlation

|ρ_{i j}|

between the remaining features after VIF with Equation (4). With these distance values, a fully connected weighted graph

G = (V, E)

is created with all features as nodes.

VIF (X_{j}) = \frac{1}{1 - R_{j}^{2}}

(3)

d_{i j} = 1 - |ρ_{i j}|

(4)

Consequently, this two-stage flow, which reduces high pairwise correlations (e.g., τ = 0.90) with Greedy Elimination and then removes multicollinearity in the remaining variables with the VIF criterion (VIF > 7.0), makes the feature set more independent, statistically more stable, and more informative with respect to the target variable. This supports predictive performance and improves the reliability of XAI-based explanations, and significantly reduces the risk of interpretations being influenced by data-driven dependencies in subsequent analysis steps. The full step-by-step procedure is provided in Supplementary Algorithm S1.

3.3. Classification with the Machine Learning Model

Each correlation-optimized dataset was used to train two tree-based models (Random Forest and XGBoost) separately for each target (PISAM, PISAR, and PISAS), and the best-performing configuration was selected. The model configuration process was carried out independently for each target variable (PISAM, PISAR, and PISAS). In this study, the differences between training and test accuracy performances (Train accuracy and Test accuracy) were calculated to monitor the risk of overfitting; the aim was to keep these differences low to avoid overfitting. In addition, metrics such as Recall, Precision, F1, and ROC-AUC were calculated to evaluate the model’s health from different perspectives.

Definition of Low/Medium/High classes and threshold rationale. The three achievement classes were constructed by applying fixed score thresholds to each PISA domain, yielding Low/Medium/High labels for the multi-class prediction task. Consistent with the implementation shared for reproducibility, we used domain-specific cutoffs: for PISAM, Low < 420, Medium 420–490, High ≥ 490; for PISAR, Low < 420, Medium 420–488, High ≥ 488; and for PISAS, Low < 425, Medium 425–495, High ≥ 495. These thresholds were selected after inspecting the empirical score distributions and ensuring that each class retains sufficient support for stable model training and interpretation. To reduce the risk that overall performance is driven by a dominant class, model development and selection emphasize balanced class-wise performance (reported via class-wise Precision/Recall/F1 and Macro-F1), so that the learned separation remains meaningful across all three profiles.

Accuracy: shows the overall accuracy rate of the model.

F1 score: measures the balance [58].

ROC-AUC: reflects the success of the model in distinguishing positive and negative classes under different thresholds [59].

Recall: In binary or multi-class classification, it measures how many true positives were correctly detected. This metric is expressed by the formula

\frac{T P}{T P + F N}

. Here, TP represents true positives (positives correctly predicted for the class), and FN represents false negatives (negatives incorrectly predicted for the class) [60]. It answers the question, “How many positives were correctly predicted?”

Precision: Measures how many of the model’s positive predictions are truly positive. This metric is expressed by the formula (

\frac{T P}{T P + F P}

) [61]. It answers the question, “How many positive predictions are truly positive?”

The primary goal of the model’s hyperparameter optimization was not only to increase accuracy but also to establish a balance between F1, Recall, and Precision. This fine-tuning process was applied to both Random Forest and XGBoost models for each of the PISAM, PISAR, and PISAS target variables. The measurement strategy was multifaceted. The Macro-F1 score was used as the primary selection criterion. When Recall decreased, particularly in the middle and minority classes, improvement was prioritized with this metric. When the model produced excessive false positives, a balance was established with the Precision metric. Additionally, the ROC-AUC score was monitored to assess the model’s threshold-independent discriminatory power. Generalization was critical. Therefore, overfitting was prevented by always keeping the difference between training and test accuracies low.

Consequently, models with high macro-F1 and strong AUC scores were selected for each target. These final models also exhibited a balanced Precision-Recall profile and robust generalization ability. This careful selection process laid a reliable foundation for subsequent SHAP-based annotations. Using Macro-F1 as the primary criterion also reduces the risk of majority-class dominance by enforcing balanced performance across all achievement classes.

3.4. Explainable Artificial Intelligence Algorithm Used: SHAP

After checking for overfitting and underfitting and selecting a model with a relatively high ROC-AUC value, SHAP was used to make the structure of the model interpretable and to evaluate the contribution levels of the features to the model. SHAP is a local explainability tool that calculates the marginal contribution of each feature for each prediction based on the Shapley values, allowing for the feature-by-feature decomposition of the model output [62]. In this study, global SHAP values were used to examine the general trends of the features. Equation (5) presents the basic mathematical definition of the method and the functional descriptions of the components included in this definition.

The SHAP (SHapley Additive ExPlanations) method defines the contribution of each feature to the output of a machine learning model based on cooperative game theory. This approach aims to fairly and consistently quantify the contribution of each feature to the model’s prediction, specifically through Shapley values [63]. The SHAP value for the ith feature is calculated as in Equation (5):

ϕ_{i} (f, x) = \sum_{S \subseteq F ∖ {i}} \frac{|S|! (|F| - |S| - 1)!}{|F|!} [f_{S \cup {i}} (x_{S \cup {i}}) - f_{S} (x_{S})]

(5)

Here,

f

is the prediction function (e.g., the XGBoost model),

x

is the model instance to be explained,

F

is the entire feature set,

S

is the subsets that do not contain the ith feature,

f_{S} (x_{S})

is the prediction made with only the features

S

, and

ϕ_{i}

is the contribution value of the ith feature. This equation calculates the contribution of a feature to the model by comparing all subset combinations whether the relevant feature is present or absent. Since the calculation considers all subset ranks with equal weight, it ensures a fair distribution. In Equation (5), SHAP is essentially a method for generating local explanations. However, the global contribution levels of the features are calculated through

Global {SHAP}_{i} = E_{x} [|ϕ_{i} (f, x)|]

by taking these local values into consideration for average value of all observations. Here,

E_{x}

is the average of the individual contribution values of the features to be explained.

3.5. Evaluation of the Explainability Methods: Fidelity and Faithfulness

The success of explainable artificial intelligence (XAI) methods should be evaluated not only by their intuitive accuracy but also by their quantitatively measurable reliability. In this context, the validity of the explanations of the SHAP method was examined using two primary metrics: Fidelity and Faithfulness. These metrics indicate how well the explanations fit the model and how faithfully the model adheres to the decision process. Fidelity measures how well a model, which is created with features identified as “important” by the explanation method, mimics the decisions of the original model [64]. Faithfulness assesses the extent to which the features identified as important using the explanation method actually influence the model’s decisions. To do this, the decrease in the model’s predictive accuracy is measured by changing the value of each feature. The magnitude of these decreases is then compared with the importance ranking assigned using the explanation method. If the features identified as most important are the ones that most significantly influence the model’s predictions, the explanation is concluded to be faithful to the model’s decisions [65]. The full step-by-step procedures for these Fidelity (surrogate-model agreement) and Faithfulness (permutation-based sensitivity) checks are provided in Supplementary Algorithms S2 and S3.

Faithfulness is operationalized as the agreement between two rankings: the “impact” ranking, defined by the performance drop observed when a feature is perturbed (reflecting its actual influence on the trained model), and the “explanatory power” ranking, defined by the importance scores assigned by the explanation method (estimated importance). A higher rank agreement (i.e., stronger correlation between these rankings) indicates that the explanation method assigns higher importance to the features that truly affect the model’s predictions, and is therefore interpreted as higher faithfulness.

4. Results

This section reports the model’s findings at four levels. First, by eliminating the high correlation and multicollinearity among the features, a more statistically stable feature set was obtained, and this set served as the primary input for the analysis. Second, XGBoost models achieved the highest classification success for the PISA math/reading/science targets; performance was assessed using both general metrics (Accuracy, ROC-AUC, etc.) and class-based Precision–Recall–F1 results. Third, the SHAP revealed which economic, governance and regional indicators were most influential in model decisions for each target variable and how this influence differed across different achievement classes. Finally, the reliability of the resulting SHAP was measured; Fidelity and Faithfulness scores indicated that the explanations closely matched the model outputs and tracked the model’s decision logic with a high degree of agreement.

4.1. Choices Based on Correlation Analysis Among Features

The dataset comprises 18 numerical features. Feature selection was conducted in two stages: (i) correlation-based elimination (|r| ≥ 0.90, Pearson) and (ii) iterative VIF elimination (threshold VIF > 7.0). Year_C was retained in the correlation step. In the correlation step, the highest correlations were: PISAR–PISAS (|r| = 0.977), PISAM–PISAS (0.970), RL–GE (0.944), RQ–GE (0.906). Therefore, PISAR, PISAM, RL, and RQ were eliminated; the remaining ones were: Year_C, CoC, GE, PS, VA, PISAS, DEM, RCAT_AF, RCAT_AME, RCAT_AO, RCAT_EU, GDP, GDPPC, and ECAT.

In the VIF step, VIF = 7.63 was determined for GE and it was eliminated because the threshold was exceeded. In the next iteration, the maximum VIF dropped to 6.42 and the process was stopped. Final feature set: Year_C, CoC, PS, VA, PISAS, DEM, RCAT_AF, RCAT_AME, RCAT_AO, RCAT_EU, GDP, GDPPC, ECAT. Final VIF values (first five in decreasing order): RCAT_EU = 6.42, RCAT_AME = 4.27, RCAT_AO = 4.24, CoC = 4.12, PS = 3.49. Others: VA = 3.38, ECAT = 2.73, GDPPC = 2.64, PISAS = 2.50, RCAT_AF = 2.38, DEM = 1.53, GDP = 1.22, Year_C = 1.19. In summary, after the correlation and VIF eliminations, a final set of 13 features was obtained.

4.2. Machine Learning Model Results

For each target variable (PISAM, PISAR, and PISAS), the best performing machine learning model was found to be XGBoost. Class-level metrics (Precision, Recall, F1-Score, Support) are presented in Table 2; general/summary metrics (Train Accuracy, Test Accuracy, ROC-AUC, Macro Avg, Weighted Avg, Support) are given in Figure 1.

4.3. Class-Distinguished SHAP Feature Importance Based on Target Variable

Figure 2 shows the top 10 features that most significantly impact model predictions for the target variables in PISAM (a), PISAR (b), and PISAS (c), and the distribution of this effect by class. The horizontal axis shows the mean absolute SHAP value, which represents the average effect size of each feature on the model output. Higher values indicate that the model relies more on that feature when making decisions. The features listed on the vertical axis (e.g., VA, CoC, GDP, RCAT_EU, etc.) are the most influential features in this respect. Each bar consists of three colors, and these colors correspond to the model’s classes. The blue portion represents the average absolute SHAP contribution for Class 0, the pink portion for Class 2, and the olive/green portion for Class 1. Thus, while the total length of the bar for a single feature represents that feature’s overall importance, the division of the bar into colored components shows the class in which this importance is concentrated. For example, if the blue portion of a feature’s bar is dominant, that feature is particularly strong at explaining Class 0 predictions. If the pink part is dominant, it is more critical in explaining Class 2 decisions. If the green part is dominant, it is more decisive in Class 1 predictions.

The graphs provide two levels of information. First, they show the overall importance ranking of the features, where the top-ranked features are the primary determinants of the model’s decision-making process. Second, they reveal a class-based decomposition, meaning that the same feature can have different impacts across different classes. Figure 2a–c illustrate which structural, economic, governance, or regional indicators the model uses to distinguish between PISAM (a), PISAR (b), and PISAS (c), and which class these indicators play a greater role in distinguishing.

4.4. Target Variable and Class-Level SHAP Data

Figure 3a–i presents the class-based SHAP distributions for the PISAM, PISAR, and PISAS target variables. These graphs illustrate which features influence class prediction and in what direction. Each row represents a target variable (PISAM, PISAR, and PISAS, respectively), and each column represents the Class 0, Class 1, and Class 2 predictions for the corresponding target. For the first 10 features in each panel, the dots represent the SHAP values at the observation level. The horizontal position of the dot indicates the direction and magnitude of the effect of the feature on the model output, either toward (+) or away from (−). For example, if red dots corresponding to high values of a feature are stacked to the left (negative SHAP), this decreases the model’s ability to predict that class; if the same red dots are stacked to the right (positive SHAP), this increases the model’s ability to predict that class. The color scale indicates the level of the feature value (blue: low, pink/red: high). This visual presents the most influential features of the model for each target and each class, along with the contribution distribution of these features across the samples.

4.5. SHAP Fidelity and Faithfulness Results

Faithfulness values were calculated to assess the accuracy of the SHAP for the PISAM, PISAR, and PISAS target variables. Fidelity indicates the extent to which the SHAP are consistent with the model predictions, while Faithfulness indicates the extent to which the explanations faithfully reflect the model’s decision logic. Both metrics are shown as separate bars for each target. For PISAM, Fidelity was 0.95 and Faithfulness was 0.85; for PISAR, Fidelity was 0.89 and Faithfulness was 0.92; and for PISAS, both metrics were 0.89.

5. Discussion

This section interprets the findings and discusses the practical implications of the models. First, the collinearity structure among variables is examined to explain which indicators move together and why the final attribute set was reduced to this form. Next, the performance of machine learning models is evaluated, and it is discussed whether this performance is suitable for reliably classifying PISA achievement levels. Next, using SHAP results, it is examined which governance, economic, and regional factors are effective on which targets (math, reading, science) in the general framework and how these factors differ across classes. Following this, it is detailed how SHAP behave on a class-by-class basis; each class’s unique institutional/economic signature is shown. Finally, the reliability level of the explanations (fidelity and faithfulness), the limitations of the study, and how the method could be expanded for policy purposes are discussed.

5.1. Collinearity Detection and Analysis

Excessive multicollinearity is particularly concentrated in PISA subscales and certain governance indicators; therefore, retaining PISA as the sole representative reduces information redundancy while adequately summarizing cognitive performance. The exclusion of GE due to VIF > 7 indicates that its marginal explanatory power remains weak because it carries a strong common signal with RL/RQ. In contrast, CoC, PS, and VA remain below the threshold, continuing to carry different channels of governance into the model. Although RCAT_EU has the highest VIF among regional dummies, its sub-threshold value implies that regional effects cannot be fully explained by GDP/GDPPC and contain independent variance. ECAT’s lack of complete overlap with continuous economic indicators supports the complementary power of the “level + continuous” combination. Year_C’s low VIF indicates that the time trend has limited overlap with other indicators and can be reliably retained. Overall, a smaller, 13-variable final set; providing an interpretable foundation expected to yield practical gains in terms of coefficient stability, smaller train–test differences, and smoother ROC-AUC profiles with more stable PR/F1 scores. However, it should be noted that structurally collinear common-source indices may arise, and findings should be interpreted with a focus on prediction/interpretation rather than causality.

5.2. Machine Learning Model Analysis

The results in Table 2 and Figure 1 reveal that the models (PISAM, PISAR, and PISAS) perform quite robustly for each target. When Table 2 is examined, the majority of the Precision, Recall, and F1-Score values on a class basis are above 0.90, and in some cases (e.g., PISAS- Low Recall = 1.00) is close to perfect. Even in the medium classes (e.g., PISAR- Medium F1 = 0.93), the performance is satisfactory. In the general performance metrics in Figure 1, Train Acc is 1.0 and Test Acc is 0.93–0.95, indicating good generalization ability of the model. The ROC-AUC being in the 0.98–1.00 band reinforces the very high discrimination power. The close agreement between Macro Avg and Weighted Avg (≈0.93–0.95) indicates that the model maintains strong performance despite class imbalance. These results are quite promising for practical applications with SHAP and generally demonstrate a successful model output.

5.3. Class-Distinct SHAP Feature Importance Analysis

Figure 2a shows that the strongest determinants of the model for PISAM are governance indicators. The “Voice and Accountability (VA)” and “Control of Corruption (CoC)” features have the highest total SHAP values and are more dominant than the economic indicators (e.g., GDP). In addition, the fact that political stability (PS) and regional location signals (e.g., RCAT_EU) are in the top rankings indicates that not only economic capacity but also institutional quality, stability, and specific regional clusters are used to explain PISAM levels. This indicates that class differences in PISAM are decomposed along the axis of “institutional core + contextual location”. Figure 2b shows that for PISAR, VA and CoC are again in the first place, but this time, economic scale (GDP) and welfare level (GDPPC) come into play more prominently. This structure suggests that the PISAR result is sensitive to both institutional qualities and economic capacity together. In other words, it implies that high PISAR values are associated with the combination of “good governance + high income”. Figure 2c reiterates the importance of governance indicators (VA, CoC) for PISAS, but here the regional category features (e.g., RCAT_EU, RCAT_AO) are in the first place. This shows that the PISAS scores systematically decompose across geographical blocks and that the model actively exploits this regional structure. Furthermore, the contribution of indicators related to political stability (PS) and democracy (DEM) appears particularly evident in the decomposition of middle-class populations.

When Figure 2a–c are read together, the following pattern emerges: governance quality (especially VA and CoC) is the most salient predictor in the model across all targets. However, PISAR is more closely linked to economic capacity, whereas PISAS is more closely linked to regional location. This indicates that each target variable explains the same institutional core with different complementary dimensions (economy or geography).

Cross-domain consistency and domain-specific complements. A key contribution of our findings is the consistent emergence of a shared institutional core across all three domains: governance-related signals—most notably Voice and Accountability (VA) and Control of Corruption (CoC)—remain primary drivers for PISAM, PISAR, and PISAS. Domain differences arise not because the core disappears, but because complementary dimensions become more salient by domain: economic capacity indicators are more prominent for reading (PISAR), whereas regional-bloc differentiation is more pronounced for science (PISAS). This “stable core + domain-specific complement” structure provides an evidence-based interpretation of why drivers differ across domains while remaining coherent within a single country-level framework.

Comparison with prior cross-national evidence. The pattern we observe is consistent with the cross-national literature that links PISA performance to governance quality (WGI dimensions such as control of corruption, political stability, voice and accountability), economic capacity (income/GDP per capita), and regional clustering and shared institu-tional heritage. In this sense, our global SHAP rankings provide an interpretable confir-mation that institutional quality forms a common “institutional core” behind achieve-ment profiles, while economy and region act as complementary lenses that become more salient depending on the competency domain.

Importantly, our class-distinct explanations refine this prior evidence by showing how the same structural family decomposes differently across domains: reading (PISAR) more strongly integrates economic capacity into the institutional core, whereas science (PISAS) differentiates more clearly across regional blocs, together with governance signals. These results should be interpreted as predictive/explanatory associations at the country level rather than causal effects, but they provide a transparent mapping of which macro signals systematically push observations toward low/medium/high profiles.

5.4. Target Variable and Class-Level SHAP Analysis

Table 3, Table 4 and Table 5 summarize the institutional, economic, and regional signals by which the model distinguishes classes (Class 0 = low-level, Class 1 = medium-level, Class 2 = high-level) for each target variable (PISAM, PISAR, PISAS). The rules, presented in tabular form, indicate which class the model pushes an observation toward when the relevant feature is at a certain threshold/feature level (positive SHAP = push toward that class, negative SHAP = push away from that class). This structure allows the results to answer not only the question “which feature matters?” but also the question “which profile corresponds to which level of achievement?”

Policy relevance and SDG 4 alignment. The class-conditional profiles in Table 3, Table 4 and Table 5 can be read as decision-support “levers” that help translate model explanations into actionable priorities aligned with SDG 4 (Quality Education). In particular, the consistent prominence of governance signals (e.g., Voice and Accountability and Control of Corruption) suggests that strengthening transparency, accountability, and anti-corruption capacity can be interpreted as a cross-cutting enabling condition for sustained improvement. The domain-specific complements provide a second layer of policy focus: reading profiles more strongly co-move with economic capacity, highlighting the importance of targeted investment capacity for literacy-related system improvements, whereas science profiles show stronger regional-bloc differentiation, indicating that regional peer-learning, policy benchmarking, and institutional diffusion mechanisms may be especially relevant in STEM-oriented capacity building. To make this link measurable, we emphasize that progress can be monitored using established system-level indicators already used in this study (e.g., WGI sub-scores such as Voice and Accountability and Control of Corruption, alongside comparable education-system resource and equity indicators). These implications are intentionally framed as actionable hypotheses for prioritization and further investigation, rather than causal prescriptions, and should be triangulated with contextual evidence when used for high-stakes decisions.

Table 3 Comments (↓ defines decrease and ↑ defines increase): Mathematics performance (PISAM) is linked to three distinct profiles. Class 0 is the lowest-level group, clustered in non-European contexts—particularly those associated with the American bloc—with low accountability/political participation, weak control of corruption, lower political stability, and limited economic capacity. Class 1 is the intermediate profile, characterized by relatively higher per capita income, pronounced European affiliation, and not yet at huge economic scale or the highest institutional density. Class 2 is the highest-level group, characterized by higher political stability and concentrated regionally in the European and Asian/Oceanic bloc regions. This group, while distinct from the very low economic scale, does not necessarily rely on the “highest accountability/strongest control of corruption” signal. This framework demonstrates that PISAM levels are not solely based on income; they are also differentiated by regional location, political stability, and the form of institutional capacity.

Table 4 Comments (↓ defines decrease and ↑ defines increase): Reading performance (PISAR) is divided into three distinct profiles. Class 0 represents the lowest level, with limited economic capacity, weaker institutional control, and low accountability; it predominantly consists of non-European countries. Class 1 is the intermediate level, with a relatively high per capita income and a European context, while institutional indicators are above a certain threshold but do not reach the highest institutional density. Class 2 is the highest level, with high corruption control, strong political accountability, and high political stability simultaneously; it is not weak in economic capacity, and is particularly evident in the Asian/Oceanic context. This structure demonstrates that performance levels are not solely reduced to income levels; rather, institutional capacity, political stability, and regional architecture play a combined role in differentiating factors.

Table 5 Comments (↓ defines decrease and ↑ defines increase): Science performance (PISAS) is divided into three distinct profiles. Class 0 is the lowest-level group, characterized by low accountability and political stability, weak control of corruption, limited economic capacity, and regional clustering outside Europe—particularly associated with the American bloc. Class 1 is the intermediate-level profile, characterized by a prominent European context, relatively high per capita income, and institutional indicators above a certain threshold, but not yet reaching the highest level of stability/size. Class 2 is the highest-level group, characterized by strong control of corruption, high accountability, political stability, and large-scale economic capacity, particularly in the Asian/Oceanic context. This structure indicates that science outcomes are not solely explained by income; governance capacity, stability, and regional location are predictive signals jointly used by the model.

5.5. Reliability of SHAP and the Interpretability Trade-Off

Figure 4 shows that the reliability metrics (Fidelity and Faithfulness) of the SHAP can explain the behavior of the models for the three target variables (PISAM, PISAR, and PISAS) with high accuracy and consistency. The very high-Fidelity value of 0.95 indicates that the SHAP are almost perfectly consistent with the model’s predictions. The Faithfulness value of 0.85 for the same model also indicates that the explanations largely capture the model’s decision-making logic and that the contribution of multiple institutional and regional signals can be meaningfully traced through individual features. In PISAR, the fact that the Faithfulness value is 0.92 and above the Fidelity value (0.89) indicates that SHAP fairly well represents the feature combinations that the model uses to target specific classes and that the decision logic can be clearly traced; this also implies that this model has a stable structure in terms of explainability. The equal and balanced Fidelity and Faithfulness values for PISAS (0.89/0.89) indicate that the SHAP in this model can explain both the predicted outcome and the underlying rationale to a similar degree, providing consistent transparency between these two dimensions. Overall, PISAM stands out in terms of predictive consistency (Fidelity), PISAR in terms of the traceability of the decision logic (Faithfulness), and PISAS in terms of the balanced combination of these two dimensions; this demonstrates that the SHAP obtained for all models are quantitatively reliable and interpretable.

5.6. Limitations and Future Work

These findings should be read with several important limitations. First, the analysis relies on aggregate country-level indicators. Therefore, within-country regional, school-based, or socioeconomic differences are not apparent. The patterns identified here should not be considered homogeneous across all subgroups within each country. The inferences are within this system and are at the national level.

Second, temporal information is included in the model, but the study is not a causal effects analysis. SHAP contributions mean “the model used this information to assign this class,” not “this factor caused this outcome.” This means that the findings offer policy-guiding signals but do not guarantee that a particular intervention will directly improve scores. This framework aims to clarify which structural signals appear with which achievement profiles for decision support, rather than to provide a causal prescription.

Algorithmic decision support has ethical implications, even when designed for transparency. Country-level profiling risks being interpreted as normative judgments, potentially reinforcing stigmatization or simplistic rankings. Therefore, outputs should be viewed as preliminary decision-support indicators, to be triangulated with qualitative insights and interpreted cautiously given uncertainties, measurement errors, and structural differences across countries.

Third, indicators such as governance quality, corruption control, political stability, economic capacity, and regional institutional frameworks have historically evolved together. This coevolutionary relationship limits how uniquely the model can attribute importance to any single component. In practice, what policymakers encounter are already packaged institutional structures.

Finally, the “low/medium/high” achievement classes are based on this study’s data-based thresholds. Different threshold definitions may alter the labeling of some marginal countries. However, we expect the high-level interpretation of the profiles to remain qualitatively similar.

Bias and responsible interpretation. As with any AI/ML-based pipeline, the risk of biased conclusions may arise if model performance collapses into a dominant class or if explanation attributions are distorted by strong feature dependencies. In this study, we mitigate these risks in two practical ways. First, model selection prioritized Macro-F1 and monitored class-level Precision/Recall/F1 (Table 2), which helps prevent majority-class dominance and makes potential performance disparities across low/medium/high pro-files visible. Second, to reduce attribution instability in SHAP under correlated macro-indicators, we applied a two-stage feature elimination procedure (correlation-based filtering followed by VIF elimination), which improves the statistical independence of the feature set and thus the reliability of additive explanations. Nevertheless, the results should be used as predictive/explanatory decision-support signals at the aggregate country level, rather than as a fairness audit or causal basis for high-stakes interventions. Be-cause the unit of analysis is at the aggregate country/region level and the study does not rely on individual-level sensitive attributes, formal demographic fairness metrics are not directly applicable; instead, we treat “bias risk” primarily as class-imbalance/majority dominance, attribution instability under feature dependence, and potential interpretive misuse in policy narratives.

External validity and generalizability. The proposed framework is designed to be transferable because it relies on widely available, cross-national indicators and an explicitly documented preprocessing and explainability pipeline. Nevertheless, generalizability depends on the stability of indicator definitions and measurement practices across contexts and time. The approach is expected to generalize most directly to future PISA cycles where comparable governance, economic, and regional indicators can be compiled; it can also be adapted to other international learning assessments, provided that outcome scales and country coverage are comparable. At the same time, the framework may require recalibration when applied to substantially different assessment regimes, missingness patterns, or shifts in the operationalization of governance and development metrics.

Three concrete directions for future work stand out.

Going down to the sub-national scale (regions, school clusters, and socio-economic segments) will increase the policy applicability of this approach.
Incorporating the time dimension into causal or comparative frameworks will allow testing of how institutional change is related to achievement class transition.
Counterfactual scenarios (such as “Will this country move from a medium profile to a high profile if corruption control improves to a certain extent?”) can transform the explainable forecast pipeline we present in this study into a forward-looking simulation tool.

6. Conclusions

This study shows that the patterns observed in the PISA Mathematics (PISAM), Reading (PISAR) and Science (PISAS) objectives are best explained in the model by the joint contribution of governance quality (e.g., accountability, control of corruption, political stability), economic capacity (macroeconomic scale and resource access), and regional/institutional context (institutional frameworks of different regional blocs), rather than a unilinear causality. Class-based SHAP analyses clearly reveal that this joint effect comes into play with different weights for each objective: While institutional indicators (especially accountability and control of corruption) are decisive in PISAM, the upper class is also differentiated by political stability and regional location; in PISAR, economic capacity is strongly articulated to this institutional core; and in PISAS, the governance signal works together with regional block differentiation. Thus, “high performance” is not explained by a single variable threshold, but emerges through combinations of institutional quality, capacity, and regional location. Thus, class profiles for each objective are defined by different combinations of the same variable family. The question “which class is differentiated under what conditions and by what signals?” has been answered in a concrete and reproducible manner. Specifically, in the PISAS context, rather than being frozen in a single socioeconomic cross-section, classes are differentiated by interactive combinations of governance indicators, regional bloc identity, and specific economic profiles, suggesting that the model’s decision is largely explained by a combination of multiple conditions rather than univariate thresholds.

Methodologically, the study presents a multi-target, multi-class structure with class-conditional SHAP contributions and quantitative reliability measures (Fidelity/Faithfulness). The values obtained were high and consistent: Fidelity 0.95/Faithfulness 0.85 for PISAM; 0.89/0.92 for PISAR; and 0.89/0.89 for PISAS. This profile demonstrates that the predictive fit is strong in PISAM, the model decision logic is more directly traceable in PISAR, and these two dimensions are balanced in PISAS. Therefore, SHAP-based explanations provide a framework that is not only consistent with the output but also faithful to the model. Consequently, the proposed flow (multi-target + class-based SHAP + Fidelity/Faithfulness) proposes a reusable and transparent decision support standard that addresses performance and interpretability together, in policy and practice contexts. Positioning relative to prior work. Prior PISA–machine learning and XAI studies have predominantly focused on student- or school-level predictors within two-level structures; our contribution complements this line by shifting the unit of explanation to the country/region level and integrating governance quality, economic capacity, and region-al/institutional context within a single explainable, multi-class framework. By providing class-conditional SHAP profiles, the study goes beyond reporting “what matters” on average and instead clarifies “which structural configuration corresponds to which achievement profile,” thereby translating the macro-level evidence discussed in the literature into a directly interpretable decision-support output.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/su18031415/s1. Algorithm S1: Two-Stage Feature Elimination Procedure. Algorithm S2. Calculating the Fidelity Score. Algorithm S3. Calculating the Faithfulness Score.

Author Contributions

During the research process, the determination of the research topic and the design of the study were jointly carried out by all authors. The Introduction and literature sections were prepared by A.Ü.K., Z.A.K., S.U. and M.A.K. Data collection and data curation were conducted by Z.A.K. and M.A.K. The methodology and data analysis were performed by G.Y. and S.U. The Discussion, Conclusion, and Limitations sections were prepared by all authors. All authors have read and agreed to the published version of the manuscript.

Funding

This study will be funded by Firat University FUBAP under Project No: SBMYO.25.03.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

OECD. PISA 2022 Assessment and Analytical Framework. In OECD; OECD Publishing: Paris, France, 2023. [Google Scholar]
Sälzer, C.; Roczen, N. Assessing global competence in PISA 2018: Challenges and approaches to capturing a complex construct. Int. J. Dev. Educ. Glob. Learn. 2018, 10, 5–20. [Google Scholar] [CrossRef]
Komatsu, H.; Rappleye, J. A new global policy regime founded on invalid statistics? Hanushek, Woessmann, PISA, and economic growth. Comp. Educ. 2017, 53, 166–191. [Google Scholar] [CrossRef]
Li, J.; Xue, E.; Guo, S. The effects of PISA on global basic education reform: A systematic literature review. Humanit. Soc. Sci. Commun. 2025, 12, 106. [Google Scholar] [CrossRef]
Wolter, S.C. Purpose and limits of a national monitoring of the education system through indicators. In Governance and Performance of Education Systems; Springer: Dordrecht, The Netherlands, 2008; pp. 57–84. [Google Scholar]
Yabe, M. A contribuição dos indicadores do PISA para a governança educacional em contexto globalizado—Educação Comparada Finlândia-Brasil. Rev. Lusófona De Educ. 2022, 56, 127–140. [Google Scholar] [CrossRef]
AlKaabi, N.A.; Al-Maadeed, N.; Romanowski, M.H.; Sellami, A. Drawing lessons from PISA: Qatar’s use of PISA results. Prospects 2024, 54, 221–240. [Google Scholar] [CrossRef]
Kelley, J.G.; Simmons, B.A. (Eds.) The Power of Global Performance Indicators; Cambridge University Press: Cambridge, UK, 2020. [Google Scholar]
Kijima, R.; Lipscy, P.Y. The politics of international testing. Rev. Int. Organ. 2024, 19, 1–31. [Google Scholar] [CrossRef]
OECD. PISA 2022 Results (Volume I): The State of Learning and Equity in Education. In PISA; OECD Publishing: Paris, France, 2023. [Google Scholar] [CrossRef]
Jerrim, J. Has Peak PISA passed? An investigation of interest in International Large-Scale Assessments across countries and over time. Eur. Educ. Res. J. 2024, 23, 450–476. [Google Scholar] [CrossRef]
Boman, B. PISA Achievement in Sweden from the Perspective of Both Individual Data and Aggregated Cross-Country Data. Front. Educ. 2022, 6, 753347. [Google Scholar] [CrossRef]
Maheshwari, A.; Koria, A. Educational inequality: A country-level comparison between OECD and non-OECD countries. Policy Futur. Educ. 2025, 23, 1313–1334. [Google Scholar] [CrossRef]
Münch, R.; Wieczorek, O. Improving schooling through effective governance? The United States, Canada, South Korea, and Singapore in the struggle for PISA scores. Comp. Educ. 2023, 59, 59–76. [Google Scholar] [CrossRef]
Luschei, T.F.; Jeong, D.W. School Governance and Student Achievement: Cross-National Evidence From the 2015 PISA. Educ. Adm. Q. 2021, 57, 331–371. [Google Scholar] [CrossRef]
Dahlum, S.; Knutsen, C.H. Do Democracies Provide Better Education? Revisiting the Democracy–Human Capital Link. World Dev. 2017, 94, 186–199. [Google Scholar] [CrossRef]
Volante, L.; Mattei, P. The politicization of PISA in evidence-based policy discourses. Policy Futur. Educ. 2024, 22, 1554–1569. [Google Scholar] [CrossRef]
Neuman, M. PISA data clusters reveal student and school inequality that affects results. PLoS ONE 2022, 17, e0267040. [Google Scholar] [CrossRef]
Enchikova, E.; Neves, T.; Toledo, C.; Nata, G. A long road to educational equity: Tracking trends through PISA 2000–2018. Int. J. Educ. Res. Open 2025, 8, 100445. [Google Scholar] [CrossRef]
Černevičienė, J.; Kabašinskas, A. Explainable artificial intelligence (XAI) in finance: A systematic literature review. Artif. Intell. Rev. 2024, 57, 216. [Google Scholar] [CrossRef]
Kabir, S.; Hossain, M.S.; Andersson, K. A Review of Explainable Artificial Intelligence from the Perspectives of Challenges and Opportunities. Algorithms 2025, 18, 556. [Google Scholar] [CrossRef]
Arrieta, A.B.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; Garcia, S.; Gil-Lopez, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
Mersha, M.; Lam, K.; Wood, J.; AlShami, A.K.; Kalita, J. Explainable artificial intelligence: A survey of needs, techniques, applications, and future direction. Neurocomputing 2024, 599, 128111. [Google Scholar] [CrossRef]
Reddy, S. Explainability and artificial intelligence in medicine. Lancet Digit. Health 2022, 4, e214–e215. [Google Scholar] [CrossRef]
Tjoa, E.; Guan, C. A Survey on Explainable Artificial Intelligence (XAI): Toward Medical XAI. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 4793–4813. [Google Scholar] [CrossRef]
Tahir, H.A.; Alayed, W.; Hassan, W.U.; Haider, A. A Novel Hybrid XAI Solution for Autonomous Vehicles: Real-Time Interpretability Through LIME–SHAP Integration. Sensors 2024, 24, 6776. [Google Scholar] [CrossRef]
Iban, M.C.; Aksu, O. SHAP-Driven Explainable Artificial Intelligence Framework for Wildfire Susceptibility Mapping Using MODIS Active Fire Pixels: An In-Depth Interpretation of Contributing Factors in Izmir, Türkiye. Remote Sens. 2024, 16, 2842. [Google Scholar] [CrossRef]
Lundberg, S.M.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J.M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; Lee, S.-I. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2020, 2, 56–67. [Google Scholar] [CrossRef] [PubMed]
Wang, H.; Liang, Q.; Hancock, J.T.; Khoshgoftaar, T.M. Feature selection strategies: A comparative analysis of SHAP-value and importance-based methods. J. Big Data 2024, 11, 44. [Google Scholar] [CrossRef]
Aas, K.; Jullum, M.; Løland, A. Explaining individual predictions when features are dependent: More accurate approximations to Shapley values. Artif. Intell. 2021, 298, 103502. [Google Scholar] [CrossRef]
Salih, A.M.; Raisi-Estabragh, Z.; Galazzo, I.B.; Radeva, P.; Petersen, S.E.; Lekadir, K.; Menegaz, G. A Perspective on Explainable Artificial Intelligence Methods: SHAP and LIME. Adv. Intell. Syst. 2025, 7, 2400304. [Google Scholar] [CrossRef]
Uzun, S.; Yildirim, G. fiXAIt: A novel feature importance-based XAI tool for enhanced explainability, self-consistency, and computational efficiency. Appl. Soft Comput. 2026, 188, 114477. [Google Scholar] [CrossRef]
Mohtasham, F.; Pourhoseingholi, M.; Nazari, S.S.H.; Kavousi, K.; Zali, M.R. Comparative analysis of feature selection techniques for COVID-19 dataset. Sci. Rep. 2024, 14, 18627. [Google Scholar] [CrossRef]
Bommer, P.L.; Kretschmer, M.; Hedström, A.; Bareeva, D.; Höhne, M.M.-C. Finding the Right XAI Method—A Guide for the Evaluation and Ranking of Explainable AI Methods in Climate Science. Artif. Intell. Earth Syst. 2024, 3, e230074. [Google Scholar] [CrossRef]
Miró-Nicolau, M.; Jaume-I-Capó, A.; Moyà-Alcover, G. Assessing fidelity in XAI post-hoc techniques: A comparative study with ground truth explanations datasets. Artif. Intell. 2024, 335, 104179. [Google Scholar] [CrossRef]
Lyu, Q.; Apidianaki, M.; Callison-Burch, C. Towards Faithful Model Explanation in NLP: A Survey. Comput. Linguist. 2024, 50, 657–723. [Google Scholar] [CrossRef]
Li, X.; Du, M.; Chen, J.; Chai, Y.; Lakkaraju, H.; Xiong, H. M⁴: A Unified XAI Benchmark for Faithfulness Evaluation of Feature Attribution Methods across Metrics, Modalities and Models. Adv. Neural Inf. Process. Syst. 2023, 36, 1630–1643. [Google Scholar]
Imani, M.; Beikmohammadi, A.; Arabnia, H.R. Comprehensive Analysis of Random Forest and XGBoost Performance with SMOTE, ADASYN, and GNUS Under Varying Imbalance Levels. Technologies 2025, 13, 88. [Google Scholar] [CrossRef]
Ballester, R.; Clemente, X.A.; Casacuberta, C.; Madadi, M.; Corneanu, C.A.; Escalera, S. Predicting the generalization gap in neural networks using topological data analysis. Neurocomputing 2024, 596, 127787. [Google Scholar] [CrossRef]
Zhang, K.; Khosravi, B.; Vahdati, S.; Faghani, S.; Nugen, F.; Rassoulinejad-Mousavi, S.M.; Moassefi, M.; Jagtap, J.M.M.; Singh, Y.; Rouzrokh, P.; et al. Mitigating Bias in Radiology Machine Learning: 2. Model Development. Radiol. Artif. Intell. 2022, 4, e220010. [Google Scholar] [CrossRef] [PubMed]
Jakubowski, M.; Gajderowicz, T.; Patrinos, H.A. COVID-19, school closures, and student learning outcomes. New global evidence from PISA. npj Sci. Learn. 2025, 10, 5. [Google Scholar] [CrossRef]
Huang, Y.; Zhou, Y.; Chen, J.; Wu, D. Applying Machine Learning and SHAP Method to Identify Key Influences on Middle-School Students’ Mathematics Literacy Performance. J. Intell. 2024, 12, 93. [Google Scholar] [CrossRef]
Alvarez-Garcia, M.; Arenas-Parra, M.; Ibar-Alonso, R. Uncovering student profiles. An explainable cluster analysis approach to PISA 2022. Comput. Educ. 2024, 223, 105166. [Google Scholar] [CrossRef]
Liu, L.; Dai, R. Explainable AI for Predicting and Understanding Mathematics Achievement: A Cross-National Analysis of PISA 2018. arXiv 2025, arXiv:2508.16747. [Google Scholar] [CrossRef]
Öz, E.; Bulut, O.; Cellat, Z.F.; Yürekli, H. Stacking: An ensemble learning approach to predict student performance in PISA 2022. Educ. Inf. Technol. 2025, 30, 7753–7779. [Google Scholar] [CrossRef]
Niu, J.; Xu, H.; Yu, J. Identifying multilevel factors on student mathematics performance for Singapore, Korea, Finland, and Denmark in PISA 2022: Considering individualistic versus collectivistic cultures. Humanit. Soc. Sci. Commun. 2025, 12, 151. [Google Scholar] [CrossRef]
Darmawan, I.G.N.; Dharmapatni, A.A.S.S.K. A multilevel analysis of student and school characteristics associated with 15-year-olds’ reading performances: A Southeast Asian perspective. Large-Scale Assess. Educ. 2024, 12, 40. [Google Scholar] [CrossRef]
Worldwide Governance Indicators. World Bank. Available online: https://www.worldbank.org/en/publication/worldwide-governance-indicators (accessed on 15 January 2025).
The Democracy Index. Economist Intelligence. Available online: https://www.eiu.com/n/global-themes/democracy-index/ (accessed on 20 January 2025).
World Bank Country and Lending Groups. World Bank Country and Lending Groups—World Bank Data Help Desk. Available online: https://datahelpdesk.worldbank.org/knowledgebase/articles/906519-world-bank-country-and-lending-groups (accessed on 10 April 2024).
Indicators. World Bank. Available online: https://data.worldbank.org/indicator (accessed on 15 January 2025).
PISA Data and Methodology. OECD. Available online: https://www.oecd.org/en/about/programmes/pisa/pisa-data.html (accessed on 20 January 2025).
Basu, I.; Maji, S. Multicollinearity correction and combined feature effect in shapley values. In Australasian Joint Conference on Artificial Intelligence; Springer: Cham, Switzerland, 2022; pp. 79–90. [Google Scholar]
Salih, A.M.; Galazzo, I.B.; Raisi-Estabragh, Z.; Petersen, S.E.; Menegaz, G.; Radeva, P. Characterizing the Contribution of Dependent Features in XAI Methods. IEEE J. Biomed. Health Inform. 2024, 28, 6466–6473. [Google Scholar] [CrossRef]
Krell, E.; Mamalakis, A.; King, S.A.; Tissot, P.; Ebert-Uphoff, I. The influence of correlated features on neural network attribution methods in geoscience. Environ. Data Sci. 2025, 4, e29. [Google Scholar] [CrossRef]
García, A. Greedy algorithms: A review and open problems. J. Inequalities Appl. 2025, 2025, 11. [Google Scholar] [CrossRef]
Shrestha, N. Detecting Multicollinearity in Regression Analysis. Am. J. Appl. Math. Stat. 2020, 8, 39–42. [Google Scholar] [CrossRef]
Diallo, R.; Edalo, C.; Awe, O.O. Machine Learning Evaluation of Imbalanced Health Data: A Comparative Analysis of Balanced Accuracy, MCC, and F1 Score; Springer Nature: Cham, Switzerland, 2025; pp. 283–312. [Google Scholar]
Jaskowiak, P.A.; Costa, I.G.; Campello, R.J.G.B. The area under the ROC curve as a measure of clustering quality. Data Min. Knowl. Discov. 2022, 36, 1219–1245. [Google Scholar] [CrossRef]
Rainio, O.; Teuho, J.; Klén, R. Evaluation metrics and statistical tests for machine learning. Sci. Rep. 2024, 14, 6086. [Google Scholar] [CrossRef] [PubMed]
Abdo, A.; Gallay, L.; Vallecillo, T.; Clarenne, J.; Quillet, P.; Vuiblet, V.; Merieux, R. A machine learning-based clinical predictive tool to identify patients at high risk of medication errors. Sci. Rep. 2024, 14, 32022. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.-I. A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Ponce-Bobadilla, A.V.; Schmitt, V.; Maier, C.S.; Mensing, S.; Stodtmann, S. Practical guide to SHAP analysis: Explaining supervised machine learning model predictions in drug development. Clin. Transl. Sci. 2024, 17, e70056. [Google Scholar] [CrossRef] [PubMed]
Velmurugan, M.; Ouyang, C.; Moreira, C.; Sindhgatta, R. Evaluating Fidelity of Explainable Methods for Predictive Process Analytics; Springer International Publishing: Berlin/Heidelberg, Germany, 2021; pp. 64–72. [Google Scholar]
Dugăeșescu, A.; Florea, A.M. Evaluation and analysis of visual methods for CNN explainability: A novel approach and experimental study. Neural Comput. Appl. 2025, 37, 14935–14970. [Google Scholar] [CrossRef]

Figure 1. General performance metrics.

Figure 2. Mean absolute SHAP values of the top 10 features most influential in model predictions for the target features in PISAM (a), PISAR (b), and PISAS (c). The horizontal axis shows the average effect size of each feature on the model output; the vertical axis shows the features of interest. Each horizontal bar is divided into separate color components for three classes (Class 0, Class 1, Class 2); thus, the total bar length indicates the overall importance of the feature, and the relative shares of the colored components indicate how this importance is distributed across classes. Across domains, governance-related indicators form a stable institutional core, consistently appearing among the most influential predictors. Domain-specific differences emerge in the complementary layer: reading aligns more with economic-capacity indicators, whereas science shows stronger differentiation by regional-bloc signals.

Figure 3. SHAP Beeswarm distributions of the top 10 features for the three targets, stratified by class. Panels (a–c) show PISAM, (d–f) PISAR, and (g–i) PISAS. In each graph, each horizontal row represents a feature and each point represents an observation. The horizontal position indicates the SHAP value (the direction and magnitude of the feature’s contribution, pushing the prediction toward (+) or away from (−) the class shown in each panel). The color scale indicates the feature value (blue: low, pink: high). The direction of SHAP contributions supports class-specific interpretation rather than a single global ranking. The figure should be read as a structural profile map: it highlights which institutional and macro signals characterize each achievement band, not a causal effect of policy interventions.

Figure 4. Fidelity and Faithfulness values for SHAP for the PISAM, PISAR, and PISAS target variables. Fidelity indicates the consistency of the SHAP with the model outputs, while Faithfulness indicates how accurately the explanations reflect the model decision process. For PISAM, Fidelity was 0.95 and Faithfulness was 0.85; for PISAR, Fidelity was 0.89 and Faithfulness was 0.92; and for PISAS, both metrics were 0.89.

Table 1. Features used in the study, their abbreviations, explanations, and data sources (World Bank, EIU, OECD).

Abbreviation	Feature Name	Explanation	Source
CoC	Control of Corruption	Control of Corruption	World Bank [48]
GE	Government Effectiveness	Government Activity	World Bank [48]
PS	Political Stability	Level of Political Stability and the Risk of Violence/Terrorism	World Bank [48]
RQ	Regulatory Quality	The Impact of Policies and Regulations on the Private Sector	World Bank [48]
RL	Rule of Law	Confidence in the Rule of Law and the Justice System	World Bank [48]
VA	Voice and Accountability	Individuals’ Freedom of Expression and Level of Political Participation	World Bank [48]
DEM	Democracy Index	Democratic Development Level of the Country	The Economist Intelligence Unit [49]
RCAT	Regional Category	EU: Europe, AO: Asia and Oceania, AME: America, AF: Africa	Categoric
ECAT	Economic Category	The Country’s Economic Development Level	World Bank [50]
GDP	Gross Domestic Product	Total Gross Domestic Product	World Bank [51]
GDPPC	GDP per Capita	GDP per capita	World Bank [51]
Yearc_C	Observation year information	Year data centered on the mean	Related resources
PISAM	PISA: Mathematics scale	PISA: Mean performance on the mathematics scale	OECD [52]
PISAR	PISA: Reading scale	PISA: Mean performance on the reading scale	OECD [52]
PISAS	PISA: Science scale	PISA: Mean performance on the science scale	OECD [52]

Table 2. Class-based performance metrics based on the XGBoost model.

Target	Class	Precision	Recall	F1-Score	Support
PISAM	Low	0.97	0.97	0.97	110
	Medium	0.91	0.93	0.92	107
	High	0.96	0.94	0.95	111
PISAR	Low	0.95	0.97	0.96	108
	Medium	0.94	0.84	0.89	112
	High	0.90	0.97	0.93	108
PISAS	Low	0.95	1.0	0.97	107
	Medium	0.95	0.90	0.92	116
	High	0.94	0.95	0.95	105

Table 3. PISAM (Mathematics): class profiles and discriminative signals.

Signal/Condition	Model Effect	Interpretive Tag
Lower accountability/political participation (VA ↓), lower economic capacity and income per capita (GDP ↓, GDPPC ↓), lower political stability (PS ↓) and weak control of corruption (CoC ↓); regional location associated with the non-European context and especially the bloc of the Americas (RCAT_EU low, RCAT_AME ↑). DEM not significantly differentiated.	Class 0 direction (positive SHAP)	Low institutional capacity, low economic resources and regional clustering (predominantly non-European/American): lowest level group (Class 0)
European context (RCAT_EU ↑) and non-American location (RCAT_AME low); relatively high per capita income (GDPPC ↑). Profiles with an extremely large economic scale (very high GDP) and “very high” institutional corruption control are excluded from this class by the model.	In the direction of Class 1 (positive SHAP)	Relatively wealthy, mostly close to Europe but not yet at the level of the largest global economy: the mid-level group (Class 1)
High political stability (PS ↑) and specific regional orientations: Asia/Oceania (RCAT_AO ↑) and European connectivity (RCAT_EU ↑) push towards Class 2, while the Americas bloc (RCAT_AME ↑) pushes further away from Class 2. Very low economic scale (very low GDP) is not typical for this group, but the “highest accountability” (VA) and “strongest control of corruption” (CoC) profiles do not appear to be necessary signals for this class.	In the direction of Class 2 (positive SHAP)	Structure that is more politically stable and based on specific regional clusters (especially Europe and Asia/Oceania): highest-level group (Class 2)

Table 4. PISAR (Reading): class profiles and discriminative signals.

Signal/Condition	Model Effect	Interpretive Tag
Lower economic capacity (GDP ↓, GDPPC ↓), weaker control of corruption (CoC ↓) and lower political accountability/participation (VA ↓); positioning in a non-European context (RCAT_EU = 0).	Class 0 direction (positive SHAP)	Limited economic capacity and fragile institutional structure (mostly non-EU): lowest level group (Class 0)
European context (RCAT_EU ↑) and high per capita income (GDPPC ↑). Observations with institutional indicators (control of corruption, democracy, accountability) above a certain level, but with the highest institutional density, are removed from Class 1 by the model.	In the direction of Class 1 (positive SHAP)	Relatively affluent, predominantly European-origin intermediate profile: intermediate group (Class 1)
Strong control of corruption (CoC ↑), high accountability/political participation (VA ↑) and high political stability (PS ↑); economic capacity is not weak (high economies of scale), and the Asian/Oceania context significantly pushes towards Class 2 (RCAT_AO ↑).	In the direction of Class 2 (positive SHAP)	Strong institutions, high stability, and regional scale (especially Asia/Oceania): highest-level group (Class 2)

Table 5. PISAS (Science): class profiles and distinctive signals.

Signal/Condition	Model Effect	Interpretive Tag
Lower accountability/political participation (VA ↓), low economic capacity (GDP ↓), weak control of corruption (CoC ↓) and low political stability (PS ↓); regional position associated with non-European and especially the American bloc (RCAT_EU low, RCAT_AME ↑).	Class 0 direction (positive SHAP)	Weak institutional capacity, low stability, and specific regional clustering (dominated by the Americas): the lowest-level group
European context (RCAT_EU ↑) and higher per capita income (GDPPC ↑); institutional indicators above a certain threshold (corruption control is not completely weak), but not the highest institutional/economic scale class (extremely large economy and very high institutional density often deviate from Class 1).	In the direction of Class 1 (positive SHAP)	Relatively affluent, European-dominated intermediate profile: mid-range group (Class 1)
Strong control of corruption (CoC ↑), high accountability/political participation (VA ↑), and high political stability (PS ↑); large/scale economy (GDP ↑) and Asian/Oceania context (RCAT_AO ↑). Not necessarily extremely high GDP per capita; high performance comes from a combination of high institutional capacity + macroscale.	In the direction of Class 2 (positive SHAP)	Strong institutions, high stability, and large-scale economy (especially Asia/Oceania): highest-level group (Class 2)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kişman, Z.A.; Kan, A.Ü.; Uzun, S.; Kan, M.A.; Yıldırım, G. Analysis of the Effects of World Bank Macroeconomic and Management Indicators on Sustainable Education Quality on PISA Scores Using the SHAP Explainable Artificial Intelligence Method. Sustainability 2026, 18, 1415. https://doi.org/10.3390/su18031415

AMA Style

Kişman ZA, Kan AÜ, Uzun S, Kan MA, Yıldırım G. Analysis of the Effects of World Bank Macroeconomic and Management Indicators on Sustainable Education Quality on PISA Scores Using the SHAP Explainable Artificial Intelligence Method. Sustainability. 2026; 18(3):1415. https://doi.org/10.3390/su18031415

Chicago/Turabian Style

Kişman, Zülfükar Aytaç, Ayşe Ülkü Kan, Selman Uzun, Mehmet Alper Kan, and Güngör Yıldırım. 2026. "Analysis of the Effects of World Bank Macroeconomic and Management Indicators on Sustainable Education Quality on PISA Scores Using the SHAP Explainable Artificial Intelligence Method" Sustainability 18, no. 3: 1415. https://doi.org/10.3390/su18031415

APA Style

Kişman, Z. A., Kan, A. Ü., Uzun, S., Kan, M. A., & Yıldırım, G. (2026). Analysis of the Effects of World Bank Macroeconomic and Management Indicators on Sustainable Education Quality on PISA Scores Using the SHAP Explainable Artificial Intelligence Method. Sustainability, 18(3), 1415. https://doi.org/10.3390/su18031415

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Analysis of the Effects of World Bank Macroeconomic and Management Indicators on Sustainable Education Quality on PISA Scores Using the SHAP Explainable Artificial Intelligence Method

Abstract

1. Introduction

2. Related Work

2.1. PISA in the Context of Country/Region Dynamics: Related International Studies

2.2. Literature on Explainable Artificial Intelligence (XAI)

2.3. Challenges and Solution Methods in Explainable Artificial Intelligence

2.4. Applications of Interpreting PISA Data with Artificial Intelligence

3. Methodology

3.1. Creating Data Sets and Determining Target Variables

3.2. Correlation Analysis Between Features and Feature Selection

3.3. Classification with the Machine Learning Model

3.4. Explainable Artificial Intelligence Algorithm Used: SHAP

3.5. Evaluation of the Explainability Methods: Fidelity and Faithfulness

4. Results

4.1. Choices Based on Correlation Analysis Among Features

4.2. Machine Learning Model Results

4.3. Class-Distinguished SHAP Feature Importance Based on Target Variable

4.4. Target Variable and Class-Level SHAP Data

4.5. SHAP Fidelity and Faithfulness Results

5. Discussion

5.1. Collinearity Detection and Analysis

5.2. Machine Learning Model Analysis

5.3. Class-Distinct SHAP Feature Importance Analysis

5.4. Target Variable and Class-Level SHAP Analysis

5.5. Reliability of SHAP and the Interpretability Trade-Off

5.6. Limitations and Future Work

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI