1. Introduction
PISA is the OECD’s International Student Assessment approach, which aims to find answers to the questions of “what citizens should know and be able to do” that are comparable to student performance at the international level [
1]. International large-scale assessments such as the PISA provide comparative indicators of student achievement across various competence areas [
2]. In other words, one of PISA’s key advantages is that it provides international data demonstrating large-scale measures of student learning [
3]. PISA is an assessment administered to 15-year-old students worldwide every three years, demonstrating the extent to which students have acquired the basic skills necessary for participation in social and economic life. The primary focus of this assessment approach is the extent to which students can infer what they have learned and the extent to which they can adapt what they have learned to different environments [
1]. PISA is considered a tool used to promote student learning while playing an important role in the global basic education reform for countries [
4]. This process can be considered as the education monitoring process. Wolter (2008) emphasizes that education monitoring is a comprehensive process that includes periodic education comparisons using tools such as PISA, and also aims to produce governance information for the field of education and its problems [
5].
In a changing and globalizing world, the growth and internationalization of capital have led to increased complexity in production, work, and governance, as well as the emergence of new actors in public and institutional spheres. In this context, applications such as PISA, one of the OECD’s most well-known programs, have gained a strategic and important place in many countries, particularly due to their roles as an economic forum and data and information center [
6]. PISA is considered an important, valid, and reliable measure that influences educational policies and decision-making processes when evaluating student performance [
7]. The importance of PISA comes not only from ranking countries but also from providing an evidence-based basis for policy design by revealing the strengths and weaknesses of education systems with objective indicators. Regular repetition of assessments such as the PISA allows for monitoring progress or regression over time. It is also important to note that it sheds light on critical issues such as inequality, fair opportunity, and learning loss. Indeed, Wolter (2008) highlights the OECD’s efforts to standardize statistics, thus making different countries comparable at an international level, as an important development [
5].
As a significant development, the preference for different data processing and interpretation methods in new statistical evaluation methods, which also support the empirical methodology of social sciences and are especially used in labor markets and health and can be adapted to different fields, has contributed to a better interpretation of the obtained data [
5]. It is precisely at this point that indicators come to the fore. Indicators activate governance technology in policy making around the world. Kelly and Simmson (2020) argued that indicators trigger competition between states [
8]. It is stated that assessments conducted by international organizations, such as PISA, are becoming increasingly popular and accepted by countries, but they also come along with some problems [
9]. As PISA is an internationally recognized assessment approach designed to help governments monitor the results of education systems, which is recognized for assessing and reporting student achievement, with the rapid pace of globalization, competition for economic and human capital among countries has become more intense, and quality education has been accepted as the key to improving national competitiveness [
4]. Evaluating the PISA results, which can be considered a source of big data for improving competitiveness, through indicators within a governance framework is deemed meaningful and valuable. Ensuring knowledge- and evidence-based governance and the management of education systems requires statistical information and administrative information systems. Today, this need is even more crucial than in the past [
5]. Competition between countries and societies is perhaps the most important reason for this situation.
PISAM, PISAR, and PISAS, which can be interpreted as key components of PISA, represent the domains of mathematics, reading, and science, respectively. PISAM focuses on competencies such as mathematical reasoning, quantitative modeling, and decision-making under uncertainty, providing direct signals about individuals’ numerical literacy and problem-solving capacity. PISAR measures the ability to understand texts, make inferences, integrate information, and assess source credibility, demonstrating the quality of deep literacy that is the foundation of lifelong learning. PISAS, on the other hand, reveals the level of scientific literacy through the components of scientific thinking, evidence evaluation, hypothesis generation, and scientific argumentation. This is a critical indicator for the sustainability of the STEM ecosystem and the capacity for evidence-based decision-making. When these three domains are considered together, not only the average achievement but also the lower and upper ends of the achievement distribution, the state of balance between domains and trends over time can be understood. This allows for a concrete analysis of which student groups need more support, the breadth of the talent pool, and the level of system consistency. This multidimensional picture presented by PISA shows that educational outcomes are related not only to classroom practices but also to broader country and regional dynamics.
In this context, Shapley Additive exPlanations (SHAP)-based explainable artificial intelligence approach makes it possible to interpret the observed performance levels in the PISAM–PISAR–PISAS domains within an interpretable framework. SHAP decomposes the decision logic of multi-class models, making visible which effects play a role in what direction and under what conditions across the low–medium–high bands of performance. Through global rankings of importance, class-specific impact profiles, and interaction patterns, the dynamics that increase or decrease PISA performance are presented in a clear narrative. When interpreting PISAM, PISAR, and PISAS, governance indicators that give insights for governance quality (control of corruption, government effectiveness, regulatory quality, rule of law, political stability, voice and accountability), economic capacity and income level (total economic size (GDP), per capita income (GDP per capita) and economic category index), level of democracy (democracy index), regional location/clustering (Europe, Asia–Oceania, Africa, and America), and time parameter (differences between years) constitute the main contextual lenses. The effects of these dynamics on the low–medium–high bands of PISA performance are transparently decomposed by SHAP.
Consequently, the study positions PISA not merely as a scorecard but as a lens for understanding skill profiles and the patterns shaping them across PISAM, PISAR, and PISAS. This SHAP-supported approach interprets findings in a transparent, replicable, and action-oriented manner, allowing policymakers to see more clearly which levers to prioritize. Unlike the existing literature, where most explainable AI applications remain at the student or school level, this study aims to reframe PISA performance from the question of “who scored higher?” to “which institutional structure is associated with which achievement profile?” by considering governance quality, economic capacity, and regional institutional positioning at the country/region level within the same transparent modeling framework. This can inform policymakers by providing decision-support guidance for developing sustainable education policies and contribute to a clearer understanding of the relationship between education policies and PISA scores. Aligned with the United Nations’ 2030 Agenda (SDG 4: Quality Education), the proposed framework provides policy-relevant, evidence-based signals to support sustainable education policy prioritization.
Contributions and novelty. This study introduces an explainable, country/region-level framework for interpreting PISA performance that integrates governance quality, economic capacity, and regional/institutional clustering within a unified multi-class pipeline. Unlike prior PISA–XAI applications that predominantly emphasize student- or school-level predictors, our approach shifts the unit of explanation to macro-structural drivers and produces class-conditional SHAP profiles that map Low/Medium/High achievement bands to interpretable structural patterns. In addition, we quantify explanation reliability using Fidelity and Faithfulness, enabling a transparent assessment of how well the explanations track the model’s behavior. Together, these contributions provide a reproducible and policy-facing representation of “structural performance profiles” aligned with SDG 4 as a decision-support tool rather than a causal claim.
3. Methodology
This section outlines the analysis pipeline from data preparation to explainability verification. First, PISA math/reading/science scores were defined as the target variable, and datasets containing economic, governance, democratic, and regional indicators for each country are generated, cleaned, and scaled. Then, to increase the statistical reliability of the variables used, two-stage feature selection was applied: highly correlated variables were eliminated, and the remaining ones were reduced using the VIF for multicollinearity. Random Forest and XGBoost models were trained; the models were evaluated on accuracy, F1, recall, precision, ROC-AUC, and train-test consistency, and overfitting and underfitting were checked. The most generalizable, robust model was explained with SHAP and the contribution of each variable to the PISA outcome is quantified. Finally, the reliability of these explanations is tested using two measures: fidelity (how well the explanation mimics the model) and faithfulness (how faithful the explanation is to the model).
3.1. Creating Data Sets and Determining Target Variables
In this study, various features affecting PISA scores and consequently the sustainable education goal were examined. Our target variables in the study were PISAM, PISAR, and PISAS, and three separate datasets were created for each target variable using the same explanatory features. This approach enables the modeling of different criteria and allows for comparative analysis. The independent variables included in the datasets cover political, governance, and economic factors. These variables were selected to represent the multidimensional structure affecting educational outcomes (PISA scores) and consist of indicators frequently used in the literature. All variables used in the study, along with their abbreviations and descriptions, are presented in
Table 1, based on data obtained from sources such as the World Bank, OECD, and The Economist Intelligence Unit. In the data preprocessing stage, missing values were checked. Specifically, PISA scores (which were unavailable for every year or for some countries in certain years) were imputed using trend-based interpolation methods, taking inter-year trends into account. Furthermore, all numerical variables were scaled to make them comparable. These datasets, created independently for each target variable, constitute the primary data source for machine learning model training and explainability analyses in the subsequent stages.
3.2. Correlation Analysis Between Features and Feature Selection
Multicollinearity between input features can distort feature contributions in XAI-based explanations, leading to model misinterpretation [
53,
54,
55]. To mitigate this issue and avoid unnecessary information duplication, a two-stage feature selection method was implemented. In the first stage, the correlation matrix of the features is extracted, and features with a correlation above 0.90 are eliminated by the Greedy Elimination method. In the next stage, the remaining features are subjected to the Variance Inflation Factor (VIF) and features with VIF greater than 7.0 were eliminated.
A greedy approach is a class of heuristic approaches that does not perform backtracking and gradually constructs a solution by choosing the locally best option at each step [
56]. The VIF measures the number of times the coefficient variance of each feature in a regression model is inflated due to multicollinearity with other independent features [
57].
To preprocess
multicollinearity, a correlation threshold-based “Greedy Elimination” algorithm was applied before the VIF analysis
(e.g., |R| ≥ 0.90). The algorithm works by finding the pair
that satisfies the condition of highest absolute correlation
within the current feature set R. To decide which attribute to eliminate from this pair
, the average absolute correlation
of each attribute with all other variables in the set R is calculated by Equation (1). The attribute with the higher average correlation is eliminated by Equation (2):
This process is repeated until all pairwise correlations in set fall below the threshold τ . Here;
: The set of features evaluated in the relevant step of the algorithm (not yet eliminated),
: Individual features of the set ,
: Predetermined threshold value that defines high correlation (e.g., ),
: and the correlation coefficient (Pearson or Spearman) between the features,
: the average absolute correlation of the feature with all other features in its set (a measure of the systemic redundancy of the feature),
: The number of elements of the set (the number of remaining features),
: Indicates the “greedy” target feature selected to be eliminated from the set.
The IF value is defined in Equation (3). Here is the coefficient of determination obtained by the regression of the feature against all other features. Features with high VIF values were eliminated, and the process was repeated for the remaining features. The process was completed when all VIF values fell below the specified threshold.
In the second stage, the distance value is calculated using the absolute correlation
between the remaining features after VIF with Equation (4). With these distance values, a fully connected weighted graph
is created with all features as nodes.
Consequently, this two-stage flow, which reduces high pairwise correlations (e.g., τ = 0.90) with Greedy Elimination and then removes multicollinearity in the remaining variables with the VIF criterion (VIF > 7.0), makes the feature set more independent, statistically more stable, and more informative with respect to the target variable. This supports predictive performance and improves the reliability of XAI-based explanations, and significantly reduces the risk of interpretations being influenced by data-driven dependencies in subsequent analysis steps. The full step-by-step procedure is provided in
Supplementary Algorithm S1.
3.3. Classification with the Machine Learning Model
Each correlation-optimized dataset was used to train two tree-based models (Random Forest and XGBoost) separately for each target (PISAM, PISAR, and PISAS), and the best-performing configuration was selected. The model configuration process was carried out independently for each target variable (PISAM, PISAR, and PISAS). In this study, the differences between training and test accuracy performances (Train accuracy and Test accuracy) were calculated to monitor the risk of overfitting; the aim was to keep these differences low to avoid overfitting. In addition, metrics such as Recall, Precision, F1, and ROC-AUC were calculated to evaluate the model’s health from different perspectives.
Definition of Low/Medium/High classes and threshold rationale. The three achievement classes were constructed by applying fixed score thresholds to each PISA domain, yielding Low/Medium/High labels for the multi-class prediction task. Consistent with the implementation shared for reproducibility, we used domain-specific cutoffs: for PISAM, Low < 420, Medium 420–490, High ≥ 490; for PISAR, Low < 420, Medium 420–488, High ≥ 488; and for PISAS, Low < 425, Medium 425–495, High ≥ 495. These thresholds were selected after inspecting the empirical score distributions and ensuring that each class retains sufficient support for stable model training and interpretation. To reduce the risk that overall performance is driven by a dominant class, model development and selection emphasize balanced class-wise performance (reported via class-wise Precision/Recall/F1 and Macro-F1), so that the learned separation remains meaningful across all three profiles.
Accuracy: shows the overall accuracy rate of the model.
F1 score: measures the balance [
58].
ROC-AUC: reflects the success of the model in distinguishing positive and negative classes under different thresholds [
59].
Recall: In binary or multi-class classification, it measures how many true positives were correctly detected. This metric is expressed by the formula
. Here, TP represents true positives (positives correctly predicted for the class), and FN represents false negatives (negatives incorrectly predicted for the class) [
60]. It answers the question, “How many positives were correctly predicted?”
Precision: Measures how many of the model’s positive predictions are truly positive. This metric is expressed by the formula (
) [
61]. It answers the question, “How many positive predictions are truly positive?”
The primary goal of the model’s hyperparameter optimization was not only to increase accuracy but also to establish a balance between F1, Recall, and Precision. This fine-tuning process was applied to both Random Forest and XGBoost models for each of the PISAM, PISAR, and PISAS target variables. The measurement strategy was multifaceted. The Macro-F1 score was used as the primary selection criterion. When Recall decreased, particularly in the middle and minority classes, improvement was prioritized with this metric. When the model produced excessive false positives, a balance was established with the Precision metric. Additionally, the ROC-AUC score was monitored to assess the model’s threshold-independent discriminatory power. Generalization was critical. Therefore, overfitting was prevented by always keeping the difference between training and test accuracies low.
Consequently, models with high macro-F1 and strong AUC scores were selected for each target. These final models also exhibited a balanced Precision-Recall profile and robust generalization ability. This careful selection process laid a reliable foundation for subsequent SHAP-based annotations. Using Macro-F1 as the primary criterion also reduces the risk of majority-class dominance by enforcing balanced performance across all achievement classes.
3.4. Explainable Artificial Intelligence Algorithm Used: SHAP
After checking for overfitting and underfitting and selecting a model with a relatively high ROC-AUC value, SHAP was used to make the structure of the model interpretable and to evaluate the contribution levels of the features to the model. SHAP is a local explainability tool that calculates the marginal contribution of each feature for each prediction based on the Shapley values, allowing for the feature-by-feature decomposition of the model output [
62]. In this study, global SHAP values were used to examine the general trends of the features. Equation (5) presents the basic mathematical definition of the method and the functional descriptions of the components included in this definition.
The SHAP (SHapley Additive ExPlanations) method defines the contribution of each feature to the output of a machine learning model based on cooperative game theory. This approach aims to fairly and consistently quantify the contribution of each feature to the model’s prediction, specifically through Shapley values [
63]. The SHAP value for the
ith feature is calculated as in Equation (5):
Here, is the prediction function (e.g., the XGBoost model), is the model instance to be explained, is the entire feature set, is the subsets that do not contain the ith feature, is the prediction made with only the features , and is the contribution value of the ith feature. This equation calculates the contribution of a feature to the model by comparing all subset combinations whether the relevant feature is present or absent. Since the calculation considers all subset ranks with equal weight, it ensures a fair distribution. In Equation (5), SHAP is essentially a method for generating local explanations. However, the global contribution levels of the features are calculated through by taking these local values into consideration for average value of all observations. Here, is the average of the individual contribution values of the features to be explained.
3.5. Evaluation of the Explainability Methods: Fidelity and Faithfulness
The success of explainable artificial intelligence (XAI) methods should be evaluated not only by their intuitive accuracy but also by their quantitatively measurable reliability. In this context, the validity of the explanations of the SHAP method was examined using two primary metrics: Fidelity and Faithfulness. These metrics indicate how well the explanations fit the model and how faithfully the model adheres to the decision process. Fidelity measures how well a model, which is created with features identified as “important” by the explanation method, mimics the decisions of the original model [
64]. Faithfulness assesses the extent to which the features identified as important using the explanation method actually influence the model’s decisions. To do this, the decrease in the model’s predictive accuracy is measured by changing the value of each feature. The magnitude of these decreases is then compared with the importance ranking assigned using the explanation method. If the features identified as most important are the ones that most significantly influence the model’s predictions, the explanation is concluded to be faithful to the model’s decisions [
65]. The full step-by-step procedures for these Fidelity (surrogate-model agreement) and Faithfulness (permutation-based sensitivity) checks are provided in
Supplementary Algorithms S2 and S3.
Faithfulness is operationalized as the agreement between two rankings: the “impact” ranking, defined by the performance drop observed when a feature is perturbed (reflecting its actual influence on the trained model), and the “explanatory power” ranking, defined by the importance scores assigned by the explanation method (estimated importance). A higher rank agreement (i.e., stronger correlation between these rankings) indicates that the explanation method assigns higher importance to the features that truly affect the model’s predictions, and is therefore interpreted as higher faithfulness.
4. Results
This section reports the model’s findings at four levels. First, by eliminating the high correlation and multicollinearity among the features, a more statistically stable feature set was obtained, and this set served as the primary input for the analysis. Second, XGBoost models achieved the highest classification success for the PISA math/reading/science targets; performance was assessed using both general metrics (Accuracy, ROC-AUC, etc.) and class-based Precision–Recall–F1 results. Third, the SHAP revealed which economic, governance and regional indicators were most influential in model decisions for each target variable and how this influence differed across different achievement classes. Finally, the reliability of the resulting SHAP was measured; Fidelity and Faithfulness scores indicated that the explanations closely matched the model outputs and tracked the model’s decision logic with a high degree of agreement.
4.1. Choices Based on Correlation Analysis Among Features
The dataset comprises 18 numerical features. Feature selection was conducted in two stages: (i) correlation-based elimination (|r| ≥ 0.90, Pearson) and (ii) iterative VIF elimination (threshold VIF > 7.0). Year_C was retained in the correlation step. In the correlation step, the highest correlations were: PISAR–PISAS (|r| = 0.977), PISAM–PISAS (0.970), RL–GE (0.944), RQ–GE (0.906). Therefore, PISAR, PISAM, RL, and RQ were eliminated; the remaining ones were: Year_C, CoC, GE, PS, VA, PISAS, DEM, RCAT_AF, RCAT_AME, RCAT_AO, RCAT_EU, GDP, GDPPC, and ECAT.
In the VIF step, VIF = 7.63 was determined for GE and it was eliminated because the threshold was exceeded. In the next iteration, the maximum VIF dropped to 6.42 and the process was stopped. Final feature set: Year_C, CoC, PS, VA, PISAS, DEM, RCAT_AF, RCAT_AME, RCAT_AO, RCAT_EU, GDP, GDPPC, ECAT. Final VIF values (first five in decreasing order): RCAT_EU = 6.42, RCAT_AME = 4.27, RCAT_AO = 4.24, CoC = 4.12, PS = 3.49. Others: VA = 3.38, ECAT = 2.73, GDPPC = 2.64, PISAS = 2.50, RCAT_AF = 2.38, DEM = 1.53, GDP = 1.22, Year_C = 1.19. In summary, after the correlation and VIF eliminations, a final set of 13 features was obtained.
4.2. Machine Learning Model Results
For each target variable (PISAM, PISAR, and PISAS), the best performing machine learning model was found to be XGBoost. Class-level metrics (Precision, Recall, F1-Score, Support) are presented in
Table 2; general/summary metrics (Train Accuracy, Test Accuracy, ROC-AUC, Macro Avg, Weighted Avg, Support) are given in
Figure 1.
4.3. Class-Distinguished SHAP Feature Importance Based on Target Variable
Figure 2 shows the top 10 features that most significantly impact model predictions for the target variables in PISAM (a), PISAR (b), and PISAS (c), and the distribution of this effect by class. The horizontal axis shows the mean absolute SHAP value, which represents the average effect size of each feature on the model output. Higher values indicate that the model relies more on that feature when making decisions. The features listed on the vertical axis (e.g., VA, CoC, GDP, RCAT_EU, etc.) are the most influential features in this respect. Each bar consists of three colors, and these colors correspond to the model’s classes. The blue portion represents the average absolute SHAP contribution for Class 0, the pink portion for Class 2, and the olive/green portion for Class 1. Thus, while the total length of the bar for a single feature represents that feature’s overall importance, the division of the bar into colored components shows the class in which this importance is concentrated. For example, if the blue portion of a feature’s bar is dominant, that feature is particularly strong at explaining Class 0 predictions. If the pink part is dominant, it is more critical in explaining Class 2 decisions. If the green part is dominant, it is more decisive in Class 1 predictions.
The graphs provide two levels of information. First, they show the overall importance ranking of the features, where the top-ranked features are the primary determinants of the model’s decision-making process. Second, they reveal a class-based decomposition, meaning that the same feature can have different impacts across different classes.
Figure 2a–c illustrate which structural, economic, governance, or regional indicators the model uses to distinguish between PISAM (a), PISAR (b), and PISAS (c), and which class these indicators play a greater role in distinguishing.
4.4. Target Variable and Class-Level SHAP Data
Figure 3a–i presents the class-based SHAP distributions for the PISAM, PISAR, and PISAS target variables. These graphs illustrate which features influence class prediction and in what direction. Each row represents a target variable (PISAM, PISAR, and PISAS, respectively), and each column represents the Class 0, Class 1, and Class 2 predictions for the corresponding target. For the first 10 features in each panel, the dots represent the SHAP values at the observation level. The horizontal position of the dot indicates the direction and magnitude of the effect of the feature on the model output, either toward (+) or away from (−). For example, if red dots corresponding to high values of a feature are stacked to the left (negative SHAP), this decreases the model’s ability to predict that class; if the same red dots are stacked to the right (positive SHAP), this increases the model’s ability to predict that class. The color scale indicates the level of the feature value (blue: low, pink/red: high). This visual presents the most influential features of the model for each target and each class, along with the contribution distribution of these features across the samples.
4.5. SHAP Fidelity and Faithfulness Results
Faithfulness values were calculated to assess the accuracy of the SHAP for the PISAM, PISAR, and PISAS target variables. Fidelity indicates the extent to which the SHAP are consistent with the model predictions, while Faithfulness indicates the extent to which the explanations faithfully reflect the model’s decision logic. Both metrics are shown as separate bars for each target. For PISAM, Fidelity was 0.95 and Faithfulness was 0.85; for PISAR, Fidelity was 0.89 and Faithfulness was 0.92; and for PISAS, both metrics were 0.89.
5. Discussion
This section interprets the findings and discusses the practical implications of the models. First, the collinearity structure among variables is examined to explain which indicators move together and why the final attribute set was reduced to this form. Next, the performance of machine learning models is evaluated, and it is discussed whether this performance is suitable for reliably classifying PISA achievement levels. Next, using SHAP results, it is examined which governance, economic, and regional factors are effective on which targets (math, reading, science) in the general framework and how these factors differ across classes. Following this, it is detailed how SHAP behave on a class-by-class basis; each class’s unique institutional/economic signature is shown. Finally, the reliability level of the explanations (fidelity and faithfulness), the limitations of the study, and how the method could be expanded for policy purposes are discussed.
5.1. Collinearity Detection and Analysis
Excessive multicollinearity is particularly concentrated in PISA subscales and certain governance indicators; therefore, retaining PISA as the sole representative reduces information redundancy while adequately summarizing cognitive performance. The exclusion of GE due to VIF > 7 indicates that its marginal explanatory power remains weak because it carries a strong common signal with RL/RQ. In contrast, CoC, PS, and VA remain below the threshold, continuing to carry different channels of governance into the model. Although RCAT_EU has the highest VIF among regional dummies, its sub-threshold value implies that regional effects cannot be fully explained by GDP/GDPPC and contain independent variance. ECAT’s lack of complete overlap with continuous economic indicators supports the complementary power of the “level + continuous” combination. Year_C’s low VIF indicates that the time trend has limited overlap with other indicators and can be reliably retained. Overall, a smaller, 13-variable final set; providing an interpretable foundation expected to yield practical gains in terms of coefficient stability, smaller train–test differences, and smoother ROC-AUC profiles with more stable PR/F1 scores. However, it should be noted that structurally collinear common-source indices may arise, and findings should be interpreted with a focus on prediction/interpretation rather than causality.
5.2. Machine Learning Model Analysis
The results in
Table 2 and
Figure 1 reveal that the models (PISAM, PISAR, and PISAS) perform quite robustly for each target. When
Table 2 is examined, the majority of the Precision, Recall, and F1-Score values on a class basis are above 0.90, and in some cases (e.g., PISAS- Low Recall = 1.00) is close to perfect. Even in the medium classes (e.g., PISAR- Medium F1 = 0.93), the performance is satisfactory. In the general performance metrics in
Figure 1, Train Acc is 1.0 and Test Acc is 0.93–0.95, indicating good generalization ability of the model. The ROC-AUC being in the 0.98–1.00 band reinforces the very high discrimination power. The close agreement between Macro Avg and Weighted Avg (≈0.93–0.95) indicates that the model maintains strong performance despite class imbalance. These results are quite promising for practical applications with SHAP and generally demonstrate a successful model output.
5.3. Class-Distinct SHAP Feature Importance Analysis
Figure 2a shows that the strongest determinants of the model for PISAM are governance indicators. The “Voice and Accountability (VA)” and “Control of Corruption (CoC)” features have the highest total SHAP values and are more dominant than the economic indicators (e.g., GDP). In addition, the fact that political stability (PS) and regional location signals (e.g., RCAT_EU) are in the top rankings indicates that not only economic capacity but also institutional quality, stability, and specific regional clusters are used to explain PISAM levels. This indicates that class differences in PISAM are decomposed along the axis of “institutional core + contextual location”.
Figure 2b shows that for PISAR, VA and CoC are again in the first place, but this time, economic scale (GDP) and welfare level (GDPPC) come into play more prominently. This structure suggests that the PISAR result is sensitive to both institutional qualities and economic capacity together. In other words, it implies that high PISAR values are associated with the combination of “good governance + high income”.
Figure 2c reiterates the importance of governance indicators (VA, CoC) for PISAS, but here the regional category features (e.g., RCAT_EU, RCAT_AO) are in the first place. This shows that the PISAS scores systematically decompose across geographical blocks and that the model actively exploits this regional structure. Furthermore, the contribution of indicators related to political stability (PS) and democracy (DEM) appears particularly evident in the decomposition of middle-class populations.
When
Figure 2a–c are read together, the following pattern emerges: governance quality (especially VA and CoC) is the most salient predictor in the model across all targets. However, PISAR is more closely linked to economic capacity, whereas PISAS is more closely linked to regional location. This indicates that each target variable explains the same institutional core with different complementary dimensions (economy or geography).
Cross-domain consistency and domain-specific complements. A key contribution of our findings is the consistent emergence of a shared institutional core across all three domains: governance-related signals—most notably Voice and Accountability (VA) and Control of Corruption (CoC)—remain primary drivers for PISAM, PISAR, and PISAS. Domain differences arise not because the core disappears, but because complementary dimensions become more salient by domain: economic capacity indicators are more prominent for reading (PISAR), whereas regional-bloc differentiation is more pronounced for science (PISAS). This “stable core + domain-specific complement” structure provides an evidence-based interpretation of why drivers differ across domains while remaining coherent within a single country-level framework.
Comparison with prior cross-national evidence. The pattern we observe is consistent with the cross-national literature that links PISA performance to governance quality (WGI dimensions such as control of corruption, political stability, voice and accountability), economic capacity (income/GDP per capita), and regional clustering and shared institu-tional heritage. In this sense, our global SHAP rankings provide an interpretable confir-mation that institutional quality forms a common “institutional core” behind achieve-ment profiles, while economy and region act as complementary lenses that become more salient depending on the competency domain.
Importantly, our class-distinct explanations refine this prior evidence by showing how the same structural family decomposes differently across domains: reading (PISAR) more strongly integrates economic capacity into the institutional core, whereas science (PISAS) differentiates more clearly across regional blocs, together with governance signals. These results should be interpreted as predictive/explanatory associations at the country level rather than causal effects, but they provide a transparent mapping of which macro signals systematically push observations toward low/medium/high profiles.
5.4. Target Variable and Class-Level SHAP Analysis
Table 3,
Table 4 and
Table 5 summarize the institutional, economic, and regional signals by which the model distinguishes classes (Class 0 = low-level, Class 1 = medium-level, Class 2 = high-level) for each target variable (PISAM, PISAR, PISAS). The rules, presented in tabular form, indicate which class the model pushes an observation toward when the relevant feature is at a certain threshold/feature level (positive SHAP = push toward that class, negative SHAP = push away from that class). This structure allows the results to answer not only the question “which feature matters?” but also the question “which profile corresponds to which level of achievement?”
Policy relevance and SDG 4 alignment. The class-conditional profiles in
Table 3,
Table 4 and
Table 5 can be read as decision-support “levers” that help translate model explanations into actionable priorities aligned with SDG 4 (Quality Education). In particular, the consistent prominence of governance signals (e.g., Voice and Accountability and Control of Corruption) suggests that strengthening transparency, accountability, and anti-corruption capacity can be interpreted as a cross-cutting enabling condition for sustained improvement. The domain-specific complements provide a second layer of policy focus: reading profiles more strongly co-move with economic capacity, highlighting the importance of targeted investment capacity for literacy-related system improvements, whereas science profiles show stronger regional-bloc differentiation, indicating that regional peer-learning, policy benchmarking, and institutional diffusion mechanisms may be especially relevant in STEM-oriented capacity building. To make this link measurable, we emphasize that progress can be monitored using established system-level indicators already used in this study (e.g., WGI sub-scores such as Voice and Accountability and Control of Corruption, alongside comparable education-system resource and equity indicators). These implications are intentionally framed as actionable hypotheses for prioritization and further investigation, rather than causal prescriptions, and should be triangulated with contextual evidence when used for high-stakes decisions.
Table 3 Comments (↓ defines decrease and ↑ defines increase): Mathematics performance (PISAM) is linked to three distinct profiles. Class 0 is the lowest-level group, clustered in non-European contexts—particularly those associated with the American bloc—with low accountability/political participation, weak control of corruption, lower political stability, and limited economic capacity. Class 1 is the intermediate profile, characterized by relatively higher per capita income, pronounced European affiliation, and not yet at huge economic scale or the highest institutional density. Class 2 is the highest-level group, characterized by higher political stability and concentrated regionally in the European and Asian/Oceanic bloc regions. This group, while distinct from the very low economic scale, does not necessarily rely on the “highest accountability/strongest control of corruption” signal. This framework demonstrates that PISAM levels are not solely based on income; they are also differentiated by regional location, political stability, and the form of institutional capacity.
Table 4 Comments (↓ defines decrease and ↑ defines increase): Reading performance (PISAR) is divided into three distinct profiles. Class 0 represents the lowest level, with limited economic capacity, weaker institutional control, and low accountability; it predominantly consists of non-European countries. Class 1 is the intermediate level, with a relatively high per capita income and a European context, while institutional indicators are above a certain threshold but do not reach the highest institutional density. Class 2 is the highest level, with high corruption control, strong political accountability, and high political stability simultaneously; it is not weak in economic capacity, and is particularly evident in the Asian/Oceanic context. This structure demonstrates that performance levels are not solely reduced to income levels; rather, institutional capacity, political stability, and regional architecture play a combined role in differentiating factors.
Table 5 Comments (↓ defines decrease and ↑ defines increase): Science performance (PISAS) is divided into three distinct profiles. Class 0 is the lowest-level group, characterized by low accountability and political stability, weak control of corruption, limited economic capacity, and regional clustering outside Europe—particularly associated with the American bloc. Class 1 is the intermediate-level profile, characterized by a prominent European context, relatively high per capita income, and institutional indicators above a certain threshold, but not yet reaching the highest level of stability/size. Class 2 is the highest-level group, characterized by strong control of corruption, high accountability, political stability, and large-scale economic capacity, particularly in the Asian/Oceanic context. This structure indicates that science outcomes are not solely explained by income; governance capacity, stability, and regional location are predictive signals jointly used by the model.
5.5. Reliability of SHAP and the Interpretability Trade-Off
Figure 4 shows that the reliability metrics (Fidelity and Faithfulness) of the SHAP can explain the behavior of the models for the three target variables (PISAM, PISAR, and PISAS) with high accuracy and consistency. The very high-Fidelity value of 0.95 indicates that the SHAP are almost perfectly consistent with the model’s predictions. The Faithfulness value of 0.85 for the same model also indicates that the explanations largely capture the model’s decision-making logic and that the contribution of multiple institutional and regional signals can be meaningfully traced through individual features. In PISAR, the fact that the Faithfulness value is 0.92 and above the Fidelity value (0.89) indicates that SHAP fairly well represents the feature combinations that the model uses to target specific classes and that the decision logic can be clearly traced; this also implies that this model has a stable structure in terms of explainability. The equal and balanced Fidelity and Faithfulness values for PISAS (0.89/0.89) indicate that the SHAP in this model can explain both the predicted outcome and the underlying rationale to a similar degree, providing consistent transparency between these two dimensions. Overall, PISAM stands out in terms of predictive consistency (Fidelity), PISAR in terms of the traceability of the decision logic (Faithfulness), and PISAS in terms of the balanced combination of these two dimensions; this demonstrates that the SHAP obtained for all models are quantitatively reliable and interpretable.
5.6. Limitations and Future Work
These findings should be read with several important limitations. First, the analysis relies on aggregate country-level indicators. Therefore, within-country regional, school-based, or socioeconomic differences are not apparent. The patterns identified here should not be considered homogeneous across all subgroups within each country. The inferences are within this system and are at the national level.
Second, temporal information is included in the model, but the study is not a causal effects analysis. SHAP contributions mean “the model used this information to assign this class,” not “this factor caused this outcome.” This means that the findings offer policy-guiding signals but do not guarantee that a particular intervention will directly improve scores. This framework aims to clarify which structural signals appear with which achievement profiles for decision support, rather than to provide a causal prescription.
Algorithmic decision support has ethical implications, even when designed for transparency. Country-level profiling risks being interpreted as normative judgments, potentially reinforcing stigmatization or simplistic rankings. Therefore, outputs should be viewed as preliminary decision-support indicators, to be triangulated with qualitative insights and interpreted cautiously given uncertainties, measurement errors, and structural differences across countries.
Third, indicators such as governance quality, corruption control, political stability, economic capacity, and regional institutional frameworks have historically evolved together. This coevolutionary relationship limits how uniquely the model can attribute importance to any single component. In practice, what policymakers encounter are already packaged institutional structures.
Finally, the “low/medium/high” achievement classes are based on this study’s data-based thresholds. Different threshold definitions may alter the labeling of some marginal countries. However, we expect the high-level interpretation of the profiles to remain qualitatively similar.
Bias and responsible interpretation. As with any AI/ML-based pipeline, the risk of biased conclusions may arise if model performance collapses into a dominant class or if explanation attributions are distorted by strong feature dependencies. In this study, we mitigate these risks in two practical ways. First, model selection prioritized Macro-F1 and monitored class-level Precision/Recall/F1 (
Table 2), which helps prevent majority-class dominance and makes potential performance disparities across low/medium/high pro-files visible. Second, to reduce attribution instability in SHAP under correlated macro-indicators, we applied a two-stage feature elimination procedure (correlation-based filtering followed by VIF elimination), which improves the statistical independence of the feature set and thus the reliability of additive explanations. Nevertheless, the results should be used as predictive/explanatory decision-support signals at the aggregate country level, rather than as a fairness audit or causal basis for high-stakes interventions. Be-cause the unit of analysis is at the aggregate country/region level and the study does not rely on individual-level sensitive attributes, formal demographic fairness metrics are not directly applicable; instead, we treat “bias risk” primarily as class-imbalance/majority dominance, attribution instability under feature dependence, and potential interpretive misuse in policy narratives.
External validity and generalizability. The proposed framework is designed to be transferable because it relies on widely available, cross-national indicators and an explicitly documented preprocessing and explainability pipeline. Nevertheless, generalizability depends on the stability of indicator definitions and measurement practices across contexts and time. The approach is expected to generalize most directly to future PISA cycles where comparable governance, economic, and regional indicators can be compiled; it can also be adapted to other international learning assessments, provided that outcome scales and country coverage are comparable. At the same time, the framework may require recalibration when applied to substantially different assessment regimes, missingness patterns, or shifts in the operationalization of governance and development metrics.
Three concrete directions for future work stand out.
Going down to the sub-national scale (regions, school clusters, and socio-economic segments) will increase the policy applicability of this approach.
Incorporating the time dimension into causal or comparative frameworks will allow testing of how institutional change is related to achievement class transition.
Counterfactual scenarios (such as “Will this country move from a medium profile to a high profile if corruption control improves to a certain extent?”) can transform the explainable forecast pipeline we present in this study into a forward-looking simulation tool.
6. Conclusions
This study shows that the patterns observed in the PISA Mathematics (PISAM), Reading (PISAR) and Science (PISAS) objectives are best explained in the model by the joint contribution of governance quality (e.g., accountability, control of corruption, political stability), economic capacity (macroeconomic scale and resource access), and regional/institutional context (institutional frameworks of different regional blocs), rather than a unilinear causality. Class-based SHAP analyses clearly reveal that this joint effect comes into play with different weights for each objective: While institutional indicators (especially accountability and control of corruption) are decisive in PISAM, the upper class is also differentiated by political stability and regional location; in PISAR, economic capacity is strongly articulated to this institutional core; and in PISAS, the governance signal works together with regional block differentiation. Thus, “high performance” is not explained by a single variable threshold, but emerges through combinations of institutional quality, capacity, and regional location. Thus, class profiles for each objective are defined by different combinations of the same variable family. The question “which class is differentiated under what conditions and by what signals?” has been answered in a concrete and reproducible manner. Specifically, in the PISAS context, rather than being frozen in a single socioeconomic cross-section, classes are differentiated by interactive combinations of governance indicators, regional bloc identity, and specific economic profiles, suggesting that the model’s decision is largely explained by a combination of multiple conditions rather than univariate thresholds.
Methodologically, the study presents a multi-target, multi-class structure with class-conditional SHAP contributions and quantitative reliability measures (Fidelity/Faithfulness). The values obtained were high and consistent: Fidelity 0.95/Faithfulness 0.85 for PISAM; 0.89/0.92 for PISAR; and 0.89/0.89 for PISAS. This profile demonstrates that the predictive fit is strong in PISAM, the model decision logic is more directly traceable in PISAR, and these two dimensions are balanced in PISAS. Therefore, SHAP-based explanations provide a framework that is not only consistent with the output but also faithful to the model. Consequently, the proposed flow (multi-target + class-based SHAP + Fidelity/Faithfulness) proposes a reusable and transparent decision support standard that addresses performance and interpretability together, in policy and practice contexts. Positioning relative to prior work. Prior PISA–machine learning and XAI studies have predominantly focused on student- or school-level predictors within two-level structures; our contribution complements this line by shifting the unit of explanation to the country/region level and integrating governance quality, economic capacity, and region-al/institutional context within a single explainable, multi-class framework. By providing class-conditional SHAP profiles, the study goes beyond reporting “what matters” on average and instead clarifies “which structural configuration corresponds to which achievement profile,” thereby translating the macro-level evidence discussed in the literature into a directly interpretable decision-support output.