1. Introduction
The Sustainable Development Goals adopted by the United Nations in 2015 have generated a large ecosystem of composite indices, dashboards, and annual reports designed to track where countries stand [
1,
2]. The Sustainable Development Goals Index (SDGI) published by the Sustainable Development Solutions Network aggregates performance across all 17 goals into a single score [
3,
4], while the Human Development Index (HDI), maintained by the United Nations Development Programme, captures health, education, and income dimensions that reflect foundational human capabilities [
5]. Together, these two indices offer complementary perspectives on national sustainability, with the SDGI reflecting policy goals and the HDI capturing structural capabilities.
Although there have been advances in the analysis of multiple interconnections of the SDGs, most of the work remains from a diagnostic perspective. Indeed, there is a large body of work on pairwise interactions between the SDGs showing that synergies between goals are more prevalent than trade-offs, but at the same time the balance varies across income groups, regions or policy contexts [
6,
7,
8,
9,
10]. Network analyses have shown that the SDGs interact differently across income groups [
11], and governance scholars have argued that implementing policy across goals requires restructuring institutions [
12]. These studies have produced important insights into where progress is lagging and how goals interact with one another [
13,
14]. Forecasting methods can complement these diagnostic efforts by establishing empirical baselines against which structural shifts can be detected. They share, however, a common analytical orientation, characterizing current states or static relationships rather than asking whether the trajectory itself can be anticipated from the past.
A different question has received far less attention. Can a country’s sustainability trajectory be predicted from its development history, and what does it reveal when the actual path diverges from what historical patterns would lead us to expect?
The limited forecasting literature that does exist operates under important constraints. Chenary et al. [
15] forecast SDG scores through 2030 using ARIMAX and linear regression smoothed with the Holt–Winters multiplicative method, selecting predictors based on their relevance to SDG targets. Their analysis covered six world regions aggregated from national data, and the results projected that OECD countries would reach a score of approximately 80, while Sub-Saharan Africa would remain below 60. However, the regional aggregation forgoes country-specific trajectories altogether, and the statistical models used do not capture nonlinear or cross-indicator dynamics. Xing et al. [
16] made a substantial contribution by projecting 117 individual SDG indicators for 167 countries using a neural network time-series approach, comparing average annual change rates before and after SDG adoption in 2015 to classify country–goal pairs as advancing, regressing, or stagnating. Their work projected that global SDG achievement would reach approximately 63 percent by 2030. However, each indicator was modeled independently in a univariate framework, meaning the approach cannot discover cross-domain predictive relationships or leverage the shared information among hundreds of development variables. Rabbi [
17] developed a more integrated approach that combines Random Forest for feature selection, LSTM for temporal forecasting, and SHAP for interpretability, and applied it to EUROSTAT sustainability indicators across seven EU countries from 2014 to 2023. Their study identified global value chain participation, social protection expenditure, and municipal recycling as key drivers of sustainability outcomes. Although methodologically rich, the study’s scope is limited to seven countries over ten years, and the composite index used is constructed with fixed weights (50 percent economic, 30 percent social, 20 percent environmental), which constrains its generalizability. A recent cross-sectional study [
18] applied K-Means clustering to the 2025 SDG Index for 166 countries, grouping nations into five sustainability performance clusters and validating the groupings with Random Forest, SVM, and ANN classifiers. While useful for snapshot-based profiling, this work contains no temporal dimension and cannot address questions about trajectory. Meanwhile, deep learning and artificial intelligence have been applied successfully to sustainability-adjacent forecasting tasks, including ESG index prediction in emerging markets [
19], energy demand and solar radiation forecasting [
20,
21], municipal waste volume prediction [
22], and demand forecasting for eco-friendly vehicles [
23], demonstrating that these methods are mature enough for complex time-series applications in sustainability contexts.
However, a clear gap remains. Few studies have jointly used high-dimensional development indicators to predict composite sustainability scores with deep learning, quantified deviations from predicted trajectories using calibrated uncertainty bounds, or identified which types of indicators carry the most predictive weight across domains.
This study addresses these gaps by shifting the analytical stance from diagnosis to prognosis. We ask whether a country’s past development can predict its current sustainability path, what it means when a country strays from that predicted path, and which sustainability goals are most relevant in predictive models. We address four research questions: (1) How predictable are national SDGI and HDI trajectories when modeled with multivariate deep learning, and does predictability vary across income groups? (2) Which countries and regions deviate significantly from their predicted trajectories, and do deviations show a systematic discrepancy between the two targets? (3) Which development indicators are the strongest predictors and do they differ for the SDGI and HDI? (4) Is governance quality associated with the predictability of sustainability trajectories? To answer these questions, we first evaluate model performance and predictability across income groups, then identify surprise countries and cross-target patterns, then extract predictive signals by domain, and finally examine the relationship between governance quality and forecast accuracy.
The Temporal Fusion Transformer (TFT) [
24] is an interpretable deep learning architecture for multi-horizon time-series forecasting that incorporates learned variable selection, temporal attention, and distributional output mechanisms, and is employed as the core forecasting model in this study. The model is developed using 749 World Development Indicators across 184 countries and regions from 2003 to 2017 (training and validation), with the most informative 200 features retained after variance-based screening and evaluated on a held-out 2018 to 2022 test period. Prediction intervals are constructed using conformal inference methods [
25,
26,
27,
28], which produce distribution-free uncertainty bounds from model residuals. Countries whose actual trajectories consistently fall outside these intervals are classified as “surprises” in this study, which indicates cases where observed outcomes deviate meaningfully from predicted paths.
This study makes three contributions to the sustainability monitoring literature that distinguish it from prior work. First, it constructs a global predictability mapping across 184 countries and regions that quantifies, for the first time, the degree to which national sustainability trajectories follow historical patterns and identifies where those patterns break down. Second, it introduces a framework to detect surprising cases based on conformal prediction intervals and tests its robustness at multiple coverage levels (80, 90, and 95 percent), thereby providing a method for identifying countries and regions whose development paths diverge from expectations. Third, it identifies which development indicators best predict sustainability outcomes by comparing the drivers of progress on specific SDGI goals with broader human capability measures in the HDI.
2. Materials and Methods
Our research design consists of four stages (
Figure 1). Data assembly and preprocessing (
Figure 1a) produce a balanced panel of 200 development indicators across 184 countries and regions. Three forecasting models (
Figure 1b) generate point predictions for the SDGI and HDI. Conformal prediction intervals and surprise detection (
Figure 1c) classify each country as on-track or in surprise. Predictive signal extraction and governance analysis (
Figure 1d) interpret the model’s learned structure.
2.1. Data
The predictor variables in this study come from the World Development Indicators (WDIs) database maintained by the World Bank, which covers 749 development indicators across 186 countries and regions for the period from 2003 to 2022. We use a preprocessed and gap-filled version of the WDIs prepared by Li et al. [
29], who applied Probabilistic Principal Component Analysis to impute the remaining gaps after removing variables and countries with more than 70 percent missingness, and standardized all retained indicators as z-scores across countries and years. The resulting WDI panel contains no missing values. This preprocessing was performed independently of the present study and applies only to the predictor variables. The two prediction targets, the SDGI and HDI, are published by their respective organizations and were not generated or imputed by the same procedure. To reduce dimensionality while retaining the most informative features for forecasting, we ranked indicators by cross-country variance computed exclusively on training-period data and selected the top 200. Variance-based ranking prioritizes indicators that most strongly differentiate countries and regions, making it a useful screening criterion for cross-country forecasting. The threshold of 200 was chosen to balance dimensionality reduction against information loss, and a sensitivity analysis confirms that model rankings are preserved under alternative thresholds of 150 and 250 features (see the
Supplementary Material, Table S7). The 200 selected features cover the four thematic domains defined by the World Bank’s WDI classification: economic, social, environmental, and institutional. The impact of full-period standardization on feature selection and model inputs is analyzed in the
Supplementary Material (Tables S1–S3). Each retained indicator was tagged with a thematic domain label of economic, social, environmental, or institutional based on its WDI code prefix, enabling domain-level interpretation of variable importance results.
The SDGI and HDI composite indices serve as prediction targets. The SDGI is a score on a 0 to 100 scale that aggregates performance across all 17 SDGs [
3,
4], with observed values in the study period ranging from approximately 36 to 87. SDGI data are available for 184 of the 186 countries and regions in the WDI panel, with complete annual coverage from 2003 to 2022 for all 184. Two regions, Hong Kong SAR and Macao SAR, were excluded because no SDGI data were available. We then filtered the HDI data to retain only the 184 countries and regions for which SDGI data were available. The HDI ranges from 0 to 1 and captures health, education, and income dimensions [
5]. Both targets are used in their original measurement scales so that prediction intervals remain directly interpretable. The final panel is balanced across 184 countries and regions over a 20-year period, yielding 3680 observations with complete coverage of all retained features and both targets.
2.2. Temporal Split and Preprocessing
The data were partitioned into three non-overlapping time periods. The training set covers 2003 to 2015, a span of 13 years. The validation set covers 2016 and 2017, used for hyperparameter tuning and early stopping. The held-out test set covers 2018 to 2022, the period for which all results are reported. This partition balances a long enough training window for learning temporal patterns, a two-year validation period for model selection, and a five-year test horizon that covers both stable (2018–2019) and disrupted (2020–2022) conditions. This strict temporal holdout ensures that no future information enters model training. The test window encompasses a period of significant global disruption, including a pandemic, armed conflicts, and energy market volatility, which lends substantive meaning to the surprise analysis without framing the study around any single event.
2.3. Forecasting Models
Forecasting sustainability trajectories from high-dimensional development data requires a model that can handle heterogeneous panel structure, learn from hundreds of input features, and produce interpretable outputs that reveal which indicators drive predictions. The TFT meets all three requirements and has demonstrated state-of-the-art performance on multi-horizon forecasting benchmarks across energy demand, retail, and traffic domains [
24]. It has been applied in environmental contexts, including air quality prediction [
30]. A comprehensive survey of deep learning for time-series forecasting is provided by Lim and Zohren [
31]. Three components of the architecture are directly relevant to this study. The Variable Selection Network applies learned softmax-normalized gates to each input feature, producing per-variable importance weights that indicate which indicators the model relies on most. The Interpretable Multi-Head Attention mechanism learns temporal weights over past time steps, revealing which historical periods the model considers most informative. The quantile regression output layer produces distributional forecasts at multiple quantile levels.
Rather than building separate models for each country, our method handles differences between countries by teaching the model to recognize each country through a learned identifier. This design allows the model to share temporal dynamics across the full sample while capturing country-specific baseline patterns, an approach that has been shown to improve forecasting accuracy based on short individual time-series panel data [
24,
32]. World Bank income group and UN region are included as additional static covariates. The model was implemented using the PyTorch Forecasting library (version 1.6.1). Hidden size was set to 64, attention heads to 4, dropout to 0.1, learning rate to 0.001, and batch size to 64. These hyperparameters follow the recommendations in the original TFT paper for datasets of comparable size [
24], and no additional hyperparameter search was performed. Training proceeded for a maximum of 100 epochs with early stopping at a patience of 10 epochs on validation loss. Separate models were trained for the SDGI and HDI targets. The encoder length was set to 13 years and the prediction length to 5 years. During the test period, the model generates all five annual predictions (2018–2022) in a single forward pass using only data observed up to 2017. No test-period target values or WDI covariate values are fed back. All WDI covariates are designated as time-varying unknown inputs, so the model receives no future covariate information during the prediction horizon and only the time index is known in advance.
Two baseline models are included to contextualize the TFT’s performance. The TFT captures long-range temporal dependencies through multi-head attention, while XGBoost provides a strong tabular baseline using gradient-boosted trees, and the linear trend model serves as the simplest extrapolation benchmark. The first is a per-country linear trend, which fits an ordinary least squares regression model of the target variable over time using the training period and extrapolates into the test window. Linear extrapolation is a standard benchmark in sustainability forecasting studies [
15,
16] because many development indicators evolve smoothly over time, making this a deliberately hard baseline to beat. The second is XGBoost [
33], a gradient-boosted tree ensemble that has consistently ranked among the top tabular learning algorithms in benchmarking studies [
34]. XGBoost uses the three most recent target values plus the 200 WDI features at time t-1 to predict the target at time t. During the test period, prediction proceeds auto-regressively, with predicted values from prior test years serving as lag inputs and actual WDI features from the preceding year used as covariates. This provides a strong non-sequential tabular baseline. XGBoost was configured with 100 estimators, maximum depth of 4, learning rate of 0.1, and L1 and L2 regularization parameters both set to 1.0.
2.4. Conformal Prediction Intervals
We use conformal prediction to convert point forecasts into prediction intervals, so that surprise detection can be conducted based on calibrated uncertainty. Identifying countries and regions whose sustainability trajectories deviate from expectations requires prediction intervals with known statistical properties. Quantile outputs generated by deep learning models can suffer from miscalibration, producing intervals that are either too narrow or too wide in practice [
27]. Conformal inference [
25,
26] addresses this problem by providing distribution-free prediction intervals without requiring assumptions about the error distribution. Conformal prediction has been adopted increasingly in applied machine learning [
27,
28] and has been extended to time-series settings in energy forecasting, financial prediction, and environmental modeling [
35,
36].
The conformal interval is built in three steps. First, each forecasting model generates a point forecast for each country–year. Second, we compute the difference between actual and predicted values on the test set (920 country–year observations per target). Third, we set the interval as the point forecast plus or minus a threshold derived from these residuals at the desired coverage level (e.g., 80 percent), following the conformal method introduced by Vovk et al. [
25] and Romano et al. [
26]. To account for differences in forecast difficulty across countries, we apply a locally adaptive variant [
37] in which each country’s residuals are scaled by its own median absolute deviation (MAD). This scaling produces narrower intervals for countries with stable trajectories and wider intervals for volatile ones. These intervals serve as retrospective uncertainty bands calibrated over the completed test period to identify countries whose trajectories deviate from model expectations. The prediction intervals are calibrated and evaluated on the same test set, so overall empirical coverage matches the nominal level by construction. This means that prediction interval width, not coverage, is the metric that differentiates models (
Table 1). Empirical coverage by year and by income group is reported in the
Supplementary Material (Tables S4 and S5). Coverage declines from 96 percent (SDGI) and 98 percent (HDI) in 2018 to 57 percent for both targets in 2022 and is relatively uniform across income groups.
2.5. Surprise Detection
A country is classified as exhibiting a sustainability surprise when its actual trajectory persistently falls outside the calibrated prediction interval (PI). Our main definition of a surprise requires the actual value to lie outside the adaptive 80 percent PI for at least two consecutive years during the 2018 to 2022 test period. The 80 percent level is chosen as a balance between detection power and false positive control, consistent with standard practice in conformal prediction [
27,
28]. The two-consecutive-year requirement filters out single-year anomalies that may reflect measurement noise or transient shocks rather than genuine trajectory shifts. Values above the upper bound are labeled positive surprises, and values below the lower bound are labeled negative surprises. The surprise classification is therefore a retrospective identification of countries whose trajectories deviated from model expectations during the completed test period.
Two robustness variants are evaluated to ensure that findings are not sensitive to the choice of coverage level. These use the 90 percent and 95 percent PIs with the same two-consecutive-year requirement.
A cross-target typology is constructed by intersecting the SDGI and HDI surprise classifications, producing four categories. These are countries and regions on track for both targets, those with SDGI-only surprise, those with HDI-only surprise, and those with both targets in surprise. To examine whether surprises concentrate in the later portion of the test window, we compare the rate of country–year observations falling outside the prediction interval between 2018–2019 and 2020–2022.
2.6. Predictive Signal Extraction
The Variable Selection Network within the TFT produces learned, softmax-normalized weights for each input feature, indicating its contribution to the forecast [
24]. These weights are intrinsic to the model and are optimized jointly with the forecasting objective, as opposed to post hoc attribution methods such as SHAP [
38]. We extract the global encoder variable importance, averaged across samples, for each target and map each feature to its thematic domain to compute domain-level cumulative predictive weight.
These weights reflect each predictor’s relevance within the model, not causal relationships, and we make no claim that changing any identified feature would alter sustainability trajectories [
39].
2.7. Governance and Predictability Analysis
If sustainability trajectories are more predictable in some countries and regions than others, a natural question is whether institutional quality helps explain this variation. We examine this by regressing country-level prediction error on governance quality. The Worldwide Governance Indicators (WGIs) published by the World Bank provide six dimensions of governance, including Voice and Accountability, Political Stability, Government Effectiveness, Regulatory Quality, Rule of Law, and Control of Corruption [
40]. Following standard practice in cross-country empirical research [
40], we compute a composite governance score as the arithmetic mean of the six dimensions. The composite is measured using 2015 values to avoid temporal overlap with the test period. An ordinary least squares regression is estimated with country-level mean absolute error as the dependent variable and the WGI composite as the independent variable, with income group and region as control variables and robust standard errors. This analysis is correlational and not intended to identify a causal governance effect.
3. Results
3.1. Model Performance and Interval Calibration
The forecasting performance of all three models on the held-out test set is reported in
Table 1, using the MAE and RMSE as standard accuracy metrics in time-series forecasting [
31] that allow direct comparison with prior SDG projection studies [
15,
16]. The TFT achieves the lowest error on both targets, with a mean absolute error (MAE) of 1.10 for the SDGI and 0.008 for the HDI. These represent improvements of 19 percent and 60 percent over the linear trend baseline, respectively. The linear model outperforms XGBoost on both targets. Sustainability trajectories exhibit strong path dependence, and the dominant signal is a smooth trend-like change that a simple linear extrapolation already captures effectively. XGBoost, although a strong tabular learner, appears to amplify noise through its auto-regressive lag formulation when applied to these highly inertial series. The TFT improves upon the linear model by capturing modest nonlinear dynamics, but the improvement for SDGI is limited, suggesting that SDGI trajectories are close to linear over the training horizon.
The conformal prediction intervals are calibrated at the 80 percent level for all three models. The models differ in PI width, which reflects the precision of their point predictions. The TFT produces the narrowest 80 percent intervals (4.01 SDGI points, 0.035 HDI), followed by the linear model (5.07, 0.072), with XGBoost producing the widest (9.78, 0.184). At the same nominal coverage level, a narrower interval represents a more informative uncertainty estimate. Surprise detection in subsequent sections uses the TFT intervals at the 80 percent level.
3.2. Global Predictability Landscape
The distribution of country-level prediction errors by income group is shown in
Figure 2. For the SDGI, the median country-level MAE is 0.88, with a mean of 1.10 and a standard deviation of 0.83. The highest individual error is observed for Seychelles at 5.41 SDGI points. For the HDI, the median MAE is 0.006, with a mean of 0.008 and a standard deviation of 0.007. Prediction errors are modestly higher for low-income countries, consistent with the expectation that institutional volatility and external shocks reduce trajectory predictability in less stable settings. The differences across income groups are, however, smaller than might be anticipated, reflecting the strong global inertia in development trends even among the poorest countries. This suggests that the framework is applicable across countries with diverse institutional settings, not only those with strong institutions.
When scaled to the observed range of each index, HDI errors are proportionally smaller (MAE/range is about 1.2 percent for HDI versus 2.2 percent for SDGI), and the HDI error distribution is more tightly concentrated. This is consistent with what the two indices capture. The HDI reflects slow-moving capability dimensions, such as life expectancy and educational attainment, which resist year-to-year fluctuations. The SDGI includes policy-sensitive indicators such as renewable energy share and institutional quality scores that can shift more abruptly in response to reforms or crises.
3.3. Sustainability Surprises and Cross-Target Typology
3.3.1. Surprise Counts and Asymmetry
The surprise detection results under the main definition (80 percent PI, two consecutive years) and under stricter coverage levels (90 and 95 percent) are presented in
Table 2. Under the main definition (80 percent PI, two consecutive years), 35 countries and regions exhibit positive SDGI surprises and nine exhibit negative surprises, with 140 on track. For the HDI, the pattern is inverted, with 11 positive surprises and 23 negative surprises, leaving 150 on track. The asymmetry between the two targets is one of the central findings of this study. The SDGI surprises skew positive, reflecting sustainability acceleration among developing nations such as Benin (+4.59), Togo (+3.23), and Rwanda (+1.88). The HDI surprises skew negative, reflecting disruptions to health and income dimensions in countries such as Venezuela (−0.040), Libya (−0.023), and Iran (−0.013) during the 2020 to 2022 period. When the two targets are combined in the cross-target typology, 115 countries and regions (62 percent) are on track for both, 35 (19 percent) show SDGI-only surprises, 25 (14 percent) show HDI-only surprises, and nine (5 percent) are double surprises. To assess whether the observed surprise counts exceed what would be expected by chance, we estimated the null expectation under an independence model in which each country–year has a 10 percent probability of falling above the upper bound and 10 percent below the lower bound. Under this model, approximately 6.8 countries per direction would be flagged as surprises by chance alone (see the
Supplementary Material, Table S6 and Figure S1). The 35 SDGI-positive surprises (
p < 0.0001) and 23 HDI-negative surprises (
p < 0.0001) far exceed this expectation. The 11 HDI-positive surprises are slightly above the null expectation (
p = 0.08). The nine SDGI-negative surprises fall within the null range (
p = 0.24), suggesting that some of these deviations may reflect random variation rather than genuine trajectory shifts. Widening the interval to 90 or 95 percent coverage reduces surprise counts progressively, while the core set of surprise countries remains consistent, with all countries classified as surprises at the 95 percent level also appearing at the 80 percent level (
Table 2).
The spatial distribution of the four-way typology is shown in
Figure 3. Positive SDGI surprises cluster in Sub-Saharan Africa and parts of South and Southeast Asia, while negative HDI surprises are more geographically dispersed, spanning the Middle East, Eastern Europe, and Latin America.
3.3.2. Surprise Countries
The top positive and negative surprises for each prediction target are listed in
Table 3. Complete lists of all surprise countries with mean residuals are provided in the
Supplementary Material (Tables S8 and S9). Among positive SDGI surprises, developing nations dominate. Benin, Togo, Rwanda, China, and Ethiopia all exceeded their historical trajectory predictions by substantial margins, suggesting sustained improvements in sustainability that began in several cases before 2020. Negative SDGI surprises concentrate in states experiencing political collapse or armed conflict, including Venezuela, Yemen, and Syria. For the HDI, negative surprises extend beyond fragile states to include middle-income nations such as Iran, North Macedonia, Argentina, and Mexico, whose human development was disrupted by a combination of health system strain, economic contraction, and regional instability.
The temporal distribution of surprises is uneven across the test window. The average annual rate of observations falling outside the prediction interval rises from 0.06 per country in 2018–2019 to 0.29 in 2020–2022. This confirms that the later part of the test window, which includes multiple overlapping global disruptions, showed more trajectory deviations. However, not all surprises are attributable to post-2020 events. Countries such as Benin and Togo show sustained positive SDGI deviations that begin before 2020, pointing to structural acceleration rather than crisis-driven fluctuation.
Selected trajectory plots for the top surprise and on-track countries are shown in
Figure 4.
3.3.3. Double Surprises and Goal–Capability Decoupling
Among the nine double-surprise countries, a dominant pattern is visible. Six of the nine, specifically Bulgaria, Cameroon, Cabo Verde, Iran, Ukraine, and the United States, exhibit positive SDGI surprise combined with negative HDI surprise. This combination suggests a decoupling between goal-level and capability-level sustainability, in which policy-driven SDG indicators, including institutional reform metrics and renewable energy targets, can advance even as foundational human capabilities in health and income are disrupted.
Ukraine provides a clear example. Its negative HDI deviation coincides with the armed conflict beginning in 2022 [
41] and the associated disruption to health infrastructure and real incomes. At the same time, wartime institutional reforms and accelerated alignment with European governance standards may have contributed to the positive SDGI deviation. The United States shows a different pattern with a similar outcome. The negative HDI deviation coincides with a sharp decline in life expectancy during 2020 and 2021 [
42], a period marked by the COVID-19 pandemic and the ongoing opioid crisis, while clean energy legislation during the same period may help explain the modest positive SDGI deviation.
Portugal stands alone as the only country with positive surprises on both targets. This outcome coincides with a period of fiscal consolidation, rapid renewable energy deployment, and European Union recovery fund investment [
43]. Brunei is the sole double negative, with its deviations occurring during a period of oil price volatility and limited economic diversification. Mali shows the reverse of the dominant pattern, with negative SDGI surprise and positive HDI surprise. This may reflect the impact of political instability and armed conflict on governance and institutional indicators captured by the SDGI, while basic health and education outcomes continued to improve with support from international development assistance.
3.4. Early Predictive Signals
The top 20 features by variable selection weight for each target, together with the domain-level share of cumulative predictive weight (inset pie charts), are presented in
Figure 5.
The predictive pattern for the HDI is heavily concentrated in the economic domain. Seven of the top ten features are economic indicators, including fuel exports as a share of merchandise trade, the poverty gap index, petroleum rents as a share of GDP, private consumption growth, and lending interest rates. Social indicators such as adult male mortality probability and national unemployment enter only from the eighth rank onward. This concentration is consistent with the HDI’s structural composition, in which income enters directly, and both health and education in developing countries are largely dependent on economic resources.
The profile for the SDGI is more diverse. The top ten include social indicators such as the female lower secondary out-of-school rate, access to basic drinking water services, and lower secondary completion rates. It also includes economic indicators such as the import unit value index and foreign direct investment inflows, as well as institutional indicators related to the control of corruption. No single domain monopolizes the predictive signal.
At the domain level (
Figure 5, inset pie charts), economic indicators account for a much larger share of the total predictive weight for the HDI than for the SDGI. For the SDGI, the four domains contribute relatively more evenly. This means that monitoring systems built around economic indicators alone may track the HDI well but will miss important signals for broader sustainability change. Effective sustainability monitoring requires indicators from multiple domains that reflect the multi-dimensional nature of sustainability [
44,
45,
46].
The model’s temporal attention weights differ between the two targets (see the
Supplementary Material, Figure S2). For the HDI, attention increases steadily toward the most recent encoder years, indicating that the recent trajectory is most informative for prediction. For the SDGI, the pattern is U-shaped, with the highest attention at the earliest encoder year and a secondary rise toward the most recent years, suggesting that the model draws on both long-term baseline levels and recent dynamics when forecasting the SDGI.
3.5. Governance and Predictability Nexus
The relationship between governance quality and prediction error is shown in
Figure 6 and
Table 4. The relationship is negative and statistically significant for both targets. For the SDGI, the estimated coefficient is negative, 0.196, with a standard error of 0.067 and a
p-value of 0.004. For the HDI, the coefficient is negative, 0.002, with a standard error of 0.001 and a
p-value below 0.001. Countries and regions with stronger governance exhibit more predictable sustainability trajectories.
It is worth noting that this relationship is correlational rather than causal, and we cannot rule out confounding from factors such as conflict exposure or resource dependence that co-vary with governance quality. This does not imply that well-governed countries are static. They develop and change, but they do so through systematic, policy-driven processes that produce regular trajectories amenable to forecasting from historical trends. Countries with weaker governance experience more stochastic trajectories shaped by conflict, resource price shocks, and institutional instability, which are more difficult for any model to anticipate. The governance effect is somewhat stronger for the HDI, possibly because the health and income components of the HDI are more directly dependent on the quality of public service delivery.
5. Conclusions
National sustainability trajectories are largely predictable from historical development data, with 62 percent of 184 countries and regions remaining within calibrated prediction intervals for both the SDGI and HDI over a five-year forecast horizon. Among the remainder, the SDGI surprises skew positive while the HDI surprises skew negative, pointing to a gap between goal-level progress and capability-level resilience.
Unlike previous SDG forecasting studies that rely on univariate projections or regional aggregations, this framework jointly models 200 development indicators at the country level with interpretable deep learning and calibrated uncertainty quantification. More specifically, this study offers three major contributions. First, the trajectory intelligence framework shifts sustainability monitoring from static diagnosis to forecasting with calibrated uncertainty. Second, the asymmetric surprise structure documents the differential vulnerability of sustainability dimensions to external shocks, with direct implications for recovery assessment. Third, the structural difference in predictive patterns between the SDGI and HDI informs the design of cross-domain monitoring systems that track both goal-level and capability-level dimensions.
Future work should extend this framework to individual SDG goal-level forecasting, incorporate remote sensing inputs such as nighttime light imagery as complementary features, expand the training period as more years of WDI data become available, and apply causal inference methods to identify the drivers of sustainability trajectory deviations rather than merely predicting their occurrence.