All experiments were conducted on Databricks Runtime 16.4 ML (Apache Spark 3.5.2, Scala 2.12) with GPU acceleration enabled. Computations were executed on a g4dn.xlarge instance equipped with four virtual CPUs, 32 GB RAM, and an NVIDIA T4 GPU.
The NRMSE facilitates comparisons across series, cross-sectional levels, and temporal frequencies with different scales, while the macro-averaging scheme prevents any temporal frequency from dominating the global metric solely because it contains more forecast steps.
Each gray cell represents the average of the cutoff-level NRMSE values for a given combination of temporal frequency and cross-sectional hierarchy level. Blue cells represent cross-sectional summaries obtained by averaging NRMSE over all frequencies and cutoffs within each hierarchy level. Green cells represent temporal summaries obtained by averaging NRMSE over all cross-sectional hierarchy levels and cutoffs within each temporal frequency. The orange cell represents the global summary obtained by averaging NRMSE over all temporal frequencies, cross-sectional hierarchy levels, and cutoffs. Accordingly, the orange cell should be interpreted as a balanced global summary rather than as a pooled error over all forecasted time points. It should therefore be read together with the frequency-specific summaries shown in the green cells.
5.1. Error Diagnostic Analysis
To evaluate whether the impact of cross-temporal reconciliation on predictive accuracy varies across forecasting paradigms, NRMSE values were analyzed using a multi-factor analysis of variance (ANOVA) based on the aligned rank transform (ART) framework. This approach provides a non-parametric alternative for factorial experimental designs, enabling the assessment of main and interaction effects without requiring the assumption of normally distributed residuals. [
54,
55]. The Bottom-Up (BU) strategy was included as a benchmark reconciliation approach, as it guarantees both temporal and cross-sectional coherence by construction through the aggregation of forecasts from the hourly level to coarser temporal frequencies and from bottom-level series to higher-level aggregates. Separate analyses were performed for weekly, daily, and hourly frequencies and for each spatial aggregation level (national, regional, and provincial). In each case, forecasting accuracy was evaluated using ART-based mixed-effects models, where cutoff was specified as a random effect and degrees of freedom were adjusted using the Kenward–Roger approximation [
56].
At the global level, we applied a repeated-measures ART ANOVA.
Table 4 indicates significant effects of model, reconciliation method, and their interaction on NRMSE. This shows that predictive accuracy varies across forecasting models, that reconciliation variants do not produce equivalent errors, and that their effect depends on the base forecaster. The same pattern was observed across all hierarchical levels and temporal frequencies, where all effects remained statistically significant (
) in every case.
To complement the analysis,
Figure 6 presents a Multiple Comparisons with the Best (MCB) analysis based on average NRMSE for the six baseline forecasting models, both globally and separately by temporal frequency and cross-sectional level. This comparison is relevant because the effectiveness of reconciliation depends, to a large extent, on the quality of the underlying base forecasts. LightGBM ranked first globally and at all three cross-sectional levels. The second to fourth positions were occupied by NHITS, NBEATSx, and KAN, with only minor changes in their ordering across the national, regional, and provincial levels, whereas TBATS and TimeGPT consistently appeared at the bottom of the ranking. Across frequencies, the weekly panel showed the weakest separation, with all baseline models falling within the critical band, indicating statistically similar performance. The daily panel shows LightGBM ranked first and NHITS and NBEATSx following behind. The hourly panel showed the clearest separation, with LightGBM alone occupying the top position.
Figure 7 extends the ranking analysis to reconciled forecasts. The MCB procedure was computed on the full set of 270 reconciled configurations generated across forecasting models, reconciliation approaches, and estimators. However, for visualization, we restricted our attention to a representative subset of 36 configurations constructed from the best- and worst-performing reconciled variants of each forecasting model under direct, univariate, and iterative cross-temporal approaches. This yielded six representative configurations per model: the best and worst of direct, univariate, and iterative schemes, for a total of 36. From this representative subset, the plot displays only the 10 highest-ranked configurations to preserve readability and facilitate comparisons between forecasting models and reconciliation strategies.
Globally, the ranking was dominated by LightGBM, whose univariate tCROSS:cBU and iterative tCROSS:cWLS configurations consistently occupied the first two positions, followed by its direct COV variant. Importantly, 117 reconciled configurations overlapped the MCB critical difference band. The top part of this group was dominated by LightGBM variants across all reconciliation approaches and estimators, followed by the best-performing iterative and univariate deep learning configurations.
The global top 10 included both the best- and worst-performing reconciled variants of LightGBM for each of the three cross-temporal approaches. This indicates that, across the direct, univariate, and iterative schemes, and across all associated estimators, LightGBM consistently remained ahead of the other forecasting models, even when compared against their best reconciled variants. KAN and NHITS also entered the global top 10, mainly through their best iterative and univariate variants for KAN, and their direct and univariate variants for NHITS.
A similar pattern was observed across cross-sectional levels. At the national and regional levels, the leading positions were still dominated by LightGBM variants. At the provincial level, however, only the three strongest LightGBM configurations remained at the top, while the best NHITS, NBEATSx and KAN variants became more prominent. The frequency-specific panels revealed a clearer contrast. At the weekly level, all plotted configurations lay within the critical band, indicating very similar performance among the best reconciled models. At the daily and hourly levels, the separation became more evident: the two LightGBM tCROSS variants remained the top-ranked configurations, followed by other LightGBM variants and then the best NHITS and KAN alternatives. Meanwhile, TBATS and TimeGPT consistently occupied the lowest positions in the global, frequency-specific, and cross-sectional MCB rankings, indicating weaker reconciled performance relative to the other forecasting models. Together, these results confirm that the strongest reconciled performance was repeatedly achieved by LightGBM, especially under the univariate and iterative tCROSS specifications.
Table 5 reports the global NRMSE of the baseline forecasts together with the best- and worst-performing reconciled variants for each model under the direct, univariate, and iterative cross-temporal approaches. This condensed presentation was adopted to preserve table readability and facilitate comparisons across forecasting models and reconciliation strategies. The complete set of results for all reconciliation methods, separately reported for the direct, univariate, and iterative approaches and for each base model, is provided in
Appendix A. For the univariate and iterative cases, the notation in parentheses identifies the order in which the temporal and cross-sectional dimensions are handled, together with the corresponding estimators. In the univariate case, either temporal reconciliation is applied first and cross-sectional coherence is then imposed through aggregation, denoted by
:
, or cross-sectional reconciliation is applied first and temporal coherence is then imposed through aggregation, denoted by
:
. In the iterative case,
:
indicates that temporal reconciliation is applied using temporal estimator
X, followed by cross-sectional reconciliation using cross-sectional estimator
Y.
LightGBM showed the largest error reductions under reconciliation, with its best results obtained under the univariate variant (tCROSS:cBU) and the iterative variant (tCROSS:cWLS). More importantly, it remained the strongest model overall: even its worst reconciled variants yielded lower global NRMSE values than the best reconciled configurations obtained by any of the other forecasting models.
KAN also benefited substantially, with its best results obtained under the univariate and iterative schemes, both of which relied on variance-based information and delivered very similar errors. For the NHITS and NBEATSx models, the most favorable results were associated with structural scaling. In NBEATSx, the best direct and iterative configurations reached the same error, whereas NHITS achieved its lowest value under the univariate scheme. TBATS attained its best performance with the iterative variant (tOLS:cCOVSh), and TimeGPT improved mainly under covariance-based specifications in the univariate and iterative settings. A broader pattern also emerges from the table: the lowest errors were concentrated almost entirely in the univariate and iterative approaches. For five of the six forecasting models, the best reconciled result came from one of these two schemes; NBEATSx was the only exception, with a tie between the direct and iterative approaches.
In all cases, the direct cross-temporal BU benchmark did not outperform the best reconciled variant obtained under the direct, univariate, or iterative approaches for any base model. This indicates that, although BU provides a coherent and useful reference, it was consistently surpassed by at least one alternative reconciliation specification in every forecasting model considered.
The effect of reconciliation was not consistently favorable. Although it reduced error for selected configurations, it also produced clear deteriorations in several cases. The largest degradations appeared under the iterative scheme for NBEATSx, NHITS, TimeGPT, and especially TBATS, where some combinations led to dramatic increases in NRMSE. The highest-error cases were concentrated in covariance-based iterative variants such as (tOLS:cCOV) and (tCROSS:cCOV), suggesting that the main challenge lies in the estimation and use of full covariance matrices within the iterative procedure. These estimators could deliver substantial gains in some settings, but they also produced the least favorable outcomes in others. By comparison, simpler variance-based and structural estimators tended to yield more favorable behavior across models. Taken together, the results identify LightGBM and KAN as the models that made the most effective use of cross-temporal reconciliation, while for the remaining models the outcome depended much more on the specific reconciliation sequence and estimator.
These findings are reinforced by the heatmaps in
Figure 8, which report the percentage change in global NRMSE for each model–reconciliation combination relative to the BU benchmark. In the direct approach, the most consistent gains were obtained with SS, which reduced error for all models, with the largest decreases observed for NBEATSx (
), NHITS (
), and TBATS (
). COV also produced marked improvements for LightGBM (
) and NHITS (
), followed by smaller reductions for TimeGPT and NBEATSx, whereas its effect on TBATS and KAN was limited. By contrast, OLS increased error for KAN, LightGBM, NBEATSx, and NHITS, with the largest deterioration observed for KAN (
), whereas it reduced error for TBATS (
) and TimeGPT (
), making these the only two models that benefited from this estimator under the direct approach.
The univariate approach reveals a clearer separation across reconciliation methods. When cross-sectional reconciliation was applied first and BU was then used temporally (e.g., cCOV:tBU, cOLS:tBU), the reductions were generally small and most combinations increased error, reaching for TBATS under cOLS:tBU and for NBEATSx under cCOV:tBU. In contrast, the reverse order, in which temporal reconciliation was applied first and BU was then used cross-sectionally, was much more effective, particularly for variance- and covariance-based estimators. A plausible explanation is that temporal reconciliation corrects within-node inconsistencies across hourly, daily, and weekly resolutions before cross-sectional aggregation is imposed. This allows lower-frequency information to regularize provincial hourly forecasts, so that the subsequent BU aggregation to regional and national levels propagates a less distorted error structure. By contrast, when cross-sectional reconciliation is applied first at the hourly level and temporal coherence is imposed afterward only through aggregation, residual temporal misspecification at the lower level is carried into daily and weekly totals rather than explicitly corrected. Under this scheme, LightGBM achieved the largest reductions, especially with tAUTOCOV:cBU () and tCROSS:cBU (). Deep learning models also benefited, particularly with variance-based estimators such as WLSS, WLSH, and WLSV, with reductions ranging from to . TimeGPT attained its largest decrease under tAUTOCOV:cBU, whereas TBATS did so under tOLS:cBU, reaching reductions of up to and , respectively. The main exceptions were the more noticeable error increases under tOLS:cBU for KAN () and, to a lesser extent, LightGBM ().
The iterative approach produced the best improvements for some models, but it also generated the largest failures. LightGBM again showed the clearest benefit, especially under temporal covariance-based estimators combined with cross-sectional WLS-type methods. Its best result was obtained with tCROSS:cWLS, which reduced global NRMSE by relative to BU. KAN improved under temporal WLSH combined with cross-sectional WLS, SS, COVSh, and OLS, with reductions between and . NHITS and NBEATSx performed best when WLSS was used temporally and paired with each cross-sectional estimator, producing decreases of roughly to . TBATS benefited across all combinations in which OLS was used as the temporal estimator, with reductions of up to regardless of the cross-sectional estimator applied afterward.
At the same time, the iterative heatmap shows that this approach could also produce the least favorable outcomes. Several combinations caused error increases above , particularly for NBEATSx, NHITS, and TBATS. The most problematic cases involved covariance-based cross-sectional reconciliation applied after temporal reconciliation, including variants such as tAUTOCOV:cCOV, tCROSS:cCOV, and tWLSS:cCOV. This pattern suggests that, under the iterative scheme, the main difficulty may lie in the estimation and inversion of the cross-sectional covariance matrix rather than in the temporal estimator itself. KAN also showed its largest deterioration under this approach, with error increases of up to when OLS was used as the temporal estimator and combined with cross-sectional estimators. Overall, these results indicate that cross-sectional covariance reconciliation was the main source of severe degradations in the iterative setting, whereas temporal OLS appeared to be an additional model-specific risk for KAN.
Table 6 extends the global comparison by reporting the baseline forecasts and the representative best- and worst-performing reconciled variants separately for the weekly, daily, and hourly frequencies. The same model-specific patterns observed at the global level remained visible across frequencies, although their magnitude changed with the temporal aggregation level. As in the global analysis, the best results were concentrated mainly in the univariate and iterative approaches. Weekly and hourly forecasts were dominated by these two schemes, whereas the daily frequency showed a weaker and more heterogeneous response. Across all three frequencies, LightGBM remained the best-performing model. This frequency-dependent behavior is consistent with prior evidence from temporal hierarchical reconciliation on the same Belgian PV dataset, where the clearest benefits of reconciliation were observed at the weekly and hourly resolutions, while the daily level showed only modest and model-dependent improvements [
22].
Weekly forecasts showed the clearest gains from reconciliation. The differences between the baseline and the best reconciled variant were negligible for TBATS and TimeGPT, but much larger for LightGBM and the deep learning models. The strongest reductions relative to the baseline were obtained by KAN () under the univariate and iterative approaches, NBEATSx () under the direct and iterative approaches, and NHITS () under the univariate approach. LightGBM still achieved the lowest weekly NRMSE overall, with tCROSS:cBU reducing error by , followed closely by tCROSS:cWLS with a reduction. Even its worst reconciled weekly variants remained below the baseline error, indicating that reconciliation was beneficial for this model across all weekly specifications considered.
Daily forecasts showed a different pattern. For LightGBM, NBEATSx, and NHITS, the baseline forecasts already yielded the lowest errors, even when compared with the best reconciled variants of KAN, TBATS, and TimeGPT. For these latter three models, the reductions relative to the baseline were also small, ranging only from to . Taken together, these results indicate that the daily level was the least responsive to reconciliation, with improvements that were generally modest and more dependent on the specific forecasting model.
The hourly forecasts again were favored by reconciliation, although less markedly than weekly forecasts. LightGBM achieved the lowest error with the iterative variant tCROSS:cWLS, which reduced NRMSE by . KAN ranked second, also under the iterative approach, with a reduction of . The remaining models exhibited more modest gains relative to their baselines, with best-case reductions of for NHITS, for NBEATSx and for TimeGPT and TBATS.
Table 7 reports the baseline forecasts and the representative best- and worst-performing reconciled variants at the national, regional, and provincial levels. Unlike the frequency results, the best reconciled variant improved upon the baseline for every model at all three cross-sectional levels.
LightGBM yielded the lowest NRMSE at every level, in all cases under the iterative approach, with values of , , and at the national, regional, and provincial levels, corresponding to reductions of , , and , respectively. KAN also showed clear reductions, again mainly under the iterative approach, with improvements of at the national level, at the regional level, and at the provincial level.
For the deep learning models, NBEATSx attained the same best value under the direct and iterative approaches at the national and regional levels, whereas its best provincial result was obtained under the univariate approach. NHITS showed the same shift, with the direct approach yielding the lowest error at the national and regional levels and the univariate approach doing so at the provincial level. TimeGPT also improved at all three levels, with the same best value under the univariate and iterative approaches, whereas TBATS consistently reached its lowest error under the iterative approach. However, TBATS also showed the most inflated worst-case reconciled values across all three levels. The largest deteriorations again came from iterative variants combining temporal OLS or CROSS with cross-sectional COV, reproducing the same instability already observed in the global and frequency-specific analyses.
Figure 9 summarizes the best-performing reconciled cross-temporal configuration of each forecasting model across hierarchy levels and temporal frequencies. The best configurations were predominantly iterative for LightGBM, KAN, and TBATS, univariate for NHITS and TimeGPT, and direct only for NBEATSx. In the x-axis labels, the abbreviation after ‘|’ identifies the selected reconciliation scheme:
ite for iterative,
uni for univariate, and
dir for direct.
Rather than altering the ranking substantially across hierarchy levels, the figure shows a much stronger separation across temporal frequencies. Weekly forecasts consistently occupy the lowest NRMSE range, daily forecasts lie at an intermediate level, and hourly forecasts display the largest errors for every model. Within each frequency, national and regional results are very close, whereas the provincial level is systematically associated with the highest NRMSE, indicating that error increases as the series become more spatially disaggregated. This pattern is most evident at the hourly frequency, where the provincial level consistently exhibited the highest NRMSE values.
LightGBM:tCROSS:cWLS remains the top-performing specification throughout, with the lowest NRMSE at the weekly, daily, and hourly levels and for all three hierarchy levels. KAN:tWLSH:cWLS follows as the second-best model overall. NHITS:tWLSS:cBU and NBEATSx:ctSS form a middle group with very similar errors. TimeGPT:tAUTOCOV:cBU and TBATS:tOLS:cCOVSh occupy the upper part of the error scale, although their ordering changes with frequency: at the weekly and daily levels, TBATS is generally below TimeGPT, whereas at the hourly level TimeGPT becomes slightly more favorable than TBATS. Overall, the figure reinforces two main findings: first, the relative ranking of the best reconciled configurations is remarkably stable across hierarchy levels; second, forecast difficulty is driven primarily by temporal resolution and spatial disaggregation, with hourly provincial series representing the most challenging setting.
These patterns are also reflected in
Figure 10, which shows forecasts during the summer period, when PV generation reaches its highest levels in this dataset.
At the weekly and daily frequencies, the best reconciled LightGBM configuration followed the temporal evolution of PV generation more closely than the other models, although it still tended to underestimate the observed series, particularly around the largest peaks. The deep learning models displayed broadly similar trajectories, but with a less accurate representation of short-term fluctuations and local turning points in generation. By contrast, TimeGPT and TBATS showed the weakest performance at the weekly and daily levels, with a clearer underestimation of major production peaks and a poorer reconstruction of the overall trajectory. At the hourly frequency, all models reproduced the pronounced intraday cycle, but differences in peak amplitude, day-to-day modulation, and local variability remained visible. In line with the previous results, the best reconciled LightGBM forecasts remained among the closest to the observed series across frequencies.
5.3. Numerical Diagnostic of Unstable Iterative Configurations
Some iterative configurations produced error increases that were too large to be treated as ordinary losses in accuracy. To examine these cases, we performed a targeted diagnostic analysis of the most unstable covariance-based iterative variants and their shrinkage counterparts. The purpose was to check whether the error explosions were linked to covariance conditioning, convergence behavior, or excessive amplification by the reconciliation operator.
Table 9 summarizes the diagnostic results. Each configuration is reported with two rows: cCOV corresponds to the sample covariance estimator, while cCOVSh corresponds to the Ledoit–Wolf shrinkage estimator. NRMSE measures the actual damage in forecasting accuracy under the same global evaluation protocol used in the main experiments. Convergence (Conv.) reports the percentage of cutoffs that reached the numerical tolerance. Iter. med/max describes the median and maximum number of iterations. Patient stop (Pat. stop) reports the percentage of cutoffs stopped because the objective stopped improving before reaching the tolerance.
The conditioning metric is
, where
This is the cross-sectional GLS matrix involved in the reconciliation projection. Large values indicate poor numerical conditioning. The metric
is the spectral norm of the cross-sectional reconciliation operator and measures its capacity to amplify forecast perturbations. The amplification ratio (Amp. ratio) is the maximum absolute reconciled forecast divided by the maximum observed value. It gives a direct measure of forecast explosion.
The diagnostic results show that the problem is not simply that some base models forecast worse. The large failures are concentrated in the cCOV variants, where the sample covariance matrix leads to poorly conditioned cross-sectional GLS projections. This is reflected in the NRMSE values. For example, TBATS:tCROSS:cCOV reaches an NRMSE of 94.8124, NHITS:tOLS:cCOV reaches 7.9416, and NBEATSx:tOLS:cCOV reaches 7.0525. After replacing cCOV with cCOVSh, these values decrease to 0.9288, 0.7207, and 0.6965, respectively.
The conditioning diagnostics explain this behavior. For NBEATSx, decreases from 10.10 to 3.36 after shrinkage. For NHITS, it decreases from 10.17 to 2.70. These values indicate that the cross-sectional matrix involved in the GLS projection was severely ill-conditioned under cCOV and became much better conditioned under cCOVSh. TBATS shows a smaller conditioning value under cCOV, 6.69, but it produced the largest practical explosion. This indicates that conditioning alone does not explain the failures; the amplification induced by the reconciliation operator also matters.
The projection norm shows this amplification effect. Under cCOV, is 9804.73 for NBEATSx and 11,876.24 for NHITS. After shrinkage, these values decrease to 13.86 and 17.31. For TBATS, the norm decreases from 268.28 to 10.51. These reductions mean that the cross-sectional reconciliation operator under cCOV had a high capacity to magnify small changes in the forecasts. The amplification ratio gives the same message in practical terms. The clearest case is TBATS:tCROSS:cCOV, where the largest reconciled forecast was more than 3000 times larger than the maximum observed value. With cCOVSh, this ratio decreased to 0.78.
The convergence diagnostics also point to numerical instability under cCOV. All cCOVSh variants reached the tolerance in 100% of the cutoffs and none stopped by patience. Under cCOV, convergence was incomplete in several cases. TBATS:tCROSS:cCOV reached the tolerance in only 30.8% of the cutoffs and stopped by patience in 69.2%. NBEATSx with AUTOCOV or CROSS reached the tolerance in 84.6% of the cutoffs and stopped by patience in 15.4%. NHITS with OLS reached the tolerance in 82.7% of the cutoffs and stopped by patience in 17.3%.
The iteration counts show that coherence convergence and numerical reliability are not the same. Some configurations reached their stopping criterion quickly but still produced unrealistic forecasts. For example, NBEATSx:tOLS:cCOV has a median/maximum of 2/11 iterations, but its amplification ratio is 64.56. Similarly, NHITS:tOLS:cCOV has 2/11 iterations and an amplification ratio of 63.17. In these cases, the failure occurs almost immediately after applying the covariance-based projection. Other cases, such as NBEATSx:tAUTOCOV:cCOV and NBEATSx:tCROSS:cCOV, require many more iterations in the worst cutoffs, with maximum values of 168 and 199, showing an additional convergence issue.
These results support a more cautious interpretation of full sample covariance estimators in iterative cross-temporal reconciliation. In this experiment, cCOVSh acts as a numerical regularization of cCOV and prevents the largest forecast explosions. The results also show that reaching coherence is not enough to guarantee useful forecasts. For this reason, shrinkage regularization, early stopping, conditioning checks, projection-norm checks, and fallback to simpler diagonal estimators such as cWLS should be treated as practical safeguards when covariance-based iterative reconciliation is used.