1. Introduction
Digital Elevation Models (DEMs), as the digital representation of the Earth’s surface morphology, serve as an indispensable foundational dataset supporting a wide range of Earth science applications, including hydrological modeling [
1], geohazard assessment [
2], ecosystem research [
3], and global change monitoring [
4]. In recent years, the proliferation of various global open-source DEM products, such as SRTM GL1 DEM [
5], TanDEM-X EDEM [
6,
7], ALOS World 3D [
8], and Copernicus DEM [
9], has significantly advanced research in related fields. However, it is crucial to clarify that most of these products are essentially Digital Surface Models (DSMs), representing the reflective surface of canopy and structures rather than the bare terrain [
10,
11]. The data acquisition and processing steps for these products introduce inherent systematic and random errors. These biases are constrained by multiple factors, including sensor type (e.g., InSAR, optical photogrammetry), terrain complexity, surface cover (especially vegetation), and atmospheric conditions [
12]. Recent global validation studies highlight that these vertical errors are not randomly distributed but exhibit strong correlations with geomorphometric variables like slope and aspect [
13]. Consequently, the high-precision error correction of these datasets remains a critical research priority.
The advent of high-precision global altimetry missions, particularly the successful operation of NASA’s Ice, Cloud, and Land Elevation Satellite-2 (ICESat-2), has provided unprecedented opportunities for DEM error correction. ICESat-2’s advanced photon-counting lidar system delivers globally distributed ground elevation control points with centimeter-level vertical accuracy [
14,
15]. While the use of high-resolution airborne LiDAR DTMs is often considered the gold standard for validation, such data are often scarce or unavailable in large-scale, complex terrains like Western China. In this context, ICESat-2 serves as the most viable “ground truth” alternative. Utilizing these high-fidelity laser points as “ground truth,” per-point error assessment and correction of DEMs has become the mainstream technical approach in this domain [
16,
17,
18].
The methodology for DEM error correction has witnessed a paradigm shift. Early efforts primarily relied on global trend surface analysis and classical geostatistical interpolation (e.g., Kriging), which explicitly models the spatial autocorrelation of errors. Despite the rapid development of new algorithms, Kriging remains a robust and widely used benchmark in recent high-precision DEM assessments [
19]. However, the field has progressively transitioned toward the contemporary era of data-driven machine learning, driven by the need to capture more complex non-linear patterns. Currently, a widely validated and highly successful paradigm is Machine Learning driven by Feature Engineering (FE-ML). Researchers consistently find that DEM errors correlate strongly with multiple quantifiable geomorphometric and geophysical factors. In this study, we define this paradigm as modeling based on “Explicit Physical Features”—variables governed by clear physical laws, such as slope (gravitational potential), aspect (solar illumination angle), and vegetation cover fraction. By constructing such an explicit feature set and applying Gradient Boosting Decision Tree (GBDT) models, such as XGBoost [
20] or LightGBM [
21], to learn the complex non-linear relationship between these features and the elevation error, exceptional correction performance is achieved [
22,
23,
24]. The essence of this paradigm lies in encoding extensive geo-scientific domain knowledge into structured, numerical features that the model can directly utilize.
Nevertheless, the FE-ML paradigm traditionally treats error points as independent samples. From a geostatistical perspective, this neglects the inherent “spatial non-stationarity” and anisotropy of terrain errors, where error distribution varies directionally and clusters spatially [
25]. To overcome this inherent constraint, an emerging paradigm from the field of deep learning—Geometric Deep Learning, specifically Graph Neural Networks (GNNs)—has demonstrated immense potential [
26,
27,
28]. Unlike FE-ML, GNNs rely on “Implicit Spatial Relations”—the latent topological dependencies between adjacent pixels learned through message-passing mechanisms [
29]. From a geomorphometric perspective, elevation errors are rarely isolated point-wise anomalies; instead, they exhibit strong spatial autocorrelation [
30]. For instance, systematic elevation biases—such as those caused by radar shadowing or canopy penetration limits—often act as artificial dams, blocking overland flow paths and causing severe distortions in downstream hydrological drainage networks [
31]. These topographically constrained biases create a continuous chain of spatially dependent errors across the landscape. Theoretically, by structuring ICESat-2 points as a geospatial graph, GNNs should capture spatially dependent error patterns missed by FE-ML models [
32,
33,
34].
Despite the theoretical advantages of GNNs, a notable gap remains in the literature. While contemporary machine learning models have significantly improved global error metrics like RMSE, their stability and interpretability in a geospatial context remain a notorious “black box.” It is often unclear whether these deep learning architectures are genuinely learning universal topographic laws or merely overfitting to spatial noise. Therefore, a pivotal question arises: Is the increasing complexity of these deep learning models justified by their marginal performance gains? In a scenario where feature engineering is already highly refined, can the “Implicit Spatial Relations” of GNNs genuinely deliver a performance advantage over state-of-the-art models that depend on “Explicit Physical Features”? Or, does a robust physical feature set effectively “short-circuit” the need for complex spatial convolution?
To address these critical scientific questions, this study designs and executes a comprehensive comparative experimental framework. Instead of a dispersed global analysis, we focus on the structurally complex terrain of Sichuan Province, China. This region serves as a unique “natural laboratory” containing diverse geomorphological units (including basins, high mountains, and plateaus) within a consistent tectonic framework. We conduct a rigorous performance and mechanistic comparison of multiple hyperparameter-tuned GNN variants—including the foundational GraphSAGE [
35], the advanced Graph Attention Network (GAT) [
36], and a GNN-XGBoost hybrid model—against an equally and systematically optimized XGBoost [
20] strong baseline across four mainstream DEM products. To ensure rigorous validation across different spatial supports, we employ a strict point-to-surface matching strategy to align ICESat-2 footprints with DEM grids.
To systematically open this “black box,” the specific objectives of this study are threefold: (1) to evaluate whether the topological complexity of GNNs yields statistically significant accuracy improvements over optimized FE-ML models under sparse supervision; (2) to decode the internal decision-making mechanisms of both paradigms to verify if their logic aligns with physical geomorphological laws; and (3) to assess the explanatory stability of these models to prevent underspecification risks in Geo-AI.
The primary contributions of this research are as follows:
- (1)
We present the first systematic comparison of these two prevailing paradigms for DEM error correction, revealing a stable and, to some extent, counter-intuitive performance hierarchy where “Explicit Physics” rivals “Implicit Geometry.”
- (2)
Through the use of multi-source explainability tools (SHAP [
37] and GNNExplainer [
38]), we uncover for the first time the fundamental differences in the decision-making logic between these two modeling paradigms.
- (3)
We reveal a decoupling between predictive stability and explanatory stability in GNN-based models, highlighting the risk of “Categorical Overfitting”.
- (4)
Drawing upon these findings, we offer a critical and broadly applicable reflection on Pareto Optimality in model selection strategies within the geospatial AI domain.
4. Results
4.1. The Performance Hierarchy
As illustrated in
Table 4 and
Figure 6, a stable performance hierarchy consistently emerges across all four DEM datasets: Hybrid
XGBoost > GraphSAGE > MLP > GAT > IDW > Kriging. This clear stratification validates our core hypothesis regarding feature utility:
- (1)
The Robustness of Physical Features against Geometric Over-Complexity (MLP vs. GAT): As shown in
Table 4, while GraphSAGE holds a slight edge over MLP, the attention-based GAT performs significantly worse than both, occupying a lower tier (e.g., RMSE 6.581 m for MLP vs. 7.793 m for GAT on AW3D). This directly explains why overly complex geometric approaches struggle in this specific task. In sparse, high-relief reference scenarios, explicit physical features (e.g., aspect) strictly govern and correct the dominant radar/optical distortions with high robustness. Conversely, as geometric models increase in complexity, they suffer a severe “complexity penalty.” GAT faces attention saturation and overfits to the sparse, irregular spatial topologies, failing to extract reliable implicit spatial relationships. Thus, a robust physical baseline (MLP) naturally surpasses a brittle, overly complex spatial model (GAT), further proving that established physical laws are far more reliable than forced geometric inferences in data-sparse environments.
- (2)
The Failure of Pure Spatial Autocorrelation (Kriging & IDW): Geostatistical baselines occupy the bottom tier (RMSE > 8 m), confirming that DEM errors in complex topography are deterministic and physically driven, not merely a function of spatial proximity.
Diminishing Returns and the Complexity Trade-Off
While the Hybrid model consistently achieves the lowest absolute RMSE across datasets (e.g., improving upon XGBoost by a marginal 0.013 m on AW3D), the improvement is remarkably marginal (<0.05 m). So, we must evaluate whether this gain justifies the computational overhead. To strictly assess this “RMSE parity,” we conducted a Wilcoxon signed-rank test between the Hybrid (k = 8) and XGBoost results. Despite the small absolute margin, the differences are statistically significant (Bonferroni-corrected
p < 0.001). However, assessing this statistical improvement through a Performance vs. Complexity lens reveals a stark trade-off. As detailed in the computational cost analysis (tested on identical hardware), the pure XGBoost model achieves near-optimal correction utilizing only CPU resources with a rapid training time (~36.86 s). In contrast, introducing the GNN encoder necessitates GPU allocation and increases the pre-training time to roughly 666.23 s—a nearly 18-fold increase in computational cost for an RMSE reduction of less than 0.7%. This massive computational asymmetry, combined with marginal accuracy gains, compellingly demonstrates that explicit physical features (the “Physics Skeleton”) already explain the vast majority of the error variance. The implicit spatial embeddings provide merely a negligible supplementary effect, solidifying the operational superiority of the purely physical XGBoost model for large-scale DEM correction tasks, see
Table 5.
4.2. Beyond Point Precision: Geomorphometric Fidelity Evaluation
Beyond vertical point accuracy, a critical objective is determining if the model genuinely restores the structural fidelity of the terrain rather than merely removing a systematic global bias. To evaluate the correction of high-frequency topographic errors, we implemented a Lagrangian Along-Track Approach. This method mitigates the spatial support inconsistency between discrete laser footprints and continuous raster grids. Specifically, we calculated the Instantaneous Along-Track Slope () directly between adjacent ICESat-2 footprints and compared it with the slope between the corresponding coordinates on the DEM surfaces. This guarantees an identical spatial baseline for a strict terrain structure comparison.
Evaluating 7007 valid slope pairs from the AW3D dataset (selected for its high sensitivity to roughness, detailed in
Section 4.5), the results provide compelling evidence of geomorphometric restoration (
Figure 7).
The Kernel Density Estimation (KDE) in
Figure 7a illustrates that the original DEM’s slope error distribution exhibits heavy tails. Conversely, the XGBoost-corrected distribution is significantly sharper and centered near zero. Quantitatively, the overall Slope RMSE decreased from 3.97° to 3.06° (a 23.07% relative improvement). This substantial reduction confirms that the model effectively corrects non-stationary, spatially varying errors rather than applying a simple mean-shift.
While traditional smoothing models typically fail in high-gradient areas by underestimating peaks and valleys,
Figure 7b demonstrates pervasive correction across all terrain complexities. The explicit feature-based model maintains highly robust performance even in the most challenging topography (Slope > 20°).
These findings firmly reinforce our core objective and hypothesis: explicit physical learning dynamically adjusts the correction magnitude based on local morphometric contexts (e.g., Slope). Rather than suffering from the low-pass filtering effect inherent to implicit spatial interpolation, the XGBoost model acts as a “texture synthesizer,” successfully re-sculpting distorted landforms and preserving topographic realism.
4.3. Decision Mechanisms: A Tale of Three “Worldviews”
To investigate the underlying causes of the performance disparities described in
Section 4.1, we conducted an in-depth explainability analysis using SHAP (for XGBoost) and GNNExplainer (for GraphSAGE and GAT). Selecting the CopDEM dataset as a representative case, the results (
Figure 8) reveal that despite receiving an identical set of input features, the models exhibit three fundamentally different decision-making “worldviews” that dictate their performance.
- (a)
The “Discriminative Physicist” Worldview of XGBoost
XGBoost exhibits a highly discriminative and sparse decision logic, aggressively isolating the dominant error drivers (
Figure 8a). It strictly prioritizes sensor geometry, with the trigonometric components of aspect (aspect_sin, aspect_cos) emerging as the paramount determinants (mean SHAP ~3.3). Vegetation cover (gfcc_percent) follows as the secondary tier, while local morphometric factors (slope, RRI) are assigned lower supplementary weights. This rigid hierarchy (Sensor Geometry > Surface Medium > Local Texture) aligns perfectly with the physical principles of remote sensing distortions.
- (b)
The “Isotropic Mean Aggregator” Worldview of GraphSAGE
Conversely, GraphSAGE demonstrates a “flattened” perception of physical reality (
Figure 8b). The importance scores of its top features (aspect_cos, slope, RRI, gfcc_percent) are tightly clustered within a negligible margin (0.37–0.40). By averaging feature information across local neighborhoods, its isotropic aggregation mechanism effectively “washes out” the distinct, high-frequency signals of specific physical drivers. It creates a generalized “feature soup” where no single physical law dominates.
- (c)
The “Attention Homogenization” Worldview of GAT
The GAT model suffers from profound attention saturation (
Figure 8c). Its importance scores are even more compressed than GraphSAGE’s, with the top five features all falling within a narrow range of 0.42–0.44. Struggling with feature collinearity, GAT attempts to leverage all correlated physical signals simultaneously with equal weight. This indecisiveness introduces information redundancy and prevents the model from locking onto the primary error source.
By systematically comparing these mechanisms, it becomes clear why explicit physical learning outperforms implicit geometry in this context: XGBoost accurately isolates physical causality, whereas complex GNNs suffer from feature homogenization when applied to sparse, high-relief geomorphometric data.
4.4. Stable Performance, Drifting Explanations
As established in Section Diminishing Returns and the Complexity Trade-Off, the inherent stochasticity in training the GNN-XGBoost Hybrid model does not affect its macroscopic predictive accuracy; the RMSE remains highly stable across independent runs (
Table 6). However, an in-depth analysis of the model’s internal decision logic reveals a significant methodological phenomenon, where the model achieves consistent predictions despite relying on highly variable, drifting feature representations.
To rigorously investigate this, we conducted three independent training sessions using both the pure XGBoost and Hybrid models. We then evaluated the feature attribution consistency globally across the dataset and locally within a specific high-relief geographic feature (a high mountain sub-region from
Figure 2a).
At the global scale, while explicit physical features (Aspect, Vegetation) remained consistently ranked at the top, the ranking of GNN-generated latent features fluctuated wildly across runs for the Hybrid model (
Figure 9). Pairwise rank correlation (
) analysis [
74] across experiments (
Table 7) confirmed that the pure XGBoost model achieved near-perfect stability (
> 0.97), whereas the Hybrid model exhibited significant inconsistency (
0.63–0.77).
To illustrate the geographic implications of this drifting explanation, we mapped the local mean SHAP values for the high-mountain test region using a heatmap (
Figure 10).
The pure XGBoost model demonstrates absolute determinism (
Figure 10a). Across all three runs, the primary physical drivers—such as aspect_sin (~3.20), aspect_cos (~2.84), and slope (~1.48)—maintain nearly identical SHAP magnitudes. The model consistently recognizes the exact physical causes of elevation error.
In stark contrast, the Hybrid model exhibits severe local attribution drift (
Figure 10b). While the overarching physical features remain present, the implicit spatial embeddings “take turns” explaining the local residuals. In Run 1, the model heavily relies on gnn_feat_26 (1.10); in Run 2, this feature drops to 0.01, and gnn_feat_22 emerges (0.96); in Run 3, entirely different features (gnn_feat_9) take over.
This drift is not a computational error but a fundamental characteristic of applying over-parameterized geometric models to sparse geomorphometric data. The GNN possesses excessive degrees of freedom, allowing it to find multiple, equally valid mathematical representations of the same local spatial noise. Therefore, while the Hybrid model can marginally improve accuracy by opportunistically fitting local residuals, its explanatory logic is mathematically non-unique. The resolution to this methodological instability is to rely on the explicitly constrained physical feature space (XGBoost), which guarantees both high accuracy and deterministic, physically interpretable causality.
4.5. The Opportunism of GNN Features
Having established the stochasticity of the Hybrid model’s feature attribution within a single dataset, we further analyzed the average SHAP feature importance across all four DEM products (
Figure 11) to determine if the extracted implicit spatial relations represent universal physical laws or merely dataset-specific adaptations. The systematic cross-dataset comparison reveals a clear behavioral dichotomy between explicit and implicit features.
The Universality of Explicit Physics: Across all four fundamentally different DEMs, explicit physical variables—specifically Acquisition Geometry (aspect_sin, aspect_cos) and Forest Cover (gfcc_percent)—rigidly dominate the Top-3 positions. This absolute consistency proves that the primary systematic elevation errors (e.g., radar foreshortening and canopy penetration) are universally governed by these explicit geomorphometric descriptors. They function as a highly robust and transferable error correction model.
The Dataset-Dependent Opportunism of GNN Features: In stark contrast, the contribution of GNN-derived implicit spatial features (gnn_feat_xx) exhibits significant instability and opportunistic behavior. First, their specific latent identities vary randomly across datasets (e.g., relying on gnn_feat_97 for AW3D versus gnn_feat_43 for SRTM), lacking any universal spatial motif. Second, their importance strictly scales with the dataset’s inherent noise level. For instance, a GNN feature rises to Rank 5 in the noisier SRTM dataset, but drops below Rank 8 in high-precision products like CopDEM.
This systematic evaluation firmly concludes that the GNN component does not learn a universal terrain correction logic; rather, it acts as a dataset-dependent “local residual scavenger.” It merely identifies specific latent vectors that opportunistically correlate with the remaining noise in the training data. Consequently, while implicit spatial relations can offer marginal local improvements, the explicitly engineered physical features unequivocally form the indispensable, generalizable “skeleton” of the correction model.
6. Conclusions
To address the unresolved question regarding the superiority of the two prevailing paradigms in DEM error correction—explicit feature engineering versus implicit geometric deep learning—this study executed a rigorous comparative framework across geomorphologically diverse landscapes. By integrating high-precision ICESat-2 altimetry data with explainable machine learning, we elucidate the mechanistic boundaries of these models and arrive at the following core conclusions:
- (1)
First, Physics Trumps Geometry in Sparse Data Regimes.
Contrary to the prevailing assumption that graph-based architectures superiorly capture terrain topology, our results demonstrate that the GNN-XGBoost Hybrid model yields mathematically marginal gains that fail to justify its computational overhead. We attribute this to a fundamental scale mismatch: the vast spatial baseline between ICESat-2 footprints (averaging ~485 m) severely violates the dense connectivity assumptions of standard GNNs, inducing detrimental over-smoothing. Conversely, the success of the baseline XGBoost model is grounded in an explicit Physics Skeleton. By isolating deterministic drivers such as acquisition geometry (Aspect) and signal penetration (Vegetation), it proves that robustly mapping physical gradients is inherently superior to inferring implicit topologies from fragmented reference data.
- (2)
Second, Deep Learning Suffers from “Underspecification” in Geo-Regression.
This study provides the first quantitative evidence of the “Decoupling Hypothesis” in DEM error correction. While the Hybrid model demonstrates high predictive reproducibility (stable RMSE), it suffers from severe attribution stochasticity. Specifically, the Spearman rank correlation of its feature importance oscillates violently ( ≈ 0.63–0.77), compared to the near-perfect stability of the purely explicit model ( > 0.97). This proves that without explicit physical constraints, geometric models devolve into residual-dependent latent feature learners, capturing opportunistic spatial noise (ID Chaos) rather than discovering universal topographic laws.
- (3)
Finally, Explanatory Stability Must Become the New Accuracy.
Our findings serve as a critical caveat to the burgeoning Geospatial AI community: performance is not a proxy for trustworthiness. The phenomenon of “stable prediction, drifting explanation” highlights the epistemic risk of deploying unchecked “Black Box” models in physical geography. We advocate that for geospatial tasks endowed with strong physical priors, the Feature-First paradigm remains the gold standard. Moving forward, the community must pivot towards Grey Box systems (e.g., Physics-Informed Neural Networks or Graph Transformers) and institutionalize explanatory stability metrics, ensuring that AI-driven discoveries are not merely mathematically accurate, but fundamentally true to the underlying physical realities.