1. Introduction
The normalized digital surface model (nDSM) is a fundamental geospatial product that plays a critical role in numerous remote sensing applications, including urban planning, disaster assessment, resource investigation, and ecological monitoring. Conceptually, an nDSM is obtained by subtracting a Digital Terrain Model (DTM) from a Digital Surface Model (DSM) [
1], thereby removing the influence of terrain elevation and isolating the absolute height of above-ground objects such as buildings and vegetation. This terrain-normalized representation provides a direct and interpretable description of object heights relative to the ground surface, which makes nDSM an indispensable data source for fine-scale three-dimensional analysis. Consequently, developing reliable and scalable nDSM acquisition methods has become a long-standing research focus in the remote sensing and geospatial information science communities.
At present, the most accurate approach for high-precision nDSM generation relies on LiDAR point cloud data [
2]. By actively emitting laser pulses and measuring their return times, LiDAR systems directly capture three-dimensional point clouds containing both ground and non-ground objects. Through point cloud classification and interpolation, DSM and DTM products can be generated, and nDSM is subsequently obtained via pixel-wise subtraction. Owing to its direct 3D measurement mechanism, LiDAR-based methods achieve centimeter-to-decimeter scale height accuracy and exhibit strong robustness to illumination conditions, surface texture variations, and atmospheric interference. These advantages make LiDAR particularly effective in complex urban environments and densely vegetated areas [
3]. However, LiDAR systems suffer from significant limitations, including high equipment and acquisition costs, long data update cycles, and substantial demands on technical expertise during preprocessing stages such as point cloud filtering and registration [
4]. These constraints severely restrict their large-scale and frequent deployment, especially in resource-limited or rapidly changing regions.
To reduce acquisition costs and improve scalability, optical remote sensing-based stereo reconstruction has emerged as a mainstream alternative for large-area nDSM generation. This passive remote sensing paradigm reconstructs DSMs from overlapping optical images and derives nDSM by combining the reconstructed DSM with open-source or measured DTMs. Existing approaches can be broadly categorized into stereo matching and multi-view stereo (MVS) methods [
5,
6]. Stereo matching techniques estimate disparity maps from image pairs through epipolar rectification and dense matching algorithms, while MVS methods extend to multiple views by jointly estimating camera poses and reconstructing dense point clouds. These approaches benefit from the wide availability of optical imagery and enable cost-effective nDSM production over extensive regions. Nevertheless, their performance is highly dependent on the availability of high-quality overlapping images with suitable viewing geometry. In practice, stereo and MVS methods often fail in texture-less or shadowed areas, suffer from DSM voids, and generally achieve meter-level accuracy [
7], which remains inferior to LiDAR-based solutions. Moreover, the requirement for qualified stereo or multi-view datasets significantly limits their applicability in remote, underdeveloped, or geographically complex regions where such data are scarce [
8].
Against this background, single-image remote sensing height estimation, also referred to as monocular height estimation (MHE), has attracted increasing attention as a data-efficient alternative. MHE aims to infer pixel-wise absolute heights from a single optical remote sensing image, thereby eliminating the dependence on stereo pairs or multi-view imagery [
9]. Early studies in this direction largely drew inspiration from monocular depth estimation in computer vision, while recent works have increasingly tailored network architectures and training strategies to accommodate the overhead viewing geometry and unique characteristics of remote sensing imagery [
10]. Representative solutions include residual convolution-deconvolution networks [
11], encoder–decoder convolutional neural networks designed to stabilize training under long-tailed height distributions [
12], CNN–Transformer hybrid architectures that capture long-range contextual dependencies [
13], distinct paradigms that leverage synthetic data and unsupervised domain adaptation [
14], and multi-task learning frameworks that jointly perform height estimation and semantic segmentation to improve generalization [
15].
Despite these advances, single-image height estimation remains fundamentally challenging due to the ill-posed nature of inferring three-dimensional structure from two-dimensional observations. In particular, absolute height estimation suffers from severe scale ambiguity and long-tailed height distributions, leading to reduced accuracy for tall structures and complex scenes. Some state-of-the-art Transformer-based models also incur high computational costs, limiting their practicality for large-scale or real-time applications [
16]. As a result, although MHE methods offer clear advantages in terms of data availability and scalability, their accuracy has generally lagged behind that of stereo reconstruction approaches [
17,
18].
Recent progress in large pre-trained models has opened new opportunities to address these limitations. Trained on massive and diverse datasets, such models have demonstrated remarkable capability in learning robust and transferable depth representations. To address the uniqueness of remote sensing imagery, semantically enhanced pre-trained models specifically developed for remote sensing scenarios have further improved the adaptability of pre-training techniques in monocular height estimation. For example, the SkySense++ model employs a paradigm of multi-granularity contrastive learning and masked semantic learning to strengthen the model’s understanding and extraction of pixel-level and region-level semantic context in remote sensing imagery [
19]. In particular, large-scale depth estimation models can produce dense relative depth maps that preserve consistent ordinal height relationships among scene elements, even though their outputs lack explicit metric scale [
20]. This observation suggests that relative depth information, while not directly interpretable as absolute height, can serve as a powerful structural prior for single-image nDSM inference when appropriately integrated with image appearance cues.
Several existing height estimation methods based on depth priors and auxiliary guidance have harnessed the capabilities of large-scale depth estimation models. However, they still exhibit limitations regarding data dependency and model flexibility. Existing low-overlap photogrammetry methods relying on monocular depth estimation [
21] leverage tie points from aerial triangulation to recover metric depth. While addressing scene completeness, this workflow depends heavily on the availability of tie points and multi-view geometry constraints, limiting its applicability in pure single-image scenarios and achieving only meter-level accuracy. Depth2Elevation [
20] integrates the Depth Anything model with scale modulation for elevation estimation; however, its training process requires the joint optimization of the foundation model and scale modulator, imposing high computational hardware requirements. This approach fails to decouple the extraction of depth priors from the adaptation to the target domain, resulting in a rigid training framework. Sparse LiDAR-guided correction methods [
22] leverage ICESat-2 photon data to refine prediction residuals, yet they rely on external sparse LiDAR supervision. When a target test image lacks corresponding external supervision, these methods are forced to utilize linear fitting parameters derived from other images, leading to significant performance degradation. These limitations highlight the urgent need for a more robust and adaptable approach.
To enable the effective utilization of depth priors from large models under limited computational resources, this paper proposes a novel network for single-image nDSM generation. By learning an effective transformation from relative depth space to absolute height space through a dedicated network architecture, the proposed method aims to overcome the scale ambiguity inherent in traditional monocular approaches and improve height estimation accuracy without relying on stereo imagery or LiDAR data. The main contributions of this work can be summarized as follows:
We identify and formalize the intrinsic geometric inconsistency between monocular relative depth and absolute terrain height in remote sensing imagery, clarifying why direct depth-to-height regression is fragile under terrain heterogeneity.
We propose RDAH-Net, a depth-to-height translation framework that decouples relative geometry from absolute height anchoring, enabling accurate single-image nDSM reconstruction.
Extensive experiments across diverse geographic regions demonstrate strong generalization and robustness, supporting scalable deployment for large-area mapping and rapid elevation updates.
3. Experimental Results
This section presents the experimental design and results to validate the proposed framework. We first introduce the dataset construction and preprocessing pipeline, followed by the experimental settings (training protocol, baselines, and evaluation metrics). We then report quantitative and qualitative comparisons on three datasets, assess cross-domain generalization, and finally conduct ablation studies to verify the contribution of each key module.
3.1. Dataset Construction
To evaluate the proposed method across different geographic regions and sensor conditions, we conduct experiments on one public benchmark and two self-constructed datasets. Geographically, the Swiss dataset covers two regions (Lat
N–
N, Long
E–
E; and Lat
N–
N, Long
E–
E), while the HK dataset covers two distinct scenes (Lat
N–
N, Long
E–
E; and Lat
N–
N, Long
E–
E). These datasets exhibit significant domain gaps beneficial for validation. As illustrated in
Figure 4, differences in architectural styles lead to distinct feature representations. Furthermore,
Table 1 highlights that the spatial resolution of the Swiss and HK datasets differs considerably from that of DFC2019-Track1; this variance causes buildings to appear at smaller scales, effectively enriching feature diversity. In terms of height distribution,
Figure 5 shows that the Swiss dataset contains a higher proportion of tall buildings (above 12 m), whereas the HK dataset exhibits a higher proportion of building areas above 2 m compared to DFC2019-Track1. These variations in geographic coverage, spatial resolution, and height distribution validate the universality of our method and provide a robust foundation for cross-domain generalization testing. In this work, we define samples acquired from the same city and the same sensor as belonging to the same domain. As summarized in
Table 1, we use the public DFC2019-Track1 [
33] dataset and construct two additional datasets from Switzerland (Swiss) and Hong Kong (HK). For each dataset, samples are split into training and test sets with a ratio of 4:1. During training, we further randomly sample 20% of the training set as the validation set.
For the Swiss and HK datasets, we follow a standardized pipeline to ensure strict geographic alignment between remote sensing images and nDSM labels.
(1) DSM/DTM and nDSM Ground Truth generation. We collect open-source LiDAR point clouds and generate DSM by raster interpolation of the original point cloud. For DTM generation, ground points are first extracted and then interpolated. Specifically, we adopt the Cloth Simulation Filter (CSF) [
34] for ground/non-ground separation. The DTM is generated from the extracted ground points, and the nDSM is obtained via pixel-wise subtraction between the aligned DSM and DTM.
(2) Orthorectification. Using DSM as elevation support, the remote sensing images are orthorectified based on the sensor orientation parameters, where collinearity equations are solved to correct terrain-induced displacements. The orthorectified image is produced by resampling, ensuring strict spatial correspondence with the nDSM [
35].
(3) Patch generation. We crop the overlapping area between the orthophoto and nDSM, then apply a sliding-window strategy to generate paired patches. The patch size is set to 512 × 512 with an overlap of 256 pixels to preserve boundary structures.
(4) Quality filtering. Invalid samples are removed by two criteria: (i) nDSM patches with ≥50% no-data values (often caused by large water bodies) are discarded; (ii) orthophoto patches with more than 20% fully black pixels (RGB all zeros) are discarded to avoid invalid margins introduced during image generation. Only valid patches are retained.
3.2. Experiments Setting
We describe the experimental settings in terms of data usage, model inputs, baselines, and training configurations:
Training/testing protocols. DFC2019-Track1 is used as the primary benchmark for in-domain evaluation (training and testing within the dataset split). To assess cross-domain generalization, we additionally evaluate the model trained on DFC2019-Track1 directly on the Swiss and HK test sets without retraining. Moreover, to verify that the observed superiority is consistent across datasets, we also train and test on Swiss and HK separately under the same protocol.
Relative depth prior generation (frozen). For each input image, a relative depth prior is generated by Depth Anything v2. Importantly, Depth Anything v2 is used only as a fixed prior generator: its parameters are kept frozen and it is not fine-tuned on any dataset in this study. The predicted relative depth maps are then provided as an additional input modality to RDAH-Net.
Input preprocessing and normalization. The patch sizes are 1024 × 1024 for DFC2019-Track1 and 512 × 512 for Swiss/HK. Since the proposed network supports variable input sizes, no resizing is applied. The uint16 orthophotos are linearly normalized to the uint8 range for stable intensity distribution, and then normalized using the ImageNet mean and standard deviation per channel. For nDSM labels, we adopt a fixed-point encoding strategy: nDSM values are multiplied by 500 and stored as uint16 during data I/O, and are divided by 500 to recover the original height scale for loss computation and evaluation. All reported metrics are computed on the recovered nDSM values.
Comparison methods. We compare the proposed RDAH-Net with (i) IM2ELEVA-TION [
36] and IM2HEIGHT [
11], representative encoder-decoder methods for single-image height estimation; (ii) Baseline-MLP, which takes only the relative depth prior as input and learns a direct projection to nDSM; (iii) SynRS3D [
14], representing distinct paradigms that leverage synthetic data and unsupervised domain adaptation; and (iv) NRF [
22], a sparse LiDAR-guided correction method that leverages ICESat-2 photon data to refine prediction residuals. For ablation studies, we remove one module at a time while keeping the remaining components unchanged.
Training configurations. All experiments are conducted on a machine with 12 vCPU Intel(R) Xeon(R) Platinum 8375C CPUs @ 2.90 GHz and a NVIDIA GeForce RTX 4080 GPU with 32 GB available memory. We use the Adam optimizer with a learning rate of , batch size 8, and train for 100 epochs. The checkpoint with the best validation performance is selected for testing. All models converge within 100 epochs under this configuration.
Loss function. We optimize the proposed method using the L1 loss,
where
and
denote the ground-truth and predicted nDSM values, respectively. L1 loss is adopted due to its robustness to outliers and its direct interpretability in the height estimation task.
Evaluation metric. We report the mean absolute error (MAE) for quantitative evaluation,
In addition to MAE, we provide qualitative visualizations to assess structural fidelity (e.g., building boundaries and height hierarchy), since MAE alone may not fully reflect the preservation of fine-grained structures in dense urban scenes.
3.3. Results
3.3.1. Relative Depth Prior Versus Absolute Height Range
We first examine the relationship between the large-model relative depth prior and the absolute nDSM range predicted by RDAH-Net. As shown in
Figure 6, the numerical range of the relative depth output is not proportional to the range of the corresponding nDSM ground truth. For example, although the ground-truth height range of OMA-315-023 [
33] is substantially larger than that of Swiss-051-161, the relative depth output shows an opposite pattern. Moreover, the relative depth values typically span tens to hundreds, while the nDSM values are near zero to tens, indicating that the prior is not directly interpretable in metric height units. These observations support the necessity of a dedicated network that jointly leverages image appearance and relative depth cues to infer physically meaningful absolute heights. As shown in
Figure 6, RDAH-Net effectively maps the relative depth prior to an output range consistent with the ground-truth nDSM.
3.3.2. In-Domain Quantitative Comparison
We compare RDAH-Net with IM2ELEVATION, IM2HEIGHT, Baseline-MLP, SynRS3D, and NRF by training and testing each method separately on each dataset. As reported in
Table 2, RDAH-Net achieves the best overall MAE across the three datasets. IM2HEIGHT performs significantly worse than the proposed method across all three datasets. SynRS3D achieves suboptimal performance, as it was trained without the semantic segmentation branch for a fair comparison, thereby diminishing its advantage and limiting it to the height prediction branch. NRF also yields unsatisfactory results due to the limitations of its random forest-based methodology. Baseline-MLP performs competitively on HK but degrades on DFC2019-Track1 and Swiss, reflecting the limitation of relying solely on the depth prior without incorporating image semantics. IM2ELEVATION achieves comparable performance on DFC2019-Track1, but both IM2ELEVATION and IM2HEIGHT exhibit severe degradation on some datasets due to their sensitivity to radiometric variations, which can lead to abnormal predictions under low-brightness conditions (see
Section 4.1 for detailed discussion).
In terms of model complexity, RDAH-Net contains only 5.3716 M trainable parameters, which is approximately 3.38% of IM2ELEVATION and a mere 1.46% of SynRS3D. Since the relative depth prior is produced by a fixed pre-trained model and does not require training, the trainable memory footprint of the proposed approach remains substantially smaller. In terms of computational efficiency, since NRF belongs to traditional machine learning rather than deep learning, it is not included in the comparison here. RDAH-Net has far lower FLOPs than all deep learning-based comparison methods except Baseline-MLP, and achieves faster FPS at both 512 × 512 and 1024 × 1024 input resolutions. Notably, while possessing such outstanding computational efficiency, RDAH-Net still maintains the optimal prediction performance across all tested datasets.
Visually in
Figure 7, RDAH-Net consistently outperforms across datasets. Baseline-MLP produces sharp edges but substantial height deviations from over-reliance on preset relative depths. On DFC2019-Track1 from Omaha and Jacksonville, RDAH-Net excels on JAX-004-015, OMA-329-039, and OMA-367-012 with precise shapes and heights despite the minor vegetation overestimate shown in the white box; on JAX-280-004 with balanced building and vegetation prediction; and on OMA-315-023 with complete bridge and building shapes and heights. Visually in
Figure 8, the analysis extends to the Swiss and HK datasets. Even on shadowed HK-072-466, where all models deviate, RDAH-Net minimizes errors through robust zero-shot depth estimation and multi-modal fusion for unmatched stable adaptation. Specifically highlighting visualization performance in vegetation areas beyond urban regions, on Swiss-051-2, RDAH-Net stands out as the only method capable of perfectly restoring the shape of vegetation. For HK-062-11, although none of the compared methods perfectly recover the vegetation morphology, RDAH-Net yields a predicted height range closest to the ground truth. In comparison, the MLP method recovers partial vegetation heights but lacks detailed textures due to its complete reliance on relative depth. Furthermore, the NRF method produces height predictions with a high granular effect, lacking precision due to inherent limitations in its algorithmic principles.
3.3.3. Cross-Domain Generalization
To assess generalization, we directly evaluate the models trained on DFC2019-Track1 on the Swiss and HK test sets. As shown in
Table 3, RDAH-Net achieves the best MAE on HK, while IM2ELEVATION achieves the lowest MAE on Swiss. However, the quantitative results should be interpreted together with the visualizations in
Figure 9. We observe that IM2ELEVATION can produce outputs that are nearly constant across large regions under certain radiometric conditions, which may yield a deceptively low MAE if the constant prediction happens to be close to the dataset mean height. In such cases, MAE alone does not reflect the loss of structural fidelity. In practical applications, cross-domain deployment across different cities and sensors is more challenging; under in-domain training/testing settings (
Table 2), RDAH-Net consistently provides the best overall performance.
In real-world deployments, such extensive cross-domain settings—covering multiple cities and heterogeneous sensors—are relatively uncommon. As previously noted, when both training and testing are conducted within a single region, our RDAH-Net consistently achieves superior performance and is fully capable of satisfying the demands of most practical applications.
3.4. Ablation Study
We conduct ablation experiments to evaluate the contributions of five settings: MobileViT backbone, CBAM, bidirectional cross-attention fusion, PixelShuffle-based upsampling, and depth (inputting only remote sensing images). To ensure fair comparison, our dual-branch network keeps the same structure by feeding remote sensing images into both branches without the depth setting. It should be noted that the conventional encoder used for replacement in the experiments is a cascaded sequential structure, which gradually increases the channel dimension of single-channel depth feature maps from 1 to 32, 64, and finally to d-model through three convolutional layers with strides set to 2, 2, and 4 in sequence (each convolutional layer is followed by batch normalization and ReLU activation function). For a controlled comparison, each ablated variant is obtained by removing (or replacing) one component while keeping all other settings unchanged. Following the main generalization protocol, all ablated models are trained and tested on DFC2019-Track1, and are additionally evaluated on HK and Swiss (trained on DFC, tested on HK/Swiss) to assess robustness under domain shifts.
Table 4 shows that removing any module leads to performance degradation on DFC2019-Track1, indicating that each component contributes to the final accuracy, where the abbreviations for components are defined as follows: MV (MobileViT), CB (CBAM), BCA (Bi-Cross Attention), PS (PixelShuffle), and DP (Depth Prediction). The “w/o” denotes “without”, representing the ablation of the corresponding module. Among the variants, removing MobileViT or depth results in the largest performance drop on DFC2019-Track1, suggesting that strong feature extraction is critical for accurate height inference. Removing CBAM produces a smaller degradation, consistent with its lightweight role as an attention refinement module. Removing the bidirectional cross-attention module degrades performance, supporting the necessity of cross-modal reasoning between image and depth prior. Replacing PixelShuffle with standard upsampling also reduces accuracy and leads to less sharp outputs, consistent with the decoder design goal of artifact-free and detail-preserving reconstruction.
Ablation results in
Figure 10 and
Figure 11 elucidate each module’s role. Removing MobileViT impairs feature extraction, degrading large-building reconstruction—e.g., reduced internal height variations and accuracy in HK-062-674’s white-boxed region, plus shape deviations in Swiss-051-161 and OMA-239-039 white boxes. Omitting CBAM weakens attention enhancement, causing extreme overprediction in HK-072-466 and indistinguishable relative heights among three buildings in Swiss-051-161’s red box. Without bi-cross-attention, image-depth fusion falters, yielding overall overestimation in JAX-280-004 and OMA-329-039. Replacing PixelShuffle with standard upsampling sacrifices height precision and induces blurry “creamy melting” effects in Swiss-083-920 and HK-072-466 buildings. When the depth input is removed, the network fails to effectively exploit cross-attention due to the lack of necessary relative depth information. As a result, relying solely on remote sensing images cannot yield meaningful nDSM prediction maps.
We further investigate the MobileViT ablation, since the cross-domain MAE on Swiss appears better than the full model in
Table 4. To verify whether this phenomenon is caused by domain shift and the evaluation protocol (training on DFC but testing on Swiss), we additionally train and test RDAH-Net and the MobileViT-ablated variant within the HK and Swiss datasets, respectively. As shown in
Table 5, RDAH-Net remains superior under the in-domain setting, indicating that MobileViT is beneficial when the training and testing distributions are consistent, and the apparent improvement under cross-domain testing is not representative of the module’s overall contribution.
4. Discussion
4.1. Analysis of Abnormal Cases in Comparative Methods
We observe that both IM2ELEVATION and IM2HEIGHT suffer from severe performance degradation under direct cross-domain evaluation without adaptation. Specifically, they occasionally produce nearly constant predictions across the entire image, leading to structurally meaningless nDSM outputs (
Figure 12), while our proposed method remains robust and free of such anomalies. Since both are representative encoder-decoder architectures, we select IM2ELEVATION for the following failure analysis. This phenomenon indicates a strong sensitivity to domain shifts, particularly to radiometric differences and intensity distribution changes between training and testing data.
A possible explanation is that the feature representation learned by the model is highly coupled with the training-domain appearance statistics. When the test data exhibit different sensor characteristics or radiometric properties, the extracted multi-scale features may deviate from the operating range of the decoder, and the subsequent decoding process tends to collapse into low-variance outputs. In addition, cross-domain differences in spatial texture distribution and scene composition can further amplify feature misalignment in encoder-decoder fusion, making dense prediction unstable. In our experiments, this issue becomes more noticeable when the input images undergo intensity normalization from uint16 to uint8, which may compress contrast and shift the overall intensity distribution compared with the training set. We attempted simple brightness adjustments, but the degradation pattern remained, suggesting that the failure is not merely due to global brightness shifts.
In contrast, Baseline-MLP takes the relative depth prior as input, and RDAH-Net jointly uses both the relative depth prior and the image modality. Since the relative depth prior is produced by a fixed pre-trained model and is less sensitive to raw image intensity range alone, these approaches are comparatively more robust under low-contrast or radiometrically shifted test images. This observation supports the motivation of this work: introducing a depth prior can provide additional geometric cues that reduce the reliance on appearance statistics, while cross-modal fusion further stabilizes height inference under domain variations.
4.2. General Discussion
This study proposes RDAH-Net for single-image remote sensing height estimation by integrating a relative depth prior from a large pre-trained model with a dedicated cross-modal fusion network. The core contribution is to explicitly address the mismatch between dimensionless relative depth and metric absolute height, which is a key bottleneck in monocular nDSM generation.
Methodological Implications
Existing single-image methods that rely only on the image modality are often sensitive to appearance variations, such as illumination differences and sensor-dependent radiometric shifts, because the inferred height cues are largely entangled with texture and intensity statistics. Conversely, methods that depend solely on a depth-like input (e.g., directly projecting relative depth) lack semantic constraints from the image and can exhibit systematic absolute-value deviations. By introducing a robust relative depth prior and explicitly modeling cross-modal complementarity through bidirectional attention, RDAH-Net reduces over-reliance on a single modality and enables reciprocal enhancement between image semantics and depth cues. The ablation results further support that each component contributes to accuracy and structural fidelity: the backbone provides strong multi-scale representations, CBAM refines salient height-related cues, bidirectional cross-attention enables effective inter-modal reasoning, and PixelShuffle supports detail-preserving reconstruction with reduced upsampling artifacts.
4.3. Practical Considerations and Limitations
From an application perspective, nDSM acquisition methods face a trade-off among accuracy, data requirements, and deployment cost. LiDAR-based pipelines provide high accuracy but are expensive and slow to update; stereo/MVS-based pipelines are more scalable but require qualified overlapping imagery that may be unavailable in many regions. RDAH-Net offers a practical alternative by predicting nDSM from a single orthophoto, while leveraging a pre-trained depth model to mitigate the scarcity of remote-sensing-specific depth annotations. Moreover, the proposed model is lightweight in trainable parameters, which lowers the computational barrier for deployment. The experiments on multiple datasets demonstrate stable in-domain performance and competitive cross-domain generalization.
Nevertheless, several practical limitations exist. First, the method relies on a two-stage pipeline: relative depth prior generation followed by metric height estimation, which introduces extra inference latency compared with end-to-end frameworks. Second, the current evaluation is limited to urban and suburban scenes. The effectiveness of RDAH-Net in natural landscapes such as mountainous areas and farmland has not been fully verified, since these regions exhibit more complex topography, weaker textures, and sparser structural patterns that may challenge both the depth prior and cross-modal fusion. These limitations indicate promising directions for future optimization toward more efficient inference and broader scene generalization.
4.4. On Evaluation Under Domain Shift
We also note that cross-domain evaluation can reveal different failure modes that are not fully captured by a single scalar metric such as MAE. For example, near-constant predictions may occasionally achieve a deceptively low MAE if the constant value coincides with the mean height distribution of the test set, while the structural fidelity is severely compromised. Therefore, in addition to reporting MAE, qualitative visual inspection remains necessary for understanding whether the model preserves meaningful building boundaries and height hierarchy in practical nDSM applications.
Overall, this work suggests that combining large-model priors with an explicitly designed cross-modal mapping network is a promising direction for improving both accuracy and robustness in single-image remote sensing height estimation, and the proposed fusion paradigm may be informative for other multi-source remote sensing tasks that benefit from complementary priors.
5. Conclusions
This paper presents RDAH-Net, a single-image remote sensing height estimation framework that integrates a relative depth prior from a large pre-trained model with a dedicated cross-modal reasoning network to enable accurate nDSM generation in regions where stereo imagery is unavailable. Specifically, a relative depth map is first produced by Depth Anything v2 as a fixed prior, and RDAH-Net learns an effective transformation from dimensionless relative depth to metric absolute height by jointly exploiting the depth prior and the original orthophoto. The proposed network adopts a lightweight MobileViT-S-Light backbone for efficient feature extraction, enhances height-relevant representations with CBAM, performs deep cross-modal fusion via a bidirectional attention mechanism with positional encoding, and refines high-resolution outputs using PixelShuffle-based artifact-reduced upsampling.
Experiments on the public DFC2019-Track1 benchmark and two self-constructed datasets from Hong Kong and Switzerland demonstrate that RDAH-Net achieves strong performance under in-domain training/testing settings and exhibits robust behavior under cross-domain evaluation. Compared with representative baselines, the proposed method provides improved accuracy and more stable structural reconstruction in challenging scenes, while maintaining a small number of trainable parameters (5.37 M), substantially reducing the deployment burden. Ablation studies further validate the contribution of each key component, including the backbone, attention refinement, cross-modal fusion, and refined decoding.
In summary, RDAH-Net offers a practical and effective solution for single-image nDSM generation and provides evidence that large-model priors, when combined with explicit cross-modal mapping and fusion design, can improve both accuracy and robustness in remote sensing height estimation. This paradigm may also benefit other remote sensing applications that require integrating complementary priors across modalities.