RDAH-Net: Bridging Relative Depth and Absolute Height for Monocular Height Estimation in Remote Sensing

Jiang, Liting; Wang, Feng; Jiao, Niangang; Zhu, Jingxing; Xiang, Yuming; You, Hongjian

doi:10.3390/rs18071024

Open AccessArticle

RDAH-Net: Bridging Relative Depth and Absolute Height for Monocular Height Estimation in Remote Sensing

by

Liting Jiang

^1,2,3

,

Feng Wang

^1,2

,

Niangang Jiao

^1,2

,

Jingxing Zhu

^1,2

,

Yuming Xiang

^4,5,*

and

Hongjian You

^1,2,3

¹

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

²

Key Laboratory of Target Cognition and Application Technology, Chinese Academy of Sciences, Beijing 100190, China

³

School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China

⁴

College of Surveying and Geoinformatics, Tongji University, Shanghai 200092, China

⁵

Shanghai Key Laboratory for Planetary Mapping and Remote Sensing for Deep Space Exploration, Shanghai 200092, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(7), 1024; https://doi.org/10.3390/rs18071024

Submission received: 27 January 2026 / Revised: 23 March 2026 / Accepted: 24 March 2026 / Published: 29 March 2026

(This article belongs to the Special Issue 3D City Modeling and Observation Using Remote Sensing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Developed RDAH-Net for single-image nDSM generation using relative depth prior from Depth Anything v2.
Cross-modal fusion of depth prior and orthophoto achieves robust absolute height recovery across domains.

What is the implication of the main finding?

Provides accurate nDSM for large-scale mapping where stereo imagery is unavailable, with strong cross-domain generalization.

Abstract

Generating high-precision normalized digital surface models (nDSMs) from a single remote sensing image remains a challenging and ill-posed problem due to the absence of reliable geometric constraints. In this work, we show that monocular depth provides structurally stable cues of local geometry but lacks the global scale and vertical reference required for absolute height recovery. This intrinsic mismatch limits direct depth-to-height regression, particularly when transferring across heterogeneous terrains, land-cover compositions, and imaging conditions. Building on this idea, we propose the Relative Depth–Absolute Height Prediction Network (RDAH-Net), a framework that exploits relative depth as a geometry-aware prior while learning terrain-dependent height mappings from image appearance to absolute height. As the backbone, we employ a lightweight MobileNetV2 enhanced with a Convolutional Block Attention Module (CBAM), and further incorporate a cross-modal bidirectional attention fusion scheme with positional encoding to achieve a deep and effective fusion of image appearance and depth prior cues. Finally, a PixelShuffle-based upsampling strategy is used to sharpen prediction details and mitigate typical upsampling artifacts. Extensive experiments across diverse regions demonstrate that RDAH-Net achieves robust and generalizable height estimation, providing a practical alternative for large-scale mapping and rapid update scenarios.

Keywords:

monocular height estimation; normalized digital surface model (nDSM); shadow information; large model; transformer

1. Introduction

The normalized digital surface model (nDSM) is a fundamental geospatial product that plays a critical role in numerous remote sensing applications, including urban planning, disaster assessment, resource investigation, and ecological monitoring. Conceptually, an nDSM is obtained by subtracting a Digital Terrain Model (DTM) from a Digital Surface Model (DSM) [1], thereby removing the influence of terrain elevation and isolating the absolute height of above-ground objects such as buildings and vegetation. This terrain-normalized representation provides a direct and interpretable description of object heights relative to the ground surface, which makes nDSM an indispensable data source for fine-scale three-dimensional analysis. Consequently, developing reliable and scalable nDSM acquisition methods has become a long-standing research focus in the remote sensing and geospatial information science communities.

At present, the most accurate approach for high-precision nDSM generation relies on LiDAR point cloud data [2]. By actively emitting laser pulses and measuring their return times, LiDAR systems directly capture three-dimensional point clouds containing both ground and non-ground objects. Through point cloud classification and interpolation, DSM and DTM products can be generated, and nDSM is subsequently obtained via pixel-wise subtraction. Owing to its direct 3D measurement mechanism, LiDAR-based methods achieve centimeter-to-decimeter scale height accuracy and exhibit strong robustness to illumination conditions, surface texture variations, and atmospheric interference. These advantages make LiDAR particularly effective in complex urban environments and densely vegetated areas [3]. However, LiDAR systems suffer from significant limitations, including high equipment and acquisition costs, long data update cycles, and substantial demands on technical expertise during preprocessing stages such as point cloud filtering and registration [4]. These constraints severely restrict their large-scale and frequent deployment, especially in resource-limited or rapidly changing regions.

To reduce acquisition costs and improve scalability, optical remote sensing-based stereo reconstruction has emerged as a mainstream alternative for large-area nDSM generation. This passive remote sensing paradigm reconstructs DSMs from overlapping optical images and derives nDSM by combining the reconstructed DSM with open-source or measured DTMs. Existing approaches can be broadly categorized into stereo matching and multi-view stereo (MVS) methods [5,6]. Stereo matching techniques estimate disparity maps from image pairs through epipolar rectification and dense matching algorithms, while MVS methods extend to multiple views by jointly estimating camera poses and reconstructing dense point clouds. These approaches benefit from the wide availability of optical imagery and enable cost-effective nDSM production over extensive regions. Nevertheless, their performance is highly dependent on the availability of high-quality overlapping images with suitable viewing geometry. In practice, stereo and MVS methods often fail in texture-less or shadowed areas, suffer from DSM voids, and generally achieve meter-level accuracy [7], which remains inferior to LiDAR-based solutions. Moreover, the requirement for qualified stereo or multi-view datasets significantly limits their applicability in remote, underdeveloped, or geographically complex regions where such data are scarce [8].

Against this background, single-image remote sensing height estimation, also referred to as monocular height estimation (MHE), has attracted increasing attention as a data-efficient alternative. MHE aims to infer pixel-wise absolute heights from a single optical remote sensing image, thereby eliminating the dependence on stereo pairs or multi-view imagery [9]. Early studies in this direction largely drew inspiration from monocular depth estimation in computer vision, while recent works have increasingly tailored network architectures and training strategies to accommodate the overhead viewing geometry and unique characteristics of remote sensing imagery [10]. Representative solutions include residual convolution-deconvolution networks [11], encoder–decoder convolutional neural networks designed to stabilize training under long-tailed height distributions [12], CNN–Transformer hybrid architectures that capture long-range contextual dependencies [13], distinct paradigms that leverage synthetic data and unsupervised domain adaptation [14], and multi-task learning frameworks that jointly perform height estimation and semantic segmentation to improve generalization [15].

Despite these advances, single-image height estimation remains fundamentally challenging due to the ill-posed nature of inferring three-dimensional structure from two-dimensional observations. In particular, absolute height estimation suffers from severe scale ambiguity and long-tailed height distributions, leading to reduced accuracy for tall structures and complex scenes. Some state-of-the-art Transformer-based models also incur high computational costs, limiting their practicality for large-scale or real-time applications [16]. As a result, although MHE methods offer clear advantages in terms of data availability and scalability, their accuracy has generally lagged behind that of stereo reconstruction approaches [17,18].

Recent progress in large pre-trained models has opened new opportunities to address these limitations. Trained on massive and diverse datasets, such models have demonstrated remarkable capability in learning robust and transferable depth representations. To address the uniqueness of remote sensing imagery, semantically enhanced pre-trained models specifically developed for remote sensing scenarios have further improved the adaptability of pre-training techniques in monocular height estimation. For example, the SkySense++ model employs a paradigm of multi-granularity contrastive learning and masked semantic learning to strengthen the model’s understanding and extraction of pixel-level and region-level semantic context in remote sensing imagery [19]. In particular, large-scale depth estimation models can produce dense relative depth maps that preserve consistent ordinal height relationships among scene elements, even though their outputs lack explicit metric scale [20]. This observation suggests that relative depth information, while not directly interpretable as absolute height, can serve as a powerful structural prior for single-image nDSM inference when appropriately integrated with image appearance cues.

Several existing height estimation methods based on depth priors and auxiliary guidance have harnessed the capabilities of large-scale depth estimation models. However, they still exhibit limitations regarding data dependency and model flexibility. Existing low-overlap photogrammetry methods relying on monocular depth estimation [21] leverage tie points from aerial triangulation to recover metric depth. While addressing scene completeness, this workflow depends heavily on the availability of tie points and multi-view geometry constraints, limiting its applicability in pure single-image scenarios and achieving only meter-level accuracy. Depth2Elevation [20] integrates the Depth Anything model with scale modulation for elevation estimation; however, its training process requires the joint optimization of the foundation model and scale modulator, imposing high computational hardware requirements. This approach fails to decouple the extraction of depth priors from the adaptation to the target domain, resulting in a rigid training framework. Sparse LiDAR-guided correction methods [22] leverage ICESat-2 photon data to refine prediction residuals, yet they rely on external sparse LiDAR supervision. When a target test image lacks corresponding external supervision, these methods are forced to utilize linear fitting parameters derived from other images, leading to significant performance degradation. These limitations highlight the urgent need for a more robust and adaptable approach.

To enable the effective utilization of depth priors from large models under limited computational resources, this paper proposes a novel network for single-image nDSM generation. By learning an effective transformation from relative depth space to absolute height space through a dedicated network architecture, the proposed method aims to overcome the scale ambiguity inherent in traditional monocular approaches and improve height estimation accuracy without relying on stereo imagery or LiDAR data. The main contributions of this work can be summarized as follows:

We identify and formalize the intrinsic geometric inconsistency between monocular relative depth and absolute terrain height in remote sensing imagery, clarifying why direct depth-to-height regression is fragile under terrain heterogeneity.
We propose RDAH-Net, a depth-to-height translation framework that decouples relative geometry from absolute height anchoring, enabling accurate single-image nDSM reconstruction.
Extensive experiments across diverse geographic regions demonstrate strong generalization and robustness, supporting scalable deployment for large-area mapping and rapid elevation updates.

2. Methodology

This work reformulates single-image nDSM generation as a learnable transformation from relative depth priors to absolute height, enabled by large pre-trained depth models. Based on the analysis in the previous sections, the key challenge in monocular elevation inference lies in bridging the gap between relative geometric perception and absolute elevation representation. Direct regression from monocular depth to height often entangles these two heterogeneous quantities, leading to limited robustness under diverse terrain conditions [23]. To address this issue, we design a depth-to-height translation framework that leverages relative depth as a geometry-aware prior rather than a metrically constrained input. The proposed methodology explicitly separates the extraction of relative terrain structure from the learning of absolute elevation anchoring, allowing the model to adapt to terrain-dependent variations while preserving structural consistency.

2.1. Relative Depth Prior from Pre-Trained Models

Recent advances in monocular depth estimation have shown that large pre-trained foundation models are capable of inferring reliable relative depth relationships from a single image. As demonstrated in the original Depth Anything v2 work [24], the model is explicitly designed to produce dense relative depth maps that preserve the spatial ordering and geometric structure of scene elements, even in the absence of metric scale supervision. This capability stems from its large-scale training on both precisely annotated synthetic datasets and massive collections of real-world images, enabling robust depth prior learning across diverse visual domains.

In this study, Depth Anything v2 [24] is adopted as the relative depth prior generator due to its favorable balance among accuracy, efficiency, and generalization ability. Compared with representative monocular depth estimation models such as MiDaS [25] and DPT [26], Depth Anything v2 produces more spatially consistent and structurally detailed relative depth maps on standard benchmarks. In contrast to diffusion-based depth estimation approaches (e.g., Marigold [27]), it offers substantially higher inference efficiency, which is critical for large-scale remote sensing applications. Moreover, the availability of multiple model scales allows flexible trade-offs between computational cost and prediction quality.

A key advantage of Depth Anything v2 lies in its strong zero-shot generalization capability to unseen domains, including overhead remote sensing imagery [28]. Given the scarcity of large-scale, accurately annotated depth datasets in the remote sensing field, this property enables the extraction of meaningful depth priors without extensive domain-specific fine-tuning. As a result, the predicted relative depth maps retain clear object contours and local structural characteristics, even when applied to satellite imagery acquired under imaging conditions significantly different from those of the training data.

Despite these advantages, the relative depth output of large pre-trained models cannot be directly used for absolute height estimation in remote sensing scenarios. Satellite-borne sensors observe the Earth from extremely high altitudes; for example, the Gaofen-7 satellite used in this study operates at an orbital height of approximately 506 km [29]. Under such conditions, typical height variations of ground objects (e.g., 5–10 m for urban buildings) account for only 0.099∼0.198‰ of the sensor-to-ground distance. This extreme scale compression makes it inherently difficult for monocular models to accurately recover global relative height relationships over large areas, even though elevated structures may still be locally distinguishable, as illustrated in Figure 1. Furthermore, the numerical range of the predicted relative depth values differs fundamentally from the physical height range of terrain objects and lacks a direct correspondence to real-world elevation units.

Therefore, while large pre-trained models provide powerful relative depth priors that encode valuable geometric and structural information, an explicit and learnable mapping mechanism is required to transform these dimensionless depth representations into accurate absolute height estimates. This observation motivates the design of a dedicated network that integrates relative depth priors with image appearance information to enable reliable single-image nDSM generation.

2.2. Relative Depth–Absolute Height Prediction Network

To effectively convert the relative depth prior into accurate absolute height estimates, a dedicated Relative Depth–Absolute Height Prediction Network (RDAH-Net) is designed. As shown in Figure 2, RDAH-Net consists of three tightly coupled components: (1) a high-precision feature extraction and enhancement module, (2) an attention-based cross-modal reasoning module, and (3) a refined output mechanism integrated into the decoder. These components collaboratively enable robust cross-modal feature alignment and progressive height refinement.

2.2.1. Feature Extraction and Enhancement Module

This module extracts modality-specific structural and semantic features from remote sensing images and relative depth maps, matching the physical traits of urban ground objects and relative height distribution rules. Its core physical meaning is to capture building texture features from images and relative height fluctuation information from depth maps, laying a feature foundation for converting relative depth to absolute building height. MobileNetV2 [30] is adopted as the backbone network for feature extraction from both the remote sensing image and the corresponding relative depth map. Owing to its inverted residual structure and linear bottleneck design, MobileNetV2 provides an efficient yet expressive representation of multi-scale semantic and structural information, which is critical for capturing subtle height variations in complex urban environments. Notably, the feature extractors for the image and depth modalities do not share weights, allowing each branch to learn modality-specific characteristics.

In practice, we employ a lightweight MobileViT-S-Light variant to balance computational efficiency and representational capacity, making the network suitable for resource-constrained deployment scenarios. To further enhance discriminative feature learning, a Convolutional Block Attention Module (CBAM) [31] is inserted after each feature extraction stage. Physically, CBAM simulates selective attention for height-related features in urban remote sensing: channel attention focuses on height-correlated feature channels, while spatial attention highlights building regions and suppresses non-building background interference, enhancing fine height difference perception. By sequentially modeling channel-wise and spatial attention, CBAM adaptively emphasizes height-relevant features while suppressing redundant or noisy responses. This mechanism is particularly effective for identifying small buildings and fine structural elements embedded in cluttered urban backgrounds, as illustrated in Figure 3.

2.2.2. Attention-Based Cross-Modal Reasoning Module

This module fuses two complementary physical cues: remote sensing images (I) carry urban spectral-texture attributes, and relative depth maps (D) contain spatial height undulation traits. Its physical meaning is to align and fuse building semantic and relative height features, eliminate modality misalignment, and convert uncalibrated depth priors to spatially correlated height information. I encodes rich texture-spectral semantics with geometric integrity, whereas D supplies explicit depth-related priors. To fully exploit their mutual complementarity, an attention-based cross-modal reasoning module is introduced to facilitate deep semantic alignment between the two modalities. The dual-branch cross-modal attention leverages semantic cues to identify homogeneous ground regions and calibrate depth discrepancies. Combined with spatial positional encoding, it ensures cross-region alignment for consistent large-scale elevation prediction.

Specifically, bidirectional attention propagation is implemented at multiple hierarchical feature levels. In the

D \to I

direction, features from D are treated as Queries, while features from I serve as Keys and Values, enabling image semantics to compensate for structural ambiguities in the depth representation. Conversely, in the

I \to D

direction, image features act as Queries and depth features as Keys and Values, allowing relative depth priors to inject spatial depth awareness into image representations. This bidirectional design ensures reciprocal information exchange, yielding synergistic feature enhancement across modalities.

The cross-modal fusion follows a coarse-to-fine hierarchical strategy, enabling progressive refinement from low-level structural cues to high-level semantic abstractions. In addition, positional encoding is incorporated into the attention computation to alleviate the inherent limitation of Transformer-based mechanisms in modeling spatial location, which introduces absolute 2D spatial coordinates of remote sensing images into cross-modal attention, corresponding to actual geographic pixel locations. It equips the network with clear spatial perception, ensuring cross-modal alignment relies on physical positions rather than just feature similarity, fitting the height-location correlation rule. Specifically, we employ non-learnable sinusoidal positional encoding, computing values based on absolute 2D coordinates using sine and cosine functions of varying wavelengths to generate unique signatures for each spatial position, thereby ensuring accurate spatial correspondence during cross-modal reasoning.

2.3. Refined Output Mechanism

This module corresponds to the physical process of generating high-precision absolute height nDSM from fused features. Its core physical meaning is to restore fine spatial resolution and convert abstract deep features to accurate absolute building height values with meter units, meeting urban nDSM resolution and precision requirements. To reconstruct high-resolution nDSM outputs while avoiding common upsampling artifacts, a refined output mechanism is embedded within the decoder. Instead of conventional transposed convolution, the PixelShuffle operation [32] is employed to perform sub-pixel upsampling, effectively eliminating checkerboard artifacts and improving spatial continuity. Unlike conventional upsampling, PixelShuffle enhances sub-pixel resolution via channel rearrangement, preserving height undulation continuity, eliminating artifacts and distortion, and ensuring nDSM aligns with real-world building height distribution.

The decoder adopts a cascaded refinement strategy, where each upsampling stage consists of channel-matching convolution, PixelShuffle-based resolution enhancement, batch normalization, and ReLU activation. This progressive design enables stable feature learning and gradual restoration of fine-grained spatial details. As a result, the network can reliably generate high-precision nDSM prediction maps at a resolution of

1024 \times 1024

, while preserving structural fidelity and boundary sharpness in the final output.

3. Experimental Results

This section presents the experimental design and results to validate the proposed framework. We first introduce the dataset construction and preprocessing pipeline, followed by the experimental settings (training protocol, baselines, and evaluation metrics). We then report quantitative and qualitative comparisons on three datasets, assess cross-domain generalization, and finally conduct ablation studies to verify the contribution of each key module.

3.1. Dataset Construction

To evaluate the proposed method across different geographic regions and sensor conditions, we conduct experiments on one public benchmark and two self-constructed datasets. Geographically, the Swiss dataset covers two regions (Lat

47.39 °

N–

47.60 °

N, Long

8.86 °

E–

9.20 °

E; and Lat

46.93 °

N–

47.17 °

N, Long

8.26 °

E–

8.60 °

E), while the HK dataset covers two distinct scenes (Lat

22.15 °

N–

22.40 °

N, Long

114.19 °

E–

114.45 °

E; and Lat

22.28 °

N–

22.53 °

N, Long

113.75 °

E–

114.01 °

E). These datasets exhibit significant domain gaps beneficial for validation. As illustrated in Figure 4, differences in architectural styles lead to distinct feature representations. Furthermore, Table 1 highlights that the spatial resolution of the Swiss and HK datasets differs considerably from that of DFC2019-Track1; this variance causes buildings to appear at smaller scales, effectively enriching feature diversity. In terms of height distribution, Figure 5 shows that the Swiss dataset contains a higher proportion of tall buildings (above 12 m), whereas the HK dataset exhibits a higher proportion of building areas above 2 m compared to DFC2019-Track1. These variations in geographic coverage, spatial resolution, and height distribution validate the universality of our method and provide a robust foundation for cross-domain generalization testing. In this work, we define samples acquired from the same city and the same sensor as belonging to the same domain. As summarized in Table 1, we use the public DFC2019-Track1 [33] dataset and construct two additional datasets from Switzerland (Swiss) and Hong Kong (HK). For each dataset, samples are split into training and test sets with a ratio of 4:1. During training, we further randomly sample 20% of the training set as the validation set.

For the Swiss and HK datasets, we follow a standardized pipeline to ensure strict geographic alignment between remote sensing images and nDSM labels.

(1) DSM/DTM and nDSM Ground Truth generation. We collect open-source LiDAR point clouds and generate DSM by raster interpolation of the original point cloud. For DTM generation, ground points are first extracted and then interpolated. Specifically, we adopt the Cloth Simulation Filter (CSF) [34] for ground/non-ground separation. The DTM is generated from the extracted ground points, and the nDSM is obtained via pixel-wise subtraction between the aligned DSM and DTM.

(2) Orthorectification. Using DSM as elevation support, the remote sensing images are orthorectified based on the sensor orientation parameters, where collinearity equations are solved to correct terrain-induced displacements. The orthorectified image is produced by resampling, ensuring strict spatial correspondence with the nDSM [35].

(3) Patch generation. We crop the overlapping area between the orthophoto and nDSM, then apply a sliding-window strategy to generate paired patches. The patch size is set to 512 × 512 with an overlap of 256 pixels to preserve boundary structures.

(4) Quality filtering. Invalid samples are removed by two criteria: (i) nDSM patches with ≥50% no-data values (often caused by large water bodies) are discarded; (ii) orthophoto patches with more than 20% fully black pixels (RGB all zeros) are discarded to avoid invalid margins introduced during image generation. Only valid patches are retained.

3.2. Experiments Setting

We describe the experimental settings in terms of data usage, model inputs, baselines, and training configurations:

Training/testing protocols. DFC2019-Track1 is used as the primary benchmark for in-domain evaluation (training and testing within the dataset split). To assess cross-domain generalization, we additionally evaluate the model trained on DFC2019-Track1 directly on the Swiss and HK test sets without retraining. Moreover, to verify that the observed superiority is consistent across datasets, we also train and test on Swiss and HK separately under the same protocol.
Relative depth prior generation (frozen). For each input image, a relative depth prior is generated by Depth Anything v2. Importantly, Depth Anything v2 is used only as a fixed prior generator: its parameters are kept frozen and it is not fine-tuned on any dataset in this study. The predicted relative depth maps are then provided as an additional input modality to RDAH-Net.
Input preprocessing and normalization. The patch sizes are 1024 × 1024 for DFC2019-Track1 and 512 × 512 for Swiss/HK. Since the proposed network supports variable input sizes, no resizing is applied. The uint16 orthophotos are linearly normalized to the uint8 range for stable intensity distribution, and then normalized using the ImageNet mean and standard deviation per channel. For nDSM labels, we adopt a fixed-point encoding strategy: nDSM values are multiplied by 500 and stored as uint16 during data I/O, and are divided by 500 to recover the original height scale for loss computation and evaluation. All reported metrics are computed on the recovered nDSM values.
Comparison methods. We compare the proposed RDAH-Net with (i) IM2ELEVA-TION [36] and IM2HEIGHT [11], representative encoder-decoder methods for single-image height estimation; (ii) Baseline-MLP, which takes only the relative depth prior as input and learns a direct projection to nDSM; (iii) SynRS3D [14], representing distinct paradigms that leverage synthetic data and unsupervised domain adaptation; and (iv) NRF [22], a sparse LiDAR-guided correction method that leverages ICESat-2 photon data to refine prediction residuals. For ablation studies, we remove one module at a time while keeping the remaining components unchanged.
Training configurations. All experiments are conducted on a machine with 12 vCPU Intel(R) Xeon(R) Platinum 8375C CPUs @ 2.90 GHz and a NVIDIA GeForce RTX 4080 GPU with 32 GB available memory. We use the Adam optimizer with a learning rate of $1 \times 10^{- 5}$ , batch size 8, and train for 100 epochs. The checkpoint with the best validation performance is selected for testing. All models converge within 100 epochs under this configuration.
Loss function. We optimize the proposed method using the L1 loss,

$L_{L 1} = \frac{1}{N} \sum_{i = 1}^{N} |y_{i} - {\hat{y}}_{i}|,$

(1)

where $y_{i}$ and ${\hat{y}}_{i}$ denote the ground-truth and predicted nDSM values, respectively. L1 loss is adopted due to its robustness to outliers and its direct interpretability in the height estimation task.
Evaluation metric. We report the mean absolute error (MAE) for quantitative evaluation,

$MAE = \frac{1}{N} \sum_{i = 1}^{N} |y_{i} - {\hat{y}}_{i}| .$

(2)

In addition to MAE, we provide qualitative visualizations to assess structural fidelity (e.g., building boundaries and height hierarchy), since MAE alone may not fully reflect the preservation of fine-grained structures in dense urban scenes.

3.3. Results

3.3.1. Relative Depth Prior Versus Absolute Height Range

We first examine the relationship between the large-model relative depth prior and the absolute nDSM range predicted by RDAH-Net. As shown in Figure 6, the numerical range of the relative depth output is not proportional to the range of the corresponding nDSM ground truth. For example, although the ground-truth height range of OMA-315-023 [33] is substantially larger than that of Swiss-051-161, the relative depth output shows an opposite pattern. Moreover, the relative depth values typically span tens to hundreds, while the nDSM values are near zero to tens, indicating that the prior is not directly interpretable in metric height units. These observations support the necessity of a dedicated network that jointly leverages image appearance and relative depth cues to infer physically meaningful absolute heights. As shown in Figure 6, RDAH-Net effectively maps the relative depth prior to an output range consistent with the ground-truth nDSM.

3.3.2. In-Domain Quantitative Comparison

We compare RDAH-Net with IM2ELEVATION, IM2HEIGHT, Baseline-MLP, SynRS3D, and NRF by training and testing each method separately on each dataset. As reported in Table 2, RDAH-Net achieves the best overall MAE across the three datasets. IM2HEIGHT performs significantly worse than the proposed method across all three datasets. SynRS3D achieves suboptimal performance, as it was trained without the semantic segmentation branch for a fair comparison, thereby diminishing its advantage and limiting it to the height prediction branch. NRF also yields unsatisfactory results due to the limitations of its random forest-based methodology. Baseline-MLP performs competitively on HK but degrades on DFC2019-Track1 and Swiss, reflecting the limitation of relying solely on the depth prior without incorporating image semantics. IM2ELEVATION achieves comparable performance on DFC2019-Track1, but both IM2ELEVATION and IM2HEIGHT exhibit severe degradation on some datasets due to their sensitivity to radiometric variations, which can lead to abnormal predictions under low-brightness conditions (see Section 4.1 for detailed discussion).

In terms of model complexity, RDAH-Net contains only 5.3716 M trainable parameters, which is approximately 3.38% of IM2ELEVATION and a mere 1.46% of SynRS3D. Since the relative depth prior is produced by a fixed pre-trained model and does not require training, the trainable memory footprint of the proposed approach remains substantially smaller. In terms of computational efficiency, since NRF belongs to traditional machine learning rather than deep learning, it is not included in the comparison here. RDAH-Net has far lower FLOPs than all deep learning-based comparison methods except Baseline-MLP, and achieves faster FPS at both 512 × 512 and 1024 × 1024 input resolutions. Notably, while possessing such outstanding computational efficiency, RDAH-Net still maintains the optimal prediction performance across all tested datasets.

Visually in Figure 7, RDAH-Net consistently outperforms across datasets. Baseline-MLP produces sharp edges but substantial height deviations from over-reliance on preset relative depths. On DFC2019-Track1 from Omaha and Jacksonville, RDAH-Net excels on JAX-004-015, OMA-329-039, and OMA-367-012 with precise shapes and heights despite the minor vegetation overestimate shown in the white box; on JAX-280-004 with balanced building and vegetation prediction; and on OMA-315-023 with complete bridge and building shapes and heights. Visually in Figure 8, the analysis extends to the Swiss and HK datasets. Even on shadowed HK-072-466, where all models deviate, RDAH-Net minimizes errors through robust zero-shot depth estimation and multi-modal fusion for unmatched stable adaptation. Specifically highlighting visualization performance in vegetation areas beyond urban regions, on Swiss-051-2, RDAH-Net stands out as the only method capable of perfectly restoring the shape of vegetation. For HK-062-11, although none of the compared methods perfectly recover the vegetation morphology, RDAH-Net yields a predicted height range closest to the ground truth. In comparison, the MLP method recovers partial vegetation heights but lacks detailed textures due to its complete reliance on relative depth. Furthermore, the NRF method produces height predictions with a high granular effect, lacking precision due to inherent limitations in its algorithmic principles.

3.3.3. Cross-Domain Generalization

To assess generalization, we directly evaluate the models trained on DFC2019-Track1 on the Swiss and HK test sets. As shown in Table 3, RDAH-Net achieves the best MAE on HK, while IM2ELEVATION achieves the lowest MAE on Swiss. However, the quantitative results should be interpreted together with the visualizations in Figure 9. We observe that IM2ELEVATION can produce outputs that are nearly constant across large regions under certain radiometric conditions, which may yield a deceptively low MAE if the constant prediction happens to be close to the dataset mean height. In such cases, MAE alone does not reflect the loss of structural fidelity. In practical applications, cross-domain deployment across different cities and sensors is more challenging; under in-domain training/testing settings (Table 2), RDAH-Net consistently provides the best overall performance.

In real-world deployments, such extensive cross-domain settings—covering multiple cities and heterogeneous sensors—are relatively uncommon. As previously noted, when both training and testing are conducted within a single region, our RDAH-Net consistently achieves superior performance and is fully capable of satisfying the demands of most practical applications.

3.4. Ablation Study

We conduct ablation experiments to evaluate the contributions of five settings: MobileViT backbone, CBAM, bidirectional cross-attention fusion, PixelShuffle-based upsampling, and depth (inputting only remote sensing images). To ensure fair comparison, our dual-branch network keeps the same structure by feeding remote sensing images into both branches without the depth setting. It should be noted that the conventional encoder used for replacement in the experiments is a cascaded sequential structure, which gradually increases the channel dimension of single-channel depth feature maps from 1 to 32, 64, and finally to d-model through three convolutional layers with strides set to 2, 2, and 4 in sequence (each convolutional layer is followed by batch normalization and ReLU activation function). For a controlled comparison, each ablated variant is obtained by removing (or replacing) one component while keeping all other settings unchanged. Following the main generalization protocol, all ablated models are trained and tested on DFC2019-Track1, and are additionally evaluated on HK and Swiss (trained on DFC, tested on HK/Swiss) to assess robustness under domain shifts.

Table 4 shows that removing any module leads to performance degradation on DFC2019-Track1, indicating that each component contributes to the final accuracy, where the abbreviations for components are defined as follows: MV (MobileViT), CB (CBAM), BCA (Bi-Cross Attention), PS (PixelShuffle), and DP (Depth Prediction). The “w/o” denotes “without”, representing the ablation of the corresponding module. Among the variants, removing MobileViT or depth results in the largest performance drop on DFC2019-Track1, suggesting that strong feature extraction is critical for accurate height inference. Removing CBAM produces a smaller degradation, consistent with its lightweight role as an attention refinement module. Removing the bidirectional cross-attention module degrades performance, supporting the necessity of cross-modal reasoning between image and depth prior. Replacing PixelShuffle with standard upsampling also reduces accuracy and leads to less sharp outputs, consistent with the decoder design goal of artifact-free and detail-preserving reconstruction.

Ablation results in Figure 10 and Figure 11 elucidate each module’s role. Removing MobileViT impairs feature extraction, degrading large-building reconstruction—e.g., reduced internal height variations and accuracy in HK-062-674’s white-boxed region, plus shape deviations in Swiss-051-161 and OMA-239-039 white boxes. Omitting CBAM weakens attention enhancement, causing extreme overprediction in HK-072-466 and indistinguishable relative heights among three buildings in Swiss-051-161’s red box. Without bi-cross-attention, image-depth fusion falters, yielding overall overestimation in JAX-280-004 and OMA-329-039. Replacing PixelShuffle with standard upsampling sacrifices height precision and induces blurry “creamy melting” effects in Swiss-083-920 and HK-072-466 buildings. When the depth input is removed, the network fails to effectively exploit cross-attention due to the lack of necessary relative depth information. As a result, relying solely on remote sensing images cannot yield meaningful nDSM prediction maps.

We further investigate the MobileViT ablation, since the cross-domain MAE on Swiss appears better than the full model in Table 4. To verify whether this phenomenon is caused by domain shift and the evaluation protocol (training on DFC but testing on Swiss), we additionally train and test RDAH-Net and the MobileViT-ablated variant within the HK and Swiss datasets, respectively. As shown in Table 5, RDAH-Net remains superior under the in-domain setting, indicating that MobileViT is beneficial when the training and testing distributions are consistent, and the apparent improvement under cross-domain testing is not representative of the module’s overall contribution.

4. Discussion

4.1. Analysis of Abnormal Cases in Comparative Methods

We observe that both IM2ELEVATION and IM2HEIGHT suffer from severe performance degradation under direct cross-domain evaluation without adaptation. Specifically, they occasionally produce nearly constant predictions across the entire image, leading to structurally meaningless nDSM outputs (Figure 12), while our proposed method remains robust and free of such anomalies. Since both are representative encoder-decoder architectures, we select IM2ELEVATION for the following failure analysis. This phenomenon indicates a strong sensitivity to domain shifts, particularly to radiometric differences and intensity distribution changes between training and testing data.

A possible explanation is that the feature representation learned by the model is highly coupled with the training-domain appearance statistics. When the test data exhibit different sensor characteristics or radiometric properties, the extracted multi-scale features may deviate from the operating range of the decoder, and the subsequent decoding process tends to collapse into low-variance outputs. In addition, cross-domain differences in spatial texture distribution and scene composition can further amplify feature misalignment in encoder-decoder fusion, making dense prediction unstable. In our experiments, this issue becomes more noticeable when the input images undergo intensity normalization from uint16 to uint8, which may compress contrast and shift the overall intensity distribution compared with the training set. We attempted simple brightness adjustments, but the degradation pattern remained, suggesting that the failure is not merely due to global brightness shifts.

In contrast, Baseline-MLP takes the relative depth prior as input, and RDAH-Net jointly uses both the relative depth prior and the image modality. Since the relative depth prior is produced by a fixed pre-trained model and is less sensitive to raw image intensity range alone, these approaches are comparatively more robust under low-contrast or radiometrically shifted test images. This observation supports the motivation of this work: introducing a depth prior can provide additional geometric cues that reduce the reliance on appearance statistics, while cross-modal fusion further stabilizes height inference under domain variations.

4.2. General Discussion

This study proposes RDAH-Net for single-image remote sensing height estimation by integrating a relative depth prior from a large pre-trained model with a dedicated cross-modal fusion network. The core contribution is to explicitly address the mismatch between dimensionless relative depth and metric absolute height, which is a key bottleneck in monocular nDSM generation.

Methodological Implications

Existing single-image methods that rely only on the image modality are often sensitive to appearance variations, such as illumination differences and sensor-dependent radiometric shifts, because the inferred height cues are largely entangled with texture and intensity statistics. Conversely, methods that depend solely on a depth-like input (e.g., directly projecting relative depth) lack semantic constraints from the image and can exhibit systematic absolute-value deviations. By introducing a robust relative depth prior and explicitly modeling cross-modal complementarity through bidirectional attention, RDAH-Net reduces over-reliance on a single modality and enables reciprocal enhancement between image semantics and depth cues. The ablation results further support that each component contributes to accuracy and structural fidelity: the backbone provides strong multi-scale representations, CBAM refines salient height-related cues, bidirectional cross-attention enables effective inter-modal reasoning, and PixelShuffle supports detail-preserving reconstruction with reduced upsampling artifacts.

4.3. Practical Considerations and Limitations

From an application perspective, nDSM acquisition methods face a trade-off among accuracy, data requirements, and deployment cost. LiDAR-based pipelines provide high accuracy but are expensive and slow to update; stereo/MVS-based pipelines are more scalable but require qualified overlapping imagery that may be unavailable in many regions. RDAH-Net offers a practical alternative by predicting nDSM from a single orthophoto, while leveraging a pre-trained depth model to mitigate the scarcity of remote-sensing-specific depth annotations. Moreover, the proposed model is lightweight in trainable parameters, which lowers the computational barrier for deployment. The experiments on multiple datasets demonstrate stable in-domain performance and competitive cross-domain generalization.

Nevertheless, several practical limitations exist. First, the method relies on a two-stage pipeline: relative depth prior generation followed by metric height estimation, which introduces extra inference latency compared with end-to-end frameworks. Second, the current evaluation is limited to urban and suburban scenes. The effectiveness of RDAH-Net in natural landscapes such as mountainous areas and farmland has not been fully verified, since these regions exhibit more complex topography, weaker textures, and sparser structural patterns that may challenge both the depth prior and cross-modal fusion. These limitations indicate promising directions for future optimization toward more efficient inference and broader scene generalization.

4.4. On Evaluation Under Domain Shift

We also note that cross-domain evaluation can reveal different failure modes that are not fully captured by a single scalar metric such as MAE. For example, near-constant predictions may occasionally achieve a deceptively low MAE if the constant value coincides with the mean height distribution of the test set, while the structural fidelity is severely compromised. Therefore, in addition to reporting MAE, qualitative visual inspection remains necessary for understanding whether the model preserves meaningful building boundaries and height hierarchy in practical nDSM applications.

Overall, this work suggests that combining large-model priors with an explicitly designed cross-modal mapping network is a promising direction for improving both accuracy and robustness in single-image remote sensing height estimation, and the proposed fusion paradigm may be informative for other multi-source remote sensing tasks that benefit from complementary priors.

5. Conclusions

This paper presents RDAH-Net, a single-image remote sensing height estimation framework that integrates a relative depth prior from a large pre-trained model with a dedicated cross-modal reasoning network to enable accurate nDSM generation in regions where stereo imagery is unavailable. Specifically, a relative depth map is first produced by Depth Anything v2 as a fixed prior, and RDAH-Net learns an effective transformation from dimensionless relative depth to metric absolute height by jointly exploiting the depth prior and the original orthophoto. The proposed network adopts a lightweight MobileViT-S-Light backbone for efficient feature extraction, enhances height-relevant representations with CBAM, performs deep cross-modal fusion via a bidirectional attention mechanism with positional encoding, and refines high-resolution outputs using PixelShuffle-based artifact-reduced upsampling.

Experiments on the public DFC2019-Track1 benchmark and two self-constructed datasets from Hong Kong and Switzerland demonstrate that RDAH-Net achieves strong performance under in-domain training/testing settings and exhibits robust behavior under cross-domain evaluation. Compared with representative baselines, the proposed method provides improved accuracy and more stable structural reconstruction in challenging scenes, while maintaining a small number of trainable parameters (5.37 M), substantially reducing the deployment burden. Ablation studies further validate the contribution of each key component, including the backbone, attention refinement, cross-modal fusion, and refined decoding.

In summary, RDAH-Net offers a practical and effective solution for single-image nDSM generation and provides evidence that large-model priors, when combined with explicit cross-modal mapping and fusion design, can improve both accuracy and robustness in remote sensing height estimation. This paradigm may also benefit other remote sensing applications that require integrating complementary priors across modalities.

Author Contributions

Conceptualization, L.J. and Y.X.; methodology, L.J.; software, L.J. and Y.X.; validation, L.J.; formal analysis, L.J.; investigation, L.J. and Y.X.; resources, Y.X. and F.W.; data curation, L.J. and Y.X.; writing—original draft preparation, L.J.; writing—review and editing, L.J. and Y.X.; visualization, L.J.; supervision, N.J., F.W. and H.Y.; project administration, J.Z., H.Y. and F.W.; funding acquisition, N.J. and F.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Our code will be uploaded to https://github.com/Elenairene/RDAH-Net (accessed on 20 January 2026) the publication of the paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

Alizadeh Naeini, A.; Sheikholeslami, M.M.; Sohn, G. Advancing Physically Informed Autoencoders for DTM Generation. Remote Sens. 2024, 16, 1841. [Google Scholar] [CrossRef]
Orlof, J.; Ozimek, P.; Łabędź, P.; Widłak, A.; Ozimek, A. Generating viewsheds based on the Digital Surface Model (DSM) and point cloud. PLoS ONE 2024, 19, e0312146. [Google Scholar] [CrossRef]
Wan, L.; Xiang, Y.; Kang, W.; Ma, L. A Self-Supervised Learning Pretraining Framework for Remote Sensing Image Change Detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5630116. [Google Scholar] [CrossRef]
Pereira, L.G.; Fernandez, P.; Mourato, S.; Matos, J.; Mayer, C.; Marques, F. Quality Control of Outsourced LiDAR Data Acquired with a UAV: A Case Study. Remote Sens. 2021, 13, 419. [Google Scholar] [CrossRef]
Jiang, L.; Wang, F.; Zhang, W.; Li, P.; You, H.; Xiang, Y. Rethinking the Key Factors for the Generalization of Remote Sensing Stereo Matching Networks. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 4936–4948. [Google Scholar] [CrossRef]
Gu, X.; Fan, Z.; Zhu, S.; Dai, Z.; Tan, F.; Tan, P. Cascade Cost Volume for High-Resolution Multi-View Stereo and Stereo Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 2492–2501. [Google Scholar] [CrossRef]
Wang, X.; Jiang, L.; Xiang, Y.; Jiao, N.; Yang, W.; Wang, F. Enhancing Photogrammetric DSM Based on Multiscale and Domain-Invariant Semantic Feature Learning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 28677–28694. [Google Scholar] [CrossRef]
Stathopoulou, E.K.; Remondino, F. A survey on conventional and learning-based methods for multi-view stereo. Photogramm. Rec. 2023, 38, 374–407. [Google Scholar] [CrossRef]
Arampatzakis, V.; Pavlidis, G.; Mitianoudis, N.; Papamarkos, N. Monocular Depth Estimation: A Thorough Review. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 2396–2414. [Google Scholar] [CrossRef] [PubMed]
Xiang, Y.; Jiang, L.; Wang, F.; You, H.; Qiu, X.; Fu, K. Detector-Free Feature Matching for Optical and SAR Images Based on a Two-Step Strategy. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5214216. [Google Scholar] [CrossRef]
Mou, L.; Zhu, X.X. IM2HEIGHT: Height Estimation from Single Monocular Imagery via Fully Residual Convolutional-Deconvolutional Network. arXiv 2018, arXiv:1802.10249. [Google Scholar]
Chen, S.; Shi, Y.; Zhu, X. Long-tailed Regression with Ensembles for Monocular Height Estimation from Single Remote Sensing Images. In Proceedings of the 2023 Joint Urban Remote Sensing Event (JURSE), Heraklion, Greece, 17–19 May 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–4. [Google Scholar] [CrossRef]
Ma, L.; Fu, Y.; Lu, X.; Xue, Q.; Miao, J. HCTNet: Hybrid CNN-Transformer Architecture Network for Self-Supervised Monocular Depth Estimation. In Proceedings of the 2023 International Conference on Computer Science and Automation Technology (CSAT), Shanghai, China, 6–8 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 353–357. [Google Scholar] [CrossRef]
Song, J.; Chen, H.; Xuan, W.; Xia, J.; Yokoya, N. SynRS3D: A synthetic dataset for global 3D semantic understanding from monocular remote sensing imagery. Adv. Neural Inf. Process. Syst. 2024, 37, 117388–117425. [Google Scholar] [CrossRef]
Gao, Z.; Sun, W.; Lu, Y.; Zhang, Y.; Song, W.; Zhang, Y. Joint Learning of Semantic Segmentation and Height Estimation for Remote Sensing Image Leveraging Contrastive Learning. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5614015. [Google Scholar] [CrossRef]
Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in vision: A survey. ACM Comput. Surv. 2022, 54, 1–41. [Google Scholar] [CrossRef]
Zhang, Z.; Qiao, J.; Lin, S.; Liu, H. Weakly supervised monocular depth estimation method based on stereo matching labels. J. Electron. Imaging 2020, 29, 053013. [Google Scholar] [CrossRef]
Tosi, F.; Aleotti, F.; Poggi, M.; Mattoccia, S. Learning Monocular Depth Estimation Infusing Traditional Stereo Knowledge. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 9791–9801. [Google Scholar] [CrossRef]
Wu, K.; Zhang, Y.; Ru, L.; Dang, B.; Lao, J.; Yu, L.; Luo, J.; Zhu, Z.; Sun, Y.; Zhang, J.; et al. A semantic-enhanced multi-modal remote sensing foundation model for Earth observation. Nat. Mach. Intell. 2025, 7, 1235–1249. [Google Scholar] [CrossRef]
Hong, Z.; Wu, T.; Xu, Z.; Zhao, W. Depth2Elevation: Scale Modulation with Depth Anything Model for Single-View Remote Sensing Image Height Estimation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4504914. [Google Scholar] [CrossRef]
Zhong, J.; Zhou, Q.; Li, M.; Gruen, A.; Liao, X. A Novel Solution for Drone Photogrammetry with Low-overlap Aerial Images using Monocular Depth Estimation. arXiv 2025, arXiv:2503.04513. [Google Scholar] [CrossRef]
Song, J.; Chen, H.; Yokoya, N. Enhancing monocular height estimation via sparse LiDAR-guided correction. ISPRS J. Photogramm. Remote Sens. 2026, 232, 155–171. [Google Scholar] [CrossRef]
Rege Cambrin, D.; Corley, I.; Garza, P. Depth Any Canopy: Leveraging Depth Foundation Models for Canopy Height Estimation. In Proceedings of the ECCV 2024 Workshops, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2024; p. 15624. [Google Scholar]
Yang, L.; Kang, B.; Huang, Z.; Xu, X.; Feng, J.; Zhao, H. Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 10371–10381. [Google Scholar] [CrossRef]
Ranftl, R.; Lasinger, K.; Hafner, D.; Schindler, K.; Koltun, V. Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 1623–1637. [Google Scholar] [CrossRef]
Chen, Z.; Zhu, Y.; Zhao, C.; Hu, G.; Zeng, W.; Wang, J.; Tang, M. Dpt: Deformable patch-based transformer for visual recognition. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event China, 20–24 October 2021; ACM: New York, NY, USA, 2021; pp. 2899–2907. [Google Scholar]
Viola, M.; Qu, K.; Metzger, N.; Ke, B.; Becker, A.; Schindler, K.; Obukhov, A. Marigold-DC: Zero-shot monocular depth completion with guided diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Honolulu, HI, USA, 19–20 October 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 5359–5370. [Google Scholar]
Huo, C.; Chen, K.; Zhang, S.; Wang, Z.; Yan, H.; Shen, J.; Wang, Z. When Remote Sensing Meets Foundation Model: A Survey and Beyond. Remote Sens. 2025, 17, 179. [Google Scholar] [CrossRef]
Zhou, P.; Tang, X. Geometric Accuracy Verification of GF-7 Satellite Stereo Imagery Without GCPs. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6509105. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef]
Le Saux, B.; Yokoya, N.; Hänsch, R.; Brown, M. 2019 IEEE GRSS data fusion contest: Large-scale semantic 3D reconstruction. In Proceedings of the 2019 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Yokohama, Japan, 28 July–2 August 2019. [Google Scholar] [CrossRef]
Zhang, W.; Qi, J.; Wan, P.; Wang, H.; Xie, D.; Wang, X.; Yan, G. An easy-to-use airborne LiDAR data filtering method based on cloth simulation. Remote Sens. 2016, 8, 501. [Google Scholar] [CrossRef]
Wolf, P.R.; Dewitt, B.A. Elements of Photogrammetry with Applications in GIS, 3rd ed.; McGraw-Hill: Boston, MA, USA, 2020; p. pp. 217–225, 233–250. [Google Scholar]
Liu, C.J.; Krylov, V.A.; Kane, P.; Kavanagh, G.; Dahyot, R. IM2ELEVATION: Building height estimation from single-view aerial imagery. Remote Sens. 2020, 12, 2719. [Google Scholar] [CrossRef]

Figure 1. The relative depth predicted by the large model preserves object contours and local structural cues but exhibits notable inaccuracies in numerical scale and global height relationships. Therefore, relative depth cannot be directly interpreted as absolute height and must be further integrated with image information to enable reliable nDSM estimation. (Left): Input remote sensing image. (Middle): Dimensionless relative depth prior output by the pre-trained foundation model (unitless). (Right): Ground truth nDSM (Unit: Meters).

Figure 2. Overall architecture of the proposed Relative Depth–Absolute Height Prediction Network (RDAH-Net). The network jointly takes the original remote sensing image and the corresponding relative depth prior as inputs and predicts the absolute height through feature extraction, cross-modal reasoning, and refined decoding.

Figure 3. Structure of the Convolutional Block Attention Module (CBAM), which sequentially applies channel attention and spatial attention to enhance informative features and suppress irrelevant responses.

Figure 4. Examples of remote sensing images and nDSMs from the three datasets. The domain gaps caused by differences in architectural styles and sensor imaging conditions challenge the generalization ability of height estimation models.

Figure 5. Statistical distribution of nDSM heights across three datasets: (a) DFC2019-Track1; (b) Swiss; (c) HK.

Figure 6. RDAH-Net projects the dimensionless relative depth prior into absolute height values with a range consistent with nDSM ground truth. ((Left): nDSM ground truth; (Middle): relative depth prior; (Right): predicted nDSM).

Figure 7. Qualitative comparison of RDAH-Net and baselines when trained and tested on the DFC2019-Track1 dataset.

Figure 8. Qualitative comparison on the Swiss and HK datasets (trained and tested on each dataset individually).

Figure 9. Cross-domain qualitative comparison (models trained on DFC2019-Track1 and directly tested on HK and Swiss).

Figure 10. Qualitative visualization of ablation study results on representative samples from DFC2019-Track1. All models are trained and tested on the DFC2019-Track1 dataset to verify the effectiveness of each ablation component.

Figure 11. Qualitative visualization of generalization assessment results for ablation study models. The models (trained on DFC2019-Track1) are directly evaluated on the HK and Swiss datasets without additional fine-tuning to validate cross-dataset generalization capability.

Figure 12. Typical failure cases when evaluating Baseline-MLP and IM2ELEVATION under cross-domain settings (trained on one dataset and directly tested on another).

Table 1. Summary of datasets used in this study.

Dataset	Location	Sensor	Resolution	Sample Size	Training Set Size	Test Set Size
DFC2019-Track1	Jacksonville, FL, USA Omaha, NE, USA	WorldView-3	0.35 m	1024 × 1024	2226	557
Swiss	Zurich, Switzerland	Gaofen-7	1.0 m	512 × 512	8824	2205
HK	Hong Kong, China	Gaofen-7	1.0 m	512 × 512	1180	294

Table 2. In-domain performance comparison on three datasets (MAE; lower is better). “Parameters” denotes the number of trainable parameters (K/M). “GPU Memory (G)” represents the peak GPU memory usage (GB) for 512 × 512 inference. “FLOPs (G)” indicates the computational complexity in terms of floating-point operations (G) for inference at 1024 × 1024 resolution. “512 × 512 FPS” and “1024 × 1024 FPS” denote the inference speed in frames per second at respective resolutions. “DFC2019-Track1 (m)”, “HK (m)”, and “Swiss (m)” report the MAE results (meters) on the corresponding datasets. The best quantitative results are boldfaced for clarity.

Model	Parameters	GPU Memory (G)	FLOPs (G)	512 × 512 FPS	1024 × 1024 FPS	DFC2019-Track1 (m)	HK (m)	Swiss (m)
Baseline-MLP	10.35 K	0.27	5.44	1135.74	216.05	5.11	3.20	5.67
IM2HEIGHT	7.36 M	1.28	501.94	82.34	18.65	3.08	4.01	4.27
SynRS3D	366.5392 M	2.71	586.10	16.91	2.70	4.46	3.66	4.19
NRF	-	-	-	-	-	3.93	4.84	6.45
IM2ELEVATION	158.9465 M	2.2	535	28.31	5.27	1.67	3.39	133.92 ¹
RDAH-Net	5.3716 M	0.36	54.40	54.26	44.58	1.54	2.56	2.80

¹ A detailed discussion is provided in Section 4.1.

Table 3. Cross-domain generalization (trained on DFC2019-Track1, tested on HK and Swiss; MAE). The best quantitative results are boldfaced for clarity.

Model	Parameters	HK (m)	Swiss (m)
Baseline-MLP	10.35 K	5.64	7.77
IM2HEIGHT	7.36 M	3.76	5.65
SynRS3D	366.5392 M	4.19	5.97
NRF	-	5.78	6.80
IM2ELEVATION	158.9465 M	5.21	3.40
RDAH-Net	5.3716 M	3.74	4.50

Table 4. Ablation study on the effectiveness of individual components (trained/tested on DFC2019-Track1; directly evaluated on HK and Swiss for cross-domain assessment; MAE, ↓ lower is better). ✔ denotes the inclusion of the corresponding module. The best quantitative results in each row are boldfaced for clarity.

Configuration	Components					Params	MAE (m) ↓
Configuration	MV	CB	BCA	PS	DP	Params	DFC19	HK	Swiss
RDAH-Net (Full)	✔	✔	✔	✔	✔	5.37 M	1.54	4.50	5.50
w/o MobileViT	–	✔	✔	✔	✔	0.43 M	2.38	5.27	3.45
w/o CBAM	✔	–	✔	✔	✔	5.37 M	1.74	4.53	6.24
w/o Bi-Cross	✔	✔	–	✔	✔	5.36 M	2.00	4.93	4.49
w/o PixelShuffle	✔	✔	✔	–	✔	5.24 M	1.90	4.70	6.15
w/o Depth	✔	✔	✔	✔	–	5.37 M	3.04	4.59	5.58

Table 5. Ablation study on MobileViT (trained/tested separately on HK and Swiss; MAE, lower is better). ✔ denotes the inclusion of MobileViT. The best quantitative results in each row are boldfaced for clarity.

Configuration	MobileViT	Params	HK	Swiss
RDAH-Net (Full)	✔	5.37 M	2.56	2.80
w/o MobileViT	–	0.43 M	3.20	2.82

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jiang, L.; Wang, F.; Jiao, N.; Zhu, J.; Xiang, Y.; You, H. RDAH-Net: Bridging Relative Depth and Absolute Height for Monocular Height Estimation in Remote Sensing. Remote Sens. 2026, 18, 1024. https://doi.org/10.3390/rs18071024

AMA Style

Jiang L, Wang F, Jiao N, Zhu J, Xiang Y, You H. RDAH-Net: Bridging Relative Depth and Absolute Height for Monocular Height Estimation in Remote Sensing. Remote Sensing. 2026; 18(7):1024. https://doi.org/10.3390/rs18071024

Chicago/Turabian Style

Jiang, Liting, Feng Wang, Niangang Jiao, Jingxing Zhu, Yuming Xiang, and Hongjian You. 2026. "RDAH-Net: Bridging Relative Depth and Absolute Height for Monocular Height Estimation in Remote Sensing" Remote Sensing 18, no. 7: 1024. https://doi.org/10.3390/rs18071024

APA Style

Jiang, L., Wang, F., Jiao, N., Zhu, J., Xiang, Y., & You, H. (2026). RDAH-Net: Bridging Relative Depth and Absolute Height for Monocular Height Estimation in Remote Sensing. Remote Sensing, 18(7), 1024. https://doi.org/10.3390/rs18071024

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RDAH-Net: Bridging Relative Depth and Absolute Height for Monocular Height Estimation in Remote Sensing

Highlights

Abstract

1. Introduction

2. Methodology

2.1. Relative Depth Prior from Pre-Trained Models

2.2. Relative Depth–Absolute Height Prediction Network

2.2.1. Feature Extraction and Enhancement Module

2.2.2. Attention-Based Cross-Modal Reasoning Module

2.3. Refined Output Mechanism

3. Experimental Results

3.1. Dataset Construction

3.2. Experiments Setting

3.3. Results

3.3.1. Relative Depth Prior Versus Absolute Height Range

3.3.2. In-Domain Quantitative Comparison

3.3.3. Cross-Domain Generalization

3.4. Ablation Study

4. Discussion

4.1. Analysis of Abnormal Cases in Comparative Methods

4.2. General Discussion

Methodological Implications

4.3. Practical Considerations and Limitations

4.4. On Evaluation Under Domain Shift

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI