1. Introduction
Unmanned Aerial Vehicles (UAVs) have become essential tools for photogrammetry and earth observation, with oblique photography (30–60° viewpoints) proving particularly valuable for high-fidelity 3D reconstruction and digital surface modeling [
1]. However, monocular depth estimation from UAV imagery remains challenging. Conventional methods like Structure-from-Motion (SfM) require extensive overlapping imagery and suffer from high computational costs, while performing poorly in textureless scenarios. These limitations prevent real-time applications such as search-and-rescue missions and dynamic obstacle avoidance, highlighting the need for more efficient depth estimation approaches.
Depth estimation from single images has been revolutionized by deep learning, particularly through supervised methods that achieve high accuracy on benchmark datasets [
2,
3,
4]. However, these methods face critical limitations when applied to UAV oblique imagery. First, they require dense ground-truth labels that are prohibitively expensive to obtain, as LiDAR-derived depth suffers from sparsity and misalignment [
5]. Second, a significant domain gap exists between terrestrial training data and UAV perspectives beyond 45° pitch, where extreme projective distortion and textureless surfaces lead to depth errors in critical areas. These limitations prevent deployment in large-scale applications like disaster assessment where dense labeling is infeasible.
To address the limitations of supervised monocular depth estimation, self-supervised approaches primarily adopt two paradigms: stereo pair-based learning and monocular video-based learning. Stereo methods leverage synchronized image pairs with known baseline distances, where one image predicts disparity to reconstruct the other via warping, with reconstruction errors backpropagated for optimization [
6]. While accurate, these methods require precise stereo calibration, which is particularly challenging for small UAVs due to their limited baseline distances (typically < 20 cm), and suffer from occlusion artifacts in complex aerial scenes [
7]. Consequently, monocular video-based self-supervised learning has emerged as the preferred solution, eliminating calibration needs by jointly estimating depth and camera pose from unconstrained video sequences.
In autonomous driving scenarios, unsupervised monocular depth estimation has been extensively studied and has demonstrated considerable potential. Zhou et al. [
8] pioneered self-supervision using view synthesis losses. Godard et al. [
9] introduced minimum reprojection loss and automasking to handle dynamic objects. Casser et al. [
10] integrated semantic segmentation to improve moving object handling. Recent advances have significantly diversified this landscape. HR-Depth [
11] redesigned skip-connections and feature fusion modules to enhance high-resolution depth prediction. Lite-Mono [
12] combined CNNs and Transformers in a lightweight architecture, achieving 80% parameter reduction. MonoViT [
13] leveraged Vision Transformers for global–local reasoning. SQLdepth [
14] proposed self-query volumes to recover fine-grained scene structures.
While such methods have demonstrated remarkable success in autonomous driving applications, they are not directly transferable to unmanned aerial vehicle (UAV) oblique imagery due to several domain-specific challenges. First, projective distortion caused by extreme camera pitch angles leads to non-linear depth discontinuities. Second, weakly textured regions—such as uniform rooftops and asphalt roads—exacerbate spatial ambiguity owing to the lack of distinctive visual cues. Existing architectures struggle to holistically address these issues. For instance, convolutional neural networks (CNNs) are limited by their local receptive fields, which are insufficient for capturing long-range geometric relationships. Meanwhile, Vision Transformers (ViTs) suffer from quadratic computational complexity and the absence of explicit spatial priors, representing critical drawbacks for resource-constrained UAV platforms operating in dynamic environments.
Although some efforts have been made to adapt depth estimation to aerial contexts, research focusing specifically on UAV-based monocular depth estimation remains limited. Hermann et al. [
15] modified the Monodepth2 architecture for drone videos. Julian et al. [
16] employed style-transfer techniques to enhance generalization. Madhuanand et al. [
1] proposed a dual-depth encoder with 3D convolutions. MRFEDepth [
17] introduced multi-resolution fusion and edge aggregation to improve depth continuity in ultra-low altitude scenarios. These studies confirm the feasibility of self-supervised depth estimation from monocular UAV videos, yet a holistic solution that efficiently tackles both extreme distortions and texture scarcity is still lacking.
Building upon previous studies that have successfully demonstrated the feasibility of monocular depth estimation from UAVs, our work focuses on enhancing unsupervised depth estimation in UAV oblique imagery, aiming to improve both applicability and robustness under challenging conditions such as extreme pitch angles and weakly textured surfaces. To this end, we propose RMTDepth, a novel self-supervised framework that effectively integrates global contextual reasoning with local geometric refinement.
The main contributions of this work are summarized as follows:
We propose RMTDepth, a unified framework that synergizes global context modeling with local edge refinement through the strategic integration of two advanced components. we adopt the Retentive Vision Transformer (RMT) [
18] as our backbone encoder, which leverages the Manhattan distance-driven spatial decay principle to inject explicit spatial proximity priors. This architecture efficiently models long-range dependencies via linear-complexity axial attention decomposition, providing geometrically consistent depth initialization for UAV oblique videos.
We incorporate the Neural Window Fully-Connected CRFs (NeW CRFs) [
19] module into the decoder stage. By partitioning feature maps into sub-windows and executing high-order CRF optimization within each partition, it refines spatial ambiguity in texture-deficient zones using multi-head attention-based pairwise affinity learning.
We introduce the UAV-SIM Dataset, a large-scale photorealistic synthetic benchmark developed with Unreal Engine 4 and AirSim. This platform provides programmatic access to depth buffers, ensuring pixel-perfect depth ground truth for 9000 oblique images. The dataset addresses the scarcity of reliable real-world training and evaluation data for UAV-based depth estimation, particularly in challenging scenarios where conventional SfM-based label generation methods fail.
Our method achieves depth estimation of oblique UAV videos through end-to-end self-supervised training. Extensive evaluations demonstrate RMTDepth’s superior performance across three urban-focused benchmarks: UAVID-Germany [
20], UAVID-China [
20], and UAV-SIM. Against seven state-of-the-art methods, our framework consistently achieves the most accurate depth predictions, particularly excelling in handling projective distortion at extreme pitch angles and recovering fine-grained details in texture-deficient zones. The synthesized UAV-SIM data further proves critical for validating robustness where real-world labels are unreliable.
The rest of this paper is organized as follows. In
Section 2, we introduce related work including self-supervised monocular depth estimation, self-supervised monocular depth estimation in aerial imagery, backbone architectures for monocular depth estimation, and Neural CRFs for monocular depth refinement.
Section 3 describes our proposed architecture and core modules.
Section 4 presents the experimental evaluation. Finally,
Section 5 concludes the work and discusses future research directions.
2. Related Works
Depth estimation from images is a pivotal computer vision task, particularly successful in terrestrial contexts. Traditional approaches leverage multiple scene views [
21,
22], stereo pairs [
23], or monocular cues like illumination or texture [
24] for 3D reconstruction. Single-image methods evolved from shape-from-shading [
25] and shape-from-texture [
26] techniques, later extended via stereo-temporal sequences. However, these methods face accuracy limitations under lighting changes, perspective shifts, and occlusions. The field was revolutionized by deep learning, where CNNs enable end-to-end depth regression through models like DenseDepth [
27] and LeReS [
28]. Depth estimation has also received attention in the field of aerial remote sensing, being applied to various aspects of aerial remote sensing image processing. Li et al. [
29] systematically categorized contemporary SIDE approaches now applied to aerial remote sensing. We will introduce some important related research from four aspects: self-supervised monocular depth estimation of images, monocular depth estimation for aerial images, backbone architectures for monocular depth estimation, and Neural CRFs for monocular depth refinement.
2.1. Self-Supervised Monocular Depth Estimation
Self-supervised monocular depth estimation has emerged as a pivotal research direction, primarily due to its elimination of dependency on costly depth annotations. This paradigm leverages geometric constraints between consecutive video frames to synthesize target views, utilizing photometric reconstruction loss as the supervisory signal for model optimization. Self-supervised monocular depth estimation has evolved significantly since Zhou et al.’s [
8] foundational view synthesis framework, though early models faced memory constraints limiting training to low resolutions. Subsequent innovations include Yin and Shi’s [
30] optical flow integration and Godard et al.’s [
9] minimum reprojection loss in Monodepth2, which remains the dominant baseline despite limitations in handling non-Lambertian surfaces and complex boundaries. Recent architectural breakthroughs address distinct challenges. MonoViT [
13] combines CNNs with Vision Transformers to overcome local receptive field limitations, enabling global–local reasoning that sets new KITTI benchmarks. Concurrently, HR-Depth [
11] tackles high-resolution estimation through redesigned skip-connections and feature fusion SE modules, achieving SOTA accuracy with ResNet-18 while its MobileNetV3 variant maintains performance at 20% parameter count. For edge deployment, Lite-Mono [
12] introduces a lightweight hybrid architecture using dilated convolutions and attention mechanisms to reduce parameters by 80% versus Monodepth2. Meanwhile, SQLdepth [
14] revolutionizes geometric priors through self-query cost volumes that encode relative distance maps in latent space, recovering fine-grained details with unprecedented generalization. These advances complement specialized approaches like Guizilini et al.’s [
31] 3D convolution (limited by computational cost) and geometry-constrained methods [
32].
2.2. Self-Supervised Monocular Depth Estimation for UAV Images
Self-supervised monocular depth estimation shows significant promise for UAV photogrammetry due to its minimal data requirements, needing only consecutive video frames for training. While widely adopted in autonomous driving, recent advances demonstrate its adaptability to aerial platforms. Key UAV-oriented innovations include the following: Aguilar et al. [
33] pioneered real-time depth estimation via CNN on micro-UAVs, Miclea and Nedevschi [
34] developed depth redistribution techniques for enhanced accuracy, Hermann et al. [
15] adapted Godard et al.’s unsupervised paradigm with shared encoder weights for depth–pose estimation, and Madhuanand et al. [
1] introduced dual 2D CNN encoders processing consecutive frames, coupled with a 3D CNN decoder to recover spatial depth features. Their multi-loss framework combined reconstruction loss, edge-aware smoothing, and temporal consistency constraints to enhance occlusion handling. Yu et al. [
17] further addressed texture-sparse regions through scene-aware refinement, deploying multi-resolution feature fusion networks with edge aggregation modules (EIA) and specialized perceptual losses.
2.3. Network Architectures for Monocular Depth Estimation
Playing with different architectures used as the backbone showed a significant impact on the performance of monocular depth estimation itself. Convolutional Neural Networks (CNNs) have long dominated monocular depth estimation, with seminal works like DenseDepth [
29] and Monodepth2 [
9] achieving remarkable accuracy through encoder–decoder designs. However, their limited receptive fields constrain long-range dependency modeling, while texture bias compromises generalization in novel environments. These issues are extensively documented by Bae et al. in cross-dataset evaluations. Inspired by Vision Transformers’ global context capabilities, recent hybrid architectures address these limitations. MonoViT [
13] integrates CNN feature extractors with transformer blocks, enabling joint local–global reasoning that achieves SOTA on KITTI but incurs quadratic complexity. Lite-Mono [
12] employs dilated convolutions and attention gating to reduce parameters by 80%, yet struggles with fine-grained detail recovery. SQLdepth [
14] innovates self-query cost volumes for implicit geometry learning, excelling in fine structure preservation but requiring intensive latent space computations. Bae et al.’s [
35] critical analysis reveals that these methods fundamentally enhance shape bias over CNNs’ texture dependency, improving cross-domain generalization. Nevertheless, pure ViT architectures remain hampered by lack of spatial priors and computational inefficiency for high-resolution UAV video processing. To overcome these constraints, we adopt the Retentive Vision Transformer (RMT) [
18] as our backbone encoder. Its Manhattan distance-driven spatial decay matrix injects explicit proximity priors, while axial decomposition achieves linear-complexity attention, thereby enabling real-time modeling of long-range geometric dependencies critical for consistent depth initialization in oblique UAV streams.
2.4. Monocular Depth Map Refinement
Depth map refinement addresses inherent limitations such as sparsity, noise, and structural inaccuracies in initial depth predictions. Traditional approaches leverage auxiliary data for enhancement. Shape-from-shading techniques [
36,
37] exploit light–surface interactions with registered color images to infer curvature. Multi-view photometric consistency methods [
38] densify sparse MVS outputs via NeRF-based optimization, while sensor fusion strategies [
39,
40] complete LiDAR point clouds using neural networks. For low-resolution depth, upsampling combines temporal cues [
41] and shading models [
37], though noise amplification at long ranges remains challenging for time-of-flight and stereo systems [
42,
43]. Recent learning-based refinements focus on architectural innovations. Fink et al. [
44] proposed multi-view differentiable rendering, scaling monocular depth to absolute distances via SfM and enforcing photometric and geometric consistency through mesh optimization. CHFNet [
45] introduced coarse-to-fine hierarchical refinement with LightDASP modules, extracting multi-scale features while integrating edge guidance for spatially coherent outputs. SE-MDE [
46] designed structure perception networks and edge refinement modules to jointly enhance global scene layout and local boundary details. Critically, Conditional Random Fields (CRFs) have demonstrated particular efficacy in structured depth optimization, with end-to-end integrations [
47,
48] improving spatial coherence through pairwise pixel relations. Building on this proven capability, we incorporate the Neural Window Fully-Connected CRFs (NeW CRFs [
19]) as our refinement decoder. By partitioning feature maps into sub-windows and executing attention-based high-order optimization, it specifically enhances depth discontinuities in texture-deficient zones, which overcoming edge blurring and fragmentation in complex UAV oblique imagery.
3. Method
In this section, we present our proposed overall method for enhanced self-supervised depth estimation applied to oblique UAV videos. The method contains four subsections, which are divided into model inputs, network architecture design, combination of loss functions, and model inference.
3.1. Model Inputs
The self-supervised monocular depth estimation model utilizes sequential UAV video frames for training, with three consecutive RGB images denoted as where serves as input during the training phase. These frames ( for previous, as reference, and for subsequent) are manually selected from captured drone footage to ensure controlled baseline spacing that preserves epipolar geometry while maintaining appropriate perspective variation within the UAV’s field of view, avoiding excessive scale changes that would degrade model performance. Camera-intrinsic parameters including focal length and principal point obtained through calibration are concurrently fed to the pose network. To optimize the trade-off between depth accuracy and computational efficiency, all input frames are standardized to a resolution of 352 × 640 pixels, while higher resolutions improve depth fidelity but significantly increase GPU memory consumption. During inference, only the single reference frame is required for depth map generation, whereas the pose network processes alongside source frames exclusively during training.
3.2. Network Architecture
As shown in
Figure 1, the overall architecture comprises two core networks—DepthNet and PoseNet—jointly trained using three consecutive RGB frames
where
. The DepthNet adopts an encoder–decoder structure. Encoder utilizes the Retentive Vision Transformer (RMT) backbone to extract hierarchical features with Manhattan distance-based spatial priors, enabling global geometric consistency modeling. Decoder incorporates the Neural Window Fully-Connected CRFs (NeW CRFs) module, which partitions feature maps into sub-windows and performs attention-based high-order optimization to refine depth discontinuities in texture-deficient zones. The depth map is generated as
, where
denotes the target frame. Concurrently, PoseNet processes the target frame
paired with source frames
to estimate relative camera transformations
. During self-supervised training, these outputs enable reference image reconstruction through geometric projection:
where
K represents camera intrinsics. This differentiable warping, adapted from Godard et al. [
9], leverages bilinear sampling to project source views (
) into the reference frame (
) using predicted depth (
) and pose (
), establishing photometric consistency as the supervisory signal.
3.2.1. Depth Network
As typical in previous works [
8,
9], we design our DepthNet as an encoder–decoder architecture.
Depth encoder. As pointed out by recent research [
1,
8,
9,
13,
17], the encoder is crucial for effective features extraction. Inspired by one of the most recent Transformer architectures, i.e., RMT [
18], in which a Retentive Networks Meet Vision Transformer Block (shown in
Figure 2) is a novel backbone that injects explicit spatial priors through Manhattan distance-driven decay mechanisms. We follow such a design to build the key components of our depth encoder in five stages. Given the input image, we implement a Conv-stem block consisting of four alternating
convolutional layers with stride configurations of
, progressively transforming features while halving spatial dimensions to
resolution. Subsequent stages implement the core RMT design. From staage two to stage four, we employ decomposed Manhattan Self-Attention (
MaSA) modules that combine axial attention decomposition with spatial decay matrices. This maintains linear complexity while modeling long-range dependencies critical for depth coherence. At stage five, we utilize full
MaSA with global Manhattan distance priors to consolidate geometric context.
Manhattan Self-Attention. MaSA is a spatial attention mechanism that injects explicit spatial priors through a Manhattan distance-driven decay matrix. Its design is motivated by the spatial locality prior inherent in visual data, where neighboring pixels are more likely to exhibit similar properties (e.g., depth, texture). Standard self-attention mechanisms lack this inductive bias, requiring excessive data to learn spatial relationships from scratch. The spatial decay matrix explicitly encodes this prior by assigning higher attention weights to spatially proximate tokens and decaying weights exponentially with Manhattan distance. This ensures the model prioritizes local features while retaining global context, mimicking the behavior of convolutional operations while maintaining the flexibility of attention. This is particularly critical for tasks such as depth estimation, where the preservation of structural details (e.g., sharp edges, object boundaries) is essential.
Its core components are Bidirectional Spatial Decay and Two-dimensional Decay. Hence, we initially broaden the retention to a bidirectional form, expressed as Equation (
2):
where
BiRetention signifies bidirectional modeling.
Meanwhile, we extend the one-dimensional retention to encompass two dimensions. In the 2D image plane, each token possesses unique spatial coordinates
. The decay matrix
D is modified so that each entry represents the Manhattan distance between the corresponding pair of tokens based on their 2D coordinates. The matrix
D is redefined as follows:
And we continue to employ
Softmax to introduce nonlinearity to our model. Combining the aforementioned steps, the Manhattan Self Attention is expressed as:
Decomposed Manhattan Self Attention. Traditional sparse attention methods or recurrent approaches disrupt the spatial decay matrix based on Manhattan distance, leading to loss of explicit spatial prior.
MaSA introduces a decomposition strategy that separately computes attention scores along horizontal and vertical directions and applies one-dimensional bidirectional decay matrices
to encode spatial distances explicitly. This approach preserves spatial structure while reducing computational complexity, as shown in Equation (
9).
Based on the decomposition of
MaSA, the shape of the receptive field of each token is shown in
Figure 3, which is identical to the shape of the complete
MaSA’s receptive field.
Following [
49],
MaSA introduced a Local Context Enhancement module using DWConv to further enhance the local expression capability.
Throughout all stages, we incorporate Conditional Positional Encodings to preserve spatial relationships, following best practices from modern vision transformers. The multi-stage hierarchy progressively transforms the Conv-stem features into geometrically consistent depth representations, overcoming ViT’s quadratic complexity while enhancing spatial awareness for oblique UAV imagery.
Depth decoder. Our decoder employs multi-scale feature aggregation to progressively restore spatial resolution from the RMT encoder’s outputs. It integrates cross-scale fusion through skip connections, where higher-resolution encoder features inject fine-grained details into upsampled coarse maps. Resolution recovery is achieved via five successive upconvolution blocks, each incrementally doubling spatial resolution with bilinear upsampling applied between blocks to minimize checkerboard artifacts. At each skip connection, a NeW CRFs module refines features by partitioning maps into local windows, modeling pixel relationships via multi-head attention-based pairwise affinity, and explicitly sharpening depth discontinuities in texture-deficient regions. Finally, four disparity prediction heads (scales 1:1, 1:2, 1:4, 1:8) generate inverse depth estimates, each comprising two convolutions followed by a sigmoid activation.
3.2.2. Neural Window FC-CRFs Module
The Neural Window Fully-connected CRFs (NeW CRFs) module acts as the core refinement engine in our decoder, designed to synergize with the RMT encoder and transform its high-level, geometrically-informed features into a precise depth map. It directly leverages the multi-scale features from the RMT backbone, utilizing their rich global context and spatial priors to guide its local, window-based optimization. Embedded within the bottom-up-top-down decoding structure, this module operates at multiple scales to iteratively refine the depth estimates by resolving spatial ambiguities in textureless regions through information propagation and by sharpening depth discontinuities at object boundaries.
Figure 4 shows the architecture of the NeW CRFs module. Each neural window FC-CRFs block employs dual successive CRFs optimizations: the first operates on regular windows and the second on spatially shifted windows [
19]. To maintain architectural consistency with the transformer encoder, the window size is fixed at N = 7, corresponding to
patches per window. The unary potentials are derived from convolutional network outputs, while the pairwise potentials are computed following Equation (
10). During optimization, multi-head query (
Q) and key (
K) computations generate multi-head potentials, enhancing the energy function’s capacity to model complex relationships. A progressive reduction in attention heads is implemented across decoding levels—utilizing 32, 16, 8, and 4 heads from top to bottom features. The resultant energy function subsequently undergoes refinement through an optimization network comprising two fully-connected layers, ultimately yielding the optimized depth map
.
For the unary potential, it is computed from the image features such that it can be directly obtained by the network as
where
is the parameters of a unary network.
For the pairwise potential, it is composed of values of the current node and other nodes, and a weight computed based on the color and position information of the node pairs. It is expressed as:
where
is the feature map and
is the weighting function. We calculate the pairwise potential node by node. For each node
i, we sum all its pairwise potentials and obtain
where
,
are the weighting functions and will be computed by the networks.
Finally, query and key vectors are derived from each patch’s feature map within a window, aggregated into matrices
Q and
K. The pairwise potential weights are computed via the dot product, which are then applied to the predicted values
X. Relative position embedding
P is incorporated to encode spatial relationships, yielding the formulation of Equation (
15):
where • denotes dot production. Thus, the
SoftMax output yields the weighting coefficients
and
for Equation (
15). The dot product
computes affinity scores between all node pairs, determining message-passing weights with relative position embedding
P. Message propagation is then effected through the dot product between the initial prediction
X and these
SoftMax-derived weights.
3.2.3. Pose Network
Following Monodepth2 [
9] and prior works [
11,
50,
51], our PoseNet adopts a lightweight ResNet-18 encoder [
52] for efficient 6-DoF relative pose estimation. The network takes concatenated image pairs
as input, representing the target frame and adjacent source frames, and outputs relative transformations
between them. The encoder generates feature maps from reference image
and source images
. These features undergo concatenation and processing through four convolutional layers (256 channels,
kernels) in the decoder. Spatial information is condensed via ReLU activation and global mean pooling, yielding a 6-DoF vector partitioned into axis–angle
and
components. This parameterization forms the 4 × 4 camera transformation matrix
used for view synthesis in Equation (
1).
3.3. Loss Functions
To enhance depth estimation quality without ground-truth supervision, we formulate depth prediction as a self-supervised image reconstruction task using unlabeled monocular videos. The depth network processes a target frame I to predict a dense inverse depth map d, converted to metric depth
with physical bounds
following [
9]. Model optimization is driven by multi-term losses applied to synthetically reconstructed reference views, replacing traditional supervised depth regression with photometric backpropagation. The total loss is given by
where
,
,
represents the weights between different loss terms used.
represents the reprojection loss.
represents the smoothing loss.
represents the masking loss. The different weights for the loss terms are tuned and the respective weights of
,
,
are determined after several experiments. The weightage of reprojection loss and auto masking loss are chosen as 1, while the weightage of smoothness loss is maintained as 0.001.
3.3.1. Reprojection Loss
Photometric reconstruction loss drives depth quality improvement by minimizing discrepancies between synthesized reference views and original target images. This loss combines structural similarity (
SSIM) and L1 norms per Equation (
17):
where
,
is the target image, and
is its reconstruction. Empirical analysis confirms that it adopts the minimum pixel reprojection loss—proposed by Godard et al. [
9] for edge artifact suppression
3.3.2. Smoothness Loss
To enforce geometric coherence in depth predictions, we implement an edge-aware smoothness regularization term following Wang et al. [
5]. This loss preserves depth discontinuities corresponding to image edges while promoting smoothness elsewhere. Applied to mean-normalized depth estimates
, it weights depth gradients exponentially by the reference image’s color derivatives as formalized in Equation (
18):
where
,
denote spatial gradients,
is the reference image, and depth normalization ensures scale invariance. The exponential weighting attenuates smoothness constraints at high-gradient regions, maintaining critical depth discontinuities aligned with image boundaries.
3.3.3. Masking Loss for Dynamic Objects
To address prevalent dynamic objects (e.g., moving vehicles) in training data, we implement an adaptive masking loss inspired by Godard et al. [
9]. This mechanism resolves the static-scene assumption inherent in self-supervised monocular depth estimation. Per-pixel masking is determined by comparing reprojection errors:
where
denotes the reconstructed image,
the reference frame, and
source frames. Pixels are retained only when reconstruction error is lower than source–reference discrepancy, effectively excluding dynamic regions during loss computation. This approach enhances robustness in real-world urban environments compared to static datasets like KITTI.
3.4. Inference
As our research focuses on depth estimation from monocular UAV videos, we primarily target enhancements to the depth estimation model architecture. In contrast to Godard et al. [
9], our modifications include replacing the encoder backbone with a RMT block to model long-range geometric dependencies at linear complexity, enhancing depth consistency in dynamic aerial scenes; and integrating a NeW CRFs module in the decoder to refine depth edges and effectively resolve depth ambiguity in weakly-textured regions. During inference, only the depth network is utilized for depth map estimation, requiring solely a single input image. Pose transformation parameters are excluded at this stage as they are exclusively employed during training. Following Hermann et al. [
15], we adopt COLMAP-generated depth maps [
53] as reference data for comparative evaluation against baseline methods.
4. Experiments
In this section, we introduce the dataset and then describe details of our implementation. After that, we present a quantitative and qualitative comparison. Furthermore, we conduct several ablation studies.
4.1. Dataset
While numerous publicly available datasets (e.g., KITTI [
5]) capture ground-level imagery for autonomous driving scenarios, high-resolution datasets featuring oblique UAV perspectives remain scarce. To address this limitation, we employ the widely adopted UAVID Germany Dataset [
20] and UAVID China Dataset [
20], complemented by our proprietary custom-built UAV SIM Dataset. Each dataset exhibits substantial scale variations among objects at differing distances, diverse object categories, and dynamic urban street scenes with moving elements.
UAVID Germany Dataset. The UAVID Germany subset comprises 9 video sequences captured over Gronau at 50–100 m altitude with 10m/s flight speed. Featuring
resolution imagery captured at 45° obliquity (20 fps original frame rate), it exhibits uniform scenes dominated by rooftops and vegetation with a limited depth range of 100–150 m. To mitigate motion artifacts from narrow temporal baselines, frame rate was reduced to 10 fps during processing. The dataset’s structural consistency provides a controlled environment for evaluating depth consistency in moderate-range aerial scenarios. Some of the training sample images from the dataset are shown in
Figure 5.
UAVID China Dataset. Containing 34 sequences from Wuhan under identical capture specifications (50–100 m altitude, 10 m/s, 45° angle), this subset presents
imagery with complex urban dynamics. Challenges include (1) high-rise buildings extending depth range > 500 m, (2) significant dynamic objects (vehicles, pedestrians), and (3) variable flight altitudes. The original 20 fps footage required aggressive downsampling to 1fps to overcome narrow baseline issues exacerbated by building occlusion and camera motion unpredictability. This dataset tests model robustness under extreme depth variations and scene complexity. Some of the training sample images from the dataset are shown in
Figure 6.
Reference depth maps were generated via COLMAP structure-from-motion [
53] using optimized frame selections. From the temporally subsampled sequences, 180 German and 80 Chinese representative frames were reconstructed. During evaluation, single test images are input to the model, with output depth maps quantitatively compared against these COLMAP-derived references using standard depth metrics. This pipeline resolves narrow-baseline limitations while ensuring geometrically consistent ground truth.
UAV SIM Dataset. To mitigate inaccuracies in COLMAP-generated ground truth depth maps which introduce false positives/negatives that degrade evaluation metrics, we constructed a high-fidelity simulation environment using UE4 and AirSim. This synthetic pipeline produced the UAV SIM Dataset featuring controllable environmental parameters. We simulated diverse flight trajectories (e.g., building circumnavigation) to generate 9000 RGB images with corresponding high-resolution depth maps at 640 × 352 resolution. Data capture parameters include 10 fps frame rate, 45° obliquity, 100–150 m altitude, and 1.1–1.2 m/s velocity. Consistent flight heights minimize unsupervised training noise, while dynamic urban elements (vehicles, pedestrians) and dominant aerial features (rooftops, roads, vegetation) introduce targeted challenges: abrupt depth discontinuities and extensive low-texture regions characteristic of oblique viewpoints. This synthesis provides complementary training complexity unattainable with real-world datasets. Some of the training sample images from the dataset are shown in
Figure 7.
4.2. Implementation Details
We implement our RMTDepth in Pytorch with an input resolution of 640 × 352 pixels for datasets. Our model is trained for 40 epochs, 20 epochs, and 30 epochs each for the Germany, China, and UAV SIM Dataset. We use Adamoptimizer for training all three datasets. The learning rate is set to
for 75% of the epochs and
for remaining epochs. The number M of Transformer layers in each of the Transformer blocks in the RMT is set as 3, 4, 18, and 4 from stage 2 to stage 5 in the depth encoder, respectively. Both the pose encoder and depth encoder are pre-trained on ImageNet [
54]. Training takes around 10, 5, and 8 h in a single RTX 4090 GPU for the Germany, China, and UAV SIM Dataset, respectively. After hyperparameter tuning, the batch size is fixed at eight. In our experiments, we adopt the same data augmentation detailed in [
9,
11]. Meanwhile, predicted disparities are converted to metric depths via dataset-specific scaling, following established practices in self-supervised monocular depth estimation (SMDE) [
8,
9,
55,
56]. Minimum and maximum depth bounds derived from reference data constrain the output range. A scaling factor is then applied to disparity values to achieve metric consistency with ground truth. Although pose network orientations offer a photogrammetric scaling alternative, we adopt the reference-based approach for direct comparability with SMDE benchmarks, avoiding complexities in direct image orientation.
4.3. Evaluation Metrics
We benchmark our model against state-of-the-art monocular depth estimation methods [
1,
9,
11,
12,
13,
14,
17] using COLMAP-generated reference depths. Standard evaluation metrics are employed to ensure comparability with contemporary research:
Absolute Relative Error (
Abs Rel): This metric normalizes per-pixel depth errors to mitigate distance-related bias, as shown in Equation (
20).
Squared Relative Error (
Sq Rel): This metric uses a squared term to penalize larger depth errors, as shown in the Equation (
21).
Root Mean Squared Error (
RMSE): This metric measures precision with sensitivity to outliers, as shown in the Equation (
22).
Root Mean Squared logarithmic error (
): This metric compresses error scale logarithmically to reduce large-error dominance, as shown in the Equation (
23).
Accuracy with threshold (
): This metric is the ratio of pixels within a certain threshold to the corresponding depth in the reference depth, and we choose thresholds of 25%, 15%, and 5%. It is shown in Equation (
24).
where
and
denote the estimated depth and the ground truth depth at pixel
i.
N is the total number of valid pixels in the ground truth. All baselines were retrained under identical conditions to ensure fair comparison.
4.4. Depth Evaluation
We qualitatively and quantitatively evaluated the performance of our model and compared it with seven current state-of-the-art monocular depth estimation methods, namely Monodepth2 [
9], Monocular-UAV-Videos [
1], MRFEDepth [
17], HR-Depth [
11], Lite-Mono [
12], MonoVit [
13], and SQLdepth [
14]. Monodepth2 employs minimum reprojection loss and auto-masking to handle dynamic objects and remains the standard baseline for self-supervised frameworks despite its terrestrial driving-scene orientation. Ref. [
1] features dual 2D CNN encoders and a 3D CNN decoder leveraging temporal information from UAV video sequences. A novel contrastive loss enhances reconstruction quality by addressing occlusion challenges in oblique aerial imagery. MRFEDepth proposes a scene-aware refinement architecture with Multi-Resolution Fusion and Perceptual Refinement Network (PRNet). Its Edge Information Aggregation (EIA) module specifically targets depth detail recovery in ultra-low-altitude UAV photography. HR-Depth redesigns skip connections and introduces Feature Fusion Squeeze-and-Excitation (fSE) modules to preserve spatial details in large-gradient regions. It offers both ResNet-based and lightweight MobileNetV3 variants. Lite-Mono combines CNNs and Transformers in a parameter-efficient architecture. It uses Consecutive Dilated Convolutions (CDC) for multi-scale feature extraction and Local-Global Features Interaction (LGFI) for long-range context modeling. MonoVit integrates Vision Transformers with convolutional blocks to enable global scene reasoning. It addresses receptive field limitations in pure CNN architectures through hybrid local–global feature processing. SQLdepth introduces Self Query Layer (SQL) constructing geometry-aware cost volumes. It learns fine-grained structural priors through relative distance mapping in latent space, enhancing detail recovery. The final results were obtained by comparing the depth maps generated by these models with the reference depth provided by COLMAP (version 3.11.0).
4.5. Qualitative Results
Figure 8,
Figure 9 and
Figure 10 present qualitative results comparing the six state-of-the-art monocular depth estimation models and our model on the UAVID Germany Dataset [
20], UAVID China Dataset [
20], and UAV SIM Dataset. Visual analysis demonstrates that our approach generates depth maps with enhanced structural granularity, exhibiting superior fidelity to reference depths across diverse aerial scenarios.
As highlighted in
Figure 8 (second column) and
Figure 10 (building-vegetation boundaries), our approach shows remarkable performance in weakly-textured areas—regions with minimal visual features or repetitive patterns where conventional methods often fail due to lack of distinctive matching cues. These include vegetation areas, homogeneous rooftops, and asphalt surfaces, where our method maintains superior depth continuity and edge precision compared to all baseline methods.
The demonstrated performance advantage stems from our globally-coherent yet locally-precise architecture. The RMT encoder explicitly models long-range geometric dependencies through Manhattan-distance spatial decay priors, preserving structural consistency under extreme UAV pitch angles (45–60°). Crucially, its linear-complexity axial attention maintains global depth coherence across projective distortions, which is evident in continuous building facades. Complementarily, the NeW CRFs module in decoder executes high-order optimization within partitioned sub-windows, refining depth discontinuities in texture-deficient zones via multi-head affinity learning. This dual mechanism resolves critical UAV-specific challenges: (1) RMT mitigates motion-blur induced photometric inconsistency through motion-robust attention; (2) NeW CRFs eliminate spatial ambiguity in uniform regions (e.g., asphalt roads, rooftops) by enhancing edge precision—particularly observable in vegetation depth transitions. The synergistic integration of global context modeling and local edge refinement thus achieves unprecedented fidelity in oblique aerial depth estimation.
Additionally, we compare our model with these models (Monodepth2 [
9], MRFEDepth [
17], Monocular-UAV-Videos [
1], HR-Depth [
11], Lite-Mono [
12], MonoViT [
13], SQLdepth [
14]) through quantitative analysis.
4.6. Quantitative Results
The quantitative analysis results of each model on the UAVID Germany Dataset [
20], UAVID China Dataset [
20], and UAV SIM Dataset are shown in
Table 1,
Table 2 and
Table 3. We calculate only the valid pixels in the reference image using the median scaling method proposed by Monodepth2 [
9]. The measurement units in tables are represented numerically.
We can observe from the table that the results of Germany and the other two datasets are different. The German dataset exclusively features static individual buildings and trees, satisfying the static scene assumption and enabling models to achieve near-optimal states. Crucially, while our model demonstrates comparable performance to baselines on Abs Rel, and threshold accuracy, it exhibits significant superiority in Sq Rel and RMSE metrics. This performance profile indicates enhanced robustness against large errors and geometric distortions, attributable to the Manhattan-distance decay mechanism in our RMT encoder which effectively suppresses depth outliers in high-rise regions.
In contrast, the Chinese dataset is filled with moving objects and high-rise buildings, and there are significant differences in camera angles and pitch angles between different datasets, which to some extent affects the training of the model. From the
Table 2, we can see that our model outperforms theirs slightly in most performance metrics of the Chinese dataset. Even compared to the latest methods based on CNN Transformer hybrid architecture (LitenMono [
12], MonoVit [
13]), our proposed method still has a competitive advantage. This validates the groundbreaking effect of the collaborative architecture of global modeling (RMT) and local optimization (NeW CRFs) in addressing the core challenges of drone perspective—projection distortion, motion blur, and weak texture ambiguity.
For the UAV SIM Dataset, we flew a particularly different trajectory separately and used 900 images as our test set. Compared with the test set of 180 images from the German UAVid subset, this test set contains a wider range of complex scenes, which can better evaluate the overall performance of the model. As shown in the
Table 3, our model outperforms other models in all evaluation metrics.
4.7. Ablation Study
Backbone Architecture. To further validate the effectiveness of our depth architecture, we present an ablation study in
Table 4. The table compares the performance of our RMT variants (tiny, small, base) against two recent Transformer encoders, SwinT-tiny [
57] and MPViT-small [
58], the latter of which has a parameter count similar to that of RMT-small. Additionally, we report the number of parameters and FPS for each model. Benefiting from the manhattan self-attention, the RMT backbone outperforms the other two SoTA Transformer backbones (SwinT [
57], MPViT [
58]) and the pure CNN one (ResNet34 [
52]) in the self-supervised monocular depth estimation task.
MaSA. We verify the impact of Manhattan Self-Attention on the model, as shown in the
Table 5. Replacing our
MaSA with standard self-attention results in performance degradation across all metrics. This empirically confirms our hypothesis: standard global attention, which treats all spatial locations equally, is suboptimal for handling the images in oblique UAV perspectives. In contrast, our
MaSA module incorporates explicit spatial priors through a Manhattan-distance-driven decay matrix. This design enforces a locality-biased attention pattern, which is more aligned with perspective geometry. It thereby enables more efficient and effective long-range geometric modeling.
LCE. The Local Context Enhancement module also plays a crucial role in the excellent performance of our model. As evidenced by the results in
Table 5 (row ‘w/o
LCE’), removing this module leads to a consistent drop in performance. While
MaSA excels at modeling spatial–geometric relationships, the
LCE module complements it by strengthening the model’s capacity for local feature representation using depth-wise convolutions. This synergistic combination ensures that the model captures fine-grained local contextual details alongside mid-to-long-range dependencies, resulting in more robust and precise feature maps for depth estimation.
NeW CRFs. The incorporation of the NeW CRFs module in the decoder provides a critical refinement step. A comparison between the rows ’w/o NeW CRFs’ and ‘Our model’ in
Table 5 demonstrates its positive contribution, with notable improvements in higher-threshold accuracy. This module acts as a post-processing optimization layer. By explicitly modeling pairwise potentials, it effectively sharpens depth edges, suppresses outliers, and enhances the local consistency of the predicted depth map, leading to an overall more refined output.
Input Resolution. The selection of input resolution involves a critical trade-off between computational efficiency and representation fidelity. Higher resolutions preserve finer details and reduce spatial ambiguity caused by mixed features within a single pixel, which is particularly important for UAV imagery containing intricate textures and structures. However, processing native high-resolution UAV images is computationally prohibitive. Due to GPU memory constraints (NVIDIA RTX 4090, Santa Clara, CA, USA), we adopted a resolution of 640 × 352 pixels, which aligns with common practice in contemporary self-supervised depth estimation methods. To quantitatively evaluate the impact of resolution reduction, we conducted an ablation study using half the resolution (320 × 192 pixels). The results demonstrate a noticeable decline in model performance across all metrics, attributable to increased information loss and excessive smoothing of fine-grained details. These comparative results are systematically presented in
Table 5.