RMTDepth: Retentive Vision Transformer for Enhanced Self-Supervised Monocular Depth Estimation from Oblique UAV Videos

Zeng, Xinrui; Luo, Bin; Zhang, Shuo; Wang, Wei; Liu, Jun; Su, Xin

doi:10.3390/rs17193372

Open AccessArticle

RMTDepth: Retentive Vision Transformer for Enhanced Self-Supervised Monocular Depth Estimation from Oblique UAV Videos

by

Xinrui Zeng

¹

,

Bin Luo

^1,*

,

Shuo Zhang

¹

,

Wei Wang

¹

,

Jun Liu

¹

and

Xin Su

²

¹

State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China

²

School of Artificial Intelligence, Wuhan University, Wuhan 430079, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(19), 3372; https://doi.org/10.3390/rs17193372

Submission received: 9 July 2025 / Revised: 22 September 2025 / Accepted: 30 September 2025 / Published: 6 October 2025

Download

Browse Figures

Versions Notes

Abstract

Highlights

What are the main findings?

The proposed RMTDepth framework effectively resolves depth discontinuity caused by geometric distortion in complex viewing angles and spatial ambiguity in weakly textured regions through integrated RMT encoder and NeW CRFs refinement.
A high-fidelity UAV-SIM dataset with 9000 precise depth-labeled oblique images was constructed via UE4/AirSim, overcoming limitations of noisy COLMAP-generated labels in real-world scenarios.

What is the implication of the main finding?

Achieves state-of-the-art performance on multiple UAV datasets with significant improvements in geometry-sensitive metrics, providing critical technical support for realtime UAV navigation and emergency response applications.
Introduces a Manhattan distance-driven spatial decay mechanism enabling linearcomplexity long-range dependency modeling, offering a new paradigm for efficient monocular depth estimation in aerial photography.

Abstract

Self-supervised monocular depth estimation from oblique UAV videos is crucial for enabling autonomous navigation and large-scale mapping. However, existing self-supervised monocular depth estimation methods face key challenges in UAV oblique video scenarios: depth discontinuity from geometric distortion under complex viewing angles, and spatial ambiguity in weakly textured regions. These challenges highlight the need for models that combine global reasoning with geometric awareness. Accordingly, we propose RMTDepth, a self-supervised monocular depth estimation framework for UAV imagery. RMTDepth integrates an enhanced Retentive Vision Transformer (RMT) backbone, introducing explicit spatial priors via a Manhattan distance-driven spatial decay matrix for efficient long-range geometric modeling, and embeds a neural window fully-connected CRF (NeW CRFs) module in the decoder to refine depth edges by optimizing pairwise relationships within local windows. To mitigate noise in COLMAP-generated depth for real-world UAV datasets, we constructed a high-fidelity UE4/AirSim simulation environment, which generated a large-scale precise depth dataset (UAV SIM Dataset) to validate robustness. Comprehensive experiments against seven state-of-the-art methods across UAVID Germany, UAVID China, and UAV SIM datasets demonstrate that our model achieves SOTA performance in most scenarios.

Keywords:

monocular depth estimation; oblique UAV videos; self-supervised; Retentive Vision Transformer; NeW CRFs

1. Introduction

Unmanned Aerial Vehicles (UAVs) have become essential tools for photogrammetry and earth observation, with oblique photography (30–60° viewpoints) proving particularly valuable for high-fidelity 3D reconstruction and digital surface modeling [1]. However, monocular depth estimation from UAV imagery remains challenging. Conventional methods like Structure-from-Motion (SfM) require extensive overlapping imagery and suffer from high computational costs, while performing poorly in textureless scenarios. These limitations prevent real-time applications such as search-and-rescue missions and dynamic obstacle avoidance, highlighting the need for more efficient depth estimation approaches.

Depth estimation from single images has been revolutionized by deep learning, particularly through supervised methods that achieve high accuracy on benchmark datasets [2,3,4]. However, these methods face critical limitations when applied to UAV oblique imagery. First, they require dense ground-truth labels that are prohibitively expensive to obtain, as LiDAR-derived depth suffers from sparsity and misalignment [5]. Second, a significant domain gap exists between terrestrial training data and UAV perspectives beyond 45° pitch, where extreme projective distortion and textureless surfaces lead to depth errors in critical areas. These limitations prevent deployment in large-scale applications like disaster assessment where dense labeling is infeasible.

To address the limitations of supervised monocular depth estimation, self-supervised approaches primarily adopt two paradigms: stereo pair-based learning and monocular video-based learning. Stereo methods leverage synchronized image pairs with known baseline distances, where one image predicts disparity to reconstruct the other via warping, with reconstruction errors backpropagated for optimization [6]. While accurate, these methods require precise stereo calibration, which is particularly challenging for small UAVs due to their limited baseline distances (typically < 20 cm), and suffer from occlusion artifacts in complex aerial scenes [7]. Consequently, monocular video-based self-supervised learning has emerged as the preferred solution, eliminating calibration needs by jointly estimating depth and camera pose from unconstrained video sequences.

In autonomous driving scenarios, unsupervised monocular depth estimation has been extensively studied and has demonstrated considerable potential. Zhou et al. [8] pioneered self-supervision using view synthesis losses. Godard et al. [9] introduced minimum reprojection loss and automasking to handle dynamic objects. Casser et al. [10] integrated semantic segmentation to improve moving object handling. Recent advances have significantly diversified this landscape. HR-Depth [11] redesigned skip-connections and feature fusion modules to enhance high-resolution depth prediction. Lite-Mono [12] combined CNNs and Transformers in a lightweight architecture, achieving 80% parameter reduction. MonoViT [13] leveraged Vision Transformers for global–local reasoning. SQLdepth [14] proposed self-query volumes to recover fine-grained scene structures.

While such methods have demonstrated remarkable success in autonomous driving applications, they are not directly transferable to unmanned aerial vehicle (UAV) oblique imagery due to several domain-specific challenges. First, projective distortion caused by extreme camera pitch angles leads to non-linear depth discontinuities. Second, weakly textured regions—such as uniform rooftops and asphalt roads—exacerbate spatial ambiguity owing to the lack of distinctive visual cues. Existing architectures struggle to holistically address these issues. For instance, convolutional neural networks (CNNs) are limited by their local receptive fields, which are insufficient for capturing long-range geometric relationships. Meanwhile, Vision Transformers (ViTs) suffer from quadratic computational complexity and the absence of explicit spatial priors, representing critical drawbacks for resource-constrained UAV platforms operating in dynamic environments.

Although some efforts have been made to adapt depth estimation to aerial contexts, research focusing specifically on UAV-based monocular depth estimation remains limited. Hermann et al. [15] modified the Monodepth2 architecture for drone videos. Julian et al. [16] employed style-transfer techniques to enhance generalization. Madhuanand et al. [1] proposed a dual-depth encoder with 3D convolutions. MRFEDepth [17] introduced multi-resolution fusion and edge aggregation to improve depth continuity in ultra-low altitude scenarios. These studies confirm the feasibility of self-supervised depth estimation from monocular UAV videos, yet a holistic solution that efficiently tackles both extreme distortions and texture scarcity is still lacking.

Building upon previous studies that have successfully demonstrated the feasibility of monocular depth estimation from UAVs, our work focuses on enhancing unsupervised depth estimation in UAV oblique imagery, aiming to improve both applicability and robustness under challenging conditions such as extreme pitch angles and weakly textured surfaces. To this end, we propose RMTDepth, a novel self-supervised framework that effectively integrates global contextual reasoning with local geometric refinement.

The main contributions of this work are summarized as follows:

We propose RMTDepth, a unified framework that synergizes global context modeling with local edge refinement through the strategic integration of two advanced components. we adopt the Retentive Vision Transformer (RMT) [18] as our backbone encoder, which leverages the Manhattan distance-driven spatial decay principle to inject explicit spatial proximity priors. This architecture efficiently models long-range dependencies via linear-complexity axial attention decomposition, providing geometrically consistent depth initialization for UAV oblique videos.
We incorporate the Neural Window Fully-Connected CRFs (NeW CRFs) [19] module into the decoder stage. By partitioning feature maps into sub-windows and executing high-order CRF optimization within each partition, it refines spatial ambiguity in texture-deficient zones using multi-head attention-based pairwise affinity learning.
We introduce the UAV-SIM Dataset, a large-scale photorealistic synthetic benchmark developed with Unreal Engine 4 and AirSim. This platform provides programmatic access to depth buffers, ensuring pixel-perfect depth ground truth for 9000 oblique images. The dataset addresses the scarcity of reliable real-world training and evaluation data for UAV-based depth estimation, particularly in challenging scenarios where conventional SfM-based label generation methods fail.

Our method achieves depth estimation of oblique UAV videos through end-to-end self-supervised training. Extensive evaluations demonstrate RMTDepth’s superior performance across three urban-focused benchmarks: UAVID-Germany [20], UAVID-China [20], and UAV-SIM. Against seven state-of-the-art methods, our framework consistently achieves the most accurate depth predictions, particularly excelling in handling projective distortion at extreme pitch angles and recovering fine-grained details in texture-deficient zones. The synthesized UAV-SIM data further proves critical for validating robustness where real-world labels are unreliable.

The rest of this paper is organized as follows. In Section 2, we introduce related work including self-supervised monocular depth estimation, self-supervised monocular depth estimation in aerial imagery, backbone architectures for monocular depth estimation, and Neural CRFs for monocular depth refinement. Section 3 describes our proposed architecture and core modules. Section 4 presents the experimental evaluation. Finally, Section 5 concludes the work and discusses future research directions.

2. Related Works

Depth estimation from images is a pivotal computer vision task, particularly successful in terrestrial contexts. Traditional approaches leverage multiple scene views [21,22], stereo pairs [23], or monocular cues like illumination or texture [24] for 3D reconstruction. Single-image methods evolved from shape-from-shading [25] and shape-from-texture [26] techniques, later extended via stereo-temporal sequences. However, these methods face accuracy limitations under lighting changes, perspective shifts, and occlusions. The field was revolutionized by deep learning, where CNNs enable end-to-end depth regression through models like DenseDepth [27] and LeReS [28]. Depth estimation has also received attention in the field of aerial remote sensing, being applied to various aspects of aerial remote sensing image processing. Li et al. [29] systematically categorized contemporary SIDE approaches now applied to aerial remote sensing. We will introduce some important related research from four aspects: self-supervised monocular depth estimation of images, monocular depth estimation for aerial images, backbone architectures for monocular depth estimation, and Neural CRFs for monocular depth refinement.

2.1. Self-Supervised Monocular Depth Estimation

Self-supervised monocular depth estimation has emerged as a pivotal research direction, primarily due to its elimination of dependency on costly depth annotations. This paradigm leverages geometric constraints between consecutive video frames to synthesize target views, utilizing photometric reconstruction loss as the supervisory signal for model optimization. Self-supervised monocular depth estimation has evolved significantly since Zhou et al.’s [8] foundational view synthesis framework, though early models faced memory constraints limiting training to low resolutions. Subsequent innovations include Yin and Shi’s [30] optical flow integration and Godard et al.’s [9] minimum reprojection loss in Monodepth2, which remains the dominant baseline despite limitations in handling non-Lambertian surfaces and complex boundaries. Recent architectural breakthroughs address distinct challenges. MonoViT [13] combines CNNs with Vision Transformers to overcome local receptive field limitations, enabling global–local reasoning that sets new KITTI benchmarks. Concurrently, HR-Depth [11] tackles high-resolution estimation through redesigned skip-connections and feature fusion SE modules, achieving SOTA accuracy with ResNet-18 while its MobileNetV3 variant maintains performance at 20% parameter count. For edge deployment, Lite-Mono [12] introduces a lightweight hybrid architecture using dilated convolutions and attention mechanisms to reduce parameters by 80% versus Monodepth2. Meanwhile, SQLdepth [14] revolutionizes geometric priors through self-query cost volumes that encode relative distance maps in latent space, recovering fine-grained details with unprecedented generalization. These advances complement specialized approaches like Guizilini et al.’s [31] 3D convolution (limited by computational cost) and geometry-constrained methods [32].

2.2. Self-Supervised Monocular Depth Estimation for UAV Images

Self-supervised monocular depth estimation shows significant promise for UAV photogrammetry due to its minimal data requirements, needing only consecutive video frames for training. While widely adopted in autonomous driving, recent advances demonstrate its adaptability to aerial platforms. Key UAV-oriented innovations include the following: Aguilar et al. [33] pioneered real-time depth estimation via CNN on micro-UAVs, Miclea and Nedevschi [34] developed depth redistribution techniques for enhanced accuracy, Hermann et al. [15] adapted Godard et al.’s unsupervised paradigm with shared encoder weights for depth–pose estimation, and Madhuanand et al. [1] introduced dual 2D CNN encoders processing consecutive frames, coupled with a 3D CNN decoder to recover spatial depth features. Their multi-loss framework combined reconstruction loss, edge-aware smoothing, and temporal consistency constraints to enhance occlusion handling. Yu et al. [17] further addressed texture-sparse regions through scene-aware refinement, deploying multi-resolution feature fusion networks with edge aggregation modules (EIA) and specialized perceptual losses.

2.3. Network Architectures for Monocular Depth Estimation

Playing with different architectures used as the backbone showed a significant impact on the performance of monocular depth estimation itself. Convolutional Neural Networks (CNNs) have long dominated monocular depth estimation, with seminal works like DenseDepth [29] and Monodepth2 [9] achieving remarkable accuracy through encoder–decoder designs. However, their limited receptive fields constrain long-range dependency modeling, while texture bias compromises generalization in novel environments. These issues are extensively documented by Bae et al. in cross-dataset evaluations. Inspired by Vision Transformers’ global context capabilities, recent hybrid architectures address these limitations. MonoViT [13] integrates CNN feature extractors with transformer blocks, enabling joint local–global reasoning that achieves SOTA on KITTI but incurs quadratic complexity. Lite-Mono [12] employs dilated convolutions and attention gating to reduce parameters by 80%, yet struggles with fine-grained detail recovery. SQLdepth [14] innovates self-query cost volumes for implicit geometry learning, excelling in fine structure preservation but requiring intensive latent space computations. Bae et al.’s [35] critical analysis reveals that these methods fundamentally enhance shape bias over CNNs’ texture dependency, improving cross-domain generalization. Nevertheless, pure ViT architectures remain hampered by lack of spatial priors and computational inefficiency for high-resolution UAV video processing. To overcome these constraints, we adopt the Retentive Vision Transformer (RMT) [18] as our backbone encoder. Its Manhattan distance-driven spatial decay matrix injects explicit proximity priors, while axial decomposition achieves linear-complexity attention, thereby enabling real-time modeling of long-range geometric dependencies critical for consistent depth initialization in oblique UAV streams.

2.4. Monocular Depth Map Refinement

Depth map refinement addresses inherent limitations such as sparsity, noise, and structural inaccuracies in initial depth predictions. Traditional approaches leverage auxiliary data for enhancement. Shape-from-shading techniques [36,37] exploit light–surface interactions with registered color images to infer curvature. Multi-view photometric consistency methods [38] densify sparse MVS outputs via NeRF-based optimization, while sensor fusion strategies [39,40] complete LiDAR point clouds using neural networks. For low-resolution depth, upsampling combines temporal cues [41] and shading models [37], though noise amplification at long ranges remains challenging for time-of-flight and stereo systems [42,43]. Recent learning-based refinements focus on architectural innovations. Fink et al. [44] proposed multi-view differentiable rendering, scaling monocular depth to absolute distances via SfM and enforcing photometric and geometric consistency through mesh optimization. CHFNet [45] introduced coarse-to-fine hierarchical refinement with LightDASP modules, extracting multi-scale features while integrating edge guidance for spatially coherent outputs. SE-MDE [46] designed structure perception networks and edge refinement modules to jointly enhance global scene layout and local boundary details. Critically, Conditional Random Fields (CRFs) have demonstrated particular efficacy in structured depth optimization, with end-to-end integrations [47,48] improving spatial coherence through pairwise pixel relations. Building on this proven capability, we incorporate the Neural Window Fully-Connected CRFs (NeW CRFs [19]) as our refinement decoder. By partitioning feature maps into sub-windows and executing attention-based high-order optimization, it specifically enhances depth discontinuities in texture-deficient zones, which overcoming edge blurring and fragmentation in complex UAV oblique imagery.

3. Method

In this section, we present our proposed overall method for enhanced self-supervised depth estimation applied to oblique UAV videos. The method contains four subsections, which are divided into model inputs, network architecture design, combination of loss functions, and model inference.

3.1. Model Inputs

The self-supervised monocular depth estimation model utilizes sequential UAV video frames for training, with three consecutive RGB images denoted as

I_{t} \in R^{H \times W \times 3}

where

t \in \{- 1, 0, 1\}

serves as input during the training phase. These frames (

I_{r - 1}

for previous,

I_{r}

as reference, and

I_{r + 1}

for subsequent) are manually selected from captured drone footage to ensure controlled baseline spacing that preserves epipolar geometry while maintaining appropriate perspective variation within the UAV’s field of view, avoiding excessive scale changes that would degrade model performance. Camera-intrinsic parameters including focal length and principal point obtained through calibration are concurrently fed to the pose network. To optimize the trade-off between depth accuracy and computational efficiency, all input frames are standardized to a resolution of 352 × 640 pixels, while higher resolutions improve depth fidelity but significantly increase GPU memory consumption. During inference, only the single reference frame

I_{r}

is required for depth map generation, whereas the pose network processes

I_{r}

alongside source frames

I_{t^{'}} (t^{'} \in \{- 1, 1\})

exclusively during training.

3.2. Network Architecture

As shown in Figure 1, the overall architecture comprises two core networks—DepthNet and PoseNet—jointly trained using three consecutive RGB frames

I_{t} \in R^{H \times W \times 3}

where

t \in \{- 1, 0, 1\}

. The DepthNet adopts an encoder–decoder structure. Encoder utilizes the Retentive Vision Transformer (RMT) backbone to extract hierarchical features with Manhattan distance-based spatial priors, enabling global geometric consistency modeling. Decoder incorporates the Neural Window Fully-Connected CRFs (NeW CRFs) module, which partitions feature maps into sub-windows and performs attention-based high-order optimization to refine depth discontinuities in texture-deficient zones. The depth map is generated as

D_{t} = D e p t h N e t (I_{t})

, where

I_{t}

denotes the target frame. Concurrently, PoseNet processes the target frame

I_{t}

paired with source frames

I_{t^{'}} (t^{'} \in \{- 1, 1\})

to estimate relative camera transformations

T_{0 \to t^{'}} = P o s e N e t (I_{0}, I_{t^{'}})

. During self-supervised training, these outputs enable reference image reconstruction through geometric projection:

I_{t^{'} \to 0} = I_{t^{'}} [p r o j (D_{0}, T_{0 \to t^{'}}, K)]

(1)

where K represents camera intrinsics. This differentiable warping, adapted from Godard et al. [9], leverages bilinear sampling to project source views (

I_{t^{'}}

) into the reference frame (

I_{0}

) using predicted depth (

D_{0}

) and pose (

T_{0 \to t^{'}}

), establishing photometric consistency as the supervisory signal.

3.2.1. Depth Network

As typical in previous works [8,9], we design our DepthNet as an encoder–decoder architecture.

Depth encoder. As pointed out by recent research [1,8,9,13,17], the encoder is crucial for effective features extraction. Inspired by one of the most recent Transformer architectures, i.e., RMT [18], in which a Retentive Networks Meet Vision Transformer Block (shown in Figure 2) is a novel backbone that injects explicit spatial priors through Manhattan distance-driven decay mechanisms. We follow such a design to build the key components of our depth encoder in five stages. Given the input image, we implement a Conv-stem block consisting of four alternating

3 \times 3

convolutional layers with stride configurations of

[2, 1, 2, 1]

, progressively transforming features while halving spatial dimensions to

\frac{H}{4} \times \frac{W}{4}

resolution. Subsequent stages implement the core RMT design. From staage two to stage four, we employ decomposed Manhattan Self-Attention (MaSA) modules that combine axial attention decomposition with spatial decay matrices. This maintains linear complexity while modeling long-range dependencies critical for depth coherence. At stage five, we utilize full MaSA with global Manhattan distance priors to consolidate geometric context.

Manhattan Self-Attention. MaSA is a spatial attention mechanism that injects explicit spatial priors through a Manhattan distance-driven decay matrix. Its design is motivated by the spatial locality prior inherent in visual data, where neighboring pixels are more likely to exhibit similar properties (e.g., depth, texture). Standard self-attention mechanisms lack this inductive bias, requiring excessive data to learn spatial relationships from scratch. The spatial decay matrix explicitly encodes this prior by assigning higher attention weights to spatially proximate tokens and decaying weights exponentially with Manhattan distance. This ensures the model prioritizes local features while retaining global context, mimicking the behavior of convolutional operations while maintaining the flexibility of attention. This is particularly critical for tasks such as depth estimation, where the preservation of structural details (e.g., sharp edges, object boundaries) is essential.

Its core components are Bidirectional Spatial Decay and Two-dimensional Decay. Hence, we initially broaden the retention to a bidirectional form, expressed as Equation (2):

B i R e t e n t i o n (X) = (Q K^{T} ⊙ D^{B i}) V

(2)

D_{n m}^{B i} = r^{|n - m|}

(3)

where BiRetention signifies bidirectional modeling.

Meanwhile, we extend the one-dimensional retention to encompass two dimensions. In the 2D image plane, each token possesses unique spatial coordinates

(x_{n}, y_{n})

. The decay matrix D is modified so that each entry represents the Manhattan distance between the corresponding pair of tokens based on their 2D coordinates. The matrix D is redefined as follows:

D_{n m}^{2 d} = γ^{|x_{n} - x_{m}| + |y_{n} - y_{m}|}

(4)

And we continue to employ Softmax to introduce nonlinearity to our model. Combining the aforementioned steps, the Manhattan Self Attention is expressed as:

M a S A (X) = (S o f t m a x (Q K^{T}) ⨀ D^{2 d}) V

(5)

D_{n m}^{2 d} = γ^{|x_{n} - x_{m}| + |y_{n} - y_{m}|}

(6)

Decomposed Manhattan Self Attention. Traditional sparse attention methods or recurrent approaches disrupt the spatial decay matrix based on Manhattan distance, leading to loss of explicit spatial prior. MaSA introduces a decomposition strategy that separately computes attention scores along horizontal and vertical directions and applies one-dimensional bidirectional decay matrices

(D_{n m}^{H} = γ^{|y_{n} - y_{m}|}, D_{n m}^{W} = γ^{|x_{n} - x_{m}|})

to encode spatial distances explicitly. This approach preserves spatial structure while reducing computational complexity, as shown in Equation (9).

{A t t n}_{H} = S o f t m a x (Q_{H} K_{H}^{T}) ⨀ D^{H},

(7)

{A t t n}_{W} = S o f t m a x (Q_{W} K_{W}^{T}) ⨀ D^{W},

(8)

M a S A (X) = {A t t n}_{H} {({A t t n}_{W} V)}^{T}

(9)

Based on the decomposition of MaSA, the shape of the receptive field of each token is shown in Figure 3, which is identical to the shape of the complete MaSA’s receptive field.

Following [49], MaSA introduced a Local Context Enhancement module using DWConv to further enhance the local expression capability.

X_{o u t} = M a S A (X) + L C E (V)

(10)

Throughout all stages, we incorporate Conditional Positional Encodings to preserve spatial relationships, following best practices from modern vision transformers. The multi-stage hierarchy progressively transforms the Conv-stem features into geometrically consistent depth representations, overcoming ViT’s quadratic complexity while enhancing spatial awareness for oblique UAV imagery.

Depth decoder. Our decoder employs multi-scale feature aggregation to progressively restore spatial resolution from the RMT encoder’s outputs. It integrates cross-scale fusion through skip connections, where higher-resolution encoder features inject fine-grained details into upsampled coarse maps. Resolution recovery is achieved via five successive upconvolution blocks, each incrementally doubling spatial resolution with bilinear upsampling applied between blocks to minimize checkerboard artifacts. At each skip connection, a NeW CRFs module refines features by partitioning maps into local windows, modeling pixel relationships via multi-head attention-based pairwise affinity, and explicitly sharpening depth discontinuities in texture-deficient regions. Finally, four disparity prediction heads (scales 1:1, 1:2, 1:4, 1:8) generate inverse depth estimates, each comprising two

3 \times 3

convolutions followed by a sigmoid activation.

3.2.2. Neural Window FC-CRFs Module

The Neural Window Fully-connected CRFs (NeW CRFs) module acts as the core refinement engine in our decoder, designed to synergize with the RMT encoder and transform its high-level, geometrically-informed features into a precise depth map. It directly leverages the multi-scale features from the RMT backbone, utilizing their rich global context and spatial priors to guide its local, window-based optimization. Embedded within the bottom-up-top-down decoding structure, this module operates at multiple scales to iteratively refine the depth estimates by resolving spatial ambiguities in textureless regions through information propagation and by sharpening depth discontinuities at object boundaries.

Figure 4 shows the architecture of the NeW CRFs module. Each neural window FC-CRFs block employs dual successive CRFs optimizations: the first operates on regular windows and the second on spatially shifted windows [19]. To maintain architectural consistency with the transformer encoder, the window size is fixed at N = 7, corresponding to

7 \times 7

patches per window. The unary potentials are derived from convolutional network outputs, while the pairwise potentials are computed following Equation (10). During optimization, multi-head query (Q) and key (K) computations generate multi-head potentials, enhancing the energy function’s capacity to model complex relationships. A progressive reduction in attention heads is implemented across decoding levels—utilizing 32, 16, 8, and 4 heads from top to bottom features. The resultant energy function subsequently undergoes refinement through an optimization network comprising two fully-connected layers, ultimately yielding the optimized depth map

X^{'}

.

For the unary potential, it is computed from the image features such that it can be directly obtained by the network as

φ_{u} (x_{i}) = θ_{u} (I, x_{i})

(11)

where

θ

is the parameters of a unary network.

For the pairwise potential, it is composed of values of the current node and other nodes, and a weight computed based on the color and position information of the node pairs. It is expressed as:

φ_{p} (x_{i}, x_{j}) = W (F_{i}, F_{j}, P_{i}, P_{j}) ‖ x_{i} - x_{j} ‖

(12)

where

F

is the feature map and

W

is the weighting function. We calculate the pairwise potential node by node. For each node i, we sum all its pairwise potentials and obtain

φ_{p_{i}} = α (F_{i}, F_{j}, P_{i}, P_{j}) x_{i} + \sum_{j \neq i} β (F_{i}, F_{j}, P_{i}, P_{j}) x_{i}

(13)

where

α

,

β

are the weighting functions and will be computed by the networks.

Finally, query and key vectors are derived from each patch’s feature map within a window, aggregated into matrices Q and K. The pairwise potential weights are computed via the dot product, which are then applied to the predicted values X. Relative position embedding P is incorporated to encode spatial relationships, yielding the formulation of Equation (15):

φ_{p_{i}} = S o f t M a x (q • K^{T} + P) • X

(14)

\sum_{i} φ_{p_{i}} = S o f t M a x (Q • K^{T} + P) • X

(15)

where • denotes dot production. Thus, the SoftMax output yields the weighting coefficients

α

and

β

for Equation (15). The dot product

Q • K^{T}

computes affinity scores between all node pairs, determining message-passing weights with relative position embedding P. Message propagation is then effected through the dot product between the initial prediction X and these SoftMax-derived weights.

3.2.3. Pose Network

Following Monodepth2 [9] and prior works [11,50,51], our PoseNet adopts a lightweight ResNet-18 encoder [52] for efficient 6-DoF relative pose estimation. The network takes concatenated image pairs

[I_{0}, I_{t^{'}}] (t^{'} \in \{- 1, 1\})

as input, representing the target frame and adjacent source frames, and outputs relative transformations

T_{0 \to t^{'}}

between them. The encoder generates feature maps from reference image

I_{r}

and source images

\{I_{r - 1}, I_{r + 1}\}

. These features undergo concatenation and processing through four convolutional layers (256 channels,

3 \times 3

kernels) in the decoder. Spatial information is condensed via ReLU activation and global mean pooling, yielding a 6-DoF vector partitioned into axis–angle

r o t a t i o n \in R^{3}

and

t r a n s l a t i o n \in R^{3}

components. This parameterization forms the 4 × 4 camera transformation matrix

P_{r^{'}} (P_{r \to (r + 1)}, P_{r \to (r - 1)})

used for view synthesis in Equation (1).

3.3. Loss Functions

To enhance depth estimation quality without ground-truth supervision, we formulate depth prediction as a self-supervised image reconstruction task using unlabeled monocular videos. The depth network processes a target frame I to predict a dense inverse depth map d, converted to metric depth

1 . / d

with physical bounds

[D_{m i n}, D_{m a x}]

following [9]. Model optimization is driven by multi-term losses applied to synthetically reconstructed reference views, replacing traditional supervised depth regression with photometric backpropagation. The total loss is given by

L = λ_{1} L_{p} + λ_{2} L_{S} + λ_{3} L_{M}

(16)

where

λ_{1}

,

λ_{2}

,

λ_{3}

represents the weights between different loss terms used.

L_{p}

represents the reprojection loss.

L_{S}

represents the smoothing loss.

L_{M}

represents the masking loss. The different weights for the loss terms are tuned and the respective weights of

λ_{1}

,

λ_{2}

,

λ_{3}

are determined after several experiments. The weightage of reprojection loss and auto masking loss are chosen as 1, while the weightage of smoothness loss is maintained as 0.001.

3.3.1. Reprojection Loss

Photometric reconstruction loss drives depth quality improvement by minimizing discrepancies between synthesized reference views and original target images. This loss combines structural similarity (SSIM) and L1 norms per Equation (17):

L_{p} = \frac{1}{N} \sum α \frac{(1 - S S I M (I_{a}, I_{b}))}{2} + (1 - α) ‖ I_{a} - I_{b} ‖

(17)

where

α = 0.85

,

I_{a}

is the target image, and

I_{b}

is its reconstruction. Empirical analysis confirms that it adopts the minimum pixel reprojection loss—proposed by Godard et al. [9] for edge artifact suppression

3.3.2. Smoothness Loss

To enforce geometric coherence in depth predictions, we implement an edge-aware smoothness regularization term following Wang et al. [5]. This loss preserves depth discontinuities corresponding to image edges while promoting smoothness elsewhere. Applied to mean-normalized depth estimates

d / d^{-}

, it weights depth gradients exponentially by the reference image’s color derivatives as formalized in Equation (18):

L_{S} = |\frac{\partial_{x} d}{d^{-}}| e^{- |\partial_{x} I_{r}|} + |\partial_{y} d / d^{-}| e^{- |\partial_{y} I_{r}|}

(18)

where

\partial_{x}

,

\partial_{y}

denote spatial gradients,

I_{r}

is the reference image, and depth normalization ensures scale invariance. The exponential weighting attenuates smoothness constraints at high-gradient regions, maintaining critical depth discontinuities aligned with image boundaries.

3.3.3. Masking Loss for Dynamic Objects

To address prevalent dynamic objects (e.g., moving vehicles) in training data, we implement an adaptive masking loss inspired by Godard et al. [9]. This mechanism resolves the static-scene assumption inherent in self-supervised monocular depth estimation. Per-pixel masking is determined by comparing reprojection errors:

L_{M} = m i n [L_{p} (I_{r}, I_{r^{'} \to r})] < m i n [L_{p} (I_{r}, I_{r^{'}})]

(19)

where

I_{r^{'} \to r}

denotes the reconstructed image,

I_{r}

the reference frame, and

I_{r^{'}}

source frames. Pixels are retained only when reconstruction error is lower than source–reference discrepancy, effectively excluding dynamic regions during loss computation. This approach enhances robustness in real-world urban environments compared to static datasets like KITTI.

3.4. Inference

As our research focuses on depth estimation from monocular UAV videos, we primarily target enhancements to the depth estimation model architecture. In contrast to Godard et al. [9], our modifications include replacing the encoder backbone with a RMT block to model long-range geometric dependencies at linear complexity, enhancing depth consistency in dynamic aerial scenes; and integrating a NeW CRFs module in the decoder to refine depth edges and effectively resolve depth ambiguity in weakly-textured regions. During inference, only the depth network is utilized for depth map estimation, requiring solely a single input image. Pose transformation parameters are excluded at this stage as they are exclusively employed during training. Following Hermann et al. [15], we adopt COLMAP-generated depth maps [53] as reference data for comparative evaluation against baseline methods.

4. Experiments

In this section, we introduce the dataset and then describe details of our implementation. After that, we present a quantitative and qualitative comparison. Furthermore, we conduct several ablation studies.

4.1. Dataset

While numerous publicly available datasets (e.g., KITTI [5]) capture ground-level imagery for autonomous driving scenarios, high-resolution datasets featuring oblique UAV perspectives remain scarce. To address this limitation, we employ the widely adopted UAVID Germany Dataset [20] and UAVID China Dataset [20], complemented by our proprietary custom-built UAV SIM Dataset. Each dataset exhibits substantial scale variations among objects at differing distances, diverse object categories, and dynamic urban street scenes with moving elements.

UAVID Germany Dataset. The UAVID Germany subset comprises 9 video sequences captured over Gronau at 50–100 m altitude with 10m/s flight speed. Featuring

4096 \times 2160

resolution imagery captured at 45° obliquity (20 fps original frame rate), it exhibits uniform scenes dominated by rooftops and vegetation with a limited depth range of 100–150 m. To mitigate motion artifacts from narrow temporal baselines, frame rate was reduced to 10 fps during processing. The dataset’s structural consistency provides a controlled environment for evaluating depth consistency in moderate-range aerial scenarios. Some of the training sample images from the dataset are shown in Figure 5.

UAVID China Dataset. Containing 34 sequences from Wuhan under identical capture specifications (50–100 m altitude, 10 m/s, 45° angle), this subset presents

3840 \times 2160

imagery with complex urban dynamics. Challenges include (1) high-rise buildings extending depth range > 500 m, (2) significant dynamic objects (vehicles, pedestrians), and (3) variable flight altitudes. The original 20 fps footage required aggressive downsampling to 1fps to overcome narrow baseline issues exacerbated by building occlusion and camera motion unpredictability. This dataset tests model robustness under extreme depth variations and scene complexity. Some of the training sample images from the dataset are shown in Figure 6.

Reference depth maps were generated via COLMAP structure-from-motion [53] using optimized frame selections. From the temporally subsampled sequences, 180 German and 80 Chinese representative frames were reconstructed. During evaluation, single test images are input to the model, with output depth maps quantitatively compared against these COLMAP-derived references using standard depth metrics. This pipeline resolves narrow-baseline limitations while ensuring geometrically consistent ground truth.

UAV SIM Dataset. To mitigate inaccuracies in COLMAP-generated ground truth depth maps which introduce false positives/negatives that degrade evaluation metrics, we constructed a high-fidelity simulation environment using UE4 and AirSim. This synthetic pipeline produced the UAV SIM Dataset featuring controllable environmental parameters. We simulated diverse flight trajectories (e.g., building circumnavigation) to generate 9000 RGB images with corresponding high-resolution depth maps at 640 × 352 resolution. Data capture parameters include 10 fps frame rate, 45° obliquity, 100–150 m altitude, and 1.1–1.2 m/s velocity. Consistent flight heights minimize unsupervised training noise, while dynamic urban elements (vehicles, pedestrians) and dominant aerial features (rooftops, roads, vegetation) introduce targeted challenges: abrupt depth discontinuities and extensive low-texture regions characteristic of oblique viewpoints. This synthesis provides complementary training complexity unattainable with real-world datasets. Some of the training sample images from the dataset are shown in Figure 7.

4.2. Implementation Details

We implement our RMTDepth in Pytorch with an input resolution of 640 × 352 pixels for datasets. Our model is trained for 40 epochs, 20 epochs, and 30 epochs each for the Germany, China, and UAV SIM Dataset. We use Adamoptimizer for training all three datasets. The learning rate is set to

10^{- 4}

for 75% of the epochs and

10^{- 5}

for remaining epochs. The number M of Transformer layers in each of the Transformer blocks in the RMT is set as 3, 4, 18, and 4 from stage 2 to stage 5 in the depth encoder, respectively. Both the pose encoder and depth encoder are pre-trained on ImageNet [54]. Training takes around 10, 5, and 8 h in a single RTX 4090 GPU for the Germany, China, and UAV SIM Dataset, respectively. After hyperparameter tuning, the batch size is fixed at eight. In our experiments, we adopt the same data augmentation detailed in [9,11]. Meanwhile, predicted disparities are converted to metric depths via dataset-specific scaling, following established practices in self-supervised monocular depth estimation (SMDE) [8,9,55,56]. Minimum and maximum depth bounds derived from reference data constrain the output range. A scaling factor is then applied to disparity values to achieve metric consistency with ground truth. Although pose network orientations offer a photogrammetric scaling alternative, we adopt the reference-based approach for direct comparability with SMDE benchmarks, avoiding complexities in direct image orientation.

4.3. Evaluation Metrics

We benchmark our model against state-of-the-art monocular depth estimation methods [1,9,11,12,13,14,17] using COLMAP-generated reference depths. Standard evaluation metrics are employed to ensure comparability with contemporary research:

Absolute Relative Error (Abs Rel): This metric normalizes per-pixel depth errors to mitigate distance-related bias, as shown in Equation (20).

Squared Relative Error (Sq Rel): This metric uses a squared term to penalize larger depth errors, as shown in the Equation (21).

Root Mean Squared Error (RMSE): This metric measures precision with sensitivity to outliers, as shown in the Equation (22).

Root Mean Squared logarithmic error (

R M S E_{l o g}

): This metric compresses error scale logarithmically to reduce large-error dominance, as shown in the Equation (23).

Accuracy with threshold (

δ < t h r

): This metric is the ratio of pixels within a certain threshold to the corresponding depth in the reference depth, and we choose thresholds of 25%, 15%, and 5%. It is shown in Equation (24).

A b s R e l = \frac{1}{T} \sum_{i = 1}^{T} \frac{|d_{i} - d_{i}^{*}|}{d_{i}}

(20)

S q R e l = \frac{1}{T} \sum_{i = 1}^{T} \frac{{|d_{i} - d_{i}^{*}|}^{2}}{d_{i}}

(21)

R M S E = \sqrt{\frac{1}{T} \sum_{i = 1}^{T} {|d_{i} - d_{i}^{*}|}^{2}}

(22)

{R M S E}_{l o g} = \sqrt{\frac{1}{T} \sum_{i = 1}^{T} {|log d_{i} - log d_{i}^{*}|}^{2}}

(23)

A c c u r a c y = % o f d_{i} s . t . m a x (\frac{d_{i}}{d_{i}^{*}}, \frac{d_{i}^{*}}{d_{i}}) < t h r, w h e r e t h r = 1.25, {1.25}^{2}, {1.25}^{3}

(24)

where

d_{i}

and

d_{i}^{*}

denote the estimated depth and the ground truth depth at pixel i. N is the total number of valid pixels in the ground truth. All baselines were retrained under identical conditions to ensure fair comparison.

4.4. Depth Evaluation

We qualitatively and quantitatively evaluated the performance of our model and compared it with seven current state-of-the-art monocular depth estimation methods, namely Monodepth2 [9], Monocular-UAV-Videos [1], MRFEDepth [17], HR-Depth [11], Lite-Mono [12], MonoVit [13], and SQLdepth [14]. Monodepth2 employs minimum reprojection loss and auto-masking to handle dynamic objects and remains the standard baseline for self-supervised frameworks despite its terrestrial driving-scene orientation. Ref. [1] features dual 2D CNN encoders and a 3D CNN decoder leveraging temporal information from UAV video sequences. A novel contrastive loss enhances reconstruction quality by addressing occlusion challenges in oblique aerial imagery. MRFEDepth proposes a scene-aware refinement architecture with Multi-Resolution Fusion and Perceptual Refinement Network (PRNet). Its Edge Information Aggregation (EIA) module specifically targets depth detail recovery in ultra-low-altitude UAV photography. HR-Depth redesigns skip connections and introduces Feature Fusion Squeeze-and-Excitation (fSE) modules to preserve spatial details in large-gradient regions. It offers both ResNet-based and lightweight MobileNetV3 variants. Lite-Mono combines CNNs and Transformers in a parameter-efficient architecture. It uses Consecutive Dilated Convolutions (CDC) for multi-scale feature extraction and Local-Global Features Interaction (LGFI) for long-range context modeling. MonoVit integrates Vision Transformers with convolutional blocks to enable global scene reasoning. It addresses receptive field limitations in pure CNN architectures through hybrid local–global feature processing. SQLdepth introduces Self Query Layer (SQL) constructing geometry-aware cost volumes. It learns fine-grained structural priors through relative distance mapping in latent space, enhancing detail recovery. The final results were obtained by comparing the depth maps generated by these models with the reference depth provided by COLMAP (version 3.11.0).

4.5. Qualitative Results

Figure 8,Figure 9 and Figure 10 present qualitative results comparing the six state-of-the-art monocular depth estimation models and our model on the UAVID Germany Dataset [20], UAVID China Dataset [20], and UAV SIM Dataset. Visual analysis demonstrates that our approach generates depth maps with enhanced structural granularity, exhibiting superior fidelity to reference depths across diverse aerial scenarios.

As highlighted in Figure 8 (second column) and Figure 10 (building-vegetation boundaries), our approach shows remarkable performance in weakly-textured areas—regions with minimal visual features or repetitive patterns where conventional methods often fail due to lack of distinctive matching cues. These include vegetation areas, homogeneous rooftops, and asphalt surfaces, where our method maintains superior depth continuity and edge precision compared to all baseline methods.

The demonstrated performance advantage stems from our globally-coherent yet locally-precise architecture. The RMT encoder explicitly models long-range geometric dependencies through Manhattan-distance spatial decay priors, preserving structural consistency under extreme UAV pitch angles (45–60°). Crucially, its linear-complexity axial attention maintains global depth coherence across projective distortions, which is evident in continuous building facades. Complementarily, the NeW CRFs module in decoder executes high-order optimization within partitioned sub-windows, refining depth discontinuities in texture-deficient zones via multi-head affinity learning. This dual mechanism resolves critical UAV-specific challenges: (1) RMT mitigates motion-blur induced photometric inconsistency through motion-robust attention; (2) NeW CRFs eliminate spatial ambiguity in uniform regions (e.g., asphalt roads, rooftops) by enhancing edge precision—particularly observable in vegetation depth transitions. The synergistic integration of global context modeling and local edge refinement thus achieves unprecedented fidelity in oblique aerial depth estimation.

Additionally, we compare our model with these models (Monodepth2 [9], MRFEDepth [17], Monocular-UAV-Videos [1], HR-Depth [11], Lite-Mono [12], MonoViT [13], SQLdepth [14]) through quantitative analysis.

4.6. Quantitative Results

The quantitative analysis results of each model on the UAVID Germany Dataset [20], UAVID China Dataset [20], and UAV SIM Dataset are shown in Table 1,Table 2 and Table 3. We calculate only the valid pixels in the reference image using the median scaling method proposed by Monodepth2 [9]. The measurement units in tables are represented numerically.

We can observe from the table that the results of Germany and the other two datasets are different. The German dataset exclusively features static individual buildings and trees, satisfying the static scene assumption and enabling models to achieve near-optimal states. Crucially, while our model demonstrates comparable performance to baselines on Abs Rel, and threshold accuracy, it exhibits significant superiority in Sq Rel and RMSE metrics. This performance profile indicates enhanced robustness against large errors and geometric distortions, attributable to the Manhattan-distance decay mechanism in our RMT encoder which effectively suppresses depth outliers in high-rise regions.

In contrast, the Chinese dataset is filled with moving objects and high-rise buildings, and there are significant differences in camera angles and pitch angles between different datasets, which to some extent affects the training of the model. From the Table 2, we can see that our model outperforms theirs slightly in most performance metrics of the Chinese dataset. Even compared to the latest methods based on CNN Transformer hybrid architecture (LitenMono [12], MonoVit [13]), our proposed method still has a competitive advantage. This validates the groundbreaking effect of the collaborative architecture of global modeling (RMT) and local optimization (NeW CRFs) in addressing the core challenges of drone perspective—projection distortion, motion blur, and weak texture ambiguity.

For the UAV SIM Dataset, we flew a particularly different trajectory separately and used 900 images as our test set. Compared with the test set of 180 images from the German UAVid subset, this test set contains a wider range of complex scenes, which can better evaluate the overall performance of the model. As shown in the Table 3, our model outperforms other models in all evaluation metrics.

4.7. Ablation Study

Backbone Architecture. To further validate the effectiveness of our depth architecture, we present an ablation study in Table 4. The table compares the performance of our RMT variants (tiny, small, base) against two recent Transformer encoders, SwinT-tiny [57] and MPViT-small [58], the latter of which has a parameter count similar to that of RMT-small. Additionally, we report the number of parameters and FPS for each model. Benefiting from the manhattan self-attention, the RMT backbone outperforms the other two SoTA Transformer backbones (SwinT [57], MPViT [58]) and the pure CNN one (ResNet34 [52]) in the self-supervised monocular depth estimation task.

MaSA. We verify the impact of Manhattan Self-Attention on the model, as shown in the Table 5. Replacing our MaSA with standard self-attention results in performance degradation across all metrics. This empirically confirms our hypothesis: standard global attention, which treats all spatial locations equally, is suboptimal for handling the images in oblique UAV perspectives. In contrast, our MaSA module incorporates explicit spatial priors through a Manhattan-distance-driven decay matrix. This design enforces a locality-biased attention pattern, which is more aligned with perspective geometry. It thereby enables more efficient and effective long-range geometric modeling.

LCE. The Local Context Enhancement module also plays a crucial role in the excellent performance of our model. As evidenced by the results in Table 5 (row ‘w/o LCE’), removing this module leads to a consistent drop in performance. While MaSA excels at modeling spatial–geometric relationships, the LCE module complements it by strengthening the model’s capacity for local feature representation using depth-wise convolutions. This synergistic combination ensures that the model captures fine-grained local contextual details alongside mid-to-long-range dependencies, resulting in more robust and precise feature maps for depth estimation.

NeW CRFs. The incorporation of the NeW CRFs module in the decoder provides a critical refinement step. A comparison between the rows ’w/o NeW CRFs’ and ‘Our model’ in Table 5 demonstrates its positive contribution, with notable improvements in higher-threshold accuracy. This module acts as a post-processing optimization layer. By explicitly modeling pairwise potentials, it effectively sharpens depth edges, suppresses outliers, and enhances the local consistency of the predicted depth map, leading to an overall more refined output.

Input Resolution. The selection of input resolution involves a critical trade-off between computational efficiency and representation fidelity. Higher resolutions preserve finer details and reduce spatial ambiguity caused by mixed features within a single pixel, which is particularly important for UAV imagery containing intricate textures and structures. However, processing native high-resolution UAV images is computationally prohibitive. Due to GPU memory constraints (NVIDIA RTX 4090, Santa Clara, CA, USA), we adopted a resolution of 640 × 352 pixels, which aligns with common practice in contemporary self-supervised depth estimation methods. To quantitatively evaluate the impact of resolution reduction, we conducted an ablation study using half the resolution (320 × 192 pixels). The results demonstrate a noticeable decline in model performance across all metrics, attributable to increased information loss and excessive smoothing of fine-grained details. These comparative results are systematically presented in Table 5.

5. Conclusions

This paper presented RMTDepth, a novel self-supervised framework for monocular depth estimation in UAV oblique imagery. By integrating an enhanced Retentive Vision Transformer (RMT) encoder with neural window fully-connected CRFs (NeW CRFs) refinement, our architecture fundamentally addresses three core challenges in aerial scenarios: depth discontinuity from geometric distortion under complex viewing angles, and spatial ambiguity in weakly textured regions. The RMT backbone establishes geometrically consistent priors through Manhattan distance-driven spatial decay, enabling linear-complexity modeling of long-range dependencies. Simultaneously, the embedded NeW CRFs module optimizes local depth discontinuities via attention-based pairwise affinity learning. To overcome COLMAP’s noisy depth labels in low-texture real-world data, we introduce the rigorously constructed UAV-SIM Dataset generated with Unreal Engine 4 and AirSim, containing 9000 oblique images with ground truth. Comprehensive evaluation across three datasets (UAVID Germany, UAVID China, and UAV SIM) demonstrates that RMTDepth consistently outperforms seven state-of-the-art methods. Notably, it achieves significant improvements in geometrically sensitive metrics while maintaining parameter efficiency. The supplementary UAV SIM Dataset – synthesized via UE4 and AirSim to overcome COLMAP ground truth limitations – further validates our approach’s robustness in complex aerial environments. Future work will explore real-time deployment on embedded UAV platforms and temporal consistency enhancement for video applications.

Author Contributions

Conceptualization, X.Z. and W.W.; methodology, X.Z.; software, X.Z.; validation, X.Z. and S.Z.; formal analysis, X.Z.; investigation, X.Z.; resources, X.S.; data curation, X.Z.; writing—original draft preparation, X.Z.; writing—review and editing, X.Z. and S.Z.; visualization, X.Z.; supervision, B.L. and X.S.; project administration, W.W.; funding acquisition, B.L. and J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Basic scientific research projects of Chinese National Defense Science and Technology Commission (Grant No. JCKY2023206B026). This work was also supported by National Key R&D Program of China No.2022YFB3903404.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy issues, such as portraits of people other than the experimenters involved in the data collection process.

Acknowledgments

Xinrui Zeng expresses gratitude to Haixu Li and Teng Hou for the support.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

UAVs	Unmanned Aerial Vehicles
DSM	Digital Surface Model
ViTs	Vision Transformers
RMT	Retentive Vision Transformer
NeW CRFs	Neural window fully-connected CRF

References

Madhuanand, L.; Nex, F.; Yang, M.Y. Self-supervised monocular depth estimation from oblique UAV videos. ISPRS J. Photogramm. Remote Sens. 2021, 176, 1–14. [Google Scholar] [CrossRef]
Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. arXiv 2014, arXiv:1406.2283. [Google Scholar] [CrossRef]
Laina, I.; Rupprecht, C.; Belagiannis, V.; Tombari, F.; Navab, N. Deeper depth prediction with fully convolutional residual networks. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 239–248. [Google Scholar]
Fu, H.; Gong, M.; Wang, C.; Batmanghelich, K.; Tao, D. Deep ordinal regression network for monocular depth estimation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2002–2011. [Google Scholar]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
Godard, C.; Mac Aodha, O.; Brostow, G.J. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 270–279. [Google Scholar]
Zhang, Y.; Yu, Q.; Low, K.H.; Lv, C. A Self-Supervised Monocular Depth Estimation Approach Based on UAV Aerial Images. In Proceedings of the 2022 IEEE/AIAA 41st Digital Avionics Systems Conference (DASC), Portsmouth, VA, USA, 18–22 September 2022; pp. 1–8. [Google Scholar]
Zhou, T.; Brown, M.; Snavely, N.; Lowe, D.G. Unsupervised learning of depth and ego-motion from video. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1851–1858. [Google Scholar]
Godard, C.; Mac Aodha, O.; Firman, M.; Brostow, G.J. Digging into self-supervised monocular depth estimation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3828–3838. [Google Scholar]
Casser, V.; Pirk, S.; Mahjourian, R.; Angelova, A. Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8001–8008. [Google Scholar]
Lyu, X.; Liu, L.; Wang, M.; Kong, X.; Liu, L.; Liu, Y.; Chen, X.; Yuan, Y. Hr-depth: High resolution self-supervised monocular depth estimation. Proc. AAAI Conf. Artif. Intell. 2021, 35, 2294–2301. [Google Scholar] [CrossRef]
Zhang, N.; Nex, F.; Vosselman, G.; Kerle, N. Lite-mono: A lightweight cnn and transformer architecture for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18537–18546. [Google Scholar]
Zhao, C.; Zhang, Y.; Poggi, M.; Tosi, F.; Guo, X.; Zhu, Z.; Huang, G.; Tang, Y.; Mattoccia, S. Monovit: Self-supervised monocular depth estimation with a vision transformer. In Proceedings of the 2022 International Conference on 3D Vision (3DV), Prague, Czech Republic, 12–16 September 2022; pp. 668–678. [Google Scholar]
Wang, Y.; Liang, Y.; Xu, H.; Jiao, S.; Yu, H. Sqldepth: Generalizable self-supervised fine-structured monocular depth estimation. Proc. AAAI Conf. Artif. Intell. 2024, 38, 5713–5721. [Google Scholar] [CrossRef]
Hermann, M.; Ruf, B.; Weinmann, M.; Hinz, S. Self-supervised learning for monocular depth estimation from aerial imagery. arXiv 2020, arXiv:2008.07246. [Google Scholar] [CrossRef]
Julian, K.; Mern, J.; Tompa, R. UAV Depth Perception from Visual Images Using a Deep Convolutional Neural Network. Technical Report; Stanford University: Stanford, CA, USA, 2017. [Google Scholar]
Yu, K.; Li, H.; Xing, L.; Wen, T.; Fu, D.; Yang, Y.; Zhou, C.; Chang, R.; Zhao, S.; Xing, L.; et al. Scene-aware refinement network for unsupervised monocular depth estimation in ultra-low altitude oblique photography of UAV. ISPRS J. Photogramm. Remote Sens. 2023, 205, 284–300. [Google Scholar] [CrossRef]
Fan, Q.; Huang, H.; Chen, M.; Liu, H.; He, R. Rmt: Retentive networks meet vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5641–5651. [Google Scholar]
Yuan, W.; Gu, X.; Dai, Z.; Zhu, S.; Tan, P. NeW CRFs: Neural window fully-connected CRFs for monocular depth estimation. arXiv 2022, arXiv:2203.01502. [Google Scholar]
Lyu, Y.; Vosselman, G.; Xia, G.S.; Yilmaz, A.; Yang, M.Y. UAVid: A semantic segmentation dataset for UAV imagery. ISPRS J. Photogramm. Remote Sens. 2020, 165, 108–119. [Google Scholar] [CrossRef]
Szeliski, R.; Zabih, R. An experimental comparison of stereo algorithms. In Proceedings of the International Workshop on Vision Algorithms, Corfu, Greece, 21–22 September 1999; Springer: Berlin/Heidelberg, Germany, 1999; pp. 1–19. [Google Scholar]
Remondino, F.; Spera, M.G.; Nocerino, E.; Menna, F.; Nex, F.; Gonizzi-Barsanti, S. Dense image matching: Comparisons and analyses. In Proceedings of the 2013 Digital Heritage International Congress (DigitalHeritage), Marseille, France, 28 October–1 November 2013; Volume 1, pp. 47–54. [Google Scholar]
Alagöz, B.B. A note on depth estimation from stereo imaging systems. Comput. Sci. 2016, 1, 8–13. [Google Scholar]
Zhang, R.; Tsai, P.S.; Cryer, J.E.; Shah, M. Shape-from-shading: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 21, 690–706. [Google Scholar] [CrossRef]
Van den Heuvel, F.A. 3D reconstruction from a single image using geometric constraints. ISPRS J. Photogramm. Remote Sens. 1998, 53, 354–368. [Google Scholar] [CrossRef]
Kanatani, K.i.; Chou, T.C. Shape from texture: General principle. Artif. Intell. 1989, 38, 1–48. [Google Scholar] [CrossRef]
Alhashim, I.; Wonka, P. High quality monocular depth estimation via transfer learning. arXiv 2018, arXiv:1812.11941. [Google Scholar]
Yin, W.; Zhang, J.; Wang, O.; Niklaus, S.; Mai, L.; Chen, S.; Shen, C. Learning to recover 3d scene shape from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 204–213. [Google Scholar]
Li, Q.; Zhu, J.; Liu, J.; Cao, R.; Li, Q.; Jia, S.; Qiu, G. Deep learning based monocular depth prediction: Datasets, methods and applications. arXiv 2020, arXiv:2011.04123. [Google Scholar] [CrossRef]
Yin, Z.; Shi, J. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1983–1992. [Google Scholar]
Guizilini, V.; Ambrus, R.; Pillai, S.; Raventos, A.; Gaidon, A. 3d packing for self-supervised monocular depth estimation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2485–2494. [Google Scholar]
Mahjourian, R.; Wicke, M.; Angelova, A. Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5667–5675. [Google Scholar]
Aguilar, W.G.; Quisaguano, F.J.; Alvarez, L.G.; Pardo, J.A.; Proaño, Z. Monocular depth perception on a micro-UAV using convolutional neuronal networks. In Proceedings of the Ubiquitous Networking: 4th International Symposium, UNet 2018, Hammamet, Tunisia, 2–5 May 2018; Revised Selected Papers 4; Springer: Berlin/Heidelberg, Germany, 2018; pp. 392–397. [Google Scholar]
Miclea, V.C.; Nedevschi, S. Monocular depth estimation with improved long-range accuracy for UAV environment perception. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–15. [Google Scholar] [CrossRef]
Bae, J.; Moon, S.; Im, S. Deep digging into the generalization of self-supervised monocular depth estimation. Proc. AAAI Conf. Artif. Intell. 2023, 37, 187–196. [Google Scholar] [CrossRef]
Tao, M.W.; Srinivasan, P.P.; Malik, J.; Rusinkiewicz, S.; Ramamoorthi, R. Depth from shading, defocus, and correspondence using light-field angular coherence. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1940–1948. [Google Scholar]
Haefner, B.; Quéau, Y.; Möllenhoff, T.; Cremers, D. Fight ill-posedness with ill-posedness: Single-shot variational depth super-resolution from shading. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 164–174. [Google Scholar]
Ito, S.; Miura, K.; Ito, K.; Aoki, T. Neural Radiance Field-Inspired Depth Map Refinement for Accurate Multi-View Stereo. J. Imaging 2024, 10, 68. [Google Scholar] [CrossRef]
Zhang, J.; Ramanagopal, M.S.; Vasudevan, R.; Johnson-Roberson, M. Listereo: Generate dense depth maps from lidar and stereo imagery. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 7829–7836. [Google Scholar]
Ma, F.; Cavalheiro, G.V.; Karaman, S. Self-supervised sparse-to-dense: Self-supervised depth completion from lidar and monocular camera. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 3288–3295. [Google Scholar]
Yuan, M.Z.; Gao, L.; Fu, H.; Xia, S. Temporal upsampling of depth maps using a hybrid camera. IEEE Trans. Vis. Comput. Graph. 2018, 25, 1591–1602. [Google Scholar] [CrossRef]
Sterzentsenko, V.; Saroglou, L.; Chatzitofis, A.; Thermos, S.; Zioulis, N.; Doumanoglou, A.; Zarpalas, D.; Daras, P. Self-supervised deep depth denoising. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1242–1251. [Google Scholar]
Yan, S.; Wu, C.; Wang, L.; Xu, F.; An, L.; Guo, K.; Liu, Y. Ddrnet: Depth map denoising and refinement for consumer depth cameras using cascaded cnns. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 151–167. [Google Scholar]
Fink, L.; Franke, L.; Keinert, J.; Stamminger, M. Refinement of Monocular Depth Maps via Multi-View Differentiable Rendering. arXiv 2024, arXiv:2410.03861. [Google Scholar] [CrossRef]
Chen, H.; Wang, Y. Chfnet: A coarse-to-fine hierarchical refinement model for monocular depth estimation. Mach. Vis. Appl. 2024, 35, 78. [Google Scholar] [CrossRef]
Zuo, S.; Xiao, Y.; Wang, X.; Lv, H.; Chen, H. Structure perception and edge refinement network for monocular depth estimation. Comput. Vis. Image Underst. 2025, 256, 104348. [Google Scholar] [CrossRef]
Hua, Y.; Tian, H. Depth estimation with convolutional conditional random field network. Neurocomputing 2016, 214, 546–554. [Google Scholar] [CrossRef]
Liu, F.; Shen, C.; Lin, G. Deep convolutional neural fields for depth estimation from a single image. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 5162–5170. [Google Scholar]
Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R.W. Biformer: Vision transformer with bi-level routing attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10323–10333. [Google Scholar]
Zhou, H.; Greenwood, D.; Taylor, S. Self-supervised monocular depth estimation with internal feature fusion. arXiv 2021, arXiv:2110.09482. [Google Scholar] [CrossRef]
Yan, J.; Zhao, H.; Bu, P.; Jin, Y. Channel-wise attention-based network for self-supervised monocular depth estimation. In Proceedings of the 2021 International Conference on 3D Vision (3DV), London, UK, 1–3 December 2021; pp. 464–473. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Schonberger, J.L.; Frahm, J.M. Structure-from-motion revisited. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4104–4113. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.-F. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Repala, V.K.; Dubey, S.R. Dual CNN models for unsupervised monocular depth estimation. In Proceedings of the Pattern Recognition And Machine Intelligence: 8th International Conference, PReMI 2019, Tezpur, India, 17–20 December 2019; Proceedings, Part I; Springer: Berlin/Heidelberg, Germany, 2019; pp. 209–217. [Google Scholar]
Aleotti, F.; Tosi, F.; Poggi, M.; Mattoccia, S. Generative adversarial networks for unsupervised monocular depth prediction. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10012–10022. [Google Scholar]
Lee, Y.; Kim, J.; Willette, J.; Hwang, S.J. Mpvit: Multi-path vision transformer for dense prediction. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 7287–7296. [Google Scholar]

Figure 1. Overview of our RMTDepth architecture.

Figure 2. Overall architecture of RMT.

Figure 3. Spatial decay matrix in the decomposed MaSA.

Figure 4. Spatial decay matrix in the decomposed MaSA.

Figure 5. Some of the training sample images from the UAVID Germany Dataset.

Figure 6. Some of the training sample images from the UAVID China Dataset.

Figure 7. Some of the training sample images from the UAV SIM Dataset.

Figure 8. Visualization of qualitative comparison of depth estimation among Monodepth2 [9], Monocular-UAV-Videos [1], MRFEDepth [17], HR-Depth [11], Lite-Mono [12], MonoViT [13], and our model in UAVid Germany Dataset. Reference Depths indicate those generated by the COLMAP software (version 3.11.0).

Figure 9. Visualization of qualitative comparison of depth estimation among Monodepth2 [9], Monocular-UAV-Videos [1], MRFEDepth [17], HR-Depth [11], Lite-Mono [12], MonoViT [13], and our model in UAVid China Dataset. Reference Depths indicate those generated by the COLMAP software (version 3.11.0).

Figure 10. Visualization of qualitative comparison of depth estimation among Monodepth2 [9], Monocular-UAV-Videos [1], MRFEDepth [17], HR-Depth [11], Lite-Mono [12], MonoViT [13], and our model in UAV SIM Dataset. Reference Depths indicate the ground truth depth provided by the AirSim simulation.

Table 1. Quantitative results on the UAVID Germany Dataset. All methods in this table were trained on the UAVID Germany Dataset, without additional datasets or online refinement. Best results are in bold. For Abs Rel, Sq Rel, RMSE, and

{R M S E}_{l o g}

, lower is better, and for

δ_{1} < 1.05

,

δ_{2} < 1.15

, and

δ_{3} < 1.25

, higher is better. The values represent the mean score over all the images in the corresponding test dataset.

Table 1. Quantitative results on the UAVID Germany Dataset. All methods in this table were trained on the UAVID Germany Dataset, without additional datasets or online refinement. Best results are in bold. For Abs Rel, Sq Rel, RMSE, and

{R M S E}_{l o g}

, lower is better, and for

δ_{1} < 1.05

,

δ_{2} < 1.15

, and

δ_{3} < 1.25

, higher is better. The values represent the mean score over all the images in the corresponding test dataset.

Method	Abs Rel ↓	Sq Rel↓	RMSE↓	${RMSE}_{\log}$ ↓	$δ_{1} ↑$	$δ_{2} ↑$	$δ_{3} ↑$
Monodepth2 [9]	0.071	0.536	5.896	0.094	0.453	0.874	0.970
Monocular-UAV-videos [1]	0.071	0.540	5.867	0.094	0.455	0.872	0.970
MRFEDepth [17]	0.070	0.524	5.782	0.093	0.463	0.877	0.971
HR-Depth [11]	0.071	0.523	5.775	0.093	0.455	0.878	0.972
LitenMono [12]	0.071	0.530	5.858	0.094	0.453	0.873	0.971
MonoViT [13]	0.071	0.515	5.752	0.092	0.462	0.877	0.972
SQLdepth [14]	0.071	0.520	5.760	0.093	0.451	0.876	0.972
Our model	0.067	0.515	5.717	0.090	0.465	0.879	0.973

Table 2. Quantitative results on the UAVID China Dataset. All methods in this table were trained on the UAVID China Dataset, without additional datasets or online refinement. Best results are in bold. For Abs Rel, Sq Rel, RMSE, and

{R M S E}_{l o g}

, lower is better, and for

δ_{1} < 1.05

,

δ_{2} < 1.15

, and

δ_{3} < 1.25

, higher is better. The values represent the mean score over all the images in the corresponding test dataset.

Table 2. Quantitative results on the UAVID China Dataset. All methods in this table were trained on the UAVID China Dataset, without additional datasets or online refinement. Best results are in bold. For Abs Rel, Sq Rel, RMSE, and

{R M S E}_{l o g}

, lower is better, and for

δ_{1} < 1.05

,

δ_{2} < 1.15

, and

δ_{3} < 1.25

, higher is better. The values represent the mean score over all the images in the corresponding test dataset.

Method	Abs Rel↓	Sq Rel↓	RMSE↓	${RMSE}_{\log}$ ↓	$δ_{1} ↑$	$δ_{2} ↑$	$δ_{3} ↑$
Monodepth2 [9]	0.201	3.354	13.021	0.240	0.643	0.934	0.994
Monocular-UAV-videos [1]	0.204	3.308	12.869	0.243	0.631	0.932	0.995
MRFEDepth [17]	0.198	3.334	13.005	0.240	0.656	0.934	0.993
HR-Depth [11]	0.188	3.063	12.506	0.225	0.693	0.948	0.996
LitenMono [12]	0.180	2.822	12.021	0.215	0.722	0.955	0.997
MonoViT [13]	0.181	2.822	11.916	0.218	0.717	0.950	0.996
SQLdepth [14]	0.204	3.723	13.844	0.235	0.657	0.936	0.996
Our model	0.180	2.765	11.864	0.210	0.720	0.956	0.998

Table 3. Quantitative results on the UAV SIM Dataset. All methods in this table were trained on the UAV SIM Dataset, without additional datasets or online refinement. Best results are in bold. For Abs Rel, Sq Rel, RMSE, and

{R M S E}_{l o g}

, lower is better, and for

δ_{1} < 1.05

,

δ_{2} < 1.15

, and

δ_{3} < 1.25

, higher is better. The values represent the mean score over all the images in the corresponding test dataset.

Table 3. Quantitative results on the UAV SIM Dataset. All methods in this table were trained on the UAV SIM Dataset, without additional datasets or online refinement. Best results are in bold. For Abs Rel, Sq Rel, RMSE, and

{R M S E}_{l o g}

, lower is better, and for

δ_{1} < 1.05

,

δ_{2} < 1.15

, and

δ_{3} < 1.25

, higher is better. The values represent the mean score over all the images in the corresponding test dataset.

Method	Abs Rel↓	Sq Rel↓	RMSE↓	${RMSE}_{\log}$ ↓	$δ_{1} ↑$	$δ_{2} ↑$	$δ_{3} ↑$
Monodepth2 [9]	0.102	1.750	12.220	0.160	0.514	0.769	0.854
Monocular-UAV-videos [1]	0.102	1.760	12.240	0.160	0.503	0.771	0.857
MRFEDepth [17]	0.097	1.718	12.216	0.158	0.538	0.778	0.861
HR-Depth [11]	0.099	1.729	12.179	0.158	0.549	0.771	0.855
LitenMono [12]	0.099	1.779	12.149	0.159	0.568	0.779	0.857
MonoViT [13]	0.092	1.647	12.010	0.154	0.565	0.783	0.866
SQLdepth [14]	0.121	2.185	12.458	0.174	0.512	0.754	0.837
Our model	0.092	1.655	11.988	0.150	0.585	0.784	0.867

Table 4. Ablation study on UAV SIM Dataset.

δ_{1} < 1.05

.

δ_{2} < 1.15

.

δ_{3} < 1.25

. Input is 640 × 192, runtime measured on RTX 4090 GPU. Encoders are pre-trained on ImageNet. Best results are in bold.

Table 4. Ablation study on UAV SIM Dataset.

δ_{1} < 1.05

.

δ_{2} < 1.15

.

δ_{3} < 1.25

. Input is 640 × 192, runtime measured on RTX 4090 GPU. Encoders are pre-trained on ImageNet. Best results are in bold.

Backbone	Params ↓	FPS ↑	Abs Rel↓	Sq Rel↓	RMSE↓	${RMSE}_{\log}$ ↓	$δ_{1} ↑$	$δ_{2} ↑$	$δ_{3} ↑$
ResNet34 [52]	27 M	42	0.101	1.794	12.283	0.161	0.548	0.772	0.854
SwinT-tiny [57]	34 M	41	0.101	1.846	12.456	0.163	0.554	0.772	0.852
MPVIT-small [58]	27 M	18	0.095	1.756	12.116	0.157	0.582	0.784	0.862
RMT-tiny [18]	13 M	24	0.095	1.721	11.997	0.156	0.577	0.782	0.862
RMT-small [18]	25 M	18	0.092	1.655	11.988	0.150	0.585	0.784	0.867
RMT-base [18]	50 M	15	0.093	1.715	11.999	0.155	0.584	0.786	0.865

Table 5. Ablation study. Results for different variants of our model with monocular training on UAV SIM Dataset. Best results are in bold.

Method	Abs Rel↓	Sq Rel↓	RMSE↓	${RMSE}_{\log}$ ↓	$δ_{1} ↑$	$δ_{2} ↑$	$δ_{3} ↑$
Our model (640 × 352)	0.092	1.655	11.988	0.150	0.585	0.784	0.867
MaSA → Attention	0.097	1.750	12.115	0.158	0.567	0.778	0.857
w/o LCE	0.094	1.718	12.034	0.156	0.584	0.782	0.865
w/o NeW CRFs	0.095	1.748	12.084	0.158	0.584	0.781	0.859
Our model (320 × 192)	0.121	2.034	12.745	0.183	0.532	0.746	0.830

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zeng, X.; Luo, B.; Zhang, S.; Wang, W.; Liu, J.; Su, X. RMTDepth: Retentive Vision Transformer for Enhanced Self-Supervised Monocular Depth Estimation from Oblique UAV Videos. Remote Sens. 2025, 17, 3372. https://doi.org/10.3390/rs17193372

AMA Style

Zeng X, Luo B, Zhang S, Wang W, Liu J, Su X. RMTDepth: Retentive Vision Transformer for Enhanced Self-Supervised Monocular Depth Estimation from Oblique UAV Videos. Remote Sensing. 2025; 17(19):3372. https://doi.org/10.3390/rs17193372

Chicago/Turabian Style

Zeng, Xinrui, Bin Luo, Shuo Zhang, Wei Wang, Jun Liu, and Xin Su. 2025. "RMTDepth: Retentive Vision Transformer for Enhanced Self-Supervised Monocular Depth Estimation from Oblique UAV Videos" Remote Sensing 17, no. 19: 3372. https://doi.org/10.3390/rs17193372

APA Style

Zeng, X., Luo, B., Zhang, S., Wang, W., Liu, J., & Su, X. (2025). RMTDepth: Retentive Vision Transformer for Enhanced Self-Supervised Monocular Depth Estimation from Oblique UAV Videos. Remote Sensing, 17(19), 3372. https://doi.org/10.3390/rs17193372

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RMTDepth: Retentive Vision Transformer for Enhanced Self-Supervised Monocular Depth Estimation from Oblique UAV Videos

Abstract

Highlights

Abstract

1. Introduction

2. Related Works

2.1. Self-Supervised Monocular Depth Estimation

2.2. Self-Supervised Monocular Depth Estimation for UAV Images

2.3. Network Architectures for Monocular Depth Estimation

2.4. Monocular Depth Map Refinement

3. Method

3.1. Model Inputs

3.2. Network Architecture

3.2.1. Depth Network

3.2.2. Neural Window FC-CRFs Module

3.2.3. Pose Network

3.3. Loss Functions

3.3.1. Reprojection Loss

3.3.2. Smoothness Loss

3.3.3. Masking Loss for Dynamic Objects

3.4. Inference

4. Experiments

4.1. Dataset

4.2. Implementation Details

4.3. Evaluation Metrics

4.4. Depth Evaluation

4.5. Qualitative Results

4.6. Quantitative Results

4.7. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI