Self-Supervised Monocular Depth Estimation Based on Differential Attention

Zhou, Ming; Yu, Hancheng; Li, Zhongchen; Zhang, Yupu

doi:10.3390/a18090590

Open AccessArticle

Self-Supervised Monocular Depth Estimation Based on Differential Attention

by

Ming Zhou

,

Hancheng Yu

^*

,

Zhongchen Li

and

Yupu Zhang

College of Electronic and Information Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China

^*

Author to whom correspondence should be addressed.

Algorithms 2025, 18(9), 590; https://doi.org/10.3390/a18090590

Submission received: 11 August 2025 / Revised: 12 September 2025 / Accepted: 16 September 2025 / Published: 19 September 2025

(This article belongs to the Special Issue Algorithms for Feature Selection (3rd Edition))

Download

Browse Figures

Versions Notes

Abstract

Depth estimation algorithms are widely applied in various fields, including 3D reconstruction, autonomous driving, and industrial robotics. Monocular self-supervised algorithms for depth prediction offer a cost-effective alternative to acquiring depth through hardware devices such as LiDAR. However, current depth prediction networks, predominantly based on conventional encoder–decoder architectures, often encounter two critical limitations: insufficient feature fusion mechanisms during the upsampling phase and constrained receptive fields. These limitations result in the loss of high-frequency details in the predicted depth maps. To overcome these issues, we introduce differential attention operators to enhance global feature representation and refine locally upsampled features within the depth decoder. Furthermore, we equip the decoder with a deformable bin-structured prediction head; this lightweight design enables per-pixel dynamic aggregation of local depth distributions via adaptive receptive field modulation and deformable sampling, enhancing the decoder’s fine-grained detail processing by capturing local geometry and holistic structures. Experimental results on the KITTI and Make3D datasets demonstrate that our proposed method produces more accurate depth maps with finer details compared to existing approaches.

Keywords:

self-supervised training; monocular depth estimation; differential attention; deformable bin

1. Introduction

Monocular depth estimation is a fundamental task in computer vision. Depth maps encode the distance between objects in a scene and the camera sensor, significantly enhancing the 3D spatial perception capabilities of industrial robots. These maps have diverse applications: they enable precise facial feature localization for high-accuracy facial recognition and expression analysis through 3D facial contour modeling; combined with other sensor data, they facilitate high-precision map construction and vehicle localization, improving the reliability and accuracy of autonomous driving; and additionally, they empower industrial robotic systems to efficiently perform tasks such as path planning, object grasping, and logistics sorting. Consequently, depth maps are widely utilized in 3D reconstruction [1], autonomous driving [2], and industrial robotics [3].

In the past, depth information was typically obtained by analyzing signals reflected from LiDAR or structured light on object surfaces. However, hardware like LiDAR incurs higher costs compared to monocular industrial cameras. Beyond hardware-based solutions, depth information can also be derived from images using vision algorithms, which are broadly categorized into traditional methods and deep learning-based approaches. Traditional monocular depth estimation leverages image features and cues: methods based on linear perspective [4] detect parallel lines and their intersections; defocus-based techniques [5] exploit varying image blur from scene-camera distances; and weather scattering approaches [6] introduce haze and estimate depth through transmission models. Nevertheless, depth maps from traditional methods are often coarse and sparse. With advances in deep neural networks, architectures, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), graph convolutional networks (GCNs), and diffusion models have achieved more accurate depth predictions. Deep learning-based monocular depth estimation utilizes these networks to predict per-pixel depth from a single image.

Neural network-based depth estimation methods are broadly categorized into supervised [7] and self-supervised approaches. Supervised methods require sparse depth ground truth from sensors like LiDAR, yet collecting large, diverse datasets with accurate depth labels remains challenging. Conversely, self-supervised methods eliminate the costs associated with depth label acquisition and simplify deployment. Without ground truth, self-supervised models are usually trained by making use of the temporal scene consistency in monocular videos. They learn depth from monocular videos by training a depth estimation network along with a separate pose network. To address occlusion issues, where pixels visible in the target image are absent in source images, Monodepth2 [8] employs an image reprojection loss that considers only the minimum photometric error per pixel across multiple source images. However, traditional methods, including Monodepth2, suffer from inadequate explicit scene structure modeling and insufficient detail handling, resulting in performance bottlenecks and blurry artifacts. To overcome these limitations, HR-Depth [9] introduces multi-resolution feature fusion with channel attention optimization; CADepth [10] enhances scene structure representation and local detail processing through channel-wise attention, though it requires substantial computational resources due to its parameter size; SwinDepth [11] integrates a Swin Transformer with a densely cascaded multi-scale network to boost accuracy via global–local feature fusion and cross-scale connections; and Lite-Mono [12] adopts a lightweight CNN–Transformer hybrid, where stacked dilated depthwise separable convolutions and pointwise convolutions strengthen spatial–channel interactions while channel attention facilitates global context aggregation.

In general, current networks with limited feature extraction capabilities struggle to preserve detailed information during depth map reconstruction. Furthermore, the feature fusion process at skip connections suffers from a fundamental semantic discrepancy between encoder and decoder features. This mismatch is exacerbated by the widespread reliance on primitive nearest neighbor interpolation for feature map upsampling in decoder stages, which notably yields coarse feature representations.

This paper presents a novel monocular depth decoder architecture that integrates differential attention mechanisms with bin-based regression to address the critical challenge of detail loss in depth map estimation. The proposed decoder incorporates three key components: (1) A global differential self-attention module enhancing semantic feature representation through joint spatial and channel context modeling; (2) a feature fusion module combining local differential mutual attention with adaptive gated modulation, which effectively bridges semantic gaps in skip connections while refining upsampled decoder features via multi-scale context aggregation; and (3) a deformable bins module that enables adaptive depth distribution focusing within localized regions, substantially enhancing the network’s capacity to recover fine-grained geometric details.

2. Methods

In this section, we begin by reviewing self-supervised training methods for monocular depth estimation. We then introduce the architecture of our differential attention-based network and present our three main contributions: (1) the global differential self-attention module, (2) the local differential mutual attention-based feature fusion module, and (3) the deformable bins module.

2.1. Self-Supervised Training Process

As illustrated in Figure 1, the self-supervised monocular depth estimation framework comprises complementary pose and depth prediction networks, with this paper concentrating on enhancing the depth prediction component. Adopting a classical encoder–decoder architecture, our depth prediction network introduces three novel elements within the decoder: a global differential self-attention module, an optimized feature fusion module, and a deformable bins module.

The entire network is trained through the correspondence between the pixels of RGB images. Denote the target image as I_t, the source image as I_s, and the predicted depth of I_t as D_t. I_t and I_s are concatenated as inputs into the pose network, and the pose matrix is

E_{t - > s}

, with

s \in {t - 1, t + 1}

. t − 1 and t + 1 denote the previous and subsequent frames of the target image. Moreover, the intrinsic matrix is K, the correspondence between the pixel coordinates p_t in I_t and p_s in I_s, can be calculated by the following formula:

p_{s} ~ K E_{t - > s} D_{t} (p_{t}) K^{- 1} p_{t}

(1)

The image

I_{s - > t}

reconstructed from the source image can be obtained using bilinear interpolation. Networks are trained using minimum photometric reprojection loss [8]:

L_{p} = \min_{s} p e (I_{t}, I_{s - > t})

(2)

where pe consists of L1 loss and structural similarity loss [13]:

p e (I_{t}, I_{s - > t}) = α (1 - S S I M (I_{t}, I_{s - > t})) / 2 + (1 - α) | | I_{t} - I_{s - > t} | |_{1}

(3)

where

α = 0.85

.

In self-supervised training, the presence of static objects within images can result in suboptimal training efficacy; therefore, a binary mask [8] is used to filter out static pixels:

μ = [\min_{s} p e (I_{t}, I_{s - > t}) < \min_{s} p e (I_{t}, I_{s})]

(4)

where [] is the Iverson bracket.

Moreover, to enhance the smoothness of the predicted depth map, the edge-aware smoothness regularization term [14] Ls is employed:

L_{s} = | \partial x d_{t}^{*} | e^{- | \partial x I_{t} |} + | \partial y d_{t}^{*} | e^{- | \partial y I_{t} |}

(5)

where

d_{t}^{*} = d_{t} / \bar{d_{t}}

is the mean-normalized inverse depth from [15] used to discourage shrinking of the estimated depth.

Furthermore, the deep network performs loss calculations at S scales, and the final loss formula is as follows:

L = \frac{1}{S} \sum_{i}^{S} u L_{p}^{i} + λ L_{S}^{i}

(6)

where i represents the i-th scale, λ = 1 × 10⁻³.

2.2. Monocular Depth Decoder Based on Differential Attention and Bin Mechanism

Vision Transformers typically rely on self-attention mechanisms to model long-range dependencies. This approach dynamically constructs affinity matrices by computing similarities between query and key vectors, an operation conventionally implemented through matrix multiplication. While effective for many vision tasks, this standard self-attention formulation proves suboptimal when deployed in decoders for monocular depth prediction. This fundamental limitation arises because depth prediction requires continuous value regression per pixel, rather than processing only key pixels through multiplicative operations. To address this constraint, we propose a novel differential attention mechanism inspired by bilateral filtering. This approach replaces multiplication with differentiation operators to model long-range dependencies, better adapting to depth regression tasks. We present two distinct implementation strategies for this mechanism, detailed comprehensively in Section 2.2.1 and Section 2.2.2.

2.2.1. Global Differential Self-Attention Module

As illustrated in Figure 2, the proposed global differential self-attention module models channel-wise interdependencies and aggregates region-specific responses within high-level feature maps.

The multi-head global differential self-attention mechanism integrates two core components: differential spatial attention and differential channel attention. Departing from conventional transformer architectures, the spatial attention module first employs a linear projection layer to compress the channel dimension of input features to match the number of attention heads M, thereby generating queries Q and keys K via dimension reduction. Subsequently, it applies a dimension-aware broadcasting operation,

B (x, d)

, that automatically expands the d-th dimension of tensor x through value replication, ensuring dimensional compatibility when constructing affinity matrices through matrix differentiation:

A = Softmax (- {(B (Q, 2) - B (K^{T}, 1))}^{2})

(7)

These affinity matrices are then used as spatial attention weights:

Attention (Q, K, V) = A V

(8)

The differential channel attention is behind the differential spatial attention, which follows the same calculation process as the differential spatial attention, but operates along the channel dimension.

2.2.2. Feature Fusion Module Based on Local Differential Mutual Attention and Gated Modulation

In this paper, we focus on Monodepth2 [8], which is a classic self-supervised monocular depth estimation algorithm, and many subsequent methods have adopted its training strategy and evaluation method. Here, it is used as the benchmark method. And the traditional feature fusion module at the skip connections of the encoder–decoder architecture is shown in Figure 3.

The second row of Figure 4 compares encoder features, decoder features, and full-resolution decoder features in Monodepth2. Figure 4b reveals that while encoder features contain texture information, their lack of fine detail due to the convolutional architecture may cause depth ambiguity at object boundaries. Figure 4c further demonstrates that nearest-neighbor interpolation in up-sampling introduces aliasing artifacts at object boundaries in full-resolution decoder features, resulting in significant detail degradation. Figure 4d shows that fused features combining elements from both representations exhibit insufficient detail preservation and increased susceptibility to erroneous depth predictions.

To overcome these limitations, we introduce a feature fusion module comprising a local differential mutual attention block and a gated modulation block, as illustrated in Figure 5. Leveraging the encoder’s richer detail information, the attention block employs encoder features to guide decoder feature refinement. Inspired by TransNext’s pixel-focused attention (PFA) in [16], we define e(u,v) as the set of the encoder’s pixels within a sliding window centered on the pixel (u,v). During the implementation process, the window size is 5 × 5. Similarly, we denote the decoder’s corresponding feature set as d(u,v), forming paired regional descriptors that enable differential mutual feature interaction.

The difference between queries and keys of features can be described as follows:

S_{(u, v) ~ e (u, v)} = Sum (- {(Q_{(u, v)} - K_{e (u, v)})}^{2})

(9)

Sum (\cdot)

represents the sum operation of the input feature vectors in the channel dimension. This module operates on the queries (Q) and keys (K), computing the sum of squared differences between central features and neighboring features within a 5 × 5 local window to generate the pixel-focused differential attention maps (A):

A_{(u, v) ~ e (u, v)} = Softmax (\frac{S_{(u, v) ~ e (u, v)}}{\sqrt{d}})

(10)

As shown in Figure 5b, the differential attention (A) of encoder features is used to guide the extraction of the decoder feature (V):

Attention (Q_{(u, v)}, K_{e (u, v)}, V_{d (u, v)}) = A_{(u, v) ~ e (u, v)} V_{d (u, v)}

(11)

Since encoder features contain rich texture details while decoder features exhibit smoother representations, a gated modulation block, as shown in Figure 5c, selectively filters encoder features, which may negatively impact depth prediction accuracy.

The effectiveness of our proposed module is demonstrated through visual comparisons in Figure 4. As shown in Figure 4b,e, the local differential mutual attention block preserves significantly more structural detail in wall regions compared to Monodepth2’s encoder features. Figure 4c,f reveal that our gated modulation block effectively suppresses invalid texture information, which is detrimental to depth estimation. The superior performance of our integrated feature fusion module is conclusively demonstrated in Figure 4d,g. Compared with Monodepth2’s decoder features, the features generated by our fusion module maintain essential details while filtering out texture information unrelated to depth, thus effectively bridging semantic gaps at skip connections and refining upsampled features through multi-scale context aggregation.

2.2.3. Deformable Bins Module

Many supervised monocular depth estimation methods currently utilize the AdaBins [17] architecture for their depth prediction head, as depicted in Figure 6a. This structure is predominantly utilized in supervised depth estimation methods but has remained unexplored in self-supervised approaches. In this section, we focus on enhancing this architecture to advance the predictive precision of self-supervised monocular depth estimation methods. The bin mechanism departs from the conventional approach of predicting depth directly using convolutional layers, instead employing a more structured methodology. Initially, it defines a depth interval with specific upper and lower bounds. This interval is then partitioned into a series of monotonically increasing, adaptively adjustable bin values. Subsequently, the network predicts a weight distribution across these bin feature dimensions for each pixel location. Finally, an accurate depth estimate is obtained by performing a weighted summation of the bin features using the predicted weights.

The advantage of AdaBins lies in its use of the bin structure to partition a scene’s depth intervals adaptively. This enables more effective capture and representation of depth values across diverse scenarios. However, a limitation of AdaBins is that it operates with a single, globally shared bin feature.

Given the inherent anisotropy of local depth information in images, a solitary bin feature cannot adequately represent optimal depth intervals per pixel. To address this limitation, we propose assigning a unique bin feature to every pixel, enabling the network to focus on depth distributions within local regions. We introduce the deformable bin module (Figure 6b) for this purpose. Unlike AdaBins, our module jointly incorporates the previous stage’s predicted depth D_s₊₁ and decoder features during bin generation.

As shown in Figure 6c, while the previous stage’s depth provides a coarse depth range estimate per pixel, deformable sampling addresses potential inaccuracies by adaptively sampling regions around the upsampled depth map to define bins, using offsets derived from decoder features.

Each pixel position on the decoder feature map is denoted as p_(u,v), the offset is denoted as

{Δ p_{n} | n = 1, \dots, N}

, and the weight of bins is denoted as

w_{(u, v)} (n)

; the prediction result D_s can be obtained from the following formula:

D_{s} (p_{(u, v)}) = \sum_{n} w_{(u, v)} (n) D_{s + 1} (p_{(u, v)} + Δ p_{n})

(12)

This lightweight design enables each pixel to dynamically aggregate depth distributions within its local neighborhood, significantly enhancing the decoder’s capacity for fine-grained spatial detail processing.

3. Results

3.1. Dataset Preprocessing, Splitting, and Implementation Details

The baseline method (Monodepth2) adopts the KITTI dataset for training and evaluation, with subsequent self-supervised monocular depth estimation algorithms commonly utilizing it for experimental validation. To demonstrate the effectiveness of the proposed decoder, experiments were conducted on the KITTI dataset. The KITTI dataset uses the split rules proposed by [18]. The training dataset contains 39,810 images, while the test dataset contains 697 images. The network is implemented with Pytorch 1.9.0 and RTX 3090 GPUs, and the number of training epochs is 20. Both the pose network and the depth network use the Adam optimizer with

β_{1} = 0.9

and

β_{2} = 0.999

. The initial learning rate is set to 1 × 10⁻⁴ and decays to 1 × 10⁻⁵ after 15 epochs. The depth network uses ResNet18 or ConvNeXt-tiny [19], while the pose network only uses ResNet18.

3.2. Experiments for Self-Supervised Monocular Depth Estimation

3.2.1. Ablation Study

For the ablation study, ResNet18 serves as the backbone for both pose and depth networks, with Monodepth2’s monocular evaluation metrics providing the baseline comparison. Experiments employ a fixed image resolution of 640 × 192. Abs Rel, Sq Rel, RMSE and RMSE log quantify the disparity between ground truth and predicted values. Lower values of these metrics indicate superior prediction performance, while the higher the threshold accuracy rates (

δ < 1.25

,

δ < {1.25}^{2}

,

δ < 1.25

), the more predicted pixels are closer to the ground truth, indicating a better overall prediction performance. As presented in Table 1, ablation results for the global differential self-attention module (GDAM), the feature fusion module incorporating local differential mutual attention and gated modulation (DGFM), and the deformable bin module (DBM) collectively demonstrate significant performance improvements. These findings confirm that each proposed module enhances the network’s depth estimation capability.

3.2.2. Comparison with Other Works

The network was trained using the KITTI dataset. The monocular training mode is denoted as M, and the monocular–stereo training mode is denoted as MS. The resolution is set to 640 × 192 (L) or 1024 × 320 (H). When the resolution is 640 × 192, the batch size is set to 12; when the resolution is 1024 × 320, the batch size is set to 8. The backbone of the network is either ResNet18 or ConvNeXt-tiny. Table 2 demonstrates the performance metrics of our method under varying experimental configurations.

Additionally, the proposed algorithm in this paper is compared with other high-performing self-supervised monocular depth estimation algorithms (Monodepth2 [8], SGDepth [20], HR-Depth [9], RMSFM6 [21], CADepth [10], Liu [22], Lite-Mono [12], SwinDepth [11]). The experimental results are shown in Table 3. Quantitative analysis shows that compared to other methods, our approach balances prediction error and prediction integrity, achieving excellent performance.

Furthermore, the proposed method is evaluated on the Make3D dataset and compared with other methods. The evaluation results are shown in Table 4. From the table, the method proposed in this paper also demonstrates superior performance across key evaluation metrics. It also demonstrates that our method has good generalization ability.

Qualitative comparisons on the KITTI dataset are shown in Figure 7. From the visualization results, it can be observed that the method proposed in this paper demonstrates significant performance advantages in depth mutation regions (such as the junction of foreground and background): the edge contours of the predicted depth map are clearer and sharper, with more complete detail preservation. Especially when dealing with slender foreground objects, our method maintains better structural continuity and integrity.

Figure 8 shows the comparison of depth maps on the Make3D dataset. In the first row of the figure, it is evident that, in contrast to the other methods, the approach proposed in this paper successfully predicts the complete outline of the tree trunk even in areas where the trunk’s color closely resembles the background. This method demonstrates superior depth prediction capabilities in regions with lower image contrast. Moving to the second row, it is apparent that this method not only accurately predicts the depth of the street lamp but also captures a more comprehensive structure of the surrounding shrubbery. In the third row, it is clear that the proposed method delivers more consistent depth predictions for the road sign, and the contours between the road sign and the background are more distinct. The comparative results on the Make3D dataset underscore the robust generalization ability of the proposed method.

Among the current advanced methods, CADepth, like the method in this paper, focuses on improving the depth decoder; therefore, a further comparison between the proposed method and CADepth is conducted. Additionally, Monodepth2 is the benchmark reference. The comparative results are shown in Table 5. The experimental setup in Table 5 employs an input resolution of 640 × 192 and a batch size of 12 and exclusively uses monocular images for training.

All experiments in this paper were conducted on a ResNet18-based pose network. From the experimental results, it can be seen that compared with methods that use stronger pose networks, our prediction head can achieve better prediction results when trained with a small pose network, which also verifies the effectiveness of our method. During inference, the pose network is not involved. Table 5 presents FPS, GPU footprint, training time, and FLOP metrics, reflecting the inference speed, memory consumption, training duration, and computational complexity of the depth network. Compared to CADepth, our approach achieves reduced inference time, a lower memory footprint, a shorter training time, and decreased computational complexity. Furthermore, under the same experimental setup as Table 5, the FLOPs of CADepth’s SPM module and our GDAM are 50.3 M and 47.1 M, respectively. The decoder with only CADepth’s DEM module added and the decoder with only our DGFM module added have FLOPs of 30,660 M and 5509 M and parameter counts of 32.78 M and 5.42 M, respectively.

On the whole, our method has good inference speed and effectively improves the performance of the self-supervised monocular depth estimation network without significantly increasing the number of parameters and FLOPs.

4. Discussion

This paper proposes a monocular depth decoder enhanced by differential attention and deformable bins to address decoder limitations in depth prediction networks. The main innovations include the introduction of a differential attention mechanism and a deformable bins mechanism in the decoder head. The former effectively solves the inconsistency problem between image edges and depth edges while enhancing global information exchange within the decoder. The latter incorporates a multi-scale depth prediction refinement mechanism, which facilitates the generation of highly accurate depth estimation results. The proposed method achieves Abs Rel and

δ < 1.25

metrics of 0.092 and 0.913 on the KITTI dataset, and experiments on the Make3D dataset show that the proposed method has good generalization ability. Our method effectively improves the predictive performance of the network without significantly increasing the number of parameters. However, our approach currently exhibits certain limitations. Occlusion handling remains challenging in self-supervised tasks. While many methods employ object detection or image segmentation techniques to localize and remove dynamic objects, this approach entails significant computational resource consumption. Furthermore, low-contrast illumination conditions and noisy images in datasets impede accurate reprojection loss computation for self-supervised methods, so processing and optimizing the quality of the dataset are also challenges. In future work, we aim to enhance the network’s predictive performance by refining training strategies, optimizing loss functions, and incorporating auxiliary techniques such as pseudo-labeling.

Author Contributions

M.Z.: Conceptualization, methodology, software, writing. H.Y.: Conceptualization, methodology, supervision, funding acquisition. Z.L.: Visualization, validation, writing. Y.Z.: Visualization, validation, writing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

This research employed publicly available datasets: KITTI: https://www.cvlibs.net/datasets/kitti/raw_data.php (accessed on 15 September 2023) and Make3D: http://make3d.cs.cornell.edu/data.html (accessed on 15 September 2023).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zollhöfer, M.; Stotko, P.; Görlitz, A.; Theobalt, C.; Nießner, M.; Klein, R.; Kolb, A. State of the Art on 3D Reconstruction with RGB-D Cameras. Comput. Graph. Forum. 2018, 37, 625–652. [Google Scholar] [CrossRef]
Janai, J.; Güney, F.; Behl, A.; Geiger, A. Computer Vision for Autonomous Vehicles: Problems, Datasets and State-of-the-Art. Found. Trends Comput. 2020, 12, 1–308. [Google Scholar] [CrossRef]
Suzuki, R.; Karim, A.; Xia, T.; Hedayati, H.; Marquardt, N. Augmented Reality and Robotics: A Survey and Taxonomy for AR-enhanced Human-Robot Interaction and Robotic Interfaces. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, New Orleans, LA, USA, 30 April–5 May 2022. [Google Scholar]
Tsai, Y.-M.; Chang, Y.-L.; Chen, L.-G. Block-based vanishing line and vanishing point detection for 3D scene reconstruction. In Proceedings of the International Symposium on Intelligent Signal Processing and Communications (ISPAC), Yonago, Japan, 12–15 December 2006. [Google Scholar]
Tang, C.; Hou, C.P.; Song, Z.J. Depth recovery and refinement from a single image using defocus cues. J. Mod. Optic. 2015, 62, 441–448. [Google Scholar] [CrossRef]
Guo, F.; Guo, F.; Peng, H. Adaptive estimation of depth map for two-dimensional to three-dimensional stereoscopic conversion. Opt. Rev. 2014, 21, 60–73. [Google Scholar] [CrossRef]
Song, M.; Lim, S.; Kim, W. Monocular Depth Estimation Using Laplacian Pyramid-Based Depth Residuals. IEEE Trans. Circuits Syst. Video Technol. (TCSVT) 2021, 32, 4381–4393. [Google Scholar] [CrossRef]
Godard, C.; Mac Aodha, O.; Firman, M.; Brostow, G.J. Digging Into Self-Supervised Monocular Depth Estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Lyu, X.; Liu, L.; Wang, M.; Kong, X.; Liu, L.; Liu, Y.; Chen, X.; Yuan, Y. HR-Depth: High Resolution Self-Supervised Monocular Depth Estimation. In Proceedings of the AAAI, Online, 2–9 February 2021. [Google Scholar]
Yan, J.; Zhao, H.; Bu, P.; Jin, Y. Channel-Wise Attention-Based Network for Self-Supervised Monocular Depth Estimation. In Proceedings of the 9th International Conference on 3D Vision (3DV), London, UK, 1–3 December 2021. [Google Scholar]
Shim, D.; Kim, H.J. SwinDepth: Unsupervised Depth Estimation using Monocular Sequences via Swin Transformer and Densely Cascaded Networks. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023. [Google Scholar]
Zhang, N.; Nex, F.; Vosselman, G.; Kerle, N. Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation. In Proceedings of the 2023 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. (TIP) 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Godard, C.; Mac Aodha, O.; Brostow, G.J. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Wang, C.; Buenaposada, J.M.; Zhu, R.; Lucey, S. Learning Depth from Monocular Videos using Direct Methods. In Proceedings of the 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Shi, D. TransNeXt: Robust Foveal Visual Perception for Vision Transformers. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
Bhat, S.F.; Alhashim, I.; Wonka, P. AdaBins: Depth Estimation Using Adaptive Bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Online, 19–25 June 2021. [Google Scholar]
Eigen, D.; Fergus, R. Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 11–18 December 2015. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Klingner, M.; Termöhlen, J.A.; Mikolajczyk, J.; Fingscheidt, T. Self-Supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance. In Proceedings of the European Conference on Computer Vision 2020, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Zhou, Z.; Fan, X.; Shi, P.; Xin, Y. R-MSFM: Recurrent Multi-Scale Feature Modulation for Monocular Depth Estimating. In Proceedings of the 18th IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021. [Google Scholar]
Liu, H.; Zhu, Y.; Hua, G.; Huang, W.; Ding, R. Adaptive Weighted Network With Edge Enhancement Module For Monocular Self-Supervised Depth Estimation. In Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022. [Google Scholar]

Figure 1. The structure of this paper’s self-supervised monocular depth estimation network.

Figure 2. Global differential self-attention module.

Figure 3. Feature fusion structure at skip connections.

Figure 4. Comparisons of feature fusion modules. (a) Input image; (b) encoder feature in Monodepth2; (c) upsampled decoder feature in Monodepth2; (d) decoder feature in Monodepth2; (e) output feature of the local differential mutual attention block; (f) output feature of the gated modulation block; (g) decoder feature of the feature fusion module based on local differential mutual attention and gated modulation.

Figure 5. Structure of the feature fusion module. (a) Feature fusion module based on local differential mutual attention and gated modulation; (b) local differential mutual attention block; (c) gated modulation block.

Figure 6. Prediction head of bin structure. (a) Prediction head of AdaBins; (b) deformable bin module; (c) deformable bin block.

Figure 7. Comparison of depth maps on the KITTI dataset. (a) Input images; (b) Monodepth2; (c) RMSFM6; (d) HR-Depth; (e) CADepth; (f) SwinDepth; (g) ours.

Figure 8. Comparison of depth maps on the Make3D dataset. (a) Input image; (b) Monodepth2; (c) CADepth; (d) ours.

Table 1. The results of the ablation study. ↓: smaller is better, ↑: larger is better, bold font: best metric, ✔: module used.

GDAM	DGFM	DBM	Abs Rel ↓	Sq Rel ↓	RMSE ↓	RMSE log ↓	$δ < 1.25$ ↑	$δ < 1.2 5^{2}$ ↑	$δ < 1.2 5^{3}$ ↑
			0.115	0.903	4.863	0.193	0.877	0.959	0.981
✔			0.111	0.826	4.724	0.189	0.877	0.960	0.982
	✔		0.112	0.838	4.705	0.187	0.880	0.961	0.982
		✔	0.112	0.867	4.754	0.190	0.878	0.960	0.982
✔	✔		0.111	0.811	4.635	0.186	0.881	0.962	0.982
✔		✔	0.109	0.814	4.679	0.187	0.882	0.961	0.982
	✔	✔	0.111	0.850	4.706	0.188	0.882	0.962	0.982
✔	✔	✔	0.109	0.827	4.635	0.185	0.885	0.962	0.982

Table 2. The experimental results of the depth prediction network. ↓: smaller is better, ↑: larger is better.

Backbone	H/L	M/MS	Abs Rel ↓	Sq Rel ↓	RMSE ↓	RMSE log ↓	$δ < 1.25$ ↑	$δ < 1.2 5^{2}$ ↑	$δ < 1.2 5^{3}$ ↑
resnet18	L	M	0.109	0.827	4.635	0.185	0.885	0.962	0.982
convnext-t	L	M	0.102	0.829	4.570	0.180	0.898	0.965	0.982
resnet18	H	M	0.105	0.787	4.492	0.180	0.893	0.965	0.983
convnext-t	H	M	0.097	0.720	4.340	0.174	0.907	0.967	0.984
resnet18	L	MS	0.108	0.839	4.650	0.188	0.888	0.961	0.981
convnext-t	L	MS	0.097	0.742	4.412	0.176	0.905	0.966	0.983
resnet18	H	MS	0.101	0.747	4.417	0.179	0.897	0.966	0.983
convnext-t	H	MS	0.092	0.691	4.278	0.171	0.913	0.968	0.984

Table 3. Comparison with other algorithms. ↓: smaller is better, ↑: larger is better, bold font: best metric.

Method	H/L	M/MS	Abs Rel ↓	Sq Rel ↓	RMSE ↓	RMSE log ↓	$δ < 1.25$ ↑	$δ < 1.2 5^{2}$ ↑	$δ < 1.2 5^{3}$ ↑
SGDepth	L	M	0.117	0.907	4.844	0.196	0.875	0.958	0.980
Monodepth2	L	M	0.115	0.903	4.863	0.193	0.877	0.959	0.981
HR-Depth	L	M	0.109	0.792	4.632	0.185	0.884	0.962	0.983
RMSFM6	L	M	0.112	0.806	4.704	0.191	0.878	0.960	0.981
CADepth	L	M	0.105	0.769	4.535	0.181	0.892	0.964	0.983
Liu	L	M	0.105	0.765	4.598	0.185	0.888	0.963	0.982
Lite-Mono	L	M	0.107	0.765	4.561	0.183	0.886	0.963	0.983
SwinDepth	L	M	0.106	0.739	4.510	0.182	0.890	0.964	0.984
Ours	L	M	0.102	0.829	4.570	0.180	0.898	0.965	0.982
MD2	H	M	0.115	0.882	4.701	0.190	0.879	0.961	0.982
HR-Depth	H	M	0.106	0.755	4.472	0.181	0.892	0.966	0.984
CADepth	H	M	0.102	0.734	4.407	0.178	0.898	0.966	0.984
Liu	H	M	0.104	0.732	4.427	0.181	0.894	0.965	0.984
Lite-Mono	H	M	0.097	0.710	4.309	0.174	0.905	0.967	0.984
Ours	H	M	0.097	0.720	4.340	0.174	0.907	0.967	0.984
MD2	L	MS	0.106	0.818	4.750	0.196	0.874	0.957	0.979
HR-Depth	L	MS	0.107	0.785	4.612	0.185	0.887	0.962	0.982
CADepth	L	MS	0.102	0.752	4.504	0.181	0.894	0.964	0.983
Ours	L	MS	0.097	0.742	4.412	0.176	0.905	0.966	0.983
Monodepth2	H	MS	0.106	0.806	4.630	0.193	0.876	0.958	0.980
HR-Depth	H	MS	0.101	0.716	4.395	0.179	0.899	0.966	0.983
CADepth	H	MS	0.096	0.694	4.264	0.173	0.908	0.968	0.984
Ours	H	MS	0.092	0.691	4.278	0.171	0.913	0.968	0.984

Table 4. Experimental results on the Make3D dataset. ↓: smaller is better, bold font: best metric.

Method	Abs Rel ↓	Sq Rel ↓	RMSE ↓	RMSE log ↓
MD2	0.322	3.589	7.417	0.163
RMSFM6	0.334	3.285	7.212	0.169
CADepth	0.312	3.086	7.066	0.159
Ours	0.307	2.865	6.888	0.159

Table 5. Performance comparison on the KITTI dataset.

Method	Pose Network Parameters (M)	Depth Network Parameters (M)	Total Parameters (M)	FPS	Abs Rel	$δ < 1.25$	Gpu Footprint (M)	Training Time (H)	FLOPs of Depth Network Decoder (G)
MD2 (Res-50)	27.26	34.57	61.83	91	0.110	0.883	15,811	11.5	6.5
CADepth	27.26	58.34	85.60	53	0.105	0.892	18,843	12.0	30.7
Ours	12.51	33.34	45.85	62	0.102	0.898	15,155	11.7	7.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, M.; Yu, H.; Li, Z.; Zhang, Y. Self-Supervised Monocular Depth Estimation Based on Differential Attention. Algorithms 2025, 18, 590. https://doi.org/10.3390/a18090590

AMA Style

Zhou M, Yu H, Li Z, Zhang Y. Self-Supervised Monocular Depth Estimation Based on Differential Attention. Algorithms. 2025; 18(9):590. https://doi.org/10.3390/a18090590

Chicago/Turabian Style

Zhou, Ming, Hancheng Yu, Zhongchen Li, and Yupu Zhang. 2025. "Self-Supervised Monocular Depth Estimation Based on Differential Attention" Algorithms 18, no. 9: 590. https://doi.org/10.3390/a18090590

APA Style

Zhou, M., Yu, H., Li, Z., & Zhang, Y. (2025). Self-Supervised Monocular Depth Estimation Based on Differential Attention. Algorithms, 18(9), 590. https://doi.org/10.3390/a18090590

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Self-Supervised Monocular Depth Estimation Based on Differential Attention

Abstract

1. Introduction

2. Methods

2.1. Self-Supervised Training Process

2.2. Monocular Depth Decoder Based on Differential Attention and Bin Mechanism

2.2.1. Global Differential Self-Attention Module

2.2.2. Feature Fusion Module Based on Local Differential Mutual Attention and Gated Modulation

2.2.3. Deformable Bins Module

3. Results

3.1. Dataset Preprocessing, Splitting, and Implementation Details

3.2. Experiments for Self-Supervised Monocular Depth Estimation

3.2.1. Ablation Study

3.2.2. Comparison with Other Works

4. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI