Self-Supervised Monocular Depth Estimation Based on Differential Attention
Abstract
1. Introduction
2. Methods
2.1. Self-Supervised Training Process
2.2. Monocular Depth Decoder Based on Differential Attention and Bin Mechanism
2.2.1. Global Differential Self-Attention Module
2.2.2. Feature Fusion Module Based on Local Differential Mutual Attention and Gated Modulation
2.2.3. Deformable Bins Module
3. Results
3.1. Dataset Preprocessing, Splitting, and Implementation Details
3.2. Experiments for Self-Supervised Monocular Depth Estimation
3.2.1. Ablation Study
3.2.2. Comparison with Other Works
4. Discussion
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Zollhöfer, M.; Stotko, P.; Görlitz, A.; Theobalt, C.; Nießner, M.; Klein, R.; Kolb, A. State of the Art on 3D Reconstruction with RGB-D Cameras. Comput. Graph. Forum. 2018, 37, 625–652. [Google Scholar] [CrossRef]
- Janai, J.; Güney, F.; Behl, A.; Geiger, A. Computer Vision for Autonomous Vehicles: Problems, Datasets and State-of-the-Art. Found. Trends Comput. 2020, 12, 1–308. [Google Scholar] [CrossRef]
- Suzuki, R.; Karim, A.; Xia, T.; Hedayati, H.; Marquardt, N. Augmented Reality and Robotics: A Survey and Taxonomy for AR-enhanced Human-Robot Interaction and Robotic Interfaces. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, New Orleans, LA, USA, 30 April–5 May 2022. [Google Scholar]
- Tsai, Y.-M.; Chang, Y.-L.; Chen, L.-G. Block-based vanishing line and vanishing point detection for 3D scene reconstruction. In Proceedings of the International Symposium on Intelligent Signal Processing and Communications (ISPAC), Yonago, Japan, 12–15 December 2006. [Google Scholar]
- Tang, C.; Hou, C.P.; Song, Z.J. Depth recovery and refinement from a single image using defocus cues. J. Mod. Optic. 2015, 62, 441–448. [Google Scholar] [CrossRef]
- Guo, F.; Guo, F.; Peng, H. Adaptive estimation of depth map for two-dimensional to three-dimensional stereoscopic conversion. Opt. Rev. 2014, 21, 60–73. [Google Scholar] [CrossRef]
- Song, M.; Lim, S.; Kim, W. Monocular Depth Estimation Using Laplacian Pyramid-Based Depth Residuals. IEEE Trans. Circuits Syst. Video Technol. (TCSVT) 2021, 32, 4381–4393. [Google Scholar] [CrossRef]
- Godard, C.; Mac Aodha, O.; Firman, M.; Brostow, G.J. Digging Into Self-Supervised Monocular Depth Estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Lyu, X.; Liu, L.; Wang, M.; Kong, X.; Liu, L.; Liu, Y.; Chen, X.; Yuan, Y. HR-Depth: High Resolution Self-Supervised Monocular Depth Estimation. In Proceedings of the AAAI, Online, 2–9 February 2021. [Google Scholar]
- Yan, J.; Zhao, H.; Bu, P.; Jin, Y. Channel-Wise Attention-Based Network for Self-Supervised Monocular Depth Estimation. In Proceedings of the 9th International Conference on 3D Vision (3DV), London, UK, 1–3 December 2021. [Google Scholar]
- Shim, D.; Kim, H.J. SwinDepth: Unsupervised Depth Estimation using Monocular Sequences via Swin Transformer and Densely Cascaded Networks. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023. [Google Scholar]
- Zhang, N.; Nex, F.; Vosselman, G.; Kerle, N. Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation. In Proceedings of the 2023 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. (TIP) 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
- Godard, C.; Mac Aodha, O.; Brostow, G.J. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Wang, C.; Buenaposada, J.M.; Zhu, R.; Lucey, S. Learning Depth from Monocular Videos using Direct Methods. In Proceedings of the 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Shi, D. TransNeXt: Robust Foveal Visual Perception for Vision Transformers. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
- Bhat, S.F.; Alhashim, I.; Wonka, P. AdaBins: Depth Estimation Using Adaptive Bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Online, 19–25 June 2021. [Google Scholar]
- Eigen, D.; Fergus, R. Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 11–18 December 2015. [Google Scholar]
- Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Klingner, M.; Termöhlen, J.A.; Mikolajczyk, J.; Fingscheidt, T. Self-Supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance. In Proceedings of the European Conference on Computer Vision 2020, Glasgow, UK, 23–28 August 2020. [Google Scholar]
- Zhou, Z.; Fan, X.; Shi, P.; Xin, Y. R-MSFM: Recurrent Multi-Scale Feature Modulation for Monocular Depth Estimating. In Proceedings of the 18th IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021. [Google Scholar]
- Liu, H.; Zhu, Y.; Hua, G.; Huang, W.; Ding, R. Adaptive Weighted Network With Edge Enhancement Module For Monocular Self-Supervised Depth Estimation. In Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022. [Google Scholar]
GDAM | DGFM | DBM | Abs Rel ↓ | Sq Rel ↓ | RMSE ↓ | RMSE log ↓ | ↑ | ↑ | ↑ |
---|---|---|---|---|---|---|---|---|---|
0.115 | 0.903 | 4.863 | 0.193 | 0.877 | 0.959 | 0.981 | |||
✔ | 0.111 | 0.826 | 4.724 | 0.189 | 0.877 | 0.960 | 0.982 | ||
✔ | 0.112 | 0.838 | 4.705 | 0.187 | 0.880 | 0.961 | 0.982 | ||
✔ | 0.112 | 0.867 | 4.754 | 0.190 | 0.878 | 0.960 | 0.982 | ||
✔ | ✔ | 0.111 | 0.811 | 4.635 | 0.186 | 0.881 | 0.962 | 0.982 | |
✔ | ✔ | 0.109 | 0.814 | 4.679 | 0.187 | 0.882 | 0.961 | 0.982 | |
✔ | ✔ | 0.111 | 0.850 | 4.706 | 0.188 | 0.882 | 0.962 | 0.982 | |
✔ | ✔ | ✔ | 0.109 | 0.827 | 4.635 | 0.185 | 0.885 | 0.962 | 0.982 |
Backbone | H/L | M/MS | Abs Rel ↓ | Sq Rel ↓ | RMSE ↓ | RMSE log ↓ | ↑ | ↑ | ↑ |
---|---|---|---|---|---|---|---|---|---|
resnet18 | L | M | 0.109 | 0.827 | 4.635 | 0.185 | 0.885 | 0.962 | 0.982 |
convnext-t | L | M | 0.102 | 0.829 | 4.570 | 0.180 | 0.898 | 0.965 | 0.982 |
resnet18 | H | M | 0.105 | 0.787 | 4.492 | 0.180 | 0.893 | 0.965 | 0.983 |
convnext-t | H | M | 0.097 | 0.720 | 4.340 | 0.174 | 0.907 | 0.967 | 0.984 |
resnet18 | L | MS | 0.108 | 0.839 | 4.650 | 0.188 | 0.888 | 0.961 | 0.981 |
convnext-t | L | MS | 0.097 | 0.742 | 4.412 | 0.176 | 0.905 | 0.966 | 0.983 |
resnet18 | H | MS | 0.101 | 0.747 | 4.417 | 0.179 | 0.897 | 0.966 | 0.983 |
convnext-t | H | MS | 0.092 | 0.691 | 4.278 | 0.171 | 0.913 | 0.968 | 0.984 |
Method | H/L | M/MS | Abs Rel ↓ | Sq Rel ↓ | RMSE ↓ | RMSE log ↓ | ↑ | ↑ | ↑ |
---|---|---|---|---|---|---|---|---|---|
SGDepth | L | M | 0.117 | 0.907 | 4.844 | 0.196 | 0.875 | 0.958 | 0.980 |
Monodepth2 | L | M | 0.115 | 0.903 | 4.863 | 0.193 | 0.877 | 0.959 | 0.981 |
HR-Depth | L | M | 0.109 | 0.792 | 4.632 | 0.185 | 0.884 | 0.962 | 0.983 |
RMSFM6 | L | M | 0.112 | 0.806 | 4.704 | 0.191 | 0.878 | 0.960 | 0.981 |
CADepth | L | M | 0.105 | 0.769 | 4.535 | 0.181 | 0.892 | 0.964 | 0.983 |
Liu | L | M | 0.105 | 0.765 | 4.598 | 0.185 | 0.888 | 0.963 | 0.982 |
Lite-Mono | L | M | 0.107 | 0.765 | 4.561 | 0.183 | 0.886 | 0.963 | 0.983 |
SwinDepth | L | M | 0.106 | 0.739 | 4.510 | 0.182 | 0.890 | 0.964 | 0.984 |
Ours | L | M | 0.102 | 0.829 | 4.570 | 0.180 | 0.898 | 0.965 | 0.982 |
MD2 | H | M | 0.115 | 0.882 | 4.701 | 0.190 | 0.879 | 0.961 | 0.982 |
HR-Depth | H | M | 0.106 | 0.755 | 4.472 | 0.181 | 0.892 | 0.966 | 0.984 |
CADepth | H | M | 0.102 | 0.734 | 4.407 | 0.178 | 0.898 | 0.966 | 0.984 |
Liu | H | M | 0.104 | 0.732 | 4.427 | 0.181 | 0.894 | 0.965 | 0.984 |
Lite-Mono | H | M | 0.097 | 0.710 | 4.309 | 0.174 | 0.905 | 0.967 | 0.984 |
Ours | H | M | 0.097 | 0.720 | 4.340 | 0.174 | 0.907 | 0.967 | 0.984 |
MD2 | L | MS | 0.106 | 0.818 | 4.750 | 0.196 | 0.874 | 0.957 | 0.979 |
HR-Depth | L | MS | 0.107 | 0.785 | 4.612 | 0.185 | 0.887 | 0.962 | 0.982 |
CADepth | L | MS | 0.102 | 0.752 | 4.504 | 0.181 | 0.894 | 0.964 | 0.983 |
Ours | L | MS | 0.097 | 0.742 | 4.412 | 0.176 | 0.905 | 0.966 | 0.983 |
Monodepth2 | H | MS | 0.106 | 0.806 | 4.630 | 0.193 | 0.876 | 0.958 | 0.980 |
HR-Depth | H | MS | 0.101 | 0.716 | 4.395 | 0.179 | 0.899 | 0.966 | 0.983 |
CADepth | H | MS | 0.096 | 0.694 | 4.264 | 0.173 | 0.908 | 0.968 | 0.984 |
Ours | H | MS | 0.092 | 0.691 | 4.278 | 0.171 | 0.913 | 0.968 | 0.984 |
Method | Abs Rel ↓ | Sq Rel ↓ | RMSE ↓ | RMSE log ↓ |
---|---|---|---|---|
MD2 | 0.322 | 3.589 | 7.417 | 0.163 |
RMSFM6 | 0.334 | 3.285 | 7.212 | 0.169 |
CADepth | 0.312 | 3.086 | 7.066 | 0.159 |
Ours | 0.307 | 2.865 | 6.888 | 0.159 |
Method | Pose Network Parameters (M) | Depth Network Parameters (M) | Total Parameters (M) | FPS | Abs Rel | Gpu Footprint (M) | Training Time (H) | FLOPs of Depth Network Decoder (G) | |
---|---|---|---|---|---|---|---|---|---|
MD2 (Res-50) | 27.26 | 34.57 | 61.83 | 91 | 0.110 | 0.883 | 15,811 | 11.5 | 6.5 |
CADepth | 27.26 | 58.34 | 85.60 | 53 | 0.105 | 0.892 | 18,843 | 12.0 | 30.7 |
Ours | 12.51 | 33.34 | 45.85 | 62 | 0.102 | 0.898 | 15,155 | 11.7 | 7.9 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhou, M.; Yu, H.; Li, Z.; Zhang, Y. Self-Supervised Monocular Depth Estimation Based on Differential Attention. Algorithms 2025, 18, 590. https://doi.org/10.3390/a18090590
Zhou M, Yu H, Li Z, Zhang Y. Self-Supervised Monocular Depth Estimation Based on Differential Attention. Algorithms. 2025; 18(9):590. https://doi.org/10.3390/a18090590
Chicago/Turabian StyleZhou, Ming, Hancheng Yu, Zhongchen Li, and Yupu Zhang. 2025. "Self-Supervised Monocular Depth Estimation Based on Differential Attention" Algorithms 18, no. 9: 590. https://doi.org/10.3390/a18090590
APA StyleZhou, M., Yu, H., Li, Z., & Zhang, Y. (2025). Self-Supervised Monocular Depth Estimation Based on Differential Attention. Algorithms, 18(9), 590. https://doi.org/10.3390/a18090590