Survey on Monocular Metric Depth Estimation
Abstract
1. Introduction
1.1. Research Gap
1.2. Main Contribution
1.3. Problem Statement
1.4. Objective
2. Background
2.1. Traditional Methods
2.2. Deep Learning
2.3. Monocular Depth Estimation
2.4. Zero-Shot Depth Estimation
3. Materials and Methods
3.1. Monocular Metric Depth Estimation
3.2. Challenges and Improvements
3.2.1. Generalizability
3.2.2. Blurriness
4. Results
4.1. Research Trend
4.2. Criteria
4.3. Comparison
4.4. Datasets for MMDE
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A
| Name | Indoor Outdoor | Driving Data | Synthetic Real | Tasks | Data Categories | Relative Metric | Description |
|---|---|---|---|---|---|---|---|
| Argoverse2 | Outdoor | Yes | Real | Trajectory Prediction, Object Detection, Depth Estimation, Semantic Segmentation, SLAM | RGB, LiDAR, GPS, IMU, 3D BBoxes, Labels | Metric | A large-scale autonomous driving dataset that provides 360-degree LiDAR and stereo data, supporting long-term tracking and motion forecasting. |
| Waymo | Outdoor | Yes | Real | Object Detection, Depth Estimation, Semantic Segmentation, Trajectory Prediction, SLAM | RGB, LiDAR, GPS, IMU, 3D BBoxes, Labels | Metric | A large-scale dataset collected in both urban and highway environments, featuring multi-sensor data for autonomous driving tasks. |
| DrivingStereo | Outdoor | No | Real | Stereo Matching, Depth Estimation | High-Resolution Stereo Images, Depth Maps | Metric | A stereo dataset with high-resolution binocular images and ground-truth depth maps, designed for stereo vision and depth estimation. |
| Cityscapes | Outdoor | No | Real | Semantic Segmentation, Instance Segmentation, Depth Estimation | RGB Images, Semantic Labels | Metric | A widely used benchmark of urban street scenes that supports segmentation tasks and provides derived depth information. |
| BDD100K | Outdoor | Yes | Real | Object Detection, Semantic Segmentation, Driving Behavior Prediction, Depth Estimation | RGB, Semantic Labels, Videos, GPS, IMU | Relative | A large-scale driving dataset that covers diverse scenes and supports multiple tasks including object detection and driving behavior analysis. |
| Mapillary Vistas | Outdoor | No | Real | Semantic Segmentation, Instance Segmentation, Depth Estimation | RGB Images, Semantic Labels, Depth Maps | Relative | A large-scale street-level dataset that provides diverse scenes with rich semantic annotations for segmentation and depth estimation. |
| A2D2 | Indoor Outdoor | Yes | Real | Semantic Segmentation, Object Detection, Depth Estimation, SLAM | RGB, Semantic Labels, LiDAR, IMU, GPS, 3D BBoxes | Metric | A multi-sensor dataset that covers both indoor and outdoor scenes, offering detailed annotations for driving-related tasks. |
| ScanNet | Indoor | No | Real | 3D Reconstruction, Semantic Segmentation, Depth Estimation | RGB-D, Point Clouds, Semantic Labels | Metric | A large-scale dataset of indoor environments that provides RGB-D imagery and 3D point clouds for reconstruction and semantic understanding. |
| Taskonomy | Indoor Outdoor | No | Real | Multi-Task Learning, Depth Estimation, Semantic Segmentation | RGB, Depth Maps, Normals, Point Clouds | Metric | A dataset designed for multi-task learning, offering a broad set of ground-truth annotations across different vision tasks. |
| SUN-RGBD | Indoor | No | Real | 3D Reconstruction, Semantic Segmentation, Depth Estimation | RGB-D, Point Clouds, Semantic Labels | Metric | A large-scale indoor dataset that provides RGB-D imagery and semantic labels for depth estimation and 3D reconstruction. |
| Diode Indoor | Indoor | No | Real | Depth Estimation, 3D Reconstruction | RGB, LiDAR Depth Maps, Point Clouds | Metric | A high-precision indoor dataset that combines LiDAR depth maps with RGB data for accurate metric depth estimation. |
| IBims-1 | Indoor | No | Real | 3D Reconstruction, Depth Estimation | RGB-D, 3D Reconstruction | Metric | A benchmark dataset that provides high-quality RGB-D data and 3D reconstructions of indoor building environments. |
| VOID | Indoor | No | Real | 3D Reconstruction, Depth Estimation, SLAM | RGB-D, Point Clouds | Metric | A dataset of indoor RGB-D scenes with a focus on occlusion handling for reconstruction and SLAM. |
| HAMMER | Indoor Outdoor | No | Real | Depth Estimation, 3D Reconstruction, Semantic Segmentation, SLAM | RGB-D, Point Clouds, Semantic Labels | Metric | A dataset covering both indoor and outdoor environments, designed for depth estimation and geometric reconstruction with high accuracy. |
| ETH-3D | Indoor Outdoor | No | Real | Multi-View Stereo, Depth Estimation, 3D Reconstruction | High-Res RGB, Depth Maps, Point Clouds | Metric | A benchmark for multi-view stereo and 3D reconstruction that provides high-resolution RGB images and precise depth ground truth. |
| nuScenes | Outdoor | Yes | Real | Object Detection, Trajectory Prediction, Depth Estimation, SLAM | RGB, LiDAR, Radar, GPS, IMU, 3D BBoxes, Labels | Metric | A large-scale driving dataset that integrates RGB, LiDAR, and radar sensors to support autonomous navigation tasks. |
| DDAD | Outdoor | Yes | Real | Depth Estimation, Object Detection, 3D Reconstruction, SLAM | RGB-D, Point Clouds | Metric | A driving dataset with dense annotations and multi-sensor inputs, focused on high-quality depth estimation and 3D reconstruction. |
| BlendedMVS | Indoor Outdoor | No | Synthetic Real | Multi-View Stereo, Depth Estimation, 3D Reconstruction | RGB, Depth Maps, 3D Reconstruction | Metric | A hybrid dataset that blends real and synthetic imagery to support depth estimation and multi-view stereo benchmarks. |
| DIML | Indoor | No | Real | Stereo Matching, Depth Estimation | RGB Images, Depth Maps | Metric | A multi-view indoor dataset designed for stereo matching and monocular depth estimation. |
| HRWSI | Outdoor | No | Real | Stereo Matching, Depth Estimation | High-Resolution RGB, Depth Maps | Metric | A high-resolution dataset that provides RGB imagery and depth maps of outdoor scenes for stereo and depth tasks. |
| IRS | Indoor | No | Real | 3D Reconstruction, Depth Estimation, SLAM | RGB-D, Point Clouds | Metric | An indoor RGB-D dataset created for reconstruction and depth estimation with emphasis on SLAM evaluation. |
| MegaDepth | Outdoor | No | Real | 3D Reconstruction, Depth Estimation | High-Res RGB, Depth Maps | Relative | A large-scale dataset of outdoor scenes that provides high-resolution imagery with relative depth annotations. |
| TartanAir | Indoor Outdoor | No | Synthetic | SLAM, Depth Estimation, 3D Reconstruction, Semantic Segmentation | RGB, Depth Maps, Point Clouds, Labels | Metric | A synthetic dataset that offers diverse indoor and outdoor environments for SLAM, reconstruction, and depth learning. |
| Hypersim | Indoor Outdoor | No | Synthetic | 3D Reconstruction, Scene Understanding, Semantic Segmentation | RGB, Depth Maps, Point Clouds, Labels | Metric | A photorealistic synthetic dataset designed to support 3D reconstruction and semantic scene understanding tasks. |
| vKITTI | Outdoor | Yes | Synthetic | Object Detection, Semantic Segmentation, Depth Estimation, SLAM | Synthetic RGB, Labels, Depth Maps, 3D BBoxes | Metric | A synthetic driving dataset that provides ground-truth labels for object detection, depth estimation, and segmentation tasks. |
| KITTI | Outdoor | Yes | Real | Object Detection, Stereo Matching, Depth Estimation, SLAM | RGB, LiDAR, Depth Maps, Semantic Labels | Metric | A foundational autonomous driving dataset with multi-sensor data, widely used for vision and robotics benchmarks. |
| NYU-D | Indoor | No | Real | Semantic Segmentation, Depth Estimation, 3D Reconstruction | RGB-D, Semantic Labels, Point Clouds | Metric | A benchmark indoor RGB-D dataset that provides semantic labels and depth information for depth estimation and segmentation tasks. |
| Sintel | Outdoor | No | Synthetic | Optical Flow, Depth Estimation, Semantic Segmentation | RGB, Depth Maps, Labels, Optical Flow | Metric | A synthetic dataset that resembles movie scenes, supporting depth estimation, segmentation, and optical flow tasks. |
| ReDWeb | Outdoor | No | Real | Monocular Depth Estimation | RGB Images | Relative | A dataset created from diverse web videos, designed to supervise monocular depth estimation with relative annotations. |
| Movies | Indoor Outdoor | No | Synthetic Real | Monocular Depth Estimation, Zero-Shot Learning | RGB Images | Relative | A blended dataset sourced from various movies, used primarily for zero-shot monocular depth estimation. |
| ApolloScape | Outdoor | Yes | Real | Object Detection, Semantic Segmentation, Depth Estimation | RGB, LiDAR | Metric | A driving dataset with dense annotations that supports object detection, semantic segmentation, and depth estimation. |
| WSVD | Indoor Outdoor | No | Real | Monocular Depth Estimation | RGB Images | Metric | A dataset collected from web videos that emphasizes dynamic scenes and moving objects for depth learning. |
| DIW | Outdoor | No | Real | Monocular Depth Estimation | RGB Images | Relative | A dataset that provides relative depth annotations for diverse outdoor scenes. |
| ETH3D | Indoor Outdoor | No | Real | Multi-View Stereo, Depth Estimation | High-Res RGB, Videos | Metric | A benchmark dataset for multi-view stereo and depth estimation, containing high-resolution imagery. |
| TUM | Indoor | No | Real | RGB-D SLAM | RGB-D | Metric | An indoor RGB-D dataset widely used for evaluating SLAM algorithms and monocular depth estimation. |
| 3D Ken Burns | Indoor | No | Synthetic | Depth Estimation, Image Animation | Static Images, Depth Maps | Metric | A dataset designed for producing image-based animations with depth information, such as the "Ken Burns" effect. |
| Objaverse | Indoor Outdoor | No | Synthetic Real | 3D Object Detection, Classification | 3D Models | Relative | A large-scale dataset of 3D objects designed for recognition, classification, and generative modeling. |
| OmniObject3D | Indoor | No | Synthetic | 3D Object Detection, Recognition, Reconstruction | 3D Models, RGB-D | Metric | A synthetic dataset that provides multi-view images and aligned depth information for 3D object understanding. |
References
- Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 2021, 65, 99–106. [Google Scholar] [CrossRef]
- Kerbl, B.; Kopanas, G.; Leimkühler, T.; Drettakis, G. 3D gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 2023, 42, 1–14. [Google Scholar] [CrossRef]
- Ye, C.; Nie, Y.; Chang, J.; Chen, Y.; Zhi, Y.; Han, X. Gaustudio: A modular framework for 3D gaussian splatting and beyond. arXiv 2024, arXiv:2403.19632. [Google Scholar] [CrossRef]
- Szeliski, R. Computer Vision: Algorithms and Applications; Springer Nature: Cham, Switzerland, 2022. [Google Scholar]
- Zheng, J.; Lin, C.; Sun, J.; Zhao, Z.; Li, Q.; Shen, C. Physical 3D adversarial attacks against monocular depth estimation in autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 24452–24461. [Google Scholar]
- Leduc, A.; Cioppa, A.; Giancola, S.; Ghanem, B.; Van Droogenbroeck, M. SoccerNet-Depth: A scalable dataset for monocular depth estimation in sports videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 3280–3292. [Google Scholar]
- Zhang, L.; Rao, A.; Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 3836–3847. [Google Scholar]
- Khan, N.; Xiao, L.; Lanman, D. Tiled multiplane images for practical 3D photography. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 10454–10464. [Google Scholar]
- Liew, J.H.; Yan, H.; Zhang, J.; Xu, Z.; Feng, J. Magicedit: High-fidelity and temporally coherent video editing. arXiv 2023, arXiv:2308.14749. [Google Scholar]
- Xu, D.; Jiang, Y.; Wang, P.; Fan, Z.; Wang, Y.; Wang, Z. Neurallift-360: Lifting an in-the-wild 2D photo to a 3D object with 360deg views. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 4479–4489. [Google Scholar]
- Shahbazi, M.; Claessens, L.; Niemeyer, M.; Collins, E.; Tonioni, A.; Van Gool, L.; Tombari, F. Inserf: Text-driven generative object insertion in neural 3D scenes. arXiv 2024, arXiv:2401.05335. [Google Scholar]
- Shriram, J.; Trevithick, A.; Liu, L.; Ramamoorthi, R. Realmdreamer: Text-driven 3D scene generation with inpainting and depth diffusion. arXiv 2024, arXiv:2404.07199. [Google Scholar]
- Deng, J.; Yin, W.; Guo, X.; Zhang, Q.; Hu, X.; Ren, W.; Long, X.X.; Tan, P. Boost 3D reconstruction using diffusion-based monocular camera calibration. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, HI, USA, 19–23 October 2025; pp. 7110–7121. [Google Scholar]
- Guo, J.; Ding, Y.; Chen, X.; Chen, S.; Li, B.; Zou, Y.; Lyu, X.; Tan, F.; Qi, X.; Li, Z.; et al. Dist-4d: Disentangled spatiotemporal diffusion with metric depth for 4D driving scene generation. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 27231–27241. [Google Scholar]
- Daxberger, E.; Wenzel, N.; Griffiths, D.; Gang, H.; Lazarow, J.; Kohavi, G.; Kang, K.; Eichner, M.; Yang, Y.; Dehghan, A.; et al. Mm-spatial: Exploring 3D spatial understanding in multimodal llms. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, HI, USA, 19–23 October 2025; pp. 7395–7408. [Google Scholar]
- Yu, Y.; Liu, S.; Pautrat, R.; Pollefeys, M.; Larsson, V. Relative pose estimation through affine corrections of monocular depth priors. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 16706–16716. [Google Scholar]
- Kim, H.; Baik, S.; Joo, H. DAViD: Modeling dynamic affordance of 3D objects using pre-trained video diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, HI, USA, 19–23 October 2025; pp. 10330–10341. [Google Scholar]
- Bhat, S.F.; Birkl, R.; Wofk, D.; Wonka, P.; Müller, M. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv 2023, arXiv:2302.12288. [Google Scholar]
- Bochkovskiy, A.; Delaunoy, A.; Germain, H.; Santos, M.; Zhou, Y.; Richter, S.; Koltun, V. Depth Pro: Sharp Monocular Metric Depth in Less Than a Second. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
- Saxena, S.; Hur, J.; Herrmann, C.; Sun, D.; Fleet, D.J. Zero-shot metric depth with a field-of-view conditioned diffusion model. arXiv 2023, arXiv:2312.13252. [Google Scholar]
- Yang, L.; Kang, B.; Huang, Z.; Xu, X.; Feng, J.; Zhao, H. Depth anything: Unleashing the power of large-scale unlabeled data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 10371–10381. [Google Scholar]
- Yang, L.; Kang, B.; Huang, Z.; Zhao, Z.; Xu, X.; Feng, J.; Zhao, H. Depth anything v2. In Advances in Neural Information Processing Systems 37; Curran Associates, Inc.: New York, NY, USA, 2024; pp. 21875–21911. [Google Scholar]
- Guo, Y.; Garg, S.; Miangoleh, S.M.H.; Huang, X.; Ren, L. Depth any camera: Zero-shot metric depth estimation from any camera. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10–17 June 2025; pp. 26996–27006. [Google Scholar]
- Bhoi, A. Monocular depth estimation: A survey. arXiv 2019, arXiv:1901.09402. [Google Scholar] [CrossRef]
- Khan, F.; Salahuddin, S.; Javidnia, H. Deep learning-based monocular depth estimation methods—A state-of-the-art review. Sensors 2020, 20, 2272. [Google Scholar] [CrossRef]
- Zhao, C.; Sun, Q.; Zhang, C.; Tang, Y.; Qian, F. Monocular depth estimation based on deep learning: An overview. Sci. China Technol. Sci. 2020, 63, 1612–1627. [Google Scholar] [CrossRef]
- Ruan, X.; Yan, W.; Huang, J.; Guo, P.; Guo, W. Monocular depth estimation based on deep learning: A survey. In Proceedings of the 2020 Chinese Automation Congress (CAC), Shanghai, China, 6–8 November 2020; IEEE: New York, NY, USA, 2021; pp. 2436–2440. [Google Scholar]
- Lahiri, S.; Ren, J.; Lin, X. Deep learning-based stereopsis and monocular depth estimation techniques: A review. Vehicles 2024, 6, 305–351. [Google Scholar] [CrossRef]
- Tosi, F.; Ramirez, P.Z.; Poggi, M. Diffusion models for monocular depth estimation: Overcoming challenging conditions. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer Nature: Cham, Switzerland, 2024; pp. 236–257. [Google Scholar]
- Vyas, P.; Saxena, C.; Badapanda, A.; Goswami, A. Outdoor monocular depth estimation: A research review. arXiv 2022, arXiv:2205.01399. [Google Scholar] [CrossRef]
- Dong, X.; Garratt, M.A.; Anavatti, S.G.; Abbass, H.A. Towards real-time monocular depth estimation for robotics: A survey. IEEE Trans. Intell. Transp. Syst. 2022, 23, 16940–16961. [Google Scholar] [CrossRef]
- Masoumian, A.; Rashwan, H.A.; Cristiano, J.; Asif, M.S.; Puig, D. Monocular depth estimation using deep learning: A review. Sensors 2022, 22, 5353. [Google Scholar] [CrossRef]
- Arampatzakis, V.; Pavlidis, G.; Mitianoudis, N.; Papamarkos, N. Monocular depth estimation: A thorough review. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 2396–2414. [Google Scholar] [CrossRef]
- Rajapaksha, U.; Sohel, F.; Laga, H.; Diepeveen, D.; Bennamoun, M. Deep learning-based depth estimation methods from monocular image and videos: A comprehensive survey. ACM Comput. Surv. 2024, 56, 1–51. [Google Scholar] [CrossRef]
- Ke, B.; Obukhov, A.; Huang, S.; Metzger, N.; Daudt, R.C.; Schindler, K. Repurposing diffusion-based image generators for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 9492–9502. [Google Scholar]
- Bhat, S.F.; Alhashim, I.; Wonka, P. Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4009–4018. [Google Scholar]
- Miangoleh, S.M.H.; Dille, S.; Mai, L.; Paris, S.; Aksoy, Y. Boosting monocular depth estimation models to high-resolution via content-adaptive multi-resolution merging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9685–9694. [Google Scholar]
- Jampani, V.; Chang, H.; Sargent, K.; Kar, A.; Tucker, R.; Krainin, M.; Liu, C. Slide: Single image 3D photography with soft layering and depth-aware inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 12518–12527. [Google Scholar]
- Izadi, S.; Kim, D.; Hilliges, O.; Molyneaux, D.; Newcombe, R.; Kohli, P.; Shotton, J.; Hodges, S.; Freeman, D.; Davison, A.; et al. Kinectfusion: Real-time 3D reconstruction and interaction using a moving depth camera. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, Santa Barbara, CA, USA, 16–19 October 2011; pp. 559–568. [Google Scholar]
- Foix, S.; Alenya, G.; Torras, C. Lock-in time-of-flight (ToF) cameras: A survey. IEEE Sens. J. 2011, 11, 1917–1926. [Google Scholar] [CrossRef]
- Sarbol, H.; Lefloch, D.; Kolb, A. Kinect range sensing: Structured-light versus Time-of-Flight Kinect. Comput. Vis. Image Underst. 2015, 139, 1–20. [Google Scholar] [CrossRef]
- He, Y.; Liang, B.; Zou, Y.; He, J.; Yang, J. Depth errors analysis and correction for time-of-flight (ToF) cameras. Sensors 2017, 17, 92. [Google Scholar] [CrossRef]
- Scharstein, D.; Szeliski, R. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vis. 2002, 47, 7–42. [Google Scholar] [CrossRef]
- Hirschmuller, H. Stereo processing by semiglobal matching and mutual information. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 30, 328–341. [Google Scholar] [CrossRef]
- Kok, K.Y.; Rajendran, P. A review on stereo vision algorithm: Challenges and solutions. ECTI Trans. Comput. Inf. Technol. (ECTI-CIT) 2019, 13, 112–128. [Google Scholar] [CrossRef]
- Wofk, D.; Ranftl, R.; Müller, M.; Koltun, V. Monocular visual-inertial depth estimation. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; IEEE: New York, NY, USA, 2023; pp. 6095–6101. [Google Scholar]
- Singh, A.D.; Ba, Y.; Sarker, A.; Zhang, H.; Kadambi, A.; Soatto, S.; Srivastava, M.; Wong, A. Depth estimation from camera image and mmwave radar point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 9275–9285. [Google Scholar]
- Huang, J.; Zhou, Q.; Rabeti, H.; Korovko, A.; Ling, H.; Ren, X.; Shen, T.; Gao, J.; Slepichev, D.; Lin, C.H.; et al. Vipe: Video pose engine for 3D geometric perception. arXiv 2025, arXiv:2508.10934. [Google Scholar]
- Garg, R.; Bg, V.K.; Carneiro, G.; Reid, I. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VIII 14. Springer International Publishing: Cham, Switzerland, 2016; pp. 740–756. [Google Scholar]
- Ranftl, R.; Bochkovskiy, A.; Koltun, V. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 12179–12188. [Google Scholar]
- Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. Dinov2: Learning robust visual features without supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar]
- Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. In Advances in Neural Information Processing Systems 27; Curran Associates, Inc.: New York, NY, USA, 2014. [Google Scholar]
- Eigen, D.; Fergus, R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2650–2658. [Google Scholar]
- Birkl, R.; Wofk, D.; Müller, M. Midas v3. 1–a model zoo for robust monocular relative depth estimation. arXiv 2023, arXiv:2307.14460. [Google Scholar]
- Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef]
- Fu, H.; Gong, M.; Wang, C.; Batmanghelich, K.; Tao, D. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2002–2011. [Google Scholar]
- Yin, W.; Zhang, C.; Chen, H.; Cai, Z.; Yu, G.; Wang, K.; Shen, C. Metric3d: Towards zero-shot metric 3D prediction from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 9043–9053. [Google Scholar]
- Guizilini, V.; Vasiljevic, I.; Chen, D.; Ambruș, R.; Gaidon, A. Towards zero-shot scale-aware monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 9233–9243. [Google Scholar]
- Spencer, J.; Russell, C.; Hadfield, S.; Bowden, R. Kick back & relax++: Scaling beyond ground-truth depth with slowtv & cribstv. arXiv 2024, arXiv:2403.01569. [Google Scholar]
- Bhat, S.F.; Alhashim, I.; Wonka, P. Localbins: Improving depth estimation by learning local distributions. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 480–496. [Google Scholar]
- Li, Z.; Wang, X.; Liu, X.; Jiang, J. Binsformer: Revisiting adaptive bins for monocular depth estimation. IEEE Trans. Image Process. 2024, 33, 3964–3976. [Google Scholar] [CrossRef]
- Yuan, W.; Gu, X.; Dai, Z.; Zhu, S.; Tan, P. NeW CRFs: Neural window fully-connected CRFs for monocular depth estimation. arXiv 2022, arXiv:2203.01502. [Google Scholar]
- Spencer, J.; Tosi, F.; Poggi, M.; Arora, R.S.; Russell, C.; Hadfield, S.; Elder, J.H. The third monocular depth estimation challenge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 1–14. [Google Scholar]
- Marsal, R.; Chabot, F.; Loesch, A.; Grolleau, W.; Sahbi, H. MonoProb: Self-supervised monocular depth estimation with interpretable uncertainty. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 3637–3646. [Google Scholar]
- Haji-Esmaeili, M.M.; Montazer, G. Large-scale monocular depth estimation in the wild. Eng. Appl. Artif. Intell. 2024, 127, 107189. [Google Scholar] [CrossRef]
- Shao, S.; Pei, Z.; Chen, W.; Sun, D.; Chen, P.C.; Li, Z. MonoDiffusion: Self-supervised monocular depth estimation using diffusion model. IEEE Trans. Circuits Syst. Video Technol. 2024, 35, 3664–3678. [Google Scholar] [CrossRef]
- Wang, Y.; Liang, Y.; Xu, H.; Jiao, S.; Yu, H. Sqldepth: Generalizable self-supervised fine-structured monocular depth estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, Stanford, CA, USA, 25–27 March 2024; Volume 38, pp. 5713–5721. [Google Scholar]
- Piccinelli, L.; Yang, Y.H.; Sakaridis, C.; Segu, M.; Li, S.; Van Gool, L.; Yu, F. UniDepth: Universal monocular metric depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 10106–10116. [Google Scholar]
- Sun, B.; Jin, M.; Yin, B.; Hou, Q. Depth Anything at Any Condition. arXiv 2025, arXiv:2507.01634. [Google Scholar] [CrossRef]
- Lin, H.; Peng, S.; Chen, J.; Peng, S.; Sun, J.; Liu, M.; Bao, H.; Feng, J.; Zhou, X.; Kang, B. Prompting depth anything for 4k resolution accurate metric depth estimation. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 17070–17080. [Google Scholar]
- Wang, R.; Xu, S.; Dai, C.; Xiang, J.; Deng, Y.; Tong, X.; Yang, J. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 5261–5271. [Google Scholar]
- Zhang, Z.; Yang, L.; Yang, T.; Yu, C.; Guo, X.; Lao, Y.; Zhao, H. StableDepth: Scene-Consistent and Scale-Invariant Monocular Depth. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, HI, USA, 19–23 October 2025; pp. 7069–7078. [Google Scholar]
- Wang, Z.; Chen, S.; Yang, L.; Wang, J.; Zhang, Z.; Zhao, H.; Zhao, Z. Depth Anything with Any Prior. arXiv 2025, arXiv:2505.10565. [Google Scholar] [CrossRef]
- Wang, Y.; Li, J.; Hong, C.; Li, R.; Sun, L.; Song, X.; Wang, Z.; Cao, Z.; Lin, G. TacoDepth: Towards Efficient Radar-Camera Depth Estimation with One-stage Fusion. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 10523–10533. [Google Scholar]
- Li, Z.; Bhat, S.F.; Wonka, P. Patchfusion: An end-to-end tile-based framework for high-resolution monocular metric depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 10016–10025. [Google Scholar]
- Li, Z.; Bhat, S.F.; Wonka, P. PatchRefiner: Leveraging Synthetic Data for Real-Domain High-Resolution Monocular Metric Depth Estimation. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer Nature: Cham, Switzerland, 2024; pp. 250–267. [Google Scholar]
- Duan, Y.; Guo, X.; Zhu, Z. Diffusiondepth: Diffusion denoising approach for monocular depth estimation. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer Nature: Cham, Switzerland, 2024; pp. 432–449. [Google Scholar]
- Zavadski, D.; Kalšan, D.; Rother, C. Primedepth: Efficient monocular depth estimation with a stable diffusion preimage. In Proceedings of the Asian Conference on Computer Vision, Hanoi, Vietnam, 8–12 December 2024; pp. 922–940. [Google Scholar]
- Patni, S.; Agarwal, A.; Arora, C. Ecodepth: Effective conditioning of diffusion models for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 28285–28295. [Google Scholar]
- Fu, X.; Yin, W.; Hu, M.; Wang, K.; Ma, Y.; Tan, P.; Long, X. Geowizard: Unleashing the diffusion priors for 3D geometry estimation from a single image. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer Nature: Cham, Switzerland, 2024; pp. 241–258. [Google Scholar]
- Pham, D.H.; Do, T.; Nguyen, P.; Hua, B.S.; Nguyen, K.; Nguyen, R. Sharpdepth: Sharpening metric depth predictions using diffusion distillation. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 17060–17069. [Google Scholar]
- Ranftl, R.; Lasinger, K.; Hafner, D.; Schindler, K.; Koltun, V. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1623–1637. [Google Scholar] [CrossRef] [PubMed]
- Hu, M.; Yin, W.; Zhang, C.; Cai, Z.; Long, X.; Chen, H.; Wang, K.; Yu, G.; Shen, C.; Shen, S. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10579–10596. [Google Scholar] [CrossRef] [PubMed]
- Wang, J.; Chen, M.; Karaev, N.; Vedaldi, A.; Rupprecht, C.; Novotny, D. Vggt: Visual geometry grounded transformer. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 5294–5306. [Google Scholar]



| Method | Publication | Category | Inference | Dataset | Output | Source |
|---|---|---|---|---|---|---|
| Zoedepth [18] | arXiv | discriminative | single | real | metric | open |
| Depth Anything [21] | CVPR ’24 | discriminative | single | real | metric | open |
| Patch Fusion [75] | CVPR ’24 | discriminative | multiple (patching) | real | metric | open |
| Unidepth [68] | CVPR ’24 | discriminative | single | real | metric | open |
| Marigold [35] | CVPR ’24 | generative | multiple (diffusion) | synthetic | relative | open |
| DMD [20] | arXiv | generative | multiple (diffusion) | real | metric | close |
| Depth Anything v2 [22] | NeurIPS ’24 | discriminative | single | real + synthetic | metric | open |
| GeoWizard [80] | ECCV ’24 | generative | multiple (diffusion) | real + synthetic | relative | open |
| Patch Refiner [76] | ECCV ’24 | discriminative | multiple (patching) | real + synthetic | metric | open |
| Depth pro [19] | ICLR ’25 | discriminative | multiple (patching) | real + synthetic | metric | open |
| Depth Anything AC [69] | arXiv | discriminative | multiple (patching) | real + synthetic | metric | open |
| MoGe [71] | CVPR ’25 | discriminative | single | real + synthetic | metric | open |
| Depth Any Camera (DAC) [23] | CVPR ’25 | discriminative | single | real | metric | open |
| SharpDepth [81] | CVPR ’25 | generative | single | real | metric | open |
| StableDepth [72] | ICCV ’25 | discriminative | single | real + synthetic | metric | open |
| Prompting DA [70] | CVPR ’25 | disc+condition | single | real + sensor | metric | open |
| TacoDepth [74] | CVPR ’25 | disc+condition | single | real + sensor | metric | open |
| Prior DA [73] | arXiv | disc+condition | single | real + sensor | metric | open |
| Dataset | Booster ↑ | ETH3D ↑ | Middlebury ↑ | NuScenes ↑ | Sintel ↑ | Sun-RGBD ↑ | NYU v2 ↓ | KITTI ↓ | |
|---|---|---|---|---|---|---|---|---|---|
| Method | Indoor | Outdoor | Outdoor | Outdoor | Outdoor | Indoor | Indoor | Outdoor | |
| DepthAnything [21] | 52.3 | 9.3 | 39.3 | 35.4 | 6.9 | 85.0 | 4.3 | 7.6 | |
| DepthAnything V2 [22] | 59.5 | 36.3 | 37.2 | 17.7 | 5.9 | 72.4 | 4.4 | 7.4 | |
| Metric3D [57] | 4.7 | 34.2 | 13.6 | 64.4 | 17.3 | 16.9 | 8.3 | 5.8 | |
| Metric3D v2 [83] | 39.4 | 87.7 | 29.9 | 82.6 | 38.3 | 75.6 | 4.5 | 3.9 | |
| PatchFusion [75] | 22.6 | 51.8 | 49.9 | 20.4 | 14.0 | 53.6 | - | - | |
| UniDepth [68] | 27.6 | 25.3 | 31.9 | 83.6 | 16.5 | 95.8 | 5.78 | 4.2 | |
| ZeroDepth [18] | - | - | 46.5 | 64.3 | 12.9 | - | 8.4 | 10.5 | |
| ZoeDepth [18] | 21.6 | 34.2 | 53.8 | 28.1 | 7.8 | 85.7 | 7.7 | 5.7 | |
| Depth Pro [19] | 46.6 | 41.5 | 60.5 | 49.1 | 40.0 | 89.0 | - | - | |
| Method Type | Strengths | Weaknesses | Opportunities/Challenges |
|---|---|---|---|
| Single-pass | Efficient real-time inference; low memory footprint; simple pipeline | Loss of fine structural details; limited ability to model complex geometries; performance heavily depends on large-scale annotated datasets | Opportunity to combine with patch- or diffusion-based refinements; challenge to maintain high-frequency details while preserving efficiency |
| Patching | Improved edge and structure preservation; scalable to high-resolution inputs; supports localized refinement | Increased inference time; cross-patch consistency issues; diminishing returns with excessive patch division | Opportunity to optimize fusion strategies; challenge to balance resolution, consistency, and computation for real-time deployment |
| Generative diffusion | Captures complex geometries and fine structures; reduced reliance on dense supervision; strong zero-shot adaptability | High computational cost; iterative inference slows deployment; limited metric depth progress in some frameworks | Opportunity for accelerated diffusion and hybrid architectures; challenge to enforce geometric consistency and achieve real-time performance |
| Conditioned sensor-assisted | Leverages priors and additional modalities to enhance depth accuracy; improved robustness in difficult scenes | Dependency on sensor quality or prompt design; limited generalization without carefully curated conditions | Opportunity for unified perception pipelines in robotics and autonomous systems; challenge to balance additional input modalities with model complexity |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, J.; Wu, Y.; Jiang, H. Survey on Monocular Metric Depth Estimation. Computers 2025, 14, 502. https://doi.org/10.3390/computers14110502
Zhang J, Wu Y, Jiang H. Survey on Monocular Metric Depth Estimation. Computers. 2025; 14(11):502. https://doi.org/10.3390/computers14110502
Chicago/Turabian StyleZhang, Jiuling, Yurong Wu, and Huilong Jiang. 2025. "Survey on Monocular Metric Depth Estimation" Computers 14, no. 11: 502. https://doi.org/10.3390/computers14110502
APA StyleZhang, J., Wu, Y., & Jiang, H. (2025). Survey on Monocular Metric Depth Estimation. Computers, 14(11), 502. https://doi.org/10.3390/computers14110502
