MonoLENS: Monocular Lightweight Efficient Network with Separable Convolutions for Self-Supervised Monocular Depth Estimation †
Abstract
1. Introduction
- We introduce MonoLENS, a novel hybrid architecture for lightweight, self-supervised monocular depth estimation. We demonstrate its effectiveness by showing substantial reductions in both model parameters and FLOPs compared with baseline architectures.
- The inference time of the proposed method is evaluated on NVIDIA Jetson Orin Nano platforms, demonstrating its favorable trade-off between model complexity and inference speed.
2. Related Work
2.1. Deep Learning–Based Monocular Depth Estimation
2.2. Supervised Monocular Depth Estimation
2.3. Self-Supervised Monocular Depth Estimation
2.4. Lightweight Models for Depth Estimation
3. Proposed Method
3.1. Design Motivation
3.1.1. Decoder Efficiency: The DS-Upsampling Block
3.1.2. Skip Connection Refinement: The MCACoder
3.1.3. Encoder Optimization: Fourth Stage Compression
3.2. Model Architecture
3.3. Encoder
3.4. Decoder
3.5. PoseNet
3.6. Self-Supervised Learning
4. Experiments
4.1. Dataset
4.1.1. KITTI Dataset
4.1.2. Cityscapes Dataset
4.2. Implementation Details
4.3. KITTI Results
4.4. Cityscapes Results
4.5. Complexity and Speed Evaluation
4.6. Ablation Study
4.6.1. Ablation Study on the DS-Upsampling Block
4.6.2. Ablation Study on the Number of Blocks in the Encoder
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Vu, H.H.; Labatut, P.; Pons, J.P.; Keriven, R. High Accuracy and Visibility-Consistent Dense Multiview Stereo. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 34, 889–901. [Google Scholar] [CrossRef] [PubMed]
- Li, Y.; Ma, L.; Zhong, Z.; Liu, F.; Chapman, M.A.; Cao, D.; Li, J. Deep Learning for Lidar Point Clouds in Autonomous Driving: A Review. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 3412–3432. [Google Scholar] [CrossRef] [PubMed]
- Liu, Y.; Liu, H.; Li, Y.; Zhao, S.; Yang, Y. Building BIM Modeling Based on Multi-Source Laser Point Cloud Fusion. J. Geogr. Inf. Sci. 2021, 23, 763–772. [Google Scholar]
- Eigen, D.; Puhrsch, C.; Fergus, R. Depth Map Prediction from a Single Image Using a Multi-Scale Deep Network. Adv. Neural Inf. Process. Syst. 2014, 27, 2366–2374. [Google Scholar]
- Fanello, S.R.; Keskin, C.; Izadi, S.; Kohli, P.; Kim, D.; Sweeney, D.; Criminisi, A.; Shotton, J.; Kang, S.B.; Paek, T. Learning to Be a Depth Camera for Close-Range Human Capture and Interaction. ACM Trans. Graph. 2014, 33, 1–11. [Google Scholar] [CrossRef]
- Wei, Y.; Liu, S.; Rao, Y.; Zhao, W.; Lu, J.; Zhou, J. NerfingMVS: Guided Optimization of Neural Radiance Fields for Indoor Multi-View Stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 5610–5619. [Google Scholar]
- Godard, C.; Mac Aodha, O.; Firman, M.; Brostow, G.J. Digging into Self-Supervised Monocular Depth Estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3828–3838. [Google Scholar]
- Liu, S.; Yang, L.T.; Tu, X.; Li, R.; Xu, C. Lightweight Monocular Depth Estimation on Edge Devices. IEEE Internet Things J. 2022, 9, 16168–16180. [Google Scholar] [CrossRef]
- Zhang, N.; Nex, F.; Vosselman, G.; Kerle, N. Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18537–18546. [Google Scholar]
- Higashiuchi, G.; Shimada, T.; Kong, X.; Tomiyama, H. Efficient Monocular Depth Estimation Using Depthwise Separable Convolutions and Multidimensional Cooperative Attention (in Japanese). In Proceedings of the Forum on Information Technology, Information Processing Society of Japan, Hokkaido, Japan, 3–5 September 2025. [Google Scholar]
- Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
- Yu, Y.; Zhang, Y.; Cheng, Z.; Song, Z.; Tang, C. MCA: Multidimensional Collaborative Attention in Deep Convolutional Neural Networks for Image Recognition. Eng. Appl. Artif. Intell. 2023, 126, 107079. [Google Scholar] [CrossRef]
- Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision Meets Robotics: The KITTI Dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
- Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
- Johnston, A.; Carneiro, G. Self-Supervised Monocular Trained Depth Estimation Using Self-Attention and Discrete Disparity Volume. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4756–4765. [Google Scholar]
- Laina, I.; Rupprecht, C.; Belagiannis, V.; Tombari, F.; Navab, N. Deeper Depth Prediction with Fully Convolutional Residual Networks. In Proceedings of the International Conference on 3D Vision, Stanford, CA, USA, 25–28 October 2016; pp. 239–248. [Google Scholar]
- Fu, H.; Gong, M.; Wang, C.; Batmanghelich, K.; Tao, D. Deep Ordinal Regression Network for Monocular Depth Estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2002–2011. [Google Scholar]
- Lee, J.H.; Han, M.K.; Ko, D.W.; Suh, I.H. From Big to Small: Multi-Scale Local Planar Guidance for Monocular Depth Estimation. arXiv 2019, arXiv:1907.10326. [Google Scholar]
- Bauer, Z.; Li, Z.; Orts-Escolano, S.; Cazorla, M.; Pollefeys, M.; Oswald, M.R. NVS-MonoDepth: Improving Monocular Depth Prediction with Novel View Synthesis. In Proceedings of the International Conference on 3D Vision, London, UK, 1–3 December 2021; pp. 848–858. [Google Scholar]
- Bhat, S.F.; Alhashim, I.; Wonka, P. AdaBins: Depth Estimation Using Adaptive Bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4009–4018. [Google Scholar]
- Godard, C.; Mac Aodha, O.; Brostow, G.J. Unsupervised Monocular Depth Estimation with Left-Right Consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 270–279. [Google Scholar]
- Garg, R.; Bg, V.K.; Carneiro, G.; Reid, I. Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 740–756. [Google Scholar]
- Sun, J.; Zheng, N.N.; Shum, H.Y. Stereo Matching Using Belief Propagation. IEEE Trans. Pattern Anal. Mach. Intell. 2003, 25, 787–800. [Google Scholar] [CrossRef]
- Poggi, M.; Tosi, F.; Mattoccia, S. Learning Monocular Depth Estimation with Unsupervised Trinocular Assumptions. In Proceedings of the International Conference on 3D Vision, Verona, Italy, 5–8 September 2018; pp. 324–333. [Google Scholar]
- Watson, J.; Firman, M.; Brostow, G.J.; Turmukhambetov, D. Self-Supervised Monocular Depth Hints. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2162–2171. [Google Scholar]
- GonzalezBello, J.L.; Kim, M. Forget about the Lidar: Self-Supervised Depth Estimators with Med Probability Volumes. Adv. Neural Inf. Process. Syst. 2020, 33, 12626–12637. [Google Scholar]
- Zhu, S.; Brazil, G.; Liu, X. The Edge of Depth: Explicit Constraints between Segmentation and Depth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13116–13125. [Google Scholar]
- Peng, R.; Wang, R.; Lai, Y.; Tang, L.; Cai, Y. Excavating the Potential Capacity of Self-Supervised Monocular Depth Estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 15560–15569. [Google Scholar]
- Zhou, T.; Brown, M.; Snavely, N.; Lowe, D.G. Unsupervised Learning of Depth and Ego-Motion from Video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1851–1858. [Google Scholar]
- Shu, C.; Yu, K.; Duan, Z.; Yang, K. Feature-Metric Loss for Self-Supervised Learning of Depth and Egomotion. In Proceedings of the European Conference on Computer Vision, Virtual, 23–28 August 2020; pp. 572–588. [Google Scholar]
- Lyu, X.; Liu, L.; Wang, M.; Kong, X.; Liu, L.; Liu, Y.; Chen, X.; Yuan, Y. HR-Depth: High Resolution Self-Supervised Monocular Depth Estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 19–21 May 2021; pp. 2294–2301. [Google Scholar]
- Yin, W.; Liu, Y.; Shen, C.; Yan, Y. Enforcing Geometric Constraints of Virtual Normal for Depth Prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5684–5693. [Google Scholar]
- Wofk, D.; Ma, F.; Yang, T.J.; Karaman, S.; Sze, V. FastDepth: Fast Monocular Depth Estimation on Embedded Systems. In Proceedings of the International Conference on Robotics and Automation, Montreal, QC, Canada, 20–24 May 2019; pp. 6101–6108. [Google Scholar]
- Nekrasov, V.; Dharmasiri, T.; Spek, A.; Drummond, T.; Shen, C.; Reid, I. Real-Time Joint Semantic Segmentation and Depth Estimation Using Symmetric Annotations. In Proceedings of the International Conference on Robotics and Automation, Montreal, QC, Canada, 20–24 May 2019; pp. 7101–7107. [Google Scholar]
- Hu, J.; Fan, C.; Jiang, H.; Guo, X.; Gao, Y.; Lu, X.; Lam, T.L. Boosting Lightweight Depth Estimation via Knowledge Distillation. In Proceedings of the International Conference on Knowledge Science, Engineering and Management, Guangzhou, China, 16–18 August 2023; pp. 27–39. [Google Scholar]
- Sheng, F.; Xue, F.; Chang, Y.; Liang, W.; Ming, A. Monocular Depth Distribution Alignment with Low Computation. In Proceedings of the International Conference on Robotics and Automation, Philadelphia, PA, USA, 23–27 May 2022; pp. 6548–6555. [Google Scholar]
- Zhou, Z.; Fan, X.; Shi, P.; Xin, Y. R-MSFM: Recurrent Multi-Scale Feature Modulation for Monocular Depth Estimating. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 12777–12786. [Google Scholar]
- Zhang, Z.; Xu, C.; Yang, J.; Gao, J.; Cui, Z. Progressive Hard-Mining Network for Monocular Depth Estimation. IEEE Trans. Image Process. 2018, 27, 3691–3702. [Google Scholar] [CrossRef] [PubMed]
- Hoang, V.T.; Jo, K.H. PyDNet: An Efficient CNN Architecture with Pyramid Depthwise Convolution Kernels. In Proceedings of the International Conference on System Science and Engineering, Dong Hoi, Vietnam, 20–21 July 2019; pp. 154–158. [Google Scholar]
- Varma, A.; Chawla, H.; Zonooz, B.; Arani, E. Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics. arXiv 2022, arXiv:2202.03131. [Google Scholar]
- Zhao, C.; Zhang, Y.; Poggi, M.; Tosi, F.; Guo, X.; Zhu, Z.; Huang, G.; Tang, Y.; Mattoccia, S. MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer. In Proceedings of the International Conference on 3D Vision, Prague, Czech Republic, 12–15 September 2022; pp. 668–678. [Google Scholar]
- Lee, Y.; Kim, J.; Willette, J.; Hwang, S.J. MPViT: Multi-Path Vision Transformer for Dense Prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 7287–7296. [Google Scholar]
- Zhang, G.; Tang, X.; Wang, L.; Cui, H.; Fei, T.; Tang, H.; Jiang, S. RepMono: A Lightweight Self-Supervised Monocular Depth Estimation Architecture for High-Speed Inference. Complex Intell. Syst. 2024, 10, 7927–7941. [Google Scholar] [CrossRef]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
- Zhang, Z.; Wang, Y.; Huang, Z.; Luo, G.; Yu, G.; Fu, B. A Simple Baseline for Fast and Accurate Depth Estimation on Mobile Devices. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2466–2471. [Google Scholar]
- Liu, Z.; Wang, Q. Edge-Enhanced Dual-Stream Perception Network for Monocular Depth Estimation. Electronics 2024, 13, 1652. [Google Scholar] [CrossRef]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Eigen, D.; Fergus, R. Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2650–2658. [Google Scholar]
- Yang, Z.; Wang, P.; Wang, Y.; Xu, W.; Nevatia, R. LEGO: Learning Edge with Geometry All at Once by Watching Videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 225–234. [Google Scholar]
- Yin, Z.; Shi, J. GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1983–1992. [Google Scholar]
- Watson, J.; Mac Aodha, O.; Prisacariu, V.; Brostow, G.; Firman, M. The Temporal Opportunist: Self-Supervised Multi-Frame Monocular Depth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1164–1174. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
- Wang, C.; Buenaposada, J.M.; Zhu, R.; Lucey, S. Learning Depth from Monocular Videos Using Direct Methods. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2022–2030. [Google Scholar]
- Luo, C.; Yang, Z.; Wang, P.; Wang, Y.; Xu, W.; Nevatia, R.; Yuille, A. Every Pixel Counts++: Joint Learning of Geometry and Motion with 3D Holistic Understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2624–2641. [Google Scholar] [CrossRef] [PubMed]
- Casser, V.; Pirk, S.; Mahjourian, R.; Angelova, A. Depth Prediction without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, Hawaii, USA, 27–28 January 2019; Volume 33, pp. 8001–8008. [Google Scholar]
- Klingner, M.; Termöhlen, J.A.; Mikolajczyk, J.; Fingscheidt, T. Self-Supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance. In Proceedings of the European Conference on Computer Vision, Virtual, 23–28 August 2020; pp. 582–600. [Google Scholar]
- Casser, V.; Pirk, S.; Mahjourian, R.; Angelova, A. Unsupervised Monocular Depth and Ego-Motion Learning with Structure and Semantics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Pilzer, A.; Xu, D.; Puscas, M.; Ricci, E.; Sebe, N. Unsupervised Adversarial Depth Estimation Using Cycled Generative Networks. In Proceedings of the International Conference on 3D Vision, Verona, Italy, 5–8 September 2018; pp. 587–595. [Google Scholar]
- Gordon, A.; Li, H.; Jonschkowski, R.; Angelova, A. Depth from Videos in the Wild: Unsupervised Monocular Depth Learning from Unknown Cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8977–8986. [Google Scholar]
Method | Data | Depth Error (Lower Is Better) | Depth Accuracy (Higher Is Better) | Params (M) | |||||
---|---|---|---|---|---|---|---|---|---|
Abs Rel | Sq Rel | RMSE | RMSE Log | ||||||
GeoNet [50] | M | 0.149 | 1.060 | 5.567 | 0.226 | 0.796 | 0.935 | 0.975 | 31.6 |
DDVO [54] | M | 0.151 | 1.257 | 5.583 | 0.228 | 0.810 | 0.936 | 0.974 | 28.1 |
Monodepth [21] | M | 0.148 | 1.344 | 5.927 | 0.247 | 0.803 | 0.922 | 0.964 | 20.2 |
EPC++ [55] | M | 0.141 | 1.029 | 5.350 | 0.216 | 0.816 | 0.941 | 0.976 | 33.2 |
Struct2depth [56] | M | 0.141 | 1.026 | 5.291 | 0.215 | 0.816 | 0.945 | 0.979 | 31.6 |
Monodepth2-Res18 [7] | M | 0.115 | 0.903 | 4.863 | 0.193 | 0.877 | 0.959 | 0.981 | 14.3 |
Monodepth2-Res50 [7] | M | 0.110 | 0.831 | 4.642 | 0.187 | 0.883 | 0.962 | 0.982 | 32.5 |
SGDepth [57] | M + Se | 0.113 | 0.835 | 4.693 | 0.191 | 0.879 | 0.961 | 0.981 | 16.3 |
Johnston et al. [15] | M | 0.111 | 0.941 | 4.817 | 0.189 | 0.885 | 0.961 | 0.981 | 14.3+ |
Lite-HR-Depth [31] | M | 0.116 | 0.845 | 4.841 | 0.190 | 0.866 | 0.957 | 0.982 | 3.1 |
R-MSFM3 [37] | M | 0.114 | 0.815 | 4.712 | 0.193 | 0.876 | 0.959 | 0.981 | 3.5 |
R-MSFM6 [37] | M | 0.112 | 0.806 | 4.704 | 0.191 | 0.878 | 0.960 | 0.981 | 3.8 |
Lite-Mono [9] | M | 0.109 | 0.872 | 4.712 | 0.187 | 0.885 | 0.961 | 0.982 | 3.1 |
Lite-Mono-small [9] | M | 0.112 | 0.896 | 4.797 | 0.189 | 0.879 | 0.960 | 0.981 | 2.5 |
MonoLENS (ours) | M | 0.110 | 0.833 | 4.644 | 0.185 | 0.883 | 0.962 | 0.982 | 1.8 |
Method | Data | Depth Error (Lower Is Better) | Depth Accuracy (Higher Is Better) | |||||
---|---|---|---|---|---|---|---|---|
Abs Rel | Sq Rel | RMSE | RMSE Log | |||||
Struct2Depth 2 [58] | M + Se | 0.145 | 1.737 | 7.280 | 0.205 | 0.813 | 0.942 | 0.976 |
Pilzer et al. [59] | M | 0.240 | 4.264 | 8.049 | 0.334 | 0.710 | 0.871 | 0.937 |
Monodepth2 [7] | M | 0.129 | 1.569 | 6.876 | 0.187 | 0.849 | 0.957 | 0.983 |
Videos in the Wild [60] | M + Se | 0.127 | 1.330 | 6.960 | 0.195 | 0.830 | 0.947 | 0.981 |
Lite-Mono [9] | M | 0.121 | 1.475 | 6.732 | 0.181 | 0.866 | 0.961 | 0.985 |
MonoLENS | M | 0.123 | 1.506 | 6.642 | 0.181 | 0.870 | 0.962 | 0.985 |
Method | FLOPs [G] | FPS | Inference Time [ms] | |||
---|---|---|---|---|---|---|
Total | Encoder | Decoder | Batch 2 | Batch 6 | ||
R-MSFM3 [37] | 16.468 | 2.449 | 14.020 | 36.3 | 43.1 | 127.7 |
Lite-Mono-small [9] | 4.746 | 4.028 | 0.718 | 37.6 | 42.7 | 127.2 |
Lite-Mono [9] | 5.032 | 4.314 | 0.718 | 36.5 | 44.7 | 134.2 |
MonoLENS (ours) | 4.008 | 3.743 | 0.266 | 38.8 | 42.6 | 124.1 |
Depthwise Seprable | MCA Coder | Reduced Encoder Blocks | Depth Error (Lower Is Better) | Depth Accuracy (Higher Is Better) | FLOPs (G) | Params (M) | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
Abs Rel | Sq Rel | RMSE | RMSE Log | ||||||||
0.109 | 0.872 | 4.712 | 0.187 | 0.885 | 0.961 | 0.982 | 5.032 | 3.069 | |||
✓ | 0.110 | 0.868 | 4.722 | 0.186 | 0.885 | 0.961 | 0.982 | 4.559 | 2.965 | ||
✓ | 0.110 | 0.859 | 4.694 | 0.187 | 0.883 | 0.961 | 0.982 | 5.033 | 3.069 | ||
✓ | 0.110 | 0.821 | 4.664 | 0.186 | 0.881 | 0.961 | 0.982 | 4.461 | 1.875 | ||
✓ | ✓ | 0.109 | 0.860 | 4.700 | 0.185 | 0.887 | 0.962 | 0.982 | 4.579 | 2.970 | |
✓ | ✓ | 0.111 | 0.854 | 4.699 | 0.188 | 0.880 | 0.960 | 0.982 | 3.988 | 1.772 | |
✓ | ✓ | 0.110 | 0.835 | 4.699 | 0.187 | 0.883 | 0.961 | 0.982 | 4.462 | 1.875 | |
✓ | ✓ | ✓ | 0.110 | 0.833 | 4.644 | 0.185 | 0.883 | 0.962 | 0.982 | 4.008 | 1.777 |
Depthwise Separable (Before Upsampling) | Depthwise Separable (After Upsampling) | Depth Error (Lower Is Better) | FLOPs (G) | Params (M) | |||
---|---|---|---|---|---|---|---|
Abs Rel | Sq Rel | RMSE | RMSE Log | ||||
0.109 | 0.872 | 4.712 | 0.187 | 5.032 | 3.069 | ||
✓ | 0.112 | 0.893 | 4.748 | 0.187 | 4.910 | 2.978 | |
✓ | 0.110 | 0.868 | 4.722 | 0.186 | 4.559 | 2.965 | |
✓ | ✓ | 0.111 | 0.885 | 4.770 | 0.187 | 4.437 | 2.875 |
Blocks in the Fourth Stage | Depth Error (Lower Is Better) | FLOPs (G) | Params (M) | |||
---|---|---|---|---|---|---|
Abs Rel | Sq Rel | RMSE | RMSE Log | |||
10 | 0.109 | 0.872 | 4.712 | 0.187 | 5.032 | 3.069 |
7 | 0.112 | 0.896 | 4.797 | 0.189 | 4.746 | 2.472 |
6 | 0.110 | 0.846 | 4.693 | 0.186 | 4.651 | 2.273 |
5 | 0.109 | 0.838 | 4.658 | 0.185 | 4.556 | 2.074 |
4 | 0.110 | 0.821 | 4.664 | 0.186 | 4.461 | 1.875 |
3 | 0.112 | 0.841 | 4.697 | 0.188 | 4.366 | 1.676 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Higashiuchi, G.; Shimada, T.; Kong, X.; Yan, H.; Tomiyama, H. MonoLENS: Monocular Lightweight Efficient Network with Separable Convolutions for Self-Supervised Monocular Depth Estimation. Appl. Sci. 2025, 15, 10393. https://doi.org/10.3390/app151910393
Higashiuchi G, Shimada T, Kong X, Yan H, Tomiyama H. MonoLENS: Monocular Lightweight Efficient Network with Separable Convolutions for Self-Supervised Monocular Depth Estimation. Applied Sciences. 2025; 15(19):10393. https://doi.org/10.3390/app151910393
Chicago/Turabian StyleHigashiuchi, Genki, Tomoyasu Shimada, Xiangbo Kong, Haimin Yan, and Hiroyuki Tomiyama. 2025. "MonoLENS: Monocular Lightweight Efficient Network with Separable Convolutions for Self-Supervised Monocular Depth Estimation" Applied Sciences 15, no. 19: 10393. https://doi.org/10.3390/app151910393
APA StyleHigashiuchi, G., Shimada, T., Kong, X., Yan, H., & Tomiyama, H. (2025). MonoLENS: Monocular Lightweight Efficient Network with Separable Convolutions for Self-Supervised Monocular Depth Estimation. Applied Sciences, 15(19), 10393. https://doi.org/10.3390/app151910393