Parallel Multi-Scale Semantic-Depth Interactive Fusion Network for Depth Estimation
Abstract
1. Introduction
- A scheme that combines semantic segmentation to estimate depth is proposed to implement an end-to-end pipeline. We adopt a multi-stage feature attention network (MSFAN) for the feature extraction of RGB images instead of the traditional U-net structure encoder to improve the accuracy of the depth estimation task.
- We introduce a semantic segmentation task by sharing feature extraction network MSFAN, and add a parallel semantic depth interactive fusion module (PSDIFM) to achieve the bidirectional complementarity of feature information between different tasks.
- The total multi-task loss function is designed to adapt to the new pipeline, and the metric loss based on semantic edges is added to refine the depth of the edges, promoting the further improvement of depth estimation results.
- Our network pipeline is trained on KITTI dataset and evaluated. The results of KITTI and Make3D datasets show that the network pipeline designed by us achieves satisfactory performance compared to other existing methods.
2. Related Work
2.1. Depth Estimation
2.2. Self-Supervised Depth Estimation
2.3. Semantic-Guided Depth Estimation
3. Proposed Approach
3.1. Problem Statement
3.2. Network Architecture
3.2.1. Pipeline
3.2.2. Multi-Stage Feature Attention Network
3.2.3. Parallel Semantic Depth Interactive Fusion
3.3. Boundary Alignment Loss
Algorithm 1: Semantic Inconsistent Boundary Pixel Extraction |
Input: Synthetic semantic maps and , Pseudo label , Semantic boundary pixel set Output: Semantically inconsistent boundary pixel set
|
3.4. Multi-Task Loss
4. Experiments
4.1. Datasets and Evaluation Metrics
4.2. Network and Training Details
4.3. Quantitative and Qualitative Results
4.4. Ablation Study
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. Adv. Neural Inf. Process. Syst. 2014, 27, 2366–2374. [Google Scholar]
- Li, B.; Shen, C.; Dai, Y.; Van Den Hengel, A.; He, M. Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1119–1127. [Google Scholar]
- Achanta, R.; Shaji, A.; Smith, K.; Lucchi, A.; Fua, P.; Süsstrunk, S. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 2274–2282. [Google Scholar] [CrossRef] [PubMed]
- Zhou, T.; Brown, M.; Snavely, N.; Lowe, D.G. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1851–1858. [Google Scholar]
- Mahjourian, R.; Wicke, M.; Angelova, A. Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–26 June 2018; pp. 5667–5675. [Google Scholar]
- Godard, C.; Mac Aodha, O.; Firman, M.; Brostow, G.J. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3828–3838. [Google Scholar]
- Meng, Y.; Lu, Y.; Raj, A.; Sunarjo, S.; Guo, R.; Javidi, T.; Bansal, G.; Bharadia, D. Signet: Semantic instance aided unsupervised 3d geometry perception. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9810–9820. [Google Scholar]
- Klingner, M.; Termöhlen, J.A.; Mikolajczyk, J.; Fingscheidt, T. Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XX 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 582–600. [Google Scholar]
- Laina, I.; Rupprecht, C.; Belagiannis, V.; Tombari, F.; Navab, N. Deeper depth prediction with fully convolutional residual networks. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; IEEE: New York, NY, USA, 2016; pp. 239–248. [Google Scholar]
- Mancini, M.; Costante, G.; Valigi, P.; Ciarfuglia, T.A. Fast robust monocular depth estimation for obstacle detection with fully convolutional networks. In Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Republic of Korea, 9–14 October 2016; IEEE: New York, NY, USA, 2016; pp. 4296–4303. [Google Scholar]
- Yang, Z.; Wang, P.; Xu, W.; Zhao, L.; Nevatia, R. Unsupervised learning of geometry from videos with edge-aware depth-normal consistency. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Poggi, M.; Aleotti, F.; Tosi, F.; Mattoccia, S. On the uncertainty of self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3227–3237. [Google Scholar]
- Guizilini, V.; Ambrus, R.; Pillai, S.; Raventos, A.; Gaidon, A. 3d packing for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2485–2494. [Google Scholar]
- Yin, Z.; Shi, J. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1983–1992. [Google Scholar]
- Bian, J.W.; Zhan, H.; Wang, N.; Li, Z.; Zhang, L.; Shen, C.; Cheng, M.M.; Reid, I. Unsupervised scale-consistent depth learning from video. Int. J. Comput. Vis. 2021, 129, 2548–2564. [Google Scholar] [CrossRef]
- Sun, K.; Zhao, Y.; Jiang, B.; Cheng, T.; Xiao, B.; Liu, D.; Mu, Y.; Wang, X.; Liu, W.; Wang, J. High-resolution representations for labeling pixels and regions. arXiv 2019, arXiv:1904.04514. [Google Scholar]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
- Chen, P.Y.; Liu, A.H.; Liu, Y.C.; Wang, Y.C.F. Towards scene understanding: Unsupervised monocular depth estimation with semantic-aware representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2624–2632. [Google Scholar]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; IEEE: New York, NY, USA, 2012; pp. 3354–3361. [Google Scholar]
- Zhu, S.; Brazil, G.; Liu, X. The edge of depth: Explicit constraints between segmentation and depth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13116–13125. [Google Scholar]
- Jung, H.; Park, E.; Yoo, S. Fine-grained semantics-aware representation enhancement for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 12642–12652. [Google Scholar]
- Lyu, X.; Liu, L.; Wang, M.; Kong, X.; Liu, L.; Liu, Y.; Chen, X.; Yuan, Y. Hr-depth: High resolution self-supervised monocular depth estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 2294–2301. [Google Scholar]
- Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
- Xu, D.; Ouyang, W.; Wang, X.; Sebe, N. Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 675–684. [Google Scholar]
- Mnih, V.; Heess, N.; Graves, A.; Kavukcuoglu, K. Recurrent models of visual attention. Adv. Neural Inf. Process. Syst. 2014, 27. Available online: https://proceedings.neurips.cc/paper_files/paper/2014/file/3e456b31302cf8210edd4029292a40ad-Paper.pdf (accessed on 10 May 2025).
- Dong, X.; Shen, J. Triplet loss in siamese network for object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 459–474. [Google Scholar]
- Saxena, A.; Sun, M.; Ng, A.Y. Make3d: Learning 3d scene structure from a single still image. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 31, 824–840. [Google Scholar] [CrossRef] [PubMed]
- Shu, C.; Yu, K.; Duan, Z.; Yang, K. Feature-metric loss for self-supervised learning of depth and egomotion. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 572–588. [Google Scholar]
- Choi, J.; Jung, D.; Lee, D.; Kim, C. Safenet: Self-supervised monocular depth estimation with semantic-aware feature extraction. arXiv 2020, arXiv:2010.02893. [Google Scholar]
- Casser, V.; Pirk, S.; Mahjourian, R.; Angelova, A. Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8001–8008. [Google Scholar]
- Ranjan, A.; Jampani, V.; Balles, L.; Kim, K.; Sun, D.; Wulff, J.; Black, M.J. Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12240–12249. [Google Scholar]
- Pnvr, K.; Zhou, H.; Jacobs, D. Sharingan: Combining synthetic and real data for unsupervised geometry estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13974–13983. [Google Scholar]
- Chanduri, S.S.; Suri, Z.K.; Vozniak, I.; Müller, C. Camlessmonodepth: Monocular depth estimation with unknown camera parameters. arXiv 2021, arXiv:2110.14347. [Google Scholar]
- Yan, J.; Zhao, H.; Bu, P.; Jin, Y. Channel-wise attention-based network for self-supervised monocular depth estimation. In Proceedings of the 2021 International Conference on 3D Vision (3DV), London, UK, 1–3 December 2021; IEEE: New York, NY, USA, 2021; pp. 464–473. [Google Scholar]
- Masoumian, A.; Rashwan, H.A.; Abdulwahab, S.; Cristiano, J.; Asif, M.S.; Puig, D. GCNDepth: Self-supervised monocular depth estimation based on graph convolutional network. Neurocomputing 2023, 517, 81–92. [Google Scholar] [CrossRef]
- Liu, M.; Salzmann, M.; He, X. Discrete-continuous depth estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 716–723. [Google Scholar]
- Godard, C.; Mac Aodha, O.; Brostow, G.J. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 270–279. [Google Scholar]
- Wang, C.; Buenaposada, J.M.; Zhu, R.; Lucey, S. Learning depth from monocular videos using direct methods. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2022–2030. [Google Scholar]
Method | Sup | AbsRel ↓ | SqRel ↓ | RMSE ↓ | RMSE_log ↓ | a1 ↑ | a2 ↑ | a3 ↑ |
---|---|---|---|---|---|---|---|---|
SFMLearner [4] | N | 0.183 | 1.595 | 6.709 | 0.270 | 0.734 | 0.902 | 0.959 |
Vid2Depth [5] | N | 0.163 | 1.240 | 6.220 | 0.250 | 0.762 | 0.916 | 0.968 |
GeoNet [14] | N | 0.153 | 1.328 | 5.737 | 0.232 | 0.802 | 0.934 | 0.972 |
Casser [30] | N | 0.141 | 1.026 | 5.291 | 0.215 | 0.816 | 0.945 | 0.979 |
CC [31] | F | 0.140 | 1.070 | 5.326 | 0.217 | 0.826 | 0.941 | 0.975 |
SharinGAN [32] | N | 0.116 | 0.939 | 5.068 | 0.203 | 0.850 | 0.948 | 0.978 |
SCSFM [15] | N | 0.114 | 0.813 | 4.706 | 0.191 | 0.873 | 0.960 | 0.982 |
Monodepth2 [6] | N | 0.112 | 0.851 | 4.754 | 0.190 | 0.881 | 0.960 | 0.981 |
SGDepth [8] | Seg | 0.112 | 0.833 | 4.688 | 0.190 | 0.884 | 0.961 | 0.983 |
SAFENet [29] | Seg | 0.112 | 0.788 | 4.582 | 0.187 | 0.878 | 0.963 | 0.983 |
PackNet-sfm [13] | N | 0.111 | 0.785 | 4.601 | 0.189 | 0.878 | 0.960 | 0.982 |
HR-Depth [22] | N | 0.109 | 0.792 | 4.632 | 0.185 | 0.884 | 0.962 | 0.983 |
Chanduri [33] | N | 0.106 | 0.750 | 4.482 | 0.182 | 0.890 | 0.964 | 0.983 |
CADepth [34] | N | 0.105 | 0.769 | 4.535 | 0.181 | 0.892 | 0.964 | 0.983 |
FeatDepth [28] | HR | 0.104 | 0.729 | 4.481 | 0.179 | 0.893 | 0.965 | 0.984 |
GCNDepth [35] | HR | 0.104 | 0.720 | 4.494 | 0.181 | 0.888 | 0.965 | 0.984 |
FSRE [21] | Seg | 0.102 | 0.675 | 4.393 | 0.178 | 0.893 | 0.964 | 0.984 |
Ours | Seg | 0.101 | 0.718 | 4.376 | 0.176 | 0.896 | 0.966 | 0.984 |
Method | Sup | AbsRel ↓ | SqRel ↓ | RMSE ↓ | RMSE_log ↓ |
---|---|---|---|---|---|
Liu et al. [36] | Y | 0.462 | 6.625 | 9.972 | 0.161 |
Laina et al. [9] | Y | 0.204 | 1.840 | 5.683 | 0.084 |
Godard et al. [37] | Y | 0.443 | 7.112 | 8.860 | 0.142 |
Zhou et al. [4] | N | 0.392 | 4.473 | 8.307 | 0.194 |
DDVO [38] | N | 0.387 | 4.720 | 8.090 | 0.204 |
Monodepth2 [6] | N | 0.344 | 4.065 | 7.920 | 0.197 |
Ours | N | 0.337 | 3.842 | 7.733 | 0.190 |
Method | AbsRel ↓ | SqRel ↓ | RMSE ↓ | RMSE_log ↓ | a1 ↑ | a2 ↑ | a3 ↑ |
---|---|---|---|---|---|---|---|
w/o MSFAN, w/o Seg, w/o PSDIFM | 0.109 | 0.792 | 4.632 | 0.185 | 0.884 | 0.962 | 0.983 |
+MSFAN | 0.107 | 0.776 | 4.620 | 0.185 | 0.886 | 0.962 | 0.983 |
+MSFAN+Seg | 0.107 | 0.766 | 4.511 | 0.183 | 0.892 | 0.965 | 0.983 |
+MAFAN+Seg+ PSDIFM | 0.103 | 0.747 | 4.405 | 0.177 | 0.895 | 0.966 | 0.984 |
+MAFAN+Seg+ PSDIFM+Metric Loss | 0.101 | 0.718 | 4.376 | 0.176 | 0.896 | 0.966 | 0.984 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Fu, C.; Sun, S.; Wei, N.; Chau, V.; Xu, X.; Wu, W. Parallel Multi-Scale Semantic-Depth Interactive Fusion Network for Depth Estimation. J. Imaging 2025, 11, 218. https://doi.org/10.3390/jimaging11070218
Fu C, Sun S, Wei N, Chau V, Xu X, Wu W. Parallel Multi-Scale Semantic-Depth Interactive Fusion Network for Depth Estimation. Journal of Imaging. 2025; 11(7):218. https://doi.org/10.3390/jimaging11070218
Chicago/Turabian StyleFu, Chenchen, Sujunjie Sun, Ning Wei, Vincent Chau, Xueyong Xu, and Weiwei Wu. 2025. "Parallel Multi-Scale Semantic-Depth Interactive Fusion Network for Depth Estimation" Journal of Imaging 11, no. 7: 218. https://doi.org/10.3390/jimaging11070218
APA StyleFu, C., Sun, S., Wei, N., Chau, V., Xu, X., & Wu, W. (2025). Parallel Multi-Scale Semantic-Depth Interactive Fusion Network for Depth Estimation. Journal of Imaging, 11(7), 218. https://doi.org/10.3390/jimaging11070218