Residual Vision Transformer and Adaptive Fusion Autoencoders for Monocular Depth Estimation
Abstract
1. Introduction
2. Related Work
2.1. Vision Transformer
2.2. Atrous Spatial Pyramid Pooling
2.3. Selective Feature Fusion
3. Proposed Methods
3.1. CNN¬-ViT Encoder
3.1.1. Subsampled Residual Block
3.1.2. Residual Block
3.1.3. Vision Transformers in CNN-ViT
3.2. Adaptive Fusion Decoder
3.2.1. Fusion Modules
- A.
- Separate Enhancement Addition Fusion Module
- B.
- Separate Enhancement Concatenation Fusion Module
- C.
- Adaptive Fusion Module
3.2.2. Up-Convolution Module
3.2.3. Deep ASPP Module

3.3. Training Loss Function
4. Experimental Results
4.1. CNN-ViT Encoder with Various ViT Configurations
4.2. Adaptive Fusion Decoder with Various Fusion Modules
4.3. Comparisons on NYU Depth V2 Dataset
4.4. Comparisons on KITTI Dataset
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Fabrizio, F.; De Luca, A. Real-time computation of distance to dynamic obstacles with multiple depth sensors. IEEE Robot. Autom. Lett. 2017, 2, 56–63. [Google Scholar] [CrossRef]
- Natan, O.; Miura, J. End-to-end autonomous driving with semantic depth cloud mapping and multi-agent. IEEE Trans. Intell. Veh. 2023, 8, 557–571. [Google Scholar] [CrossRef]
- Kauff, P.; Atzpadin, N.; Fehn, C.; Müller, M.; Schreer, O.; Smolic, A.; Tanger, R. Depth map creation and image-based rendering for advanced 3DTV services providing interoperability and scalability. Signal Process. Image Commun. 2007, 22, 217–234. [Google Scholar] [CrossRef]
- Gordon, G.G. Face recognition based on depth maps and surface curvature. In Proceedings of the SPIE 1570, Geometric Methods in Computer Vision, San Diego, CA, USA, 1 September 1991. [Google Scholar] [CrossRef]
- Ding, M.; Huo, Y.; Yi, H.; Wang, Z.; Shi, J.; Lu, Z.; Luo, P. Learning depth-guided convolutions for monocular 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Seattle, WA, USA, 13–19 June 2020; pp. 1000–1001. [Google Scholar]
- Zhang, C.; Wang, L.; Yang, R. Semantic segmentation of urban scenes using dense depth maps, ECCV 2010. In Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2010; Volume 6314. [Google Scholar] [CrossRef]
- Žbontar, J.; LeCun, Y. Stereo matching by training a convolutional neural network to compare image patches. J. Mach. Learn. Res. 2016, 17, 1–32. [Google Scholar]
- Pang, J.; Sun, W.; Ren, J.; Yang, C.; Yang, Q.; Yan, Q. Cascade residual learning: A two-stage convolutional neural network for stereo matching. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 878–886. [Google Scholar]
- Chang, J.R.; Chen, Y.S. Pyramid stereo matching network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5410–5418. [Google Scholar]
- Hirschmuller, H. Stereo processing by semiglobal matching and mutual information. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 30, 328–341. [Google Scholar] [CrossRef] [PubMed]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar]
- Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: Redesigning Skip Connections to Exploit Multiscale Features in Image Segmentation. IEEE Trans. Med. Imaging 2020, 39, 1856–1867. [Google Scholar] [CrossRef] [PubMed]
- Masoumian, A.; Rashwan, H.A.; Cristiano, J.; Asif, M.S.; Puig, D. Monocular Depth Estimation Using Deep Learning: A Review. Sensors 2021, 22, 5353. [Google Scholar] [CrossRef] [PubMed]
- Kim, D.; Ka, W.; Ahn, P.; Joo, D.; Chun, S.; Kim, J. Global-local path networks for monocular depth estimation with vertical cutdepth. arXiv 2022, arXiv:2201.07436. [Google Scholar]
- Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; Volume 27. [Google Scholar]
- Godard, C.; Mac Aodha, O.; Brostow, G.J. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Kuznietsov, Y.; Stuckler, J.; Leibe, B. Semi-supervised deep learning for monocular depth map prediction. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2215–2223. [Google Scholar]
- Yang, W.-J.; Tsung, W.-N.; Chung, P.-C. Video-based depth estimation autoencoder with weighted temporal feature and spatial edge guided modules. IEEE Trans. Artif. Intell. 2024, 5, 613–623. [Google Scholar] [CrossRef]
- Bazi, Y.; Bashmal, L.; Al Rahhal, M.M.; Al Dayil, R.; Al Ajlan, N. Vision transformers for remote sensing image classification. Remote Sens. 2021, 13, 516. [Google Scholar] [CrossRef]
- Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 7262–7272. [Google Scholar]
- Yang, J.; An, L.; Dixit, A.; Koo, J.; Park, S.I. Depth estimation with simplified transformer. In Proceedings of the Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; Available online: https://arxiv.org/abs/2204.13791v3 (accessed on 28 May 2024).
- Yu, W.; Luo, M.; Zhou, P.; Si, C.; Zhou, Y.; Wang, X.; Feng, J.; Yan, S. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Agarwal, A.; Arora, C. Depthformer: Multiscale Vision Transformer for Monocular Depth Estimation with Global Local Information Fusion. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 3873–3877. [Google Scholar]
- Zhu, X.; Han, Z.; Zhang, Z.; Song, L.; Wang, H.; Guo, Q. PCTNet: Depth estimation from single structured light image with a parallel CNN-transformer network. Meas. Sci. Technol. 2023, 34, 085402. [Google Scholar] [CrossRef]
- Zhang, Z.; Chan, R.K.; Wong, K.K. GlocalFuse-Depth: Fusing transformers and CNNs for all-day self-supervised monocular depth estimation. Neurocomputing 2024, 569, 127122. [Google Scholar] [CrossRef]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic image segmentation with deep convolutional nets, Atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
- Lee, J.H.; Han, M.K.; Ko, D.W.; Suh, I.H. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv 2021, arXiv:1907.10326v5. [Google Scholar]
- Yang, M.; Yu, K.; Zhang, C.; Li, Z.; Yang, K. Denseaspp for semantic segmentation in street scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3684–3692. [Google Scholar]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 14 December 2019; Volume 32. [Google Scholar]
- Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor segmentation and support inference from RGBD images. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; Springer: Berlin/Heidelberg, Germany, 2012; pp. 746–760. [Google Scholar]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
- Kingma, D.P. A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]



















| ViT Positions n1 n2 n3 n4 n5 | Flops (G) | Params (MB) | δ1 ↑ | δ2 ↑ | δ3 ↑ | RMSE ↓ | AbsRel ↓ | 
|---|---|---|---|---|---|---|---|
| 00000 (no ViTs) | 6.367 | 1.696 | 0.622 | 0.881 | 0.966 | 0.453 | 0.225 | 
| 00005 | 13.229 | 64.815 | 0.875 | 0.967 | 0.991 | 0.371 | 0.105 | 
| 00014 | 13.522 | 89.828 | 0.879 | 0.968 | 0.991 | 0.366 | 0.106 | 
| 00023 | 13.522 | 89.828 | 0.880 | 0.971 | 0.992 | 0.360 | 0.101 | 
| 00032 | 13.522 | 89.828 | 0.881 | 0.969 | 0.991 | 0.360 | 0.102 | 
| 00041 | 13.522 | 89.828 | 0.878 | 0.969 | 0.992 | 0.365 | 0.102 | 
| 00113 | 14.357 | 102.33 | 0.879 | 0.968 | 0.991 | 0.365 | 0.106 | 
| 00122 | 14.357 | 102.33 | 0.881 | 0.968 | 0.990 | 0.361 | 0.105 | 
| 00131 | 14.357 | 102.33 | 0.876 | 0.968 | 0.990 | 0.370 | 0.109 | 
| 00212 | 14.603 | 102.33 | 0.878 | 0.969 | 0.991 | 0.363 | 0.101 | 
| 00221 | 14.603 | 102.33 | 0.880 | 0.968 | 0.991 | 0.362 | 0.103 | 
| 00311 | 14.849 | 102.33 | 0.879 | 0.970 | 0.992 | 0.363 | 0.104 | 
| 01112 | 16.797 | 108.62 | 0.878 | 0.968 | 0.991 | 0.361 | 0.103 | 
| 01121 | 16.797 | 108.62 | 0.882 | 0.970 | 0.992 | 0.357 | 0.100 | 
| 01211 | 17.043 | 108.62 | 0.874 | 0.967 | 0.990 | 0.364 | 0.104 | 
| 02111 | 18.030 | 108.62 | 0.880 | 0.968 | 0.991 | 0.358 | 0.105 | 
| 11111 | 24.317 | 112.31 | 0.878 | 0.967 | 0.991 | 0.364 | 0.103 | 
| Fusion Modules | Params (MB) | δ1 ↑ | δ2 ↑ | δ3 ↑ | RMSE ↓ | AbsRel ↓ | 
|---|---|---|---|---|---|---|
| SFF (baseline) | 1.665 | 0.696 | 0.907 | 0.971 | 0.651 | 0.206 | 
| SEAFM | 0.836 | 0.718 | 0.919 | 0.975 | 0.615 | 0.192 | 
| SECFM | 2.159 | 0.717 | 0.917 | 0.973 | 0.626 | 0.195 | 
| AFM | 3.320 | 0.747 | 0.930 | 0.978 | 0.589 | 0.181 | 
| Network | δ1 ↑ | δ2 ↑ | δ3 ↑ | RMSE ↓ | AbsRel ↓ | 
|---|---|---|---|---|---|
| BTS [30] | 0.762 | 0.940 | 0.984 | 0.565 | 0.167 | 
| GLPDepth [17] | 0.605 | 0.872 | 0.962 | 0.769 | 0.235 | 
| RVTAF Net | 0.773 | 0.942 | 0.984 | 0.560 | 0.162 | 
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. | 
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yang, W.-J.; Wu, C.-C.; Yang, J.-F. Residual Vision Transformer and Adaptive Fusion Autoencoders for Monocular Depth Estimation. Sensors 2025, 25, 80. https://doi.org/10.3390/s25010080
Yang W-J, Wu C-C, Yang J-F. Residual Vision Transformer and Adaptive Fusion Autoencoders for Monocular Depth Estimation. Sensors. 2025; 25(1):80. https://doi.org/10.3390/s25010080
Chicago/Turabian StyleYang, Wei-Jong, Chih-Chen Wu, and Jar-Ferr Yang. 2025. "Residual Vision Transformer and Adaptive Fusion Autoencoders for Monocular Depth Estimation" Sensors 25, no. 1: 80. https://doi.org/10.3390/s25010080
APA StyleYang, W.-J., Wu, C.-C., & Yang, J.-F. (2025). Residual Vision Transformer and Adaptive Fusion Autoencoders for Monocular Depth Estimation. Sensors, 25(1), 80. https://doi.org/10.3390/s25010080
 
        


 
       