A Hybrid UNet with Attention and a Perceptual Loss Function for Monocular Depth Estimation
Abstract
1. Introduction
- A new hybrid model combines a ResNet18-based encoder for extracting local spatial features with Transformer attention blocks that are designed to capture global context and long-range dependencies.
- A Boundary-Aware Depth Consistency Loss (BADCL) function has been introduced to improve training accuracy.
- Comparative results show that the proposed approach achieves similar results with fewer trainable parameters.
2. Materials and Methods
2.1. UNet Model
2.2. Monocular Depth Estimation in Practice
2.3. NYU-Depth V2
3. Proposed Method
3.1. Transformer-Based Hybrid UNet
3.2. Hybrid Loss Function
3.2.1. Boundary-Aware Loss
3.2.2. Smoothness Regularization Loss
3.2.3. Dynamic Scaling Loss
3.3. Combined Loss Functions
3.4. Hyperparameters
- : Control the contribution of each term to the final loss.
- The three components are combined with weighting factors to form the BADCL (Boundary-Aware Depth Consistency Loss) function. BADCL is designed to enhance the accuracy of monocular depth maps while ensuring the overall scale of the depth map is correct. As a result, monocular depth estimates provide more reliable and accurate predictions.
3.5. Performance Metrics
3.5.1. SSIM (Structural Similarity Index Measure)
3.5.2. MSE (Mean Squared Error)
3.5.3. ARE (Absolute Relative Error)
3.5.4. -Metrics
3.5.5. Log10 Error (Mean Logarithmic Error)
4. Experimental Results
Computational Complexity
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
AMENet | A Monocular Depth Estimation Network |
ARE | Absolute Relative Error |
BADCL | Boundary-Aware Depth Consistency Loss |
CATNet | Convolutional Attention and Transformer |
CNN | Convolutional Neural Network |
CRF | Conditional Random Field |
DAT | Dual Attention Transformer |
DB | Decoder Block |
DNet | Hierarchical Self-Supervised Monocular Absolute Depth Estimation |
EB | Encoder Block |
FLOPs | Floating-Point Operations |
LapUNet | Laplacian Residual U-shape Network |
LiDAR | Light Detection and Ranging |
MCA | Multi-dimensional Convolutional Attention |
MHSA | Multi-Head Self-Attention |
MRF | Markov Random Field |
MSE | Mean Squared Error |
PHOG | Pyramid Histogram of Oriented Gradients |
RCConv | Reducing Channel Convolution |
REL | Mean Relative Error |
ResNet | Residual Network |
RMS | Root Mean Squared Error |
SIFT | Scale-Invariant Feature Transform |
SPT-Depth | Stereoscopic Pyramid Transformer-Depth |
SSIM | Structural Similarity Index Measure |
SURF | Speeded Up Robust Features |
TB | Transformer Block |
UNet | A U-shaped encoder–decoder network architecture |
ViT | Vision Transformer |
Appendix A
Layer (Type) | Output Shape | Param # |
---|---|---|
HybridDepthEstimationModel | [8, 1, 224, 224] | – |
CNNEncoder | [8, 64, 56, 56] | – |
Sequential | [8, 64, 56, 56] | – |
Conv2d | [8, 64, 112, 112] | 9408 |
BatchNorm2d | [8, 64, 112, 112] | 128 |
ReLU | [8, 64, 112, 112] | – |
MaxPool2d | [8, 64, 56, 56] | – |
Sequential | [8, 64, 56, 56] | – |
BasicBlock | [8, 64, 56, 56] | 73,984 |
BasicBlock | [8, 64, 56, 56] | 73,984 |
Sequential | [8, 128, 28, 28] | – |
BasicBlock | [8, 128, 28, 28] | 230,144 |
BasicBlock | [8, 128, 28, 28] | 295,424 |
Sequential | [8, 256, 14, 14] | – |
BasicBlock | [8, 256, 14, 14] | 919,040 |
BasicBlock | [8, 256, 14, 14] | 1,180,672 |
TransformerAttention | [8, 128, 28, 28] | – |
MultiheadAttention | [8, 784, 128] | 66,048 |
LayerNorm | [8, 784, 128] | 256 |
Sequential | [8, 784, 128] | 131,712 |
LayerNorm | [8, 784, 128] | 256 |
TransformerAttention | [8, 256, 14, 14] | – |
MultiheadAttention | [8, 196, 256] | 263,168 |
LayerNorm | [8, 196, 256] | 512 |
Sequential | [8, 196, 256] | 525,568 |
LayerNorm | [8, 196, 256] | 512 |
UpsampleBlock | [8, 128, 28, 28] | – |
Conv2d | [8, 128, 28, 28] | 442,496 |
BatchNorm2d | [8, 128, 28, 28] | 256 |
Conv2d | [8, 128, 28, 28] | 147,584 |
BatchNorm2d | [8, 128, 28, 28] | 256 |
UpsampleBlock | [8, 64, 56, 56] | – |
Conv2d | [8, 64, 56, 56] | 110,656 |
BatchNorm2d | [8, 64, 56, 56] | 128 |
Conv2d | [8, 64, 56, 56] | 36,928 |
BatchNorm2d | [8, 64, 56, 56] | 128 |
UpsampleBlock | [8, 64, 112, 112] | – |
Conv2d | [8, 64, 112, 112] | 73,792 |
BatchNorm2d | [8, 64, 112, 112] | 128 |
Conv2d | [8, 64, 112, 112] | 36,928 |
BatchNorm2d | [8, 64, 112, 112] | 128 |
ConvTranspose2d | [8, 32, 224, 224] | 32,800 |
Conv2d | [8, 1, 224, 224] | 289 |
Total params: 4,653,313 | ||
Trainable params: 4,653,313 | ||
Non-trainable params: 0 | ||
Total mult-adds (G): 43.02 |
References
- Li, Y.; Ma, L.; Zhong, Z.; Liu, F.; Chapman, M.A.; Cao, D.; Li, J. Deep learning for lidar point clouds in autonomous driving: A review. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 3412–3432. [Google Scholar] [CrossRef]
- Tang, C.; Hou, C.; Song, Z. Depth recovery and refinement from a single image using defocus cues. J. Mod. Opt. 2015, 62, 441–448. [Google Scholar] [CrossRef]
- Tsai, Y.M.; Chang, Y.L.; Chen, L.G. Block-based vanishing line and vanishing point detection for 3D scene reconstruction. In Proceedings of the 2006 IEEE International Symposium on Intelligent Signal Processing and Communications, Yonago, Japan, 12–15 December 2006; pp. 586–589. [Google Scholar] [CrossRef]
- Zhang, R.; Tsai, P.S.; Cryer, J.E.; Shah, M. Shape-from-shading: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 1999, 21, 690–706. [Google Scholar] [CrossRef]
- Lowe, D.G. Object recognition from local scale-invariant features. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece, 20–27 September 1999; Volume 2, pp. 1150–1157. [Google Scholar] [CrossRef]
- Bay, H. Surf: Speeded up robust features. In Computer Vision—ECCV; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar] [CrossRef]
- Bosch, A.; Zisserman, A.; Munoz, X. Image classification using random forests and ferns. In Proceedings of the 2007 IEEE 11th International Conference on Computer Vision, Rio De Janeiro, Brazil, 14–21 October 2007; pp. 1–8. [Google Scholar] [CrossRef]
- Lafferty, J.; McCallum, A.; Pereira, F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the ICML, Williamstown, MA, USA, 28 June–1 July 2001; Volume 1, p. 3. [Google Scholar]
- Cross, G.R.; Jain, A.K. Markov random field texture models. IEEE Trans. Pattern Anal. Mach. Intell. 1983, PAMI-5, 25–39. [Google Scholar] [CrossRef]
- Güney, E.; Bayılmış, C.; Çakar, S.; Erol, E.; Atmaca, Ö. Autonomous control of shore robotic charging systems based on computer vision. Expert Syst. Appl. 2024, 238, 122116. [Google Scholar] [CrossRef]
- Yolcu, G.; Oztel, I.; Kazan, S.; Oz, C.; Palaniappan, K.; Lever, T.E.; Bunyak, F. Facial expression recognition for monitoring neurological disorders based on convolutional neural network. Multimed. Tools Appl. 2019, 78, 31581–31603. [Google Scholar] [CrossRef] [PubMed]
- Sazak, H.; Kotan, M. Automated Blood Cell Detection and Classification in Microscopic Images Using YOLOv11 and Optimized Weights. Diagnostics 2024, 15, 22. [Google Scholar] [CrossRef] [PubMed]
- Yang, W.J.; Wu, C.C.; Yang, J.F. Residual Vision Transformer and Adaptive Fusion Autoencoders for Monocular Depth Estimation. Sensors 2024, 25, 80. [Google Scholar] [CrossRef]
- O’Shea, K. An introduction to convolutional neural networks. arXiv 2015, arXiv:1511.08458. [Google Scholar] [CrossRef]
- Targ, S.; Almeida, D.; Lyman, K. Resnet in resnet: Generalizing residual architectures. arXiv 2016, arXiv:1603.08029. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar] [CrossRef]
- Sinha, D.; El-Sharkawy, M. Thin mobilenet: An enhanced mobilenet architecture. In Proceedings of the 2019 IEEE 10th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), New York, NY, USA, 10–12 October 2019; pp. 280–285. [Google Scholar] [CrossRef]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
- Huang, H.; Lin, L.; Tong, R.; Hu, H.; Zhang, Q.; Iwamoto, Y.; Han, X.; Chen, Y.W.; Wu, J. Unet 3+: A full-scale connected unet for medical image segmentation. In Proceedings of the ICASSP 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 1055–1059. [Google Scholar] [CrossRef]
- Duong, H.T.; Chen, H.M.; Chang, C.C. URNet: An UNet-based model with residual mechanism for monocular depth estimation. Electronics 2023, 12, 1450. [Google Scholar] [CrossRef]
- Wang, B.; Wang, S.; Dou, Z.; Ye, D. Deep Neighbor Layer Aggregation for Lightweight Self-Supervised Monocular Depth Estimation. arXiv 2024, arXiv:2309.09272. [Google Scholar] [CrossRef]
- Xue, F.; Zhuo, G.; Huang, Z.; Fu, W.; Wu, Z.; Ang, M. Toward Hierarchical Self-Supervised Monocular Absolute Depth Estimation for Autonomous Driving Applications. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 25–29 October 2020; pp. 2330–2337. [Google Scholar] [CrossRef]
- Tadepalli, Y.; Kollati, M.; Kuraparthi, S.; Kora, P. EfficientNet-B0 Based Monocular Dense-Depth Map Estimation. Trait. Signal 2021, 38, 1485–1493. [Google Scholar] [CrossRef]
- Li, R.; Yu, H.; Du, K.; Xiao, Z.; Yan, B.; Yuan, Z. Adaptive Semantic Fusion Framework for Unsupervised Monocular Depth Estimation. In Proceedings of the ICASSP 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
- Dang, Y.; Li, C.; Zhang, L.; Gao, Y. RCCNet: Reducing Channel Convolution Network for Monocular Depth Estimation. In Proceedings of the 2023 4th IEEE International Conference on Computer Vision, Image and Deep Learning (CVIDL), Zhuhai, China, 12–14 May 2023; pp. 1–4. [Google Scholar] [CrossRef]
- Shen, M.; Wang, Z.; Su, S.; Liu, C.; Chen, Q. DNA-Depth: A Frequency-Based Day-Night Adaptation for Monocular Depth Estimation. IEEE Trans. Instrum. Meas. 2023, 72, 2530112. [Google Scholar] [CrossRef]
- Xia, Z.; Wu, T.; Wang, Z.; Zhou, M.; Wu, B.; Chan, C.; Kong, L.B. Dense monocular depth estimation for stereoscopic vision based on pyramid transformer and multi-scale feature fusion. Sci. Rep. 2024, 14, 7037. [Google Scholar] [CrossRef]
- Sharma, M.; Choudhary, R.; Anil, R. 2T-UNET: A Two-Tower UNet with Depth Clues for Robust Stereo Depth Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 757–764. [Google Scholar]
- Tang, S.; Lu, T.; Liu, X.; Zhou, H.; Zhang, Y. CATNet: Convolutional attention and transformer for monocular depth estimation. Pattern Recognit. 2024, 145, 109982. [Google Scholar] [CrossRef]
- Kolbeinsson, B.; Mikolajczyk, K. UCorr: Wire Detection and Depth Estimation for Autonomous Drones. In Proceedings of the International Conference on Robotics, Computer Vision and Intelligent Systems, Rome, Italy, 25–27 February 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 179–192. [Google Scholar] [CrossRef]
- Wu, T.; Xia, Z.; Zhou, M.; Kong, L.B.; Chen, Z. AMENet is a monocular depth estimation network designed for automatic stereoscopic display. Sci. Rep. 2024, 14, 5868. [Google Scholar] [CrossRef]
- Xi, Y.; Li, S.; Xu, Z.; Zhou, F.; Tian, J. LapUNet: A novel approach to monocular depth estimation using dynamic laplacian residual U-shape networks. Sci. Rep. 2024, 14, 23544. [Google Scholar] [CrossRef]
- Zou, Y.; Luo, Z.; Huang, J.B. Df-net: Unsupervised joint learning of depth and flow using cross-task consistency. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 36–53. [Google Scholar]
- Nathan Silberman, Derek Hoiem, Pushmeet Kohli ; Rob Fergus Indoor Segmentation and Support Inference from RGBD Images. In Proceedings of the ECCV, Florence, Italy, 7–13 October 2012. [CrossRef]
- Wu, C.; Wu, F.; Ge, S.; Qi, T.; Huang, Y.; Xie, X. Neural news recommendation with multi-head self-attention. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 6389–6394. [Google Scholar] [CrossRef]
- Borse, S.; Wang, Y.; Zhang, Y.; Porikli, F. Inverseform: A loss function for structured boundary-aware segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5901–5911. [Google Scholar]
- Ngoc, M.Ô.V.; Chen, Y.; Boutry, N.; Chazalon, J.; Carlinet, E.; Fabrizio, J.; Mallet, C.; Géraud, T. Introducing the Boundary-Aware loss for deep image segmentation. In Proceedings of the British Machine Vision Conference (BMVC) 2021, Online, 22–25 November 2021. [Google Scholar]
- Abdusalomov, A.; Umirzakova, S.; Shukhratovich, M.B.; Kakhorov, A.; Cho, Y.I. Breaking New Ground in Monocular Depth Estimation with Dynamic Iterative Refinement and Scale Consistency. Appl. Sci. 2025, 15, 674. [Google Scholar] [CrossRef]
- Mancini, M.; Costante, G.; Valigi, P.; Ciarfuglia, T.A.; Delmerico, J.; Scaramuzza, D. Toward domain independence for learning-based monocular depth estimation. IEEE Robot. Autom. Lett. 2017, 2, 1778–1785. [Google Scholar] [CrossRef]
- Xu, D.; Wang, W.; Tang, H.; Liu, H.; Sebe, N.; Ricci, E. Structured attention guided convolutional neural fields for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3917–3925. [Google Scholar] [CrossRef]
- Alhashim, Ibraheem; Wonka, Peter High quality monocular depth estimation via transfer learning. arXiv 2018, arXiv:1812.11941. [CrossRef]
- Li, J.; Klein, R.; Yao, A. A two-streamed network for estimating fine-scaled depth maps from single rgb images. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3372–3380. [Google Scholar]
- Rudolph, M.; Dawoud, Y.; Güldenring, R.; Nalpantidis, L.; Belagiannis, V. Lightweight monocular depth estimation through guided decoding. In Proceedings of the 2022 IEEE International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 2344–2350. [Google Scholar] [CrossRef]
- Lee, J.H.; Kim, C.S. Monocular depth estimation using relative depth maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 9729–9738. [Google Scholar]
- Guizilini, V.; Ambrus, R.; Pillai, S.; Raventos, A.; Gaidon, A. 3d packing for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 2485–2494. [Google Scholar] [CrossRef]
- Basak, H.; Ghosal, S.; Sarkar, M.; Das, M.; Chattopadhyay, S. Monocular depth estimation using encoder-decoder architecture and transfer learning from single RGB image. In Proceedings of the 2020 IEEE 7th Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON), Prayagraj, India, 27–29 November 2020; pp. 1–6. [Google Scholar] [CrossRef]
- Das, D.; Das, A.D.; Sadaf, F. Depth Estimation From Monocular Images with Enhanced Encoder-Decoder Architecture. arXiv 2024, arXiv:2410.11610. [Google Scholar] [CrossRef]
- Ignatov, D.; Ignatov, A.; Timofte, R. Virtually Enriched NYU Depth V2 Dataset for Monocular Depth Estimation: Do We Need Artificial Augmentation? In Proceedings of the Synthetic Data for Computer Vision Workshop@ CVPR 2024, Seattle, WA, USA, 17–21 June 2024.
Parameter | Value / Specification |
---|---|
Backbone | ResNet-18 (Pretrained on ImageNet) |
Optimizer | AdamW (using PyTorch defaults) |
Learning Rate | (constant, no schedule) |
Batch Size | 8 |
Number of Epochs | 100 |
Random Seed | 42 (for dataset splits) |
Method | Error Metrics (Lower is Better) | Accuracy (Higher is Better) | |||||
---|---|---|---|---|---|---|---|
ARE | RMSE | Log10 | |||||
Mancini et al. [39] | 0.312 | 0.565 | 0.336 | 0.809 | 0.786 | 0.911 | |
Xu et al. [40] | 0.125 | 0.593 | 0.057 | 0.806 | 0.952 | 0.986 | |
Alhashim et al. [41] | 0.123 | 0.465 | 0.053 | 0.846 | 0.974 | 0.994 | |
Li et al. (VGG16) [42] | 0.152 | 0.611 | 0.064 | 0.789 | 0.955 | 0.988 | |
Li et al. (VGG19) [42] | 0.146 | 0.617 | 0.063 | 0.795 | 0.958 | 0.991 | |
Li et al. (ResNet50) [42] | 0.143 | 0.635 | 0.063 | 0.788 | 0.958 | 0.991 | |
Rudolph et al. [43] | 0.138 | 0.501 | 0.058 | 0.823 | 0.961 | 0.990 | |
Lee et al. [44] | 0.131 | 0.538 | – | 0.837 | 0.971 | 0.994 | |
Guizilini et al. [45] | 0.072 | 2.727 | 0.120 | 0.932 | 0.984 | 0.994 | |
Basak et al. [46] | 0.103 | 0.388 | – | 0.892 | 0.978 | 0.995 | |
Das et al. (Enc-Dec-IRv2) [47] | 0.064 | 0.228 | 0.032 | 0.893 | 0.967 | 0.985 | |
Ignatov et al. [48] | 0.090 | 0.322 | 0.039 | 0.929 | 0.991 | 0.998 | |
Hybrid Ensemble Unet (Proposed) | 0.063 | 0.237 | 0.026 | 0.982 | 0.996 | 0.998 |
Batch Size | Throughput (Images/s) | Batches/s | Latency (ms/Image) |
---|---|---|---|
1 | 249.84 | 249.84 | 4.003 |
4 | 1011.47 | 252.87 | 0.989 |
8 | 1087.35 | 135.92 | 0.920 |
16 | 1176.84 | 73.55 | 0.850 |
32 | 1212.19 | 37.88 | 0.825 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Turkmen, H.; Akgun, D. A Hybrid UNet with Attention and a Perceptual Loss Function for Monocular Depth Estimation. Mathematics 2025, 13, 2567. https://doi.org/10.3390/math13162567
Turkmen H, Akgun D. A Hybrid UNet with Attention and a Perceptual Loss Function for Monocular Depth Estimation. Mathematics. 2025; 13(16):2567. https://doi.org/10.3390/math13162567
Chicago/Turabian StyleTurkmen, Hamidullah, and Devrim Akgun. 2025. "A Hybrid UNet with Attention and a Perceptual Loss Function for Monocular Depth Estimation" Mathematics 13, no. 16: 2567. https://doi.org/10.3390/math13162567
APA StyleTurkmen, H., & Akgun, D. (2025). A Hybrid UNet with Attention and a Perceptual Loss Function for Monocular Depth Estimation. Mathematics, 13(16), 2567. https://doi.org/10.3390/math13162567