Self-Supervised Monocular Depth Estimation Using Global and Local Mixed Multi-Scale Feature Enhancement Network for Low-Altitude UAV Remote Sensing
Abstract
:1. Introduction
- We propose a global and local mixed multi-scale feature enhancement network for depth estimation in low-altitude remote sensing scenarios. It parallelizes the input image into lateral branches of different scales, where the same branch maintains the same size throughout the process, and different branches exchange feature information at the intersection nodes, reducing information loss during convolution to obtain a more refined depth estimation result.
- We propose a Global Scene Attention (GSA) module for the decoder part of the depth network, which aims to establish long-distance semantic connections in the global context of the input feature map and integrate this contextual information into the channel representation of the feature map. This helps to improve the model’s understanding and reasoning ability for the overall scene, thereby enhancing the performance of the task.
2. Related Work
2.1. Self-Supervised Monocular Depth Estimation
2.2. Monocular Depth Estimation for Aerial Images
3. Materials and Methods
3.1. Model Inputs
3.2. Network Architecture
3.2.1. Depth Network
3.2.2. Pose Network
3.3. Loss Functions
3.3.1. Gradient Discrimination Loss ()
3.3.2. Photometric Loss ()
3.3.3. Minimization of the Photometric Loss ()
3.4. Inference
4. Results
4.1. Dateset
4.2. Implementation Details
4.3. Evaluation Metrics
4.4. Comparison Methods
4.5. Qualitative Results in UAVid 2020
4.6. Quantitative Results in UAVid 2020
4.7. Ablation Study
5. Discussion
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
UAV | Unmanned Aerial Vehicle |
SLAM | Simultaneous Localization and Mapping |
References
- Nex, F.; Remondino, F. UAV for 3D mapping applications: A review. Appl. Geomat. 2014, 6, 1–15. [Google Scholar] [CrossRef]
- Berie, H.T.; Burud, I. Application of unmanned aerial vehicles in earth resources monitoring: Focus on evaluating potentials for forest monitoring in Ethiopia. Eur. J. Remote Sens. 2018, 51, 326–335. [Google Scholar] [CrossRef] [Green Version]
- Noor, N.M.; Abdullah, A.; Hashim, M. Remote sensing UAV/drones and its applications for urban areas: A review. In Proceedings of the IOP Conference Series: Earth and Environmental Science, Kuala Lumpur, Malaysia, 24–25 April 2018; IOP Publishing: Bristol, UK, 2018; Volume 169, p. 012003. [Google Scholar]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
- Karsch, K.; Liu, C.; Kang, S.B. Depth extraction from video using non-parametric sampling. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; pp. 775–788. [Google Scholar]
- Zhang, R.; Tsai, P.S.; Cryer, J.E.; Shah, M. Shape-from-shading: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 1999, 21, 690–706. [Google Scholar] [CrossRef] [Green Version]
- Fu, H.; Gong, M.; Wang, C.; Batmanghelich, K.; Tao, D. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2002–2011. [Google Scholar]
- Lee, J.H.; Han, M.K.; Ko, D.W.; Suh, I.H. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv 2019, arXiv:1907.10326. [Google Scholar]
- Lee, J.H.; Heo, M.; Kim, K.R.; Kim, C.S. Single-image depth estimation based on fourier domain analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 330–339. [Google Scholar]
- Zhou, T.; Brown, M.; Snavely, N.; Lowe, D.G. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1851–1858. [Google Scholar]
- Bian, J.W.; Zhan, H.; Wang, N.; Li, Z.; Zhang, L.; Shen, C.; Cheng, M.M.; Reid, I. Unsupervised Scale-consistent Depth Learning from Video. Int. J. Comput. Vis. (IJCV) 2021, 129, 2548–2564. [Google Scholar] [CrossRef]
- Li, R.; Wang, S.; Long, Z.; Gu, D. Undeepvo: Monocular visual odometry through unsupervised deep learning. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 7286–7291. [Google Scholar]
- Tosi, F.; Aleotti, F.; Poggi, M.; Mattoccia, S. Learning monocular depth estimation infusing traditional stereo knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9799–9809. [Google Scholar]
- Yang, Z.; Wang, P.; Wang, Y.; Xu, W.; Nevatia, R. Lego: Learning edge with geometry all at once by watching videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 225–234. [Google Scholar]
- Spencer, J.; Bowden, R.; Hadfield, S. Defeat-net: General monocular depth via simultaneous unsupervised representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 14402–14413. [Google Scholar]
- Godard, C.; Mac Aodha, O.; Brostow, G.J. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 270–279. [Google Scholar]
- Yin, Z.; Shi, J. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1983–1992. [Google Scholar]
- Godard, C.; Mac Aodha, O.; Firman, M.; Brostow, G.J. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3828–3838. [Google Scholar]
- Casser, V.; Pirk, S.; Mahjourian, R.; Angelova, A. Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8001–8008. [Google Scholar]
- Mou, L.; Zhu, X.X. IM2HEIGHT: Height estimation from single monocular imagery via fully residual convolutional-deconvolutional network. arXiv 2018, arXiv:1802.10249. [Google Scholar]
- Hermann, M.; Ruf, B.; Weinmann, M.; Hinz, S. Self-supervised learning for monocular depth estimation from aerial imagery. arXiv 2020, arXiv:2008.07246. [Google Scholar] [CrossRef]
- Madhuanand, L.; Nex, F.; Yang, M.Y. Self-supervised monocular depth estimation from oblique UAV videos. ISPRS J. Photogramm. Remote Sens. 2021, 176, 1–14. [Google Scholar] [CrossRef]
- Prados, E.; Faugeras, O. Shape from shading. In Handbook of Mathematical Models in Computer Vision; Springer: Berlin/Heidelberg, Germany, 2006; pp. 375–388. [Google Scholar]
- Tsai, Y.M.; Chang, Y.L.; Chen, L.G. Block-based vanishing line and vanishing point detection for 3D scene reconstruction. In Proceedings of the 2006 International Symposium on Intelligent Signal Processing and Communications, Yonago, Japan, 12–15 December 2005; pp. 586–589. [Google Scholar]
- Tang, C.; Hou, C.; Song, Z. Depth recovery and refinement from a single image using defocus cues. J. Mod. Opt. 2015, 62, 441–448. [Google Scholar] [CrossRef]
- Lyu, X.; Liu, L.; Wang, M.; Kong, X.; Liu, L.; Liu, Y.; Chen, X.; Yuan, Y. Hr-depth: High resolution self-supervised monocular depth estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2021; Volume 35, pp. 2294–2301. [Google Scholar]
- Guizilini, V.; Ambrus, R.; Pillai, S.; Raventos, A.; Gaidon, A. 3d packing for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2485–2494. [Google Scholar]
- Ranjan, A.; Jampani, V.; Balles, L.; Kim, K.; Sun, D.; Wulff, J.; Black, M.J. Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2019; pp. 12240–12249. [Google Scholar]
- Yan, J.; Zhao, H.; Bu, P.; Jin, Y. Channel-Wise Attention-Based Network for Self-Supervised Monocular Depth Estimation. In Proceedings of the 2021 International Conference on 3D Vision (3DV), London, UK, 1–3 December 2021; pp. 464–473. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Lyu, Y.; Vosselman, G.; Xia, G.; Yilmaz, A.; Yang, M.Y. UAVid: A Semantic Segmentation Dataset for UAV Imagery. ISPRS J. Photogramm. Remote Sens. 2020, 165, 108–119. [Google Scholar] [CrossRef]
- Yang, G.; Tang, H.; Ding, M.; Sebe, N.; Ricci, E. Transformer-based attention networks for continuous pixel-wise prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 16269–16279. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef] [Green Version]
- Shu, C.; Yu, K.; Duan, Z.; Yang, K. Feature-metric loss for self-supervised learning of depth and egomotion. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 572–588. [Google Scholar]
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Schönberger, J.L.; Frahm, J.M. Structure-from-Motion Revisited. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic Differentiation in Pytorch. 2017. Available online: https://openreview.net/forum?id=BJJsrmfCZ (accessed on 23 June 2023).
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the CVPR 2021, Virtual, 19–25 June 2021. [Google Scholar]
Method | Dataset | Abs Rel | Sq Rel | RMSE | RMSE | |||
---|---|---|---|---|---|---|---|---|
Godard et al. [18] | UAVid 2020 | 0.1389 | 1.7943 | 4.5913 | 0.2130 | 0.8781 | 0.9537 | 0.9795 |
Yan et al. [29] | UAVid 2020 | 0.1297 | 3.2008 | 4.4344 | 0.1964 | 0.9177 | 0.9620 | 0.9782 |
Madhuanand et al. [22] | UAVid 2020 | 0.1383 | 3.2538 | 4.7721 | 0.2052 | 0.9054 | 0.9621 | 0.9792 |
Our model | UAVid 2020 | 0.0955 | 1.3705 | 3.3753 | 0.1724 | 0.9341 | 0.9730 | 0.9856 |
Method | Dataset | Abs Rel | Sq Rel | RMSE | RMSE | |||
---|---|---|---|---|---|---|---|---|
N/A | UAVid 2020 | 0.1210 | 1.7066 | 4.4226 | 0.1993 | 0.9033 | 0.9626 | 0.9817 |
CAM [40] | UAVid 2020 | 0.1487 | 3.3558 | 5.5283 | 0.2189 | 0.8918 | 0.9499 | 0.9722 |
SAM [40] | UAVid 2020 | 0.1341 | 3.2246 | 4.4733 | 0.2078 | 0.9091 | 0.9677 | 0.9799 |
Coordinate [41] | UAVid 2020 | 0.1060 | 1.2173 | 3.5873 | 0.1832 | 0.9190 | 0.9671 | 0.9845 |
GSA | UAVid 2020 | 0.0955 | 1.3705 | 3.3753 | 0.1724 | 0.9341 | 0.9730 | 0.9856 |
Method | Pre- Train | Loss Function | Abs Rel | Sq Rel | RMSE | RMSE | |||
---|---|---|---|---|---|---|---|---|---|
Baseline | ✗ | 0.1797 | 3.8392 | 6.6307 | 0.2540 | 0.8377 | 0.9292 | 0.9622 | |
Baseline | ✓ | 0.1250 | 1.1581 | 3.7793 | 0.1987 | 0.8908 | 0.9605 | 0.9854 | |
Our model | ✗ | 0.1543 | 3.2377 | 4.7649 | 0.2374 | 0.8577 | 0.9439 | 0.9725 | |
Our model | ✓ | 0.1253 | 2.3697 | 3.9955 | 0.1949 | 0.9142 | 0.9658 | 0.9828 | |
Our model | ✓ | 0.1203 | 2.1244 | 3.4125 | 0.1891 | 0.9212 | 0.9693 | 0.9831 | |
Our model | ✓ | 0.1081 | 1.8056 | 3.3994 | 0.1809 | 0.9302 | 0.9715 | 0.9854 | |
Our model | ✓ | 0.0955 | 1.3705 | 3.3753 | 0.1724 | 0.9341 | 0.9730 | 0.9856 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chang, R.; Yu, K.; Yang, Y. Self-Supervised Monocular Depth Estimation Using Global and Local Mixed Multi-Scale Feature Enhancement Network for Low-Altitude UAV Remote Sensing. Remote Sens. 2023, 15, 3275. https://doi.org/10.3390/rs15133275
Chang R, Yu K, Yang Y. Self-Supervised Monocular Depth Estimation Using Global and Local Mixed Multi-Scale Feature Enhancement Network for Low-Altitude UAV Remote Sensing. Remote Sensing. 2023; 15(13):3275. https://doi.org/10.3390/rs15133275
Chicago/Turabian StyleChang, Rong, Kailong Yu, and Yang Yang. 2023. "Self-Supervised Monocular Depth Estimation Using Global and Local Mixed Multi-Scale Feature Enhancement Network for Low-Altitude UAV Remote Sensing" Remote Sensing 15, no. 13: 3275. https://doi.org/10.3390/rs15133275
APA StyleChang, R., Yu, K., & Yang, Y. (2023). Self-Supervised Monocular Depth Estimation Using Global and Local Mixed Multi-Scale Feature Enhancement Network for Low-Altitude UAV Remote Sensing. Remote Sensing, 15(13), 3275. https://doi.org/10.3390/rs15133275