Joint Soft–Hard Attention for Self-Supervised Monocular Depth Estimation
Abstract
:1. Introduction
- A novel soft attention module that integrates spatial attention and channel attention. The soft attention module not only correlates different non-contiguous regions in the spatial dimension, but also explores the distribution of attention in the channel dimension. In contrast to other current attention methods [19,20], our soft attention module achieves optimal performance with only a small number of parameters.
- Experiments show that our contributions can help us achieve the best self-supervised performance both on the standard KITTI benchmark [2,22] and the Make3D [23] dataset. Either contribution can be easily integrated into many previously well-performing approaches, especially the proposed hard attention strategy.
2. Related Works
2.1. Monocular Depth Estimation
2.2. Attention Mechanism
3. Method
3.1. Overall Framework
3.2. Soft Attention Module
3.2.1. Spatial Attention
3.2.2. Channel Attention
3.3. Hard Attention Strategy
3.3.1. Self-Supervised Training
3.3.2. Hard Attention for Multi-Scale Estimation
4. Experiments
4.1. Implementation Details
4.2. KITTI Results
4.3. KITTI Ablation
4.4. Make3D Results
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Zhu, Z.; Su, A.; Liu, H.; Shang, Y.; Yu, Q. Vision navigation for aircrafts based on 3D reconstruction from real-time image sequences. Sci. China Technol. Sci. 2015, 58, 1196–1208. [Google Scholar] [CrossRef]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
- El Jamiy, F.; Marsh, R. Survey on depth perception in head mounted displays: Distance estimation in virtual reality, augmented reality, and mixed reality. IET Image Process. 2019, 13, 707–712. [Google Scholar] [CrossRef]
- Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. In Proceedings of the 28th Conference on Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
- Laina, I.; Rupprecht, C.; Belagiannis, V.; Tombari, F.; Navab, N. Deeper depth prediction with fully convolutional residual networks. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 239–248. [Google Scholar]
- Fu, H.; Gong, M.; Wang, C.; Batmanghelich, K.; Tao, D. Deep ordinal regression network for monocular depth estimation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2002–2011. [Google Scholar]
- Zhou, T.; Brown, M.; Snavely, N.; Lowe, D.G. Unsupervised learning of depth and ego-motion from video. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6612–6619. [Google Scholar]
- Wang, C.; Buenaposada, J.M.; Zhu, R.; Lucey, S. Learning depth from monocular videos using direct methods. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2022–2030. [Google Scholar]
- Yin, Z.; Shi, J. GeoNet: Unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1983–1992. [Google Scholar]
- Zou, Y.; Luo, Z.; Huang, J.-B. Df-net: Unsupervised joint learning of depth and flow using cross-task consistency. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 36–53. [Google Scholar]
- Godard, C.; Aodha, O.M.; Firman, M.; Brostow, G. Digging into self-supervised monocular depth estimation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 3827–3837. [Google Scholar]
- Vijayanarasimhan, S.; Ricco, S.; Schmid, C.; Sukthankar, R.; Fragkiadaki, K. Sfm-net: Learning of structure and motion from video. arXiv 2017, arXiv:1704.07804. [Google Scholar]
- Luo, C.; Yang, Z.; Wang, P.; Wang, Y.; Xu, W.; Nevatia, R.; Yuille, A. Every pixel counts ++: Joint learning of geometry and motion with 3D holistic understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2624–2641. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Godard, C.; Aodha, O.M.; Brostow, G.J. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6602–6611. [Google Scholar]
- Guizilini, V.; Hou, R.; Li, J.; Ambrus, R.; Gaidon, A. Semantically-guided representation learning for self-supervised monocular depth. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Klingner, M.; Termöhlen, J.-A.; Mikolajczyk, J.; Fingscheidt, T. Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 582–600. [Google Scholar]
- Guizilini, V.; Ambruș, R.; Pillai, S.; Raventos, A.; Gaidon, A. 3D packing for self-supervised monocular depth estimation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2482–2491. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Gao, F.; Yu, J.; Shen, H.; Wang, Y.; Yang, H. Attentional separation-and-aggregation network for self-supervised depth-pose learning in dynamic scenes. In Proceedings of the 4th Conference on Robot Learning (CoRL), Cambridge, MA, USA, 16–18 November 2020. [Google Scholar]
- Johnston, A.; Carneiro, G. Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 4755–4764. [Google Scholar]
- Garg, R.; Bg, V.K.; Carneiro, G.; Reid, I. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 740–756. [Google Scholar]
- Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef] [Green Version]
- Saxena, A.; Sun, M.; Ng, A.Y. Make3d: Learning 3d scene structure from a single still image. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 31, 824–840. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Liu, F.; Shen, C.; Lin, G.; Reid, I. Learning depth from single monocular images using deep convolutional neural Fields. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 2024–2039. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Lee, S.; Lee, J.; Kim, B.; Yi, E.; Kim, J. Patch-wise attention network for monocular depth estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, Edinburgh, UK, 2–9 February 2021; pp. 1873–1881. [Google Scholar]
- Watson, J.; Firman, M.; Brostow, G.J.; Turmukhambetov, D. Self-supervised monocular depth hints. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 11–17 October 2019; pp. 2162–2171. [Google Scholar]
- Casser, V.; Pirk, S.; Mahjourian, R.; Angelova, A. Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 8001–8008. [Google Scholar]
- Gordon, A.; Li, H.; Jonschkowski, R.; Angelova, A. Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 11–17 October 2019; pp. 8977–8986. [Google Scholar]
- Liu, C.; Gu, J.; Kim, K.; Narasimhan, S.G.; Kautz, J. Neural RGB® D sensing: Depth and uncertainty from a video camera. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Los Angeles, CA, USA, 15–20 June 2019; pp. 10978–10987. [Google Scholar]
- GonzalezBello, J.L.; Kim, M. Forget about the lidar: Self-supervised depth estimators with med probability volumes. Adv. Neural Inf. Process. Syst. 2020, 33, 12626–12637. [Google Scholar]
- Lyu, X.; Liu, L.; Wang, M.; Kong, X.; Liu, L.; Liu, Y.; Chen, X.; Yuan, Y. HR-depth: High resolution self-supervised monocular depth estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, A Virtual Conference, Edinburgh, UK, 2–9 February 2021; pp. 2294–2301. [Google Scholar]
- Patil, V.; Van Gansbeke, W.; Dai, D.; Van Gool, L. Don′t forget the past: Recurrent depth estimation from monocular video. IEEE Robot. Autom. Lett. 2020, 5, 6813–6820. [Google Scholar] [CrossRef]
- Watson, J.; Mac Aodha, O.; Prisacariu, V.; Brostow, G.; Firman, M. The temporal opportunist: Self-supervised multi-frame monocular depth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, A Virtual Conference, Edinburgh, UK, 19–25 June 2021; pp. 1164–1174. [Google Scholar]
- Galassi, A.; Lippi, M.; Torroni, P. Attention, please! A critical review of neural attention models in natural language processing. arXiv 2019, arXiv:1902.02181. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Wang, Y.; Yang, Y.; Yang, Z.; Zhao, L.; Wang, P.; Xu, W. Occlusion aware unsupervised learning of optical flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4884–4893. [Google Scholar]
- Scharstein, D.; Szeliski, R. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vis. 2002, 47, 7–42. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef] [Green Version]
- Eigen, D.; Fergus, R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2650–2658. [Google Scholar]
- Chen, Y.; Schmid, C.; Sminchisescu, C. Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 7063–7072. [Google Scholar]
- Luo, X.; Huang, J.-B.; Szeliski, R.; Matzen, K.; Kopf, J. Consistent video depth estimation. ACM Trans. Graph. TOG 2020, 39, 71. [Google Scholar] [CrossRef]
- Wang, J.; Zhang, G.; Wu, Z.; Li, X.; Liu, L. Self-supervised joint learning framework of depth estimation via implicit cues. arXiv 2020, arXiv:2006.09876v3. [Google Scholar]
- Kuznietsov, Y.; Proesmans, M.; Van Gool, L. Comoda: Continuous monocular depth adaptation using past experiences. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikola, HI, USA, 5–9 January 2021; pp. 2907–2917. [Google Scholar]
- McCraith, R.; Neumann, L.; Zisserman, A.; Vedaldi, A. Monocular depth estimation with self-supervised instance adaptation. arXiv 2020, arXiv:2004.05821. [Google Scholar]
- Ranjan, A.; Jampani, V.; Balles, L.; Kim, K.; Sun, D.; Wulff, J.; Black, M.J. Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Los Angeles, CA, USA, 15–20 June 2019; pp. 12240–12249. [Google Scholar]
- Li, H.; Gordon, A.; Zhao, H.; Casser, V.; Angelova, A. Unsupervised monocular depth learning in dynamic scenes. arXiv 2020, arXiv:2010.16404. [Google Scholar]
- Karsch, K.; Liu, C.; Kang, S.B. Depth transfer: Depth extraction from video using non-parametric sampling. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 2144–2158. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Liu, M.; Salzmann, M.; He, X. Discrete-continuous depth estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 716–723. [Google Scholar]
Layer | k | s | chns | res | Input | Activation |
---|---|---|---|---|---|---|
conv1 | 7 | 2 | 64 | 2 | image | ReLU |
maxpool | 3 | 2 | 64 | 4 | conv1 | none |
res1 | 3 | 1 | 64 | 4 | maxpool | ReLU |
res2 | 3 | 2 | 128 | 8 | res1 | ReLU |
res3 | 3 | 2 | 256 | 16 | res2 | ReLU |
res4 | 3 | 2 | 512 | 32 | res3 | ReLU |
Spatial attention | N/A | N/A | 512 | 32 | res4 | N/A |
Channel attention | N/A | N/A | 512 | 32 | Spatial attention | N/A |
upconv4 | 3 | 1 | 256 | 32 | Channel attention | ELU |
skipconv4 | 3 | 1 | 256 | 16 | upconv4↑, res3 | skipconv4 |
upconv3 | 3 | 1 | 128 | 16 | skipconv4 | ELU |
skipconv3 | 3 | 1 | 128 | 8 | upconv3↑, res2 | ELU |
scale3 | 3 | 1 | 1 | 1 | skipconv3 | Sigmoid |
upconv2 | 3 | 1 | 64 | 8 | skipconv3 | ELU |
skipconv2 | 3 | 1 | 64 | 4 | upconv2↑, res1 | ELU |
scale2 | 3 | 1 | 1 | 1 | skipconv2 | Sigmoid |
upconv1 | 3 | 1 | 32 | 4 | skipconv2 | ELU |
skipconv1 | 3 | 1 | 32 | 2 | upconv1↑, conv1 | ELU |
scale1 | 3 | 1 | 1 | 1 | skipconv1 | Sigmoid |
upconv0 | 3 | 1 | 16 | 2 | skipconv1 | ELU |
newconv0 | 3 | 1 | 16 | 1 | upconv0↑ | ELU |
scale0 | 3 | 1 | 1 | 1 | newconv0 | Sigmoid |
Layer | k | s | chns | res | Input | Activation |
---|---|---|---|---|---|---|
squeeze | 1 | 1 | 256 | 32 | res4 | ReLU |
pconv0 | 3 | 1 | 256 | 32 | squeeze*3 | ReLU |
pconv1 | 3 | 1 | 256 | 32 | pconv0 | ReLU |
pconv2 | 1 | 1 | 12 | 32 | pconv1 | none |
Method | Test Frames | Abs Rel | Sq Rel | RMSE | RMSE Log | δ < 1.25 | δ < 1.252 | δ < 1.253 |
---|---|---|---|---|---|---|---|---|
GLNet [45] | 3 (−1, 0, +1) | 0.099 | 0.796 | 4.743 | 0.186 | 0.884 | 0.955 | 0.979 |
Luo et al. [46] | N | 0.130 | 2.086 | 4.876 | 0.205 | 0.878 | 0.946 | 0.970 |
Struct2depth (M+R) [27] | 3 (−1, 0, +1) | 0.109 | 0.825 | 4.750 | 0.187 | 0.874 | 0.958 | 0.983 |
Patil et al. [32] | N | 0.111 | 0.821 | 4.650 | 0.187 | 0.883 | 0.961 | 0.982 |
Wang et al. [47] | 2 (−1, 0) | 0.106 | 0.799 | 4.662 | 0.187 | 0.889 | 0.961 | 0.982 |
CoMoDA [48] | N | 0.103 | 0.862 | 4.594 | 0.183 | 0.899 | 0.961 | 0.981 |
McCraith et al. [49] | 2 (0, +1) | 0.089 | 0.747 | 4.275 | 0.173 | 0.912 | 0.964 | 0.982 |
ManyDepth [33] | 2 (−1, 0) | 0.090 | 0.713 | 4.261 | 0.170 | 0.914 | 0.966 | 0.983 |
Ranjan et al. [50] | 1 | 0.148 | 1.149 | 5.464 | 0.226 | 0.815 | 0.935 | 0.973 |
EPC++ [13] | 1 | 0.141 | 1.029 | 5.350 | 0.216 | 0.816 | 0.941 | 0.976 |
Struct2depth (M) [27] | 1 | 0.141 | 1.026 | 5.291 | 0.215 | 0.816 | 0.945 | 0.979 |
Videos in the wild [28] | 1 | 0.128 | 0.959 | 5.230 | 0.212 | 0.845 | 0.947 | 0.976 |
Li et al. [51] | 1 | 0.130 | 0.950 | 5.138 | 0.209 | 0.843 | 0.948 | 0.978 |
Guizilini et al. (R18) [15] | 1 | 0.117 | 0.854 | 4.714 | 0.191 | 0.873 | 0.963 | 0.981 |
Monodepth2 [11] | 1 | 0.115 | 0.903 | 4.863 | 0.193 | 0.877 | 0.959 | 0.981 |
ASANet (R18) [19] | 1 | 0.112 | 0.866 | 4.693 | 0.189 | 0.881 | 0.961 | 0.981 |
Johnston et al. (R18) [20] | 1 | 0.111 | 0.941 | 4.817 | 0.189 | 0.885 | 0.961 | 0.981 |
PackNet-SfM [17] | 1 | 0.111 | 0.785 | 4.601 | 0.189 | 0.878 | 0.960 | 0.982 |
HR-Depth [31] | 1 | 0.109 | 0.792 | 4.632 | 0.185 | 0.884 | 0.962 | 0.983 |
SHdepth (R18) | 1 | 0.111 | 0.828 | 4.692 | 0.187 | 0.881 | 0.961 | 0.982 |
SHdepth (R50) | 1 | 0.108 | 0.812 | 4.634 | 0.185 | 0.887 | 0.962 | 0.982 |
Ablation | Abs Rel | Sq Rel | RMSE | RMSE Log | δ < 1.25 | δ < 1.252 | δ < 1.253 |
---|---|---|---|---|---|---|---|
Baseline w/o full-res m.s. | 0.117 | 0.866 | 4.864 | 0.196 | 0.871 | 0.957 | 0.981 |
Baseline (MD2) | 0.115 | 0.903 | 4.863 | 0.193 | 0.877 | 0.959 | 0.981 |
Baseline + H | 0.113 | 0.865 | 4.809 | 0.191 | 0.878 | 0.960 | 0.981 |
Baseline + SA | 0.112 | 0.848 | 4.736 | 0.188 | 0.879 | 0.961 | 0.982 |
Baseline + CA | 0.114 | 0.902 | 4.855 | 0.192 | 0.878 | 0.959 | 0.981 |
Baseline + H + CA | 0.113 | 0.854 | 4.771 | 0.190 | 0.880 | 0.960 | 0.982 |
Baseline + H + SA | 0.112 | 0.826 | 4.700 | 0.189 | 0.878 | 0.961 | 0.982 |
Baseline + SA + CA | 0.111 | 0.833 | 4.704 | 0.189 | 0.881 | 0.960 | 0.981 |
Baseline + SA + CA + H (Ours) | 0.111 | 0.828 | 4.692 | 0.187 | 0.881 | 0.961 | 0.982 |
Method | Type | Abs Rel | Sq Rel | RMSE | RMSE Log |
---|---|---|---|---|---|
Karsch [52] | D | 0.428 | 5.079 | 8.389 | 0.149 |
Liu [53] | D | 0.475 | 6.562 | 10.05 | 0.165 |
Laina [5] | D | 0.204 | 1.840 | 5.683 | 0.084 |
Monodepth [14] | S | 0.544 | 10.940 | 11.760 | 0.193 |
Monodepth2 [11] | MS | 0.374 | 3.792 | 8.238 | 0.201 |
Zhou [7] | M | 0.383 | 5.321 | 10.470 | 0.478 |
DDVO [8] | M | 0.387 | 4.720 | 8.090 | 0.204 |
Monodepth2 [11] | M | 0.322 | 3.589 | 7.417 | 0.163 |
SHdepth (Ours) | M | 0.320 | 3.031 | 7.280 | 0.163 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Fan, C.; Yin, Z.; Xu, F.; Chai, A.; Zhang, F. Joint Soft–Hard Attention for Self-Supervised Monocular Depth Estimation. Sensors 2021, 21, 6956. https://doi.org/10.3390/s21216956
Fan C, Yin Z, Xu F, Chai A, Zhang F. Joint Soft–Hard Attention for Self-Supervised Monocular Depth Estimation. Sensors. 2021; 21(21):6956. https://doi.org/10.3390/s21216956
Chicago/Turabian StyleFan, Chao, Zhenyu Yin, Fulong Xu, Anying Chai, and Feiqing Zhang. 2021. "Joint Soft–Hard Attention for Self-Supervised Monocular Depth Estimation" Sensors 21, no. 21: 6956. https://doi.org/10.3390/s21216956
APA StyleFan, C., Yin, Z., Xu, F., Chai, A., & Zhang, F. (2021). Joint Soft–Hard Attention for Self-Supervised Monocular Depth Estimation. Sensors, 21(21), 6956. https://doi.org/10.3390/s21216956