Monocular 3D Object Detection Based on Uncertainty Prediction of Keypoints
Abstract
:1. Introduction
- (a).
- A method to predict keypoints uncertainty based on multi-clue fusion.
- (b).
- A strategy to optimize the 3D position by jointly considering the uncertainty.
- (c).
- KUP-Net outperforms the previous methods on the kitti dataset.
2. Materials
3. Methods
3.1. 2D Detection
3.2. 3D Detection
3.3. Uncertainty Prediction Module
Algorithm 1: The illustration of uncertainty prediction. |
Input: Feature map , Keypoints offset Output: Uncertainty 1: Coordinate solution: keypoints coordinate: 2: Diagonal points acquisition: 3: 2D boxes acquisition: 4: Feature region extraction: 5: Position encoding: 6: Feature fusion: ; concatenate and into 7: Output: 8: END |
3.4. Loss Function
4. Results
4.1. Experimental Setting
4.2. Experimental Results
4.2.1. Qualitative Results of Cars
4.2.2. Quantitative Results of Cars
4.2.3. Results of 2D and Multi-Class Detection
4.3. Ablation Study
4.3.1. Effect of Timing Coefficient
4.3.2. Effect of Uncertainty Prediction Mode
4.3.3. Effect of Position Encoding
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A
References
- Li, S.; Yan, Z.; Li, H.; Cheng, K.T. Exploring intermediate representation for monocular vehicle pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021. [Google Scholar]
- Yang, B.; Luo, W.; Urtasun, R. PIXOR: Real-time 3D Object Detection from Point Clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Qi, C.R.; Litany, O.; He, K.; Guibas, L.J. Deep Hough Voting for 3D Object Detection in Point Clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019. [Google Scholar]
- Peng, W.; Pan, H.; Liu, H.; Sun, Y. IDA-3D: Instance-Depth-Aware 3D Object Detection from Stereo Vision for Autonomous Driving. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
- Ferryman, J.M.; Maybank, S.J.; Worrall, A.D. Visual surveillance for moving vehicles. Int. J. Comput. Vis. 2000, 37, 187–197. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Chen, X.; Kundu, K.; Zhang, Z.; Ma, H.; Urtasun, R. Monocular 3D Object Detection for Autonomous Driving. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Xu, B.; Chen, Z. Multi-level Fusion Based 3D Object Detection from Monocular Images. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Manhardt, F.; Kehl, W.; Gaidon, A. ROI-10D: Monocular Lifting of 2D Detection to 6D Pose and Metric Shape. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Chen, Y.; Tai, L.; Sun, K.; Li, M. MonoPair: Monocular 3D Object Detection Using Pairwise Spatial Relationships. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
- Ma, X.; Wang, Z.; Li, H.; Zhang, P.; Ouyang, W.; Fan, X. Accurate Monocular 3D Object Detection via Color-Embedded 3D Reconstruction for Autonomous Driving. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27–28 October 2019. [Google Scholar]
- Chabot, F.; Chaouch, M.; Rabarisoa, J.; Teulière, C.; Chateau, T. Deep MANTA: A Coarse-to-Fine Many-Task Network for Joint 2D and 3D Vehicle Analysis from Monocular Image. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Mousavian, A.; Anguelov, D.; Flynn, J.; Kosecka, J. 3D Bounding Box Estimation Using Deep Learning and Geometry. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [Green Version]
- Brazil, G.; Liu, X. M3D-RPN: Monocular 3D Region Proposal Network for Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27–28 October 2019. [Google Scholar]
- Li, P.; Zhao, H.; Liu, P.; Cao, F. RTM3D: Real-Time Monocular 3D Detection from Object Keypoints forAutonomous Driving. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
- Zhou, X.; Wang, D.; Krhenbühl, P. Objects as Points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
- Li, P. Monocular 3D Detection with Geometric Constraints Embedding and Semi-supervised Training. IEEE Robot. Autom. Lett. 2021, 6, 5565–5572. [Google Scholar] [CrossRef]
- Kendall, A.; Gal, Y. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? arXiv 2017, arXiv:1703.04977. [Google Scholar]
- Gal, Y.; Ghahramani, Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016. [Google Scholar]
- Liu, C.; Gu, J.; Kim, K.; Narasimhan, S.G.; Kautz, J. Neural RGB®D Sensing: Depth and Uncertainty From a Video Camera. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 10978–10987. [Google Scholar] [CrossRef]
- Blundell, C.; Cornebise, J.; Kavukcuoglu, K.; Wierstra, D. Weight Uncertainty in Neural Networks. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015. [Google Scholar]
- Wirges, S.; Reith-Braun, M.; Lauer, M.; Stiller, C. Capturing Object Detection Uncertainty in Multi-Layer Grid Maps. In Proceedings of the 2019 IEEE Intelligent Vehicles Symposium (IV), Paris, France, 9–12 June 2019. [Google Scholar]
- Bertoni, L.; Kreiss, S.; Alahi, A. MonoLoco: Monocular 3D Pedestrian Localization and Uncertainty Estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019. [Google Scholar]
- Giles, M. An Extended Collection of Matrix Derivative Results for Forward and Reverse Mode Algorithmic Dieren Tiation; Oxford University Computing Laboratory: Oxford, UK, 2008. [Google Scholar]
- Ionescu, C.; Vantzos, O.; Sminchisescu, C. Matrix Backpropagation for Deep Networks with Structured Layers. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 2965–2973. [Google Scholar] [CrossRef]
- Laine, S.; Aila, T. Temporal Ensembling for Semi-Supervised Learning. arXiv 2016, arXiv:1610.02242. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
- Simonelli, A.; Bulò, S.R.R.; Porzi, L.; López-Antequera, M.; Kontschieder, P. Disentangling Monocular 3D Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019. [Google Scholar]
- Feng, D.; Rosenbaum, L.; Dietmayer, K. Towards Safe Autonomous Driving: Capture Uncertainty in the Deep Neural Network for Lidar 3D Vehicle Detection. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018. [Google Scholar]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition, Providence, RI, USA, 16–21 June 2012. [Google Scholar]
- Yu, X.; Choi, W.; Lin, Y.; Savarese, S. Data-Driven 3D Voxel Patterns for Object Category Recognition. In Proceedings of the CVPR 2015, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Chen, X.; Kundu, K.; Zhu, Y.; Ma, H.; Fidler, S.; Urtasun, R. 3D Object Proposals using Stereo Imagery for Accurate Object Class Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1259–1272. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Yu, F.; Wang, D.; Shelhamer, E.; Darrell, T. Deep Layer Aggregation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Ku, J.; Pon, A.D.; Waslander, S.L. Monocular 3D Object Detection Leveraging Accurate Proposals and Shape Reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Li, B.; Ouyang, W.; Sheng, L.; Zeng, X.; Wang, X. GS3D: An Efficient 3D Object Detection Framework for Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Qin, Z.; Wang, J.; Lu, Y. Triangulation Learning Network: From Monocular to Stereo 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Choi, J.; Chun, D.; Kim, H.; Lee, H. Gaussian YOLOv3: An Accurate and Fast Object Detector Using Localization Uncertainty for Autonomous Driving. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27–28 October 2019. [Google Scholar]
- He, Y.; Zhu, C.; Wang, J.; Savvides, M.; Zhang, X. Bounding Box Regression with Uncertainty for Accurate Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st Conference on Neural Information Processing System, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Zhang, Q.; Yang, Y. ResT: An Efficient Transformer for Visual Recognition. arXiv 2021, arXiv:2105.13677. [Google Scholar]
- Alaparthi, S.; Mishra, M. Bidirectional Encoder Representations from Transformers (BERT): A sentiment analysis odyssey. arXiv 2020, arXiv:2007.01127. [Google Scholar]
Method | Extra | @IOU = 0.5 [] | @IOU = 0.7 [] | ||||
---|---|---|---|---|---|---|---|
E | M | H | E | M | H | ||
Mono3D [7] | Mask | 25.19/ - | 18.20/ - | 15.52/ - | 2.53/ - / - | 2.31/ - / - | 2.31/ - / - |
ROI-10D [9] | Depth | 37.59/ - | 25.14/ - | 21.83/ - | 9.61/ - /12.30 | 6.63/ - /10.30 | 6.29/ - / 9.39 |
MF3D [8] | Depth | 47.88/45.57 | 29.48/30.03 | 26.44/23.95 | 10.53/ 7.85 /7.08 | 5.69/ 5.39 /5.18 | 5.39/ 4.73 /4.68 |
3DOP [33] | Stereo | 46.04/ - | 34.63/ - | 30.09/ - | 6.55/ - / - | 5.07/ - / - | 4.10/ - / - |
MonoPSR [36] | Lidar | 49.65/48.89 | 41.71/40.93 | 29.95/33.43 | 12.75/13.94/12.57 | 11.48/12.24/10.85 | 8.59/10.77/9.06 |
M3D-RPN [15] | None | 48.96/49.89 | 39.57/36.14 | 33.01/28.98 | 20.27/20.40/14.76 | 17.06/16.48/9.71 | 15.21/13.34/7.42 |
GS3D [37] | None | 32.15/30.60 | 29.89/26.40 | 26.19/22.89 | 13.46/11.63/4.47 | 10.97/10.51/2.90 | 10.38/10.51/2.47 |
Deep3DBox [12] | None | 27.04/ - | 20.55/ - | 15.88/ - | 5.85/ - / - | 4.10/ - / - | 3.84/ - / - |
MonPair [10] | None | - / - | - / - | - / - | - / - /13.04 | - / - /9.99 | - / - /8.65 |
RTM3D [16] | None | 54.36/52.59 | 41.90/40.96 | 35.84/34.95 | 20.77/19.47/13.61 | 16.86/16.29/10.09 | 16.63/15.57/8.18 |
KM3D [18] | None | 56.02/54.09 | 43.13/43.07 | 36.77/37.56 | 22.50/22.71/16.73 | 19.60/17.71/11.45 | 17.12/16.15/9.92 |
Ours | None | 56.51/54.63 | 42.75/43.56 | 36.15/36.02 | 22.97/23.14/17.26 | 19.23/20.12/11.78 | 16.95/16.84/9.51 |
Method | Extra | @IOU = 0.5 [] | @IOU = 0.7 [] | ||||
---|---|---|---|---|---|---|---|
E | M | H | E | M | H | ||
Mono3D [7] | Mask | 30.05/ - | 22.39/ - | 19.16/ - | 5.22/ - / - | 5.19/ - / - | 4.13/ - / - |
ROI-10D [9] | Depth | 46.85/ - | 34.05/ - | 30.46/ - | 14.50/ - /16.77 | 9.91/ - /12.40 | 8.73/ - /11.39 |
MF3D [8] | Depth | 55.02/54.18 | 36.73/38.06 | 31.27/31.46 | 22.03/19.20/13.73 | 13.63/12.17/9.62 | 11.60/10.89/8.22 |
3DOP [33] | Stereo | 55.04/ - | 41.25/ - | 34.55/ - | 12.63/ - / - | 9.49/ - / - | 7.59/ - / - |
MonoPSR [36] | Lidar | 56.97/55.45 | 43.39/43.31 | 36.00/35.47 | 20.63/21.52/20.25 | 18.67/18.90/17.66 | 14.45/14.94/15.78 |
M3D-RPN [15] | None | 55.37/55.87 | 42.49/41.36 | 35.29/34.08 | 25.94/26.86/21.02 | 21.18/21.15/13.67 | 17.90/17.14/10.23 |
GS3D [37] | None | - / - | - / - | - / - | - / - /8.41 | - / - /6.08 | - / - /4.94 |
Deep3DBox [12] | None | 30.02/ - | 23.77/ - | 18.83/ - | 9.99/ - / - | 7.71/ - / - | 5.30/ - / - |
MonPair [10] | None | - / - | - / - | - / - | - / - /19.28 | - / - /14.83 | - / - /12.89 |
RTM3D [16] | None | 57.47/56.90 | 44.16/44.69 | 42.31/41.75 | 25.56/24.74/ - | 22.12/22.03/ - | 20.91/18.05/ - |
KM3D [18] | None | 62.39/59.35 | 49.93/45.14 | 43.73/42.47 | 27.83/28.87/23.44 | 23.38/22.87/16.20 | 21.69/22.55/14.47 |
Ours | None | 62.57/59.73 | 49.19/49.36 | 43.61/43.18 | 28.73/28.16/23.59 | 23.27/23.41/16.63 | 21.32/21.68/14.25 |
Method | @IOU = 0.7 | |||||
---|---|---|---|---|---|---|
E | M | H | E | M | H | |
AM3D [11] | 92.55 | 88.71 | 77.78 | - | - | - |
TLNet [38] | 76.92 | 63.53 | 54.58 | - | - | - |
MonoDIS [29] | 94.61 | 89.15 | 78.37 | - | - | - |
MonoPSR [36] | 93.63 | 88.50 | 73.36 | 93.29 | 87.45 | 72.26 |
M3D-RPN [15] | 89.04 | 85.08 | 69.26 | 88.38 | 82.81 | 67.08 |
KM3D [18] | 96.44 | 91.07 | 81.19 | 96.34 | 90.70 | 90.72 |
Monopair [10] | 96.61 | 93.55 | 83.55 | 91.65 | 86.11 | 76.45 |
Ours | 96.59 | 92.47 | 83.14 | 94.28 | 89.41 | 84.23 |
Method | @IOU = 0.5 | @IOU = 0.5 | ||||
---|---|---|---|---|---|---|
E | M | H | E | M | H | |
Pedestrian | 11.35 | 10.42 | 10.37 | 12.10 | 11.35 | 10.46 |
Cyclist | 15.68 | 11.64 | 11.03 | 15.97 | 11.81 | 11.14 |
Poe | Tic | @IOU = 0.7 | @IOU = 0.7 | ||||
---|---|---|---|---|---|---|---|
E | M | H | E | M | H | ||
- | - | 10.13 | 4.58 | 3.07 | 15.28 | 8.51 | 7.19 |
- | ✓ | 12.42 | 6.71 | 5.14 | 18.53 | 11.57 | 9.65 |
✓ | - | 15.73 | 10.06 | 8.14 | 21.64 | 14.76 | 12.53 |
✓ | ✓ | 17.26 | 11.78 | 9.51 | 23.59 | 16.63 | 14.25 |
@IOU = 0.7 | @IOU = 0.7 | |||||||
---|---|---|---|---|---|---|---|---|
E | M | H | E | M | H | |||
✓ | ✓ | - | 13.91 | 8.45 | 6.58 | 19.42 | 12.56 | 10.39 |
- | - | ✓ | 17.26 | 11.78 | 9.51 | 23.59 | 16.63 | 14.25 |
- | - | - | 16.38 | 10.82 | 8.74 | 22.35 | 15.41 | 13.26 |
✓ | ✓ | ✓ | 16.04 | 10.39 | 8.57 | 21.95 | 15.02 | 12.98 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chen, M.; Zhao, H.; Liu, P. Monocular 3D Object Detection Based on Uncertainty Prediction of Keypoints. Machines 2022, 10, 19. https://doi.org/10.3390/machines10010019
Chen M, Zhao H, Liu P. Monocular 3D Object Detection Based on Uncertainty Prediction of Keypoints. Machines. 2022; 10(1):19. https://doi.org/10.3390/machines10010019
Chicago/Turabian StyleChen, Mu, Huaici Zhao, and Pengfei Liu. 2022. "Monocular 3D Object Detection Based on Uncertainty Prediction of Keypoints" Machines 10, no. 1: 19. https://doi.org/10.3390/machines10010019
APA StyleChen, M., Zhao, H., & Liu, P. (2022). Monocular 3D Object Detection Based on Uncertainty Prediction of Keypoints. Machines, 10(1), 19. https://doi.org/10.3390/machines10010019