FCNet: Stereo 3D Object Detection with Feature Correlation Networks
Abstract
:1. Introduction
- A simple and efficient stereo 3D object detection method, FCNet, is proposed. Compared with SOTA approaches, it achieves better performance without LiDAR point clouds and other additional supervision while maintaining an inference time of about 0.1 s per frame.
- After building a multi-scale cost–volume containing implicit depth information, we develop a variant attention module to enhance the global structure representation and local detail description of the multi-scale cost–volume. A region depth loss supervises depth regression.
- A channel reweighting strategy is used to strengthen the feature correlation while integrating the binocular images’ last-layer features by removing the redundant and robust correlation features while retaining the weak correlation features with significant differences. This facilitates the balance between channel information preservation and the computational burden.
2. Related Works
2.1. Image Depth Estimation
2.2. 3D Object Detection
3. Methods
3.1. Anchors Preprocessing
3.2. Sparse Depth Feature Extraction
3.3. Feature Correlation Model
3.4. Loss Function
4. Experiments
4.1. Datasets and Evaluation Metrics
4.2. Ablation Studies
4.3. Qualitative Comparison
4.4. Quantitative Comparison
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Zhou, Y.; Sun, P.; Zhang, Y.; Anguelov, D.; Gao, J.; Ouyang, T.; Guo, J.; Ngiam, J.; Vasudevan, V. End-to-end multi-view fusion for 3d object detection in lidar point clouds. In Proceedings of the Conference on Robot Learning, Virtual Event, 16–18 November 2020; pp. 923–932. [Google Scholar]
- Liu, Y.; Han, C.; Zhang, L.; Gao, X. Pedestrian detection with multi-view convolution fusion algorithm. Entropy 2022, 24, 165. [Google Scholar] [CrossRef]
- Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10529–10538. [Google Scholar]
- Deng, J.; Shi, S.; Li, P.; Zhou, W.; Zhang, Y.; Li, H. Voxel r-cnn: Towards high performance voxel-based 3d object detection. arXiv 2020, arXiv:2012.15712. [Google Scholar]
- Yang, Z.; Sun, Y.; Liu, S.; Jia, J. 3dssd: Point-based 3d single stage object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11040–11048. [Google Scholar]
- Yang, Z.; Sun, Y.; Liu, S.; Shen, X.; Jia, J. Std: Sparse-to-dense 3d object detector for point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 1951–1960. [Google Scholar]
- Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
- Zhang, Y.; Huang, D.; Wang, Y. PC-RGNN: Point Cloud Completion and Graph Neural Network for 3D Object Detection. arXiv 2020, arXiv:2012.10412. [Google Scholar]
- Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4490–4499. [Google Scholar]
- Peng, L.; Liu, F.; Yan, S.; He, X.; Cai, D. Ocm3d: Object-centric monocular 3d object detection. arXiv 2021, arXiv:2104.06041. [Google Scholar]
- Chen, X.; Kundu, K.; Zhang, Z.; Ma, H.; Fidler, S.; Urtasun, R. Monocular 3d object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2147–2156. [Google Scholar]
- Li, P.; Zhao, H.; Liu, P.; Cao, F. Rtm3d: Real-time monocular 3d detection from object keypoints for autonomous driving. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 644–660. [Google Scholar]
- Luo, S.; Dai, H.; Shao, L.; Ding, Y. M3DSSD: Monocular 3D single stage object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 21–25 June 2021; pp. 6145–6154. [Google Scholar]
- Bao, W.; Xu, B.; Chen, Z. Monofenet: Monocular 3d object detection with feature enhancement networks. IEEE Trans. Image Process. 2019, 29, 2753–2765. [Google Scholar] [CrossRef] [PubMed]
- Chen, H.; Huang, Y.; Tian, W.; Gao, Z.; Xiong, L. Monorun: Monocular 3d object detection by reconstruction and uncertainty propagation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10379–10388. [Google Scholar]
- You, Y.; Wang, Y.; Chao, W.L.; Garg, D.; Pleiss, G.; Hariharan, B.; Campbell, M.; Weinberger, K.Q. Pseudo-lidar++: Accurate depth for 3d object detection in autonomous driving. arXiv 2019, arXiv:1906.06310. [Google Scholar]
- Li, P.; Chen, X.; Shen, S. Stereo r-cnn based 3d object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7644–7652. [Google Scholar]
- Liu, Y.; Wang, L.; Liu, M. Yolostereo3d: A step back to 2d for efficient stereo 3d detection. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 13018–13024. [Google Scholar]
- Shi, Y.; Guo, Y.; Mi, Z.; Li, X. Stereo CenterNet-based 3D object detection for autonomous driving. Neurocomputing 2022, 471, 219–229. [Google Scholar] [CrossRef]
- Wang, Y.; Chao, W.L.; Garg, D.; Hariharan, B.; Campbell, M.; Weinberger, K.Q. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 8445–8453. [Google Scholar]
- Mousavian, A.; Anguelov, D.; Flynn, J.; Kosecka, J. 3d bounding box estimation using deep learning and geometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7074–7082. [Google Scholar]
- Li, S.; He, J.; Li, Y.; Rafique, M.U. Distributed recurrent neural networks for cooperative control of manipulators: A game-theoretic perspective. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 415–426. [Google Scholar] [CrossRef]
- Li, S.; Zhang, Y.; Jin, L. Kinematic control of redundant manipulators using neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 2243–2254. [Google Scholar] [CrossRef] [PubMed]
- Qian, R.; Garg, D.; Wang, Y.; You, Y.; Belongie, S.; Hariharan, B.; Campbell, M.; Weinberger, K.Q.; Chao, W.L. End-to-end pseudo-lidar for image-based 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5881–5890. [Google Scholar]
- Sun, J.; Chen, L.; Xie, Y.; Zhang, S.; Jiang, Q.; Zhou, X.; Bao, H. Disp r-cnn: Stereo 3d object detection via shape prior guided instance disparity estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10548–10557. [Google Scholar]
- Pon, A.D.; Ku, J.; Li, C.; Waslander, S.L. Object-centric stereo matching for 3d object detection. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 8383–8389. [Google Scholar]
- Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1907–1915. [Google Scholar]
- Xu, D.; Anguelov, D.; Jain, A. Pointfusion: Deep sensor fusion for 3d bounding box estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 244–253. [Google Scholar]
- Huynh, L.; Nguyen, P.; Matas, J.; Rahtu, E.; Heikkilä, J. Boosting monocular depth estimation with lightweight 3d point fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 12767–12776. [Google Scholar]
- Luo, W.; Schwing, A.G.; Urtasun, R. Efficient deep learning for stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 5695–5703. [Google Scholar]
- Zbontar, J.; LeCun, Y. Stereo matching by training a convolutional neural network to compare image patches. J. Mach. Learn. Res. 2016, 17, 2287–2318. [Google Scholar]
- Chang, J.R.; Chen, Y.S. Pyramid stereo matching network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5410–5418. [Google Scholar]
- Qin, Z.; Wang, J.; Lu, Y. Triangulation learning network: From monocular to stereo 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 7615–7623. [Google Scholar]
- Zhang, S.; Wang, Z.; Wang, Q.; Zhang, J.; Wei, G.; Chu, X. EDNet: Efficient Disparity Estimation with Cost Volume Combination and Attention-based Spatial Residual. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5433–5442. [Google Scholar]
- Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. arXiv 2017, arXiv:1706.02413. [Google Scholar]
- Shi, S.; Wang, Z.; Wang, X.; Li, H. Part-A2 net: 3d part-aware and aggregation neural network for object detection from point cloud. arXiv 2019, arXiv:1907.03670. [Google Scholar]
- Liu, Z.; Zhou, D.; Lu, F.; Fang, J.; Zhang, L. Autoshape: Real-time shape-aware monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 15641–15650. [Google Scholar]
- Wang, L.; Du, L.; Ye, X.; Fu, Y.; Guo, G.; Xue, X.; Feng, J.; Zhang, L. Depth-conditioned dynamic message propagation for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 454–463. [Google Scholar]
- Qin, Z.; Wang, J.; Lu, Y. Monogrnet: A general framework for monocular 3d object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 5170–5184. [Google Scholar] [CrossRef] [PubMed]
- Li, P.; Su, S.; Zhao, H. RTS3D: Real-time Stereo 3D Detection from 4D Feature-Consistency Embedding Space for Autonomous Driving. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 1930–1939. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
- Yan, Y.; Mao, Y.; Li, B. Second: Sparsely embedded convolutional detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The kitti vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
- Liu, C.; Gu, S.; Van Gool, L.; Timofte, R. Deep Line Encoding for Monocular 3D Object Detection and Depth Prediction. In Proceedings of the 32nd British Machine Vision Conference (BMVC 2021), Online, 22–25 November 2021; p. 354. [Google Scholar]
- Chen, Y.; Liu, S.; Shen, X.; Jia, J. Dsgn: Deep stereo geometry network for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12536–12545. [Google Scholar]
- Zhang, Y.; Lu, J.; Zhou, J. Objects are different: Flexible monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3289–3298. [Google Scholar]
- Liu, X.; Xue, N.; Wu, T. Learning Auxiliary Monocular Contexts Helps Monocular 3D Object Detection. arXiv 2021, arXiv:2112.04628. [Google Scholar] [CrossRef]
- Weng, X.; Kitani, K. A baseline for 3d multi-object tracking. arXiv 2019, arXiv:1907.03961. [Google Scholar]
- Gao, A.; Pang, Y.; Nie, J.; Cao, J.; Guo, Y. EGFN: Efficient Geometry Feature Network for Fast Stereo 3D Object Detection. arXiv 2021, arXiv:2111.14055. [Google Scholar]
- Königshof, H.; Salscheider, N.O.; Stiller, C. Realtime 3d object detection for automated driving using stereo vision and semantic information. In Proceedings of the 2019 IEEE Intelligent Transportation Systems Conference (ITSC), Auckland, New Zealand, 27–30 October 2019; pp. 1405–1410. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv 2015, arXiv:1506.01497. [Google Scholar] [CrossRef]
- Beltrán, J.; Guindel, C.; Moreno, F.M.; Cruzado, D.; Garcia, F.; De La Escalera, A. Birdnet: A 3d object detection framework from lidar information. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; pp. 3517–3523. [Google Scholar]
Methods | of Car (%) | ||
---|---|---|---|
Easy | Moderate | Hard | |
w/o DLA | 63.84 | 38.73 | 30.19 |
w/o Anchor Filtering | 63.58 | 39.19 | 30.14 |
w/o Bounding Box prior | 62.17 | 37.43 | 29.84 |
w/o PCA | 63.31 | 39.42 | 30.45 |
w/o CSR | 63.94 | 40.25 | 30.12 |
w CSR | 64.13 | 40.87 | 31.81 |
w CSR | 64.24 | 41.09 | 31.97 |
FCNet | 64.89 | 41.93 | 32.60 |
Methods | of Pedestrians (%) | ||
---|---|---|---|
Easy | Moderate | Hard | |
w/o DLA | 29.34 | 23.07 | 17.31 |
w/o Anchor Filtering | 31.58 | 22.89 | 19.34 |
w/o Bounding Box prior | 29.67 | 22.51 | 19.42 |
w/o PCA | 28.94 | 22.43 | 17.50 |
w/o CSR | 30.47 | 23.25 | 19.17 |
w CSR | 31.71 | 24.77 | 19.66 |
w CSR | 31.29 | 23.79 | 19.63 |
FCNet | 32.51 | 25.04 | 20.39 |
Methods | Data | Time | of Car (%) | of Pedestrians (%) | ||||
---|---|---|---|---|---|---|---|---|
Easy | Moderate | Hard | Easy | Moderate | Hard | |||
MonoFlex [48] | Mono | 0.03 s | 19.94 | 13.89 | 12.07 | 9.43 | 6.31 | 5.26 |
MonoRun [15] | Mono | 0.07 s | 19.65 | 12.30 | 10.58 | 10.88 | 6.78 | 5.83 |
MonoCon [49] | Mono | 0.02 s | 22.50 | 16.46 | 13.95 | 13.10 | 8.41 | 6.94 |
DLE [46] | Mono | 0.06 s | 24.23 | 14.33 | 10.30 | - | - | - |
Complexer-YOLO [50] | LiDAR | 0.06 s | 55.93 | 47.34 | 42.60 | 17.60 | 13.96 | 12.70 |
EGFN [51] | Stereo | 0.06 s | 65.80 | 46.39 | 38.42 | 14.05 | 10.27 | 9.02 |
Pseudo-LiDAR [20] | Stereo | 0.4 s | 54.53 | 34.05 | 28.25 | - | - | - |
Pseudo-LiDAR++ [16] | Stereo | 0.4 s | 61.11 | 42.43 | 36.99 | - | - | - |
Disp R-CNN [25] | Stereo | 0.387 s | 67.02 | 43.27 | 36.43 | 35.75 | 25.40 | 21.79 |
RT3DStereo [52] | Stereo | 0.08 s | 29.90 | 23.28 | 18.96 | 3.28 | 2.45 | 2.35 |
Stereo-RCNN [17] | Stereo | 0.30 s | 47.58 | 30.23 | 23.72 | - | - | - |
OC Stereo [26] | Stereo | 0.35 s | 55.15 | 37.60 | 30.25 | 24.48 | 17.58 | 15.60 |
Stereo-CenterNet [19] | Stereo | 0.04 s | 49.94 | 31.30 | 25.62 | - | - | - |
TLNet [33] | Stereo | 0.1 s | 7.64 | 4.37 | 3.74 | - | - | - |
Yolostereo3D [18] | Stereo | 0.1 s | 65.68 | 41.25 | 30.42 | 28.49 | 19.75 | 16.48 |
DSGN [47] | Stereo | 0.67 s | 73.50 | 52.18 | 45.14 | 20.53 | 15.55 | 14.15 |
FCNet(ours) | Stereo | 0.1 s | 67.83 | 41.32 | 31.48 | 30.15 | 20.84 | 18.43 |
Methods | Data | of Car (%) | ||
---|---|---|---|---|
Easy | Moderate | Hard | ||
Faster R-CNN [53] | Mono | 87.90 | 79.11 | 70.19 |
MonoGRNet [39] | Mono | 88.65 | 77.94 | 63.31 |
RTM3D [12] | Mono | 91.82 | 86.93 | 77.41 |
YOLOMono3D [18] | Mono | 92.37 | 79.63 | 59.69 |
Mono3D [11] | Mono | 90.27 | 87.86 | 78.09 |
MV3D [27] | LiDAR | 68.35 | 54.54 | 49.16 |
BirdNet [54] | LiDAR | 79.30 | 57.12 | 55.16 |
Stereo-RCNN [17] | Stereo | 93.98 | 85.98 | 71.25 |
TLNet [33] | Stereo | 76.92 | 63.53 | 54.58 |
Pseudo-LiDAR [20] | Stereo | 85.40 | 67.79 | 58.50 |
Yolostereo3D [18] | Stereo | 94.81 | 82.15 | 62.17 |
FCNet(ours) | Stereo | 94.77 | 84.53 | 64.54 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wu, Y.; Liu, Z.; Chen, Y.; Zheng, X.; Zhang, Q.; Yang, M.; Tang, G. FCNet: Stereo 3D Object Detection with Feature Correlation Networks. Entropy 2022, 24, 1121. https://doi.org/10.3390/e24081121
Wu Y, Liu Z, Chen Y, Zheng X, Zhang Q, Yang M, Tang G. FCNet: Stereo 3D Object Detection with Feature Correlation Networks. Entropy. 2022; 24(8):1121. https://doi.org/10.3390/e24081121
Chicago/Turabian StyleWu, Yingyu, Ziyan Liu, Yunlei Chen, Xuhui Zheng, Qian Zhang, Mo Yang, and Guangming Tang. 2022. "FCNet: Stereo 3D Object Detection with Feature Correlation Networks" Entropy 24, no. 8: 1121. https://doi.org/10.3390/e24081121
APA StyleWu, Y., Liu, Z., Chen, Y., Zheng, X., Zhang, Q., Yang, M., & Tang, G. (2022). FCNet: Stereo 3D Object Detection with Feature Correlation Networks. Entropy, 24(8), 1121. https://doi.org/10.3390/e24081121