Deep Dual-Modal Traffic Objects Instance Segmentation Method Using Camera and LIDAR Data for Autonomous Driving
Abstract
:1. Introduction
- (1)
- A novel dual-modal instance segmentation deep neural network (DM-ISDNN) is presented for target detection by using a fused camera and LIDAR data. The proposed method provides significant technical contributions, which can be used for target detection under various complex environmental conditions (e.g., low illumination or bad visibility);
- (2)
- The early-, middle-, and late-stage data fusion architectures are compared and analyzed in-depth and the middle-stage fusion architecture with weight assignment mechanism has been selected for feature fusion in our paper by comprehensively considering the detection accuracy and detection speed. This work has great significance for exploring the best feature fusion scheme of a multi-modal neural network.
- (3)
- Due to the sparseness of LIDAR point cloud data, we propose a weight assignment function that assigns different weight coefficients to different feature pyramid convolutional layers for the LIDAR sub-network. A weight assignment mechanism is a novel exploration for optimizing the multi-sensor data feature fusion effect in deep neural networks. In addition, we apply a mask distribution function to improve the quality of the predicted mask;
- (4)
- We provide a manually annotated dual-modal traffic object instance segmentation dataset using a 7481 camera and LIDAR data pairs from the KITTI dataset, with 79,118 instance masks annotated. To the best of our knowledge, there is no existing instance annotation on the KITTI dataset with such quality and volume;
- (5)
- A novel dual-modal dataset with 14,652 camera and LIDAR data pairs is acquired using our own designed autonomous vehicle under different environmental conditions in real traffic scenarios. A total of 62,579 instance masks are obtained using the semi-automatic annotation method, which can be used to validate the effectiveness and efficiency of the instance segmentation deep neural networks under complex environmental conditions.
2. Methodology
2.1. Input LIDAR Data Preparation
2.2. Network Architecture
2.2.1. Feature Extraction Sub-Network
2.2.2. Data Fusion Sub-Network
2.2.3. Bounding Box Prediction and Classification Sub-Network
2.2.4. Mask Prediction Sub-Network
2.2.5. Loss Function
3. Dataset and Network Training
3.1. Dual-Modal KITTI Dataset
3.2. Dual-Modal Zhi-Shan Dataset
3.2.1. Data Collection Platform
3.2.2. Distribution of the Dual-Modal Zhi-Shan Dataset
3.3. Network Training
4. Experiment and Analysis of Proposed Method
4.1. Experimental Setup
4.2. Comparison of Experimental Results and Analysis
4.2.1. Average Precision
4.2.2. Processing Time
4.2.3. Average Accuracy
4.2.4. Changes of Loss Functions
4.2.5. Classification Prediction Results
4.3. Influence of the Feature Impact Factor
5. Detection Results Comparison on The Dual-Modal Zhi-Shan Dataset
6. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Zhu, J.S.; Ke, S.; Sen, J.; Lin, W.D.; Hou, X.X.; Liu, B.Z.; Qiu, G.P. Bidirectional Long Short-Term Memory Network for Vehicle Behavior Recognition. Remote Sens. 2018, 10, 887. [Google Scholar] [CrossRef] [Green Version]
- Stateczny, A.; Kazimierski, W.; Gronska-Sledz, D.; Motyl, W. The Empirical Application of Automotive 3D Radar Sensor for Target Detection for an Autonomous Surface Vehicle’s Navigation. Remote Sens. 2019, 11, 1156. [Google Scholar] [CrossRef] [Green Version]
- Kaiming, H.; Georgia, G.; Piotr, D.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 386–397. [Google Scholar]
- Fu, C.Y.; Shvets, M.; Berg, A.C. RetinaMask: Learning to Predict Masks Improves State-of-the-Art Single-Shot Detection for Free. Available online: https://arxiv.org/abs/1703.06870 (accessed on 13 September 2020).
- Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. YOLACT: Real-time Instance Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019. [Google Scholar]
- Khaire, P.; Kumar, P.; Imran, J. Combining CNN streams of RGB-D and skeletal data for human activity recognition. Pattern Recognit. Lett. 2018, 115, 107–116. [Google Scholar] [CrossRef]
- Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor segmentation and support inference from RGB-D images. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012. [Google Scholar]
- Kosaka, N.; Ohashi, G. Vision-Based Nighttime Vehicle Detection Using CenSurE and SVM. IEEE Trans. Intell. Transp. Syst. 2015, 16, 1–10. [Google Scholar] [CrossRef]
- Cheon, M.; Lee, W.; Yoon, C.; Park, M. Vision-Based Vehicle Detection System With Consideration of the Detecting Location. IEEE Trans. Intell. Transp. Syst. 2012, 13, 1243–1252. [Google Scholar] [CrossRef]
- Chavez-Garcia, R.O.; Aycard, O. Multiple Sensor Fusion and Classification for Moving Target Detection and Tracking. IEEE Trans. Intell. Transp. Syst. 2015, 17, 1–10. [Google Scholar]
- Gupta, S.; Girshick, R.; Arbeláez, P. Learning Rich Features from RGB-D Images for Target Detection and Segmentation. Lect. Notes Comput. Sci. 2014, 8695, 345–360. [Google Scholar]
- Eitel, A.; Springenberg, J.T.; Spinello, L.; Riedmiller, M.; Burgard, W. Multimodal Deep Learning for Robust RGB-D Object Recognition. In Proceedings of the International Conference on Intelligent Robots and Systems, Daejeon, Korea, 28 September–2 October 2015. [Google Scholar]
- Song, H.; Choi, W.; Kim, H. Robust vision-based relative-localization approach using an RGB-depth camera and LiDAR sensor fusion. IEEE Trans. Ind. Electron. 2016, 63, 3725–3736. [Google Scholar] [CrossRef]
- He, W.; Li, Z.; Kai, Z.; Shi, Y.S.; Zhao, C.; Chen, X. Accurate and automatic extrinsic calibration method for blade measurement system integrated by different optical sensors. In Proceedings of the Optical Metrology & Inspection for Industrial Applications III, Beijing, China, 9–11 October 2014. [Google Scholar]
- Premebida, C.; Nunes, U. Fusing LIDAR, camera and semantic information: A context-based approach for pedestrian detection. Int. J. Rob. Res. 2013, 32, 371–384. [Google Scholar] [CrossRef]
- Zhang, F.; Clarke, D.; Knoll, A. Vehicle detection based on LIDAR and camera fusion. In Proceedings of the IEEE International Conference on Intelligent Transportation Systems, Qingdao, China, 8–11 October 2014. [Google Scholar]
- Niessner, R.; Schilling, H.; Jutzi, B. Investigations on the potential of convolutional neural networks for vehicle classification based on RGB and LIDAR data. Remote Sens. Space Inform. Sci. 2017, 4, 115–123. [Google Scholar] [CrossRef] [Green Version]
- Schlosser, J.; Chow, C.K.; Kira, Z. Fusing LIDAR and images for pedestrian detection using convolutional neural networks. In Proceedings of the IEEE International Conference on Robotics and Automation, Stockholm, Sweden, 16–21 May 2016. [Google Scholar]
- Xiao, L.; Wang, R.; Dai, B.; Fang, Y.Q.; Liu, D.X.; Wu, T. Hybrid conditional random field based camera-LIDAR fusion for road detection. Inf. Sci. 2018, 432, 543–558. [Google Scholar] [CrossRef]
- Almagambetov, A.; Velipasalar, S.; Casares, M. Robust and Computationally Lightweight Autonomous Tracking of Vehicle Taillights and Signal Detection by Embedded Smart Cameras. IEEE Trans. Ind. Electron. 2015, 62, 3732–3741. [Google Scholar] [CrossRef]
- Gneeniss, A.S.; Mills, J.P.; Miller, P.E. In-flight photogrammetric camera calibration and validation via complementary LIDAR. J. Photogramm. Remote Sens. 2015, 100, 3–13. [Google Scholar] [CrossRef] [Green Version]
- Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset Int. J. Rob. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef] [Green Version]
- Premebida, C.; Carreira, J.; Batista, J.; Nunes, U. Pedestrian detection combining RGB and dense LIDAR data. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Chicago, IL, USA, 14–18 September 2014. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Feng, D.; Haase-Schuetz, C.; Rosenbaum, L.; Hertlein, H. Deep Multi-Modal Target Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges. Available online: https://arxiv.org/abs/1902.07830?context=cs (accessed on 8 September 2020).
- Sawaragi, T.; Kudoh, T. Self-reflective segmentation of human bodily motions using recurrent neural networks. IEEE Trans. Ind. Electron. 2013, 50, 903–911. [Google Scholar] [CrossRef]
- Chen, X.; Ma, H.; Wan, J.; Ma, H.; Wan, J.; Li, B. Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R. Feature pyramid networks for target detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Bodla, N.; Singh, B.; Chellappa, R. Soft-NMS—Improving Target. Detection with One Line of Code. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D. SSD: Single Shot MultiBox Detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–14 September 2016. [Google Scholar]
Hardware | Property |
---|---|
Laser LIDAR: HESAI Pandar 40 | Lines: 40; Range: 200 m; Angular resolution: 0.1°; Updating: 20 Hz; Accuracy: ±2 cm. |
Radar: Delphi ESR | Range: 100 m; Viewing field: ±10°; Updating: 20 Hz; Accuracy: ±5 cm, ±0.12 m/s, ±0.5°. |
Navigation system: NovAtel SPAN-CPT | Accuracy: ±1 cm, ±0.02 m/s, ± 0.05° (Pitch/Roll), 0.1° (Azimuth); Updating: 10Hz. |
Camera: SY8031 | Resolution: 3264 × 2448; FPS: 15; Viewing field: 65° (vertical), 50° (horizontal). |
Image processing: NVIDIA JTX 2 | CPU: ARM Cortex-A57 (Quad-core, 2 GHz); GPU: Pascal TM (256-core, 1300 MHz); RAM: LPDDR4 (8G, 1866 MHz, 58.3 GB/s). |
Controller: ARK-3520P | CPU: Intel Core i5-6440EQ (Quad-core, 2.8 GHz); RAM: LPDDR4 (32G, 2133 MHz, 100 GB/s). |
Conditions | Sunny Day-Time | Rainy | Smoggy | Night-Time |
---|---|---|---|---|
Data pairs | 4369 | 2315 | 3907 | 4061 |
Modal | Fusion Stage | AP | AP50 | AP75 |
---|---|---|---|---|
Single | None | 25.46 | 41.19 | 23.95 |
Dual | Early-stage fusion strategy (ESFS) | 27.32 | 44.74 | 26.61 |
Late-stage fusion strategy (LSFS) | 32.80 | 50.64 | 30.92 | |
Middle-stage fusion strategy I (MSFS I) | 33.85 | 51.89 | 31.95 | |
MSFS II without weight assignment (MSFS II without WA) | 36.59 | 57.62 | 37.44 | |
MSFS II with weight assignment (MSFS II with WA) | 38.42 | 59.38 | 39.91 |
Networks | Backbone | F1 | FPS | AP | AP50 | AP75 |
---|---|---|---|---|---|---|
Mask R-CNN [24] | ResNet-101-FPN | 81.5 | 13.5 | 35.7 | 58.0 | 37.8 |
Retina-Mask [28] | ResNet-101-FPN | 79.8 | 11.2 | 34.7 | 55.4 | 36.9 |
YOLACT | ResNet-101-FPN | 80.6 | 30.0 | 29.8 | 48.5 | 31.2 |
SM-ISDNN (RGB) | ResNet-101-FPN | 78.7 | 36.5 | 25.5 | 41.2 | 23.9 |
DM-ISDNN using ESFS | ResNet-101-FPN | 80.8 | 35.3 | 27.3 | 44.7 | 26.6 |
DM-ISDNN using LSFS | ResNet-101-FPN | 81.6 | 31.6 | 32.8 | 50.6 | 30.9 |
DM-ISDNN using MSFS I | ResNet-101-FPN | 83.8 | 33.1 | 33.8 | 51.8 | 32.9 |
DM-ISDNN using MSFS II without WA | ResNet-101-FPN | 87.3 | 32.8 | 36.5 | 57.6 | 37.4 |
DM-ISDNN using MSFS II with WA | ResNet-18-FPN | 80.1 | 37.3 | 26.7 | 43.8 | 26.0 |
DM-ISDNN using MSFS II with WA | ResNet-50-FPN | 84.0 | 35.5 | 31.2 | 50.6 | 32.8 |
DM-ISDNN using MSFS II with WA | ResNet-101-FPN | 89.5 | 27.0 | 38.4 | 59.4 | 39.9 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Geng, K.; Dong, G.; Yin, G.; Hu, J. Deep Dual-Modal Traffic Objects Instance Segmentation Method Using Camera and LIDAR Data for Autonomous Driving. Remote Sens. 2020, 12, 3274. https://doi.org/10.3390/rs12203274
Geng K, Dong G, Yin G, Hu J. Deep Dual-Modal Traffic Objects Instance Segmentation Method Using Camera and LIDAR Data for Autonomous Driving. Remote Sensing. 2020; 12(20):3274. https://doi.org/10.3390/rs12203274
Chicago/Turabian StyleGeng, Keke, Ge Dong, Guodong Yin, and Jingyu Hu. 2020. "Deep Dual-Modal Traffic Objects Instance Segmentation Method Using Camera and LIDAR Data for Autonomous Driving" Remote Sensing 12, no. 20: 3274. https://doi.org/10.3390/rs12203274
APA StyleGeng, K., Dong, G., Yin, G., & Hu, J. (2020). Deep Dual-Modal Traffic Objects Instance Segmentation Method Using Camera and LIDAR Data for Autonomous Driving. Remote Sensing, 12(20), 3274. https://doi.org/10.3390/rs12203274