Adopting the YOLOv4 Architecture for Low-Latency Multispectral Pedestrian Detection in Autonomous Driving
Abstract
:1. Introduction
- We investigated how the detection frame rate (expressed in frames per second, fps) influences the recall measure in a realistic scenario wherein the goal is to detect a person and brake before a collision occurs. This analysis allowed us to conclude that low latency during detection is a key factor in pedestrian detection. We ought to increase speed of detection algorithms so they can spot pedestrians and initiate safe breaking in time.
- In the context of a realistic scenario, we investigated five different fusion schemes for multispectral images that were inspired by the state-of-the-art, but are our original contributions to the YOLOv4 architecture. These fusion schemes range from very simple early fusion at the level of image data to elaborated middle and late fusion schemes.
- As a result of those investigations, we developed a new YOLOv4-based architecture that allows for middle fusion and scored the best on average in the experiments while processing the multispectral images at 35 fps.
- Being aware of the limited computing resources of autonomous cars, we prepared a lightweight model. This detector exceeds 400 fps on the desktop Nvidia RTX 3080 GPU, provides the lowest latency when detecting vulnerable road users from a moving vehicle, and can be deployed on edge computing devices.
2. Related Work
2.1. System Architectures for Pedestrian Detection
2.2. Multispectral Fusion in Pedestrian Detection
3. Pedestrian Detection with YOLOv4
3.1. YOLOv4 with RGB Images
3.2. YOLOv4 with Thermal Images
4. Sensory Fusion with the YOLO Architecture
4.1. Early Fusion Approaches
4.1.1. Yolo4-HST and YOLO4-GST Fusions
4.1.2. YOLO4-RGB-T Fusion
4.2. Late Fusion Approach (YOLO4-Late)
4.3. Middle Fusion Approach (YOLO4-Middle)
5. Experiments
5.1. Kaist Dataset
5.2. Precision Performance Comparison
5.3. Performance as a Function of Object Size
5.4. Inference Time
5.5. The Lightweight Approach: YOLO4-Tiny-Middle
5.6. Comparison to the State-of-the-Art
6. Real-World Application Viability
- We assumed that each detected person object represents only one person of consistent height and width;
- We assumed that each object detection was independent;
- We assumed that the detection rate was not limited by the camera’s frame rate;
- We focused on recall without the consideration for false positives.
6.1. Distance to the Obstacle Based on Bounding Box from the Detection
6.2. Recall as a Function of Distance
6.3. Emergency Braking Procedure
6.4. Accumulated Recall Measure for Real-World Viability
7. Discussion and Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- World Health Organization. Global Status Report on Road Safety 2018; World Health Organization: Geneva, Switzerland, 2018. [Google Scholar]
- Yeong, D.J.; Velasco-Hernandez, G.; Barry, J.; Walsh, J. Sensor and Sensor Fusion Technology in Autonomous Vehicles: A Review. Sensors 2021, 21, 2140. [Google Scholar] [CrossRef]
- Krišto, M.; Ivasic-Kos, M.; Pobar, M. Thermal Object Detection in Difficult Weather Conditions Using YOLO. IEEE Access 2020, 8, 125459–125476. [Google Scholar] [CrossRef]
- Camara, F.; Bellotto, N.; Cosar, S.; Nathanael, D.; Althoff, M.; Wu, J.; Ruenz, J.; Dietrich, A.; Fox, C.W. Pedestrian Models for Autonomous Driving Part I: Low-Level Models, From Sensing to Tracking. IEEE Trans. Intell. Transp. Syst. 2021, 22, 6131–6151. [Google Scholar] [CrossRef]
- Liu, Y.; Sun, P.; Wergeles, N.; Shang, Y. A survey and performance evaluation of deep learning methods for small object detection. Expert Syst. Appl. 2021, 172, 114602. [Google Scholar] [CrossRef]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
- Liu, J.; Zhang, S.; Wang, S.; Metaxas, D.N. Multispectral deep neural networks for pedestrian detection. In Proceedings of the British Machine Vision Conference (BMVC 2016), York, UK, 19–22 September 2016. [Google Scholar]
- Hwang, S.; Park, J.; Kim, N.; Choi, Y.; So Kweon, I. Multispectral Pedestrian Detection: Benchmark Dataset and Baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–26 June 2005; Volume 1, pp. 886–893. [Google Scholar]
- Esfahanian, M.; Zhuang, H.; Erdol, N. Using local binary patterns as features for classification of dolphin calls. J. Acoust. Soc. Am. 2013, 134, EL105–EL111. [Google Scholar] [CrossRef] [Green Version]
- Dollar, P.; Tu, Z.; Perona, P.; Belongie, S. Integral Channel Features. In Proceedings of the British Machine Vision Conference, London, UK, 7–10 September 2009; BMVA Press: London, UK, 2009; pp. 91.1–91.11. [Google Scholar]
- Zhang, S.; Benenson, R.; Omran, M.; Hosang, J.; Schiele, B. How Far are We from Solving Pedestrian Detection? In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1259–1267. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef] [Green Version]
- Zhang, L.; Lin, L.; Liang, X.; He, K. Is Faster R-CNN Doing Well for Pedestrian Detection? In Computer Vision–ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 443–457. [Google Scholar]
- Zhang, H.; Du, Y.; Ning, S.; Zhang, Y.; Yang, S.; Du, C. Pedestrian Detection Method Based on Faster R-CNN. In Proceedings of the 13th International Conference on Computational Intelligence and Security (CIS), Hong Kong, China, 15–18 December 2017; pp. 427–430. [Google Scholar]
- König, D.; Adam, M.; Jarvers, C.; Layher, G.; Neumann, H.; Teutsch, M. Fully Convolutional Region Proposal Networks for Multispectral Person Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 243–0250. [Google Scholar]
- Wagner, J.; Fischer, V.; Herman, M.; Behnke, S. Multispectral Pedestrian Detection using Deep Fusion Convolutional Neural Networks. In Proceedings of the 24th European Symposium on Artificial Neural Networks (ESANN), Bruges, Belgium, 27–29 April 2016. [Google Scholar]
- Zhao, Z.Q.; Zheng, P.; Xu, S.T.; Wu, X. Object Detection With Deep Learning: A Review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef] [Green Version]
- Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Han, B.G.; Lee, J.G.; Lim, K.T.; Choi, D.H. Design of a Scalable and Fast YOLO for Edge-Computing Devices. Sensors 2020, 20, 6779. [Google Scholar] [CrossRef]
- Wang, Z.; Li, L.; Li, L.; Pi, J.; Li, S.; Zhou, Y. Object detection algorithm based on improved Yolov3-tiny network in traffic scenes. In Proceedings of the 4th CAA International Conference on Vehicular Control and Intelligence (CVCI), Hangzhou, China, 18–20 December 2020; pp. 514–518. [Google Scholar]
- Jiang, Z.; Zhao, L.; Li, S.; Jia, Y. Real-time object detection method based on improved YOLOv4-tiny. arXiv 2020, arXiv:2011.04244. [Google Scholar]
- Harishankar, V.; Karthika, R. Real Time Pedestrian Detection Using Modified YOLO V2. In Proceedings of the 5th International Conference on Communication and Electronics Systems (ICCES), Coimbatore, India, 10–12 June 2020; pp. 855–859. [Google Scholar]
- Xue, Y.; Ju, Z.; Li, Y.; Zhang, W. MAF-YOLO: Multi-modal attention fusion based YOLO for pedestrian detection. Infrared Phys. Technol. 2021, 118, 103906. [Google Scholar] [CrossRef]
- Cao, Z.; Yang, H.; Zhao, J.; Guo, S.; Li, L. Attention Fusion for One-Stage Multispectral Pedestrian Detection. Sensors 2021, 21, 4184. [Google Scholar] [CrossRef]
- Zheng, Y.; Izzat, I.H.; Ziaee, S. GFD-SSD: Gated Fusion Double SSD for Multispectral Pedestrian Detection. arXiv 2019, arXiv:1903.06999. [Google Scholar]
- Wolpert, A.; Teutsch, M.; Sarfraz, M.S.; Stiefelhagen, R. Anchor-free Small-scale Multispectral Pedestrian Detection. In Proceedings of the 31st British Machine Vision Conference 2020 (BMVC), Manchester, UK, 7–11 September 2020. [Google Scholar]
- Feng, D.; Haase-Schutz, C.; Rosenbaum, L.; Hertlein, H.; Glaser, C.; Timm, F.; Wiesbeck, W.; Dietmayer, K. Deep Multi-Modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges. IEEE Trans. Intell. Transp. Syst. 2021, 22, 1341–1360. [Google Scholar] [CrossRef] [Green Version]
- Zhang, H.; Fromont, E.; Lefevre, S.; Avignon, B. Guided Attentive Feature Fusion for Multispectral Pedestrian Detection. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 72–80. [Google Scholar]
- Zhang, L.; Liu, Z.; Zhang, S.; Yang, X.; Qiao, H.; Huang, K.; Hussain, A. Cross-modality interactive attention network for multispectral pedestrian detection. Inf. Fusion 2019, 50, 20–29. [Google Scholar] [CrossRef]
- Li, C.; Song, D.; Tong, R.; Tang, M. Illumination-aware Faster R-CNN for robust multispectral pedestrian detection. Pattern Recognit. 2019, 85, 161–171. [Google Scholar] [CrossRef] [Green Version]
- Li, C.; Song, D.; Tong, R.; Tang, M. Multispectral Pedestrian Detection via Simultaneous Detection and Segmentation. In Proceedings of the British Machine Vision Conference (BMVC), Newcastle, UK, 3–6 September 2018. [Google Scholar]
- Guan, D.; Cao, Y.; Yang, J.; Cao, Y.; Yang, M.Y. Fusion of multispectral data through illumination-aware deep neural networks for pedestrian detection. Inf. Fusion 2019, 50, 148–157. [Google Scholar] [CrossRef] [Green Version]
- Niu, Z.; Zhong, G.; Yu, H. A review on the attention mechanism of deep learning. Neurocomputing 2021, 452, 48–62. [Google Scholar] [CrossRef]
- Dao, V.H.; Mac, H.; Tran, D. A Real-time Multispectral Algorithm for Robust Pedestrian Detection. In Proceedings of the RIVF International Conference on Computing and Communication Technologies (RIVF), Hanoi, Vietnam, 2–4 December 2021; pp. 1–4. [Google Scholar]
- Choi, Y.; Kim, N.; Hwang, S.; Park, K.; Yoon, J.S.; An, K.; Kweon, I.S. KAIST Multi-Spectral Day/Night Data Set for Autonomous and Assisted Driving. IEEE Trans. Intell. Transp. Syst. 2018, 19, 934–948. [Google Scholar] [CrossRef]
- Shivakumar, S.S.; Rodrigues, N.; Zhou, A.; Miller, I.D.; Kumar, V.; Taylor, C.J. PST900: RGB-Thermal Calibration, Dataset and Segmentation Network. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Virtual, 31 May–31 August 2020; pp. 9441–9447. [Google Scholar] [CrossRef]
- Max Roser, C.A.; Ritchie, H. Human Height. Our World in Data 2013. Available online: https://ourworldindata.org/human-height (accessed on 5 September 2021).
- Nowak, T.; Ćwian, K.; Skrzypczyński, P. Real-Time Detection of Non-Stationary Objects Using Intensity Data in Automotive LiDAR SLAM. Sensors 2021, 21, 6781. [Google Scholar] [CrossRef] [PubMed]
Training | Validation | Test | Total | |||||
---|---|---|---|---|---|---|---|---|
Img. | Obj. | Img. | Obj. | Img. | Obj. | Img. | Obj. | |
Day | 41 k | 54 k | 13 k | 12 k | 8 k | 4 k | 62 k | 70 k |
Night | 20.5 k | 36 k | 9 k | 5 k | 3.5 k | 3.5 k | 33 k | 44.5 k |
Total | 61.5 k | 90 k | 22 k | 17 k | 11.5 k | 7.5 k | 95 k | 114.5 k |
Time of Day | |||
---|---|---|---|
Day | Night | Day + Night | |
YOLO4-RGB | 0.684 | 0.298 | 0.465 |
YOLO4-T | 0.641 | 0.617 | 0.625 |
YOLO4-HST | 0.672 | 0.62 | 0.639 |
YOLO4-GST | 0.673 | 0.603 | 0.627 |
YOLO4-RGB-T | 0.648 | 0.609 | 0.618 |
YOLO4-Middle | 0.751 | 0.626 | 0.686 |
YOLO4-Late | 0.666 | 0.636 | 0.645 |
Original [fps] | Optimized [fps] | Original Single | Optimized Single | |
---|---|---|---|---|
Inference Time [ms] | Inference Time [ms] | |||
YOLO4-RGB | 28.1 | 41.0 | 35.6 | 24.4 |
YOLO4-T | 28.1 | 41.0 | 35.6 | 24.4 |
YOLO4-HST | 27.9 | 39.9 | 35.8 | 25.1 |
YOLO4-GST | 27.8 | 39.8 | 36.0 | 25.1 |
YOLO4-RGB-T | 27.6 | 39.6 | 36.2 | 25.3 |
YOLO4-Middle | 21.7 | 35.2 | 46.1 | 28.4 |
YOLO4-Late | 27.1 | 38.2 | 36.9 | 26.2 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Roszyk, K.; Nowicki, M.R.; Skrzypczyński, P. Adopting the YOLOv4 Architecture for Low-Latency Multispectral Pedestrian Detection in Autonomous Driving. Sensors 2022, 22, 1082. https://doi.org/10.3390/s22031082
Roszyk K, Nowicki MR, Skrzypczyński P. Adopting the YOLOv4 Architecture for Low-Latency Multispectral Pedestrian Detection in Autonomous Driving. Sensors. 2022; 22(3):1082. https://doi.org/10.3390/s22031082
Chicago/Turabian StyleRoszyk, Kamil, Michał R. Nowicki, and Piotr Skrzypczyński. 2022. "Adopting the YOLOv4 Architecture for Low-Latency Multispectral Pedestrian Detection in Autonomous Driving" Sensors 22, no. 3: 1082. https://doi.org/10.3390/s22031082