Skip to Content
InformationInformation
  • Article
  • Open Access

16 December 2020

Vehicle Pedestrian Detection Method Based on Spatial Pyramid Pooling and Attention Mechanism

,
,
and
1
School of Computer, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
2
Jiangsu Key Laboratory of Wireless Sensor Network High Technology Research, Nanjing 210003, China
*
Author to whom correspondence should be addressed.
Current address: Nanjing University of Posts and Telecommunications, 9 Wenyuan Road, Xianlin Street, Qixia District, Nanjing 210023, China.

Abstract

Object detection for vehicles and pedestrians is extremely difficult to achieve in autopilot applications for the Internet of vehicles, and it is a task that requires the ability to locate and identify smaller targets even in complex environments. This paper proposes a single-stage object detection network (YOLOv3-promote) for the detection of vehicles and pedestrians in complex environments in cities, which improves on the traditional You Only Look Once version 3 (YOLOv3). First, spatial pyramid pooling is used to fuse local and global features in an image to better enrich the expression ability of the feature map and to more effectively detect targets with large size differences in the image; second, an attention mechanism is added to the feature map to weight each channel, thereby enhancing key features and removing redundant features, which allows for strengthening the ability of the feature network to discriminate between target objects and backgrounds; lastly, the anchor box derived from the K-means clustering algorithm is fitted to the final prediction box to complete the positioning and identification of target vehicles and pedestrians. The experimental results show that the proposed method achieved 91.4 mAP (mean average precision), 83.2 F1 score, and 43.7 frames per second (FPS) on the KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) dataset, and the detection performance was superior to the conventional YOLOv3 algorithm in terms of both accuracy and speed.

1. Introduction

Currently, the development of the Internet of vehicles in China is gaining increasing attention. The Internet of vehicles integrates the Internet of Things, intelligent transportation, and cloud computing. The most well-known and vigorously developed Internet of vehicles application is autonomous driving, involving a driver assistance system. The system uses cameras, lasers, and radars to collect information outside the car in real time and make judgments to remind the driver of abnormal conditions around. This allows the driver to promptly identify hidden dangers, thereby improving driving safety. The rapid detection of targets such as vehicles and pedestrians is an important task for driving assistance systems. In recent years, object detection methods based on deep learning have stood out among many detection algorithms, attracting the attention and use of professionals and scholars in the industry. Driver assistance systems not only require an extremely high accuracy of object detection but also cannot miss small targets that are difficult to detect in complex scenes.
The computational aspect of deep learning is roughly divided into target classification [1,2,3,4], object detection, semantic segmentation [5], and instance segmentation [6]. Object detection is improved on the basis of various basic networks of target classification to realize the recognition of objects in pictures or videos. Accordingly, object detection is the basis of subsequent semantic segmentation, and instance segmentation, finding target objects for both tasks. Therefore, the pros and cons of object detection algorithms are particularly important.

3. Results

In this paper, we propose the YOLOv3-promote method on the open dataset KITTI. The experiment was based on the deep learning framework of Pytorch. The hardware configuration of the experiment was as follows: the processor was an Intel (R) core (TM) i9-9900k CPU @ 3.60 GHz; the memory size was 16.0 GB; the video card was single 2080ti, and the video memory size was 11 GB. The configuration environment of the software was Windows 10, CUDA 10.2, CUDNN 7.6.5, and the programming language was Python 3.7.

3.1. Dataset Description

The KITTI dataset [30,31] was co-founded by the Karlsruhe Institute of Technology in Germany and the Toyota American Institute of Technology. It is the largest computer vision algorithm evaluation dataset in the automatic driving scene in the world. KITTI contains a variety of real-scene image data, such as urban, rural, and highway areas, and each image contains vehicles and pedestrians as well as various shadows, different illuminations, occlusions, and truncations, which provides an effective reference for the robustness of the algorithm. The labels of the KITTI original dataset are divided into eight categories: Car, Van, Truck, Pedestrian, Pedestrian (sitting), Cyclist, Tram, and Misc. However, since the primary goal of automatic driving in the application of Internet of vehicles is to detect the targets of vehicles and pedestrians, this paper changes the original eight categories of labels into three categories, classifying Van, Truck, and Tram into Car, Pedestrian and Pedestrian (sitting) as Person, and removes the Misc category. The final three categories are Car, Person, and Cyclist. This paper selected 7481 images in the dataset as the experimental data and allocated one-tenth of the dataset as the verification set.

3.2. Execution Details

The system presented in this article was trained and tested on images of the same size, and we compared YOLOv3 as a baseline with the YOLOv3-promote proposed in this article. The input image was zoomed to 608 × 608 pixels. Through the darknet53 network, SPP, and attention modules, the information of the target vehicle and pedestrian in the image was extracted, and three feature maps with different scales were used to predict the target location and type. For anchor box selection, this paper used the K-means algorithm to generate a total of nine anchor points for the labeled images in KITTI dataset: (7,66), (9,23), (13,34), (19,54), (22,161), (24,36), (35,65), (57,107), and (96,196). Figure 8 shows the distribution of the nine anchors in all real frames.
Figure 8. Anchor distribution.
In the whole training process, the backbone network used the model parameters of Darknet53.conv.74. YOLOv3-promote has carried out a total of 2000 epochs. The batch size was set to 64, and the number of subdivisions was 16. The momentum parameter and weight decay regularization term were set to 0.9 and 0.0005, respectively, and the learning rate parameter was set to 0.001. When iterating to 7000 times and 10,000 times, the learning rate decreased to one-tenth of the previous. In addition, this paper also used data enhancement to generate more training samples. By setting the saturation parameter equal to 1.5, exposure amount equal to 1.5, hue equal to 0.1, and data jitter and horizontal flipping, the robustness was increased and the accuracy of the model and the generalization of various real environments were improved.

3.3. The Method of the Network Design

Based on YOLOv3, this paper added a spatial pyramid pooling. Through the SPP module, the local feature information and global feature information in the feature map are fused to further enrich the information expression ability of the feature map and improve the detection ability of multiple targets. In addition, this paper added an attention mechanism to YOLOv3 through a local cross-channel, non-dimensionality reduction channel interaction method, which autonomously learns the weight of each channel, thereby eliminating redundant features and enhancing features containing key information. The network structure of YOLOv3-promote based on spatial pyramid pooling and attention mechanism is shown in Figure 9. The orange and purple parts in Figure 9 are the spatial pyramid pooling module and the attention module, respectively.
Figure 9. Overall network structure.
The backbone of YOLOv3-promote is Darknet53. The network refers to the residual structure proposed by ResNet. A total of 23 residual modules were used in the backbone to avoid the risk of overfitting caused by increasing the network depth. At the same time, YOLOv3-promote uses convolution with a stride of two to achieve down-sampling [32], abandoning the pooling layer used in many networks. The purpose of this was to further reduce the negative effect of gradients caused by pooling and improve the accuracy of the network. The Convolutional layer in Figure 9 is composed of three components, namely Conv2d, Batch Normalization, and Leaky Relu. In order to enhance the accuracy of the network for small object detection, YOLOv3-promote uses up-sample and fusion (here called Concatenation) methods similar to feature pyramid networks (FPN) [33] to construct a convolutional layer containing three different scales in the feature pyramid, namely: 19 × 19, 38 × 38, and 76 × 76 resolution. In Figure 9, the size of the feature map is increased through the 93rd and 112th up-sampling layers, and the route layer of the 94th and 113th layers in Figure 9 is obtained by Concatenation with the shallow feature maps. For example, the 112th layer up-samples the 38 × 38 × 128 feature map into a 76 × 76 × 128 feature map and then cascades it with the 76 × 76 × 256 feature map of the 36th layer to obtain a 76 × 76 × 384 Route layer feature.
The 90th, 109th, and 128th layers in Figure 9 are the YOLO layers, that is, the detection layers. The sizes of the three detection layers are 19 × 19 × 24, 38 × 38 × 24, and 76 × 76 × 24. Since the smaller the size of the feature map, the larger the receptive field, the 19 × 19 × 24 detection layer is used to detect large targets, and the 38 × 38 × 24 detection layer is used to detect medium-sized targets, and the 76 × 76 × 24 detection layer tends to detect some small targets. Because each grid cell is assigned three anchor boxes, the predicted vector length of each cell is 3 × (3 + 4 + 1) = 24, where 3 corresponds to the three types of Car, Cyclist, and Person in the modified KITTI dataset in this article, 4 represents the coordinate information (x, y, w, h) corresponding to the detection frame, and 1 represents the object score.

3.4. Detection Result

In this paper, we use mean average precision (mAP), F1 Score, namely the number of floating-point operations per second, FPS(frames per second), and parameters as the evaluation criteria.
Table 1 lists the comparison between the method proposed in this paper and the traditional YOLOv3 method. It can be seen that although the method proposed in this paper increases the number of model parameters by 3.7%, the mAP for target object detection is much higher than that of traditional YOLOv3, which makes many targets that could not be detected before now able to be detected; the F1 score of this method is 83.2. Since F1 is the average of precision and recall, and the precision and recall of YOLOv3-promote are higher than the original YOLOv3, the value of F1 is naturally higher than that of the traditional YOLOv3 system. Under the same image input size, because the parameter amount of the YOLOv3-promote model is 2 MB higher than the traditional YOLOv3, the amount of calculation is a little more than the original, so the FPS is slightly reduced, but overall, the improved YOLOv3-promote FPS is basically the same as YOLOv3. Figure 10 shows the mAP diagram of the YOLOv3-promote method proposed in this article after 500 epochs.
Table 1. Performance comparison of algorithms.
Figure 10. Mean average precision (mAP).
The model where the maximum mAP is 91.4 was selected as the optimal model and compared with the optimal model of YOLOv3. The comparison chart is shown in Figure 11, which is classified by day, night, extreme weather, multi-target, and small targets.
Figure 11. Example comparison.
As can be seen from Figure 11, in the daylight, the effect gap between the two algorithms is the smallest, but YOLOv3 still misses several small target vehicles (the missed detection vehicles have been marked with yellow arrows in Figure 11), and all of them are detected in this paper; as for the night, the difference between the two algorithms is particularly obvious, and YOLOv3 faces more difficulties with correct identification due to the lack of attention mechanism. In extreme weather, YOLOv3 does not detect small targets in the distance due to the interference of water mist in the window; in the case of multi-target and small targets, the difference between YOLOv3 and the proposed YOLOv3-promote is reflected in the small target detection in the distance. Because the spatial pyramid pooling proposed in this paper can effectively combine the local features and global features of the feature map, both large and small targets can be detected accurately.

4. Conclusions

By adding spatial pyramid pooling and an attention mechanism, the improved network structure of YOLOv3-promote not only integrates the local and global features of the image, but also improves the generalization of the model for various environmental targets and makes each channel of the feature map learn their respective weights, which makes the network more sensitive to the target objects in the image. Whether it is during the day, night, or extreme weather conditions, the detection effect for multiple targets and small targets is better than the previous YOLOv3. Although the traditional YOLOv3 has the ability to detect small targets, it is not obvious for long-distance small targets in the above complex situations, and it is easy to miss detection, false detection, and repeated detection. The method proposed in this paper perfectly solves the above problems. The K-means clustering method is used to automatically generate an anchor that conforms to the data set, which further speeds up the model convergence. Using GIoU as a new loss function, extra attention is paid to the situation when there is no overlap, which better reflects the degree of overlap between the predicted frame and the real frame. Experiments on the KITTI dataset show that YOLOv3-promote can achieve real-time performance and is superior to the current YOLOv3 detection algorithms in vehicle and pedestrian target detection. In the automatic driving of the Internet of vehicles applications, more lightweight models are needed for real deployment to reduce the requirements of various hardware. Therefore, further research will be conducted on how to compress the model size and increase the accuracy slightly in the future.

Author Contributions

Conceptualization, M.G., D.X., P.L. and H.X.; methodology, M.G.; software, M.G.; validation, M.G. and D.X.; formal analysis, M.G.; investigation, M.G.; writing—original draft preparation, M.G.; writing—review and editing, M.G. and H.X.; visualization, M.G.; supervision, H.X.; project administration, P.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key R & D Program of China (No. 2019YFB2103003), the National Natural Science Foundation of P. R. China (No. 61672296, No. 61872196, No. 61872194 and No. 61902196), Scientific and Technological Support Project of Jiangsu Province (No. BE2017166, and No. BE2019740), Major Natural Science Research Projects in Colleges and Universities of Jiangsu Province (No. 18KJA520008), Six Talent Peaks Project of Jiangsu Province (RJFW-111), Postgraduate Research and Practice Innovation Program of Jiangsu Province under Grant KYCX19_0973, and the 1311 Talent Plan of the Nanjing University of Posts and Telecommunications (NUPT).

Acknowledgments

The authors would like to thank Yi Lu and Jiajie Sun for their suggestions to improve the manuscript.

Conflicts of Interest

The authors declare no conflict of interests.

References

  1. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  2. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
  3. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  4. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
  5. Sultana, F.; Sufian, A.; Dutta, P. Evolution of Image Segmentation using Deep Convolutional Neural Network: A Survey. Knowledge-Based Syst. 2020, 201–202, 106062. [Google Scholar] [CrossRef]
  6. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
  7. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
  8. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
  9. Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
  10. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2019, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  11. Uijlings, J.R.R.; Van De Sande, K.E.A.; Gevers, T.; Smeulders, A.W.M. Selective Search for Object Recognition. Int. J. Comput. Vis. 2013, 104, 154–171. [Google Scholar] [CrossRef]
  12. Zhang, F.; Yang, F.; Li, C. Fast vehicle detection method based on improved YOLOv3. Comput. Eng. Appl. 2019, 55, 12–20. [Google Scholar]
  13. Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
  14. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
  15. Liu, S.; Huang, D. Receptive field block net for accurate and fast object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 385–400. [Google Scholar]
  16. Zhang, S.; Wen, L.; Bian, X.; Lei, Z.; Li, S.Z. Single-Shot Refinement Neural Network for Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4203–4212. [Google Scholar]
  17. Ren, J.; Chen, X.; Liu, J.; Sun, W.; Pang, J.; Yan, Q.; Tai, Y.W.; Xu, L. Accurate single stage detector using recurrent rolling convolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5420–5428. [Google Scholar]
  18. Li, Z.; Zhou, F. FSSD: Feature fusion single shot multibox detector. arXiv 2017, arXiv:1712.00960. [Google Scholar]
  19. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  20. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
  21. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
  22. Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3156–3164. [Google Scholar]
  23. Liu, D.; Wu, Y. Gaussian-yolov3 target detection with embedded attention and feature interleaving module. Comput. Appl. 2020, 40, 2225–2230. [Google Scholar]
  24. Zhao, B.; Wu, X.; Feng, J.; Peng, Q.; Yan, S. Diversified Visual Attention Networks for Fine-Grained Object Classification. IEEE Trans. Multimed. 2017, 19, 1245–1256. [Google Scholar] [CrossRef]
  25. Xiao, T.; Xu, Y.; Yang, K.; Zhang, J.; Peng, Y.; Zhang, Z. The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 842–850. [Google Scholar]
  26. Stollenga, M.F.; Masci, J.; Gomez, F.; Schmidhuber, J. Deep networks with internal selective attention through feedback connections. In Proceedings of the Advances in Neural Information Processing Systems, Washington, DC, USA, 10–12 June 2014; pp. 3545–3553. [Google Scholar]
  27. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2018; pp. 7132–7141. [Google Scholar]
  28. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the Lecture Notes in Computer Science, Munich, Germany, 8–11 September 2018; pp. 3–19. [Google Scholar]
  29. Hartigan, J.A.; Wong, M.A. Algorithm AS 136: A K-Means Clustering Algorithm. J. R. Stat. Soc. Ser. C Appl. Stat. 1979, 28, 100. [Google Scholar] [CrossRef]
  30. Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012. [Google Scholar]
  31. Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
  32. Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 11531–11539. [Google Scholar]
  33. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.