1. Introduction
Traffic sign detection and tracking is a critical task of self-driving vehicles in Real-World traffic scenarios, which provides real-time decision support for the autopilot system.
Traffic sign detection can be broadly divided into two categories [
1,
2,
3,
4,
5,
6]. One is the traditional method based on manual features [
1,
2,
3,
4,
5], and the other is the deep learning algorithm based on CNN (Convolutional Neural Network) [
6]. Traditional traffic signs are mainly detected based on the appearance characteristics of some traffic signs. In [
2,
3], RGB (Red, Green, Blue) and HSI (Hue, Saturation, Intensity) color model methods are used for detection owing to different color information (red, yellow, blue) for multifarious traffic signs. In [
4], HOG (Histogram of Gradient) is employed to describe shape features used for detection. However, due to the small target of traffic signs, they are easily affected by external adverse factors such as lighting, weather, and shielding [
5]. It is necessary to use nontraditional algorithms to extract features such as color, texture, context, etc. in order to have higher commonality. In contrast, deep learning-based object detection algorithms are more accurate and capable of evolving to more complex environments [
6].
CNN-based object detection methods can be further divided into two types [
7,
8,
9,
10,
11,
12,
13,
14]: two-stage schemes and one-stage schemes, where the two-stage schemes combine RPN (Region Proposal Network) with the CNN network at first, then classify and regress these candidate regions [
7] such as R-CNN [
8] (Region-Convolutional Neural Network), Fast R-CNN [
9], Mask R-CNN [
10], etc. Even though it is possible to achieve a high detection accuracy with the two-step schemes, it is demanding because of its complicated computation. Two-stage schemes are slower at detection, but end-to-to-end ones provide more accurate results. SSD (Single Shot Detector) [
11], YOLO [
12], YOLO 9000 [
13], YOLOv3 [
14], etc. are typical representatives of one-stage schemes.
Because of the progress of deep learning, there have been new deep learning-based object detection algorithms that have progressively released. Yang et al. [
15] proposed a new traffic sign detection method. They used a two-stage strategy to extract region proposals and introduced AN (Attention Network), combining Faster R-CNN with traditional computer vision algorithms to find regions of interest by using color characteristics. Finally, the experimental results showed that its mAP was 80.31%. Lu et al. [
16] improved the detection effect of Faster R-CNN by introducing a visual attention model which can integrate a series of regions for locating and classifying small objects. Their mAP is 87.0%, and the efficiency is 3.85 FPS.
The detection algorithms mentioned above have achieved excellent detection results on the dataset PASCAL VOC (Pattern Analysis Statistical Modelling and Computational Learning Visual Object Classes) [
17] and COCO (Common Objects in Context) [
18]. In [
19], good performance on the most commonly used traffic sign dataset GTSDB (German Traffic Sign Detection Benchmark) [
20] has been achieved. The improved YOLOv2 achieved 99.61% and 96.69% precision in CCTSDB (CSUST Chinese Traffic Sign Detection Benchmark) and GTSDB [
21]. At the same time, however, only classifying all traffic signs as detectable or non-detectable (i.e., prohibitory signs, command, and notification signs) is a fair categorization due to the great disparities between detection types and their corresponding traffic signs. It is far from meeting the actual scenario requirements in the self-driving task. The TT100K (Tsinghua-Tencent 100 K) [
22] benchmark dataset subdivides the three types of traffic signs into 128 categories, covering varieties of factors under different light conditions and climatic conditions, which is closer to the Real-World traffic scenarios, and it also contains more backgrounds and smaller targets.
Although [
23,
24] have achieved better detection accuracy on TT100K, their real-time performances are poor. In [
25], the real-time problem is well addressed with a less-parameter and low-computational cost model MF-SSD, but its performance on small object detection is poor. Li et al. [
26] proposed a Perceptual GAN (Generative Adversarial Network) model to improve the detection of small traffic signs in the TT100K by minimizing the differences between large and small objects. Zhang et al. [
19] proposed a multiscale cascaded R-CNN and multiscale features in pyramids that fuse high-level semantic information and low-level spatial information. Moreover, the features extracted from the ROI (Region of Interest) for each scale are fined and then fused to improve the detection accuracy of traffic signs. Precision and recall achieved 98.7% and 90.5% respectively. Nevertheless, they only roughly divided the small traffic signs into three types. Obviously, it is not enough in practice.
Actually, road object detection has reached a bottleneck for further improvement because of the small scale of targets [
27,
28]. Infrared (IR) small object detection has been established [
29] recently, as well as the remote sensing radar [
30] and LiDAR [
31]. Ref. [
32] uses infrastructure-based LiDAR sensors for the detection and tracking of both pedestrians and vehicles at intersections and obtains good results on small objects.
However, most of the above-mentioned detection methods are expensive which strongly limit their deployment in practical applications [
33], and their deployment in day-to-day use is currently impeded by a number of unsolved problems: (1) Low detection accuracy of small traffic signs in large images. Compared with medium and large objects, small traffic signs lack appearance information needed to be distinguished from the background or other similar categories [
34]. (2) The prior achievements in detecting small objects can’t be verified since the vast majority of the research efforts are focused on large object detection [
35]. Besides, due to the extremely uneven distribution of different traffic signs, it is generally easy to result in the problem of low recognition rate in those very-low-frequency samples. It is difficult but necessary to ensure high accuracy and robustness in traffic sign detection at the same time, especially in videos [
36]. (3) Efficient simultaneous detection and tracking in videos. Owing to the lighting and weather interference in the real traffic environment and the onboard camera motion blur, bumps, etc. in the video detection [
37,
38], the bounding box is prone to flickering and missing targets [
39], resulting in missed detections and false detections [
40]. The safety of self-driving vehicles may probably be threatened.
Hence, this paper proposes an improved YOLV3 algorithm to help minimize the problems associated with small traffic signs and increase the overall YOLV3 performance. Furthermore, motivated from MOT (Multi-Object Tracking) [
41], which is widely used to mark and track vehicles and pedestrians in videos in traffic surveillance systems and noisy crowd scenes most recently [
42,
43]. Deep-Sort (Simple Online and Real-time Tracking) [
44] is adopted to overcome a series of adverse factors brought by camera motion to real-time video detection. Compared to several latest detection methods, the proposed method has higher accuracy and real-time performance and meets the requirements of small target traffic signs detection. The main contributions of this paper are summarized as follows:
(1) To address the problem of low detection accuracy resulted from exceptionally unbalanced samples distribution of different traffic signs in the TT100K, several image enhancement methods, such as adding noise, snow, frost, fog, blur, and contrast, etc., are applied to those categories of traffic signs that rarely appeared. These images obtained by enhancement are added to the original sample database to complete data augmentations, increasing the proportion of low-frequency traffic signs in the dataset, improving the equilibrium of sample distributions, and then improve the overall detection accuracy.
(2) It’s proposed that a new YOLOv3 architecture to better enable it to detect small targets. The detailed step is deleting the output feature map corresponding to the 32-times subsampling, adding the output feature map of the 4-times subsampling and concatenating it to the original network, which will more fit for small target features and will not reduce the detection effect of medium and big targets in the meanwhile.
(3) In order to strengthen object detection and object tracking in real-time, the false detection and missed detection levels caused by the external environment must be reduced. Deep-Sort is applied to object detection, which uses Mahalanobis distance, the smallest cosine distance, and other indicators to associate various targets in the video frames. While stabilizing the actual video bounding box, the error rate and omission rate of video detection are effectively decreased, and the anti-interference performance of the detection algorithm is enhanced too.
The remainder of this paper is organized as follows:
Section 2 proposes the data enhancement method, detection framework, loss function, and multi-object tracking additionally.
Section 3 presents the experimental results, and
Section 4 concludes the paper.