1. Introduction
The detection of traffic objects in road scenes is a critical part of intelligent transport systems and a key technology in the achievement of autonomous driving. Good real-time traffic object detection and recognition is essential for environment awareness in road scenes. Traffic object detection in intelligent transportation systems is usually divided into four categories: vehicle detection, pedestrian detection, traffic sign detection and other obstacle detection. Due to the rapid development of deep learning methods in recent years, object detection methods can be broadly classified into two main categories: traditional object detection methods and deep learning-based object detection methods.
The core idea of traditional object detection methods is to generate the corresponding artificial feature information from the image based on the characteristics of the target itself and then use these features for object detection. Objects in traffic scenes often contain a large number of regular features, such as the color and model of a car, the posture and limb structure of a pedestrian, the shape of a traffic sign, etc. This rule has given rise to a number of object detection algorithms based on edge feature information. Matthews et al. [
1] detect distinct vertical edges from the image and combine this with under-vehicle shadow detection to determine the left, right and lower boundaries of the vehicle for vehicle detection and recognition. You et al. [
2] used HOG features and CIE-LUV histograms as low-level features and proposed an extended filter channel framework using the concept of filter channel features to improve the accuracy of pedestrian detection on multiple datasets. Stefan et al. [
3] exploited self-similarity on the color channel to improve the detection performance of still images and video sequences in the dataset, with a 20% performance improvement in pedestrian detection when combined with HOG features. Traditional object detection methods are built on manually designed feature representations and shallow trainable architectures and the algorithms are prone to performance bottlenecks when multiple low-level image features are combined with contextual information from the target detector or scene classifier.
Deep learning-based object detection methods have a large accuracy improvement over traditional methods and are now the mainstream in this field. Deep learning methods are characterized by the introduction of semantic and deep-level features that can be learned, which can compensate for the shortcomings of traditional object detection methods. In recent years, object detection methods based on convolutional neural network have developed rapidly and achieved significant results [
4,
5,
6,
7,
8]. The release of public datasets such as ImageNet [
9], COCO [
10], VOC [
11] and KITTI [
12] has greatly promoted the development of object detection applications. CNN-based object detectors can be divided into two types: (1) one-stage detectors: YOLO9000 [
13], YOLOv3 [
14], YOLOv4 [
15], Scaled-YOLOv4 [
16], YOLOv5, YOLOX [
17], FCOS [
18], DETR [
19], etc.; (2) two-stage detectors: Faster R-CNN [
4], VFNet [
8], CenterNet2 [
20], etc. Two-stage detectors require a network to find possible object regions in images and then a network to classify objects. Two-stage detectors such as Faster R-CNN have high detection accuracy in object detection tasks, but their detection speed is slow, and does not meet the real-time requirement of object detection in intelligent transportation systems. The YOLO series [
5,
13,
14,
15,
16,
17] is a typical one-stage detector that demonstrates excellent performance in object detection tasks. The YOLO model takes into account the advantages of speed and precision, and is our first choice for object detection tasks in traffic scenes.
However, the traditional YOLO model is designed for object detection tasks in natural scenes and there are several main problems with using previous models directly to perform object detection on images of traffic scenes, which are intuitively illustrated by some cases in
Figure 1. Firstly, traffic scene images are often captured by cameras set up at various intersections and the different camera angles lead to large variations in target size, which can easily lead to missed and false detections. Secondly, due to hardware specifications and lighting conditions, the captured images may have low resolution and blurrier objects. Thirdly, the large coverage area of the camera results in images containing a large number of complex backgrounds, resulting in extremely small sized objects that are difficult to detect. These problems result in the traditional YOLO model performing poorly in traffic scene images and cannot be directly applied to object detection tasks in traffic scenes. Yu et al. [
21] used an improved YOLOv3 model to detect traffic lights in traffic scenes and achieved good results, but the model could not be applied to object detection tasks in various complex traffic scenes. Zhu et al. [
22] proposed a new multi-sensor and multi-level enhanced convolutional network structural model, MME-YOLO, for object detection in complex traffic scenes, but the detection accuracy of extremely small objects in traffic scenes was not high. Li et al. [
23] proposed an Attention-YOLOv4, which introduced the attention mechanism to improve the detection accuracy of small target objects, but the detection ability of objects in low-resolution images was insufficient. Mittal et al. [
24] proposed a hybrid model of Faster R-CNN and YOLO and established a rich traffic scene dataset for vehicle object detection and traffic flow detection and achieved good results.
In this paper, we propose an improved model, multi-scale YOLOv5s based on YOLOv5s to solve the three problems presented above. The overview of the multi-scale YOLOv5s model is shown in
Figure 2. We, respectively, use CSPDarknet53 [
25] as the backbone and use FPN [
26]+PAN [
27] as the neck of multi-scale YOLOv5s. In the original YOLOv5 model, three detection heads are included, which are, respectively, used for the detection of small, medium and large objects. In complex traffic scenes, it is easy to miss and misdetect extremely small objects. On this basis, we add a detection head for detecting extremely small objects, which shows a good effect in complex traffic scene object detection. Then, we use a content-aware reassembly of features (CARAFE) module [
28], to replace the original upsampling layer. We replace the original convolution module with a new SPD-Conv CNN Module [
29], dedicated to low-resolution images and extremely small object detection. Finally, To find the attention region in images with large coverage, we adopt the Normalization-based Attention Module (NAM) [
30] to suppress unimportant channels or pixels to improve detection efficiency. Compared to YOLOv5s, our improved multi-scale YOLOv5s can better deal with traffic scene images.
Our contributions are listed as follows:
We add the fourth detection head for the detection of extremely small objects on the basis of the three detection heads of the original YOLOv5, which improved the problem of wrong detection and missing detection of extremely small objects in complex traffic images.
A new content-aware reassembly of features (CARAFE) module is used for feature fusion, which enhances the feature fusion capability of the neck part. It is lighter than the traditional upsampling module and requires fewer parameters and less computation.
A new SPD-Conv CNN Module is used to replace the original convolution module, which improves detection accuracy for low-resolution images and extremely small objects. It uses the space-to-depth and non-strided convolution layers to replace the original pooling and strided convolution layers.
An effective attention mechanism, Normalization-based Attention Module (NAM), is added to the neck part, which improves the accuracy and robustness of the model. It applies a weight sparsity penalty to the attention modules, making them more computationally efficient while retaining similar performance.
2. Related Work
Object detectors usually consist of two parts. One part is the Backbone for feature extraction, which is a convolutional neural network structure that aggregates and forms image features on different fine-grained images. The other part is the detection head used to output prediction results, to predict the image features, generate boundary boxes and predict categories. To enhance the feature extraction effect, some layers are usually added between the backbone and the head, which are called the neck of the detector. We will separately introduce these three structures in detail.
Backbone. The backbone that is often used includes VGG [
31], ResNet [
32], DenseNet [
33], CSPDarknet53 [
25], etc. These networks have proven to have strong feature extraction capabilities for problems such as detection and classification and are widely used in the construction of various network models.
Neck. To better enhance feature extraction from the backone, the neck is added between the backbone and the head for feature fusion. The neck is an important link in the detection network. Usually, the neck consists of multiple bottom-up paths and multiple top-down paths. Commonly used path-aggregation blocks in the neck are the following: FPN [
26], PANet [
27], BiFPN [
34], ASFF [
35], etc. These modules typically perform feature fusion through operations such as upsampling, downsampling, splicing, dot product, etc.
Head. The head can apply the features extracted by the backbone for target localization and classification. Heads are generally divided into two kinds: one-stage object detector and two-stage object detector. The YOLO series is a typical one-stage detector, which can predict both the bounding box and the class of the target at the same time, giving a significant speed advantage, but with relatively low detection accuracy.
YOLOv5 generally uses the CSPDarknet53 architecture with SPP layer as backbone, FPN+PANet as neck and YOLO detection head, respectively. YOLOv5 is available in five different models, YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l and YOLOv5x. Experiments show that the training results of YOLOv5x are much better than those of YOLOv5n, YOLOv5s, YOLOv5m and YOLOv5l. Although the training computational cost of the YOLOv5x model is higher than the other four models, we still choose to use YOLOv5x in order to pursue the best detection performance.
The attention mechanism in deep learning is similar to the human visual attention mechanism, both of which extract more detailed information about the current target information from a large amount of information, which has become a hot topic of academic research in recent years. Squeeze-and-Excitation Networks (SENet) [
36] integrates spatial information into the feature response in terms of channels and uses two multilayer perceptron (MLP) layers to compute the corresponding attention. Coordinate Attention (CA) [
37] embeds position information into channel attention, capturing long-range dependencies in one spatial direction while retaining accurate position information in the other. Convolutional Block Attention Module (CBAM) [
38] provides a solution that embeds the channel and spatial attention submodules sequentially. However, these efforts ignore information from the adjusted weights in training. Therefore, we aim to highlight salient features by using variance measures of the trained model weights.
5. Conclusions
In order to improve the detection accuracy of traffic objects in complex road scenes, we add a detection head for extremely small objects to the original YOLOv5s model, which significantly improves the detection accuracy of extremely small traffic objects. A content-aware reassembly of features (CARAFE) module is introduced in the feature fusion part to enhance the feature fusion. A new SPD-Conv CNN Module is introduced instead of the original convolutional structure to enhance the overall computational efficiency of the model. Finally, the normalization-based attention module (NAM) is introduced, allowing the model to focus on more useful information during training and significantly improving detection accuracy.
The experimental results show that compared with the original YOLOv5s algorithm, the detection accuracy of the multi-scale YOLOv5s model proposed in this paper is improved by 7.1% on the constructed diverse traffic scene datasets, which is comparable to YOLOv5x, maintaining the lightness of the model with respect to its weight while having a high detection accuracy. Compared with the current mainstream object detection algorithms, the multi-scale YOLOv5s model has the highest detection accuracy and is superior to the current mainstream object detection algorithms in the detection of traffic objects in complex road scenes.