A New Deep Model for Detecting Multiple Moving Targets in Real Traffic Scenarios: Machine Vision-Based Vehicles

When performing multiple target detection, it is difficult to detect small and occluded targets in complex traffic scenes. To this end, an improved YOLOv4 detection method is proposed in this work. Firstly, the network structure of the original YOLOv4 is adjusted, and the 4× down-sampling feature map of the backbone network is introduced into the neck network of the YOLOv4 model to splice the feature map with 8× down-sampling to form a four-scale detection structure, which enhances the fusion of deep and shallow semantics information of the feature map to improve the detection accuracy of small targets. Then, the convolutional block attention module (CBAM) is added to the model neck network to enhance the learning ability for features in space and on channels. Lastly, the detection rate of the occluded target is improved by using the soft non-maximum suppression (Soft-NMS) algorithm based on the distance intersection over union (DIoU) to avoid deleting the bounding boxes. On the KITTI dataset, experimental evaluation is performed and the analysis results demonstrate that the proposed detection model can effectively improve the multiple target detection accuracy, and the mean average accuracy (mAP) of the improved YOLOv4 model reaches 81.23%, which is 3.18% higher than the original YOLOv4; and the computation speed of the proposed model reaches 47.32 FPS. Compared with existing popular detection models, the proposed model produces higher detection accuracy and computation speed.


Introduction
Multi-target detection in traffic scenes is critical for a driverless car, object tracking, and intelligent driver assistance [1]. There are many dynamic objects around vehicles in real traffic scenarios, e.g., vehicles, pedestrians, cyclists, etc. Commonly used multi-target detection algorithms often produce poor performance and a high missed detection rate when detecting traffic targets that are too small or occluded [2,3]. Therefore, fast and accurate detection of potentially dangerous targets around the vehicles is a key issue [4,5].
Traditional target detection algorithms mainly extract features through sliding windows, such as Histogram of Gradient (HOG) features and Deformable Parts Model (DPM) features [6]. For example, Rao et al. [7] performed pedestrian detection by extracting the HOG features from candidate regions. Sun et al. [8] realized nighttime vehicle detection detecting small and occluded targets are addressed. Firstly, to detect small targets, shallow feature map information is added to the YOLOv4 model, expanding the original threescale detection structure to a four-scale detection structure; then, an attention mechanism is introduced in the YOLOv4 neck network to supplement the feature information for small and occluded targets; finally, in the prediction phase, the DIoU (distance-intersection over union) is introduced in the soft-NMS (non-maximum suppression) to improve the detection of occluded targets. Experimental evaluation demonstrates the effectiveness of the proposed method.

Materials and Methods
To improve the detection effect in a complex traffic environment, this paper improves the original YOLOv4 by (1) expanding the original 3-scale structure to a 4-scale structure, (2) adding CBAM into each scale, and (3) introducing the DIoU-based Soft-NMS to enhance the occluded target detection rate. The improved network structure is shown in Figure 1. target detection, the one-stage methods can balance detection speed and accuracy. Therefore, an improved YOLOv4 method is proposed for multi-target detection of traffic scenes, in which detecting small and occluded targets are addressed. Firstly, to detect small targets, shallow feature map information is added to the YOLOv4 model, expanding the original three-scale detection structure to a four-scale detection structure; then, an attention mechanism is introduced in the YOLOv4 neck network to supplement the feature information for small and occluded targets; finally, in the prediction phase, the DIoU (distance-intersection over union) is introduced in the soft-NMS (non-maximum suppression) to improve the detection of occluded targets. Experimental evaluation demonstrates the effectiveness of the proposed method.

Materials and Methods
To improve the detection effect in a complex traffic environment, this paper improves the original YOLOv4 by (1) expanding the original 3-scale structure to a 4-scale structure, (2) adding CBAM into each scale, and (3) introducing the DIoU-based Soft-NMS to enhance the occluded target detection rate. The improved network structure is shown in Figure 1.  Figure 1. Improved YOLOv4 algorithm framework.

Four-Scale Detection
In actual driving scenes, there are a large number of small targets, such as pedestrians and distant vehicles, which have fewer pixel points and less obvious features in the image. The maximum detection scale of the original YOLOv4 is obtained by down-sampling the input image by a factor of 8, and the number of feature extraction layers is deep, which easily causes the loss of feature information of such small targets. As shown in Figure 1, a detection layer of size 104 × 104 is added to the 3-scale detection layer of the YOLOv4 network. The neck network structure is adjusted to perform a 2× up-sampling operation on the 52 × 52 size feature mapping in the pyramid network structure, which is overlaid with the 104 × 104 sized feature mapping generated by the backbone network to create a feature fusion target detection layer with 4× down-sampling as output; then the 104 × 104 size feature map is down-sampled by adding a new PAN structure, which concatenates with the 52 × 52 size feature map to form a new 8× down-sampling feature map, completing the bottom-up feature fusion operation of the neck network. Compared with other scale detection layers, the 104 × 104 size detection layer can obtain more refined feature information and improve the detection effect on small targets.

Introduction of CBAM
The attention mechanism in deep learning refers to focusing on important information in an image and ignoring invalid information, as in human vision. The attention

Four-Scale Detection
In actual driving scenes, there are a large number of small targets, such as pedestrians and distant vehicles, which have fewer pixel points and less obvious features in the image. The maximum detection scale of the original YOLOv4 is obtained by down-sampling the input image by a factor of 8, and the number of feature extraction layers is deep, which easily causes the loss of feature information of such small targets. As shown in Figure 1, a detection layer of size 104 × 104 is added to the 3-scale detection layer of the YOLOv4 network. The neck network structure is adjusted to perform a 2× up-sampling operation on the 52 × 52 size feature mapping in the pyramid network structure, which is overlaid with the 104 × 104 sized feature mapping generated by the backbone network to create a feature fusion target detection layer with 4× down-sampling as output; then the 104 × 104 size feature map is down-sampled by adding a new PAN structure, which concatenates with the 52 × 52 size feature map to form a new 8× down-sampling feature map, completing the bottom-up feature fusion operation of the neck network. Compared with other scale detection layers, the 104 × 104 size detection layer can obtain more refined feature information and improve the detection effect on small targets.

Introduction of CBAM
The attention mechanism in deep learning refers to focusing on important information in an image and ignoring invalid information, as in human vision. The attention mechanism can weigh the weights of different feature channels, making the network focus more on the target region of interest and guiding the network to highlight useful features for the current network task [20]. Therefore, in this paper, a CBAM [21] is added to each of the four detection scales to enhance the model's ability to learn feature information and improve the detection accuracy of the model. As shown in Figure 2, the CBAM consists of a channel attention module and a spatial attention module. The channel attention module focuses on different channels of the input feature map to enhance the weight of key features, and the spatial attention module enhances the localization of critical features on this basis. Its mathematical expression is described as: which ⊗ denotes the corresponding multiplication of pixel values, F represents the input feature map, M C represents the channel attention weight coefficients, M S represents the spatial attention weight coefficients, F is the feature map out by the channel attention module, and F " represents the feature map output by the CBAM module.  Figure 2. Architecture of CBAM. The module has two sequential sub-modules: channel and spatial.
In Figure 2, the channel attention module first performs global maximum pooling and average pooling operations on the feature map F to obtain the feature information of each channel firstly and then performs operations of dimension reduction and dimension enhancement on the Multi-Layer Perceptron (MPL), which composed of two fully connected layers to obtain two feature channel attention vectors, finally, the two vectors outputted by MPL are summed and pass through the Sigmoid nonlinear activation function to obtain the channel attention weight coefficients MC. The channel attention module reweights each channel feature of the input feature map F by MC to increase the effective channel weights and suppress the invalid channel weights. The attention weight coefficient MC is described as: <!--MathType@Translator@5@5@MathML2 (no namespace).tdl@MathML 2.0 (no namespace)@ --> <math> <semantics> <mrow> (2) Figure 2. Architecture of CBAM. The module has two sequential sub-modules: channel and spatial.
In Figure 2, the channel attention module first performs global maximum pooling and average pooling operations on the feature map F to obtain the feature information of each channel firstly and then performs operations of dimension reduction and dimension enhancement on the Multi-Layer Perceptron (MPL), which composed of two fully connected layers to obtain two feature channel attention vectors, finally, the two vectors outputted by MPL are summed and pass through the Sigmoid nonlinear activation function to obtain the channel attention weight coefficients M C . The channel attention module re-weights each channel feature of the input feature map F by M C to increase the effective channel weights and suppress the invalid channel weights. The attention weight coefficient M C is described as: where σ denotes the Sigmoid activation function, W 0 and W 1 denote the two fully connected layer weight matrices in the multiple layer perceptron, F c avg and F c max respectively denote the out feature through average pooling and maximum pooling.
In Figure 2, the spatial attention module compresses the channel information using global maximum pooling and average pooling operations on the feature map F obtained through the channel attention module, and splices the two-channel information together to obtain a two-channel feature, then which is reduced in dimension using a 7 × 7 convolution where σ denotes the Sigmoid activation function, f 7 × 7 represents convolution operation.

Soft-NMS
Most target detection algorithms use the NMS algorithm in the post-processing stage, where the NMS is used to filter the bounding boxes and only a portion of the bounding boxes are retained for the final target location. The traditional NMS algorithm forces the deletion of bounding boxes with low confidence if the Intersection Over Union (IoU) value of two bounding boxes is greater than the set overlap threshold, the bounding boxes with low confidence will be forcibly removed, which leads to easily missed detection of targets, especially in scenes with target occlusion. Therefore, we use the Soft-NMS algorithm [22] instead of the NMS algorithm. Unlike the traditional NMS algorithm, the Soft-NMS algorithm uses a weight decay function to suppress the confidence of the bounding box, the current bounding box confidence is multiplied by a weight function, this function decays the confidence of the bounding boxes, which overlaps with the highest confidence bounding box, to retain the low confidence of the occluded targets. In addition, DIoU [23] considers the center point distance and overlap area of the bounding box, which can better converge. Using DIoU instead of IoU to calculate the similarity between targets can more accurately portray the relative position and overlap between targets. Finally, a Gaussian penalty function is used to predict the box confidence decay coefficient. The mathematical formulation is described as where: b i is the i-th bounding box to be predicted, S i is the original score of b i , S f is the final score of b i , M is the bounding box with the highest score, DIoU (M, b i ) is the distance intersection over the union of b i and M, and N t represents the threshold for screening two overlapping boxes, which is set to 0.3 [22]. When the overlap between the predicted box b i and M is greater, the smaller the value of S f is, the stronger the suppression effect is. As a result, avoiding missed detection due to forced deletion of predicted boxes and improving target detection in occlusion situations.

Experiments and Results Analysis
The experiments using the KITTI and BDD100K datasets demonstrate that the improved YOLOv4 can improve detection accuracy and speed for traffic targets over related models. All experiments were trained and tested in the PyTorch framework, with the version of CUDA and cuDNN being 10.0 and 7.4, respectively, and an Nvidia RTX2080Ti graphics card as the hardware configuration.

Evaluation Indicators
To verify and evaluate the effectiveness of the improved YOLOv4 network in this paper, the commonly used Average Precision (AP), mean Average Precision (mAP), and detection speed (FPS) were selected as evaluation indicators. In the prediction, the IoU value between the target predicted bounding box and the real bounding box is used as an indicator of whether the sample is correctly predicted, and the threshold value of IoU is set to 0.5, i.e., samples with an IoU value greater than 0.5 are regarded as positive samples detected, otherwise they're negative sample. AP is the average precision of the model for a certain kind of target detection, and mAP is the average precision of the model for all classes of detection judgments, and the calculation formula is described as (5).
where, N is the number of all classes. Where the check-all rate P represents the ratio of the number of correctly predicted samples to the total number of samples, and the recall rate R is the ratio of the number of correctly predicted samples to the number of marked true samples. The check-all rate P and the recall rate R can be calculated as where True Positive (TP) represents samples whose predicted target class is consistent with the true target class; False Positive (FP) represents samples whose predicted target class is inconsistent with the true target class; False Negative (FN) represents samples whose true target exists but is not predicted by the network.

Experiment Based on KITTI Dataset
The KITTI object detection dataset, commonly used for computer algorithm evaluation in autonomous driving scenarios, and is based on real data collected from urban, rural, and highway scenarios and contains many complex scenarios, such as occluded vehicles, pedestrians, and cyclists. The KITTI dataset consists of 7481 images from the test set and 7518 images from the training set.
As the label information in the test set was not publicly available, the training set images are re-randomly divided into a new training set and a test set according to a ratio of 8:2, and the categories in the dataset were re-merged into three categories of objects, including car, pedestrian, and cyclist. The image size in the KITTI dataset was 1242 × 375, and for the purpose of algorithm effect comparison, the input images were re-scaled to a size of 416 × 416.
The parameters were set as follows: the batch size was 8, the momentum was 0.9, the decay coefficient was 0.0005, the maximum number of iterations was set to 60,000, the initial learning rate was 0.001, and the learning rate changed to 0.0001 and 0.00001 at 15,000 and 35,000 iterations, respectively. In addition, to prevent overfitting of the model training, the training process was added to early stop. The variation of the loss value during training of YOLOv4 before and after improvement is shown in Figure 3, and the loss value region is stable after the number of iterations exceeds 30,000. R is the ratio of the number of correctly predicted samples to the number of marked true samples. The check-all rate P and the recall rate R can be calculated as where True Positive (TP) represents samples whose predicted target class is consistent with the true target class; False Positive (FP) represents samples whose predicted target class is inconsistent with the true target class; False Negative (FN) represents samples whose true target exists but is not predicted by the network.

Experiment Based on KITTI Dataset
The KITTI object detection dataset, commonly used for computer algorithm evaluation in autonomous driving scenarios, and is based on real data collected from urban, rural, and highway scenarios and contains many complex scenarios, such as occluded vehicles, pedestrians, and cyclists. The KITTI dataset consists of 7481 images from the test set and 7518 images from the training set.
As the label information in the test set was not publicly available, the training set images are re-randomly divided into a new training set and a test set according to a ratio of 8:2, and the categories in the dataset were re-merged into three categories of objects, including car, pedestrian, and cyclist. The image size in the KITTI dataset was 1242 × 375, and for the purpose of algorithm effect comparison, the input images were re-scaled to a size of 416 × 416.
The parameters were set as follows: the batch size was 8, the momentum was 0.9, the decay coefficient was 0.0005, the maximum number of iterations was set to 60,000, the initial learning rate was 0.001, and the learning rate changed to 0.0001 and 0.00001 at 15,000 and 35,000 iterations, respectively. In addition, to prevent overfitting of the model training, the training process was added to early stop. The variation of the loss value during training of YOLOv4 before and after improvement is shown in Figure 3, and the loss value region is stable after the number of iterations exceeds 30,000.    Figure 4 shows an example plot comparing the detection results of the YOLOv4 algorithm and the improved algorithm in this paper on the KITTI test sets, including a variety of complex scenarios. In the first and second row plots of Figure 4, the improved YOLOv4 algorithm detects better in a multiple vehicle occlusion, vehicle-dense environment, and the YOLOv4 algorithm misses some occluded vehicles. In the fourth figure of Figure 4, the improved YOLOv4 algorithm performs effective detection for small target vehicles at a distance, while the YOLOv4 algorithm produces more missed detection. From the example of the detection effect, it can be seen that the improved YOLOv4 algorithm in this paper accurately identifies obscured targets and smaller targets in complex traffic scenes and further improves the detection effect compared with the YOLOv4 algorithm. vehicles at a distance, while the YOLOv4 algorithm produces more missed detection. From the example of the detection effect, it can be seen that the improved YOLOv4 algorithm in this paper accurately identifies obscured targets and smaller targets in complex traffic scenes and further improves the detection effect compared with the YOLOv4 algorithm.
(a) (b) To further verify that the proposed method can effectively improve the accuracy of traffic multi-target detection, the improved strategy proposed in this paper is compared with the original YOLOv4 algorithm on the KITTI dataset. The improved YOLOv4 traffic target detection algorithm consists of three improvements: adding a scale detection layer, introducing a hybrid attention module, and introducing a DIOU-based Soft-NMS algorithm. Each improved strategy is combined with the original YOLOv4 algorithm in turn, and the corresponding comparison experiments are conducted on the validation set to verify the effectiveness of each improved strategy. Experiment A used the original YOLOv4 algorithm for training, Experiment B added a scale detection layer to the model of the original YOLOv4 algorithm, Experiment C added the DIOU-based Soft-NMS algorithm to the model of Experiment B, Experiment D introduced a blending attention module to the model of Experiment C, and Experiment E added the DIOU-based Soft-NMS algorithm to the model of Experiment D. Table 1 shows the comparative experimental results of the four experimental models on the KITTI dataset.  Table 1 shows the comparative experimental results of the four experimental models on the KITTI dataset.  Table 1 shows that in terms of the number of parameters, the improved YOLOv4 model parameters increased by 13.1 MB relative to the original YOLOv4 model parameters, introducing a smaller amount of additional computation and network computing, the speed of change is low; in terms of detection accuracy, the improved YOLOv4 model significantly improved the average detection accuracy by 3.18%. From the analysis of the changes in AP values of each model, model B with the addition of the scale detection layer, significantly improves the detection accuracy of small objects compared to the original YOLOv4 model, especially for the relatively small traffic targets of pedestrians and cyclists; model D introduces a CBAM on top of model B, and the model detection accuracy improves by 1.01%, which shows that the attention module has a great influence on the improvement of model detection accuracy; model C and model E introduce the DIOU-based Soft-NMS algorithm, and the model detection accuracy is further improved, and the detection accuracy is significantly higher than the other two categories in the pedestrian detection category, which proves the effectiveness of the DIOU-based Soft-NMS algorithm in dealing with the object occlusion problem in the traffic environment.
To better visualize the impact of the CBAM module on the detector performance, a visualization of the location prediction activation plot is given in Figure 5. This also demonstrates that the attention module can direct the network to focus more on the visible part of the target while also reducing the impact of background noise on the detection performance.  Table 1 shows that in terms of the number of parameters, the improved YOLOv4 model parameters increased by 13.1 MB relative to the original YOLOv4 model parameters, introducing a smaller amount of additional computation and network computing, the speed of change is low; in terms of detection accuracy, the improved YOLOv4 model significantly improved the average detection accuracy by 3.18%. From the analysis of the changes in AP values of each model, model B with the addition of the scale detection layer, significantly improves the detection accuracy of small objects compared to the original YOLOv4 model, especially for the relatively small traffic targets of pedestrians and cyclists; model D introduces a CBAM on top of model B, and the model detection accuracy improves by 1.01%, which shows that the attention module has a great influence on the improvement of model detection accuracy; model C and model E introduce the DIOUbased Soft-NMS algorithm, and the model detection accuracy is further improved, and the detection accuracy is significantly higher than the other two categories in the pedestrian detection category, which proves the effectiveness of the DIOU-based Soft-NMS algorithm in dealing with the object occlusion problem in the traffic environment.
To better visualize the impact of the CBAM module on the detector performance, a visualization of the location prediction activation plot is given in Figure 5. This also demonstrates that the attention module can direct the network to focus more on the visible part of the target while also reducing the impact of background noise on the detection performance. In summary, all three types of improvement strategies are indispensable for improving the overall detection accuracy of the model. Compared with the original YOLOv4 model, the improved YOLOv4 model proposed in this paper can effectively detect multiple types of targets in traffic scenes while improving the detection accuracy when detecting small and occluded targets.
To further validate the advancement of the improved YOLOv4 algorithm in this paper, compared with target detection algorithms such as Faster R-CNN, Cascade R-CNN, SSD, YOLOv3, and YOLOv4. Training and test evaluations were performed using the official code of each algorithm in KITTI test data, and the test results are shown in Table 2. PR curves of different detection methods on the test set are shown in Figure 6.  As can be seen from Table 2, compared with YOLOv4 with a detection speed of 51.68 FPS, the proposed algorithm in this paper improves the mean average precision (mAP) by 5.85% and can achieve real-time detection. The two-stage detection algorithms represented by Cascade R-CNN and Faster R-CNN have good mAP of 68.89% and 79.30%, respectively, in the KITTI test dataset, but the detection speed was too slow to meet the realtime requirements of autonomous driving perception systems. The single-stage detection In summary, all three types of improvement strategies are indispensable for improving the overall detection accuracy of the model. Compared with the original YOLOv4 model, the improved YOLOv4 model proposed in this paper can effectively detect multiple types of targets in traffic scenes while improving the detection accuracy when detecting small and occluded targets.
To further validate the advancement of the improved YOLOv4 algorithm in this paper, compared with target detection algorithms such as Faster R-CNN, Cascade R-CNN, SSD, YOLOv3, and YOLOv4. Training and test evaluations were performed using the official code of each algorithm in KITTI test data, and the test results are shown in Table 2. PR curves of different detection methods on the test set are shown in Figure 6.  In summary, all three types of improvement strategies are indispensable for improving the overall detection accuracy of the model. Compared with the original YOLOv4 model, the improved YOLOv4 model proposed in this paper can effectively detect multiple types of targets in traffic scenes while improving the detection accuracy when detecting small and occluded targets.
To further validate the advancement of the improved YOLOv4 algorithm in this paper, compared with target detection algorithms such as Faster R-CNN, Cascade R-CNN, SSD, YOLOv3, and YOLOv4. Training and test evaluations were performed using the official code of each algorithm in KITTI test data, and the test results are shown in Table 2. PR curves of different detection methods on the test set are shown in Figure 6.   Figure 6. The precision-recall curves about different detection methods on the KITTI dataset.
As can be seen from Table 2, compared with YOLOv4 with a detection speed of 51.68 FPS, the proposed algorithm in this paper improves the mean average precision (mAP) by 5.85% and can achieve real-time detection. The two-stage detection algorithms represented by Cascade R-CNN and Faster R-CNN have good mAP of 68.89% and 79.30%, respectively, in the KITTI test dataset, but the detection speed was too slow to meet the realtime requirements of autonomous driving perception systems. The single-stage detection algorithm SSD has a good detection speed in the test set, but the detection accuracy is too low. The YOLO series of detection algorithms have good performance in detection speed and accuracy, especially the YOLOv4 algorithm, which obtains 78.05% detection accuracy and 51.68 FPS detection speed. However, the improved YOLOv4 algorithm has less loss in detection speed and 3.18% higher detection accuracy than the YOLOv4 algorithm, while the detection speed reaches 47.32 fps, exceeding the YOLOv3 algorithm. The improved YOLOv4 algorithm has been shown to perform well in terms of speed and accuracy and meet the requirements of the autonomous driving scene detection algorithm.

Experiment Based on BDD100K Dataset
The BDD100k dataset is the largest open autonomous driving dataset with 100K videos and 10 tasks to evaluate the progress of image recognition algorithms on automatic driving. The dataset possesses geographic, environmental, and weather diversity, which is useful for training models that are less likely to be surprised by new conditions. It provided 100.000 images with a resolution of 1280 × 720 pixels. Due to the limited computing power of the device, we only used 10,000 images in BDD100K dataset for training and evaluation of the model, and the ratio of the training set to the testing set is 8:2. The training parameters of the model were set as in the above experiments. After data statistics and classification, the detection test results under different algorithms are illustrated in Table 3.  Table 3 demonstrates that the target detection accuracy of the improved YOLOv4 network model is higher than other detection algorithms under the BDD100K dataset. The two-stage detectors generally showed better detection accuracy compared to one-stage detectors, but the detection speed was too slow and not a balanced algorithm in terms of speed and accuracy. The SSD algorithm is not the best choice for speed and accuracy because the model runs detection for medium and small targets from very few layers. As a representative of one-stage detection algorithms, YOLOv4 has detected higher mAP values than other general-purpose target detection algorithms. It uses CSPDarkNet-53 as a backbone, which increases the accuracy of the classifier and detector. However, the improved YOLOv4 enhanced the mAP by 2.27% more than the YOLOv4 algorithm. For both pedestrian and bicyclist targets, the AP for each category of target identification increased by 3.4% and 1.75%; this indicates that our improved algorithm effectively detects small targets.
In summary, it can be seen that our improved YOLOv4 model not only outperforms other algorithms in terms of detection accuracy and recognition accuracy but also achieves a detection speed of about 46 frames/sec, which verifies that the algorithm in this paper can achieve a balance of accuracy and speed, with strong application prospects.

Conclusions
Multi-target detection algorithms must meet high detection accuracy and real-time detection speed in traffic scenes, especially high detection accuracy. However, generic multi-target detection algorithms have difficulty detecting small and occluded targets in complex environments. To address this problem, we propose an improved YOLOv4 algorithm, which is specifically designed for multi-target detection in traffic scenes. A new scale detection layer was added to the original YOLOv4 network to improve the algorithm's ability to detect small targets, and the CBAM was introduced into the feature fusion network; finally, a Soft-NMS was used as a post-processing algorithm to improve the detection accuracy of occluded objects. Compared with the original YOLOv4 algorithm, the average accuracy obtained is improved by 3.18% on the KITTI dataset and 2.27% on the BDD100K dataset, which proves that the proposed improved strategy can effectively improve the detection accuracy of small and obscured targets. Compared with other multi-target detection algorithms, the proposed algorithm achieved an average accuracy of 81.23% for three types of detection targets while maintaining a detection speed of 47.32 FPS, which is both a good detection speed and detection accuracy and can complete the task of multi-target detection in traffic scenes very well. The method proposed in this paper has good application scenarios in various traffic scenarios, but in the face of harsh traffic environments, such as rain and fog, the method in this paper needs to be further improved.