Research on Multi-Target Detection and Tracking of Intelligent Vehicles in Complex Traffic Environments Based on Deep Learning Theory

Chen, Xuewen; Yan, Shilong; Xia, Chenxi

doi:10.3390/wevj16060325

Open AccessArticle

Research on Multi-Target Detection and Tracking of Intelligent Vehicles in Complex Traffic Environments Based on Deep Learning Theory

by

Xuewen Chen

^*

,

Shilong Yan

and

Chenxi Xia

College of Automobile and Traffic Engineering, Liaoning University of Technology, Jinzhou 121001, China

^*

Author to whom correspondence should be addressed.

World Electr. Veh. J. 2025, 16(6), 325; https://doi.org/10.3390/wevj16060325

Submission received: 18 April 2025 / Revised: 21 May 2025 / Accepted: 6 June 2025 / Published: 11 June 2025

(This article belongs to the Special Issue Recent Advances in Intelligent Vehicle)

Download

Browse Figures

Versions Notes

Abstract

To address the issues of missed detections and false detections of small target missed detections caused by dense occlusion in complex traffic environments, a non-maximum suppression method, Bot-NMS, is proposed to achieve accurate prediction and localization of occluded targets. In the backbone network of YOLOv7, the Ghost module, the ECA attention mechanism, and the multi-scale feature detection structure are introduced to enhance the network’s capacity to learn small target features. The SCSTD and KITTI datasets were used to train and test the improved YOLOv7 target detection network model. The results demonstrate that the improved YOLOv7 method significantly enhances the recall rate and detection accuracy of various targets. A multi-target tracking method based on target re-identification (ReID) is proposed. Utilizing deep learning theory, a ReID model for target identification is constructed to comprehensively capture global and foreground target features. By establishing the correlation cost matrix of the cosine distance and IoU overlap, the correlation between target detection objects, the tracking trajectory, and ReID feature similarity is realized. The VERI-776 vehicle re-identification dataset and MARKET1501 pedestrian re-identification dataset were used to train the proposed ReID model, and multi-target tracking performance comparison experiments were conducted on the MOT16 dataset. The results show that the multi-target tracking method by introducing the ReID model and improving the cost matrix can better deal with the dense occlusion of the target, and can effectively and accurately track the road target in the realistic complex traffic environment.

Keywords:

intelligent vehicle; deep learning; non-maximum suppression method; ECA attention mechanism; multi-target tracking; target re-identification model; simulation

1. Introduction

With the rapid development of the economy and technology, intelligent vehicles are now considered to be a multi-functional advanced driving system integrating environmental perception, path planning, and decision control, and will continue to develop towards intelligent and networked fields in the future to meet people’s various needs [1,2]. In the environment perception of intelligent vehicles, the object detection task aims to identify and locate targets of interest, such as pedestrians, non-motor vehicles, and various vehicle targets. In this task, visual sensors are often used to collect video image data, and the target is located and classified by extracting the feature information in the image. At the same time, targets of interest can subsequently be tracked on the basis of the detection results, so as to provide stable and reliable key targets for longitudinal control systems such as intelligent following or collision avoidance. Multi-target tracking is used to assign identity tags to multiple object detection objects in the video collected via vehicle equipment in the real complex road traffic scene and realize identity matching of the same object in different frames using the data association method.

From the research status at home and abroad, many experts and scholars have put forward a large number of target detection and tracking methods. Reference [3] improved the SSD network structure by introducing void convolution and feature fusion mechanisms. By expanding the acceptance domain of the convolution kernel, the network can better capture the context information of the target, improve the detection ability of dense targets by enriching shallow semantic information, and at the same time alleviate the problem of foreground–background unbalance. Reference [4] systematically evaluated mainstream target detection architectures such as Faster R-CNN, R-CNN, and SSD, as well as mainstream feature extraction networks such as ResNet, MobileNet, Inception, etc., based on a KITTI dataset from the perspective of trade-off between detection speed, accuracy, and memory usage. Based on the YOLOv5 detection system, the category consistency regularization module is proposed in reference [5], which improves the generalization performance of the model in different domains. A multi-level adaptive strategy is proposed to better adapt to the feature distribution differences in different domains, making the target detection system based on YOLOv5 more capable of dealing with domain drift. Reference [6] reports 56.8% mAP detection accuracy (50.7% for YOLOv5) on the COCO dataset by using YOLOv7, and it reaches the detection speed of 160 frames per second the fastest. In terms of small target detection and occlusion detection, reference [7] designed the backbone and detection head of CenterNet, introduced the average boundary model, and realized more accurate target positioning based on boundary characteristic information, which helped to improve the accuracy of detection. The real-time performance and robustness of the system in real driving scenarios are proved by real vehicle experiments. In reference [8], the Spatial Layout Memory (SLM) module is added to the object detection module to extract the important appearance features of the object. The constructed tracker SMILEtrack not only pays attention to the location information of the target, but also pays attention to the appearance characteristics, and achieves a high accuracy rate in multi-target tracking. Reference [9], based on the YOLOX detector, proposed BYTE, an association method based on detection frames. The IoU distance between the detection object and the tracking track was used to complete the association, and the MOTA index exceeded 80%. In reference [10], camera motion compensation is introduced into target tracking to improve the Kalman filter state prediction, and a robust two-level target association method is adopted to show good tracking performance on various MOT datasets. Reference [11], based on DeepSORT, improved the tracking method from the perspective of appearance embedding and trajectory association, improved the missing association and missing detection, and improved the robustness of the tracker. Reference [12] introduces the time step to calculate the virtual trajectory in the Kalman filter tracker, which solves the problem of accumulated noise caused by the long-term nonlinear movement of the target and improves the tracking effect of the filter. A new multi-target tracking method, Deep OC-SORT, is proposed in reference [13], which further improves the accuracy of multi-target tracking. In order to balance detection, tracking accuracy, and computing speed, a multi-target tracking method based on joint detection and the tracking paradigm is proposed. Reference [14] proposes a multi-target tracking method of joint detection, which uses a network to accomplish both target detection and multi-target tracking. In this method, the loss function of multi-task learning is used in the training process, branches are added in the detection head network to output additional features, and the target ID is associated based on these features. Reference [15] uses a single network model to process the two tasks of target detection and appearance association, and uses a short-time memory network to model the multi-target motion constraints, so as to realize the coordination of target detection and tracking under a single network framework. Reference [16] puts forward the CenterTrack method, which is based on the CenterNet detection network and uses historical frame clues to realize the recovery of occluded and disappeared objects. CenterTrack omits the feature association matching process and directly outputs the target detection and matching results. Reference [17] uses the CenterNet network framework to design an Anchor-free target detection network, which balances the competition between the two and achieves better detection and tracking accuracy. In reference [18], a SimpleTrack method is proposed to solve the problem of object loss caused by occlusion and blurring, which effectively enhances the stability of the joint detection and tracking method. Reference [19] develops bank turn trajectory primitives that minimize occlusion periods when the sensor footprint sweeps a region; analogous path-planning ideas may inform intelligent vehicle sensor placement or ego-vehicle maneuvers that reduce dense occlusion in urban scenes. Reference [20] presents a trackFormer model with end-to-end multi-object tracking and segmentation based on the encoder–decoder transformer architecture. By using self- and encoder–decoder attention mechanisms, the trackFormer achieves seamless data association between frames in a new tracking-by-attention paradigm that simultaneously reasons about location, occlusion, and object identity. Reference [21] proposes a MaskOCSORT tracking method, which uses cosine similarity, intersection over union, and velocity consistency metrics for the observation and tracking of correlation problems. The proposed tracking method was evaluated using evaluation metrics such as HOTA, MOTA, and IDF1, and the ID count generated by the tracking method was evaluated using evaluation metrics such as precision, recall, and F1-score.

The rest of the paper is organized as follows. The second part deeply studies the detection method of occluded targets and small targets in complex road traffic environments based on YOLOv7. Based on the YOLOv7 backbone network, the network structure is improved in the aspects of multi-scale learning, prior frames, and attention mechanisms to improve the detection ability of small targets. A non-maximum suppression method, Bot-NMS, is proposed to improve the deep learning ability of the network model for the densely blocked target features and solve the problem of missing targets. The third part studies and improves the multi-object re-identification and tracking method of ByteTrack for complex road traffic environments. The ReID model of target re-identification is constructed to obtain the global and foreground features of the target, and the correlation cost matrix of the cosine distance and IoU overlap is established to realize the correlation between the target detection object, tracking trajectory, and ReID feature similarity. The fourth part completes the training and testing of the improved YOLOv7 target detection network model by using SCSTD and KITTI datasets. The VERI-776 vehicle re-identification dataset and MARKET1501 pedestrian re-identification dataset were used to train the ReID model, and multi-target tracking performance comparison experiments were conducted on the MOT16 dataset. In addition, the multi-target detection and tracking of complex urban road traffic scene video images are verified. The fifth part summarizes the main research work and its conclusion.

2. Detection Method of Occluded Targets and Small Targets in Complex Road Traffic Environments Based on YOLOv7

In the complex road traffic environment, the occlusion of dense targets, such as vehicles and pedestrians, leads to the serious problem of missing targets. Aiming at the problem of missing detection caused by poor detection and the positioning ability of detection methods for dense occlusiveness, a robust non-maximum suppression method, Bot-NMS, is proposed to accurately predict and locate occlusiveness targets. Based on the YOLOv7 backbone network (Figure 1), the network structure is improved in the aspects of multi-scale learning, prior frames, and attention mechanisms, aiming to improve the detection ability of small targets.

2.1. A Bot-NMS Method for Occlusion Object Detection

The standard NMS method is a greedy algorithm. Its core idea is to find out the detection frame r_i, whose IoU is greater than or equal to the threshold θ_t between the detection frame list B and the detection frame B with the highest confidence via cyclic traversal, set its confidence c_i to zero, and remove it from B. The confidence formula is as follows:

c_{i} = \{\begin{array}{l} 0, & I o U (B, r_{i}) \geq θ_{t} \\ c_{i}, & I o U (B, r_{i}) < θ_{t} \end{array}

(1)

The DIoU-NMS method uses DIoU as an alternative to IoU from the perspective of improving the evaluation criteria. DIoU adds a penalty term to measure the distance between the center points between the two boxes. Its calculation formula is as follows:

D I o U = I o U - {(\frac{ρ^{2} (B, r_{i})}{b^{2}})}^{α}

(2)

The Soft-NMS method replaces the hard decision strategy in the standard NMS method with the penalty attenuation strategy. The core idea is to set the confidence score c_i as the weighted function of the overlap rate (IoU) between B and r_i boxes, so that the confidence score can decrease with the increase in IoU, instead of directly setting the confidence score to zero like in the standard NMS method. The confidence is shown in Formula (3).

c_{i} = \{\begin{matrix} c_{i} f (I o U (B, r_{i})), & I o U (B, r_{i}) \geq θ_{t} \\ c_{i}, & I o U (B, r_{i}) < θ_{t} \end{matrix}

(3)

f (I o U (B, r_{i})) = e x p (- \frac{I o U {(B, r_{i})}^{2}}{δ})

Combining the improved schemes of DIoU-NMS and Soft-NMS, this paper proposes a robust non-maximum suppression method, Bot-NMS, to reduce the error suppression of overlapping prediction boxes. The confidence in the Bot-NMS method is shown in Formula (4).

c_{i} = \{\begin{matrix} c_{i} f (D I o U (B, r_{i})), & D I o U (B, r_{i}) \geq θ_{t} \\ c_{i}, & D I o U (B, r_{i}) < θ_{t} \end{matrix}

(4)

f (D I o U (B, r_{i})) = e x p (- \frac{D I o U {(B, r_{i})}^{2}}{δ})

2.2. Comparison and Analysis of Different NMS Methods

In order to verify the effectiveness of the Bot-NMS method proposed in this paper, standard NMS, DIoU-NMS, Soft-NMS, and Bot-NMS methods are installed on the framework of YOLOv7, and comparative experiments are conducted on the SCTSD dataset in this paper. The evaluation indicators used were recall, mean accuracy (mAP50), and inference time (Time), and the results are shown in Table 1.

As can be seen from Table 1, the YOLOv7 detection model configured with the Bot-NMS method has the most obvious improvement in the recall rate of various targets (Car, Bus, Truck, Motor Bike, Bicycle, and Person). The recall rate of all categories increased to 92.3%, 96.1%, 96.5%, 93.8%, 87.4%, and 91.5%, respectively. The mAP50 increased from 89.4% to 91.5%.

Due to the increased complexity of Bot-NMS compared with the standard NMS method, the detection speed of the model decreased slightly, increasing by 0.31 ms. However, the improved YOLOv7 model can still meet the real-time detection requirements of intelligent vehicles in terms of detection speed.

Figure 2 shows the comparison between the detection results of Bot-NMS, proposed in this paper, and those of standard NMS.

Through the analysis of the results from Figure 2, it can be seen that the YOLOv7 detection model of the Bot-NMS method proposed in this paper can significantly reduce the error suppression of overlapping detection boxes, reduce the number of missed detections, and improve the recall rate, thus enhancing the detection ability of the model for occluded targets.

2.3. Small Target Detection Research

In order to enhance the detection ability of the YOLOv7 detection method for small targets, improvements were introduced in three aspects: (1) optimizing the ELAN module of the backbone network, using Ghost module to replace the coy of feature extraction; (2) improving the multi-scale feature detection structure, using the improved K-means common convolution in one branch, and improving the efficiency of the clustering algorithm to generate more accurate prior boxes to achieve more accurate detection and positioning of targets; and (3) the ECA attention mechanism is introduced to improve the learning and expression ability of small target features of the network model.

2.3.1. ELAN Module Optimization of the Backbone Network

The Ghost module is used to replace the common convolution in branch 1 of the ELAN module of the YOLOv7 backbone network in order to compress the number of model parameters and improve the extraction efficiency of important information (as shown in Figure 3). After improvement, the ELAN module in the backbone network is named the Elan-G module.

The Ghost module divides the composition of the output feature graph into two parts: one part is the feature graph generated by the ordinary convolution with fewer parameters

Y^{'} \in R^{m \times h^{'} \times w^{'}}

; the other part is the feature graph generated by the cheap linear transformation ϕ_ij; and finally, the two parts of the feature graph are splicing together for output. The above calculation process is shown in Figure 4 and Formulas (5) and (6).

Y^{'} = X * f^{'}

(5)

\begin{matrix} y_{i, j} = ϕ_{i, j} ({y^{'}}_{i}), & \forall i = 1, \dots, m, & j = 1, \dots, s \end{matrix}

(6)

where f′ represents the convolution filter, Y′ represents the feature graph. ϕ_ij represents the linear transformation that produces the ghost feature graph.

2.3.2. Multi-Scale Learning and Prior Frame Improvement

In order to make better use of the fine-grained information of the target in the shallow feature, it is fused with the semantic information in the deep feature to enhance the expression ability of small object features. In addition to the existing detection scale, a prediction feature graph with a size of 160 × 160 is used. The fusion mode of the Four-layer feature pyramid (PAFPN) structure not only helps the model to improve the detection ability of small targets, but also makes the detection feature map more continuous. In order to obtain a more accurate prior frame (Anchor), the optimized K-means algorithm is used for clustering to obtain the Anchor frame (as shown in Table 2).

The optimized K-means algorithm uses a more reasonable distance measurement method, which effectively alleviates the problem that the original clustering algorithm is prone to falling into with the local optimal solution. The distance measurement method before and after optimization is shown in Formula (7).

\{\begin{array}{l} o r i g i n a l, & d_{1} = f (r^{c}, r_{i}) = 1 - I o U (r^{c}, r_{i}) \\ o p t i m i z e d, & d_{2} = φ (r^{c}, r) = \sqrt{1 - I o U^{2} (r^{c}, r_{i})} \end{array}

(7)

where r^c represents the cluster center and r_i represents the sample box to be clustered.

2.3.3. Introduction of ECA Attention Mechanism

The ECA attention mechanism not only makes the network model pay more attention to the useful features of the detection task and weaken the useless information, but also enables the network model to independently learn the correlation between different positions in the input or the dependency between channels, which significantly improves the performance and efficiency of the model in the target detection task. The ECA attention module is introduced after each ELAN module in the backbone network to improve the feature extraction capability of the model, as shown in Figure 5.

On the basis of avoiding channel dimension reduction, the ECA attentional mechanism adopts one-dimensional convolution to realize local cross-channel interaction, captures the dependency between channels, and adaptively determines the optimal interaction coverage according to the channel dimension.

The attention mechanism can make the network model pay more attention to the features that are useful for the detection task, and weaken the useless information. On the SCTSD dataset, in this paper, three attention mechanism experiments were configured for the YOLOv7 model, namely, ECA (Efficient Channel Attention), CA (Coordinate Attention), and SA (Shuffle Attention). It can be seen from Table 3 that when the ECA, CA, and SA attention modules are imported into the backbone network, the detection accuracy of the model is improved, while other indicators such as the parameter variation (M), calculation variation (GLOPs), and inference time (Time) decrease (as shown in Table 3). Therefore, considering the balance between detection accuracy and detection speed, this paper finally decides to introduce the ECA attention module.

2.3.4. Detection Results of Small Targets in Complex Road Traffic Environment

The training set selected in this paper is the fusion dataset of the SCSTD dataset and part of the KITTI training set. The total number of images is 19,700, for which the ratio of training set to test set is 3:1, that is, 16,000 images are used for training verification, and 3900 images are used for test. Comparison experiments were carried out on the SCTSD dataset for the YOLOv7 model before and after improvement. The comparison results of the recall rate and accuracy rate of the YOLOv7 model before and after improvement on the SCTSD dataset are shown in Table 4 and Table 5. Table 6 lists the main training parameter settings.

As can be seen from Table 4, the optimized and improved YOLOv7 model has a more significant recall improvement for small targets (such as pedestrians, motorcycles, and bicycles). As can be seen from Table 5, the improved YOLOv7 model also shows certain improvement in the level of accuracy. The experimental results show that the improved YOLOv7 model can further improve the problem of the missing and false detection of small targets in the realistic complex road traffic environment.

In addition, a variety of datasets such as the PASCAL VOC 2007, MS-COCO 2017, and KITTI datasets were used to evaluate the performance of the optimized YOLOv7 model (as shown in Figure 6).

The specific improvement is as follows: on the SCTSD dataset, the mAP50 increased from 92.45% before the improvement to 95.42%, with an improvement of 2.97%. For the PASCAL VOC2007 dataset with low detection difficulty, the improved YOLOv7 model mAP50 was 2.26% higher than that before the improvement. On the Microsoft COCO2017 dataset, the mAP50 of the model before and after improvement reached 89.22% and 90.93%, respectively. On the KITTI dataset, the mAP50 of the model before and after improvement reached 92.57% and 95.25%, respectively. The average Time of the improved YOLOv7 model does not exceed 17 ms, which fully meets the requirements of the real-time detection of intelligent vehicles.

3. ReID Feature Track Tracking Based on Deep Learning

3.1. Design of Object Re-Identification Model Based on Deep Learning

Multi-object tracking (MOT) is a method to assign identity tags (ids) to the object detection objects in the video collected by vision sensors in the realistic complex road traffic scenes, and realize the same ID of the same object in different frames using the data association method. This paper designs a ReID model for target re-identification, and its structure is shown in Figure 7.

According to Figure 7, the overall structure of the ReID model mainly includes four parts: for input, the backbone feature extraction network, global feature selection module, and spatial local feature selection module, and finally for output, ReID’s global feature F. The ReID model adopts IBN-Net50-a as the backbone feature extraction network, removes the original adaptive average pooling layer and full connection layer behind it, and introduces the global feature selection and spatial local feature selection modules designed in this paper to finally achieve feature fusion.

3.1.1. Global Feature Selection Module

Assuming that the discriminant feature X is the output of the backbone feature extraction network, this feature is passed into the global feature selection module (see the orange module area in Figure 7), which divides the feature X into the G part along the height direction rule, that is

X = {x_{i}}_{i = 1}^{G}

. For the features of each part (x_i), a convolution with a kernel size of 1 is first used for channel adjustment, and then the global average pooling operation (AP) is used to obtain the feature importance scores (f_i) of each part. After connecting them together, the softmax function is further used to normalize each score to obtain the corresponding weights of each part. Perform the normalization operation on each score to obtain the weights (w_i) corresponding to the features of each part. The above process is expressed by Formula (8) as follows.

\{\begin{cases} f_{i} = A P (C o n v_{1 \times 1} (x_{i})) \\ w_{i} \leftarrow s o f t m a x (f_{i}) = \frac{e x p (f_{i})}{\sum_{i = 1}^{G} e x p (f_{i})} \end{cases}

(8)

The learned weight parameters are multiplied with the corresponding position elements of each part of the feature, and the initial feature X is fused by residual connection, so as to obtain a better representation feature X₁.

3.1.2. Spatial Local Feature Selection Module

The spatial local feature selection module X₂ is responsible for the multi-scale partitioning of optimized global features, selectively integrating spatial local information from multi-scale features, obtaining features that can distinguish the foreground and background, and filtering out the interference of noise and redundant information. The output X₁ of module 1 is passed into module 2 (see the blue module area in Figure 7), which is subsampled by four different multiples to obtain multi-scale features. Firstly, the multi-scale features are transformed and the features of different scales are aggregated together. For the aggregate feature F_p, a spatial enhancement mechanism is adopted, which aims to learn its corresponding foreground feature F_f adaptively by strengthening the spatial information of the aggregate feature. The above process is expressed in Formula (9) as follows. Where W stands for full connection operation, F_tr stands for dimension adjustment.

F_{f} = F_{t r} (s i g m o i d (W (F_{p})))

(9)

For the foreground feature F_f, module 2 performs element-wise multiplication with feature X₁ to obtain the feature X₂ enhanced by the foreground feature F_f.

The feature vector F_v that finally participates in ReID model training can be expressed as follows.

F_{v} = {\hat{F}}_{a} \oplus {\tilde{F}}_{L}

(10)

where

{\hat{F}}_{a}

represents the global feature obtained by the output feature X₁ of module 1 through pooling, convolution, and other operations.

{\tilde{F}}_{L}

represents the local feature obtained by the transformation of module 2’s output feature X₂.

\oplus

represents the addition operation of the corresponding element.

3.2. Trajectory Tracking Method Based on ReID Feature

3.2.1. Trajectory Update Based on ReID Feature

For each image entered into the ReID network, the network maps it into a feature space, where it determines whether targets within different images are the same identity ID by measuring the distance between features. Figure 8 shows the general flow of the trajectory update method based on ReID features.

As can be seen from Figure 8, the update process can be mainly divided into two parts, namely the offline training ReID network model and online real-time tracking. The input of the trajectory update method based on ReID features is a video sequence R, and K-frame video images are obtained after frame-by-frame processing, i.e.,

R = {r_{1}, r_{2}, \dots, r_{i}, \dots, r_{K}}

. A frame of video images is input into the tracker, and the object detection boundary box

{T R_{i}^{m}}_{m = 1}^{M}

of the current frame is obtained through the detector Det. Based on the detection of the boundary box, the image in the boundary box is captured from the current frame image and transferred to the trained ReID network model in an offline state to extract ReID features

{R F_{i}^{m}}_{m = 1}^{M}

. The ReID features of each detection target will be calculated from the ReID features on each trajectory to establish a two-dimensional distance cost matrix cms. The data association algorithm selects the detection targets and tracking trajectories that can be successfully matched according to the distance cost matrix. For the successfully matched trajectory, the detection results are used to assist trajectory updating. At the same time, the ReID feature of the target detection object successfully matched via the latest frame is added. After the track update is complete, the tracking task of the current frame is completed, and then the target prediction in the next video image is continued (Kalman Filtering (KF) is mainly used to predict the position of the tracking track of the previous frame in the current frame, and the Kalman track prediction box is output), and so on until the end of the video.

3.2.2. Similarity Measure in Target Association

The motion feature similarity and ReID feature similarity between the target detection object and the tracking trajectory are considered jointly by means of a cost matrix that fuses cosine-based ReID similarity and IoU-based motion similarity. As shown in Figure 8, the metric distance and cost matrix are shown in Formulas (11) and (12).

L_{d c o s} (X, Y) = 1 - \frac{\sum_{i = 1}^{n} (x_{i} \times y_{i})}{\sqrt{\sum_{i = 1}^{n} x_{i}^{2}} \times \sqrt{\sum_{i = 1}^{n} y_{i}^{2}}}

(11)

where

X = (x_{1}, x_{2}, \dots, x_{n})

and

Y = (y_{1}, y_{2}, \dots, y_{n})

are two n-dimensional eigenvectors.

C_{i j} = \{\begin{array}{l} m i n {0.5 \cdot L_{i j}^{d c o s}, L_{i j}^{i o u}}, & (L_{i j}^{d c o s} < L_{e m b}) \land (L_{i j}^{i o u} < L_{m o t}) \\ 1, & o t h e r w i s e \end{array}

(12)

where C_ij represents the substitution value of the association between the i trajectory and the j target detection, and also represents the value of the element located in the i row and j column in the cost matrix.

L_{i j}^{d c o s}

represents the cosine distance between the ReID feature on the i track and the ReID feature of the j target detection object, and represents the generation value based on the ReID feature association between the two.

L_{i j}^{i o u}

represents the IoU overlap between the bounding box of the i track and the bounding box of the j target detection object, and represents the generation value based on the association of motion features between the two. L_emb and L_mot are ReID feature similarity and motion position proximity thresholds, respectively, which are used to filter out tracks and detection targets that cannot be correlated and matched.

3.2.3. Trajectory Updating

In the target tracking task, the Kalman filter can predict the state of the target at k time according to the state of the target at k − 1 time, and then synthesize the predicted value and observation value to obtain the final update state of the target. The overall process is shown in Figure 9.

In Figure 9,

{\hat{x}}_{k}^{-}

is the predicted value of the target state given by the k time motion model;

{\hat{x}}_{k - 1}^{+}

is the target state at time k − 1;

Φ_{k - 1}

is the target motion parameter matrix, also known as the state transition matrix, representing the connection between the state of the target at the previous moment and the state at the current moment. w is the motion model noise, which satisfies the normal distribution;

P_{k}^{-}

is the covariance matrix of

{\hat{x}}_{k}^{-}

, representing the correlation between the elements in the state vector.

Q_{k - 1}

is the covariance matrix of the excitation noise w. K_k is the Kalman gain at time k;

{\hat{x}}_{k}^{+}

is the final updated state of the target at time k, and the error of the corrected observed values Z_k and predicted values of

{\hat{x}}_{k}^{-}

and K_k of the motion model is comprehensively considered.

P_{k}^{+}

is the covariance matrix of

{\hat{x}}_{k}^{+}

.

H_{k}

is the target observation parameter matrix, also known as the observation transfer matrix, which is responsible for mapping the high-dimensional target state

{\hat{x}}_{k - 1}^{+}

to the low-dimensional detection space.

R_{k}

is the covariance matrix of the observed model noise V.

3.3. Target Rerecognition and Tracking Experiment

In this paper, the VeRi-776 vehicle re-identification dataset and MARKET1501 pedestrian re-identification dataset are used to train the proposed ReID model. The targets are divided into two categories, which are marked as Vehicle class and Person class. Each category of targets uses its own ReID model to extract corresponding ReID features. During the training and reasoning process of the detection model, certain software and hardware configurations are required to support complex image processing and computation. The hardware configuration adopted in this article includes RTX3060 Ti GPU (8 GB video memory), i7-11800H CPU (3.2 GHz), and 16 GB of RAM. In terms of software, the Windows10 Professional Edition operating system, Python 3.7.0, torch 1.7.1+cu110, CUDA 11.0, cuDNN 8.0.2, Matplotlib 2.2.3 LabelImg 1.8.6, and Pycharm 2021.3.1 are used.

3.3.1. Vehicle Re-Identification Verification Based on ReID Model

In order to verify the effectiveness of the ReID model proposed in this paper in vehicle re-identification, a performance comparison experiment was conducted on the VERI-766 dataset with existing mainstream vehicle re-recognition methods (as shown in Table 7). In performance comparison, the Rank-1 hit ratio, Rank-5 hit ratio, and average accuracy ratio mean mAP are used as evaluation indicators, where the Rank-n hit rate refers to the probability of correctly hitting the target within n times in the Rank-n sequence returned by image library retrieval. The average accuracy mean mAP indicates that the average AP values of the category are detected. In this paper, the average of the precision values corresponding to all different recall values on the PR curve is taken to obtain AP values for this category. By synthesizing the AP values of all categories to be detected, the mAP values are calculated (mAP50 is adopted in this paper (IoU threshold is 0.5)).

Through experimental analysis, the ReID model proposed in this paper significantly improves the vehicle re-identification effect through the efficient cooperation between the global feature selection module and the spatial local feature selection module. By comparing the performance with other mainstream methods, the effectiveness of this model in vehicle re-identification is proved.

3.3.2. Pedestrian Re-Identification Verification Based on ReID Model

The performance comparison experiment with existing mainstream pedestrian re-recognition methods was conducted on the MARKET1501 dataset (as shown in Table 8). In the performance comparison, the target re-identification evaluation indexes such as the Rank-1 hit rate and average accuracy mean mAP are used.

Through the performance comparison experiments in Table 8, the model proposed in this paper has improved to a certain extent in the Rank-1 hit rate and average accuracy mean mAP, which fully proves the effectiveness of the model in pedestrian re-identification.

3.3.3. Comparison Experiment and Result Analysis of Multi-Target Tracking

In order to verify whether the ReID model proposed in this paper can significantly improve the performance of the multi-target tracker, the trained ReID models, including the ReID-vehicle and ReID-person models, are integrated into the tracker in this section. A performance comparison experiment was performed on the MOT16 dataset with the ByteTrack tracker (Baseline) without the ReID mode l (as shown in Table 9).

As can be seen from Table 9, MOTA increased from 69.4% to 72.1% before the improvement, IDF1 increased from 74.6% to 77.8%, and ID Sw. from 607 times before the improvement to 524 times, an improvement of 13.7%. In terms of real-time tracking, although it is lower than before the improvement, 34.9 frames per second is enough to meet the requirements of real-time tracking.

Figure 10 shows the tracking results of the improved ByteTrack tracker during frames 173 to 182 of the MOT16 dataset. As can be seen from Figure 10, the woman wearing a green top and black pants is accurately detected in frame 173 and successfully associated with the trajectory, whose ID is 8 (corresponding to the green detection box). However, in subsequent frames, due to being heavily obscured by other pedestrians, she completely disappeared from camera view until she was redetected in frame 180. According to the ReID features retained on the track, it is successfully associated with the correct track, and the tracker gives it the same identity ID as the original track. This shows that the multi-target tracker introduced in the ReID model performs well in the case of disappearing and retrieving in the middle of the trajectory.

Finally, the paper uses the road target video images collected via tachograph in the realistic complex traffic environment to demonstrate the verification of the target detection results (as shown in Figure 11).

The subgraphs in Figure 11 show the multi-target tracking effect before (left) and after (right) improvement. From the comparison results of corresponding frames in Figure 11, it can be found that the ID of the newly appearing Person object in the center of the image is incorrectly assigned to 8 (the real label is 7) in frame 114, which causes the ID of other newly appearing Person objects in the current frame of the video image to be incorrectly assigned. In subsequent frames (such as video frames 204, 209, etc.), the Person object ID is also incorrectly assigned. After the improvement, in frame 114, the ID of the newly appearing Person object located in the center of the image is accurately assigned to 7 (consistent with the real label), and the IDs of the other newly appearing Person objects in the current frame’s video image are accurately assigned. In subsequent frames (such as video frames 204, 209, etc.), the Person object ID is also accurately assigned and stable. The experimental results show that the improved tracking method has a good perception ability in complex road traffic environments, can better deal with the occlusion of road targets, and can detect and track road targets in real time, accurately and robustly.

4. Conclusions

(1): In order to solve the problem of the missing and false detection of small targets due to dense occlusion in complex traffic environments, a non-maximum suppression method, Bot-NMS, is proposed to improve the deep learning ability of network models for dense occlusion targets and solve the problem of missing targets. The comparison results of the SCTSD dataset show that the performance of recall and accuracy are improved;
(2): In the backbone network of YOLOv7, the overall structure of the network is improved by introducing multi-scale learning, prior frames, and attention mechanisms, and the detection ability of the network of small targets is improved. The model performance was evaluated through a variety of datasets such as PASCAL VOC2007, Microsoft COCO2017, and KITTI; the improved model had a significant improvement in the mAP50 and the average Time was less than 17 ms, which met the requirements for the real-time detection of intelligent vehicles;
(3): In order to solve the problem of target ID association failure and low tracking accuracy caused by target occlusion in complex road environments based on deep learning theory, a ReID model for target re-identification was constructed to obtain target global features and foreground features. Based on the trained ReID network model, the track update process of the tracking target was designed, and the correlation cost matrix of cosine distance and IoU overlap was established to realize the correlation between the target detection object, tracking trajectory, and ReID feature similarity. The VERI-776 vehicle re-identification dataset and MARKET1501 pedestrian re-identification dataset were used to complete ReID model training and comparative analysis with mainstream methods. The multi-target tracking performance was compared and verified on the MOT16 dataset. The accuracy of ID matching and the tracking of occlusion targets were verified by combining the road target video collected using the dashcam in the realistic complex traffic environment.

In short, this paper adopts the deep learning theory to conduct in-depth research on the learning of small target features and the accurate prediction of occlusive targets, as well as the ID association matching and tracking of multiple targets, so as to solve the problem of the dense occlusion of road targets in complex traffic environments and achieve accurate multi-target tracking. The research results of this paper can provide a reliable basis for the improvement of vision-based intelligent vehicle early warning/stage braking and collision avoidance control.

Author Contributions

X.C. is responsible for conceptualizing and writing, S.Y. is responsible for the editing and normalization of graphics and tables, and participates in the writing, C.X. is responsible for simulation verification and result analysis. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Foundation of China under Grant 62373175 and 2024 Fundamental Research Funding of the Educational Department of Liaoning Province LJZZ232410154016.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Peng, P.; Geng, K.K.; Wang, Z.W.; Liu, Z.C.; Yin, G.D. Review on Environmental Perception Methods of Autonomous Vehicles. J. Mech. Eng. 2023, 59, 281–303. [Google Scholar]
Jiang, W.H.; Xin, X.; Chen, W.W.; Cui, W.W. Multi-condition Parking Space Recognition Based on Information Fusion and Decision Planning of Automatic Parking System. J. Mech. Eng. 2021, 57, 131–141. [Google Scholar]
Xu, X.; Zhao, J.; Li, Y.; Gao, H.; Wang, X. BANet: A Balanced Atrous Net Improved From SSD for Autonomous Driving in Smart Transportation. IEEE Sens. J. 2021, 21, 25018–25026. [Google Scholar] [CrossRef]
Chen, L.; Lin, S.; Lu, X.; Cao, D.; Wang, F. Deep Neural Network Based Vehicle and Pedestrian Detection for Autonomous Driving: A Survey. IEEE Trans. Intell. Transp. Syst. 2021, 22, 3234–3246. [Google Scholar] [CrossRef]
Li, G.; Ji, Z.; Qu, X.; Zhou, R.; Cao, D. Cross-Domain Object Detection for Autonomous Driving: A Stepwise Domain Adaptative YOLO Approach. IEEE Trans. Intell. Veh. 2022, 7, 603–615. [Google Scholar] [CrossRef]
Wang, C.; Bochkovskiy, A.; Liao, H. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 7464–7475. [Google Scholar]
Wang, H.; Xu, Y.; Wang, Z.; Cai, Y.; Chen, L.; Li, Y. CenterNet-Auto: A Multi-object Visual Detection Algorithm for Autonomous Driving Scenes Based on Improved CenterNet. IEEE Trans. Emerg. Top. Comput. Intell. 2023, 7, 742–752. [Google Scholar] [CrossRef]
Wang, Y.; Hsieh, J.; Chen, P.; Chang, M.; So, H.; Li, X. SSMILEtrack: SiMIlarity LEarning for Occlusion-Aware Multiple Object Tracking. arXiv 2024, arXiv:2211.08824v4. Available online: https://arxiv.org/pdf/2211.08824.pdf (accessed on 20 October 2023).
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. ByteTrack: Multi-Object Tracking by Associating Every Detection Box. arXiv 2022, arXiv:2110.06864v3. Available online: https://arxiv.org/pdf/2110.06864.pdf (accessed on 20 October 2023).
Aharon, N.; Orfaig, R.; Bobrovsky, B. BoT-SORT: Robust Associations Multi-Pedestrian Tracking. arXiv 2022, arXiv:2206.14651v2. Available online: https://arxiv.org/pdf/2206.14651.pdf (accessed on 20 October 2023).
Du, Y.; Zhao, Z.; Song, Y.; Zhao, Y.; Su, F.; Gong, T.; Meng, H. StrongSORT: Make deepsort great again. IEEE Trans. Multimed. 2023, 25, 8725–8737. [Google Scholar] [CrossRef]
Cao, J.; Pang, J.; Weng, X.; Khirodkar, R.; Kitani, K. Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 9686–9696. [Google Scholar]
Maggiolino, G.; Ahmad, A.; Cao, J.; Kitani, K. Deep OC-SORT: Multi-Pedestrian Tracking by Adaptive Re-Identification. arXiv 2023, arXiv:2302.11813v1. Available online: https://arxiv.org/pdf/2302.11813.pdf (accessed on 16 June 2024).
Wang, Z.; Zheng, L.; Liu, Y.; Li, Y.; Wang, S. Towards Real-Time Multi-Object Tracking. arXiv 2020, arXiv:1909.12605v2. Available online: https://arxiv.org/pdf/1909.12605.pdf (accessed on 20 October 2023).
Chaabane, M.; Zhang, P.; Beveridge, J.; O’Hara, S. DEFT: Detection Embeddings for Tracking. arXiv 2021, arXiv:2102.02267v2. Available online: https://arxiv.org/pdf/2102.02267.pdf (accessed on 20 October 2022).
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as Points. arXiv 2019, arXiv:1904.07850v2. Available online: https://arxiv.org/pdf/1904.07850.pdf (accessed on 20 October 2022).
Zhang, Y.; Wang, C.; Wang, X.; Zeng, W.; Liu, W. FairMOT: On the Fairness of Detection and Re-Identification in Multiple Object Tracking. Int. J. Comput. Vis. 2021, 129, 3069–3087. [Google Scholar] [CrossRef]
Pang, Z.; Li, Z.; Wang, N. SimpleTrack: Understanding and Rethinking 3D Multi-object Tracking. arXiv 2021, arXiv:2111.09621v1. Available online: https://arxiv.org/pdf/2111.09621.pdf (accessed on 20 October 2023).
Machmudah, A.; Shanmugavel, M.; Parman, S.; Manan, T.S.A.; Dutykh, D.; Beddu, S.; Rajabi, A. Flight Trajectories Optimization of Fixed-Wing UAV by Bank-Turn Mechanism. Drones 2022, 6, 69. [Google Scholar] [CrossRef]
Meinhardt, T.; Kirillov, A.; Leal-Taixe, L.; Feichtenhofer, C. TrackFormer: Multi-Object Tracking with Transformers. arXiv 2021, arXiv:2101.02702v1. [Google Scholar]
Brillantes, A.K.; Sybingco, E.; Billones, R.K.; Bandala, A.; Fillone, A.; Dadios, E. Observation-Centric with Appearance Metric for Computer Vision-Based Vehicle Counting. J. Adv. Inf. Technol. 2023, 14, 1261–1272. [Google Scholar] [CrossRef]

Figure 1. YOLOv7 network architecture and each component module.

Figure 2. Comparison of detection results of standard NMS and Bot-NMS.

Figure 3. ELAN-G module structure diagram before and after improvement.

Figure 4. Ghost module schematic diagram.

Figure 5. ECA module diagram.

Figure 6. The mAP50 results of YOLOv7 model on various datasets before and after improvement.

Figure 7. ReID model overall framework structure diagram.

Figure 8. Trajectory update process based on ReID features.

Figure 9. Kalman filter flow chart.

Figure 10. The tracking results of the improved ByteTrack tracker.

Figure 11. Improved multi-target tracking effect before (left) and after (right picture).

Table 1. Comparison of results with non-maximum suppression methods.

Non-Maximum Suppression Method	Various Detection Classes						Evaluation Index
Non-Maximum Suppression Method	Car	Bus	Truck	Motor Bike	Bicycle	Person	mAP₅₀	Time
NMS	87.9	91.1	91.2	88.5	82.2	86.3	89.4	15.4
DIoU-NMS	91.1	95.3	95.4	92.2	86.1	90.1	91.1	15.6
Soft-NMS	90.1	94.3	94.3	91.2	84.7	88.9	90.7	15.5
Bot-NMS	92.3	96.1	96.5	93.8	87.4	91.5	91.5	15.7

Table 2. The improved feature map and prior box of YOLOv7.

Feature Map Size	Receptive Field	Prior Frame Size
20 × 20	big	(116,253), (168,111), (376,300)
40 × 40	middle	(37,29), (58,143), (78,58)
80 × 80	lesser	(8,12), (16,35), (30,74)
160 × 160	small	(2, 6), (4, 13), (6, 7)

Table 3. The improved YOLOv7 using various attention modules.

Detection Model	Parameters (M)	GLOPs	mAP50	Time
Improved YOLOv7	0.00	0.00	91.46	15.69
+ECA	+0.04	+0.095	92.56	16.09
+CA	+1.16	+0.90	91.96	18.29
+SA	+1.18	+0.88	93.26	17.69

Table 4. Comparison results of recall rate of YOLOv7 model before and after improvement.

Evaluation Index of Recall Rate	Various Detection Classes
Evaluation Index of Recall Rate	Car	Bus	Truck	Motor Bike	Bicycle	Person
YOLOv7 model	93.1	96.3	96.1	92.1	89.0	92.3
Improved YOLOv7 model	95.3	97.2	97.6	96.9	93.7	95.6

Table 5. Comparison results of accuracy of YOLOv7 model before and after improvement.

Evaluation Index of Detection Accuracy	Various Detection Classes
Evaluation Index of Detection Accuracy	Car	Bus	Truck	Motor Bike	Bicycle	Person
YOLOv7 model	95.8	96.9	97.3	96.2	93.3	95.6
Improved YOLOv7 model	96.0	97.1	97.5	97.5	94.4	96.9

Table 6. Main training parameter settings.

Parameter Name	Parameter Setting
Input image size	640 × 640 × 3
Epochs	300
Batch size	32
Optimizer	SGD
weight_decay	5 × 10⁻⁴
Init_lr	1 × 10⁻²
Momentum	0.937
Lr_scheduler	CosineAnnealingLR
Save_period	10

Table 7. Comparison of different vehicle re-identification methods.

Vehicle Re-Identification Method	Rank-1 (%)	Rank-5 (%)	mAP (%)
RAM	88.9	93.8	61.7
SAN	92.0	97.1	72.3
PNVR	94.1	98.0	74.5
Ours	94.3	98.1	77.6

Table 8. Comparison of different pedestrian re-identification methods.

Pedestrian Rerecognition Method	Rank-1 (%)	mAP (%)
PSE	87.9	69.0
MaskReID	90.0	72.9
DuATM	91.4	76.6
Ours	91.7	77.9

Table 9. MOT16 dataset tracking comparison results.

MOT Method	MOTA↑ (%)	IDF1↑ (%)	ID Sw.↓	FPS↓
Baseline	69.4	74.6	607	43.1
Ours	72.1	77.8	524	34.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Published by MDPI on behalf of the World Electric Vehicle Association. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, X.; Yan, S.; Xia, C. Research on Multi-Target Detection and Tracking of Intelligent Vehicles in Complex Traffic Environments Based on Deep Learning Theory. World Electr. Veh. J. 2025, 16, 325. https://doi.org/10.3390/wevj16060325

AMA Style

Chen X, Yan S, Xia C. Research on Multi-Target Detection and Tracking of Intelligent Vehicles in Complex Traffic Environments Based on Deep Learning Theory. World Electric Vehicle Journal. 2025; 16(6):325. https://doi.org/10.3390/wevj16060325

Chicago/Turabian Style

Chen, Xuewen, Shilong Yan, and Chenxi Xia. 2025. "Research on Multi-Target Detection and Tracking of Intelligent Vehicles in Complex Traffic Environments Based on Deep Learning Theory" World Electric Vehicle Journal 16, no. 6: 325. https://doi.org/10.3390/wevj16060325

APA Style

Chen, X., Yan, S., & Xia, C. (2025). Research on Multi-Target Detection and Tracking of Intelligent Vehicles in Complex Traffic Environments Based on Deep Learning Theory. World Electric Vehicle Journal, 16(6), 325. https://doi.org/10.3390/wevj16060325

Article Menu

Research on Multi-Target Detection and Tracking of Intelligent Vehicles in Complex Traffic Environments Based on Deep Learning Theory

Abstract

1. Introduction

2. Detection Method of Occluded Targets and Small Targets in Complex Road Traffic Environments Based on YOLOv7

2.1. A Bot-NMS Method for Occlusion Object Detection

2.2. Comparison and Analysis of Different NMS Methods

2.3. Small Target Detection Research

2.3.1. ELAN Module Optimization of the Backbone Network

2.3.2. Multi-Scale Learning and Prior Frame Improvement

2.3.3. Introduction of ECA Attention Mechanism

2.3.4. Detection Results of Small Targets in Complex Road Traffic Environment

3. ReID Feature Track Tracking Based on Deep Learning

3.1. Design of Object Re-Identification Model Based on Deep Learning

3.1.1. Global Feature Selection Module

3.1.2. Spatial Local Feature Selection Module

3.2. Trajectory Tracking Method Based on ReID Feature

3.2.1. Trajectory Update Based on ReID Feature

3.2.2. Similarity Measure in Target Association

3.2.3. Trajectory Updating

3.3. Target Rerecognition and Tracking Experiment

3.3.1. Vehicle Re-Identification Verification Based on ReID Model

3.3.2. Pedestrian Re-Identification Verification Based on ReID Model

3.3.3. Comparison Experiment and Result Analysis of Multi-Target Tracking

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI