Research on Pedestrian Multi-Object Tracking Network Based on Multi-Order Semantic Fusion

Liu, Cong; Han, Chao

doi:10.3390/wevj14100272

Open AccessArticle

Research on Pedestrian Multi-Object Tracking Network Based on Multi-Order Semantic Fusion

by

Cong Liu

^1,2 and

Chao Han

^1,2,*

¹

School of Electrical Engineering (School of Integrated Circuits), Anhui Polytechnic University, Wuhu 241000, China

²

Key Laboratory of Advanced Perception and Intelligent Control of High-end Equipment, Ministry of Education, Anhui Polytechnic University, Wuhu 241000, China

^*

Author to whom correspondence should be addressed.

World Electr. Veh. J. 2023, 14(10), 272; https://doi.org/10.3390/wevj14100272

Submission received: 1 September 2023 / Revised: 12 September 2023 / Accepted: 12 September 2023 / Published: 1 October 2023

Download

Browse Figures

Versions Notes

Abstract

:

Aiming at the problem of insufficient tracking accuracy caused by object occlusion in the process of multi-object tracking, this paper proposes a multi-order semantic fusion pedestrian multi-object tracking network. Firstly, the feature pyramid attention module is used in the backbone network to enlarge the receptive field and obtain more abundant feature information to improve the detection accuracy of different scale objects. Secondly, a size-aware module is integrated into the pedestrian re-identification branch network to fuse semantic features from different resolutions and extract more basic pedestrian features, thereby improving the tracking accuracy. Finally, the detection head is reconstructed and the small object detection layer is fused to make the proposed network adapt to objects of different sizes. Experiments on the MOT16 and MOT17 datasets show that the multi-object tracking accuracy of the proposed network reaches 75.4% (MOT16) and 74.3% (MOT17), which effectively deals with the problem of low tracking accuracy caused by occlusion in the field of autonomous driving, and achieves good tracking results. The network proposed in this paper improves the tracking accuracy of pedestrians and provides a basis for further practical applications.

Keywords:

multi-object tracking; object occlusion; tracking accuracy; semantic fusion; autonomous driving

1. Introduction

Multiple object tracking (MOT) is one of the key problems studied by domestic and foreign scholars [1]. At present, it is widely used in video surveillance [2], human–computer interaction [3], virtual reality [4], automatic driving [5], and other areas. Multi-object tracking belongs to the ‘perception’ part of the autonomous driving system. Through multi-object tracking, the automatic driving system can accurately perceive multiple objects in the surrounding environment to make corresponding decisions and plans to achieve safe and intelligent driving.

In recent years, the one-shot multi-object tracking method has attracted much attention because of its good tracking effect. Compared with the two-stage multi-object tracking method based on identification (ID) embedding extraction after object detection, the one-shot method combines the object detection part with the ID embedding extraction part to share a large number of object features, simultaneously outputs the object detection information, and embeds the ID. This method can significantly reduce the amount of calculation and improve the tracking speed, but its tracking accuracy is not high. In order to improve the tracking accuracy and further reduce the tracking time, Mostafa et al. [6] proposed a real-time lightweight multi-object tracker, which used a simplified deep layer aggregation (DLA) network to extract the object features for detection and used a linear converter to extract the features of the detection heat map of each input image to generate effective tracking features. Although the method further improves the tracking speed, its detection effect on small objects in crowded scenes is poor, resulting in unsatisfactory tracking accuracy. Li et al. [7] developed an online multi-object tracking method based on a convolutional neural network and spatial channel attention mechanism. Extended convolution is used to adapt to the deformation of the object, so the whole network dynamically focuses on key tasks and ignores irrelevant information. The method greatly reduces the number of missed objects in the tracking process, but it cannot solve the competition problem of semantic information distribution between the object detection task and re-identification task, which means that the accuracy of multi-object tracking is still not high. Because the one-shot multi-object tracking method focuses on the object detection task, the object re-identification task is often used as a secondary task. Zhang et al. [8] used the anchor-free object detection method in the object detection part, and added a pedestrian re-identification (ReID) network [9] branch to distinguish different objects. The method solved the problem of unfair object feature extraction and mismatch of the object feature dimension between the detection task and re-identification task. However, it has the problem of missed object detection for dense crowds and poor tracking accuracy under occlusion. On the basis of the one-shot learning framework, Yoon et al. [10] integrated the attention mechanism to associate the newly detected object with the existing tracking trajectory, without considering the number of objects in the video frame. The method proposed a new data association model, which can identify the false positive object output by the detector, but it is difficult to solved the object occlusion problem.

To sum up, occlusion problem is a key problem in multi-object tracking. However, the existing one-shot multi-object tracking methods have a poor effect on the feature fusion of different resolutions, and the correlation ability of ID embedding is low, so the effect of dealing with different scale objects and occlusion objects in the tracking process is poor, which leads to the lack of tracking accuracy. The attention mechanism in computer vision [11] is mainly to make the system learn to pay attention to the place of interest, which is widely used in object detection [12], image enhancement [13], object tracking [14], image segmentation [15], and other fields. Therefore, in this paper, the coordinate attention (CA) mechanism [16] is integrated into the feature pyramid network (FPN) to improve the detection accuracy of the proposed network [17]. This paper integrates the spatial attention mechanism and channel attention mechanism and adopts a size-aware module in the ReID part to adapt to the object features of different resolutions and deal with the occlusion problem in the tracking process. Since the YOLOv5s detection head can use grid-based anchors to perform object detection on feature maps of different scales and has high detection speed and accuracy, it is suitable for scenarios that require a rapid response. Therefore, in the detection part, the YOLOv5 s detection head is improved, and the small object detection layer is fused to obtain better detection results.

2. Proposed Network

The one-shot multi-object tracking network can be divided into three parts: backbone, neck, and head. The backbone is used to extract the initial features of the tracking object; the neck is used to extract more complex tracking object features; and the head is used to predict the type and location of the tracking object, as well as to embed the tracking object ID. The overall network structure is shown in Figure 1. This paper improves the one-shot multi-object tracking network from the backbone and the head and proposes a multi-order semantic fusion pedestrian multi-object tracking network. The coordinate attention mechanism can fuse the position information and channel information and the object can be located and recognized more accurately while avoiding a lot of calculation. Therefore, in the backbone part, the feature pyramid attention module is used to expand the receptive field and the coordinate attention mechanism is introduced. Because the size-aware module can improve the correlation ability of ID embedding and prevent the dislocation of the semantic level of the tracking object feature, the size-aware module is integrated into the ReID head of the head to deal with the object occlusion problem. In practical application scenarios, small objects only account for the size of a few pixels, which can easily lead to missed detection, especially in crowded scenes. In order to improve the detection ability of small objects, in the detection head part of the head, the YOLOv5s detection head is improved, and the small object detection layer is added, so that the overall network has better effect on objects of different sizes.

2.1. Improved Backbone Network Part

The one-shot multi-object tracking network has fast tracking speed, but its tracking accuracy is not high due to insufficient feature extraction of the tracked object by the backbone network under occlusion. The CA attention mechanism can focus on both location features and channel features, so that the backbone network can focus on larger areas and greatly reduce the computational cost. In order to improve the ability of feature extraction of the backbone network, this paper combines the CA attention mechanism with FPN to form a backbone network, so that it can extract more effective tracking object features and improve tracking accuracy.

The backbone network structure diagram is shown in Figure 2, and its working principle is as follows: the input image is passed through the FPN network to obtain the feature map F of C × H × W. The global average pooling is performed in the width and height directions, respectively, to obtain the feature maps F₁ and F₂ in the width and height directions. The feature maps F₁ and F₂ are spliced together, and the dimension is reduced by a 1 × 1 convolution kernel. The feature map after dimension reduction is normalized in batches and the feature map after batch normalization is optimized by a nonlinear activation function to obtain the feature map F′. The feature map F′ is convoluted with the convolution kernel of 1 × 1 in the width and height directions, respectively, and the feature maps F_h and F_w with the same number of channels as the original feature map F are obtained. F_h and F_w are optimized by Sigmoid activation function, respectively, and the attention weight G_h in the height direction and the attention weight G_w in the width direction of the initial input feature map F are obtained. The attention weights G_h and G_w are multiplied with the initial feature map F to obtain a feature map with attention weights to obtain more accurate and effective tracking object features.

2.2. Improved ReID Head

In order to improve the correlation ability of ID embedding and solve the problem of the intermittent disappearance of pedestrians during tracking, such as pedestrian occlusion or transient disappearance, this paper integrates a size-aware module in the ReID head part of the head.

The size-aware module includes a spatial attention sub-module and a channel attention sub-module. The spatial attention sub-module can enhance the tracking-object-related features and suppress the background noise, so that the ReID network can better extract the effective features of the tracking object at different resolutions. The channel attention sub-module allocates the weight of each channel, increases the weight of the effective channel, and reduces the weight of the invalid channel, so that the ReID network can pay more attention to the key area where the tracking object is located. The specific implementation of the spatial attention sub-module is shown in Equation (1):

M_{s} (F) = σ (f^{7 \times 7} ([A v g P o o l (F); M a x P o o l (F)]))

(1)

where F represents input feature; M_s(F) represents the spatial attention feature map and AvgPool indicates average pooling operation. MaxPool indicates the maximum pooling operation. f^7×7 represents a 7 × 7 convolution operation; σ indicates the Sigmoid activation function.

The specific implementation of the channel attention sub-module is shown in Equation (2):

M_{c} (F) = σ (M L P (A v g P o o l (F)) + M L P (M a x P o o l (F)))

(2)

where F represents input feature; M_c(F) represents the channel attention feature map; and AvgPool indicates average pooling operation. MaxPool indicates the maximum pooling operation. MLP stands for multilayer perceptron; σ indicates the Sigmoid activation function.

The structure diagram of the size-aware module is shown in Figure 3, and its working principle is as follows: firstly, the tracking object feature maps of different scales are up-sampled to 1/8 and then the convolution operation is performed through the 3 × 3 convolution layer. The feature map after convolution operation is processed by the maximum pooling and average pooling of the spatial attention sub-module, 7 × 7 convolution operation and Sigmoid layer normalization. Finally, the spatial attention map is obtained. The spatial attention map is fused with the feature map initially up-sampled to 1/8 to prevent the loss of object feature information. The spatial attention maps of different scales are stitched together to obtain the feature map F_s. The channel attention sub-module is used to perform global average pooling and maximum pooling on the feature map F_s, respectively, and the processed feature maps F_a and F_m are obtained. Then, the feature maps F_a and F_m are calculated through a one-dimensional convolutional layer and a fully connected layer to generate feature maps F′_a and F′_m, and then F′_a and F′_m are added and summed to obtain the output feature map F₁. F₁ is normalized through the Sigmoid layer to generate a feature map F_c with only one channel. Finally, the feature map F_c is convoluted through a 3 × 3 convolution layer and then mapped to 512 channels to provide identity information for subsequent ReID tasks.

We use the ReID head to obtain the appearance and structural feature information of the objects. The extracted object features are represented as a feature vector, and the similarity or distance between the objects is calculated to measure their similarity. The ReID head can match the currently detected object features with the known object features to determine whether they belong to the same object.

2.3. Improved Detection Head

The detection head of the one-shot multi-object tracking network is used to predict the type and location of the tracking object. The common detection heads include YOLO series, CenterNet, etc. However, these detection heads extract insufficient object feature information when detecting small objects, resulting in situations where small objects cannot be detected. In the detection head part of this paper, in order to improve the detection ability of the detection network for small objects, we improve the YOLOv5s detection head. Since the YOLOv5s detection head can detect the object with a resolution of 8 × 8 at least, it is easy to miss a small object with a resolution of less than 8 × 8. On the basis of its three detection layers, a small object detection layer with four-times down-sampling is added to detect the small object with a resolution of 4 × 4. The improved detection head structure is shown in Figure 4. The feature map extracted by the backbone network is down-sampled by four times to obtain the feature map F₂. An up-sampling module is added to enlarge the feature map. The feature map F₂ is spliced with the up-sampled feature map F₁ to obtain the feature map F. The feature map F obtained after stitching is further convoluted by a lightweight convolution module (Conv3 module) to obtain a feature map FF for small object detection. The Conv3 module consists of three 3 × 3 deep separable convolutional layers. A batch normalization layer and a Leaky ReLU activation function layer are introduced between each convolutional layer. The first convolutional layer has a stride of 2, which can reduce the size of the feature map by half. The strides of the second and third convolutional layers are both 1, which can obtain richer tracking object feature information. Finally, the improved four detection layers are used to perform the detection task.

3. Experimental and Analysis

3.1. Experimental Environment

The experimental environment is based on the Win10 operating system and a NVIDIA RTX 2060 graphics card. The Pytorch1.9.0 deep learning framework is adopted under the condition of Python3.7.

3.2. Datasets and Evaluation Metrics

In this paper, the CUHKSYSU dataset [18], CityPersons dataset [19], and PRW dataset [20] were selected for training. The proposed network was tested on the MOT16 and MOT17 dataset [21]. Compared with other advanced methods, the model performance was analyzed.

The MOT16 dataset was proposed in 2016 to measure the performance of multi-object tracking detection and tracking, specifically used for pedestrian tracking. The MOT16 dataset consisted of 14 challenging video sequences (7 training, 7 testing), these video sequences were captured in an unconstrained environment using both static and mobile cameras. Tracking and evaluation are completed in image coordinates. All sequences are annotated with high precision and strictly follow clearly defined protocols.

The MOT17 dataset is proposed based on the MOT16 dataset. Compared with the MOT16 dataset, the MOT17 dataset uses a new and more accurate tracking scenario. Each sequence has three sets of detection: DPM, Faster-RCNN, and SDP. In this paper, SDP is adopted.

In order to make the evaluation of the model more objective and accurate, the multi-object tracking evaluation metrics used in this paper include: multi-object tracking accuracy (MOTA), multi-object tracking precision (MOTP), identification F1 score (IDF1), ID switches (IDs), mostly tracked targets (MT) and mostly lost targets (ML). The equations of the evaluation metric of MOTA, MOTP, and IDF1 are shown in Equations (3)–(5), respectively:

MOTA = 1 - \frac{N_{F N} + N_{F P} + N_{I D s}}{N_{G T}}

(3)

where N_FN represents the total number of missed detentions in the whole video sequence. N_FP represents the total number of false checks in the whole video sequence. N_IDs represents the total number of pedestrian ID switching. N_GT represents the true number of bounding boxes.

MOTP = 1 - \frac{\sum_{i, t} d_{t}^{i}}{\sum_{t} c_{t}} .

(4)

where t denotes that the current video frame is the t-frame; dⁱ_t represents the overlap rate between the ith prediction box and the real box in the tth frame. c represents the number of successful matching objects.

IDF 1 = \frac{2 \times I D T P}{2 \times I D T P + I D F P + I D F N} .

(5)

where IDTP represents the total number of true positive IDs. IDFP represents the total number of false positive IDs. IDFN indicates the total number of false negative IDs.

3.3. Experimental Results and Analysis

3.3.1. Ablation Experiment

In order to test the role of different modules in the network proposed in this paper, we conduct experiments on different modules: only the original data module (Raw), only the fusion CA attention mechanism module (CA), only the fusion of the original data module, only the CA attention mechanism module and the size-aware module (Scale), and complete network proposed in this paper (Ours). The comparison results of some evaluation indexes of different modules on the MOT16 dataset are shown in Table 1 and the comparison results of some evaluation indexes on the MOT17 dataset are shown in Table 2.

Table 1 shows that on the MOT16 dataset, the tracking performance of the network with multiple modules is significantly improved compared with the network without modules. MT is increased by 10.6%, ML is reduced by 6.1%, and ID switching times are significantly reduced. Compared with the complete network in this paper, only the accuracy and IDF1 index of the original data module decreased by 10.7% and 15.9%, respectively; only the tracking accuracy and IDF1 index of the CA attention mechanism module decreased by 6.9% and 10.3%, respectively; and only the tracking accuracy and IDF1 index of the original data module, CA attention mechanism module, and size-aware module decreased by 4.2% and 4.6%, respectively.

Table 2 shows that on the MOT17 dataset, the tracking performance of the network with multiple modules is also significantly improved compared with the original network. In general, MOTA increased by 11.5%, IDF1 increased by 15.8%, MT increased by 11.0%, ML decreased by 7.3%, and ID switching times decreased. Compared with the original network, the accuracy of the fusion original data module and the CA attention mechanism module is improved by 4.3%, and the accuracy of the fusion original data module, the CA attention mechanism module, and the size-aware module is improved by 7.5%.

The experimental results show that the CA attention mechanism, the proposed size-aware module and the integrated small object detection layer have a great impact on improving the tracking effect of the network. In practical applications, it can improve the accuracy of multi-target tracking and reduce the number of ID switching. No matter which module is removed, the impact of multi-target tracking accuracy will decrease. However, due to the complexity of the actual scene, it can affect the generalization and applicability of the results and further research is needed.

3.3.2. Performance Comparison Experiment with Existing Methods

In this paper, experiments are carried out on the MOT16 and MOT17 datasets and compared with other state-of-the-art multi-object trackers. The comparison results are shown in Table 3 and Table 4 and the success rate curve is shown in Figure 5 and Figure 6. It can be seen from the figure that the proposed network in this paper has a good success rate in the case of occlusion and obviously has more advantages than other methods.

Based on the above evaluation index analysis, this network has certain advantages over other advanced methods (‘-’ means that the data are not given in the original paper). On the MOT16 dataset, the accuracy of multi-object tracking is improved by 0.5%, 2.2%, 10.9%, 25.1%, and 11.0%, respectively, compared with FairMOT, LMOT, MOT_GM [22], CRF_RNN [23], and JDE [24] methods. Compared with JDE, MOT_GM, and CRF_RNN methods, IDF1 is increased by 16.4%, 1.3%, and 17.8%, respectively. Compared with the FairMOT method, ML is reduced by 0.8%, MT is increased by 0.7%, and ID switching times are reduced. Compared with MOT_GM and CRF_RNN methods, the multi-object tracking precision is improved by 4.9% and 6.8%, respectively. On the MOT17 dataset, the accuracy of multi-object tracking is improved by 11.3%, 0.6%, 1.0%, 21.2%, and 2.3%, respectively, compared with JDE, FairMOT, MOT_ES [25], CRF_RNN, and LMOT methods. Compared with JDE method and CRF_RNN method, IDF1 is increased by 10.8% and 16.6%, respectively. Compared with CRF_RNN method, MOTP is increased by 5.6%. Compared with the best method, FairMOT, MT is increased by 0.9%, ML is decreased by 0.7%, and MOTP is increased by 0.4%. Since the CA attention mechanism is added to the network and the small object detection head is added, the tracking speed is slightly lower, but it also meets the real-time requirements.

In summary, it can be verified that the proposed network has good performance and certain advantages in the actual tracking scenario. The multi-object tracking effect on the public datasets MOT16 and MOT17 is shown in Figure 7 and Figure 8.

4. Conclusions

This paper proposes a multi-order semantic fusion pedestrian multi-object tracking network. The proposed network includes three parts of improvements: the coordinate attention mechanism is used to extract sufficient tracking object features in the backbone network, which effectively improves the detection accuracy of the detection network in the case of occlusion; the size-aware module is used to fuse the object features of different resolutions and improve the correlation ability of ID embedding, which effectively improves the tracking accuracy under occlusion; this paper also adds a small object detection layer to enhance the detection of small objects. Experiments show that the proposed network solves the problem of object occlusion in pedestrian multi-object tracking to a certain extent. Compared with other advanced methods, the proposed network can effectively improve the accuracy of multi-object tracking, the number of ID switching times is significantly reduced, and the tracking speed also meets the real-time requirements.

Author Contributions

C.H. surveyed the literature and revised the paper; C.L. constructed the overall framework and wrote the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Open Research Fund of Anhui Key Laboratory of Detection Technology and Energy Saving Devices, Anhui Polytechnic University (Grant No. DTESD2020A06); National Natural Science Foundation of China (Grant No. 62203012); 2021 Graduate Science Research Project of Department of Education of Anhui Province (Grant No. YJS20210447); and Wuhu City Science and Technology Plan Project (Grant No. 2021cg21).

Data Availability Statement

All data generated or analysed during this study are included in this published article.

Conflicts of Interest

The contact author has declared that neither they nor their co-authors have any competing interest.

References

Luo, W.; Xing, J.; Milan, A. Multiple object tracking: A literature review. J. IEEE Trans. Artif. Intell. 2021, 293, 58–80. [Google Scholar] [CrossRef]
Mohanapriya, D.; Mahesh, K. Multi object tracking using gradient-based learning model in video-surveillance. J. China Commun. 2021, 18, 169–180. [Google Scholar] [CrossRef]
Candamo, J.; Shreve, M.; Kasturi, R. Understanding transit scenes: A survey on human behavior recognition algorithms. J. IEEE Trans. Artif. Intell. 2020, 11, 206–224. [Google Scholar] [CrossRef]
Ikbal, M.S.; Ramadoss, V. Dynamic Pose Tracking Performance Evaluation of HTC Vive Virtual Reality System. J. IEEE Access 2021, 9, 3798–3815. [Google Scholar] [CrossRef]
Ravindran, R.; Santora, M.J.; Jamali, M.M. Multi-Object Detection and Tracking, Based on DNN, for Autonomous Vehicles: A Review. J. IEEE Sens. J. 2021, 21, 5668–5677. [Google Scholar] [CrossRef]
Mostafa, R.; Baraka, H.; Bayoumi, A. LMOT: Efficient Light-Weight Detection and Tracking in Crowds. J. IEEE Access 2022, 10, 83085–83095. [Google Scholar] [CrossRef]
Li, G.; Chen, X.; Li, M.J. One-shot multi-object tracking using CNN-based networks with spatial-channel attention mechanism. J. Opt. Laser Technol. 2022, 153, 108267. [Google Scholar] [CrossRef]
Zhang, Z.; Wang, C.; Wang, X. FairMOT: On the fairness of detection and re-identification in multiple object tracking. J. Int. J. Comput. Vis. 2021, 129, 3069–3087. [Google Scholar] [CrossRef]
Zhang, J.; Wang, N.; Zhang, L. Multi-Shot Pedestrian Re-Identification via Sequential Decision Making. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6781–6789. [Google Scholar] [CrossRef]
Yoon, K.; Gwak, J. OneShotDA: Online Multi-Object Tracker With One-Shot-Learning-Based Data Association. J. IEEE Access 2020, 8, 38060–38072. [Google Scholar] [CrossRef]
Guo, M.; Xu, T.; Liu, J. Attention mechanisms in computer vision: A survey. J. Comput. Vis. Media 2022, 8, 331–368. [Google Scholar] [CrossRef]
Aziz, L.; Sheikh, U.U.; Ayub, S. Exploring Deep Learning-Based Architecture, Strategies, Applications and Current Trends in Generic Object Detection: A Comprehensive Review. J. IEEE Access 2020, 8, 170461–170495. [Google Scholar] [CrossRef]
Singh, K.; Seth, A.; Sandhu, H.S. A Comprehensive Review of Convolutional Neural Network based Image Enhancement Techniques. In Proceedings of the 2019 IEEE International Conference on System, Computation, Automation and Networking (ICSCAN), Pondicherry, India, 29–30 March 2019; pp. 1–6. [Google Scholar] [CrossRef]
Ondrasovic, M.; Tarabek, P. Siamese Visual Object Tracking: A Survey. J. IEEE Access 2021, 9, 110149–110172. [Google Scholar] [CrossRef]
Minaee, S.; Boykov, Y.; Porikli, F. Image Segmentation Using Deep Learning: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 3523–3542. [Google Scholar] [CrossRef] [PubMed]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13708–13717. [Google Scholar] [CrossRef]
Lin, T.; Dollar, P.; Girshick, R. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar] [CrossRef]
Xiao, T.; Wang, B. Joint detection and identification feature learning for person search. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3376–3385. [Google Scholar] [CrossRef]
Zhang, S.; Benenson, R.; Schiele, B. CityPersons: A Diverse Dataset for Pedestrian Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4457–4465. [Google Scholar] [CrossRef]
Zheng, L.; Zhang, H.; Sun, S. Person Re-Identification in the Wild. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3346–3355. [Google Scholar] [CrossRef]
Milan, A.; Leal-Taixe, L.; Reid, I. MOT16: A benchmark for multi-object tracking. arXiv 2016, arXiv:1603.00831. [Google Scholar] [CrossRef]
Yoo, Y.S.; Lee, S.H.; Bae, S.H. Effective Multi-Object Tracking via Global Object Models and Object Constraint Learning. J. Sens. 2022, 22, 7943. [Google Scholar] [CrossRef] [PubMed]
Xiang, J.; Xu, G.; Ma, C.; Hou, J. End-to-End Learning Deep CRF Models for Multi-Object Tracking Deep CRF Models. J. IEEE Trans. Circuits Syst. Video Technol. 2016, 31, 275–288. [Google Scholar] [CrossRef]
Wang, Z.; Zheng, L.; Liu, Y. Towards Real-Time Multi-Object Tracking. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 107–122. [Google Scholar] [CrossRef]
Xiang, X.; Ren, W.; Qiu, Y. Multi-object Tracking Method Based on Efficient Channel Attention and Switchable Atrous Convolution. J. Neural Process. Lett. 2021, 53, 2747–2763. [Google Scholar] [CrossRef]

Figure 1. Overall network structure diagram.

Figure 2. Backbone network structure diagram.

Figure 3. Size-aware module structure diagram.

Figure 4. The improved detection head structure diagram.

Figure 5. Success rate graph on the MOT16 dataset.

Figure 6. Success rate graph on the MOT17 dataset.

Figure 7. Multi-object tracking effect on the MOT16 dataset.

Figure 8. Multi-object tracking effect on the MOT17 dataset.

Table 1. Comparative experiments of each module on MOT16 dataset.

Method	MOTA	IDF1	IDs	MT	ML
Raw	64.7	56.3	1855	34.8	21.2
CA	68.5	61.9	1229	38.3	19.6
Scale	71.2	67.6	1064	41.9	17.4
Ours	75.4	72.2	986	45.4	15.1

Table 2. Comparative experiments of each module on MOT17 dataset.

Method	MOTA	IDF1	IDs	MT	ML
Raw	62.8	54.5	5226	33.1	23.9
CA	67.1	60.2	4643	37.6	20.1
Scale	70.3	65.7	3976	40.4	18.5
Ours	74.3	70.3	3575	44.1	16.6

Table 3. Comparison with other advanced methods on the MOT16 dataset.

Method	MOTA	MOTP	IDF1	IDs	MT	ML	FPS
JDE	64.4	-	55.8	1544	35.4	20.0	18.8
FairMOT	74.9	-	72.8	1074	44.7	15.9	25.9
MOT_GM	64.5	76.7	70.9	816	36.4	20.7	6.5
CRF_RNN	50.3	74.8	54.4	702	18.3	35.7	1.5
LMOT	73.2	-	72.3	669	44.0	17.0	29.6
Ours	75.4	81.6	72.2	986	45.4	15.1	25.8

Table 4. Comparison with other advanced methods on the MOT17 dataset.

Method	MOTA	MOTP	IDF1	IDs	MT	ML	FPS
JDE	63.0	-	59.5	6171	35.7	17.3	18.8
FairMOT	73.7	81.3	72.3	3303	43.2	17.3	25.9
MOT_ES	73.3	-	71.8	3372	41.1	17.2	19.0
CRF_RNN	53.1	76.1	53.7	2518	24.2	30.7	1.4
LMOT	72.0	-	70.3	3071	45.4	17.3	29.6
Ours	74.3	81.7	70.3	3575	44.1	16.6	25.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, C.; Han, C. Research on Pedestrian Multi-Object Tracking Network Based on Multi-Order Semantic Fusion. World Electr. Veh. J. 2023, 14, 272. https://doi.org/10.3390/wevj14100272

AMA Style

Liu C, Han C. Research on Pedestrian Multi-Object Tracking Network Based on Multi-Order Semantic Fusion. World Electric Vehicle Journal. 2023; 14(10):272. https://doi.org/10.3390/wevj14100272

Chicago/Turabian Style

Liu, Cong, and Chao Han. 2023. "Research on Pedestrian Multi-Object Tracking Network Based on Multi-Order Semantic Fusion" World Electric Vehicle Journal 14, no. 10: 272. https://doi.org/10.3390/wevj14100272

APA Style

Liu, C., & Han, C. (2023). Research on Pedestrian Multi-Object Tracking Network Based on Multi-Order Semantic Fusion. World Electric Vehicle Journal, 14(10), 272. https://doi.org/10.3390/wevj14100272

Article Menu

Research on Pedestrian Multi-Object Tracking Network Based on Multi-Order Semantic Fusion

Abstract

1. Introduction

2. Proposed Network

2.1. Improved Backbone Network Part

2.2. Improved ReID Head

2.3. Improved Detection Head

3. Experimental and Analysis

3.1. Experimental Environment

3.2. Datasets and Evaluation Metrics

3.3. Experimental Results and Analysis

3.3.1. Ablation Experiment

3.3.2. Performance Comparison Experiment with Existing Methods

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI