Next Article in Journal
Demand Forecasting of Online Car-Hailing with Combining LSTM + Attention Approaches
Next Article in Special Issue
An Improved Deep Convolutional Neural Network-Based Autonomous Road Inspection Scheme Using Unmanned Aerial Vehicles
Previous Article in Journal
Mandatory Access Control Method for Windows Embedded OS Security
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Online Multiple Object Tracking Using a Novel Discriminative Module for Autonomous Driving

1
National Center for Materials Service Safety, University of Science and Technology Beijing, Beijing 100083, China
2
College of Nuclear Technology and Automation Engineering, Chengdu University of Technology, Chengdu 610000, China
*
Authors to whom correspondence should be addressed.
Electronics 2021, 10(20), 2479; https://doi.org/10.3390/electronics10202479
Submission received: 28 July 2021 / Revised: 24 September 2021 / Accepted: 5 October 2021 / Published: 12 October 2021
(This article belongs to the Special Issue Deep Learning Methods and Applications for Unmanned Aerial Vehicles)

Abstract

:
Multi object tracking (MOT) is a key research technology in the environment sensing system of automatic driving, which is very important to driving safety. Online multi object tracking needs to accurately extend the trajectory of multiple objects without using future frame information, so it will face greater challenges. Most of the existing online MOT methods are anchor-based detectors, which have many misdetections and missed detection problems, and have a poor effect on the trajectory extension of adjacent object objects when they are occluded and overlapped. In this paper, we propose a discrimination learning online tracker that can effectively solve the occlusion problem based on an anchor-free detector. This method uses the different weight characteristics of the object when the occlusion occurs and realizes the extension of the competition trajectory through the discrimination module to prevent the ID-switch problem. In the experimental part, we compared the algorithm with other trackers on two public benchmark datasets, MOT16 and MOT17, and proved that our algorithm has achieved state-of-the-art performance, and conducted a qualitative analysis on the convincing autonomous driving dataset KITTI.

1. Introduction

The multi object tracking (MOT) system is an accurate tracking of obstacles moving in front of or in the surrounding environment of an autonomous vehicle, including vehicle path tracking, non-motor vehicle trajectory tracking, pedestrian trajectory tracking, etc. This subsystem helps self-driving cars make decisions and avoid collisions with objects that may move (for example, other vehicles and pedestrians) [1,2,3]. In the above scenarios, the main task of the multi-object tracking algorithm is to track many objects simultaneously, assign and maintain a corresponding ID for each object, and record the trajectory, which cannot be achieved by only using the object detection algorithm or single object tracking algorithm.
The object tracking task is very important to driving safety and can effectively predict the trajectory of object movement, so that the control layer can make decisions such as collision warning and lane change processing in advance. The application of object tracking can be divided into single object tracking (SOT) [4,5] and multi-object tracking (MOT) in terms of the number of objects. In the actual traffic scene, MOT is more common, and the matching relationship between the previous frame and the next frame of multiple moving objects in the actual movement should be taken into account. An example of an output diagram is shown in Figure 1. As an important task branch of computer vision, the MOT algorithm has also been widely used in the fields of intelligent surveillance systems [6], medical image processing [7] and human–computer interaction [8].
MOT can be divided into offline mode and online mode in terms of processing mode. Offline tracking usually uses global information to track objects, so its accuracy is relatively high. However, due to its high computational cost and huge storage space, it is not suitable for automatic driving which requires high portability and real-time performance. Online tracking, due to its real-time requirements, can only use the information of the current frame and historical frame, which brings more challenges to researchers. Due to the complexity of multiple tracking problems, we need to consider not only the change of shooting angle and illumination, but also the emergence of a new object, the disappearance of an old object, and how to recognize the lost object again. This makes a robust tracking algorithm still a huge challenge.
Recently, deep learning technology based on neural networks has made great progress. Representative detection algorithms include Fast R-CNN [9], SSD [10] and YOLO [11,12,13] algorithms and so on. With the advancement of object detection technology, detection based tracking-by-detection has taken the lead. The algorithm detects the object in each frame and then matches it with the existing tracking trajectory. For a new object in the current frame, a new trajectory needs to be formed. For an object leaving the field of view in the current frame, the trajectory of the object needs to be terminated. However, whether it is a detector based on Faster-RCNN [14] or a detector based on SDP [15], they are all anchor-based detectors, which are prone to the problems of center point offset and low accuracy of the regression frame. Therefore, in this paper, we use the anchor-free detector algorithm.
In this work, in order to meet the scene requirements of real-time online tracking of autonomous vehicles, we are inspired by the pipeline FairMOT [16] algorithm and propose an online multi-object real-time tracker based on the feature extraction of ROI regions. This algorithm designs a multifunctional discriminant model by differently affecting the driver in the autonomous driving scene by overlapping or adjacent objects and backgrounds. The model determines the type of trajectory by calibrating the ROI of the object detected in the previous frame, and then uses the discriminative model to solve the change in the appearance of the object due to the occlusion of the object or the interaction between the objects, and then obtain the global characteristic trajectory of the object during the movement. At the same time, in order to meet the real-time requirements of autonomous vehicles, historical information and future information are used at the same time to smooth the trajectory of objects on multiple frames. The main contributions of this work are as follows:
i.
An online multi-object tracking algorithm suitable for the process of autonomous driving environment perception is proposed.
ii.
For the occlusion problem of different objects or overlapping adjacent objects when the object is moving, a discriminative learning model is proposed.
iii.
The performance of our proposed MOT tracker has achieved competitive performance on the MOT [16], MOT [17] benchmark and KITTI datasets.

2. Our Proposed Tracker

In this section, we first introduce the FairMOT pipeline and the novel detection strategy, then introduce the proposed online MOT tracking algorithm, and finally, introduce in detail our optimized trajectory extension strategy for different tracking objects during the tracking process.

2.1. Baseline FairMOT

2.1.1. Problem Formulation

Since multi-object tracking is used to predict the position state of multiple objects in the next frame, the tracking method of MOT can be described as a multi-variable optimisation problem. Given an image sequence, suppose that A t i and X t i are the state value and observation value of the i-th target in frame t respectively, and A t = A t 1 , A t 2 , A t M t is the track sequence value of all targets M t in frame t. A i s : i e i = A i s i , , A i e i is the track sequence value of the i-th target, where i s and i e respectively denote the object i for the start and end frame that the object i appears, while A 1 : t = A 1 , A 2 , A t represents the track sequence of all objects in the image from the start frame to the t-th frame. X t = X t 1 , X t 2 , X t M t is used to refer to the observed values of all objects M t in frame t. X 1 : t = X 1 , X 2 , X t represents the observation values of all the object bears from the start frame to the t-th frame in the image.
The research purpose of MOT is to find the best trajectory of all objects. Therefore, under the condition that all object state values are known, the optimization problem of MOT can be modeled by the maximal a posteriori (MAP) probability model as:
A ^ 1 : t = argmax A 1 : t P ( A 1 : t | X 1 : t )
The prediction and update process is obtained by the following formula:
Predict : P A t | X 1 : t 1 = P ( A t | A t 1 ) P ( A t 1 | X 1 : t 1 ) dA t 1
Update : P A t | X 1 : t   P X t | A t P ( A t | X 1 : t 1 )   )

2.1.2. FairMOT Pipeline

For the detection-based multi-object tracking algorithm, the detection performance of the detector directly affects the tracking effect. The traditional MOT algorithm basically uses an anchor-based detection algorithm. However, anchor-based detection not only has a large number of hyperparameters, but also has low detection accuracy. The FairMOT algorithm adopts anchorless detection, which improved detection accuracy effectively. The highlight of the FairMOT algorithm is that it combines the anchor-free detection algorithm and the Re-ID feature for end-to-end tracking. The tracking process is shown in Figure 2.
The object detection process is regarded as a center-based bounding box regression task on high-resolution feature maps. Three parallel regression heads are added to the backbone to predict the heatmap, object center offset, and box size. The loss functions of the three processes can be obtained by the following formulas:
L heatmap = 1 N xy 1 M ^ xy α log M ^ xy                                                           if     M xy = 1 1 M xy β M ^ xy α log ( 1 M ^ xy                   otherwise
where M xy denotes the response of (x,y),
L box = i = 1 N || o i o ^ i || 1 + || s i s ^ i || 1
L identity = i = 1 N k = 1 K L i k log P k
Among them, S ^ R 9 W × H × 2 and O ^ R W × H × 2 are the output size and offset, respectively. b i = ( x 1 i , y 1 i , x 2 i , y 2 i ) is each corresponding ground truth (GT) of the image, and its size can be represented by S i = ( x 2 i x 1 i , y 2 i y 1 i ) . In the same way, the offset of GT can be obtained as O i = c x i 4 , c y i 4 c x i 4 , c y i 4 . Then, s ^ i and o ^ i are the estimated size and offset of the corresponding position, respectively. L box is the L 1 loss function formula of the two. P k is the distribution vector of our identity feature vector mapping at the center of the GT box. L i k represents the one-hot value of the GT label. Embed object recognition as a classification task. All object instances with the same identity in the training set are regarded as one class. For each label box in the picture, obtain the object center C x i , C y i on the heatmap, extract an identity feature vector E x i , y i to locate and learn to map it to a class distribution vector P k , which represents the encoding of the label L i k .

2.2. Discrimination Learning Model

For multi-object tracking, occlusion has always been a difficult problem to overcome, although many scholars have tried to deal with occlusion. For example, Naiyan Wang et al. [17] treats the occlusion problem as a trajectory association problem, which is analogous to the data association of detection. The tracklet is put into the optical flow network for model optimization, thereby ignoring the failed detection object and continuing the tracking. However, this method did not achieve a good anti-occlusion effect because it did not pay attention to the importance of the sample itself. In this article, in order to meet the real-time performance of autonomous vehicles and the frequent occlusion problems in the process of vehicle travel, we introduce the discrimination model to solve the problem of the occlusion of moving objects.
For two known competing trajectories, as shown in Figure 3, suppose there are the previous M historical trajectories and the feature map Z 1 . In order to reduce the influence of ambient noise, we use spatial Gaussian weights to denoise each channel. Through 1 × 1 convolution operation and global maximum pooling, we get our abstract invariant features S N × C . After the S matrix is multiplied by its transposed matrix, the X N × N matrix is obtained after the softmat operation. The calculation of the correlation matrix X N × N can be obtained as follows:
X ij = exp X i · X j T k = 1 N exp X i · X k T
We can draw X ij on behalf the spatial correlation between trajectory j-th and trajectory i-th. Where spatial correlation map X N × N is a matrix composed of X ij .
Next, the correlation map X is reshaped and input to the two fully connected layers and the softmaxx layer, and then the attention score y N of each position is obtained.
Finally, the final output result is obtained by:
O = i = 1 N y i Z 1 i

2.3. Trajectory Extension Strategy

In the tracking phase, trajectory extension in MOT is one of the most challenging tasks. In order to effectively overcome the problems caused by trajectory extension, we propose a position discrimination model, which can effectively separate the object from the background and its surrounding adjacent or overlapping areas. Since the trajectories in the tracking process can be divided into isolated trajectories and competitive trajectories, we have designed different tracking strategies for them, and still adopt the classic two-stage tracking strategy.
First, for each current active trajectory, we extract its region of interest as a candidate region, and use instance segmentation to refine its bounding box. If the trajectory is an isolated trajectory, when its confidence is greater than threshold σ t (as Equation (7)), it will be stored as a new trajectory.
Z T n = i n Z i t p · 2 exp ϑ t p ,         if   t p > 0   1 ,                                                                         else  
Here, t p represents the continuous tracking time in the first stage and Z i refers to the refinement confidence in the ith growth. ϑ log 2 T max is measured by the maximum number T max   of consecutive failures matches, which is a balance parameter. In this experiment, all ϑ values are set to 0.1.
Secondly, for the trajectory with competitive relationship, the detection example is shown in Figure 4, and the overlapping detection area after the ROI candidate area and the instance segmentation refined bounding box is taken as our candidate object. The discrimination model is used to calculate the similarity between the competition trajectory and the candidate region, and the deep Hungary algorithm is used to associate the similarity matrix to carry out the correct extension of the trajectory.
The final stage is the allocation of the trajectory of the untracked object, and the IOU calculation between the detector and the threshold τ iou is tracked, and it is allocated to the remaining detection results. After data association, each untracked trajectory is considered lost in the current frame, and a new trajectory is initialized with high response confidence for each unmatched detection. In order to reduce the influence of false detection, once any new trajectory is lost in any first τi frame, it will be deleted. If the trajectory continuously exceeds τt and is lost or leaves the field of view, the trajectory will terminate.

2.4. Proposed Online MOT Tracking Network

Multi-object tracking based on detection can be divided into online tracking and offline tracking. Online multi-object tracking is a frame-by-frame progressive tracking method, which is similar to the real-time tracking process of human eyes. Firstly, each moving object should be identified and confirmed (object detection), and then its next action should be predicted (trajectory prediction). Finally, the motion direction (motion model), appearance shape (appearance model) and other features of the object are associated with the previous trajectory (data correlation matching).
In this section, we will introduce the main tracking process of our algorithm. Due to anchor-based detectors have many hyperparameters and the shortcomings of features that are not easy to counteract, we employ anchor-free detectors in the detection process. As shown in Figure 5, after the t-th frame image of the current frame passes through the backbone network, the region of interest is extracted and the result of the t − 1th frame detection is performed to correct the position to obtain the trajectory of the object in the current frame. If the trajectory of the object in the frame is an isolated trajectory, the trajectory is stored and extended directly, and the tracking is successful. If the trajectory of the object in the t-th frame is in a competitive relationship, that is, there is occlusion, input the discrimination learning model to solve the occlusion problem through position correlation, realize the storage and extension of the trajectory, and track successfully. If the trajectory of the object in the t-th frame belongs to the new object, the trajectory is initialized. If the trajectory of the object in frame t-th does not appear in consecutive frames, the tracking is stopped and the tracking ends.
In order to better balance the two performances of speed and accuracy, we use the ResNet-34 backbone network with strong feature extraction capabilities like the FairMOT detection method. As shown in Figure 6, in order to better integrate the semantic and location information of different layers, we use a backbone network of Deep Layer Aggregation [18] to extract image features. At the same time, in order to dynamically adjust the receptive field when the proportion and posture of the object change, we use deformable convolution [19] to complete the up-sampling. The size of the input images are H image × W image , and the output feature map has the shape of C × H × W where H = H image 4 and W = W image 4 . The proposed tracking flow is summarized in Algorithm 1.
Algorithm 1: The proposed Method
Input: The pre-trained network model, the first frame, initial obkect location bounding box b 1
Output: The object location b 2 , b 3 , b n of the subsequent frames
1. Input the initial frame and initial bounding box
2. for i = 2 : n do
       Get the ROI feature
3. Calculate the correlation matrix using Equation (6)
4. Calculate the maximum response using Equations (4) and (5)
5. Calculate the bounding box
6. end for

3. Experiments and Evaluation

In this section, we will introduce the experimental details of our proposed algorithm in detail and compare it with the most representative MOT16 [20] MOT17 [20] public benchmark in the MOT Challenge and an autonomous driving dataset KITTI [21,22].

3.1. Experiment Implementation Details

Our algorithm is implemented based on Pytorch in an Ubuntu 16.04 desktop computer with Intel i7-9700k CPU, 16G RAM and two Nvidia GTX1080Ti GPUs. In this experiment, we use the DLA-34 pre-trained multi-layer feature fusion on the COCO dataset [23] as the backbone network. The ADM optimizer is used for 30 epochs of training on ETH [24], city person [25] and crowd human [26]. During our experiment, the input size of all training set images is 1088 × 608, and the feature map resolution is 272 × 152.

3.2. Results on MOT16

MOT16 mainly detects moving pedestrians and vehicles. It is a dataset based on MOT15 [27] with more detailed annotations and more bounding boxes. MOT16 has a richer picture, different shooting angles and camera movements, as well as different weather condition videos. It is marked by a group of qualified researchers in strict compliance with the corresponding marking guidelines, and finally a double detection method is used to ensure the high accuracy of the marked information. The trajectory marked by MOT16 is 2D. There are 14 video sequences in the MOT16 dataset, of which 7 are training sets with annotation information, and the other 7 are test datasets.
The detector used in the MOT16 data set is DPM [28], which has a good performance in detecting the pedestrian category. The main information of these videos is as follows: including FPS, resolution, video duration, number of tracks, object book, density, static or moving shooting, low, medium and high angle shooting, weather conditions for shooting, etc.
Table 1 shows our comparison with the most state-of-the-art algorithm on the MOT16 public benchmark. The results show that whether we compare with offline trackers or online trackers, the algorithm we proposed obtains the best results on several important indicators such as MOTA, MOTP and IDF1.
In Table 1, FP represents false positive samples during the tracking process. The lower the value, the better. The number of false positive samples detected in our algorithm is 79,634, which ranks in the middle. FN is the false negative sample, ML is the mostly lost sample; the smaller the value of both the better. The results of our algorithm have achieved good performance in the eight competitive algorithms in 2016. MT is mostly tracking, IDF1 refers to the F value of the pedestrian ID in each pedestrian frame. The larger the value of the two, the better. MOTA and MOTP are the other most important indicators to measure tracking accuracy and position error in multi-object tracking, and can be expressed by Formulas (9) and (10) as:
MOTA = 1 t FN t + FP t + IDSW t t GT t
MOTP = t , i d t , i t c t
where t is the index of each frame of image, and GT is the ground truth label, and c t denotes the number of matches in frame t and d t , i is the bounding box overlap of target i with its assigned ground truth object.
As shown in Table 1, in the three most important indicators of multi-object tracking performance, MOTA, MOTP and IDF1, the algorithm we proposed all ranked first.

3.3. Results on MOT17

3.3.1. Quantitative Analysis

MOT17 are datasets based on MOT15 with more detailed annotations and more bounding boxes, mainly for pedestrians and vehicles. They have a richer picture, different shooting angles and camera movements, as well as different weather condition videos. They are marked by a group of qualified researchers in strict compliance with the corresponding marking guidelines, and finally a double detection method is used to ensure the high accuracy of the marked information. The motion trajectory marked by MOT17 is 2D, which is a brand new data set. Compared with MOT15 of pedestrian density, it is more difficult. Therefore, in this experiment, we will use MOT17 as our verification data set to verify the performance of our algorithm.
As shown in Table 2, the best performance has been bolded in black. Compared with the online tracker or offline tracker, our algorithm has significant advantages. Because the offline tracker can use the global information to track, the overall performance of the tracker is better than the online tracker. However, due to the wide application of deep learning in the field of detection and its obvious advantages, the gap between the two is getting smaller and smaller, and will even surpass some offline trackers.
In Table 2, among the two most indicators MOTA and MOTP to measure multi-object tracking, our algorithm exceeds the tracking performance of offline algorithms and ranks first.
In order to show the performance of our tracker more intuitively, we further compare the performance of different detectors on the test set in Table 3. Overall, the performance of the SDP [15] detector is the best among the three detectors. DPM is a traditional algorithm that uses the sliding window idea, while FRCNN and SDP are both detection methods using convolutional neural networks.
Table 3 shows the results of various indicators in different sequences of different detectors in the MOT2017 video. The performance of our proposed algorithm has achieved good results.

3.3.2. Qualitative Analysis

In order to show the performance of our algorithm more intuitively, we conducted a qualitative analysis of the proposed algorithm as shown in Figure 7. In the first sequence of the MOT17 test dataset, a lady wearing a black skirt on a street corner can still accurately track her with the same ID after crossing and overlapping with a pedestrian next to her. Sequence 3 is a scene with a lot of people and crowded at night, and the tracker we proposed still shows good tracking performance. Sequence 6 uses a mobile camera to shoot in a busy commercial block, and still has a good tracking performance after experiencing a large range of deformation and occlusion. For MOT, in addition to difficulties such as occlusion and illumination deformation, the tracking of small objects is also an extremely challenging task. Since our algorithm uses a feature pyramid network with multi-feature fusion in the feature extraction stage, the tracking of small objects in Sequence 7 shows good performance. False detection, missed detection and occlusion have always been huge challenges faced by MOT. In order to overcome these difficulties, we adopted an anchor-free detector in the detection branch that does not rely on the experience setting, which not only effectively avoids false detections and missed detections, but also in sequence 7 we can see that the man in the white shirt was tracked accurately even after severe occlusion, and Sequence 6 shows that in a complex indoor shopping mall, we also tracked the men in black shirts that appeared midway. In the actual autonomous driving environment in the city, the tracking of pedestrians on both sides of the road and crossing the road is particularly important. Sequence 7 is taken by the in-car dash cam, which not only tracks the pedestrians on both sides of the station, but also in the distance small object pedestrians crossing the road on the zebra crossing have also been accurately tracked, which has played an important role in taking avoidance measures for subsequent vehicles and avoiding traffic accidents.

3.4. Results on the Autonomous Driving Dataset KITTI

The KITTI dataset is a computer vision algorithm evaluation dataset used in autonomous driving scenarios. It was co-founded by the Karlsruhe Institute of Technology (KIT) and Toyota Institute of Technology Chicago (TTIC). The scenes mainly include urban areas, villages and highways. Among them, the data set used for the multi-object tracking algorithm consists of 21 training sequences and 29 test sequences. Here, we have selected KITTI-16 and KITTI-19 for qualitative analysis, as shown in Figure 8 below. Since the pedestrian is a non-rigid object in MOT, it is the most difficult to track, so we only show the tracking effect on pedestrians.
KITTI-16 is a high-traffic intersection shot by a static camera. Intersections, overlaps, and occlusion frequently occur. Because we use the DM module to effectively solve the ID-switch problem caused by occlusion. KITTI-19 is a bustling road scene in the city captured by a mobile camera in the car. Our algorithm can still accurately track the road and pedestrians on both sides.

3.5. Ablation Experiment

The most important process in multi-object tracking is the early detection and the later trajectory extension. The detection accuracy directly affects our later tracking results. The innovation of our algorithm is in the detection and trajectory extension part. In order to show the performance of our algorithm more intuitively, we conducted an ablation experiment analysis on each part of our proposed algorithm on the MOT2016 dataset, as shown in Table 4. In the experiment, we list three indexes which can best reflect the performance of multi-object tracking.

4. Conclusions

While self-driving cars bring us a lot of convenience, there are still many difficulties and challenges in real life. To this end, we use a multi-feature fusion pyramid feature extractor and anchor-free detector combined with the DM module to propose a multi- object tracking algorithm that takes into account both accuracy and speed. In particular, the proposal and application of the DM module effectively solve the problem of frequent ID-switch when the object overlaps or occludes the background and surrounding objects, and extends the competitive trajectory well. Compared with the most advanced trackers in the two benchmarks of MOT16 and MOT17, it is more competitive. In the future, we will continue to study the problems existing in the two-stage tracking, realize end-to-end multi-object tracking, and further improve the accuracy and speed of the tracker.

Author Contributions

J.C. and C.L. conceived and designed the experiments; J.C. performed the experiments; Y.A. and F.W. analyzed the data; W.Z. and Y.Z. contributed reagents/materials/analysis tools; J.C. wrote the paper. All authors have read and agreed to the published version of the manuscript.

Funding

The authors would like to acknowledge the financial support provided by the Fundamental Research Funds for the Central Universities of China (Grant No. FRF-GF-20-24B, FRF-MP-19-014), the project supported by Innovation Group Project of Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai) (No. 311021013) the 111 Project (Grant No. B12012).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Ding, S.; Liu, L.; Park, J.H. A novel adaptive nonsingular terminal sliding mode controller design and its application to active front steering system. Int. J. Robust Nonlinear Control 2019, 29, 4250–4269. [Google Scholar] [CrossRef]
  2. Norouzi, A.; Masoumi, M.; Barari, A.; Farrokhpour Sani, S. Lateral control of an autonomous vehicle using integrated backstepping and sliding mode controller. Proc. Inst. Mech. Eng. Part K J. Multi-Body Dyn. 2019, 233, 141–151. [Google Scholar] [CrossRef]
  3. Formentin, S.; Garatti, S.; Rallo, G.; Savaresi, S.M. Robust direct data-driven controller tuning with an application to vehicle stability control. Int. J. Robust Nonlinear Control 2018, 28, 3752–3765. [Google Scholar] [CrossRef]
  4. Chen, J.; Ai, Y.; Qian, Y.; Zhang, W. A novel Siamese Attention Network for visual object tracking of autonomous vehicles. Proc. Inst. Mech. Eng. Part D J. Automob. Eng. 2021. [Google Scholar] [CrossRef]
  5. Gao, M.; Jin, L.; Jiang, Y.; Guo, B. Manifold Siamese Network: A Novel Visual Tracking ConvNet for Autonomous Vehicles. IEEE Trans. Intell. Transp. Syst. 2020, 21, 1612–1623. [Google Scholar] [CrossRef]
  6. Zhang, Q.N.; Sun, Y.D.; Yang, J.; Liu, H.B. Real-time multi-class moving target tracking and recognition. IET Intell. Transp. Syst. 2016, 10, 308–317. [Google Scholar] [CrossRef]
  7. Türetken, E.; Wang, X.; Becker, C.J.; Haubold, C.; Fua, P. Network flow integer programming to track elliptical cells in time-lapse sequences. IEEE Trans. Med. Imaging 2017, 36, 942–951. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  8. Yan, X.; Kakadiaris, I.; Shah, A. Modeling local behavior for predicting social interactions towards human tracking. Pattern Recognit. 2014, 47, 1626–1641. [Google Scholar] [CrossRef]
  9. Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
  10. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multi box detector. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; Springer Press: New York, NY, USA, 2016; pp. 21–37. [Google Scholar]
  11. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; IEEE Computer Society Press: Washington, DC, USA, 2015; pp. 779–788. [Google Scholar]
  12. Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE Computer Society Press: Washington, DC, USA, 2017; pp. 6517–6525. [Google Scholar]
  13. Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. In Proceedings of the Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; IEEE Computer Society Press: Los Alamices, CA, USA, 2018; pp. 1–6. [Google Scholar]
  14. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  15. Yang, F.; Choi, W.; Lin, Y. Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2129–2137. [Google Scholar]
  16. Zhang, Y.; Wang, C.; Wang, X.; Zeng, W.; Liu, W. FairMOT: On the Fairness of Detection and Re-Identification in Multiple Object Tracking. Int. J. Comput. Vis. 2020. [Google Scholar] [CrossRef]
  17. Hu, Y.; Song, R.; Li, Y. Efficient coarse-to-fine patchmatch for large displacement optical flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 5704–5712. [Google Scholar]
  18. Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
  19. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
  20. Milan, A.; Leal-Taixé, L.; Reid, I.; Roth, S.; Schindler, K. MOT16: A benchmark for multi-object tracking. arXiv 2016, arXiv:1603.00831. [Google Scholar]
  21. Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The kitti vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
  22. Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef] [Green Version]
  23. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
  24. Ess, A.; Leibe, B.; Schindler, K.; van Gool, L. A mobile vision system for robust multi-person tracking. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar]
  25. Zhang, S.; Benenson, R.; Schiele, B. Citypersons: A diverse dataset for pedestrian detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3213–3221. [Google Scholar]
  26. Shao, S.; Zhao, Z.; Li, B.; Xiao, T.; Yu, G.; Zhang, X.; Sun, J. Crowdhuman: A benchmark for detecting human in a crowd. arXiv 2018, arXiv:1805.00123. [Google Scholar]
  27. Leal-Taixé, L.; Milan, A.; Reid, I.; Roth, S.; Schindler, K. Motchallenge 2015: Towards a benchmark for multi-target tracking. arXiv 2015, arXiv:1504.01942. [Google Scholar]
  28. Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 32, 1627–1645. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  29. Henschel, R.; Leal-Taixé, L.; Cremers, D.; Rosenhahn, B. Fusion of Head and Full-Body Detectors for Multi-Object Tracking. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
  30. Peng, J.; Wang, T.; Lin, W.; Wang, J.; See, J.; Wen, S.; Ding, E. TPM: Multiple Object Tracking with Tracklet-Plane Matching. Pattern Recognit. 2020, 107, 107480. [Google Scholar] [CrossRef]
  31. Tang, S.; Andriluka, M.; Andres, B.; Schiele, B. Multiple People Tracking by Lifted Multicut and Person Re-identification. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  32. Xu, Y.; Osep, A.; Ban, Y.; Horaud, R.; Leal-Taixé, L.; Alameda-Pineda, X. How to train your deep multi-object tracker. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6787–6796. [Google Scholar]
  33. Bergmann, P.; Meinhardt, T.; Leal-Taixe, L. Tracking without bells and whistles. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019. [Google Scholar]
  34. Zhu, J.; Yang, H.; Liu, N.; Kim, M.; Zhang, W.; Yang, M.H. Online multi-object tracking with dual matching attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 379–396. [Google Scholar]
  35. Li, X.; Liu, Y.; Wang, K.; Yan, Y.; Wang, F.Y. Multi-Target Tracking with Trajectory Prediction and Re-Identification. In Proceedings of the 2019 Chinese Automation Congress (CAC), Hangzhou, China, 22–24 November 2019. [Google Scholar]
  36. Chen, J.; Sheng, H.; Zhang, Y.; Xiong, Z. Enhancing Detection Model for Multiple Hypothesis Tracking. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 2143–2152. [Google Scholar] [CrossRef]
  37. Zhang, Y.; Sheng, H.; Wu, Y.; Wang, S.; Lyu, W.; Ke, W.; Xiong, Z. Long-Term Tracking With Deep Tracklet Association. IEEE Trans. Image Process. 2020, 29, 6694–6706. [Google Scholar] [CrossRef]
  38. Lee, S.; Kim, E. Multiple object tracking via feature pyramid Siamese networks. IEEE Access 2019, 7, 8181–8194. [Google Scholar] [CrossRef]
  39. Chu, P.; Ling, H. FAMNet: Joint Learning of Feature, Affinity and Multi-dimensional Assignment for Online Multiple Object Tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019. [Google Scholar]
Figure 1. Sample output of the MOT algorithm.
Figure 1. Sample output of the MOT algorithm.
Electronics 10 02479 g001
Figure 2. Simplified FairMOT pipeline.
Figure 2. Simplified FairMOT pipeline.
Electronics 10 02479 g002
Figure 3. The details of discrimination learning model.
Figure 3. The details of discrimination learning model.
Electronics 10 02479 g003
Figure 4. Examples of competitive trajectory tracking results, where yellow is the detection result, and red and green indicate the correct tracking result.
Figure 4. Examples of competitive trajectory tracking results, where yellow is the detection result, and red and green indicate the correct tracking result.
Electronics 10 02479 g004
Figure 5. Simplified ours pipeline.
Figure 5. Simplified ours pipeline.
Electronics 10 02479 g005
Figure 6. Deep Layer Aggregation (DLA-34) backbone network structure, where the red arrow denotes down-sampling, the yellow arrow denotes up-sampling, the blue arrow denotes the resolution keeping.
Figure 6. Deep Layer Aggregation (DLA-34) backbone network structure, where the red arrow denotes down-sampling, the yellow arrow denotes up-sampling, the blue arrow denotes the resolution keeping.
Electronics 10 02479 g006
Figure 7. The tracking results of our algorithm on the MOT17 test sequence. In units of rows, from top to bottom are sequence 1, sequence 2, sequence 3, sequence 4, sequence 5, sequence 6, and sequence 7.
Figure 7. The tracking results of our algorithm on the MOT17 test sequence. In units of rows, from top to bottom are sequence 1, sequence 2, sequence 3, sequence 4, sequence 5, sequence 6, and sequence 7.
Electronics 10 02479 g007
Figure 8. The tracking results of our algorithm on the KITTI dataset.
Figure 8. The tracking results of our algorithm on the KITTI dataset.
Electronics 10 02479 g008
Table 1. Comparison of our algorithm with other state-of-the-art algorithms.
Table 1. Comparison of our algorithm with other state-of-the-art algorithms.
ModeMethod MOTA MOTP IDF 1 MT ML FP FN
Off-lineFWT [29]47.875.544.319.138.2888685,487
TPM [30]51.375.247.918.740.8270185,504
LMP [31]48.879.051.318.240.1665486,245
On-lineDeepMOT [32]54.877.553.419.137.0295578,765
Tracktor++ [33]54.478.252.51936.9328079,149
DMAN [34]51.476.95416.534.921,042251,873
PV [35]50.477.750.814.938.9260086,780
Ours56.379.255.120.435.6309579,634
Table 2. Comparison of our algorithm with other state-of-the-art algorithms.
Table 2. Comparison of our algorithm with other state-of-the-art algorithms.
ModeMethod MOTA MOTP IDF 1 MT ML FP FN
Off-lineEDMT [36]50.976.652.717.535.724,069250,768
TT17 [37]54.977.263.124.438.120,236233,295
TPM [30]54.276.752.622.837.513,739242,730
On-lineFPSN [38]44.976.648.416.535.833,757269,952
DeepMOT [32]53.777.253.819.436.611,731247,447
FAMNet [39]5276.548.719.133.414,138253,616
DMAN [34]48.275.755.719.338.326,218263,6083
Ours55.178.954.120.035.68524241,795
Table 3. Comparison results of different detector algorithms for MOT17.
Table 3. Comparison results of different detector algorithms for MOT17.
SequenceMOTA( )MOTP( )IDF1( )MT( )ML( )FP( )FN( )IDSW( )
MOT17-01-DPM41.778.440.351123371621
MOT17-03-DPM65.379.159.75119155234,530216
MOT17-06-DPM54.080.655.94786120522779
MOT17-07-DPM41.679.345.952294969974
MOT17-08-DPM26.683.532.78396815,37564
MOT17-12-DPM45.982.853.8164326463527
MOT17-14-DPM31.777.339.5118121812,263142
average43.8380.156.820.443300.112,206.489
MOT17-01-FRCNN43.677.941.1610107350524
MOT17-03-FRCNN67.778.760.35418157832,032198
MOT17-06-FRCNN57.580.058.655612254657125
MOT17-07-FRCNN41.979.146.9622219951783
MOT17-08-FRCNN26.283.532.18409415,43160
MOT17-12-FRCNN44.882.554.7154434472818
MOT17-14-FRCNN33.076.239.9127845711,734197
average45.079.747.722.339359.111,657100.7
MOT17-01-SDP43.977.759.7610104348826
MOT17-03-SDP71.878.162.76216238026,774333
MOT17-06-SDP58.080.056.958652824545127
MOT17-07-SDP43.978.745.8819222914998
MOT17-08-SDP27.782.732.4103714615,05774
MOT17-12-SDP46.382.254.4174497453226
MOT17-14-SDP35.476.342.3117047611,254208
average46.779.450.624.637.3529.612,1141513.4
Table 4. The results of ablation experiments of different models of our algorithm on the MOT16 dataset.
Table 4. The results of ablation experiments of different models of our algorithm on the MOT16 dataset.
Method MOTA MOTP MT
Anchor-based tracking 48.767.849.2
Anchor-free tracking52.370.152.4
Anchor-free tracking + trajectory extension strategy (ours)56.379.255.1
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Chen, J.; Wang, F.; Li, C.; Zhang, Y.; Ai, Y.; Zhang, W. Online Multiple Object Tracking Using a Novel Discriminative Module for Autonomous Driving. Electronics 2021, 10, 2479. https://doi.org/10.3390/electronics10202479

AMA Style

Chen J, Wang F, Li C, Zhang Y, Ai Y, Zhang W. Online Multiple Object Tracking Using a Novel Discriminative Module for Autonomous Driving. Electronics. 2021; 10(20):2479. https://doi.org/10.3390/electronics10202479

Chicago/Turabian Style

Chen, Jia, Fan Wang, Chunjiang Li, Yingjie Zhang, Yibo Ai, and Weidong Zhang. 2021. "Online Multiple Object Tracking Using a Novel Discriminative Module for Autonomous Driving" Electronics 10, no. 20: 2479. https://doi.org/10.3390/electronics10202479

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop