Online Multiple Object Tracking Using a Novel Discriminative Module for Autonomous Driving

Multi object tracking (MOT) is a key research technology in the environment sensing system of automatic driving, which is very important to driving safety. Online multi object tracking needs to accurately extend the trajectory of multiple objects without using future frame information, so it will face greater challenges. Most of the existing online MOT methods are anchor-based detectors, which have many misdetections and missed detection problems, and have a poor effect on the trajectory extension of adjacent object objects when they are occluded and overlapped. In this paper, we propose a discrimination learning online tracker that can effectively solve the occlusion problem based on an anchor-free detector. This method uses the different weight characteristics of the object when the occlusion occurs and realizes the extension of the competition trajectory through the discrimination module to prevent the ID-switch problem. In the experimental part, we compared the algorithm with other trackers on two public benchmark datasets, MOT16 and MOT17, and proved that our algorithm has achieved state-of-the-art performance, and conducted a qualitative analysis on the convincing autonomous driving dataset KITTI.


Introduction
The multi object tracking (MOT) system is an accurate tracking of obstacles moving in front of or in the surrounding environment of an autonomous vehicle, including vehicle path tracking, non-motor vehicle trajectory tracking, pedestrian trajectory tracking, etc. This subsystem helps self-driving cars make decisions and avoid collisions with objects that may move (for example, other vehicles and pedestrians) [1][2][3]. In the above scenarios, the main task of the multi-object tracking algorithm is to track many objects simultaneously, assign and maintain a corresponding ID for each object, and record the trajectory, which cannot be achieved by only using the object detection algorithm or single object tracking algorithm.
The object tracking task is very important to driving safety and can effectively predict the trajectory of object movement, so that the control layer can make decisions such as collision warning and lane change processing in advance. The application of object tracking can be divided into single object tracking (SOT) [4,5] and multi-object tracking (MOT) in terms of the number of objects. In the actual traffic scene, MOT is more common, and the matching relationship between the previous frame and the next frame of multiple moving objects in the actual movement should be taken into account. An example of an output diagram is shown in Figure 1. As an important task branch of computer vision, the MOT algorithm has also been widely used in the fields of intelligent surveillance systems [6], medical image processing [7] and human-computer interaction [8]. MOT can be divided into offline mode and online mode in terms of processing mode. Offline tracking usually uses global information to track objects, so its accuracy is relatively high. However, due to its high computational cost and huge storage space, it is not suitable for automatic driving which requires high portability and real-time performance. Online tracking, due to its real-time requirements, can only use the information of the current frame and historical frame, which brings more challenges to researchers. Due to the complexity of multiple tracking problems, we need to consider not only the change of shooting angle and illumination, but also the emergence of a new object, the disappearance of an old object, and how to recognize the lost object again. This makes a robust tracking algorithm still a huge challenge.
Recently, deep learning technology based on neural networks has made great progress. Representative detection algorithms include Fast R-CNN [9], SSD [10] and YOLO [11][12][13] algorithms and so on. With the advancement of object detection technology, detection based tracking-by-detection has taken the lead. The algorithm detects the object in each frame and then matches it with the existing tracking trajectory. For a new object in the current frame, a new trajectory needs to be formed. For an object leaving the field of view in the current frame, the trajectory of the object needs to be terminated. However, whether it is a detector based on Faster-RCNN [14] or a detector based on SDP [15], they are all anchor-based detectors, which are prone to the problems of center point offset and low accuracy of the regression frame. Therefore, in this paper, we use the anchor-free detector algorithm.
In this work, in order to meet the scene requirements of real-time online tracking of autonomous vehicles, we are inspired by the pipeline FairMOT [16] algorithm and propose an online multi-object real-time tracker based on the feature extraction of ROI regions. This algorithm designs a multifunctional discriminant model by differently affecting the driver in the autonomous driving scene by overlapping or adjacent objects and backgrounds. The model determines the type of trajectory by calibrating the ROI of the object detected in the previous frame, and then uses the discriminative model to solve the change in the appearance of the object due to the occlusion of the object or the interaction between the objects, and then obtain the global characteristic trajectory of the object during the movement. At the same time, in order to meet the real-time requirements of autonomous vehicles, historical information and future information are used at the same time to smooth the trajectory of objects on multiple frames. The main contributions of this work are as follows: i.
An online multi-object tracking algorithm suitable for the process of autonomous driving environment perception is proposed. ii. For the occlusion problem of different objects or overlapping adjacent objects when the object is moving, a discriminative learning model is proposed. iii. The performance of our proposed MOT tracker has achieved competitive performance on the MOT [16], MOT [17] benchmark and KITTI datasets.

Our Proposed Tracker
In this section, we first introduce the FairMOT pipeline and the novel detection strategy, then introduce the proposed online MOT tracking algorithm, and finally, introduce in detail our optimized trajectory extension strategy for different tracking objects during the tracking process.

Problem Formulation
Since multi-object tracking is used to predict the position state of multiple objects in the next frame, the tracking method of MOT can be described as a multi-variable optimisation problem. Given an image sequence, suppose that A i t and X i t are the state value and observation value of the i-th target in frame t respectively, and A t = A 1 t , A 2 t , . . . A M t t is the track sequence value of all targets M t in frame t. A i i s :i e = A i i s , . . . , A i i e is the track sequence value of the i-th target, where i s and i e respectively denote the object i for the start and end frame that the object i appears, while A 1:t = {A 1 , A 2 , . . . A t } represents the track sequence of all objects in the image from the start frame to the t-th frame. X t = X 1 t , X 2 t , . . . X M t t is used to refer to the observed values of all objects M t in frame t. X 1:t = {X 1 , X 2 , . . . X t } represents the observation values of all the object bears from the start frame to the t-th frame in the image. The research purpose of MOT is to find the best trajectory of all objects. Therefore, under the condition that all object state values are known, the optimization problem of MOT can be modeled by the maximal a posteriori (MAP) probability model as: The prediction and update process is obtained by the following formula:

FairMOT Pipeline
For the detection-based multi-object tracking algorithm, the detection performance of the detector directly affects the tracking effect. The traditional MOT algorithm basically uses an anchor-based detection algorithm. However, anchor-based detection not only has a large number of hyperparameters, but also has low detection accuracy. The FairMOT algorithm adopts anchorless detection, which improved detection accuracy effectively. The highlight of the FairMOT algorithm is that it combines the anchor-free detection algorithm and the Re-ID feature for end-to-end tracking. The tracking process is shown in Figure 2. The object detection process is regarded as a center-based bounding box regression task on high-resolution feature maps. Three parallel regression heads are added to the backbone to predict the heatmap, object center offset, and box size. The loss functions of the three processes can be obtained by the following formulas: where M xy denotes the response of (x,y), Among them,Ŝ ∈ R 9W×H×2 andÔ ∈ R W×H×2 are the output size and offset, respec- is each corresponding ground truth (GT) of the image, and its size can be represented by In the same way, the offset of GT can be obtained as . Then,ŝ i andô i are the estimated size and offset of the corresponding position, respectively. L box is the L 1 loss function formula of the two. P(k) is the distribution vector of our identity feature vector mapping at the center of the GT box. L i (k) represents the one-hot value of the GT label. Embed object recognition as a classification task. All object instances with the same identity in the training set are regarded as one class. For each label box in the picture, obtain the object center C x i , C y i on the heatmap, extract an identity feature vector E x i ,y i to locate and learn to map it to a class distribution vector P(k), which represents the encoding of the label L i (k).

Discrimination Learning Model
For multi-object tracking, occlusion has always been a difficult problem to overcome, although many scholars have tried to deal with occlusion. For example, Naiyan Wang et al. [17] treats the occlusion problem as a trajectory association problem, which is analogous to the data association of detection. The tracklet is put into the optical flow network for model optimization, thereby ignoring the failed detection object and continuing the tracking. However, this method did not achieve a good anti-occlusion effect because it did not pay attention to the importance of the sample itself. In this article, in order to meet the real-time performance of autonomous vehicles and the frequent occlusion problems in the process of vehicle travel, we introduce the discrimination model to solve the problem of the occlusion of moving objects.
For two known competing trajectories, as shown in Figure 3, suppose there are the previous M historical trajectories and the feature map Z 1 . In order to reduce the influence of ambient noise, we use spatial Gaussian weights to denoise each channel. Through 1 × 1 convolution operation and global maximum pooling, we get our abstract invariant features S ∈ R N×C . After the S matrix is multiplied by its transposed matrix, the X ∈ R N×N matrix is obtained after the softmat operation. The calculation of the correlation matrix X ∈ R N×N can be obtained as follows: We can draw X ij on behalf the spatial correlation between trajectory j-th and trajectory i-th. Where spatial correlation map X ∈ R N×N is a matrix composed of X ij .
Next, the correlation map X is reshaped and input to the two fully connected layers and the softmaxx layer, and then the attention score y ∈ R N of each position is obtained.
Finally, the final output result is obtained by:

Trajectory Extension Strategy
In the tracking phase, trajectory extension in MOT is one of the most challenging tasks. In order to effectively overcome the problems caused by trajectory extension, we propose a position discrimination model, which can effectively separate the object from the background and its surrounding adjacent or overlapping areas. Since the trajectories in the tracking process can be divided into isolated trajectories and competitive trajectories, we have designed different tracking strategies for them, and still adopt the classic two-stage tracking strategy.
First, for each current active trajectory, we extract its region of interest as a candidate region, and use instance segmentation to refine its bounding box. If the trajectory is an isolated trajectory, when its confidence is greater than threshold σ t (as Equation (7)), it will be stored as a new trajectory.
Here, t p represents the continuous tracking time in the first stage and Z i refers to the refinement confidence in the ith growth. ϑ ≈ log(2)/ √ T max is measured by the maximum number T max of consecutive failures matches, which is a balance parameter. In this experiment, all ϑ values are set to 0.1.
Secondly, for the trajectory with competitive relationship, the detection example is shown in Figure 4, and the overlapping detection area after the ROI candidate area and the instance segmentation refined bounding box is taken as our candidate object. The discrimination model is used to calculate the similarity between the competition trajectory and the candidate region, and the deep Hungary algorithm is used to associate the similarity matrix to carry out the correct extension of the trajectory. The final stage is the allocation of the trajectory of the untracked object, and the IOU calculation between the detector and the threshold τ iou is tracked, and it is allocated to the remaining detection results. After data association, each untracked trajectory is considered lost in the current frame, and a new trajectory is initialized with high response confidence for each unmatched detection. In order to reduce the influence of false detection, once any new trajectory is lost in any first τi frame, it will be deleted. If the trajectory continuously exceeds τt and is lost or leaves the field of view, the trajectory will terminate.

Proposed Online MOT Tracking Network
Multi-object tracking based on detection can be divided into online tracking and offline tracking. Online multi-object tracking is a frame-by-frame progressive tracking method, which is similar to the real-time tracking process of human eyes. Firstly, each moving object should be identified and confirmed (object detection), and then its next action should be predicted (trajectory prediction). Finally, the motion direction (motion model), appearance shape (appearance model) and other features of the object are associated with the previous trajectory (data correlation matching).
In this section, we will introduce the main tracking process of our algorithm. Due to anchor-based detectors have many hyperparameters and the shortcomings of features that are not easy to counteract, we employ anchor-free detectors in the detection process. As shown in Figure 5, after the t-th frame image of the current frame passes through the backbone network, the region of interest is extracted and the result of the t − 1th frame detection is performed to correct the position to obtain the trajectory of the object in the current frame. If the trajectory of the object in the frame is an isolated trajectory, the trajectory is stored and extended directly, and the tracking is successful. If the trajectory of the object in the t-th frame is in a competitive relationship, that is, there is occlusion, input the discrimination learning model to solve the occlusion problem through position correlation, realize the storage and extension of the trajectory, and track successfully. If the trajectory of the object in the t-th frame belongs to the new object, the trajectory is initialized. If the trajectory of the object in frame t-th does not appear in consecutive frames, the tracking is stopped and the tracking ends. In order to better balance the two performances of speed and accuracy, we use the ResNet-34 backbone network with strong feature extraction capabilities like the FairMOT detection method. As shown in Figure 6, in order to better integrate the semantic and location information of different layers, we use a backbone network of Deep Layer Aggregation [18] to extract image features. At the same time, in order to dynamically adjust the receptive field when the proportion and posture of the object change, we use deformable convolution [19] to complete the up-sampling. The size of the input images are H image × W image , and the output feature map has the shape of C × H × W where H = H image /4 and W = W image /4 . The proposed tracking flow is summarized in Algorithm 1.

Algorithm 1: The proposed Method
Input: The pre-trained network model, the first frame, initial obkect location bounding box b 1 Output: The object location b 2 , b 3 , . . . b n of the subsequent frames 1. Input the initial frame and initial bounding box 2. for i = 2 : n do Get the ROI feature 3. Calculate the correlation matrix using Equation (6) 4. Calculate the maximum response using Equations (4) and (5) 5. Calculate the bounding box 6. end for

Experiments and Evaluation
In this section, we will introduce the experimental details of our proposed algorithm in detail and compare it with the most representative MOT16 [20] MOT17 [20] public benchmark in the MOT Challenge and an autonomous driving dataset KITTI [21,22].

Experiment Implementation Details
Our algorithm is implemented based on Pytorch in an Ubuntu 16.04 desktop computer with Intel i7-9700k CPU, 16G RAM and two Nvidia GTX1080Ti GPUs. In this experiment, we use the DLA-34 pre-trained multi-layer feature fusion on the COCO dataset [23] as the backbone network. The ADM optimizer is used for 30 epochs of training on ETH [24], city person [25] and crowd human [26]. During our experiment, the input size of all training set images is 1088 × 608, and the feature map resolution is 272 × 152.

Results on MOT16
MOT16 mainly detects moving pedestrians and vehicles. It is a dataset based on MOT15 [27] with more detailed annotations and more bounding boxes. MOT16 has a richer picture, different shooting angles and camera movements, as well as different weather condition videos. It is marked by a group of qualified researchers in strict compliance with the corresponding marking guidelines, and finally a double detection method is used to ensure the high accuracy of the marked information. The trajectory marked by MOT16 is 2D. There are 14 video sequences in the MOT16 dataset, of which 7 are training sets with annotation information, and the other 7 are test datasets.
The detector used in the MOT16 data set is DPM [28], which has a good performance in detecting the pedestrian category. The main information of these videos is as follows: including FPS, resolution, video duration, number of tracks, object book, density, static or moving shooting, low, medium and high angle shooting, weather conditions for shooting, etc. Table 1 shows our comparison with the most state-of-the-art algorithm on the MOT16 public benchmark. The results show that whether we compare with offline trackers or online trackers, the algorithm we proposed obtains the best results on several important indicators such as MOTA, MOTP and IDF1. In Table 1, FP represents false positive samples during the tracking process. The lower the value, the better. The number of false positive samples detected in our algorithm is 79,634, which ranks in the middle. FN is the false negative sample, ML is the mostly lost sample; the smaller the value of both the better. The results of our algorithm have achieved good performance in the eight competitive algorithms in 2016. MT is mostly tracking, IDF1 refers to the F value of the pedestrian ID in each pedestrian frame. The larger the value of the two, the better. MOTA and MOTP are the other most important indicators to measure tracking accuracy and position error in multi-object tracking, and can be expressed by Formulas (9) and (10) as: where t is the index of each frame of image, and GT is the ground truth label, and c t denotes the number of matches in frame t and d t,i is the bounding box overlap of target i with its assigned ground truth object. As shown in Table 1, in the three most important indicators of multi-object tracking performance, MOTA, MOTP and IDF1, the algorithm we proposed all ranked first.

Results on MOT17
3.3.1. Quantitative Analysis MOT17 are datasets based on MOT15 with more detailed annotations and more bounding boxes, mainly for pedestrians and vehicles. They have a richer picture, different shooting angles and camera movements, as well as different weather condition videos. They are marked by a group of qualified researchers in strict compliance with the corresponding marking guidelines, and finally a double detection method is used to ensure the high accuracy of the marked information. The motion trajectory marked by MOT17 is 2D, which is a brand new data set. Compared with MOT15 of pedestrian density, it is more difficult. Therefore, in this experiment, we will use MOT17 as our verification data set to verify the performance of our algorithm.
As shown in Table 2, the best performance has been bolded in black. Compared with the online tracker or offline tracker, our algorithm has significant advantages. Because the offline tracker can use the global information to track, the overall performance of the tracker is better than the online tracker. However, due to the wide application of deep learning in the field of detection and its obvious advantages, the gap between the two is getting smaller and smaller, and will even surpass some offline trackers. In Table 2, among the two most indicators MOTA and MOTP to measure multiobject tracking, our algorithm exceeds the tracking performance of offline algorithms and ranks first.
In order to show the performance of our tracker more intuitively, we further compare the performance of different detectors on the test set in Table 3. Overall, the performance of the SDP [15] detector is the best among the three detectors. DPM is a traditional algorithm that uses the sliding window idea, while FRCNN and SDP are both detection methods using convolutional neural networks.  Table 3 shows the results of various indicators in different sequences of different detectors in the MOT2017 video. The performance of our proposed algorithm has achieved good results.

Qualitative Analysis
In order to show the performance of our algorithm more intuitively, we conducted a qualitative analysis of the proposed algorithm as shown in Figure 7. In the first sequence of the MOT17 test dataset, a lady wearing a black skirt on a street corner can still accurately track her with the same ID after crossing and overlapping with a pedestrian next to her. Sequence 3 is a scene with a lot of people and crowded at night, and the tracker we proposed still shows good tracking performance. Sequence 6 uses a mobile camera to shoot in a busy commercial block, and still has a good tracking performance after experiencing a large range of deformation and occlusion. For MOT, in addition to difficulties such as occlusion and illumination deformation, the tracking of small objects is also an extremely challenging task. Since our algorithm uses a feature pyramid network with multi-feature fusion in the feature extraction stage, the tracking of small objects in Sequence 7 shows good performance. False detection, missed detection and occlusion have always been huge challenges faced by MOT. In order to overcome these difficulties, we adopted an anchorfree detector in the detection branch that does not rely on the experience setting, which not only effectively avoids false detections and missed detections, but also in sequence 7 we can see that the man in the white shirt was tracked accurately even after severe occlusion, and Sequence 6 shows that in a complex indoor shopping mall, we also tracked the men in black shirts that appeared midway. In the actual autonomous driving environment in the city, the tracking of pedestrians on both sides of the road and crossing the road is particularly important. Sequence 7 is taken by the in-car dash cam, which not only tracks the pedestrians on both sides of the station, but also in the distance small object pedestrians crossing the road on the zebra crossing have also been accurately tracked, which has played an important role in taking avoidance measures for subsequent vehicles and avoiding traffic accidents.

Results on the Autonomous Driving Dataset KITTI
The KITTI dataset is a computer vision algorithm evaluation dataset used in autonomous driving scenarios. It was co-founded by the Karlsruhe Institute of Technology (KIT) and Toyota Institute of Technology Chicago (TTIC). The scenes mainly include urban areas, villages and highways. Among them, the data set used for the multi-object tracking algorithm consists of 21 training sequences and 29 test sequences. Here, we have selected KITTI-16 and KITTI-19 for qualitative analysis, as shown in Figure 8 below. Since the pedestrian is a non-rigid object in MOT, it is the most difficult to track, so we only show the tracking effect on pedestrians.
KITTI-16 is a high-traffic intersection shot by a static camera. Intersections, overlaps, and occlusion frequently occur. Because we use the DM module to effectively solve the ID-switch problem caused by occlusion. KITTI-19 is a bustling road scene in the city captured by a mobile camera in the car. Our algorithm can still accurately track the road and pedestrians on both sides.

Ablation Experiment
The most important process in multi-object tracking is the early detection and the later trajectory extension. The detection accuracy directly affects our later tracking results. The innovation of our algorithm is in the detection and trajectory extension part. In order to show the performance of our algorithm more intuitively, we conducted an ablation experiment analysis on each part of our proposed algorithm on the MOT2016 dataset, as shown in Table 4. In the experiment, we list three indexes which can best reflect the performance of multi-object tracking.

Conclusions
While self-driving cars bring us a lot of convenience, there are still many difficulties and challenges in real life. To this end, we use a multi-feature fusion pyramid feature extractor and anchor-free detector combined with the DM module to propose a multiobject tracking algorithm that takes into account both accuracy and speed. In particular, the proposal and application of the DM module effectively solve the problem of frequent ID-switch when the object overlaps or occludes the background and surrounding objects, and extends the competitive trajectory well. Compared with the most advanced trackers in the two benchmarks of MOT16 and MOT17, it is more competitive. In the future, we will continue to study the problems existing in the two-stage tracking, realize end-to-end multi-object tracking, and further improve the accuracy and speed of the tracker.

Conflicts of Interest:
The authors declare no conflict of interest.