SORT-YM: An Algorithm of Multi-Object Tracking with YOLOv4-Tiny and Motion Prediction

Multi-object tracking (MOT) is a signiﬁcant and widespread research ﬁeld in image processing and computer vision. The goal of the MOT task consists in predicting the complete tracklets of multiple objects in a video sequence. There are usually many challenges that degrade the performance of the algorithm in the tracking process, such as occlusion and similar objects. However, the existing MOT algorithms based on the tracking-by-detection paradigm struggle to accurately predict the location of the objects that they fail to track in complex scenes, leading to tracking performance decay, such as an increase in the number of ID switches and tracking drifts. To tackle those difﬁculties, in this study, we design a motion prediction strategy for predicting the location of occluded objects. Since the occluded objects may be legible in earlier frames, we utilize the speed and location of the objects in the past frames to predict the possible location of the occluded objects. In addition, to improve the tracking speed and further enhance the tracking robustness, we utilize efﬁcient YOLOv4-tiny to produce the detections in the proposed algorithm. By using YOLOv4-tiny, the tracking speed of our proposed method improved signiﬁcantly. The experimental results on two widely used public datasets show that our proposed approach has obvious advantages in tracking accuracy and speed compared with other comparison algorithms. Compared to the Deep SORT baseline, our proposed method has a signiﬁcant improvement in tracking performance.


Introduction
Multi-object tracking (MOT), which aims to assign and maintain a unique ID to each object of interest in a video sequence while predicting the location of all objects, is an essential branch of computer vision tasks. MOT has a vital theoretical research significance and application value. An MOT system with well-behaved performance plays a critical part in visual security monitoring systems, vehicle visual navigation systems, human-computer interaction, etc. [1]. However, as shown in Figure 1, there are many challenges in the actual tracking scenarios that will lead to tracking performance decay, including the interaction between objects, occlusions, the high similarity between different objects, interference of the background, etc. Under these challenges, undesirable errors such as bounding box drift and ID switches are prone to occur, resulting in tracking performance decay. Therefore, this paper aims to propose a robust MOT algorithm in complex scenes. In recent years, lots of MOT methods appeared to perform the MOT task. Amo them, owing to the rapid development of object detection, the tracking-by-detection (TB paradigm has shown excellent performance and is the most commonly used framewo [2]. As shown in Figure 2, the TBD paradigm consists of a detector and a data associat procedure. First, the detector is used to locate all objects of interest from the video quence. Then, the feature information of each object is extracted in the data associat process, and the same objects are associated according to the metrics (e.g., appearan feature and motion feature) defined on the feature. Finally, by associating the same obj in different video frames, a continuously updated tracklet set is formed. Obviously, in t TBD paradigm, the performance of the detector and the data association algorithm join determine the tracking accuracy and robustness. Undesirable detection results may le to bounding box drift and low tracking precision. Meanwhile, the performance of the d association has a great impact on some vital metrics such as the number of ID switch tracklet segmentation, etc. In addition, both the detector and the data association grea influence the inference speed of the MOT system. Therefore, we utilize YOLOv4-tiny as the detector to improve the detection accura and speed. Moreover, we design a motion prediction strategy to predict the location the lost objects. Specifically, it utilizes the location and velocity information of the l objects in the past frames to estimate the location and velocity of the objects in the curr frame. By adding this model, our method enhances the ability to retrieve the original when the lost objects reappear in the subsequent video frames. Through the motion p diction approach, our algorithm reduces the number of ID switches and tracklet segme effectively. The main contributions of this work are as follows: 1. We utilize the YOLOv4-tiny in the TBD paradigm to improve the tracking accura and speed of our model. 2. We design a motion prediction strategy to predict the location of lost objects, eff tively reducing the number of ID switches and tracklet segments. 3. We compare our approach with state-of-the-art methods and analyze the effects In recent years, lots of MOT methods appeared to perform the MOT task. Among them, owing to the rapid development of object detection, the tracking-by-detection (TBD) paradigm has shown excellent performance and is the most commonly used framework [2]. As shown in Figure 2, the TBD paradigm consists of a detector and a data association procedure. First, the detector is used to locate all objects of interest from the video sequence. Then, the feature information of each object is extracted in the data association process, and the same objects are associated according to the metrics (e.g., appearance feature and motion feature) defined on the feature. Finally, by associating the same object in different video frames, a continuously updated tracklet set is formed. Obviously, in this TBD paradigm, the performance of the detector and the data association algorithm jointly determine the tracking accuracy and robustness. Undesirable detection results may lead to bounding box drift and low tracking precision. Meanwhile, the performance of the data association has a great impact on some vital metrics such as the number of ID switches, tracklet segmentation, etc. In addition, both the detector and the data association greatly influence the inference speed of the MOT system. In recent years, lots of MOT methods appeared to perform the MOT task. Among them, owing to the rapid development of object detection, the tracking-by-detection (TBD) paradigm has shown excellent performance and is the most commonly used framework [2]. As shown in Figure 2, the TBD paradigm consists of a detector and a data association procedure. First, the detector is used to locate all objects of interest from the video sequence. Then, the feature information of each object is extracted in the data association process, and the same objects are associated according to the metrics (e.g., appearance feature and motion feature) defined on the feature. Finally, by associating the same object in different video frames, a continuously updated tracklet set is formed. Obviously, in this TBD paradigm, the performance of the detector and the data association algorithm jointly determine the tracking accuracy and robustness. Undesirable detection results may lead to bounding box drift and low tracking precision. Meanwhile, the performance of the data association has a great impact on some vital metrics such as the number of ID switches tracklet segmentation, etc. In addition, both the detector and the data association greatly influence the inference speed of the MOT system. Therefore, we utilize YOLOv4-tiny as the detector to improve the detection accuracy and speed. Moreover, we design a motion prediction strategy to predict the location of the lost objects. Specifically, it utilizes the location and velocity information of the lost objects in the past frames to estimate the location and velocity of the objects in the current frame. By adding this model, our method enhances the ability to retrieve the original ID when the lost objects reappear in the subsequent video frames. Through the motion prediction approach, our algorithm reduces the number of ID switches and tracklet segments effectively. The main contributions of this work are as follows: 1. We utilize the YOLOv4-tiny in the TBD paradigm to improve the tracking accuracy and speed of our model. 2. We design a motion prediction strategy to predict the location of lost objects, effectively reducing the number of ID switches and tracklet segments. 3. We compare our approach with state-of-the-art methods and analyze the effects of introducing the YOLOv4-tiny and the motion prediction strategy with the MOT-15 and MOT-16 datasets. Therefore, we utilize YOLOv4-tiny as the detector to improve the detection accuracy and speed. Moreover, we design a motion prediction strategy to predict the location of the lost objects. Specifically, it utilizes the location and velocity information of the lost objects in the past frames to estimate the location and velocity of the objects in the current frame. By adding this model, our method enhances the ability to retrieve the original ID when the lost objects reappear in the subsequent video frames. Through the motion prediction approach, our algorithm reduces the number of ID switches and tracklet segments effectively.
The main contributions of this work are as follows: 1.
We utilize the YOLOv4-tiny in the TBD paradigm to improve the tracking accuracy and speed of our model.

2.
We design a motion prediction strategy to predict the location of lost objects, effectively reducing the number of ID switches and tracklet segments.

3.
We compare our approach with state-of-the-art methods and analyze the effects of introducing the YOLOv4-tiny and the motion prediction strategy with the MOT-15 and MOT-16 datasets. The remainder of this paper is structured as follows. Section 2 introduces the related works and highlights of the previous studies. In Section 3, we describe the main components of the proposed method. Section 4 evaluates the proposed method and compares it with state-of-the-art methods in two public datasets. Section 5 concludes this paper.

Related Works
In recent years, the tracking-by-detection paradigm has been the most extensively used in the MOT task. The main component of this paradigm can be divided into object detection and data association. This section describes the works and achievements of object detection and data association in the past.

Object Detection Approaches
The deep learning-based object detection approaches can be divided into two-stage methods and one-stage methods. The two-stage methods first generate a series of candidate regions that may contain objects and then classify each region and perform the bounding box regression according to the features of each candidate region. Meanwhile, the one-stage methods skip the step of candidate regions generation and utilize a convolutional neural network (CNN) directly to regress the location and classification of all objects of the whole image.
In 2014, Girshick et al. [3] proposed the R-CNN, which replaced the classic DPM [4] with an absolute advantage on the PASCAL VOC dataset [5]. In addition, it was the first deep learning-based object detection approach. However, it had the disadvantages of low detection accuracy and high computational cost. To reduce the computational overhead, He et al. [6] proposed the SPP-Net. Different from the R-CNN sending candidate regions into the CNN in turn, the SPP-Net directly produced the feature map of the entire image and then divided the features of each candidate region. Compared with the R-CNN, the SPP-Net's biggest contribution was the significant improvement in training and inference speed. However, compared to the R-CNN, the detection accuracy of the SPP-Net did not show an obvious advantage. Based on the SPP-Net, Girshick et al. [7] proposed Fast R-CNN, which used a multi-task loss function and directly trained the CNN for classification and regression on two branches. Although the Fast R-CNN reached a higher detection accuracy, it took two seconds to detect an image on a CPU. Therefore, Ren et al. [8] improved the Fast R-CNN and proposed the Faster R-CNN. The Faster R-CNN designed a region proposal network (RPN) to share full-image convolution features with the detection network. The design of shared features not only improved the region proposal quality but also decreased the computing cost. As a result, the Faster R-CNN achieved 5 frames per second (FPS) on a K40 GPU, ranked first on the PASCAL VOC dataset [5]. In particular, it is the first detection method that realized end-to-end training. Since then, most two-stage methods have been based on the Faster R-CNN. Although the two-stage approach has made great progress, it is difficult to achieve real-time detection speed.
In 2015, Redmon et al. [9] proposed efficient YOLO, which realized real-time object detection. Different from the two-stage methods, the YOLO did not design the initial stage of generating candidate regions but completed the regression and classification of all objects at once. The detection speed of the YOLO reached 45 FPS on a Titan X GPU. However, the detection accuracy of the YOLO is worse than the Fast R-CNN. Based on the YOLO, Liu et al. [10] proposed SSD, which trained the network to predict objects at different scales on feature layers at different depths. The detection speed of the SSD can be comparable to the YOLO, and the accuracy can match the Faster R-CNN. Although the SSD predicted with multi-layer feature maps, the ability to detect small objects had not been significantly improved. In 2017, Redmon et al. [11] upgraded the original YOLO and proposed the YOLOv2, which utilized the Darknet-19 as the backbone network to extract object features. Meanwhile, k-means clustering was used to calculate the best anchor sizes. Compared with the YOLO, the YOLOv2 improves the detection accuracy and speed. Lin et al. [12] designed a new loss function, Focal Loss, and proposed RetinaNet that utilized the ResNet as the backbone network to improve the detection accuracy of one-stage methods. In addition, the RetinaNet applied the feature pyramid network structure, achieving better detection accuracy compared to the Faster R-CNN on the MS COCO dataset [13]. After that, Redmon et al. [14] proposed the YOLOv3, which utilized the Darknet-53 as the backbone network. Additionally, it replaced the softmax classifiers with the multiple logistic regression classifier so that the model can be applied to classification tasks with the intersection between classes. Further, the YOLOv3 set different anchors on three feature maps of different sizes to predict objects at different scales. The YOLOv3 achieved an excellent balance between detection accuracy and speed and played an essential role in the industry.
In summary, the one-stage and two-stage methods have their own advantages. The twostage approach is relatively more accurate. In contrast, the speed of the one-stage method is generally faster, and it is easier to achieve the real-time requirements in practical applications.

Data Association Approaches
In early studies, multi-hypothesis tracking (MHT) [15] utilized the deep feature extracted by the AlexNet [16]. The MHT retained multiple association assumptions and constructed a hypothesis tree to select the best assumption as the tracking results by calculating confidence. Based on the MHT, Kim et al. [17] proposed MHT-DAM, which used a multi-output regularized least square method to reduce the dimension of the 4093-dimensional deep feature. Compared with MHT, the tracking accuracy of MHT-DAM was significantly improved. However, the tracking speed of the MHT-DAM is only 0.7 FPS. In addition, to deal with the uncertainty in association conditions, Reid et al. proposed the joint probabilistic data association (JPDA) [18], which considered all possible candidate detection results. The tracklets were updated by a weighted combination of all feasible candidate detections. In 2015, Rezatofighi et al. [19] proposed a novel solution to find the m-best solutions to an integer linear program based on JPDA. The experimental results showed that the JPDA 100 achieved high tracking accuracy in the application of MOT with noise interference and occlusion. In addition, the tracking speed of the JPDA 100 is ten times faster than that of the JPDA. However, the classic multi-object tracking algorithms extracted less information about objects, so handling various challenges in complex tracking scenes remains difficult.
Due to the powerful feature extraction capability of CNN, the deep learning-based MOT methods can extract appearance features, motion, and interaction information from large amounts of data. Compared with the classic algorithms, deep learning-based methods usually achieve a higher tracking accuracy and robustness. A robust data association method needs an accurate representation of the object state. In 2019, Han et al. [20] designed a scale estimation strategy for multi-channel feature fusion to characterize the appearance features of objects and proposed DSCF. Meanwhile, the DSCF fused the color names (CNs), HOG, and gray features to improve the tracking robustness. In addition, it estimated the scale of objects based on the correlation filter and then utilized the appearance feature to perform the data association. However, the DSCF cannot deal with multiple similar objects, such as vehicles and pedestrians. Therefore, most algorithms combine appearance features with motion information for data association to distinguish multiple objects with similar appearances. By using an ensemble learning algorithm to learn tracklet features online, Bae et al. [21] proposed confidence multi-object tracking (CMOT), which utilized the incremental linear discriminant analysis learning model to learn the appearance of objects and combine the similarity between the tracklet and the detection. As a result, the CMOT achieved a high tracking accuracy with a speed of 5 FPS on a 3.07 GHz CPU. Xiang et al. [22] regarded the generation and termination of tracklets as state transitions in the Markov decision process (MDP). They utilized the reinforcement learning algorithm to learn the correlation of data. The experimental results showed that the MDP outperformed the state-of-the-art methods on the MOT-15 dataset [23]. Wojke et al. [24] extracted the deep feature that described the appearance differences between different objects more accurately through a CNN network and proposed the Deep SORT.
As a result, Deep SORT was able to track under a large number of occlusions. After that, Chen et al. [25] proposed the MOTDT, which introduced a tracklet scoring mechanism to prevent tracking drifts in the long term. In addition, a deeply learned appearance representation was applied in the MOTDT to enhance the identification capability. The experimental results showed that the MOTDT achieved state-of-the-art tracking performance on the MOT-16 datasets [26]. Based on the YOLOv3 and the MOTDT, Wang et al. [27] proposed joint detection and embedding (JDE), which performed object detection and feature extraction in a network. Therefore, the JDE significantly reduced the computational overhead. The experimental results on the MOT-16 dataset [26] showed that the JDE became the first real-time MOT method, with a tracking speed of 30.3 FPS.
However, realizing a trade-off between the tracking accuracy and speed in the MOT task is still challenging. On the one hand, since the previous methods did not fully use the information of objects in the past video frames, the tracking performance was seriously affected by occlusions. On the other hand, the tracking robustness and speed can be heavily affected by the detector. Thus, we design a motion prediction strategy to predict the location of the occluded objects. To further improve the tracking performance of our method, we utilize YOLOv4-tiny [28] to produce the detections.

The Proposed Method
In this section, a simple online and real-time tracking with the YOLOv4-tiny and the motion prediction strategy (SORT-YM) is proposed. The detailed description of SORT-YM is as follows.

Overall Framework of SORT-YM
As shown in Figure 3, we demonstrate the overall framework of the SORT-YM based on the feature extraction, matching cascade, intersection-over-union (IOU) matching, motion state updating, and output tracklets. We perform the flowchart of SORT-YM as follows: Step 1. The detection generation: We apply efficient YOLOv4-tiny to produce the detections for each video frame.
Step 2. The appearance feature extraction: The appearance features of each detection are extracted through a convolutional neural network.
Step 3. The tracklet location prediction: We can obtain the predicted location of every tracklet in the next frame by utilizing the Kalman filter.
Step 4. The matching cascade: We calculate the appearance feature similarity and location distance between the confirmed tracklets and detections. After that, the association results of the confirmed tracklets and detections are obtained through the Hungarian algorithm.
Step 5. The IOU matching: We compute the intersection-over-union (IOU) between the detection boxes and predicted bounding boxes of candidate tracklets. After that, the association results of the candidate tracklets and detections are obtained through the Hungarian algorithm.
Step 6. The motion state updating: We update the motion state of the tracklets by the Kalman filter and the motion prediction model. Then, we initialize new tracklets for unassociated detections.

YOLOv4-Tiny Model
The performance of the tracking-by-detection paradigm can be heavily affected by the detections. The MOT system with a two-stage detector is limited in tracking speed. Moreover, the runtime increases significantly as the number of objects increases. Therefore, comprehensively considered the tracking performance and speed, we apply efficient YOLOv4-tiny to realize a trade-off between detection accuracy and speed. In contrast to Faster R-CNN [8], the YOLOv4-tiny is a lightweight convolutional network, which achieves 371 FPS on GTX 1080Ti. The network structure of the YOLOv4-tiny [28] is shown in

YOLOv4-Tiny Model
The performance of the tracking-by-detection paradigm can be heavily affected by the detections. The MOT system with a two-stage detector is limited in tracking speed. Moreover, the runtime increases significantly as the number of objects increases. Therefore, comprehensively considered the tracking performance and speed, we apply efficient YOLOv4-tiny to realize a trade-off between detection accuracy and speed. In contrast to Faster R-CNN [8], the YOLOv4-tiny is a lightweight convolutional network, which achieves 371 FPS on GTX 1080Ti. The network structure of the YOLOv4-tiny [28] is shown in Figure 4.

YOLOv4-Tiny Model
The performance of the tracking-by-detection paradigm can be heavily affected by the detections. The MOT system with a two-stage detector is limited in tracking speed. Moreover, the runtime increases significantly as the number of objects increases. Therefore, comprehensively considered the tracking performance and speed, we apply efficient YOLOv4-tiny to realize a trade-off between detection accuracy and speed. In contrast to Faster R-CNN [8], the YOLOv4-tiny is a lightweight convolutional network, which achieves 371 FPS on GTX 1080Ti. The network structure of the YOLOv4-tiny [28] is shown in  Different from Faster R-CNN, the YOLOv4-tiny employs the CSPDarknet53-tiny network as a backbone. The CSPDarknet53-tiny network applies CBLblock and CSPBlock for feature extraction. The CBLblock contains the convolution operation, batch normalization, and activation function. To further reduce computational overhead, the YOLOv4-tiny utilizes the LeakyRelu function as an activation function, which is defined by: where a i is a constant parameter larger than 1. By adopting a cross-stage partial connections structure, the CSPBlock divides the input feature map into two parts and concatenates the two parts in the cross-stage residual edge. Meanwhile, the CSPBlock can significantly reduce the computational complexity by 10-20% while ensuring the detection accuracy of the network. In the multi-feature fusion stage, the YOLOv4-tiny constructs a feature pyramid network to extract feature maps. Through the feature pyramid network, we can obtain two effective feature maps of different sizes. To estimate the detections, the YOLOv4tiny adopts the fused feature maps by the classification and location of the targets.
In the process of prediction, the YOLOv4-tiny divides the input images into grids with the size S × S. For each grid, the network utilizes three anchors to predict objects. As a result, S × S × 3 bounding boxes will be generated for each input image. The anchors in the grids that contain the center of objects will be used to regress the detection boxes. Subsequently, to reduce redundant bounding boxes, we can calculate the confidence score of each detection box. The detections with a confidence score lower than the preset threshold will be removed. The confidence score of each detection is defined as: where Pr(object) denotes the possibility that the detection box contains an object. Then, the IoU truth pred represents the IOU between the predicted bounding box R pred and the groundtruth box R truth , which can be denoted as: Then, the YOLOv4-tiny applies the classification loss function to measure the category error between the predicted box and the ground-truth box. The classification loss function is: Among them, if the j-th anchor in the i-th grid contains an object, I obj ij = 1, otherwise I obj ij = 0. Thep i (c) and p i (c) are the real possibility and predicted the possibility of the target in the anchor that belongs to class c. After that, the YOLOv4-tiny employs the CIoU loss function for bounding box regression. The CIoU loss function is defined as: where ρ 2 (·) denotes the Euclidean distance. b pred and b truth represent the central points of R pred and R truth , respectively. c indicates the diagonal length of the smallest enclosing rectangle covering R pred and R truth . w and h signify the width and height of the bounding box, respectively. In Figure 5, we demonstrate detection results of the YOLOv4-tiny. After where ( ) ⋅ ρ 2 denotes the Euclidean distance. b pred and b truth represent the central points of R pred and R truth , respectively. c indicates the diagonal length of the smallest enclosing rectangle covering R pred and R truth . w and h signify the width and height of the bounding box, respectively. In Figure 5, we demonstrate detection results of the YOLOv4-tiny. After that, we send the detection results of each video frame to the data association model to receive the association results of targets.

Feature Extraction
In order to improve the tracking accuracy and robustness in complex scenes, such as the interaction between targets and occlusion, we carry out the data association by extracting the target appearance information. By using a convolution neural network (CNN), we can pick up the appearance features of all detections in the current image. After counting the aspect ratio of the ground-truth bounding boxes of the MOT-16 training set [26], we found that approximately 70% of pedestrians have an aspect ratio between 0.3 and 0.7. Thus, the input detections are reshaped to 128 × 64 and presented to the CNN in RGB color space. As shown in Table 1, the CNN structure applies two convolution layers, a max-pooling layer, and six residual blocks to squeeze the size of the feature map into 16 × 8. As a result, we can obtain the global feature vector of dimensionality 128 in a dense layer. Finally, the feature vector is normalized. We use the cosine-margin-triplet loss function [29] to train the feature extractor. The cosine-margin-triplet is defined as:

Feature Extraction
In order to improve the tracking accuracy and robustness in complex scenes, such as the interaction between targets and occlusion, we carry out the data association by extracting the target appearance information. By using a convolution neural network (CNN), we can pick up the appearance features of all detections in the current image. After counting the aspect ratio of the ground-truth bounding boxes of the MOT-16 training set [26], we found that approximately 70% of pedestrians have an aspect ratio between 0.3 and 0.7. Thus, the input detections are reshaped to 128 × 64 and presented to the CNN in RGB color space. As shown in Table 1, the CNN structure applies two convolution layers, a max-pooling layer, and six residual blocks to squeeze the size of the feature map into 16 × 8. As a result, we can obtain the global feature vector of dimensionality 128 in a dense layer. Finally, the feature vector is normalized. We use the cosine-margin-triplet loss function [29] to train the feature extractor. The cosine-margin-triplet is defined as: where N denotes the number of detections over a batch of video frames, f indicates the anchor of the triplet. f + and f − is the positive and negative sample with respect to f , respectively. The function aims to minimize the distance between the positive pair and enlarge the distance between the negative pair. The dot product between the pair of feature vectors is defined as: where · denotes the two-norm of the feature vector and θ indicates the angle between the two vectors. This network contains a total of 2,800,864 parameters. Therefore, it realizes a fast calculation speed that satisfies the real-time performance. We train the feature extraction network on the popular person re-identification dataset [30], which contains 1261 pedestrians and more than 1,100,000 video frames. The CNN is tested on the Nvidia RTX3070 GPU (NVIDIA, Santa Clara, CA, USA) and Nvidia GTX1050 GPU (NVIDIA, Santa Clara, CA, USA), and the results show that it only takes 0.85 ms and 0.93 ms on average to extract the feature of an object, respectively. Thus, this CNN is suitable for real-time tracking on different hardware devices. We send all detections to the trained CNN network, and the feature vectors of each target can be obtained as shown in Figure 6.
where N denotes the number of detections over a batch of video frames, f  indicates the anchor of the triplet. + f and − f is the positive and negative sample with respect to f  , respectively. The function aims to minimize the distance between the positive pair and enlarge the distance between the negative pair. The dot product between the pair of feature vectors is defined as: where  denotes the two-norm of the feature vector and θindicates the angle between the two vectors. This network contains a total of 2,800,864 parameters. Therefore, it realizes a fast calculation speed that satisfies the real-time performance. We train the feature extraction network on the popular person re-identification dataset [30], which contains 1261 pedestrians and more than 1,100,000 video frames. The CNN is tested on the Nvidia RTX3070 GPU (NVIDIA, Santa Clara, CA, USA) and Nvidia GTX1050 GPU (NVIDIA, Santa Clara, CA, USA), and the results show that it only takes 0.85 ms and 0.93 ms on average to extract the feature of an object, respectively. Thus, this CNN is suitable for realtime tracking on different hardware devices. We send all detections to the trained CNN network, and the feature vectors of each target can be obtained as shown in Figure 6.

Motion State Estimation
In complex tracking scenarios, associating detections and tracklets based on appearance feature only produces a large number of ID switches. To further improve tracking accuracy, it is necessary to introduce the motion information of targets. As the velocity of targets in adjacent video frames is relatively stable, we apply the Kalman filter [31] to predict the motion state of targets in the next frame, as shown in Figure 7.  Firstly, we initialize a new motion state model based on the detection results for targets that first appear in the video. The motion state model is defined as:

Motion State Estimation
In complex tracking scenarios, associating detections and tracklets based on appearance feature only produces a large number of ID switches. To further improve tracking accuracy, it is necessary to introduce the motion information of targets. As the velocity of targets in adjacent video frames is relatively stable, we apply the Kalman filter [31] to predict the motion state of targets in the next frame, as shown in Figure 7. where N denotes the number of detections over a batch of video frames, f  indicates the anchor of the triplet. + f and − f is the positive and negative sample with respect to f  , respectively. The function aims to minimize the distance between the positive pair and enlarge the distance between the negative pair. The dot product between the pair of feature vectors is defined as: where  denotes the two-norm of the feature vector and θindicates the angle between the two vectors. This network contains a total of 2,800,864 parameters. Therefore, it realizes a fast calculation speed that satisfies the real-time performance. We train the feature extraction network on the popular person re-identification dataset [30], which contains 1261 pedestrians and more than 1,100,000 video frames. The CNN is tested on the Nvidia RTX3070 GPU (NVIDIA, Santa Clara, CA, USA) and Nvidia GTX1050 GPU (NVIDIA, Santa Clara, CA, USA), and the results show that it only takes 0.85 ms and 0.93 ms on average to extract the feature of an object, respectively. Thus, this CNN is suitable for realtime tracking on different hardware devices. We send all detections to the trained CNN network, and the feature vectors of each target can be obtained as shown in Figure 6.

Motion State Estimation
In complex tracking scenarios, associating detections and tracklets based on appearance feature only produces a large number of ID switches. To further improve tracking accuracy, it is necessary to introduce the motion information of targets. As the velocity of targets in adjacent video frames is relatively stable, we apply the Kalman filter [31] to predict the motion state of targets in the next frame, as shown in Figure 7.  Firstly, we initialize a new motion state model based on the detection results for targets that first appear in the video. The motion state model is defined as: Firstly, we initialize a new motion state model based on the detection results for targets that first appear in the video. The motion state model is defined as: where [c x , c y , r, h] is the location state of the target. Among them, (c x , c y ) denotes the center coordinate of the target, and r and h denote the aspect ratio and height of the bounding box, respectively.
[v x , v y , v r , v h ] is the velocity state of targets, which indicates the target's speed in four directions. Then, for each new target based on the height of the bounding box, we initialize the covariance matrix P 0 empirically, which is defined as: According to the target motion state at frame k − 1, we estimate the motion state of the object in frame k by the Kalman filter as: wherex k−1 denotes the motion state of the target in frame k − 1, andx − k indicates the predicted motion state in frame k. F is the state-transition matrix, which is defined as: After that, by the covariance matrix P k−1 in the previous frame, we can computer the covariance matrix at the current frame as: where Q and P − k are the noise matrix and the predicted covariance matrix, respectively. As shown in Figure 3, we divide the tracklets into confirmed tracklets and unconfirmed tracklets. The confirmed tracklets are formed by the associated detections of more than three frames. The other tracklets are denoted as unconfirmed tracklets. For the confirmed tracklets, we will associate them with subsequent detections by subsequent cascade matching. For unconfirmed tracklets and remaining unassociated tracklets, we will associate them with detections by subsequent IOU matching.

Matching Cascade
To improve the tracking accuracy and reduce the identity switches effectively, we comprehensively consider the appearance feature and motion state of objects. The matching cascade stage is performed as follows.
Firstly, we compute the cosine distance between the confirmed tracklet feature and the detection feature. We denote the i-th tracklet feature and the j-th detection feature as A = (a 1 , a 2 , . . . , a 128 ) and B = (b 1 , b 2 , . . . , b 128 ), respectively. The mathematical expression of cosine distance between tracklet feature and detection feature can be formulated as: The smaller c i,j is, the higher similarity between the confirmed tracklet and detection is. Then, we construct all c i,j into the cosine distance matrix C. Considering that the location of pedestrians changes little in the adjacent two frames, we get rid of the matching pairs whose Mahalanobis distance is too large. For the i-th confirmed tracklet, we calculate the squared Mahalanobis distance with the j-th detection as follows: where x i indicates the predicted location of the i-th confirmed tracklet, and l j denotes the location of the j-th detection. Σ i is the first four rows and four columns of the predicted covariance matrix P − k . The smaller b i,j is, the closer the confirmed tracklet and detection are. If b i,j is larger than the preset threshold ζ, we set the corresponding c i,j of C to 10. Next, we utilize C as the input of the Hungarian algorithm [32] to obtain the association results through the matching cascade. The Hungarian algorithm is a commonly used approach for solving assignment problems. By means of this method, it can be obtained that the association results are both similar in appearance feature and location.

IOU Matching
There are still some unassociated tracklets and detections after the above matching method. We use the unassociated confirmed tracklets and unconfirmed tracklets as candidate tracklets. Then, we send all candidate tracklets and unassociated detections to participate in IOU matching.
In the process of IOU matching, we first compute the IOU between the predicted boxes of the candidate tracklets and unassociated detection boxes. Then, we construct the matrix U based on 1-IOU of each pair of tracklet and detection. The element u i,j of U denotes the value of 1-IOU between the i-th candidate tracklet and the j-th detection. If u i,j is smaller, it means that the overlap ratio of the two objects is larger, and the possibility of being the same target is greater. After that, we employ U as the input of the Hungarian algorithm to obtain the association results through the IOU matching.

Motion State Updating
The involves combining the association results of matching cascade and IOU matching, and dividing all tracklets into associated tracklets and unassociated tracklets. In addition, all detections are divided into matched detections and unmatched detections. Then, we update the estimated motion states for tracklets and initialize new tracklets for unmatched detections.
For associated tracklets, the estimated motion states are updated by means of a Kalman filter. Firstly, the tractor computes Kalman gain. Determining Kalman gain is a critical step in establishing the Kalman filter model, which significantly impacts the efficiency and accuracy of filtering. Kalman gain can be calculated as follows: where H denotes the transition matrix from state quantity to observation, and R indicates the observation noise covariance matrix. Then, updating the estimated motion state of the tracklets by the matched detections as follows: wherex k denotes the modified motion state for the tracklets, and z k indicates the location vector of the matched detections. Finally, updating the covariance matrix for tracklets is as follows: where P k denotes the modified covariance matrix for tracklets, and I is an identity matrix. As for unmatched detections, we regard them as new objects in the video and initialize new motion states for them using Equation (8).

Motion Prediction Strategy
If the motion states of the unassociated tracklets are not updated, the difficulty of reassociating the object with subsequent detections increases significantly after a period of disappearance. When the object appears in the video again, it is very likely to increase the identity switches. Therefore, we design a motion prediction strategy to estimate the moving speed in four directions of the occluded objects. Since the occluded objects may be legible in the previous frames, we utilize the location of the objects in the earlier frames to estimate the speed of the objects, thereby predicting the possible location of the occluded objects.
Firstly, the motion prediction model gets the velocity state model [v x , v y , v r , v h ] of the unassociated tracklets in previous frames. As different video frames are of different importance for predicting the current location, we give weights to each frame's velocity state model based on the time interval. The velocity state model closer to the current time has a higher weight. The weight for each frame can be calculated as: where n represents the frame index when the object enters the video for the first time, and m indicates the current frame index. Subsequently, we subtract the speed of the object in the past two adjacent frames to get the acceleration of the object in each frame in the past. After that, we sum up the weighted acceleration of the object in each frame in the earlier frames to obtain the predicted acceleration. The predicted acceleration of the object can be calculated as: Then, based on the weight of each object in previous frames, we weighted sum the speed of the object in the past. After that, we sum the result of the weighted summation to the predicted acceleration to obtain the predicted velocity in the current frame as follows: Finally, we add the estimated velocity to the location in the previous frame locat k−1 to obtain the predicted location locat pred of the object in the current frame as follows: The main steps of the SORT-YM are shown in Algorithm 1.
dard stochastic gradient descent for 30 epochs. The learning rate of the network is initially set as 10 −2 , and it decays to 10 −3 at the 25th epoch.

Dataset Protocol
We evaluate the performance of our algorithm on the testing sets of two extensively used MOT benchmarks: MOT-15 and MOT-16. The MOT-15 dataset contains 22 video sequences, including 11 training sets and 11 testing sets, with over 11,000 video frames in total. The main difficulties of MOT-15 are as follows: • A large number of objects: The number of the annotated bounding boxes for all testing video sequences is 61,440. Therefore, it is difficult for the algorithm to achieve a high tracking accuracy with a fast speed. • Static or moving camera: Among the 11 testing video sequences, six videos are taken by a static camera, and a moving camera takes the remaining videos. These two modes of videos increase the requirement for the algorithm to predict the location of tracklets.
The MOT-16 contains seven testing video sequences in total. Compared to the MOT-15, the MOT-16 is a more challenging dataset. The scenarios of the videos in MOT-16 are more crowded and more complex. In addition to the challenges of the MOT-15, there still exists the following challenges in MOT-16 video sequences: • Different viewpoints: Each video sequence has a different viewpoint owing to the different heights of the camera. Videos from multiple perspectives increase the difficulty of object detection and feature extraction. • Varying weather conditions: A sunny weather video may contain some shadows, while the videos with dark or cloudy weather have lower visibility, making pedestrian detection and tracking more difficult.

Evaluation Metrics
To evaluate the tracking performance, we utilize the evaluation metrics provided by MOT Challenge Benchmark [38]. First, we adopt the multi-object tracking accuracy (MOTA) to evaluate the robustness of our algorithm. MOTA comprehensively considers three types of tracking errors, and it is defined as: where t is the index of the video frame, and FN denotes the number of false negatives, representing the ground-truth objects that are not detected by the algorithm. GT is the number of ground-truth objects in all video sequences. IDs indicates the number of ID switches for all objects. Then, the tracking precision of the method is evaluated through the multiple object tracking precision (MOTP), which is defined as: where c t represents the number of the objects tracked correctly in frame t. d t,i indicates the bounding box overlap of the i-th successfully tracked object with the ground-truth object in frame t. Mostly tracked objects (MT) is the ratio of tracklets that are tracked to more than 80%. Mostly lost objects (ML) is the ratio of tracklets that are tracked for less than 20%. After that, fragmentations (FM) indicates the number of interruptions for all ground-truth tracklets. Finally, we evaluate the speed of the algorithm through the frames per second (FPS).

Experiment on MOT-15 Dataset
We test the performance of the SORT-YM on the MOT-15 dataset. We demonstrate the comparative quantitative results on the MOT-15 dataset, as shown in Table 2. We compared SORT-YM with 3 offline approaches and 13 online approaches. SORT-YM achieves state-of-the-art performance in terms of MOTA (58.2%), MOTP (79.3%), and ML (12.2%). Meanwhile, SORT-YM ranks second on MT (44.4%) and FN (20753). In addition, the tracking speed of our proposed method is faster than most online approaches. The tracking speed of SiameseCNN, RNN_LSTM, and SORT is faster than SORT-YM. However, SORT-YM has an obvious advantage in tracking accuracy. The visual tracking result on the MOT-15 dataset of SORT-YM is shown in Figure 8. As shown in Figure 8b-d,g, SORT-YM obtain accurate tracklets in videos taken by a moving camera. Among them, our proposed method maintains robust tracking performance in Figure 8c with large shadows and small objects. As shown in Figure 8a,e,f, SORT-YM achieves higher tracking accuracy in videos taken by a static camera. Among them, tracking bounding box drift does not occur in complex scenes with interference with many similar objects, as shown in Figure 8e. In addition, our algorithm keeps strong robustness in crowded tracking scenarios with many occlusions and interactions between objects, as shown in Figure 8f

Experimental on MOT-16 Dataset
We first perform ablation analysis on the MOT-16 dataset to verify the necessity of each component in SORT-YM. We apply the Deep SORT [24] as our baseline, and we remove proposed components to investigate the contribution to our method. The comparison results are shown in Table 3.
Effect of YOLOv4-tiny. To evaluate the effectiveness of using YOLO-tiny, we perform an experiment to see the benefit. As shown in Table 3, compared to the Deep SORT baseline, which utilizes the Faster R-CNN [8] as the detector, the tracking speed of the algorithm with YOLOv4-tiny significantly improves. Meanwhile, the MOTA and MOTP increase by 1.2% and 0.9%, respectively.
Effect of motion prediction. We also experiment to verify the necessity of motion prediction. We added the motion prediction module to the Deep SORT baseline. The experimental results show that adding the motion prediction module reduces the number of FN and FM by 27900 and 85, respectively. In addition, compared with the case where no motion prediction is employed, SORT-YM improves the MOTA by 0.8%. Meanwhile, the number of FN and FM is reduced by 9332 and 107, respectively, due to the strong ability of the motion prediction to retrieve the lost objects.

Experimental on MOT-16 Dataset
We first perform ablation analysis on the MOT-16 dataset to verify the necessity of each component in SORT-YM. We apply the Deep SORT [24] as our baseline, and we remove proposed components to investigate the contribution to our method. The comparison results are shown in Table 3. Effect of YOLOv4-tiny. To evaluate the effectiveness of using YOLO-tiny, we perform an experiment to see the benefit. As shown in Table 3, compared to the Deep SORT baseline, which utilizes the Faster R-CNN [8] as the detector, the tracking speed of the algorithm with YOLOv4-tiny significantly improves. Meanwhile, the MOTA and MOTP increase by 1.2% and 0.9%, respectively.
Effect of motion prediction. We also experiment to verify the necessity of motion prediction. We added the motion prediction module to the Deep SORT baseline. The experimental results show that adding the motion prediction module reduces the number of FN and FM by 27900 and 85, respectively. In addition, compared with the case where no motion prediction is employed, SORT-YM improves the MOTA by 0.8%. Meanwhile, the number of FN and FM is reduced by 9332 and 107, respectively, due to the strong ability of the motion prediction to retrieve the lost objects.
As shown in Table 4, SORT-YM achieves state-of-the-art performance in terms of MOTP (81.7%), ML (16.7%), and FN (21439). Meanwhile, our proposed method reaches the highest MOTA (63.4%) in online methods. Compared to KDNT and LMP_P, the speed of SORT-YM is much faster. In fact, SORT-YM achieves the second-fastest tracking speed. Compared with the Deep SORT baseline, the SORT-YM has improved in all indexes. Among them, MOTA and MOTP increase by 2% and 2.6%, respectively, which shows that our algorithm has improved both in tracking accuracy and precision. In addition, SORT-YM reduces 74 ID switches, which indicates that our algorithm can better deal with the interference of occlusion and interaction. Meanwhile, the tracking speed of SORT-YM increases significantly. In general, SORT-YM achieved high performance on both tracking robustness and speed on the MOT-16 dataset. Table 4. The quantitative results by our proposed method and state-of-the-arts on MOT-16 dataset. ↑ Denotes that higher is better and ↓ represents the opposite. The best and sub-optimal results are highlighted in bold and italics.

Conclusions
Focusing on the problem that it is difficult to predict the location of occluded objects in existing multi-object tracking methods, this study designs a motion prediction strategy for predicting the location of the lost objects. Since the occluded objects may be legible in the earlier frames, we use the information from multiple frames in the past to predict the location of the occluded objects. To further improve the tracking performance of the algorithm, we utilize YOLOv4-tiny as our detector. Experimental results on the MOT-15 dataset and MOT-16 dataset show that the SORT-YM achieves state-of-the-art results on multiple indicators and improves the tracking speed. Comprehensively considering the tracking robustness and speed, SORT-YM is a competitive algorithm. Compared with the Deep SORT, our algorithm has a significant improvement in tracking performance. The future research focus is mainly on two aspects. One is improving the efficiency of the object detection algorithm and the feature extraction network to reach real-time tracking Another is improving the Kalman filter to enhance the tracking robustness in complex scenarios.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.

Conclusions
Focusing on the problem that it is difficult to predict the location of occluded objects in existing multi-object tracking methods, this study designs a motion prediction strategy for predicting the location of the lost objects. Since the occluded objects may be legible in the earlier frames, we use the information from multiple frames in the past to predict the location of the occluded objects. To further improve the tracking performance of the algorithm, we utilize YOLOv4-tiny as our detector. Experimental results on the MOT-15 dataset and MOT-16 dataset show that the SORT-YM achieves state-of-the-art results on multiple indicators and improves the tracking speed. Comprehensively considering the tracking robustness and speed, SORT-YM is a competitive algorithm. Compared with the Deep SORT, our algorithm has a significant improvement in tracking performance. The future research focus is mainly on two aspects. One is improving the efficiency of the object detection algorithm and the feature extraction network to reach real-time tracking. Another is improving the Kalman filter to enhance the tracking robustness in complex scenarios.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.