TLtrack: Combining Transformers and a Linear Model for Robust Multi-Object Tracking

: Multi-object tracking (MOT) aims at estimating locations and identities of objects in videos. Many modern multiple-object tracking systems follow the tracking-by-detection paradigm, consisting of a detector followed by a method for associating detections into tracks. Tracking by associating detections through motion-based similarity heuristics is the basic way. Motion models aim at utilizing motion information to estimate future locations, playing an important role in enhancing the performance of association. Recently, a large-scale dataset, DanceTrack, where objects have uniform appearance and diverse motion patterns, was proposed. With existing hand-crafted motion models, it is hard to achieve decent results on DanceTrack because of the lack of prior knowledge. In this work, we present a motion-based algorithm named TLtrack, which adopts a hybrid strategy to make motion estimates based on confidence scores. For high confidence score detections, TLtrack employs transformers to predict its locations. For low confidence score detections, a simple linear model that estimates locations through trajectory historical information is used. TLtrack can not only consider the historical information of the trajectory, but also analyze the latest movements. Our experimental results on the DanceTrack dataset show that our method achieves the best performance compared with other motion models.


Introduction
Multi-object tracking (MOT) has been a long-standing problem in computer vision, the aim being to predict the trajectories of objects in a video.It is one of the fundamental yet challenging tasks in computer vision [1] and forms the basis for important applications ranging from video surveillance [2,3] to autonomous driving [4,5].
Many modern multiple-object tracking systems follow the tracking-by-detection paradigm, consisting of a detector followed by a method for associating detections into tracks.The displacement of objects of interest provides important cues for object association.Many works have been inspired by tracking objects through motion estimation.SORT [6] uses the Kalman filter [7] as the motion model, which is a recursive Bayes filter that follows a typical predict-update cycle.The Kalman filter's simplicity and effectiveness make it widely used in tracking algorithms [6,8,9].But the Kalman filter as a hand-crafted motion model struggles to deal with diverse motion on DanceTrack [10].OC-SORT [11] pointed out the limitations of SORT [6] from the use of the Kalman filter [7] and improved robustness against occlusion and nonlinear motion.CenterTrack [12] built on the Cen-terNet [13] detector to learn a 2D offset between two adjacent frames and associate them based on center distance.But CenterTrack [12] has bad association performance.Recently, MOTR [14], which extended DETR [15] and introduced track query to model the tracked instances in the entire video, has shown the potential of the transformer on data association.
But MOTR [14] utilizes the same query to implement detection and tracking, resulting in poor detection performance.
DanceTrack [10] is a large-scale multi-object tracking dataset where objects have uniform appearance and diverse motion patterns.DanceTrack [10] focuses on situations where multiple objects are moving in a relatively large range, the occluded areas are dynamically changing, and they are even in crossover.Such cases are common in the real world, but naive motion models cannot handle them effectively.It can be concluded that the ability to analyze complex motion patterns is necessary for building a more comprehensive and intelligent tracker.
We aimed to develop a strong motion model capable of handling complex movements.Inspired by MOTR [14], we utilize transformers to analyze cross-frame motion patterns.Specifically, an object detector is used to generate detection results and track queries.A transformer architecture then takes the track queries and the image feature as input to predict the current location of the detections.In our method, we directly obtain the track queries from the detections of each frame.Consequently, the accuracy of motion prediction is highly influenced by the quality of the detections.While the detector is trained to locate object positions, its performance may fall short in certain scenes.In MOT tasks, occurrences like occlusion or blurring can result in less accurate detection bounding boxes than expected, as illustrated in Figure 1.This, in turn, renders the track queries less representative and leads to erroneous predictions.We point out that the confidence score can aid in addressing this issue.Thus, we have designed a hybrid strategy to make motion estimates based on the confidence score.For objects with a high confidence score, we adopt a transformer to predict their future locations.Conversely, for objects with a low detection score, we employ a simple linear model to estimate the position.Although the world does not move with constant velocity, many short-term movements, as in the case of two consecutive frames, can be approximated with linear models and by assuming a constant velocity.Additionally, a linear model predicts position through the historical velocity of the trajectory, reducing the impact of the current state.Generally, TLtrack designs a novel hybrid strategy to make motion estimates, not only by considering the historical information of trajectory but also by analyzing the latest movements of each object.Towards pushing forwards the development of a motion-based MOT algorithm, we propose a novel motion model, named TLtrack.TLtrack adopts a novel hybrid strategy to make motion estimates, utilizing transformers to predict the locations of high confidence score detections and employing a linear model for low confidence score detections.Our experimental results on the DanceTrack dataset show that our method achieves the best performance compared with other motion models.

Tracking by Detection
The tracking-by-detection paradigm consists of a detector followed by a method for associating detections into tracks.The performance of the object detector plays a pivotal role in tracking.As the field of object detection undergoes rapid development, an increasing number of methods are embracing more powerful detectors to enhance tracking performance.Notably, the two-stage detector Faster R-CNN [16] is employed by SORT [6], due to its exceptional detection accuracy.Additionally, many approaches have turned to the YOLO series detectors [17,18] for their commendable balance between accuracy and speed.Among the array of choices, CenterNet [13] has emerged as a preeminent selection, due to its simplicity and efficiency, garnering extensive adoption by numerous methods [19][20][21][22].In a similar vein, Transformer-based detectors, such as DETR [15] and Deformable-DETR [23], have been incorporated by TransTrack [24] and Trackformer [25].Introducing a different perspective, P3AFormer [26] employs pixel-wise techniques to achieve precise object predictions at the pixel level.In practice, these methods commonly leverage detection boxes from each input image directly for tracking purposes.Thus, judicious employment of these detection boxes becomes crucial for effective data association.

Motion Model
Motion model is widely used for data association to achieve robust ID assignment.Specifically, motion model utilizes the motion information of trajectories to predict the future position.Many methods [6,8,9] use Kalman filter [7] as the motion model, which is a recursive Bayes filter that follows a typical predict-update cycle.The Kalman filter is the predominant motion model for its simplicity and effectiveness.But the Kalman filter as a hand-crafted motion model fails to utilize non-linear or fast object motion.LSTM is adopted by Deft [27] as a motion model, which leads to decent association metrics.Optical flow [28] is adopted by Centertrack [12] as an alternative motion model.But Optical flow cannot achieve excellent performance.Recently, a transformer-based model has shown great performance in motion prediction.Transtrack [24] adopts two transformer decoders to achieve detection and motion estimation simultaneously in a single stage.MOTR [14] learns strong temporal motion and achieves advanced performance on the DanceTrack [10] dataset, which proves the potential of the transformer for diverse motion prediction.

Transformers in MOT
More recently, with the new focus on applying transformers [29] in vision tasks, many methods have made attempts to leverage the attention mechanism in tracking objects in videos.TransTrack [24] builds up a novel joint-detection-and-tracking paradigm by accomplishing object detection and object association in a single shot.Trackformer [25] extends DETR [15] by incorporating additional object queries sourced from existing tracks and propagates track IDs similarly to Tracktor [30].MOTR [14] follows the structure of DETR [15] and continually propagates and updates track queries for association object identities.MO3TR [31] introduces a temporal attention module to update each track's status over a temporal window, employing updated track features as queries in DETR [15].GTR [32] designs a global tracking transformer that takes object features from all frames within a temporal window and groups objects into trajectories.In these works, the transformer architecture is carefully designed to suit the needs of tracking tasks.Our methods intend to utilize a transformer to build a strong motion model.

Methodology
In this section, we present the proposed tracking framework, as illustrated in Figure 2. The overall structure of the encoder and decoder can be seen in Figure 3.

Architecture
Following the tracking-by-detection paradigm, our model is built upon an object detector.An extra transformer architecture is employed to leverage motion cues.Given a frame I t , it is initially fed into the detector to generate the detection results D t ∈ R N×5 (N represents the number of detected objects, 5 includes the bounding boxes and confidence score) and track queries Q t ∈ R N×C , which are the features corresponding to each detected object.The backbone of the transformer takes two consecutive frames, I t−1 and frame I t , as input and produces the stacked feature map F s ∈ R H×W×C .The transformer encoder consists of a self-attention block and a feed-forward block, taking F s as the query to generate the enhanced feature F e ∈ R H×W×C for the decoder.The transformer decoder, comprising a cross-attention block and a feed-forward block, utilizes the track queries Q t−1 and the enhanced feature F e as the query and key, respectively.An MLP is used after the decoder to obtain the prediction results P t ∈ R N×4 (4 represents the bounding boxes).For each object detected in frame t − 1 represented by the track query Q t−1 , the prediction results P t represent their predicted positions in frame t.The Hungarian algorithm is employed to achieve bipartite matching.assignment is determined by a cost matrix that compares new detections with the tracks obtained in previous frames.We will discuss how to selectively use the prediction results P t to populate the cost matrix later.

Transformers and Linear Track
We have designed a hybrid strategy based on confidence scores to make motion estimates.Assuming P t−1 to be the locations of the detections in frame t − 1, our goal is to predict their locations in frame t.
For high confidence score detections, we firstly turn their feature maps into track queries Q t−1 .Then, Q t−1 goes through a self-attention block, which can be expressed as where d k is the dimension of the key vector and t−1 is then fed into the cross-attention block, which can be expressed as where Q cross t−1 is the output of the cross-attention block and F e represents the enhanced feature generated by the encoder.At the end, a feed-forward network and an MLP work on generating the final predictions: where P t ∈ R N×4 (4 represents the bounding boxes) are the predicted locations on frame t.
For low confidence score detections, we estimate their locations by a simple linear model.Assuming P low t−1 to be the location of one low confidence score detection on frame t − 1, its location on frame t can be represented by where v is the mean velocity of this object between the last M frames.Further experiments will determine how many frames to compute the mean velocity it would be appropriate to choose.We set ∆t to be 1.The whole hybrid strategy can be represented by where p i t−1 represents location of the i-th detection in frame t − 1 and s i t−1 represents its confidence score.A( p i t−1 ) represents the processing for high score detections that we discussed above and τ is the threshold for the confidence score.We set τ as 0.9 based on further experiments.

Training
Following the same settings as in TransTrack [24], we choose a static image as the train data; the adjacent frame is simulated by randomly scaling and translating the static image.Firstly, a trained detector generates detections and track queries from the original frame.Secondly, the track queries and the adjacent frame are fed into the transformer to obtain the prediction results.apply a set prediction loss to supervise the prediction results.The set-based loss produces an optimal bipartite matching between the predictions and the ground truth objects.The matching cost is defined as where L cls is the focal loss, L L1 denotes the L1 loss, L giou is the generalized IoU loss, and λ cls , λ L1 , and λ giou are the corresponding weight coefficients.The training loss is the same as the matching cost except that it is only performed on matched pairs.

Settings
Datasets.We evaluated our method on the multi-object tracking dataset Dance-Track [10] under the "private detection" protocol.DanceTrack [10] is a recently proposed dataset designed for human tracking, which focuses on promoting multi-object tracking studies with a stronger emphasis on association rather than mere detection.In this dataset, object localization is straightforward, but the object motion exhibits a highly non-linear behavior.Additionally, the objects share a close appearance, leading to significant occlusion and frequent crossovers.These aspects pose substantial challenges for both motion-based and appearance-matching-based tracking algorithms.As our main objective was to enhance tracking robustness in the presence of fast movements and non-linear object motion, we placed special emphasis on comparing TLtrack with previous methods, specifically on the DanceTrack [10] dataset, in the following experiments.
Metrics.We employed the CLEAR metrics [33], encompassing MOTA [34], FP, FN, and IDs, in addition to IDF1 and HOTA [35], to comprehensively assess various facets of tracking performance.MOTA is calculated based on FP, FN, and IDs, with a greater emphasis on detection performance due to the relatively larger presence of FP and FN.Conversely, IDF1 appraises the capability of preserving identities, with a stronger focus on association performance.Higher-order tracking accuracy (HOTA) explicitly balances the effect of performing accurate detection, association, and localization into a single unified metric for comparing trackers.HOTA decomposes into a family of sub-metrics that can evaluate each of the five basic error types separately, which enables clear analysis of tracking performance.The detection accuracy, DetA, is simply the percentage of aligning detections.The association accuracy, AssA, is simply the average alignment between matched trajectories, averaged over all detections.
Implementation details.TLtrack uses a default detection scores threshold of 0.6, unless specified otherwise.For the benchmark evaluation of DanceTrack [10], we solely adopted GIoU as the similarity metric.During the linear assignment step, a matching between the detection box and the tracklet box was rejected if the GIoU was smaller than −0.2.To address lost tracklets, we retained them for 30 frames in case they reappeared.
The detector utilized was YOLOX [17] with YOLOX-X as the backbone and a COCOpretrained model as the initialized weights.With a trained detector, we trained the transformer for an input shape 1440 × 800.The model was trained on 4 NVIDIA Titan xp GPUs with a batch size of 1.We used SGD as the optimizer with a weight decay of 10 −4 and a momentum of 0.9.The initial learning rate was 2 × 10 −4 for the transformer and 2 × 10 −5 for the backbone.All transformer weights were initialized with Xavier-init and the backbone model was pretrained on ImageNet with frozen batch-norm layers.We used data augmentation, including random horizontal, random crop, and scale augmentation.We trained the model for 20 epochs and the learning rate dropped by a factor of 10 at the 10th epoch.The total training time was approximately 7 days on the DanceTrack train set.

DanceTrack.
In order to assess TLtrack's performance under complex motion patterns, we present the results for the DanceTrack test set in Table 1.As evident from the results, TLtrack achieved advanced results when handling complex object motions.TLtrack achieved the highest which meant achieving the best overall performance, but DetA was lower than CenterTrack and TransTrack.This was because the two algorithms were carefully designed for detection; at the same time, they did not have good association performance.The AssA of TLtrack was slightly lower than GTR, because this algorithm adopts a global association strategy.About IDF1, ByteTrack achieved the highest performance because lower confidence detections were recovered, but this strategy also affected the detection performance of the algorithm.
Comparison to other motion models.Our method, TLtrack, serves as a motion model aimed at dealing with complex motions.In this study, we compared TLtrack with other motion-based methods on the DanceTrack-test dataset: SORT [6], which applies a Kalman filter for motion prediction (the Kalman filter has been the predominant motion model for MOT); CenterTrack [12], which converts Centernet to an MOT architecture and predicts the center offset between two frames, utilizing the center offset to conduct greedy matching; and Transtrack [24], which consists of two transformer decoders, one for detection and another for predicting the position of each track.The results in Table 1 demonstrate that TLtrack outperforms all these methods, in terms of HOTA, IDF1, and AssA, showcasing its superior capabilities in handling complex motion patterns.

Ablation Studies
Component Ablation.To demonstrate the effectiveness of our hybrid strategy, we conducted experiments on the validation sets of Dancetrack [10].We compared the performance of a linear model for all detections, a transformer for all detections, and our method.The results in Table 2 show that our hybrid strategy enables both models to reach their full potential on the DanceTrack dataset.When applying a transformer to predict the locations of all detections, the association performance is poor.When applying a transformer only to the high confidence score detections, the association performance is considerable.This phenomenon confirms our analysis that the accuracy of motion prediction is highly influenced by the quality of detections.
Velocity in the linear model.In order to perform motion estimation using the linear model, the velocity of each track needs to be computed.In this section, we discuss how many last detections are appropriate to use to calculate the velocity.For the DanceTrack dataset, the motion is extreme.The results in Table 3 indicate that relying solely on the last four frames yields a reasonable estimation of future motion.
TLtrack: τ.Based on the concept of enhancing the representativeness of track queries, we set τ to distinguish detections.The results in Table 4 show that 0.9 would be the appropriate threshold value.This phenomenon indicates that the confidence score plays a crucial role in handling challenging situations in MOT.For example, a decrease in the confidence score often signifies the occurrence of occlusion, which can easily lead to ID switches or target loss.Effect of input image size.Table 5 shows the effect of the input image size.As the input image size gradually increases, the HOTA performance reaches saturation when the short side of the input image is 800 pixels.Therefore, we set this as the default setting in TLtrack.
Effect of the frames of the video.Table 6 shows the effect of the frames of the video.We compared the model performance with the original video, sampling one frame every two frames, sampling one frame every three frames, and sampling one frame every four frames.As can be seen from the results, the performance of the model gradually decreased as the sampling frame rate increased.↑ indicates that higher is better.

Conclusions
This paper TLtrack, a novel hybrid strategy to make motion estimates based on confidence scores.For detections with a high confidence score, TLtrack employs transformers to predict locations.Conversely, for detections with a low confidence score, it resorts to a straightforward linear model.In this way, not only the direction of the trajectory in the past can be considered, but also the latest movements can be analyzed.TLtrack's strength lies in its simplicity, real-time processing capability, and effectiveness.An empirical evaluation on the Dancetrack dataset shows that our method achieves the best performance compared with other motion models.

Institutional Review Board Statement:
The study was conducted in accordance with the Declaration of Helsinki, and approved by the institutional review board (IRB) of Shanghai University with a waiver of the requirement for informed consent.
Informed Consent Statement: Patient consent was waived because the the research used a public dataset that does not contain any identifiable information and there is no way to link the information back to identifiable information.

Figure 1 .
Figure 1.Illustration of detections with different confidence scores, wherein the red box represents detections with a low score and the green box represents detections with a high score.As the detection confidence score decreases, the detection box cannot accurately represent the location of the object.

Figure 2 .
Figure 2. Diagram of the proposed model: (Left) The detector takes the current frame as input and generates the detection results.The features corresponding to detected objects are used as track queries for the next frame.The encoder takes two consecutive frames as input and outputs the enhanced image feature.The decoder takes the enhanced image feature and track queries as input.An MLP, which is omitted in this figure for simplicity, takes the output of the decoder to generate the final prediction results.Finally, Hungarian matching is used between the detections and predictions.(Right) The detailed structure of the decoder consists of a self-attention layer, a cross-attention layer, and an FFN (Feed-Forward Neural Network) layer.

Figure 3 .
Figure 3.The overall structure of the encoder and decoder.

Table 1 .
Results for DanceTrack test set.The methods in the bottom block use the same detections.

Table 3 .
Influence of choice of M last detections to compute velocity on DanceTrack-val set.
↑ indicates that higher is better.

Table 4 .
Ablation study on τ in the TLtrack on DanceTrack-val set.
↑ indicates that higher is better.

Table 5 .
Ablation study on the input image size.

Table 6 .
The effect of the frames of the video.