PSMOT: Online Occlusion-Aware Multi-Object Tracking Exploiting Position Sensitivity

Models based on joint detection and re-identification (ReID), which significantly increase the efficiency of online multi-object tracking (MOT) systems, are an evolution from separate detection and ReID models in the tracking-by-detection (TBD) paradigm. It is observed that these joint models are typically one-stage, while the two-stage models become obsolete because of their slow speed and low efficiency. However, the two-stage models have naive advantages over the one-stage anchor-based and anchor-free models in handling feature misalignment and occlusion, which suggests that the two-stage models, via meticulous design, could be on par with the state-of-the-art one-stage models. Following this intuition, we propose a robust and efficient two-stage joint model based on R–FCN, whose backbone and neck are fully convolutional, and the RoI-wise process only involves simple calculations. In the first stage, an adaptive sparse anchoring scheme is utilized to produce adequate, high-quality proposals to improve efficiency. To boost both detection and ReID, two key elements—feature aggregation and feature disentanglement—are taken into account. To improve robustness against occlusion, the position-sensitivity is exploited, first to estimate occlusion and then to direct the post-process for anti-occlusion. Finally, we link the model to a hierarchical association algorithm to form a complete MOT system called PSMOT. Compared to other cutting-edge systems, PSMOT achieves competitive performance while maintaining time efficiency.


Introduction
As one of the most critical tasks in computer vision, online multiple object tracking (MOT) aims to accurately identify and track objects of interest from real time video sequences, capturing their continuous motion trajectories.This task plays a pivotal role in applications related to advanced environmental perception and autonomous control.For instance, in autonomous driving [1], the surrounding environments are captured in real time by onboard cameras, radars and lidars, then MOT is applied to perceive objects, including vehicles, pedestrians and bicycles, providing precise tracking information for decision-level applications, such as path planning, collision avoidance and safe interaction with other traffic participants; in intelligent video surveillance [2], MOT is widely used to identify pedestrians within surveillance areas, offering a reliable means for flow estimation and swift response to potential threats or unusual activities.
In the field of MOT, tracking-by-detection (TBD) [3] stands out as the predominant paradigm, comprising three key sub-tasks.First, it detect objects from the current video frame.Second, it extracts the objects' ReID features based on their bounding boxes.Third, it associates the detected objects with those from the previous frame, relying on cues such as the similarity of the ReID features and intersection over union (IoU).Within the TBD framework, the architecture of separate detection and embedding (SDE) [3] directly links these three sub-tasks sequentially.However, a notable drawback arises, as these tasks cannot share any computation.This limitation results in a disproportionately long processing time for the entire system, which prompts the exploration of joint detection and embedding (JDE) [3] architecture.The JDE methods integrate detection and ReID feature extraction into a unified model, mitigating the need for re-computation.However, this integration is neither a straightforward addition of an ReID branch to a detector [4] nor an expansion of the dimensions of the output coefficient map [5].It introduces the new challenge of learning multiple tasks that may contradict each other.Different tasks exhibit sensitivity to various types of information derived from distinct Convolutional Neural Network (CNN) layers, so it is crucial to ensure that the shared feature maps encompass synthetic information and can be decomposed into task-specific features at the entry of the task branches.Therefore, feature aggregation and disentanglement emerge as the key elements in effectively solving the multi-task issue.Notably, many related works [6][7][8][9][10][11][12] have significantly enhanced the performance of the JDE MOT systems by adhering to these key elements.To maintain timeliness, these methods mainly focus on the design of one-stage joint models.In contrast, the two-stage joint models tend to become obsolete, due to slow speed and low efficiency.
In the JDE framework, the one-stage anchor-based models employ an end-to-end mechanism where the object's bounding box regression occurs simultaneously with classification and ReID feature extraction.However, the latter two results stem from the regions of hypothetical anchors, rather than regressed regions, which leads to the problem of feature misalignment.In contrast, the two-stage models take a different approach by performing bounding box regression in the first stage.Consequently, classification and feature extraction in the second stage could be in closer proximity to the actual objects.Additionally, the one-stage anchor-free models leverage point-based mechanism to achieve accurate feature alignment.However, this approach introduces vulnerability: features extracted from a specific point of the object become highly susceptible to occlusion.Strategies for anti-occlusion are predominantly reliant on region-based methodologies and are difficult to be applied to the point-based models.In contrast, the two-stage models are all based on regions, offering proper conditions for various effective countermeasures against occlusion.Following this point of view, we initiate the design of a fundamental two-stage model based on R-FCN [13].Despite the model's light-weight RoI-wise process, inefficiency persists due to the dense and manual anchoring scheme in the first stage, leading to sub-optimal proposals and performance degradation in detection and ReID.To tackle this issue, we replace the original region proposal network (RPN) with a light-weight network [14], which generates high-quality proposals with sparse and adaptive anchors.To simultaneously enhance detection performance and ReID features, we incorporate the key elements, feature aggregation and feature disentanglement, into our basic model.In particular, the multi-layer feature fusion is employed for feature aggregation in the following backbone network [15] and feature disentanglement is achieved by embedding a neat set of convolutional layers on each task branch.
Originally employed in R-FCN to maintain translation-variance in deep features and in [16] to segment inter-class instances, position-sensitivity is exploited for anti-occlusion in our work for the first time.Capitalizing on the fact that different locations on a positionsensitive feature map are exclusively sensitive to corresponding parts of the objects, we leverage this inherent property to locate and exclude the occluded sub-regions.Specifically, for a given object proposal, we transform its position-sensitive classification map into a binary map.This binary representation, outputted by an adaptive mean-std threshold, indicates the visibility of each part of the object and could guide aggregation of the maps for classification, bounding box regression and ReID feature extraction, while effectively excluding interference caused by occlusion.
Finally, the proposed two-stage model is integrated with the hierarchical association algorithm in MOTDT [17], resulting in the complete system, named PSMOT.Experimental results demonstrate that PSMOT achieves outstanding performance and robustness while maintaining time efficiency.To sum up, the main contributions of our work are as follows:

•
Reuses the two-stage model in multi-object tracking and leverages its inherent advantages in RoI-wise and region-based mechanisms to handle feature misalignment and provide conditions for anti-occlusion; • The two-stage JDE model, extended from R-FCN, adopts a fully convolutional network structure that significantly reduces the computational burden in RoI-wise processing;

•
Replaces the original RPN network, which relies on dense and predefined anchors, with a network based on adaptive sparse anchors, enabling the production of more high-quality proposals with fewer anchors.This replacement further improves the model's efficiency; • An efficient encoder-decoder network with multi-layer feature fusion is employed as the model's backbone and additional convolutional layers are added at the entry of each task branch.Therefore, the highly informative shared features are first generated and then disentangled into effective task-specific features.This feature process effectively mitigates the conflicts between tasks and significantly improves the overall performance;

•
Extends the application of position sensitivity to determine whether a specific part of an object is occluded.Leveraging this cue helps in active anti-occlusion by excluding the corresponding interference.

Early Tracking-by-Detection MOT Methods
The characteristic of the TBD paradigm lies in the steps it takes to conduct object detection on each frame and associate objects between frames to establish their trajectories.Early research primarily focuses on constructing motion models that utilize motion features for tracking.For instance, ref. [18] models the position and velocity of targets, followed by Kalman filtering [19] to predict the bounding boxes of targets from the previous frames to the current frame, which are then individually subjected to IoU calculation, with the detection boxes obtained through Faster R-CNN [20] to create corresponding cost matrices.The Hungarian matching algorithm is subsequently employed to find the optimal match between tracked targets and detected objects.Although the motion models perform well in handling short-term occlusions, MOT methods relying solely on them still exhibit limitations, especially in complex scenarios.
Recent research has benefited from the powerful feature representation of CNNs and has focused on extracting and matching discriminative appearance features.For example, ref. [21] designs a feature extraction network based on GoogLeNet [22] specifically for extracting the appearance features of objects, whereas ref. [23] proposes a feature extraction network based on the feature pyramid, enhancing discriminative power through feature fusion.Motion features are often combined with appearance features; for example, ref. [24] constructs a unified cost matrix based on both IoU of bounding boxes and similarity of appearance features, and ref. [17] designs a scoring mechanism to eliminate unreliable detection results and motion predictions, then employs a hierarchical strategy to associate detected objects with tracked targets.In addition, ref. [25] utilizes recurrent neural networks (RNNs) to assess motion and appearance similarity between targets.

Joint Detection and Embedding (JDE) MOT Methods
The JDE MOT methods are introduced to streamline the redundant pipeline observed in the SDE approaches.Since the joint models are capable of handling detection and ReID feature extraction simultaneously, significant reduction in inference time can be achieved.However, the performance of these models suffers from issues concerning multi-task learning [26,27].It has been discerned that feature aggregation and disentanglement are pivotal elements for enhancing the performance of multiple tasks concurrently.FairMOT [6] generates synthetic feature maps using the variant DLA-34 from [9].Based on FairMOT's model, RelationTrack [8] introduces a self-motivated module called GCD to separate shared features into detection-specific and ReID-specific representations and incorporates a transformer encoder with deformable attention, known as GTE, to enhance the ReID task.CSTrack [7] integrates a reciprocal network into the model from [5] to achieve feature disentanglement and embeds the scale-aware attention network into the ReID branch for feature enhancement.Swin-JDE [10] proposes an anchor-free JDE model based on Transformer architecture.In this model, the Patch-Expanding module is employed to improve the spatial information of feature maps and Einops Notation-based rearrangement is utilized to enhance the detection and tracking performance.To achieve real-time multiobject tracking, LMOT [11] introduces a simplified DLA-34 to extract detection features for the current image and generates efficient tracking features using a linear Transformer.RetinaMOT [12] extends object detection model Yolov5 [28] into a JDE model.To enhance the representative power of features, a series of retina-related convolutional modules are introduced in the backbone network.
Different from the aforementioned MOT methods, which use one-stage JDE models, PSMOT adopts an efficient two-stage model to accomplish detection and ReID feature extraction simultaneously.

Anti-Occlusion in MOT
Occlusion poses a significant challenge in MOT systems, manifesting in two primary issues.First, it can lead to missed detection, resulting in numerous interrupted trajectories of objects.The point-based models, such as FairMOT, are particularly vulnerable compared to the anchor-based models when faced with occlusion.Second, occlusion can corrupt the ReID features of tracked targets, which ultimately results in tracking drift.Many works [29][30][31][32][33] deal with occlusion based on regions.Typically, they partition the objects' bounding boxes into blocks and process occlusion within each block.MOTs [4] tackle the problem by simultaneously addressing segmentation and extracting global attributes from appearance information, along with graph information.In [34], the representative power of the ReID feature is enhanced for each target through spatial and temporal attention.Re-lationTrack [8] adopts a deformable attention mechanism to avoid aggregating interference caused by occlusion.OUTrack [9] employs an occlusion estimation module to recognize and track occluded objects, which are missed by detection.
There are two types of occlusion in natural scenes: intra-class occlusion, where objects are obstructed by objects belonging to different classes, and inter-class occlusion, which involves overlaps between two objects from the same class.The latter is more challenging, as it requires instance-level cues for distinction.In this paper, we leverage the use of position-sensitivity [13,16] and transform it into an effective tool for addressing both interclass and intra-class occlusion.

Our Approach
In this chapter, we present the technical details of PSMOT.We start by delving into the proposed two-stage JDE model, outlining its key modules and training process.Subsequently, we explore the entire operational flow of PSMOT, encompassing the model's anti-occlusion inference and cooperative online association algorithm.

The Two-Stage JDE Model
The overview of our proposed model is shown in Figure 1.The backbone network follows the encode-decode structure with a scheme of multilayer feature fusion (see Section 3.1.1).The bottleneck network adopts the sharing of hard parameters and employs FD modules to generate task-specific feature maps (see Section 3.1.2).The anchors are automatically generated by means of adaptive anchor generation (see Section 3.1.3).Additionally, details of task branches are described in Section 3.1.4and the joint loss function of the model is described in Section 3.1.5.The backbone network follows the encode-decode structure with a scheme of multilayer feature fusion (see Section 3.1.1).The bottleneck network adopts the sharing of hard parameters and employs FD modules to generate task-specific feature maps (see Section 3.1.2).The anchors are automatically generated by means of adaptive anchor generation (see Section 3.1.3).Additionally, details of task branches are described in Section 3.1.4and the joint loss function of the model is described in Section 3.1.5.

Multi-Layer Feature Fusion
The proposed model has four tasks to complete: generation of proposals, classification, bounding box regression and the extraction of ReID features.Different tasks are sensitive to different features, which are derived from different layers of CNNs.Thus, to generate features which contain adequate information for all tasks, we employ the variant DLA-34 in FairMOT [6], which is a fully convolutional encoder-decoder network with multi-layer feature fusion, as our model's backbone.
As shown in Figure 1, the input frame I with shape , S is the output stride and D is the number of output channels.We set S to 4 to generate the feature map with relatively high resolution.

Task-Specific Disentanglement
Through Section 3.1.1,we manage to obtain a shared feature map with strong representation, including semantic information at various levels.However, if we directly use the feature map in multi-task prediction, competitions between tasks would result in problematic or compromised convergence and cause significant decrease in the performance of each task branch.To address the issue, the shared feature map should be disentangled before being fed into each task branch.
As shown in Figure 1, we follow the idea of "Decouple Head" from Yolov8 and utilize two CBL modules to implement feature disentanglement on each branch.The essence of this process is to enhance the spatial and dimensional features exclusive to the specific task and suppress the others.

Multi-Layer Feature Fusion
The proposed model has four tasks to complete: generation of proposals, classification, bounding box regression and the extraction of ReID features.Different tasks are sensitive to different features, which are derived from different layers of CNNs.Thus, to generate features which contain adequate information for all tasks, we employ the variant DLA-34 in FairMOT [6], which is a fully convolutional encoder-decoder network with multi-layer feature fusion, as our model's backbone.
As shown in Figure 1, the input frame I with shape H × W × 3 is fed into the backbone, which outputs the feature map F 0 with shape H 0 × W 0 ×D where H 0 = H/S, W 0 = W/S, S is the output stride and D is the number of output channels.We set S to 4 to generate the feature map with relatively high resolution.

Task-Specific Disentanglement
Through Section 3.1.1,we manage to obtain a shared feature map with strong representation, including semantic information at various levels.However, if we directly use the feature map in multi-task prediction, competitions between tasks would result in problematic or compromised convergence and cause significant decrease in the performance of each task branch.To address the issue, the shared feature map should be disentangled before being fed into each task branch.
As shown in Figure 1, we follow the idea of "Decouple Head" from Yolov8 and utilize two CBL modules to implement feature disentanglement on each branch.The essence of this process is to enhance the spatial and dimensional features exclusive to the specific task and suppress the others.

Adaptive Sparse Anchors
The dense anchoring scheme associates every pixel in a feature map with a set of anchors with predefined scales and aspect ratios.To achieve sufficiently high recall, the total number of anchors should be large enough, which would inevitably increase the computational cost.In our model, this burden would be heavier in that the backbone outputs a high-resolution feature map.To improve efficiency, we alternatively resort to the adaptive anchoring scheme [14], which generates adaptive sparse anchors via a small fully convolutional network.This new module manages to achieve higher recall rate with fewer anchors than the original RPN-adopting dense anchoring scheme.
As shown in Figure 1, the task-specific feature map F rpn undergoes two branches, location prediction branch and shape regression branch, respectively.The former branch produces a heatmap Hm indicating the probability of objects in each pixel location, and the latter generates a map of shape coefficients marked as Sm.Finally, the anchors are generated, first by selecting the locations where the probabilities of Hm are beyond a certain threshold and then choosing the most probable shape at each of the selected locations.Additionally, to keep the receptive field and semantic scope consistent with the shapes of anchors on different locations, a convolution layer is applied on the shape coefficient map to produce the offset map Om, which is used in the subsequent deformable convolution layer, employed to transform the task-specific features on task branches.

Branches of Tasks (1) Classification
The classification branch is designed to classify the region proposals into object categories.As shown in Figure 1, the final 1 × 1 convolutional layer is applied on F ′ cls to produce K 2 groups of position-sensitive maps M cls .Each group contains C channels, where C is the number of categories.Given the bounding box of a proposal parameterized as (x, y, w, h), the position-sensitive RoI pooling/align is employed to produce a score map with the shape of K × K × C.
In detail, as shown in Figure 2, the bounding box is first divided into K × K bins and the value in each bin are aggregated by the vectors only from the counterpart from K 2 groups.

Loss Function
The proposed network is optimized in an end-to-end fashion employing tas pendent uncertainty loss [27] to balance tasks automatically.The joint loss fun formed in Equation ( 2), where wi is the uncertainty weight for the loss of each t can be learned as a parameter.Different from linear summation of losses, this au balancing method breaks the strong restriction on loss weights, so the result can b to the optimal value.The LGA indicates the total loss of the module in Section 3.1.3,which is a line bination of the loss of location prediction Lloc and the loss of shape regression L shown in Equation (3): where Lloc is optimized by focal loss and Lshape is optimized by a variant of bound For instance, the (i, j)-th bin a(i, j), which spans X i ≤ x < X i+1 and Y j ≤ y < Y j+1 of the bounding box, pools only at the corresponding region over the (i × k + j)-th group score map g(i × k + j), where the X i , X i+1 , Y j and Y j+1 are defined in Equation (1): Thereafter, the pixel-wise softmax function is applied on the score map to output the classification probability map with the same shape, as shown in Figure 1. (

2) Bounding Box Regression
The bounding box regression branch aims to align the bounding box to the correspondent object more precisely.Similar to the classification branch, the feature map F ′ reg is fed into the final convolutional layer to generate the position-sensitive maps M reg , for which the channels are 4K 2 , where the number 4 indicates the coefficients for a bounding box as (b x , b y , b w , b h , following the parameterization in [20].Then, for each proposal, the position-sensitive RoI pooling/align is performed and produces the bounding box regression map with a shape of K × K × 4. (3) ReID Feature Extraction The objective of the ReID branch is to generate features that can distinguish different instances of the same class.Based on the embedding vectors used in human ReID, the branch is trained to establish an embedding space in which the vectors belonging to the same instance are close by, while those belonging to different instance are far away, according to a proper measurement.The final convolutional layer with D t kernels is employed to transform the feature map and the general RoI pooling/align layer is appended to produce the ReID feature map with a shape of K × K × D t for each proposal.We follow the conclusion from the FairMOT that, for MOT, learning lower dimensional ReID features is more efficient, and thus set D t to 64.Note that the position-sensitivity is not introduced to the ReID branch, in that the M reid with position-sensitivity could be too thick to maintain balance with other tasks.

Loss Function
The proposed network is optimized in an end-to-end fashion employing task-independent uncertainty loss [27] to balance tasks automatically.The joint loss function is formed in Equation (2), where w i is the uncertainty weight for the loss of each task and can be learned as a parameter.Different from linear summation of losses, this automatic balancing method breaks the strong restriction on loss weights, so the result can be closer to the optimal value.
The L GA indicates the total loss of the module in Section 3.1.3,which is a linear combination of the loss of location prediction L loc and the loss of shape regression L shape, as shown in Equation (3): where L loc is optimized by focal loss and L shape is optimized by a variant of bounded IoU loss formed as Equation (4): where w and h represent predicted width and height of anchors and * represents the ground truth.
During training, the classification probability map for each proposal is additionally averaged to yield a vector with C channels, then the loss for classification branch L α is formulated as Equation ( 5), where N represents the number of objects in the a frame, i refers to the i-th object and p * i is the ground truth label of the object's categories (p * i = 0 signifies the background).
The bounding box regression map for each proposal is also averaged to yield a 4-d vector, and the loss for bounding box regression branch L β is calculated as in Equation ( 6): where b i is the i-th estimated bounding box, b * i is the corresponding ground truth vector and [p * i > 0] is an indicator which equals 1 if the argument is true and 0 otherwise.
As for the ReID branch, we train it as a classification task.As shown in Figure 1, during training, the ReID feature map of each proposal undergoes additional series of functions and is converted to a vector of a large number of categories; the objects of the same identity in the training set are treated as one class.Thus, after training, the ReID features learn to discriminate different instances.We denote the class distribution vector of a proposal as c i and its corresponding one-shot representation of ground truth label as c * i , and we compute the ReID loss Lγ as Equation ( 7): where N represents the number of objects and M is the number of identity classes.Moreover, to take full advantage of the high-quality proposals generated by Section 3.1.3,we set a relatively high positive/negative threshold and use fewer samples during training.

Online Tracking
The overview of our online tracking is shown in Figure 3.The whole process can be divided into network inference and online association.

Anti-Occlusion Inference
The anchors from Section 3.1.3indicate the regions where objects of interest are likely to be present.To eliminate duplicate anchors, the NMS function is performed on the anchors to produce a number of region proposals.We denote the classification probability map, bounding box regression map and ReID feature map of a proposal u in frame t as , respectively.
We first average the vectors in K × K bins of t u P , as shown in Equation ( 8): Along the C channels of t u Avg , the channel which contains the maximum value is determined as the proposal's category, as shown in Equation ( 9): The maximum value is formulated as Equation ( 10): t

Anti-Occlusion Inference
The anchors from Section 3.1.3indicate the regions where objects of interest are likely to be present.To eliminate duplicate anchors, the NMS function is performed on the anchors to produce a number of region proposals.We denote the classification probability map, bounding box regression map and ReID feature map of a proposal u in frame t as , respectively.We first average the vectors in K × K bins of P t u , as shown in Equation ( 8): Along the C channels of Avg t u , the channel which contains the maximum value is determined as the proposal's category, as shown in Equation ( 9): The maximum value is formulated as Equation ( 10): Then, we extract the cls t u -th map from p t u and transform it to the visibility map V t u by binarization based on p t u and the RMS σ t u , as shown in Equation ( 11): V t u (i, j) = 1 indicates that, at position (i, j), an object of class cls t u appears, otherwise there could be either occlusion or background.To avoid taking in irrelevant cues, we only average the values in the positions where V t u (i, j) = 1 for the cls t u -th channel of P t u , B t u and E t u .Thus, the probability of the category cls t u is calculated as in Equation ( 12): Thus, the bonding box regression vector of the proposal is shown in Equation ( 13): The ReID feature vector of the proposal is calculated as Equation ( 14): where n is the total number of positions where V t u (i, j) = 1.Along with b t u , the bounding box of the corresponding proposal u is rectified.Then the NMS function is performed on all of the proposals to generate a certain number of detection candidates, which are input to the next section.

Online Association
To further improve the stability of our MOT system, we follow the online association strategy from MOTDT [17], which utilizes reliable detection results to prevent tracking drift in the long term, and predictions of previous tracks to avoid missed or false detection caused by occlusions.The strategy originally adopts the Euclidean distance function to evaluates the similarity of ReID feature vectors between the detection candidates and the targets; instead, we utilize the cosine distance in our work.In addition, we use the linear blending function to update the ReID feature vectors of targets when they have been successfully associated with the detection candidates; for a matched pair <Target t−1 v , Detection t u >, the ReID feature vector ε t v of Target t v is updated as Equation ( 15):

Anti-Occlusion Tracking
Here we explain how tracking drifts are occur.From the beginning, an object is partially occluded but can still be detected.However, at this moment, the object's bounding box contains not only the object itself but also the occlusion.Then, the ReID feature vectors in the bounding box are extracted and aggregated, bringing in the interference of occlusion.Being partially occluded, the object is successfully matched with the correctly tracked target.Then, the ReID feature of the object is updated to the tracked target as Equation (15); the ReID feature of the tracked target is contaminated by occlusion.Thereafter, the object's ReID feature is always matched with a contaminated feature of the tracked target, leading finally to tracking drift.
In our work, the proposed MOT system is armed with the capability of anti-occlusion by position sensitivity, which encodes information on position into the K × K bins in P t u [cls t u ] and B t u .For P t u [cls t u ]; each bin responds with high confidence only to the corresponding part of an object.Therefore, we can directly use the strength of the response to determine whether the correspondent area is obstructed by occlusions, which might be either inter-class or intra-class.To exclude the irrelevant information from the occluded parts, we merely aggregate the vectors on the bins that are not occluded.Therefore, the tracking drifts are effectively resolved by ensuring the object's and its matched target's ReID feature are uncontaminated.

Experiments
In this chapter, we apply our PSMOT to online multi-pedestrian tracking and evaluate this via various corresponding public datasets.

Datasets
In order to train the proposed unified model, we combine eight public datasets, ETH [35], CityPerson [36], WiderPerson-traffic [37], CalTech [38], MOT16 [39], CUHK-SYSU [40], PRW [41] and TAO-person [42], to create a large-scale training set for pedestrian detection and ReID.The ETH, CityPerson and WiderPerson-traffic datasets are utilized to train the classification and bounding box regression branches, because they only offer bounding box annotations.Together, the remaining datasets offering identification and box annotations are used to train all the task branches.We assess our method using the MOT17 [39] and MOT20 [43] testing sets after training.

Metrics
We evaluate the performance of the proposed unified model in three areas.First, the detection accuracy [44] (DetA) and localization accuracy [44] (LocA) are used to assess the detection performance.Second, the association accuracy [44] (AssA) evaluates the discriminability of the ReID features.Thirdly, the number of switches in the targets' identification [45] (IDs) and the fragments of the targets' trajectories [45] (Frag) are used to assess the quality of their predicted trajectories.Additionally, two comprehensive metrics are utilized to evaluate overall performance: the MOTA [46] and IDF1 [47].The FPS is employed to measure the processing speed, whose reciprocal is referred to as the inference time in seconds.

Implementation
The modified version of the DLA-34, whose parameters are pre-trained on the COCO dataset [48], is employed as the backbone of our unified model.For online pedestrian tracking, the number of categories is set as 1.Besides, in default, the dimension of the ReID features is 64 and the size of the spatial grid K is 5.The hyper-parameters of the module in Section 3.1.3follow the parameterization in [14], with σ 1 = 0.2, σ 2 = 0.2, ω 1 = 1 and ω 2 = 0.1, and the number of the proposals is manually limited to 500.The standard Adam optimizer [49] is employed to implement the data fitting.Specifically, the number of training epochs is 30 and the learning rate is initialized as 0.02 and dynamically decreased by 10% at the 15th and 25th epoch.The batch size is set as 10.Besides, to further improve the performance of our trackers, standard training schemes, such as the online hard example mining (OHEM) [50] and data augmentation techniques [51], are employed during their training.The training takes about 43 h on two RTX 2080Ti GPUs.

Ablastion Studies 4.4.1. Multi-Layer Feature Fusion
In this section we examine the efficacy of multi-layer feature fusion.Two models with other distinct backbones are produced as the control group, in addition to the variant DLA-34 in the proposed model.These models are the ResNet-34 [52] and the FPN-34 [53], which is the ResNet-34 with the feature pyramid structure.All models have a stride of 4, and the ResNet-34 requires the integration of three extra up-sampling operations in order to maintain its stride.
The results are shown in Table 1.By comparing the FPN-34 with the ResNet-34, it is evident that the AssA, DetA and LocA improve significantly.We credit these advancements to the usage of multi-layer feature fusion in the FPN-34.Furthermore, the DLA-34 achieves even greater results with its encoder-decoder layout and additional levels of feature fusion.In particular, there is a 4.4%, 6.3% and 5.1% increase in AssA, DetA and LocA compared with the ResNet-34, respectively.A strong foundation for tracking is provided by high-precision detection and discriminative ReID features, which inevitably lead to better tracking performance.The table shows a significant increase in MOTA and IDF1 and a decrease in IDs and Frag.Consequently, the results imply that our feature aggregation scheme effectively mitigates the conflicts between tasks and significantly improves the overall performance.

Feature Disentanglement
As for the evaluation of the proposed module for feature disentanglement, we compare it with the general method, which simply transforms features by the combination of 3 × 3 convolution and 1 × 1 convolution on each task branch.
The results are shown in Table 2. Our solution achieves a noticeable improvement in performance by substituting the dual CBL layers at the entrance of each task branch for the general module.The AssA, DetA and LocA increase by 2.6%, 2.4% and 2.5%, respectively, showing that our feature disentanglement method helps to resolve conflicts better among tasks.

Generation of Adaptive Anchors
In this section, we compare two versions of PSMOT: PSMOT with RPN based on adaptive anchors generation and PSMOT with vanilla RPN.We set the maximum number of proposals from 300 to 1000, respectively, and fix the IoU threshold to 0.6 in order to define the positive and negative samples.
The results are shown in Table 3.As for the vanilla RPN, the MOTA of the MOT system improves by 1.4% when the maximum number of proposals increases from 300 to 1000, while the FPS noticeably decreases from 14.3 Hz to 5.1 Hz.Nevertheless, the proposed PSMOT obtains substantially greater performance in a shorter amount of operating time when the generation of adaptive anchors is applied to the RPN: the FPS increases from 14.3 Hz to 22.4 Hz, the MOTA increases from 72.1% to 73.2%, and the IDF1 increases from 72.5% to 74.4%.We attribute the gains in overall performance to the higher yield rate of high-quality proposals produced by the generation of adaptive anchors and the gains in FPS to the sparse anchors scheme.Furthermore, it is evident that the system performance will continue to improve as we loosen the limit on the number of proposals in order to introduce more high-quality proposals, but the computation time will also increase.

Position Sensitivity
The position sensitivity in PSMOT is essential for handling occlusions.Therefore, it is essential to assess this attribute's efficiency.The results are shown in Table 4.It should be noted that, when K = 1, the position-sensitive pooling/align is deteriorated to the global pooling/align and the position-sensitive feature maps are relegated; as a result, the anti-occlusion is removed from the PSMOT.Table 4. Effects of the position sensitivity adopted to handle occlusion on the validation set of the MOT17.The optimal results are shown in bold.The symbol ↑ (↓) indicates that the higher (lower) the value of the metric, the better the performance.With the help of position sensitivity, the performance improves significantly, with only a small increase in running time.Specifically, the quality of the tracked trajectories improves significantly: the IDs reduce from 535 to 252 and the Frag from 878 to 504, indicating a successful suppression of the tracking drift issue.
We also look into the impact of the grid dimensions K.As the number of the grid dimensions increases, the proposals' granularity becomes finer, which makes it easier to detect and associate the obstructed objects.However, the processing time also increases.When K = 9, PSMOT has almost lost its timeliness, but the performance gains become noticeably slow.As the grid dimensions rise, the position-sensitive maps become thicker, making convergence more challenging.

Association Scheme
In this section, we evaluate and analyze the impact of the adopted association scheme on the performance of PSMOT.
ReID: the similarity between the detected objects and the tracking targets is based on ReID features.The Hungarian algorithm is adopted to finally decide which target a certain object is assigned to.
ReID + IoU and Kalman: for each tracking target, the Kalman filter is adopted to predict its bounding box in the current frame.The similarity between the detected objects and the tracking targets is, additionally, based on the IoU of bounding boxes.
ReID + IoU and Kalman + Hierarchy: the predicted targets and the detected objects in the current frame are all considered as candidates.The hierarchy step includes selection of the candidates with a high confidence score, calculation of similarity between the candidates and the tracking targets based on ReID features and bounding boxes' IoU, and final assignment using Hungarian algorithm.This combination constitutes the association algorithm adopted in PSMOT.
The results are shown in Table 5.Even if we only use ReID features for association, our method still exhibits good performance.The addition of motion prediction and IoU matching together contribute 1.3% and 1.0% gains to MOTA and IDF1, respectively, and reduce the number of IDs and Frag at the same time.Furthermore, utilizing a hierarchical matching and assigning scheme further boosts IDF1 by 2.2% and reduces IDs and Frag by 50 and 57, making the tracking process more sustainable.On the other hand, due to the growing number of candidates for matching and assigning, the operating speed drops from 19.0 FPS to 16.6 FPS.
Table 5. Evaluation of the impact of the association scheme on PSMOT.The symbol ↑ (↓) indicates that the higher (lower) the value of the metric, the better the performance.The symbol ✓means that the corresponding component is employed.

Comparisons with State-of-the-Art MOT Methods
In this part, we compare the performance of PSMOT with the preceding SOTA online MOT trackers on the test sets of MOT17 and MOT20.In order to evaluate our approach more thoroughly, we prepare three versions of PSMOT: PSMOT-Fast, which focuses on timeliness; PSMOT-Balance, which focuses on balance between performance and timeliness; and PSMOT-Pro, which focuses on performance.The detailed configurations are shown in Table 6.

Comparisons with Typical Methods
First, we select two representative methods for comparative analysis, which employ technical principles similar to PSMOT.These methods are FairMOT and RelationTrack, respectively.The results are shown in Table 7.
PSMOT vs. FairMOT: PSMOT and FairMOT achieve feature aggregation through the DLA-34 network.However, FairMOT applies linear convolutional operations to the shared feature map when passed to each task branch, while PSMOT employs non-linear convolutional operations, as mentioned in Sections 3.1.2and 4.4.2.Besides, FairMOT performs classification, bounding box regression and ReID feature extraction based on points, while PSMOT adopts a two-stage approach: in the first stage, it generates region proposals based on points, and in the second stage it performs classification, bounding box regression and ReID feature extraction within the proposals' areas and utilizes position sensitivity to exclude the occluded parts.As shown in the first row and the fourth row of the table, with slight increase in parameter size (24.4M vs. 24.8M) and inference time (25.9FPS vs. 20.0FPS), PSMOT demonstrates a significant advantage in performance when compared with FairMOT.
PSMOT vs. RelationTrack: RelationTrack also employs the DLA-34 network to achieve feature aggregation and utilizes the Global Context Disentangling (GCD) module to decouple the shared feature map into the detection-specific and ReID-specific feature maps.However, it does not consider the conflicts between classification and localization within the detection task and only processes the detection-specific feature map with linear convolutions on the two sub-task branches.In contrast, PSMOT directly employs non-linear convolutions on the task branches to disentangle the shared feature map into proposal-specific, classification-specific, localization-specific and ReID-specific feature maps, respectively, which further alleviates the conflicts among all tasks.As shown in the table, the PSMOT series outperforms RelationTrack in terms of MOTA, LocA and DetA, with more parameters (24.8 M, 24.9 M, 25.0 M vs. 22.7 M).Additionally, RelationTrack employs the Guided Transformer Encoder (GTE) module to enhance the ReID feature map by a global self-attention mechanism, while PSMOT generates visibility maps for each proposal by position sensitivity and utilizes these to exclude the occluded parts.From the table, we can see that PSMOT-Fast performs slightly worse than RelationTrack in terms of tracking-related metrics, such as IDF1, AssA, IDs and Frag.As the region of proposals in PSMOT becomes more finely divided, the tracking performance gradually approaches that of RelationTrack.As shown in the last row of the table, PSMOT-Pro, which divides each proposal into 7 × 7 grids, exhibits a tracking performance superior to that of RelationTrack.8 demonstrates that, in spite of its sluggish operating speed, PSMOT-Pro has outperformed its compared counterparts by significant margins in terms of the performancerelated metrics.Meanwhile, PSMOT-Balance and PSMOT-Fast, when compared with the MOT methods of the one-stage JDE models, provide exceptional performance, while maintaining timeliness.In particular, PSMOT-Balance performs better than FairMOT, CSTrack, and RelationTrack by 3.5%, 3.2%, and 1.1% in the IDF1 metric at a running speed of 16.6 FPS and produces low IDs and Frags in MOT17.Even in MOT20, where the scenes are more crowded and intricate, PSMOT-Balance still surpasses them by a large margin.Furthermore, PSMOT-Fast performs better than FairMOT in MOT17 and MOT20, matching FairMOT's speed in MOT20 and reaching nearly real-time speed in MOT17.
Table 8.Comparison of PSMOT with other methods on the test sets of MOT17 and MOT20.The optimal results are shown in red bold and the sub-optimal are in bold and underlined.The symbol ↑ (↓) indicates that the higher (lower) the value of the metric, the better the performance.The generation of visibility maps based on position sensitivity is shown visually in Figure 4.Note that each of the 3 × 3 bins is sensitive to a different part of the human body.For example, the top-left bin is sensitive to the left shoulder while the top-middle bin is sensitive to the head and neck.
Because of the overlap of different parts, the value in each bin of the probability map can be directly used to determine if the corresponding part is obscured by another instance of the same class.Next, we use mean-deviation threshold to convert the probability map into a binary map, which explicitly suggests the visibility of distinct parts.
We can see in the figure that the man in the purple box has blocked the bottomleft and bottom-right parts of the men in the yellow and red boxes, respectively.Their visibility maps clearly illustrate how they are obscured.Consequently, by filtering out the occluded parts, the aggregated results are more dependable than those aggregated via global averaging.Figure 6 displays the variation in similarity between the lady's ReID feature in each frame and the corresponding tracking pool's feature.The lady vanishes at frame 80 and her tracks are interrupted for several frames.When she reappears at frame 150, her current ReID feature still maintains high similarity with her tracking pool's feature.

Visualization of Online Tracking
The overall visual results are shown in Figure 7. From the results of MOT17-08 and MOT20-08, we observe that PSMOT manages to detect objects and maintain their identities in challenging scenes where there are frequent occlusions, which is mainly attributed to its abilities to coordinate multiple tasks and exclude interference caused by occlusion.From the results of MOT17-07 and MOT17-08, we also see PSMOT's robustness against large-scale variations, which is mainly due to the fact that the backbone network aggregates features are from different resolutions.
Additionally, we also present visual comparisons of PSMOT, FairMOT and Relation-Track in some typical cases for MOT17-03.Figure 8 shows the results of the three trackers for the handling the false objects, from which we can see that, during the tracking process, FairMOT generates duplicate bounding boxes with the same ID number, RelationTrack assigns two different ID numbers to the same object, while PSMOT maintains the unique ID number of the object.Both FairMOT and RelationTrack detect objects and extract their ReID features in a one-shot manner and subsequently utilize NMS to filter out duplicates and objects with low confidence.Because of the fixed thresholds, NMS is unable to filter out the invalid objects completely and accurately, which results in the first and second rows in the figure.In contrast, PSMOT achieves object detection and feature extraction in a two-stage approach: in the first stage, multiple region proposals are generated and, in the second stage, the information within each region is utilized to determine whether the proposal is the background or object.Thus, the two-stage approach, together with the Figure 6.The last row quantifies the similarity between the lady's ReID feature in each frame and that stored in the tracking pool.

Visualization of Online Tracking
The overall visual results are shown in Figure 7. From the results of MOT17-08 and MOT20-08, we observe that PSMOT manages to detect objects and maintain their identities in challenging scenes where there are frequent occlusions, which is mainly attributed to its abilities to coordinate multiple tasks and exclude interference caused by occlusion.From the results of MOT17-07 and MOT17-08, we also see PSMOT's robustness against large-scale variations, which is mainly due to the fact that the backbone network aggregates features are from different resolutions.Figure 9 illustrates the results of the above trackers for handling interference caused by occlusion.It is observed that, at frame 736, since the object is almost completely occluded, none of the methods can detect it, and instead they rely on the motion model to predict the object's location.FairMOT and PSMOT successfully predict the location of the occluded object using Kalman Filter, while the trajectory-filling strategy employed by Re-lationTrack falsely filters out this prediction.Additionally, we also present visual comparisons of PSMOT, FairMOT and Relation-Track in some typical cases for MOT17-03.Figure 8 shows the results of the three trackers for the handling the false objects, from which we can see that, during the tracking process, FairMOT generates duplicate bounding boxes with the same ID number, RelationTrack assigns two different ID numbers to the same object, while PSMOT maintains the unique ID number of the object.Both FairMOT and RelationTrack detect objects and extract their ReID features in a one-shot manner and subsequently utilize NMS to filter out duplicates and objects with low confidence.Because of the fixed thresholds, NMS is unable to filter out the invalid objects completely and accurately, which results in the first and second rows in the figure.In contrast, PSMOT achieves object detection and feature extraction in a two-stage approach: in the first stage, multiple region proposals are generated and, in the second stage, the information within each region is utilized to determine whether the proposal is the background or object.Thus, the two-stage approach, together with the subsequent NMS, eliminates false proposals more efficiently, resulting in the last row in the figure.Furthermore, at frame 760, both FairMOT and RelationTrack assigns incorrect ID number to the reappearing object, which is attributed to the contamination of the template feature of the corresponding target in the tracking pool-as the object enters the occluded area, its ReID feature is interfered with by occlusion and is directly used to update the template feature of the target in the tracking pool corresponding to the ID number; after the object leaves the occluded area, its newly extracted ReID feature fails to match the contaminated template feature of the original target, thus the object would be recognized as a newcomer or as another target.In contrast, through the object's classification map with position sensitivity, PSMOT determines the occluded parts of the object and excludes the interference of the occluded parts during the aggregation of the ReID feature, so as to prevent subsequent contamination.As shown in the last row in the figure, PSMOT maintains the correct ID number for the object when it passes through the occluded area.Furthermore, at frame 760, both FairMOT and RelationTrack assigns incorrect ID number to the reappearing object, which is attributed to the contamination of the template feature of the corresponding target in the tracking pool-as the object enters the occluded area, its ReID feature is interfered with by occlusion and is directly used to update the template feature of the target in the tracking pool corresponding to the ID number; after the object leaves the occluded area, its newly extracted ReID feature fails to match the contaminated template feature of the original target, thus the object would be recognized as a newcomer or as another target.In contrast, through the object's classification map with position sensitivity, PSMOT determines the occluded parts of the object and excludes the interference of the occluded parts during the aggregation of the ReID feature, so as to prevent subsequent contamination.As shown in the last row in the figure, PSMOT maintains the correct ID number for the object when it passes through the occluded area.

Conclusions
In this paper, we unleash the potential of two-stage JDE models for handling feature misalignment and occlusion in MOT.To achieve an ideal two-stage JDE model, efforts are made as follows.To maintain timeliness, the proposed model is fully convolutional, and its RoI-wise process only involves simple statistical operations.Furthermore, the dense and predefined anchoring scheme is replaced with a sparse and adaptive anchoring scheme in the first-stage RPN, which is able to produce more high-quality proposals with fewer anchors; to reach high performance by addressing the multi-task learning problem, feature aggregation and feature disentanglement are accomplished by the model's encoder-decoder backbone, with a deep level of multi-layer feature fusion and hard parameters sharing, respectively; to improve robustness, position sensitivity is further applied to evaluate the visibility of each proposal's parts and guide the aggregation of the tasks' results, while excluding interference.To make a sustainable MOT system, the hierarchical association algorithm in MOTDT is employed.The experimental results exhibits the high performance of the proposed method.
While this study provides valuable insights into the design of the JDE models, there are still some limitations.First, PSMOT is currently implemented only for tracking pedestrians; the experimental results are not comprehensive enough.In our future work, we

Conclusions
In this paper, we unleash the potential of two-stage JDE models for handling feature misalignment and occlusion in MOT.To achieve an ideal two-stage JDE model, efforts are made as follows.To maintain timeliness, the proposed model is fully convolutional, and its RoI-wise process only involves simple statistical operations.Furthermore, the dense and predefined anchoring scheme is replaced with a sparse and adaptive anchoring scheme in the first-stage RPN, which is able to produce more high-quality proposals with fewer anchors; to reach high performance by addressing the multi-task learning problem, feature aggregation and feature disentanglement are accomplished by the model's encoderdecoder backbone, with a deep level of multi-layer feature fusion and hard parameters sharing, respectively; to improve robustness, position sensitivity is further applied to evaluate the visibility of each proposal's parts and guide the aggregation of the tasks' results, while excluding interference.To make a sustainable MOT system, the hierarchical association algorithm in MOTDT is employed.The experimental results exhibits the high performance of the proposed method.
While this study provides valuable insights into the design of the JDE models, there are still some limitations.First, PSMOT is currently implemented only for tracking pedestrians; the experimental results are not comprehensive enough.In our future work, we aim to extend PSMOT to handle scenarios with multiple objects of multiple categories, such as mixed traffic scenarios involving pedestrians, vehicles and bicycles.Second, in this paper we disentangle the shared feature map using multiple non-linear convolutions, which are independent of each other.In principle, this arrangement is hard parameter sharing, which comes at the cost of increased parameter size and computational burden.In the future, we plan to explore networks based on soft parameters sharing to further improve the efficiency of the JDE models.

Figure 1 .
Figure 1.Overview of the proposed model.

Figure 1 .
Figure 1.Overview of the proposed model.

Sensors 2024 ,
24,  x FOR PEER REVIEW(3) ReID Feature ExtractionThe objective of the ReID branch is to generate features that can distinguish d instances of the same class.Based on the embedding vectors used in human Re branch is trained to establish an embedding space in which the vectors belongin same instance are close by, while those belonging to different instance are far aw cording to a proper measurement.The final convolutional layer with Dt kernel ployed to transform the feature map and the general RoI pooling/align layer is ap to produce the ReID feature map with a shape of t each proposal.W the conclusion from the FairMOT that, for MOT, learning lower dimensional Re tures is more efficient, and thus set Dt to 64.Note that the position-sensitivity is no duced to the ReID branch, in that the Mreid with position-sensitivity could be too maintain balance with other tasks.

Figure 2 .
Figure 2.An example of the position-sensitive RoI pooling operation.

Figure 2 .
Figure 2.An example of the position-sensitive RoI pooling operation.

24 Figure 3 .
Figure 3. Overview of the Online Tracking Process.

Figure 3 .
Figure 3. Overview of the Online Tracking Process.

Figure 4 .
Figure 4. Example of the generation of targets' visibility maps.Columns 2 to 4: the 3 3 positionsensitive maps of classification probability fused with the raw frame and the process of the positionsensitive pooling/align.Columns 6 to 7: the assembled classification probability maps and the visibility maps for the three people in this case.4.6.2.Visualization of Detection and Tracking of Occluded TargetsSince the region proposals are broken down into bins, as Figure5illustrates, our approach has an advantage when it comes to identifying the obstructed parts.As shown in the figure, even though the lady's view is blocked by the man in front, her exposed features could still yield a clean ReID feature, thus avoiding contamination of the recorded ReID feature in the tracking pool.Figure6displays the variation in similarity between the lady's ReID feature in each frame and the corresponding tracking pool's feature.The lady vanishes at frame 80 and her tracks are interrupted for several frames.When she reappears at frame 150, her current ReID feature still maintains high similarity with her tracking pool's feature.

Figure 4 .
Figure 4. Example of the generation of targets' visibility maps.Columns 2 to 4: the 3 × 3 positionsensitive maps of classification probability fused with the raw frame and the process of the positionsensitive pooling/align.Columns 6 to 7: the assembled classification probability maps and the visibility maps for the three people in this case.4.6.2.Visualization of Detection and Tracking of Occluded TargetsSince the region proposals are broken down into bins, as Figure5illustrates, our approach has an advantage when it comes to identifying the obstructed parts.As shown in the figure, even though the lady's view is blocked by the man in front, her exposed features could still yield a clean ReID feature, thus avoiding contamination of the recorded ReID feature in the tracking pool.

Figure 5 .
Figure 5. Example of detecting and tracking of a frequently occluded target.The visibility map shows the visible parts of the lady behind the man in white.

Figure 5 .
Figure 5. Example of detecting and tracking of a frequently occluded target.The visibility map shows the visible parts of the lady behind the man in white.

Figure 6 Figure 5 .
Figure6displays the variation in similarity between the lady's ReID feature in each frame and the corresponding tracking pool's feature.The lady vanishes at frame 80 and her tracks are interrupted for several frames.When she reappears at frame 150, her current ReID feature still maintains high similarity with her tracking pool's feature.

Figure 6 .
Figure 6.The last row quantifies the similarity between the lady's ReID feature in each frame and that stored in the tracking pool.

Sensors 2024 ,
24, x FOR PEER REVIEW 19 of 24 subsequent NMS, eliminates false proposals more efficiently, resulting in the last row in the figure.

Figure 9
Figure9illustrates the results of the above trackers for handling interference caused by occlusion.It is observed that, at frame 736, since the object is almost completely occluded, none of the methods can detect it, and instead they rely on the motion model to predict the object's location.FairMOT and PSMOT successfully predict the location of the occluded object using Kalman Filter, while the trajectory-filling strategy employed by RelationTrack falsely filters out this prediction.Furthermore, at frame 760, both FairMOT and RelationTrack assigns incorrect ID number to the reappearing object, which is attributed to the contamination of the template feature of the corresponding target in the tracking pool-as the object enters the occluded area, its ReID feature is interfered with by occlusion and is directly used to update the template feature of the target in the tracking pool corresponding to the ID number; after

Figure 9 .
Figure 9. Visual comparison of PSMOT, FairMOT and RelationTrack for handling occlusion on MOT17-03.The lady in the pink blouse is reaching the foreground and becomes partially occluded at frame 690 and 720, is fully in the blind spot at frame 736 and reappears at frame 760.

Figure 9 .
Figure 9. Visual comparison of PSMOT, FairMOT and RelationTrack for handling occlusion on MOT17-03.The lady in the pink blouse is reaching the foreground and becomes partially occluded at frame 690 and 720, is fully in the blind spot at frame 736 and reappears at frame 760.

Table 1 .
Comparison of different backbones based on ResNet-34 on the validation set of the MOT17.The optimal results are shown in bold.The symbol ↑ (↓) indicates that the higher (lower) the value of the metric, the better the performance.

Table 2 .
Comparison of different strategies for feature disentanglement on the validation set of the MOT17.The optimal results are shown in bold.The symbol ↑ (↓) indicates that the higher (lower) the value of the metric, the better the performance.

Table 3 .
Comparison of the different strategies for the RPN on the validation set of the MOT17.The optimal results are shown in bold.The symbol ↑ (↓) indicates that the higher (lower) the value of the metric, the better the performance.

Table 6 .
Configurations of different versions of PSMOT.

Table 7 .
Comparisons with typical online MOT methods on the test sets of MOT17.The 'Params' in the last column in the table show the parameter size of each model.The symbol ↑(↓) indicates that the higher (lower) the value of the metric, the better the performance.