You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

10 December 2025

OcclusionTrack: Multi-Object Tracking in Dense Scenes

,
and
1
School of Automation and Information Engineering, Sichuan University of Science and Engineering, Yibin 644000, China
2
Intelligent Perception and Control Key Laboratory of Sichuan Province, Yibin 644000, China
*
Author to whom correspondence should be addressed.

Abstract

This paper presents OcclusionTrack (OCCTrack), a robust multi-object tracker designed to address occlusion challenges in dense scenes. Occlusion remains a critical issue in multi-object tracking; despite significant advancements in current tracking methods, dense scenes and frequent occlusions continue to pose formidable challenges for existing tracking-by-detection trackers. Therefore, four key improvements are integrated into a tracking-by-detection paradigm: (1) a confidence-based Kalman filter (CBKF) that dynamically adapts measurement noise to handle partial occlusions; (2) camera motion compensation (CMC) for inter-frame alignment to stabilize predictions; (3) a depth–cascade-matching (DCM) algorithm that uses relative depth to resolve association ambiguities among overlapping objects; and (4) a CMC-detection-based trajectory Re-activate method to recover and correct tracks after complete occlusion. Despite relying solely on IoU matching, OCCTrack achieves highly competitive performance on MOT17 (HOTA 64.9, MOTA 80.9, IDF1 79.7), MOT20 (HOTA 63.2, MOTA 76.9, IDF1 77.5), and DanceTrack (HOTA 57.5, MOTA 91.4, IDF1 58.4). The primary contribution of this work lies in the cohesive integration of these modules into a unified, real-time pipeline that systematically mitigates both partial and complete occlusion effects, offering a practical and reproducible framework for complex real-world tracking scenarios.

1. Introduction

The multi-object tracking (MOT) task aims to obtain the spatial–temporal trajectories of multiple objects in videos. The tracking targets in multi-object tracking encompass various types: pedestrians, vehicles, animals, and more. Currently, there are three mainstream paradigms for multi-object tracking: (1) tracking-by-detection (TBD); (2) joint detection and tracking (JDT); and (3) transformer-based tracking. Through experiments, it has been demonstrated that TBD is currently the strongest paradigm in MOT [1,2,3].
TBD includes two main stages: detecting and tracking. The working process of TBD is to input video frames into the detector to generate detection bounding boxes, then run the tracking algorithm to obtain the final tracking results. The tracking stage has two main parts. (1) The motion model, which is used for predicting the states of trajectories at the current frame to generate bounding boxes. Most TBD paradigms employ the Kalman filter [4] as the motion model. (2) Associating new detections with the current set of tracks. The associating task is mainly to compute pairwise costs between trajectories and detections, then solve the cost matrix using the Hungarian algorithm to obtain the final matching results. There are two mainstream methods for cost calculation. (a) Localizing the objects through the intersection-over-union (IoU) between the predicted bounding boxes and the detection bounding boxes [1,3]. (b) The appearance model [2,5,6], which compute the similarity between their appearance features. Both methods are quantified into distances and used for solving the association task as a global assignment problem.
While simple and effective, mainstream TBD methods like SORT [1], DeepSORT [2], and ByteTrack [3] exhibit significant limitations in dense and occluded scenes, primarily stemming from their core components:
(1)
Vulnerability to Partial Occlusions: Occlusions pose a major challenge. Partial occlusions can degrade detection quality, but the KF’s update stage typically uses a uniform of measurement noise, failing to account for this variability.
(2)
Sensitivity to Camera Motion: The performance of IoU-based tracking is highly dependent on the accuracy of the Kalman filter’s predicted boxes. Camera motion can cause coordinate shifts between frames, leading to prediction offsets and subsequent association failures.
(3)
Ineffective Association under Occlusion: During the association stage, occlusion poses significant challenges to matching methods. Because occlusion arises from overlapping targets, the positions and sizes of the bounding boxes for these overlapping targets become extremely close, and conventional matching algorithms fail to deliver satisfactory performance.
(4)
Vulnerability to Complete Occlusions: Complete occlusions cause tracks to be lost, meaning that the detector cannot detect that trajectory in the image, during which the KF continues to predict without correction, accumulating significant error and hindering the target’s tracking performance after rematching.
To address these interconnected challenges, we propose a comprehensive framework that enhances the TBD pipeline at multiple stages:
  • We introduce a confidence-based Kalman filter (CBKF) that dynamically adjusts the measurement noise based on detection confidence, improving robustness to partial occlusions.
  • We incorporate camera motion compensation (CMC) [5] to align frames, thereby rectifying the prediction offsets in the KF and boosting the reliability of IoU-based association.
  • We integrate depth–cascade-matching (DCM) [7] into the association phase. By leveraging relative depth information to partition targets and perform matching within the same depth level, DCM effectively resolves ambiguities caused by inter-object occlusions.
  • For tracks recovering from complete occlusion, we employ the CMC-detection-based Re-activate method to correct accumulated errors in the track state and KF parameters, which ensures that the target can be successfully associated in subsequent frames.

3. Methods

We present four improvements in this paper, and we use ByteTrack [3] as the baseline model. Our tracker only uses IoU matching for data association. The framework of our tracker is shown in Figure 1.
Figure 1. The overall framework of our tracker. CMC is included in the CBKF prediction in this figure.

3.1. Confidence-Based Kalman Filter

In recent years, there have been two mainstream motion models in tracking-by-detection: (1) the Kalman filter [4]; and (2) the deep learning motion model. The core task of the Kalman filter is state estimation. Given past state information of a target, it predicts the target’s potential location at the next frame based on a constant-velocity kinematic model. This predicted state is then refined using the current detection bounding box to produce the final result. Deep learning models are primarily data-driven, relying on large training datasets and appropriate training methods. They often demonstrate strong performance in scenarios with highly complex motion patterns. However, in multi-pedestrian tracking scenarios, target motion patterns tend to be relatively simple, and the frame intervals are short, and pedestrian movements over short durations generally align well with the Kalman filter’s prior assumptions about target motion patterns. Consequently, the Kalman filter can demonstrate sufficiently robust performance in such scenarios. Deep learning models, due to their data-driven nature, require large amounts of high-quality training samples for pedestrian tracking scenarios and carry the risk of overfitting. Furthermore, compared to the Kalman filter, deep learning models have a vast number of parameters, increasing computational costs. Therefore, we selected the Kalman filter as the motion model.
The Kalman filter [4] is employed in multi-object tracking for motion state prediction and information fusion after matching. In recent years, the Kalman filter state vector has been defined as an eight-dimensional vector x = [ x c , y c , s , a , x c ˙ , y c ˙ , s ˙ ] . However, experiments of BoTSORT [5] demonstrated that directly predicting the bounding box width yields better results than predicting the aspect ratio. Therefore, we adopt the eight-dimensional state vector from BoTSORT as detailed in Equation (1). The measurement vector is defined in Equation (2).
x k = [ x c ( k ) , y c ( k ) , w ( k ) , h ( k ) , x ˙ c ( k ) , y ˙ c ( k ) , w ˙ ( k ) , h ˙ ( k ) ]
z k = z x c ( k ) , z y c ( k ) , z w ( k ) , z h ( k )
Here, x c ( k ) and y c ( k ) denote the horizontal and vertical coordinates of the center point of the bounding box, while w ( k ) and h ( k ) represent the width and height of the bounding box.
The Kalman filter performs two functions: prediction in Equations (3) and (4) and update in Equations (5)–(7). F and H are in Equations (8) and (9). In multi-object tracking, prediction generates the predicted state of trajectories, while update combines the detection boxes and predicted boxes to produce the final tracking results.
x ^ k | k 1 = F k x ^ k 1 + B k u k
P k | k 1 = F k P k 1 | k 1 F k T + Q k
K k = P k | k 1 H k T H k P k | k 1 H k T + R k 1
x ^ k | k = x ^ k | k 1 + K k z k H k x ^ k | k 1
P k | k = ( I K k H k ) P k | k 1
F = 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
H = 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0
However, through experiments and some studies, we have discovered that there is a pressing issue that needs to be addressed for the traditional Kalman filter. In dense scenes, occlusions frequently occur, posing significant challenges to detector performance. Severe occlusions will degrade detection results. However, in such cases, the traditional Kalman filter still calculates measurement noise using the same method regardless of detection quality, causing trackers to experience substantial performance degradation when confronting occlusion challenges. Therefore, based on detection confidence, an intuitive metric for quantifying the quality of detection results, we designed a non-linearly varying coefficient to scale measurement noise R ( c o n f ) in Equation (10).
R ( c o n f ) = R e β ( c o n f c 0 )
Among these, β is an adjustable hyperparameter that controls the rate of change in the scaling factor. c 0 is the confidence threshold, representing the boundary line determining whether a detection result is considered reliable. According to the properties of exponential functions, within the interval ( c 0 , 1], as the confidence level increases, the function curve flattens out, and the scaling factor ultimately approaches zero. Within the interval [0, c 0 ], as the confidence level decreases, the function curve becomes steeper, and the scaling factor tends toward a very large value. The curve is shown in Figure 2. This indicates that when confidence is high, the tracker fully trusts the detection result, while when confidence is low, it places greater trust in the predicted state. The improved Kalman filter update functions are in Equations (11)–(13).
Figure 2. The figure above shows the confidence-based scaling factor e     β   ( c o n f     c 0 ) function curve in the range from 0 to 1. The figure below shows the confidence-based scaling factor e     β   ( c o n f     c 0 ) function curve in the range from 0.7 to 1. The horizontal axis represents the confidence score.
Furthermore, to evaluate the performance of different types of scaling factors, we conduct a comparative experiment on the MOT17 validation set. The results are shown in Table 2.
K k = P k | k 1 H k T H k P k | k 1 H k T + R ( c o n f ) 1
x ^ k | k = x ^ k | k 1 + K k ( z k H k x ^ k | k 1 )
P k | k = ( I K k H k ) P k | k 1
Table 2. The impact of different scale factors on MOT17.

3.2. Camera Motion Compensation

Tracking-by-detection trackers that rely solely on IoU distance place significant emphasis on the quality of prediction boxes. However, in complex scenarios where the camera position is not fixed, camera motion causes shifts in the reference coordinate system between consecutive frames. In such cases, the Kalman filter [4] continues to output predicted states based on the previous frame’s coordinate system, resulting in prediction boxes exhibiting substantial bias. Therefore, in recent years, some trackers have begun to apply camera motion compensation to achieve inter-frame coordinate system registration [5,6]. We introduced the global motion compensation (GMC) technique [5] from OpenCV [12]. This registration method is suitable for revealing the background motion of adjacent frames. First, image key-points are extracted [13], followed by feature point tracking using sparse optical flow. Then, based on the corresponding feature points, a system of equations is constructed. The RANSAC algorithm [14] is used to solve this system of equations and obtain the affine matrix in Equation (14).
A = s 2 x 2 | z 2 x 1 = a 11 a 12 a 13 a 21 a 22 a 23
  s R 2 × 2 represents the rotation and scaling components within the affine matrix, while T represents the translation component within the affine matrix. Since the predicted state x is an eight-dimensional vector, each dimension requires transformation. Therefore, the S   R 8 × 8 and Z   R 8 that satisfy the tracking requirements are as follows in Equation (15). Then, apply S and Z to transform the predicted state and the predicted covariance matrix in Equations (16) and (17).
S = s 0 0 0 0 s 0 0 0 0 s 0 0 0 0 s ,                 Z = a 13 a 23 0 0 0
x ^ k = S x ^ k 1 + Z
P k = S P k | k 1 S
where x ^ k     1 represents the predicted state before transformation. P k | k     1 represents the predicted covariance matrix before transformation. x ^ k represents the predicted state after transformation. P k represents the predicted state after transformation. After CMC processing, the predicted state is based on the current frame coordinate system. The results are shown in Figure 3.
Figure 3. The left image is the predicted state before CMC. The right image is the predicted state after CMC.

3.3. Depth–Cascade-Matching

Occlusion refers to the overlapping of multiple objects. In dense scenes, due to frequent occlusions, multiple targets are highly prone to overlapping. In such cases, both their prediction boxes and detection boxes become extremely close, making conventional matching methods susceptible to causing ID switches. Therefore, we introduced the depth–cascade-matching (DCM) algorithm [7], shown in Figure 4, to address occlusion issues during the matching phase. The DCM algorithm consists of two main stages: calculating the pseudo-depth value and the cascade-matching bases on the obtained pseudo-depth value.
Figure 4. The framework of DCM is shown in this figure.
For calculating the pseudo-depth value, we first obtain the connection point A between the target and the ground in three-dimensional space. This connection point A   is then projected onto the camera plane, with the projection point denoted as B .
Secondly, a perpendicular line is drawn from point B to the bottom edge of the image, with the intersection point denoted as C . The Euclidean distance between B and C is calculated to serve as the pseudo-depth value. The pseudo-depth value, in a certain sense, represents the occlusion relationships between objects. The computational process is shown in Figure 5.
Figure 5. The illustration of depth value.
After obtaining the pseudo-depth value, the tracker performs cascade matching. For the detection set, we divided the depth values of detections into k intervals between the maximum depth value     S m a x and the minimum depth value S m i n using a linear method. Based on the depth values, the detection results are assigned to corresponding intervals, forming k subsets { D 0 , D 1 , , D k 1 } . k signifies the highest depth level. Then use the same operation to split the track set into k subsets T 0 , T 1 , , T k 1 .
Trajectories and detections at the same depth level undergo matching. The matching process proceeds from lower to higher depth levels. Our tracker only uses IoU distance to match, solving the cost matrix through the Hungarian algorithm. Trajectories and detections that remain unmatched at the current level are passed to the next level for matching after the data association of each depth level.

3.4. CMC-Detection-Based Re-Activate Method

Severe occlusions can cause long-term target lost. During the period of the long-term target loss, both the Kalman filter and deep learning models predict motion state based on the target’s historical trajectory. However, according to the analysis of OC-SORT [9], this prediction method will lead to substantial error accumulation in the trajectory state and the Kalman filter’s parameter during the lost period, meaning that even if the target is rematched, it remains at risk of being lost again in subsequent phases. The accumulation of error mainly stems from the absence of detection in the lost period. Therefore, we propose a Re-activate method based on CMC and detection for generating virtual detections to address the issue of error accumulation. This re-activate method is applied after the trajectory has been rematched.
Firstly, our tracker records the information of the last matched detection before becoming lost, including bounding boxes z L , detection confidence c L , and its frame T L . When a trajectory is marked as lost, our tracker records its trajectory state x L . Meanwhile, the tracker maintains a CMC list whose length matches the maximum age of the trace to contain the CMC affine matrix at the beginning of tracking. When the trajectory is rematched, the newly matched detection bounding box, confidence score, and frame are denoted as z n e w , c n e w , and T n e w separately.
Secondly, the tracker performs the Re-activate operation by applying the accumulated CMC from T L to T n e w to z L in Equation (18). Due to x L and z L being recorded at different frames, the tracker applies the accumulated CMC from T L + 1 to T n e w to x L in Equation (19). For generating a series of virtual detection bounding boxes z k , the tracker uses linear interpolation between the newly matched detection bounding box z n e w   and the last matched detection bounding box z after CMC in Equation (20), shown in Figure 6. Then the tracker generates a series of virtual detection confidence scores c k using linear interpolation between the newly matched detection confidence c n e w     and the last matched detection confidence c L     in Equation (21). After completing the aforementioned data preparation tasks, the tracker backtracks the trajectory state to the recorded state x after CMC, reruns CBKF for the lost period from T L + 1 to T n e w and obtains the final trajectory state for the current frame. The operational process of Re-activate is shown in Figure 7.
x = C M C T L + 1 T n e w ( x L )
z = C M C T L T n e w ( z L )
z k = z + ( T k T L ) Z n e w z Δ t
c k = c L + ( T k T L ) C n e w C L Δ t
where x represents the trajectory state after cumulative CMC transformation, z denotes the last detection box before becoming lost after cumulative CMC transformation, and T k indicates the frame number corresponding to the generated virtual detection.
Figure 6. The left image shows the position of the pedestrian in the last frame before becoming lost. The right image shows the position of the pedestrian in the re-associated frame. The dashed rectangles represent the generated virtual detections.
Figure 7. The operational process of Re-activate.
It is worth noting that, initially, we only intended to use detection bounding boxes and the same confidence score to generate virtual detections. However, experiments reveal that this approach performs poorly in certain scenarios. Therefore, we introduce CMC to help the tracker generate more robust virtual detections; furthermore, linear interpolation is also applied to the confidence score to generate a more reliable confidence level. Additionally, we evaluate both approaches on the MOT17-14 video sequence. The results are shown in Table 3 and Figure 8.
Table 3. The comparison between the two approaches on MOT17-14.
Figure 8. The left image was obtained using CMC-detection-based Re-activate. The right image was obtained using the method of detection bounding boxes and the same confidence score.
From Table 2 and Figure 8, we can conclude that the improved Re-activate method indeed demonstrates superior performance in certain scenarios. In the MOT17-14 video sequence scenarios, the camera is in constant motion. Under such conditions, if trackers do not use CMC to calibrate the last detection z L and x L coordinate systems, this would introduce additional errors into the target trajectory. Additionally, when we perform linear interpolation between two detection results with different confidence levels to generate virtual detection bounding boxes, accurately evaluating these generated boxes becomes challenging. Therefore, we opt to linearly interpolate the confidence scores to generate virtual confidence scores. These virtual confidence scores gradually decrease from the higher confidence side toward the lower confidence side, indicating that the tracker places greater trust in virtual detection boxes closer to the higher confidence side.

3.5. Tracking Pipeline

To more clearly demonstrate the pipeline of OcclusionTrack and enhance reproducibility, we present Algorithm 1, which integrates ByteTrack with our four improvements.
Step 1: Use the YOLOX detector to detect bounding boxes and confidence scores from the current frame image. Classify these detections into high-score and low-score detections based on confidence scores.
Step 2: Calculate the CMC parameter using the current frame image and the previous frame image. Apply Kalman filtering to predict the trajectory state for the current frame, then correct it using CMC.
Step 3: Perform the first association: match the predicted trajectory state with high-scoring detections using DCM to obtain association results.
Step 4: Perform the second association: match the unmatched trajectory from the first association with low-score detections using DCM to obtain association results.
Step 5: Merge the association results from the first two steps.
Step 6: Update the matched results: if the trajectory state is activated, perform CBKF update; if the trajectory state is lost, perform Re-activate.
Step 7: Process unmatched tracks by incrementing their lost frame count by one. If the lost frame count exceeds the max age, remove the track. If within the max age, set the track state to lost.
Step 8: Initialize unmatched detections as new tracks.
Step 9: Output the current frame tracking results.
Algorithm 1: Proposed OcclusionTrack Pipeline for Frame t.
Input: Current frame I_t, Previous tracks T_{t-1}, Previous image I_{t-1}
Output: Updated tracks T_t

# Step 1: Object Detection and Classification
1: D_all = detector(I_t) // Get all detections {bbox, score}
2: D_high = {d ∈ D_all | d.score ≥ τ_high} // High-score detections
3: D_low = {d ∈ D_all | d.score < τ_high ∧ d.score ≥ τ_low} // Low-score detections

# Step 2: Kalman Filter Prediction with CMC
4: H = estimate(I_{t-1}, I_t) // Camera motion compensation
5: for each track τ in T_{t-1} do
6:   if τ.state ∈ {“active”, “lost”} then
7:     τ.pred_box = KalmanPredict(τ) // Standard prediction
8:     τ.pred_box = compensate(H, τ.pred_box) // Apply CMC
9:   end if
10: end for

# Step 3: First Association (High-score detections)
11: T_candidates = {τ ∈ T_{t-1} | τ.state ∈ {“active”, “lost”}} // All tracks
12: matches_high, unmatched_tracks_1, unmatched_dets_high =
DCM_Association(D_high, T_candidates)

# Step 4: Second Association (Low-score detections)
13: matches_low, unmatched_tracks_2, unmatched_dets_low =
DCM_Association(D_low, unmatched_tracks_1)

# Step 5: Merge Matching Results
14: all_matches = matches_high ∪ matches_low
15: unmatched_tracks = unmatched_tracks_2
16: unmatched_dets = unmatched_dets_high ∪ unmatched_dets_low

# Step 6: Track Update and Management
17: for each match (τ, d) in all_matches do
18:   if τ.state == “lost” then
19:     τ = Re_activate(τ, d, CMC) // Re-activate for lost tracks
20:   end if
21:   
22:   # CBKF update with non-linear measurement noise
23:   R = compute_measurement_noise(d.score, R) // Measurement noise based on confidence
24:   KalmanUpdate(τ, d.bbox, R)
25:   τ.state = “active”
26:   τ.lost_count = 0 // Reset lost counter
27: end for

# Step 7: Handle Unmatched Tracks
28: for each track τ in unmatched_tracks do
29:   τ.lost_count += 1
30:   if τ.lost_count > max_age then
31:     τ.state = “removed” // Delete track
32:   else
33:     τ.state = “lost”
34:   end if
35: end for

# Step 8: Initialize New Tracks
36: for each detection d in unmatched_dets do
37:   if d.score ≥ τ_init then
38:     create_new_track(d)
39:   end if
40: end for

# Step 9: Output Results
41: T_t = {τ ∈ T_{t-1} ∪ new_tracks | τ.state ≠ “removed”}
42: return T_t

4. Experiments

4.1. Datasets

We evaluate the performance of our tracker on the MOT17 [15], MOT20 [16], and DanceTrack [17] datasets. Tracking is performed on the test sets of MOT17 and MOT20, and tracking results are submitted to MOTchallenge to obtain tracking metrics. Based on ByteTrack [3], we split the MOT17 and MOT20 training sets in half to form validation sets and evaluate the performance of each component on the validation sets.
MOT17. The MOT17 [15] dataset is one of the most influential and widely used benchmark datasets in the field of multi-object tracking (MOT). It aims to provide a unified, rigorous, and challenging platform for evaluating and comparing multi-object tracking algorithms. The MOT17 dataset comprises 14 video sequences primarily featuring pedestrian tracking scenarios across both indoor and outdoor environments. It encompasses multiple key challenges: (1) occlusion; (2) lighting variations; (3) camera motion; (4) scale changes; and (5) dense crowds.
MOT20. The MOT20 [16] dataset is a benchmark dataset proposed in the field of multi-object tracking (MOT) to address scenarios involving extremely dense crowds. The core objective of MOT20 is to evaluate the robustness and accuracy of tracking algorithms in environments characterized by exceptionally high crowd density and severe occlusions. MOT20 comprises eight video sequences primarily captured at large public gatherings, train stations, and plazas. The key challenges of MOT20 are extremely high target density and severe and persistent occlusions.
DanceTrack. The DanceTrack [17] dataset exhibits the following characteristics: tracking targets share similar visual features, their motion patterns are highly complex, and occlusion occurs frequently. Dancetrack comprises a total of 100 video sequences, including 40 training sequences, 20 validation sequences, and 40 test sequences. Furthermore, similar to the MOT series datasets, test set metrics require uploading results to the official website to obtain.

4.2. Metrics

We employ the general CLEAR metrics [18], HOTA [19], MOTA, IDF1 [20], ID Switch, FP, and FN to evaluate the performance of our tracker. MOTA emphasizes detector performance. ID Switch represents the number of target ID switches, reflecting the identity retention capability of a tracker. Additionally, IDF1 emphasizes association performance. HOTA serves as a comprehensive metric to evaluate the overall effectiveness of detection and association.

4.3. Implementation Details

During the confidence-based Kalman filter update and Re-activate phases, we set different β values based on distinct datasets. For MOT17, the default value for β is 8. For MOT20, the default value is set to 12. For DanceTrack, we set the default value of β at 12. Additionally, the confidence threshold c 0 is set to 0.7 in MOT17, MOT20, and DanceTrack. During the DCM association stage, for MOT17, the default depth levels are set to 3, for MOT20, the default depth levels are set to 8, and for DanceTrack, the default depth levels are set to 12. The high-score and low-score detection threshold is set to 0.6 and 0.1 by default. The down scale of CMC is set to 4 by default. The method of CMC is the sparse optical flow method. All our tracking results rely solely on IoU distance.
In the comparison experiments, to ensure fair comparison, we employed the pre-trained YOLOX detector from ByteTrack [3], utilizing the same weights and NMS threshold as ByteTrack [3]. Additionally, all experiments are performed on the NVIDIA GeForce RTX3060 GPU (NVIDIA, Santa Clara, CA, USA).

4.4. Comparison on MOT17, MOT20, and DanceTrack

We evaluate our tracker on the MOT17, MOT20, and DanceTrack test sets in comparison with other state-of-the-art trackers. The best results are shown in Table 4, Table 5 and Table 6.
Table 4. The comparison of our tracker with other state-of-the-art methods under the “private detector” protocol on the MOT17 test set.
Table 5. The comparison of our tracker with other state-of-the-art methods under the “private detector” protocol on the MOT20 test set.
Table 6. The comparison of our tracker with other state-of-the-art methods under the “private detector” protocol on the DanceTrack test set.
MOT17. Through the comparison with other state-of-the-art methods, our tracker has demonstrated strong performance on the MOT17 test set. Compared to the baseline ByteTrack [3] with the same pre-trained detector YOLOX, our tracker achieves gains of +1.8 HOTA, +0.6 MOTA, and +2.4 IDF1, and ID Switch reduced to almost half that of ByteTrack’s.
MOT20. Through the comparison with other state-of-the-art methods, our tracker has demonstrated that it still maintains robust tracking performance in scenes with severe occlusion on the MOT20 test set. Compared to the baseline (ByteTrack) with the same pre-trained detector YOLOX, our tracker achieves gains of +1.9 HOTA and +2.3 IDF1, and ID switch has been reduced by twenty-five percent.
DanceTrack. Through the comparison with other state-of-the-art methods, our tracker has demonstrated strong performance on the DanceTrack test set. Compared to the baseline with the pre-trained YOLOX detector from OC-SORT, our tracker achieves gains of +9.8 HOTA, +1.8 MOTA, and +4.5 IDF1.

4.5. Ablation Studies of Different Components

We evaluate our tracker on the validation sets of MOT17 and MOT20 to observe the impact of different components on tracking performance. The results are shown in Table 7 and Table 8.
Table 7. Ablation studies on the MOT17 validation set.
Table 8. Ablation studies on the MOT20 validation set.
MOT17. On the MOT17 validation set, we set β to 8 and c 0 to 0.7. The depth level is set to 3. We use sparse optical flow as the CMC method. We use the pre-trained detector YOLOX from ByteTrack [3] as the detector. Through the experiments, CMC achieves primary gains of +0.9 IDF1, CBKF achieves gains of +0.3 IDF1, and Re-activate achieves gains of +0.1 MOTA.
MOT20. On the MOT20 validation set, we set β to 15 and c 0 to 0.7. The depth level is set to 8. We use sparse optical flow as the CMC method. We use the pre-trained detector YOLOX from SparseTrack [7] as the detector. Through the experiments, CBKF and DCM both achieve gains of +0.3 IDF1 and Re-activate achieves gains of +0.1 HOTA.

4.6. The Impact of Different β Values

We evaluate our tracker on the MOT17 validation set to obtain the impact of different β values on tracking performance. The results are shown in Table 9.
Table 9. The performance of the OcclusionTrack when using different β values.
We set the value to 2, 4, 6, 8, and 10. As β values increase, the tracking performance gradually improves. When β values reach 6 and 8, the tracking performance achieves the best scores. When we continue to increase β to 10, tracking performance begins to decline. This indicates that a larger β value facilitates Kalman filter update. However, when it becomes excessively large, performance actually deteriorates. We think that an overly large β value may cause measurement noise to fluctuate too rapidly, thereby interfering with detection results with confidence between 0.4 and 0.6.

4.7. The Impact of the Max Age

Severe occlusions can cause the long-term lost of trajectories. The max age represents the maximum number of frames a trajectory remains lost for before being removed. Both excessively low and excessively high max age values can degrade tracking performance. When the max age is set too low, even if a trajectory is rematched in subsequent frames, it cannot recover its original ID because it has already been removed by the tracker. Instead, it is assigned a new ID. When the max age is set too high, the target may have disappeared from the frame, yet the tracker continues to predict its state. This not only increases computational loads but also risks erroneous matches. Therefore, we evaluate the tracking performance of different max age settings on the MOT17 and MOT20 validation sets to observe the impact of the max age on tracking performance. The results are shown in Figure 9 and Figure 10.
Figure 9. The impact of different max age settings on the MOT17 validation set.
Figure 10. The impact of different max age settings on the MOT20 validation set.

4.8. The Impact of Different Detectors

Tracking-by-detection (TBD) trackers rely heavily on the detection quality of the detector. It can be said that TBD’s current powerful capabilities are inseparable from the continuous advancement of detectors. Therefore, we evaluate our tracker’s performance with different detectors to obtain the impact of the detector on tracking performance. The results are shown in Table 10.
Table 10. The performance of the OcclusionTrack when using different detectors.
We can observe that detectors with enhanced performance significantly improve the tracking capabilities of trackers. This indicates that we can enhance tracking performance by designing detectors with superior capabilities.
To observe the generalization capabilities of each module when using different detectors, we replaced the detector with a more powerful one and conducted ablation experiments on each module. The results are shown in Table 11. Through the above two experiments, we observed that OcclusionTrack is highly sensitive to detector performance; poor-quality detectors significantly limit the tracker’s capabilities. This situation may be due to the fact that each module within OcclusionTrack relies heavily on the data provided by the detector.
Table 11. Performance analysis of each module when using a better detector on the MOT17 validation set.

4.9. Visualization

To better evaluate the tracking performance of our tracker and observe the specific performance of our tracker, we conduct visualization experiments on the MOT20 and MOT17 test sets. The visualization results are shown in Figure 11 and Figure 12.
Figure 11. These two images are taken from two consecutive frames within the same MOT20 video sequence.
Figure 12. These two images are taken from two consecutive frames within the same MOT17 video sequence.
This MOT20 video sequence represents a scene with severe occlusion. The scene depicted in the MOT17 video sequence is a street between two buildings. Certain areas within the sequence feature relatively dense targets and moderate occlusion.

5. Conclusions

A robust tracking-by-detection framework named OcclusionTrack has been proposed to address key challenges in dense multi-object tracking, particularly those arising from occlusions and camera motion. The framework integrates four core improvements into a cohesive pipeline: (1) a confidence-based Kalman filter (CBKF) for adaptive state update under partial occlusion; (2) camera motion compensation (CMC) for inter-frame alignment; (3) a depth–cascade-matching (DCM) algorithm for association in overlapping scenarios; and (4) a CMC-detection-based trajectory Re-activate method for recovering from complete occlusion.
The effectiveness of the proposed method has been validated through comprehensive experiments on the MOT17, MOT20, and DanceTrack benchmarks. OcclusionTrack achieves highly competitive performances, notably improving IDF1 by 2.4 and reducing ID switches by nearly 50% over the ByteTrack baseline on the MOT17 test set, improving IDF1 by 2.3 and reducing ID switches by nearly 25% over the ByteTrack baseline on the MOT20 test set, and improving IDF1 and HOTA by 4.5 and 9.8 over the ByteTrack baseline on the DanceTrack test set.
The experimental results demonstrate that all the predefined research objectives have been met to a certain extent, with quantitative evidence supporting each contribution:
(1)
Objective 1 (Handling Partial Occlusion Challenges during Kalman Filter Update): To mitigate the impact of detection quality degradation during partial occlusion, the CBKF module was designed. Its efficacy is confirmed through experiments, where its introduction leads to the +0.3 IDF1 improvement on both the MOT17 and MOT20 validation sets. This gain is attributed to the module’s ability to dynamically change the weight of detections during the Kalman update, preventing trajectory drift.
(2)
Objective 2 (Compensation for Camera Motion): To stabilize the prediction base for IoU matching, a CMC module was incorporated. Results indicate a +0.9 IDF1 increase on the MOT17 validation set. It is the component with the greatest improvement among the four. More importantly, its application is a prerequisite for the effective functioning of the reactivation mechanism, showcasing its systemic value beyond a direct performance bump.
(3)
Objective 3 (Robust Association under Occlusion): To resolve ambiguities in dense, overlapping scenes, the DCM algorithm was integrated. DCM contributes significant gains, boosting IDF1 by +0.4 on the MOT20 validation set. This confirms that depth-level partitioning effectively addresses complex inter-object interactions.
(4)
Objective 4 (Correction for Complete Occlusion): To address error accumulation in lost tracks, a Re-activate method was proposed. While its isolated impact is smaller (+0.1 HOTA) on the MOT20 validation set, its role in maintaining long-term identity consistency is critical.
While demonstrating strong performance, the portability and effectiveness of OcclusionTrack are subject to certain conditions which define its scope of applicability:
(1)
Dependence on Detector: Experiments demonstrate that TBD trackers are highly dependent on detector performance, with higher-quality detectors significantly improving tracking accuracy. Additionally, the CBKF module relies on the confidence scores from the detector being reasonably calibrated to detection quality. Performance may degrade if confidence scores are not predictive of localization accuracy.
(2)
Assumption for Camera Motion Scenario: The CMC module employs a global 2D model, which assumes dominant, low-frequency camera motion. Its effectiveness may diminish in scenarios with highly dynamic, non-planar backgrounds or very high-frequency jitter. Moreover, it cannot meet the requirements of 3D scenes.
(3)
Scene-Type Specificity: The current motion model retains a linear constant-velocity assumption. Consequently, tracking performance can decline in sequences with extremely low frame rates or highly non-linear, agile object motion. Furthermore, the framework is primarily designed and tested on pedestrians, and its extension to objects with highly deformable shapes may require adjustments.
Future research will focus on enhancing generalizability by (1) exploring the joint optimization of the detector and tracker components to reduce dependency on the performance of the detections; (2) investigating adaptive, scene-aware CMC strategies that can handle more diverse camera dynamics; and (3) developing deep, data-driven motion models to replace the linear Kalman predictor for handling complex motions.

Author Contributions

Conceptualization, Y.C. and F.M.; methodology, Y.C. and F.M.; software, Y.C.; validation, Y.C.; formal analysis, Y.C.; investigation, Y.C., F.M. and Z.C.; resources, F.M., Y.C. and Z.C.; data curation, Y.C. and Z.C.; writing—original draft preparation, Y.C.; writing—review and editing, F.M.; visualization, Z.C.; supervision, F.M.; project administration, F.M.; funding acquisition, F.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the New Generation Information Technology Innovation Project under Grant 2021ITA10002, in part by the Sichuan Provincial Key Research and Development Program under Grant 2025YFHZ0006, and in part by the Sichuan University of Science and Engineering Talent Introduction Program under Grant 2025RCZ064.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The datasets are presented in https://motchallenge.net/ and https://github.com/DanceTrack/DanceTrack (accessed on 7 December 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

AbbreviationsFull Term
OCCTrackOcclusionTrack
MOTMulti-Object Tracking
CBKFConfidence-Based Kalman Filter
CMCCamera Motion Compensation
DCMDepth–Cascade-Matching
KFKalman Filter
TBDTracking-By-Detection

References

  1. Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upadhya, A. Simple Online and Realtime Tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar]
  2. Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the IEEE International Conference on Image Processing, Beijing, China, 17–20 September 2017; pp. 3645–3649. Available online: https://ieeexplore.ieee.org/document/8296962 (accessed on 11 October 2023).
  3. Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. Bytetrack: Multi-object tracking byassociating every detection box. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 1–21. [Google Scholar]
  4. Kalman, R.E. A New Approach to Linear Filtering and Prediction Problems; Wiley Press: New York, NY, USA, 2001; Volume 82D, pp. 35–45. Available online: https://ieeexplore.ieee.org/document/5311910 (accessed on 12 October 2023).
  5. Aharon, N.; Orfaig, R.; Bobrovsky, B.Z. BoT-SORT: Robust Associations Multi-Pedestrian Tracking. arXiv 2022, arXiv:2206.14651. [Google Scholar]
  6. Maggiolino, G.; Ahmad, A.; Cao, J.; Kitani, K. Deep OC-SORT: Multi-pedestrian tracking by adaptive re-identification. In Proceedings of the 2023 IEEE International Conference on Image Processing (ICIP), Kuala Lumpur, Malaysia, 8–11 October 2023; pp. 3025–3029. Available online: https://ieeexplore.ieee.org/document/10222576 (accessed on 8 January 2025).
  7. Liu, Z.; Wang, X.; Wang, C.; Liu, W.; Bai, X. Sparsetrack: Multi-object tracking by performing scene decomposition based on pseudo-depth. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 4870–4882. Available online: https://ieeexplore.ieee.org/document/10819455 (accessed on 4 September 2025). [CrossRef]
  8. Du, Y.; Zhao, Z.; Song, Y.; Zhao, Y.; Su, F.; Gong, T.; Meng, H. Strongsort: Make deepsort great again. IEEE Trans. Multimed. 2023, 25, 8725–8737. Available online: https://ieeexplore.ieee.org/document/10032656 (accessed on 22 November 2024). [CrossRef]
  9. Cao, J.; Pang, J.; Weng, X.; Khirodkar, R.; Kitani, K. Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 9686–9696. [Google Scholar]
  10. Lv, W.; Huang, Y.; Zhang, N.; Lin, R.S.; Han, M.; Zeng, D. Diffmot: A real-time diffusion-based multiple object tracker with non-linear prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 19321–19330. [Google Scholar]
  11. Hong, J.; Li, Y.; Yan, J.; Wei, X.; Xian, W.; Qin, Y. KalmanFormer: Integrating a Deep Motion Model into SORT for Video Multi-Object Tracking. Appl. Sci. 2025, 15, 9727. Available online: https://www.mdpi.com/2076-3417/15/17/9727 (accessed on 25 October 2025).
  12. Bradski, G. The opencv library. Dr. Dobb’s J. Softw. Tools Prof. Program. 2000, 25, 122–125. Available online: https://www.researchgate.net/publication/233950935_The_Opencv_Library (accessed on 17 August 2025).
  13. Shi, J. Good features to track. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 21–23 June 1994; pp. 593–600. Available online: https://ieeexplore.ieee.org/abstract/document/323794 (accessed on 17 August 2025).
  14. Fischler, M.A.; Bolles, R.C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. Available online: https://dl.acm.org/doi/10.1145/358669.358692 (accessed on 17 August 2025).
  15. Milan, A.; Leal-Taixé, L.; Reid, I.; Roth, S.; Schindler, K. Mot16: A benchmark for multi-object tracking. arXiv 2016, arXiv:1603.00831. [Google Scholar]
  16. Dendorfer, P.; Rezatofighi, H.; Milan, A.; Shi, J.; Cremers, D.; Reid, I.; Roth, S.; Schindler, K.; Leal-Taixé, L. Mot20: A benchmark for multi object tracking in crowded scenes. arXiv 2020, arXiv:2003.09003. [Google Scholar]
  17. Sun, Y.; Zhang, W.; Zhao, B.; Li, L.; Wang, J. DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; Available online: https://ieeexplore.ieee.org/document/9879192 (accessed on 3 September 2024).
  18. Bernardin, K.; Stiefelhagen, R. Evaluating multiple object tracking performance: The clear mot metrics. EURASIP J. Image Video Process. 2008, 2008, 246309. Available online: https://jivp-eurasipjournals.springeropen.com/articles/10.1155/2008/246309 (accessed on 5 September 2025).
  19. Luiten, J.; Osep, A.; Dendorfer, P.; Torr, P.; Geiger, A.; Leal-Taixé, L.; Leibe, B. Hota: A higher order metric for evaluating multi-object tracking. Int. J. Comput. Vis. 2021, 129, 548–578. [Google Scholar] [CrossRef] [PubMed]
  20. Ristani, E.; Solera, F.; Zou, R.; Cucchiara, R.; Tomasi, C. Performance measures and a data set for multi-target, multi-camera tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–10 and 15–16 October 2016; pp. 17–35. Available online: https://link.springer.com/chapter/10.1007/978-3-319-48881-3_2 (accessed on 5 September 2025).
  21. Pang, B.; Li, Y.; Zhang, Y.; Li, M.; Lu, C. Tubetk: Adopting tubes to track multi-object in a one-step training model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6308–6318. [Google Scholar]
  22. Zhou, X.; Koltun, V.; Krähenbühl, P. Tracking objects as points. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 474–490. [Google Scholar]
  23. Xu, Y.; Ban, Y.; Delorme, G.; Gan, C.; Rus, D.; Alameda-Pineda, X. TransCenter: Transformers with Dense Representations for Multiple-Object Tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 7820–7835. [Google Scholar] [CrossRef] [PubMed]
  24. Zhang, Y.; Wang, C.; Wang, X.; Zeng, W.; Liu, W. Fairmot: On the fairness of detection and re-identification in multiple object tracking. Int. J. Comput. Vis. 2021, 129, 3069–3087. [Google Scholar] [CrossRef]
  25. Chu, P.; Wang, J.; You, Q.; Ling, H.; Liu, Z. Transmot: Spatial-temporal graph transformer for multiple object tracking. arXiv 2021, arXiv:2104.00194. Available online: https://ieeexplore.ieee.org/abstract/document/10030267 (accessed on 5 October 2025).
  26. Stadler, D.; Beyerer, J. Modelling ambiguous assignments for multi-person tracking in crowds. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2022; pp. 133–142. Available online: https://ieeexplore.ieee.org/document/9707576 (accessed on 7 December 2025).
  27. Yang, M.; Han, G.; Yan, B.; Zhang, W.; Qi, J.; Lu, H.; Wang, D. Hybrid-sort: Weak cues matter for online multi-object tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; pp. 6504–6512. [Google Scholar]
  28. Yi, K.; Luo, K.; Luo, X.; Huang, J.; Wu, H.; Hu, R.; Hao, W. Ucmctrack: Multi-object tracking with uniform camera motion compensation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; pp. 6702–6710. [Google Scholar]
  29. Zhang, Y.; Sheng, H.; Wu, Y.; Wang, S.; Ke, W.; Xiong, Z. Multiplex labeling graph for near-online tracking in crowded scenes. IEEE Internet Things J. 2020, 7, 7892–7902. Available online: https://ieeexplore.ieee.org/document/9098857 (accessed on 11 October 2025).
  30. Sun, P.; Cao, J.; Jiang, Y.; Zhang, R.; Xie, E.; Yuan, Z.; Wang, C.; Luo, P. Transtrack: Multiple-object tracking with transformer. arXiv 2020, arXiv:2012.15460. [Google Scholar]
  31. Li, W.; Xiong, Y.; Yang, S.; Xu, M.; Wang, Y.; Xia, W. Semi-TCL: Semi-supervised track contrastive representation learning. arXiv 2021, arXiv:2107.02396. [Google Scholar]
  32. Yu, E.; Li, Z.; Han, S.; Wang, H. RelationTrack: Relation aware multiple object tracking with decoupled representation. arXiv 2021, arXiv:2105.04322. Available online: https://ieeexplore.ieee.org/document/9709649 (accessed on 11 October 2025).
  33. Qin, Z.; Zhou, S.; Wang, L.; Duan, J.; Hua, G.; Tang, W. MotionTrack: Learning Robust Short-term and Long-term Motions for Multi-Object Tracking. arXiv 2023, arXiv:2303.10404. [Google Scholar]
  34. Wu, J.; Cao, J.; Song, L.; Wang, Y.; Yang, M.; Yuan, J. Track to detect and segment: An online multi-object tracker. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 12352–12361. Available online: https://ieeexplore.ieee.org/document/9578864 (accessed on 11 October 2025).
  35. Zeng, F.; Dong, B.; Zhang, Y.; Wang, T.; Zhang, X.; Wei, Y. MOTR: End-to-end multiple-object tracking with transformer. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 659–675. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.