4.3.1. Comparative Analysis with State-of-the-Art Methods
We conducted a comprehensive comparison between our proposed method and contemporary state-of-the-art tracking algorithms, and the results are as follows. The experimental results on the MOT17 test set are presented in
Table 5.
As illustrated in
Figure 5 and
Table 5, the proposed model demonstrates significant advantages when compared with ByteTrack, FairMOT, and other models on the MOT17 dataset, while ByteTrack achieves competitive performance in detection accuracy (MOTA 80.3), it shows limitations in identity preservation and association in complex scenarios, evidenced by its higher number of ID switches (2196) compared to our approach (1134). Similarly, FairMOT, despite its integrated detection-tracking framework, struggles with maintaining identity consistency (3303 ID switches) and achieves lower comprehensive tracking metrics (HOTA 59.3, IDF1 72.3). These baseline methods face challenges in scenarios involving occlusions, crowded scenes, and non-linear motion patterns. In terms of accuracy-related metrics, our model achieves a HOTA score of 63.7 (compared to ByteTrack’s 63.1 and FairMOT’s 59.3), a MOTA score of 79.8 (versus ByteTrack’s 80.3 and FairMOT’s 73.7), and an IDF1 score of 79.5 (surpassing ByteTrack’s 77.3 and FairMOT’s 72.3). These results indicate superior performance in comprehensive tracking, object detection, and identity recognition accuracy, enabling precise detection, the tracking of multiple targets, and maintenance of correct identity recognition. Regarding identity stability, our model records only 1134 ID switches, substantially lower than ByteTrack’s 2196 and FairMOT’s 3303, demonstrating robust continuous tracking capability in complex scenarios while reducing identity switches and enhancing tracking reliability. In terms of real-time performance measured by FPS, our model operates at 61.4 FPS, significantly outperforming ByteTrack (29.1 FPS), FairMOT (25.2 FPS), and StrongSORT (7.0 FPS), showcasing superior efficiency in processing video frames and real-time applications.
As illustrated in
Figure 6 and
Table 6, the proposed model demonstrates exceptional performance in the more challenging MOT20 dataset scenarios. In terms of accuracy metrics, it achieves a HOTA score of 61.5, MOTA of 76.1, and IDF1 of 76.5, maintaining high detection and tracking precision compared to FairMOT (HOTA 54.6, MOTA 61.8, and IDF1 67.3) and StrongSORT (HOTA 62.6, MOTA 73.8, and IDF1 77.0). Regarding identity stability, our model records only 1562 ID switches, substantially lower than FairMOT’s 5243, demonstrating superior continuous tracking capability in complex backgrounds with frequent target interactions. For real-time performance, our model operates at 23.6 FPS, outperforming both FairMOT (13.2 FPS) and StrongSORT (1.4 FPS), proving its ability to balance detection accuracy and real-time processing in complex scenarios while achieving efficient video stream processing with high-precision tracking.
Based on the experimental results from both the MOT17 and MOT20 datasets, our proposed tracking model achieves optimal balance between speed and detection performance through innovative algorithm design and optimization. This balance holds significant implications for practical applications, particularly in intelligent surveillance scenarios where real-time processing of extensive video data and accurate tracking of multiple targets are essential. In autonomous driving applications, vehicles must make split-second decisions while maintaining precise real-time tracking of surrounding objects. The exceptional performance of our proposed model provides robust technical support for these application scenarios, demonstrating considerable potential for practical implementation and promising to advance the development of multi-object tracking technology in real-world applications.
Moreover, there are significant differences in the performance of different multi-object tracking (MOT) methods on the MOT17 and MOT20 datasets. On the MOT17 dataset, our proposed method achieves scores of 63.7, 79.8, and 79.4 in the HOTA, MOTA, and IDF1 metrics, respectively, ranking just behind StrongSORT++, and it significantly outperforms other methods with only 1134 identity switches, demonstrating its advantage in maintaining object identity consistency. With an FPS of 61.4, it far surpasses StrongSORT++ in processing speed, balancing accuracy and efficiency. In contrast, on the MOT20 dataset, our method achieves HOTA, MOTA, and IDF1 scores of 61.5, 77.1, and 76.5, with slightly lower overall accuracy compared to MOT17, where BoostTrack++ leads with a HOTA of 66.4 and an IDF1 of 82.0. However, our method still maintains a low number of identity switches (1162 times) and has an FPS of 27.6, significantly higher than BoostTrack++. Overall, our method demonstrates robust performance on both datasets, especially excelling in reducing identity switches and improving processing speed, and it remains competitive in practical applications despite the slightly lower accuracy on MOT20.
4.3.2. Small Object Tracking Performance
Given this study’s particular focus on small object tracking, we conducted a dedicated evaluation using the VisDrone-MOT dataset, with the results presented in
Table 7.
The experimental results demonstrate that our proposed method achieves optimal performance in both MOTA (39.3) and FPS (14.6), significantly outperforming other approaches. Additionally, it exhibits exceptional performance in IDF1 (48.2) and ID (215) metrics, particularly surpassing most comparative methods in identity consistency. Compared to CMOT, our method maintains a high IDF1 while substantially reducing the number of ID switches, and it demonstrates superior computational efficiency over DeepSORT and Flow-Tracker. Overall, our approach achieves a better balance among tracking accuracy, identity stability, and computational efficiency, providing an efficient and robust solution for multi-object tracking tasks.
As shown in
Figure 7, the image sequence depicts a cyclist temporarily obscured by a bus before reappearing. Although UAV-SIFT detects the cyclist in the process, their ID changes before and after the occlusion, and our proposed method not only succeeds in maintaining continuous detection, but also preserves the consistent ID of the cyclist before and after the occlusion.
When evaluating small object tracking performance using the VisDrone -MOT dataset, our method clearly outperforms others, with high MOTA and IDF1 scores indicating its effectiveness in detecting small objects and maintaining identity consistency. Compared to Deepsort, the Multi-Scale Feature Adaptive Enhancement (MS-FAE) module in our method is a major advantage. It combines spatial details and semantic information through multi-scale feature pyramid fusion, which is crucial for small objects with indistinct single-scale features. Its small object adaptive attention mechanism also better highlights small object features. In terms of identity stability, our method has far fewer ID switches, mainly due to the Cross-Frame Association Module (CFAM). The CFAM’s grouped cross-attention captures global semantic associations, and its memory recall mechanism maintains object identity during occlusions, as demonstrated by the successful tracking of a cyclist during occlusion while UAV-SIFT failed. Regarding computational efficiency, our method runs at 14.6 FPS, faster than some counterparts like Flow-Tracker and Deepsort. This is achieved through the optimized design of modules like CFAM, whose grouped cross-attention reduces computational complexity, enhancing both feature representation and calculation speed. Overall, the combination of innovative modules such as MS-FAE and CFAM, along with well-tuned hyperparameters, gives our method an edge in detection accuracy, identity stability, and computational efficiency for small object tracking.