Enhancing Multi Object Tracking with CLIP: A Comparative Study on DeepSORT and StrongSORT
Abstract
1. Introduction
1.1. Research Questions and Motivation
- 1.
- How does replacing conventional ReID appearance embeddings with CLIP-derived representations affect identity consistency in multi-object tracking pipelines?
- 2.
- To what extent do semantically rich, high-capacity appearance features contribute to reducing identity switches (IDSW), particularly in crowded and occlusion-heavy scenes?
- 3.
- Can improvements in tracking performance be directly attributed to enhanced appearance feature quality when other components of the tracking-by-detection framework are held constant?
1.2. List of Main Contributions
- 1.
- An enhanced tracking-by-detection architecture is proposed in which CLIP-derived visual embeddings are integrated into the DeepSORT and StrongSORT frameworks, replacing conventional CNN-based ReID models with semantically rich appearance representations.
- 2.
- A controlled and systematic analysis of appearance feature quality in multi-object tracking is conducted by isolating the impact of CLIP-based embeddings on identity association stability, with particular emphasis on identity switch (IDSW) reduction in crowded and occluded environments.
- 3.
- A high-capacity object detector, YOLOv8x fine-tuned on the MOT20 dataset, is incorporated to examine how improvements in detection accuracy propagate through the tracking pipeline and interact with enhanced appearance modeling.
- 4.
- Extensive experimental evaluations are performed on established multi-object tracking benchmarks, including MOT15 and MOT16, assessing tracking performance under challenging conditions characterized by dense crowds, severe occlusions, and complex scene dynamics.
- 5.
- The selection of CLIP over alternative vision–language models, such as ALIGN [21], is justified based on the availability of publicly released pretrained weights, large-scale training data, strong cross-domain generalization capability, and reproducibility, establishing its suitability for practical deployment in modern multi-object tracking systems.
2. Background
2.1. Object Detection Models (YOLOv8)
2.2. Feature Extraction Models for Re-Identification
2.3. Tracking Models: DeepSORT and StrongSORT
2.3.1. DeepSORT
2.3.2. StrongSORT
3. Benchmark Datasets
3.1. MOT15 Dataset
3.2. MOT16 Dataset
3.3. MOT20 Dataset
4. Evaluation Metrics
4.1. Multiple Object Tracking Accuracy (MOTA)
4.2. Multiple Object Tracking Precision (MOTP)
4.3. Identity Switches (IDSW)
4.4. Precision
4.5. Recall
5. Experiments and Results
5.1. Proposed Architectures
5.2. Experimental Setups
5.3. Quantitative Results
5.3.1. DeepSORT-Experiment: Impact of CLIP
- On PETS09-S2L1, integrating CLIP reduced ID switches from 19 to 17, demonstrating improved identity tracking.
- On ETH-Sunnyday, using fine-tuned YOLOv8x weights improved both MOTP and Precision, indicating better localization and detection quality.
5.3.2. StrongSORT-Experiment 1: Impact of CLIP
5.3.3. StrongSORT-Experiment 2: Cross-Dataset Generalization of YOLO
- 1.
- CLIP integration on PETS09-S2L1 reduced IDSW by approximately 60%, demonstrating significant improvements in identity consistency.
- 2.
- Replacing the default detector with YOLOv8x on TUD-Stadmittee reduced IDSW by 44%, confirming the benefits of high-capacity detection for tracking performance.
- 3.
- Fine-tuned YOLOv8x generalized well to MOT15, improving MOTA (78.70 → 82.79) and MOTP (22.40 → 26.70). Adding CLIP further reduced IDSW from 15 to 10 (33% improvement), illustrating the complementary effect of robust embeddings combined with accurate detection.
5.4. Qualitative Results
5.4.1. Graphical Analysis of Results
5.4.2. Visual Results on MOT15 and MOT16
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Yeung, S.; Downing, N.L.; Fei-Fei, L.; Milstein, A. Bedside computer vision—moving artificial intelligence from driver assistance to patient safety. N. Engl. J. Med. 2018, 378, 1271–1273. [Google Scholar] [CrossRef] [PubMed]
- Li, J.; Wang, B.; Ma, H.; Gao, L.; Fu, H. Visual feature extraction and tracking method based on corner flow detection. ICCK Trans. Intell. Syst. 2024, 1, 3–9. [Google Scholar] [CrossRef]
- Ravindran, R.; Santora, M.; Jamali, M. Multi object detection and tracking, based on DNN, for autonomous vehicles: A review. IEEE Sens. J. 2020, 21, 5668–5677. [Google Scholar] [CrossRef]
- Coifman, B.; Beymer, D.; McLauchlan, P.; Malik, J. A real-time computer vision system for vehicle tracking and traffic surveillance. Transp. Res. Part C Emerg. Technol. 1998, 6, 271–288. [Google Scholar] [CrossRef]
- Sophokleous, A.; Christodoulou, P.; Doitsidis, L.; Chatzichristofis, S.A. Computer vision meets educational robotics. Electronics 2021, 10, 730. [Google Scholar] [CrossRef]
- Reis, D.; Kupec, J.; Hong, J.; Daoudi, A. Real-time flying object detection with YOLOv8. arXiv 2023, arXiv:2305.09972. [Google Scholar] [CrossRef]
- Alkandary, K.; Yildiz, A.S.; Meng, H. A Comparative Study of YOLO Series (v3–v10) with DeepSORT and StrongSORT: A Real-Time Tracking Performance Study; Technical Report; Department of Electronic and Electrical Engineering, Brunel University: London, UK, 2025. [Google Scholar]
- Danilowicz, M.; Kryjak, T. Real-Time Multi-object Tracking Using YOLOv8 and SORT on a SoC FPGA. In Proceedings of the International Symposium on Applied Reconfigurable Computing, Seville, Spain, 9–11 April 2025; Springer: Berlin/Heidelberg, Germany, 2025; pp. 214–230. [Google Scholar]
- Yu, X.; Liu, X.; Liang, G. YOLOv8-SMOT: An Efficient and Robust Framework for Real-Time Small Object Tracking via Slice-Assisted Training and Adaptive Association. arXiv 2025, arXiv:2507.12087. [Google Scholar]
- Liu, Y.; Shen, S. Vehicle Detection and Tracking Based on Improved YOLOv8. IEEE Access 2025, 13, 24793–24803. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, J.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Zhu, H.; Lu, Q.; Xue, L.; Zhang, P.; Yuan, G. Vision-language tracking with CLIP and interactive prompt learning. IEEE Trans. Intell. Transp. Syst. 2024, 26, 3659–3670. [Google Scholar] [CrossRef]
- Wojke, N.; Bewley, A.; Paulus, D. Simple Online and Realtime Tracking with a Deep Association Metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; IEEE: New York, NY, USA, 2017; pp. 3645–3649. [Google Scholar] [CrossRef]
- Du, Y.; Zhao, Z.; Song, Y.; Zhao, Y.; Su, F.; Gong, T.; Meng, H. StrongSORT: Make DeepSORT Great Again. IEEE Trans. Multimed. 2023, 25, 8725–8737. [Google Scholar] [CrossRef]
- Du, J.; Xing, W.; Li, M.; Yu, F.R. VSE-MOT: Multi-Object Tracking in Low-Quality Video Scenes Guided by Visual Semantic Enhancement. arXiv 2025, arXiv:2509.14060. [Google Scholar]
- Asperti, A.; Naldi, L.; Fiorilla, S. An Investigation of the Domain Gap in CLIP-Based Person Re-Identification. Sensors 2025, 25, 363. [Google Scholar] [CrossRef] [PubMed]
- Yang, X.; Gao, X.; Niu, S.; Zhu, F.; Feng, G.; Qu, X.; Camacho, D. CLIP4VI-ReID: Learning Modality-shared Representations via CLIP Semantic Bridge for Visible-Infrared Person Re-identification. arXiv 2025, arXiv:2511.10309. [Google Scholar]
- Leal-Taix’e, L.; Milan, A.; Reid, I.; Roth, S.; Schindler, K. MOTChallenge 2015: Towards a benchmark for multi-target tracking. arXiv 2015, arXiv:1504.01942. [Google Scholar] [CrossRef]
- Milan, A.; Leal-Taix’e, L.; Reid, I.; Roth, S.; Schindler, K. MOT16: A benchmark for multi object tracking. arXiv 2016, arXiv:1603.00831. [Google Scholar] [CrossRef]
- Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.H.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International Conference on machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 4904–4916. [Google Scholar]
- Li, S.; Sun, L.; Li, Q. Clip-ReID: Exploiting Vision–Language Model for Image Re-Identification Without Concrete Text Labels. Proc. Aaai Conf. Artif. Intell. 2023, 37, 1405–1413. [Google Scholar] [CrossRef]
- Cheng, T.; Luo, X.; Zhao, H.; Zhou, Y.; Yan, S. YOLO-World: Real-Time Open-Vocabulary Object Detection. arXiv 2024, arXiv:2401.17270. [Google Scholar]
- Bayraktar, E. ReTrackVLM: Transformer-Enhanced Multi-Object Tracking with Cross-Modal Embeddings and Zero-Shot Re-Identification Integration. Appl. Sci. 2025, 15, 1907. [Google Scholar] [CrossRef]
- Wu, Y.; Li, Y.; Sheng, H.; Zhang, Z. OVTrack: Open-Vocabulary Multiple Object Tracking. arXiv 2023, arXiv:2304.08963. [Google Scholar] [CrossRef]
- Huang, L.; Wu, Y.; Li, Y.; Sheng, H.; Zhang, Z. Z-GMOT: Zero-Shot Generic Multiple Object Tracking. arXiv 2023, arXiv:2305.17648. [Google Scholar]
- Chen, Y.; Sheng, H.; Li, Y.; Zhang, J.; Zhang, Z. ReferGPT: Towards Zero-Shot Referring Multi-Object Tracking. arXiv 2025, arXiv:2504.09195. [Google Scholar]
- Li, H.; Zhao, F.; Xue, F.; Wang, J.; Liu, Y.; Chen, Y.; Wu, Q.; Tao, J.; Zhang, G.; Xi, D.; et al. Succulent-YOLO: Smart UAV-Assisted Succulent Farmland Monitoring with CLIP-Based YOLOv10 and Mamba Computer Vision. Remote Sens. 2025, 17, 2219. [Google Scholar] [CrossRef]
- Lin, J.; Gong, S. Gridclip: One-stage object detection by grid-level clip representation learning. arXiv 2023, arXiv:2303.09252. [Google Scholar] [CrossRef]
- Vidit, V.; Engilberge, M.; Salzmann, M. Clip the gap: A single domain generalization approach for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 3219–3229. [Google Scholar]
- Zheng, L.; Bie, Z.; Sun, Y.; Wang, J.; Su, C.; Wang, S.; Tian, Q. MARS: A video benchmark for large-scale person re-identification. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 868–884. [Google Scholar]
- Zhou, K.; Yang, Y.; Cavallaro, A.; Xiang, T. Omni-scale feature learning for person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3702–3712. [Google Scholar]
- Dendorfer, P.; Rezatofighi, H.; Milan, A.; Shi, J.; Cremers, D.; Reid, I.; Roth, S.; Schindler, K.; Leal-Taix’e, L. MOT20: A benchmark for multi object tracking in crowded scenes. arXiv 2020, arXiv:2003.09003. [Google Scholar] [CrossRef]
- Bernardin, K.; Stiefelhagen, R. Evaluating multiple object tracking performance: The CLEAR MOT metrics. EURASIP J. Image Video Process. 2008, 2008, 246309. [Google Scholar] [CrossRef]
- Ristani, E.; Solera, F.; Zou, R.; Cucchiara, R.; Tomasi, C. Performance measures and a data set for multi-target, multi-camera tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 17–35. [Google Scholar]
- Luiten, J.; Osep, A.; Dendorfer, P.; Torr, P.H.; Geiger, A.; Leal-Taix’e, L.; Leibe, B. HOTA: A higher order metric for evaluating multi-object tracking. Int. J. Comput. Vis. 2021, 129, 548–578. [Google Scholar] [CrossRef] [PubMed]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Doll’ar, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
- Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]










| Year | Method/Paper | Uses CLIP | Uses Object Detection | Tracking Paradigm | Main Purpose | Tracking Method Used (Instead of DeepSORT/StrongSORT) |
|---|---|---|---|---|---|---|
| 2023 | OVTrack: Open-Vocabulary Multiple Object Tracking | Yes (CLIP feature distillation) | Yes | Detection + tracking head | Track seen and unseen categories | Open-vocabulary MOT framework: learned tracking head (no Kalman filter or Hungarian matching) |
| 2023 | CLIP-ReID: Exploiting Vision-Language Model for Image Re-Identification | Yes (CLIP visual features) | No | Feature learning (Re-ID only) | Improve appearance embeddings | No tracking: Re-ID model only (can be plugged into DeepSORT/StrongSORT) |
| 2024 | YOLO-World: Real-Time Open-Vocabulary Object Detection | Yes (vision-language detector) | No | Object detection | Open-vocabulary detection | No tracking: detector only (can feed DeepSORT/StrongSORT) |
| 2024 | Z-GMOT: Zero-shot Generic Multiple Object Tracking | Yes (leverages a vision-language detector iGLIP/GLIP) | Yes | Tracking-by-Detection (zero-shot generic MOT) | Track multiple generic unseen objects without predefined categories | MA-SORT—motion & appearance association specifically designed for generic object association (replaces typical SORT/DeepSORT) |
| 2025 | ReTrackVLM: Transformer-Enhanced MOT with Cross-Modal Embeddings and Zero-Shot Re-ID Integration | Yes (VLM/CLIP-style embeddings) | Yes | Transformer-based MOT | Enhance identity association | Transformer encoder–decoder tracking: end-to-end learned association |
| 2025 | VSE-MOT: Multi-Object Tracking in Low-Quality Video Scenes Guided by Visual Semantic Enhancement | Yes (CLIP image encoder) | Yes | Tracking-by-detection | Improve robustness in low-quality videos | Transformer-based MOT, query-based tracking with learned association |
| 2025 | ReferGPT: Towards Zero-Shot Referring Multi-Object Tracking | Yes—uses CLIP-based semantic encoding for matching generated captions with queries | Yes | Tracking-by-Detection (with Kalman filter association) | Track objects specified by natural language queries in a zero-shot manner | Kalman filter + fuzzy query matching within a tracking-by-detection pipeline (not DeepSORT/StrongSORT) |
| Seq. ID | Exp. | Recall ↑ | Prec. ↑ | IDSW ↓ | MOTA ↑ | MOTP ↑ |
|---|---|---|---|---|---|---|
| PETS09-S2L1 | NO CLIP | 94.00 | 92.50 | 19.00 | 86.00 | 23.30 |
| With CLIP | 93.70 | 92.30 | 17.00 | 85.60 | 23.80 | |
| ETH-Sunnyday | No CLIP | 94.10 | 74.30 | 2.00 | 61.40 | 17.40 |
| With CLIP | 93.90 | 74.30 | 2.00 | 61.30 | 17.80 | |
| ETH-Sunnyday ★ | No CLIP | 93.20 | 86.80 | 1.00 | 79.00 | 15.20 |
| With CLIP | 93.10 | 86.90 | 1.00 | 79.00 | 15.60 |
| Seq. ID | Exp. | Recall ↑ | Prec. ↑ | IDSW ↓ | MOTA ↑ | MOTP ↑ |
|---|---|---|---|---|---|---|
| 02 | No CLIP | 97.30 | 85.30 | 479.00 | 77.90 | 91.30 |
| With CLIP | 97.50 | 85.60 | 318.00 | 79.30 | 91.80 | |
| 09 | No CLIP | 92.30 | 80.10 | 45.00 | 68.50 | 96.40 |
| With CLIP | 93.50 | 80.00 | 42.00 | 69.30 | 96.30 | |
| 11 | No CLIP | 91.70 | 92.20 | 50.00 | 83.30 | 96.40 |
| With CLIP | 92.00 | 92.00 | 49.00 | 83.50 | 96.30 |
| Seq. ID | Exp. | Recall ↑ | Prec. ↑ | IDSW ↓ | MOTA ↑ | MOTP ↑ |
|---|---|---|---|---|---|---|
| PETS09-S2L1 | NO CLIP | 94.60 | 89.30 | 100.00 | 81.10 | 24.30 |
| With CLIP | 92.70 | 92.20 | 42.00 | 84.00 | 24.20 | |
| TUD-Stadmittee | No CLIP | 82.40 | 92.60 | 9.00 | 78.70 | 22.40 |
| With CLIP | 82.80 | 96.50 | 5.00 | 79.30 | 22.90 | |
| TUD-Stadmittee ★ | No CLIP | 86.20 | 97.60 | 15.00 | 82.70 | 26.70 |
| With CLIP | 88.20 | 96.20 | 10.00 | 83.90 | 26.80 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Alkandary, K.; Yildiz, A.S.; Meng, H. Enhancing Multi Object Tracking with CLIP: A Comparative Study on DeepSORT and StrongSORT. Electronics 2026, 15, 265. https://doi.org/10.3390/electronics15020265
Alkandary K, Yildiz AS, Meng H. Enhancing Multi Object Tracking with CLIP: A Comparative Study on DeepSORT and StrongSORT. Electronics. 2026; 15(2):265. https://doi.org/10.3390/electronics15020265
Chicago/Turabian StyleAlkandary, Khadijah, Ahmet Serhat Yildiz, and Hongying Meng. 2026. "Enhancing Multi Object Tracking with CLIP: A Comparative Study on DeepSORT and StrongSORT" Electronics 15, no. 2: 265. https://doi.org/10.3390/electronics15020265
APA StyleAlkandary, K., Yildiz, A. S., & Meng, H. (2026). Enhancing Multi Object Tracking with CLIP: A Comparative Study on DeepSORT and StrongSORT. Electronics, 15(2), 265. https://doi.org/10.3390/electronics15020265

