TLtrack: Combining Transformers and a Linear Model for Robust Multi-Object Tracking
Abstract
:1. Introduction
2. Related Work
2.1. Tracking by Detection
2.2. Motion Model
2.3. Transformers in MOT
3. Methodology
3.1. Architecture
3.2. Transformers and Linear Track
3.3. Training
4. Experiments
4.1. Settings
4.2. Benchmark Results
4.3. Ablation Studies
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Luo, W.; Xing, J.; Milan, A.; Zhang, X.; Liu, W.; Kim, T.K. Multiple object tracking: A literature review. Artif. Intell. 2021, 293, 103448. [Google Scholar] [CrossRef]
- Zhu, J.; Lao, Y.; Zheng, Y.F. Object tracking in structured environments for video surveillance applications. IEEE Trans. Circuits Syst. Video Technol. 2009, 20, 223–235. [Google Scholar] [CrossRef]
- Xing, W.; Yang, Y.; Zhang, S.; Yu, Q.; Wang, L. NoisyOTNet: A robust real-time vehicle tracking model for traffic surveillance. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 2107–2119. [Google Scholar] [CrossRef]
- Lee, Y.G.; Tang, Z.; Hwang, J.N. Online-learning-based human tracking across non-overlapping cameras. IEEE Trans. Circuits Syst. Video Technol. 2017, 28, 2870–2883. [Google Scholar] [CrossRef]
- Zhang, K.; Li, Y.; Wang, J.; Cambria, E.; Li, X. Real-time video emotion recognition based on reinforcement learning and domain knowledge. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 1034–1047. [Google Scholar] [CrossRef]
- Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar]
- Kalman, R.E. Contributions to the theory of optimal control. Bol. Soc. Mat. Mex. 1960, 5, 102–119. [Google Scholar]
- Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. Bytetrack: Multi-object tracking by associating every detection box. In Proceedings of the Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part XXII. Springer: Berlin/Heidelberg, Germany, 2022; pp. 1–21. [Google Scholar]
- Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
- Sun, P.; Cao, J.; Jiang, Y.; Yuan, Z.; Bai, S.; Kitani, K.; Luo, P. Dancetrack: Multi-object tracking in uniform appearance and diverse motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 20993–21002. [Google Scholar]
- Cao, J.; Weng, X.; Khirodkar, R.; Pang, J.; Kitani, K. Observation-centric sort: Rethinking sort for robust multi-object tracking. arXiv 2022, arXiv:2203.14360. [Google Scholar]
- Zhou, X.; Koltun, V.; Krähenbühl, P. Tracking objects as points. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 474–490. [Google Scholar]
- Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
- Zeng, F.; Dong, B.; Zhang, Y.; Wang, T.; Zhang, X.; Wei, Y. Motr: End-to-end multiple-object tracking with transformer. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 659–675. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2020; pp. 213–229. [Google Scholar]
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
- Tokmakov, P.; Li, J.; Burgard, W.; Gaidon, A. Learning to track with object permanence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10860–10869. [Google Scholar]
- Wang, Q.; Zheng, Y.; Pan, P.; Xu, Y. Multiple object tracking with correlation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 3876–3886. [Google Scholar]
- Wang, Y.; Kitani, K.; Weng, X. Joint object detection and multi-object tracking with graph neural networks. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 13708–13715. [Google Scholar]
- Wu, J.; Cao, J.; Song, L.; Wang, Y.; Yang, M.; Yuan, J. Track to detect and segment: An online multi-object tracker. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 12352–12361. [Google Scholar]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
- Sun, P.; Cao, J.; Jiang, Y.; Zhang, R.; Xie, E.; Yuan, Z.; Wang, C.; Luo, P. Transtrack: Multiple object tracking with transformer. arXiv 2020, arXiv:2012.15460. [Google Scholar]
- Meinhardt, T.; Kirillov, A.; Leal-Taixe, L.; Feichtenhofer, C. Trackformer: Multi-object tracking with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8844–8854. [Google Scholar]
- Zhao, Z.; Wu, Z.; Zhuang, Y.; Li, B.; Jia, J. Tracking objects as pixel-wise distributions. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27October 2022; pp. 76–94. [Google Scholar]
- Chaabane, M.; Zhang, P.; Beveridge, J.R.; O’Hara, S. Deft: Detection embeddings for tracking. arXiv 2021, arXiv:2102.02267. [Google Scholar]
- Ilg, E.; Mayer, N.; Saikia, T.; Keuper, M.; Dosovitskiy, A.; Brox, T. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2462–2470. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Bergmann, P.; Meinhardt, T.; Leal-Taixe, L. Tracking without bells and whistles. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 941–951. [Google Scholar]
- Zhu, T.; Hiller, M.; Ehsanpour, M.; Ma, R.; Drummond, T.; Reid, I.; Rezatofighi, H. Looking beyond two frames: End-to-end multi-object tracking using spatial and temporal transformers. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 12783–12797. [Google Scholar] [CrossRef] [PubMed]
- Zhou, X.; Yin, T.; Koltun, V.; Krähenbühl, P. Global tracking transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8771–8780. [Google Scholar]
- Bernardin, K.; Stiefelhagen, R. Evaluating multiple object tracking performance: The clear mot metrics. EURASIP J. Image Video Process. 2008, 2008, 246309. [Google Scholar] [CrossRef]
- Milan, A.; Leal-Taixé, L.; Reid, I.; Roth, S.; Schindler, K. MOT16: A benchmark for multi-object tracking. arXiv 2016, arXiv:1603.00831. [Google Scholar]
- Luiten, J.; Osep, A.; Dendorfer, P.; Torr, P.; Geiger, A.; Leal-Taixé, L.; Leibe, B. Hota: A higher order metric for evaluating multi-object tracking. Int. J. Comput. Vis. 2021, 129, 548–578. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Y.; Wang, C.; Wang, X.; Zeng, W.; Liu, W. Fairmot: On the fairness of detection and re-identification in multiple object tracking. Int. J. Comput. Vis. 2021, 129, 3069–3087. [Google Scholar] [CrossRef]
- Fischer, T.; Huang, T.E.; Pang, J.; Qiu, L.; Chen, H.; Darrell, T.; Yu, F. QDTrack: Quasi-dense similarity learning for appearance-only multiple object tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 15380–15393. [Google Scholar] [CrossRef] [PubMed]
Tracker | HOTA↑ | DetA↑ | AssA↑ | MOTA↑ | IDF1↑ |
---|---|---|---|---|---|
CenterTrack [12] | 41.8 | 78.1 | 22.6 | 86.8 | 35.7 |
TransTrack [24] | 45.5 | 75.9 | 27.5 | 88.4 | 45.2 |
FairMOT [36] | 39.7 | 66.7 | 23.8 | 82.2 | 40.8 |
TraDes [22] | 43.3 | 74.5 | 25.4 | 86.2 | 41.2 |
QDTrack [37] | 45.7 | 72.1 | 29.2 | 83.0 | 44.8 |
MOTR [14] | 48.4 | 71.8 | 32.7 | 79.2 | 46.1 |
GTR [32] | 48.0 | 72.5 | 31.9 | 84.7 | 50.3 |
SORT [6] | 47.9 | 72.0 | 31.2 | 91.8 | 50.8 |
DeepSORT [9] | 45.6 | 71.0 | 29.7 | 87.8 | 47.9 |
ByteTrack [8] | 47.3 | 71.6 | 31.4 | 89.5 | 52.5 |
ours | 49.1 | 73.0 | 31.5 | 89.0 | 51.8 |
DanceTrack-Val | ||||||
---|---|---|---|---|---|---|
Linear | Transformer | Hybrid | HOTA↑ | AssA↑ | DETA↑ | IDF1↑ |
44.9 | 28.5 | 71.3 | 46.3 | |||
√ | 45.9 | 29.5 | 71.5 | 48.1 | ||
√ | 45.3 | 29.0 | 71.3 | 46.6 | ||
√ | 46.6 | 31.0 | 71.4 | 49.1 |
M | DanceTrack-Val | |||
---|---|---|---|---|
HOTA↑ | AssA↑ | DETA↑ | IDF1↑ | |
2 | 44.5 | 27.9 | 71.3 | 45.7 |
3 | 45.2 | 28.9 | 71.5 | 47.6 |
4 | 46.6 | 31.0 | 71.4 | 49.1 |
5 | 46.2 | 30.7 | 71.4 | 48.3 |
DanceTrack-Val | ||||
---|---|---|---|---|
HOTA↑ | AssA↑ | DETA↑ | IDF1↑ | |
0.7 | 45.2 | 29.1 | 71.5 | 47.4 |
0.8 | 45.8 | 29.5 | 71.5 | 47.6 |
0.9 | 46.6 | 31.0 | 71.4 | 49.1 |
Short-Side | DanceTrack-Val | |||
---|---|---|---|---|
HOTA↑ | AssA↑ | DETA↑ | IDF1↑ | |
540 pix | 45.8 | 29.9 | 71.4 | 48.4 |
800 pix | 46.6 | 31.0 | 71.4 | 49.1 |
1080 pix | 46.0 | 30.6 | 71.4 | 48.5 |
Sample Rate | DanceTrack-Val | |||
---|---|---|---|---|
HOTA↑ | AssA↑ | DETA↑ | IDF1↑ | |
original | 46.6 | 31.0 | 71.4 | 49.1 |
two frames | 46.0 | 30.2 | 71.3 | 48.6 |
three frames | 45.8 | 30.0 | 71.3 | 48.0 |
four frames | 45.4 | 29.5 | 71.3 | 47.6 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
He, Z.; Zhao, K.; Zeng, D. TLtrack: Combining Transformers and a Linear Model for Robust Multi-Object Tracking. AI 2024, 5, 938-947. https://doi.org/10.3390/ai5030047
He Z, Zhao K, Zeng D. TLtrack: Combining Transformers and a Linear Model for Robust Multi-Object Tracking. AI. 2024; 5(3):938-947. https://doi.org/10.3390/ai5030047
Chicago/Turabian StyleHe, Zuojie, Kai Zhao, and Dan Zeng. 2024. "TLtrack: Combining Transformers and a Linear Model for Robust Multi-Object Tracking" AI 5, no. 3: 938-947. https://doi.org/10.3390/ai5030047
APA StyleHe, Z., Zhao, K., & Zeng, D. (2024). TLtrack: Combining Transformers and a Linear Model for Robust Multi-Object Tracking. AI, 5(3), 938-947. https://doi.org/10.3390/ai5030047