An Improved End-to-End Multi-Target Tracking Method Based on Transformer Self-Attention
Abstract
:1. Introduction
2. Transformer-Based End-to-End MTMCT Algorithm Architecture
- (a)
- For the multi-view camera detection, the corresponding detection frames and texture features were first obtained based on a multi-dimensional feature extraction CNN network and fed into the encoder;
- (b)
- The encoder received the raster semantic map, which was constructed based on the target scene. Using the projection of the multi-dimensional feature detection frame and the raster semantic map from the multi-view detection results, the final detection frame result (Frame Pos Result) in the object space was obtained and sent into the decoder;
- (c)
- The decoder received the frame detection result from the encoder and the a priori query of the previous frame. The decoder consisted of three parts: the spatial clustering and semantic filtering algorithm that generated the spatial clustering results, the multi-dimensional feature dynamic matching algorithm combined with the raster semantic map filter that produced the feature, and the space–time logic-based multi-visual target tracking algorithm that created the logic result. The results were subjected to focal loss to obtain the continuous tracking ReID result of the current frame, which was input to the STCN;
- (d)
- The STCN produced an a priori query for the next frame and cascaded it with the historical query. It was then fed it into the decoder and repeated (a) (b) (c) in the algorithm for the next frame;
- (e)
- When the overall tracking had been completed, the ReID was optimized by reviewing the overall results using inverted order processing and compensating for the confidence score in the historical results.
2.1. Construction of Backbone Network and Encoder
2.2. Construction of a Transformer-Based Decoder
2.3. Construction of a Retrospective Mechanism Based on Inverse Order Processing
2.4. Collective Average Loss
3. Optimization of the Decoder Based on the Raster Semantic Map
3.1. Multidimensional Feature Matching on the Raster Semantic Maps
3.2. Space-Time Logic Matching Based on the Raster Semantic Maps
4. Experiments
4.1. Materials
4.1.1. Image Data
4.1.2. Raster Semantic Map Data Construction
4.2. Evaluation Metrics
4.3. Implementation Details
4.4. Validation of Single Camera Accuracy Results Based on the Publicly Available Dataset (MOT17)
4.5. Continuous Tracking Accuracy Based on the Self-Built Dataset OVIT-MOT01
4.6. Ablation Experiments Based on OVIT-MOT01
5. Discussion
5.1. Optimization Results Based on Multi-Dimensional Dynamic Feature Matching Method
5.2. Optimization Results Based on Temporal Logic Matching Method
5.3. Optimization Results Based on Retrospective Mechanism
6. Conclusions
7. Recommendations and Future Work
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Liu, S.; Kong, W.; Chen, X.; Xu, M.; Yasir, M.; Zhao, L.; Li, J. Multi-Scale Ship Detection Algorithm Based on a Lightweight Neural Network for Spaceborne SAR Images. Remote Sens. 2022, 14, 1149. [Google Scholar] [CrossRef]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2020. [Google Scholar]
- Meinhardt, T.; Kirillov, A.; Leal-Taixé, L.; Feichtenhofer, C. TrackFormer: Multi-Object Tracking with Transformers. coRR 2022. [Google Scholar]
- Zeng, F.; Dong, B.; Zhang, Y.; Wang, T.; Zhang, X.; Wei, Y. MOTR: End-to-End Multiple-Object Tracking with Transformer. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2022. [Google Scholar]
- He, Y.; Wei, X.; Hong, X.; Shi, W.; Gong, Y. Multi-Target Multi-Camera Tracking by Tracklet-to-Target Assignment. IEEE Trans. Image Process. 2020, 29, 5191–5205. [Google Scholar] [CrossRef]
- Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and real-time tracking. In Proceedings of the IEEE International Conference on Image Processing, Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar]
- Yu, F.; Li, W.; Li, Q.; Liu, Y.; Shi, X.; Yan, J. Poi: Multiple object tracking with high performance detection and appearance feature. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 36–42. [Google Scholar]
- Tang, S.; Andriluka, M.; Andres, B.; Schiele, B. Multiple people tracking by lifted multicut and person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Xu, J.; Cao, Y.; Zhang, Z.; Hu, H. Spatial-temporal relation networks for multi-object tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2019, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
- Wojke, N.; Bewley, A.; Paulus, D. “Simple online and real-time tracking with a deep association metric,” in Image. In Proceedings of the (ICIP), 2017 IEEE International Conference on IEEE, Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
- Xu, Y.; Liu, X.; Liu, Y.; Zhu, S.-C. Multi-view people tracking via hierarchical trajectory composition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4256–4265. [Google Scholar]
- Berclaz, J.; Fleuret, F.; Turetken, E.; Fua, P. Multiple Object Tracking Using K-Shortest Paths Optimization. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 1806–1819. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Hu, W.; Hu, M.; Zhou, X.; Tan, T.; Lou, J.; Maybank, S. Principal axis-based correspondence between multiple cameras for people tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 663–671. [Google Scholar] [PubMed] [Green Version]
- Cai, Y.; Medioni, G. Exploring Context Information for Inter-Camera Multiple Target Tracking; IEEE: Piscataway, NJ, USA, 2014. [Google Scholar]
- Ristani, E.; Tomasi, C. Features for Multi-target Multi-camera Tracking and Re-identification. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Chen, K.; Lai, C.; Hung, Y.; Chen, C. An adaptive learning method for target tracking across multiple cameras. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 24–26 June 2008; pp. 1–8. [Google Scholar] [CrossRef]
- Chen, W.; Cao, L.; Chen, X.; Huang, K. An equalized global graph model-based approach for multi-camera object tracking. In IEEE Transactions on Circuits and Systems for Video Technology; IEEE: Piscataway, NJ, USA, 2016; Volume 27, pp. 2367–2381. [Google Scholar]
- Chen, X.; Bhanu, B. Integrating social grouping for multi-target tracking across cameras in a crf model. In IEEE Transactions on Circuits and Systems for Video Technology; IEEE: Piscataway, NJ, USA, 2016; Volume 27, pp. 2382–2394. [Google Scholar]
- Lee, Y.-G.; Tang, Z.; Hwang, J.-N. Online-learning-based human tracking across non-overlapping cameras. In IEEE Transactions on Circuits and Systems for Video Technology; IEEE: Piscataway, NJ, USA, 2017; Volume 28, pp. 2870–2883. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar]
- Yang, C.; Wang, Y.; Zhang, J.; Zhang, H.; Wei, Z.; Lin, Z.; Yuille, A. Lite Vision Transformer with Enhanced Self-Attention. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11988–11998. [Google Scholar] [CrossRef]
- Wang, G.; Lai, J.; Huang, P.; Xie, X. Spatial-Temporal Person Re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 8933–8940. [Google Scholar]
- Zheng, X.; Gong, T.; Li, X.; Lu, X. Generalized Scene Classification From Small-Scale Datasets With Multitask Learning. In IEEE Transactions on Geoscience and Remote Sensing; IEEE: Piscataway, NJ, USA, 2022; Volume 60, pp. 1–11. [Google Scholar] [CrossRef]
- Liu, Y.; Tong, M. An Application of Hungarian Algorithm to the Multi-Target Assignment. Fire Control. Command. Control. 2002, 27, 4. [Google Scholar]
- Klein, M. A primal method for minimal cost flows with applications to the assignment and transportation problems. Manag. Sci. 1967, 14, 205–220. [Google Scholar] [CrossRef] [Green Version]
- Milan, A.; Leal-Taix’e, L.; Reid, I.; Roth, S.; Schindler, K. Mot16: A benchmark for multi-object tracking. arXiv 2016, arXiv:1603.00831. [Google Scholar]
- Olson, E.B. Real-time correlative scan matching. In Proceedings of the International Conference on Robotics and Automation, Kobe, Japan, 12–17 May 2009; pp. 4387–4393. [Google Scholar]
- Konolige, K.; Grisetti, G.; Kümmerle, R.; Burgard, W.; Limketkai, B.; Vincent, R. Efficient sparse pose adjustment for 2D mapping. In Proceedings of the 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems IEEE, Taipei, Taiwan, 18–22 October 2010. [Google Scholar]
- Hess, W.; Kohler, D.; Rapp, H.; Andor, D. Real-time Loop Closure in 2D LIDAR SLAM. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA) IEEE, Stockholm, Sweden, 16–21 May 2016. [Google Scholar]
- Wang, S.; Yang, D.; Wu, Y.; Liu, Y.; Sheng, H. Tracking Game: Self-adaptative Agent based Multi-object Tracking. In Proceedings of the 30th ACM International Conference on Multimedia (MM ’22), Lisbon, Portugal, 10–14 October 2022. [Google Scholar]
- Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. ByteTrack: Multi-Object Tracking by Associating Every Detection Box. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
- Dendorfer, P.; Yugay, V.; Ošep, A. Systems Quo Vadis: Is Trajectory Forecasting the Key Towards Long-Term Multi-Object Tracking? arXiv 2022, arXiv:2210.07681. [Google Scholar]
- Nasseri, M.; Babaee, M.; Moradi, H.; Hosseini, R. Fast Online and Relational Tracking. arXiv 2022, arXiv:2208.03659. [Google Scholar]
- Aharon, N.; Orfaig, R.; Bobrovsky, B. BoT-SORT: Robust Associations Multi-Pedestrian Tracking. arXiv 2022, arXiv:2206.14651. [Google Scholar]
- Stadler, D.; Beyerer, J. BYTEv2: Associating More Detection Boxes under Occlusion for Improved Multi-Person Tracking. In Proceedings of the ICPR Workshops 2022, Montréal, QC, Canada, 21–25 August 2022. [Google Scholar]
- Solera, F.; Calderara, S.; Cucchiara, R. Towards the Evaluation of Reproducible Robustness in Tracking-by-Detection; AVSS: Karlsruhe, Germany, 2015. [Google Scholar]
- Wen, L.; Du, D.; Cai, Z.; Lei, Z.; Chang, M.C.; Qi, H.; Lim, J.; Yang, M.H.; Lyu, S. A New Benchmark and Protocol for Multi-Object Detection and Tracking. arXiv 2015. [Google Scholar]
- Wu, C.W.; Zhong, M.T.; Tsao, Y.; Yang, S.W.; Chen, Y.K.; Chien, S.Y. Track-Clustering Error Evaluation for Track-Based Multi-camera Tracking System Employing Human Re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops 2017, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Ristani, E.; Solera, F.; Zou, R.; Cucchiara, R.; Tomasi, C. Performance Measures and a Data Set for Multi-Target, Multi-Camera Tracking. In Proceedings of the ECCV 2016 Workshop on Benchmarking Multi-Target Tracking, Amsterdam, The Netherlands, 9 October 2016. [Google Scholar]
- Weber, M.; Osep, A.; Leal-Taixé, L. The Multiple Object Tracking Benchmark. Available online: https://motchallenge.net/results/MOT17/?det=Public (accessed on 1 October 2022).
- Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ren, N. Nerf: Representing scenes as neural radiance fields for view synthesis. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 405–421. [Google Scholar]
- Chen, R.; Guo, G.; Ye, F.; Qian, L.; Xu, S.; Li, Z. Tightly-coupled integration of acoustic signal and MEMS sensors on smartphones for indoor positioning. Acta Geod. Et Cartogr. Sin. 2021, 50, 10. [Google Scholar]
- Zhang, X.W.; Zheng, W.Y.; Chen, Y. A Group Learning Based Optimization Algorithm Applied to UWB Positioning; IOP Publishing Ltd.: Bristol, UK, 2022. [Google Scholar]
- Chen, H.C.; Lin, R.S.; Huang, C.J.; Tian, L.; Su, X.; Yu, H. Bluetooth-controlled Parking System Based on WiFi Positioning Technology. Sens. Mater. Int. J. Sens. Technol. 2022, 34, 1179–1189. [Google Scholar] [CrossRef]
IDF1 | MOTA | IDP | IDR | Recall | Precision | |
---|---|---|---|---|---|---|
MOT17-02-SDP | 0.577433 | 0.666111 | 0.682192 | 0.500565 | 0.707013 | 0.963547 |
MOT17-04-SDP | 0.907841 | 0.945097 | 0.919049 | 0.896903 | 0.961099 | 0.984831 |
MOT17-05-SDP | 0.735971 | 0.788926 | 0.809045 | 0.675004 | 0.816684 | 0.97886 |
MOT17-09-SDP | 0.643427 | 0.782535 | 0.696413 | 0.597934 | 0.824601 | 0.960411 |
MOT17-10-SDP | 0.648384 | 0.730119 | 0.716602 | 0.592024 | 0.784952 | 0.950127 |
MOT17-11-SDP | 0.835397 | 0.873145 | 0.860902 | 0.811361 | 0.910237 | 0.965816 |
MOT17-13-SDP | 0.727051 | 0.801151 | 0.767242 | 0.690861 | 0.855437 | 0.950014 |
OVERALL | 0.781575 | 0.83606 | 0.828008 | 0.740073 | 0.868322 | 0.971496 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hong, Y.; Li, D.; Luo, S.; Chen, X.; Yang, Y.; Wang, M. An Improved End-to-End Multi-Target Tracking Method Based on Transformer Self-Attention. Remote Sens. 2022, 14, 6354. https://doi.org/10.3390/rs14246354
Hong Y, Li D, Luo S, Chen X, Yang Y, Wang M. An Improved End-to-End Multi-Target Tracking Method Based on Transformer Self-Attention. Remote Sensing. 2022; 14(24):6354. https://doi.org/10.3390/rs14246354
Chicago/Turabian StyleHong, Yong, Deren Li, Shupei Luo, Xin Chen, Yi Yang, and Mi Wang. 2022. "An Improved End-to-End Multi-Target Tracking Method Based on Transformer Self-Attention" Remote Sensing 14, no. 24: 6354. https://doi.org/10.3390/rs14246354
APA StyleHong, Y., Li, D., Luo, S., Chen, X., Yang, Y., & Wang, M. (2022). An Improved End-to-End Multi-Target Tracking Method Based on Transformer Self-Attention. Remote Sensing, 14(24), 6354. https://doi.org/10.3390/rs14246354