AMTT: An End-to-End Anchor-Based Multi-Scale Transformer Tracking Method
Abstract
:1. Introduction
- We design a multi-scale feature encoder. This encoder is able to utilize the multi-layer features of the backbone network for fusion, which enhance the ability of the tracker to sense targets of different sizes.
- We design a feature focusing block module inserted between the encoder and the decoder. This module can perform feature aggregation on the fused multi-layer features to achieve feature enhancement while reducing the number of feature token numbers.
- We introduce an anchor in a traditional decoder to design an anchor-based decoder. The query of the traditional encoder is split into content query and location query, where the content query is a search feature and the location query is a predefined anchor. The combination of the two parts yields the feature anchor to guide the decoder’s work and enable the tracker to predict the location and size of the target more accurately.
- We evaluate our method on multiple datasets and compare it with state-of-the-art object tracking methods. The results validate the effectiveness of our method. In addition, we perform ablation experiments on each module to verify the independent validity of each module.
2. Related Work
2.1. Transformer in Vision
2.2. Visual Object Tracking
3. Method
3.1. Feature Extraction Network
3.2. Multi-Scale Feature Fusion Networks
3.2.1. Multi-Scale Feature Encoder
3.2.2. Feature Focusing Block
3.2.3. Anchor-Based Decoder
3.3. Prediction Head and Loss Function
4. Experiments and Results
4.1. Experimental Details
4.2. Comparison of Experimental Results
4.3. Ablation Experiment
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Jha, S.; Seo, C.; Yang, E.; Joshi, G.P. Real time object detection and trackingsystem for video surveillance system. Multimed. Tools Appl. 2021, 80, 3981–3996. [Google Scholar] [CrossRef]
- Pereira, R.; Carvalho, G.; Garrote, L.; Nunes, U.J. Sort and deep-SORT based multi-object tracking for mobile robotics: Evaluation with new data association metrics. Appl. Sci. 2022, 12, 1319. [Google Scholar] [CrossRef]
- Guo, D.; Shao, Y.; Cui, Y.; Wang, Z.; Zhang, L.; Shen, C. Graph attention tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9543–9552. [Google Scholar]
- Voigtlaender, P.; Luiten, J.; Torr, P.H.S.; Leibe, B. Siam R-CNN: Visual Tracking by Re-Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 6577–6587. [Google Scholar]
- Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H. Fully-Convolutional Siamese Networks for Object Tracking. In Proceedings of the Computer Vision—ECCV 2016 Workshops, Amsterdam, The Netherlands, 8–10 October 2016; Lecture Notes in Computer Science. Volume 9914, pp. 850–865. [Google Scholar]
- Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High Performance Visual Tracking with Siamese Region Proposal Network. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8971–8980. [Google Scholar]
- Fan, H.; Ling, H. Siamese cascaded region proposal networks for real-time visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7952–7961. [Google Scholar]
- Zhu, Z.; Wang, Q.; Li, B.; Wu, W.; Yan, J.; Hu, W. Distractor-aware siamese networks for visual object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 101–117. [Google Scholar]
- Guo, D.; Wang, J.; Cui, Y.; Wang, Z.; Chen, S. SiamCAR: Siamese Fully Convolutional Classification and Regression for Visual Tracking. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6268–6276. [Google Scholar]
- Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer Tracking. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 8122–8131. [Google Scholar]
- Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4277–4286. [Google Scholar]
- Chen, Z.; Zhong, B.; Li, G.; Zhang, S.; Ji, R. Siamese Box Adaptive Network for Visual Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
- Huang, L.; Zhao, X.; Huang, K. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1562–1577. [Google Scholar] [CrossRef] [PubMed]
- Muller, M.; Bibi, A.; Giancola, S.; Alsubaihi, S.; Ghanem, B. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 300–317. [Google Scholar]
- Wu, Y.; Lim, J.; Yang, M.H. Object Tracking Benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1834–1848. [Google Scholar] [CrossRef] [PubMed]
- Mueller, M.; Smith, N.; Ghanem, B. A benchmark and simulator for uav tracking. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 445–461. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
- Hao, S.; Gao, S.; Ma, X.; An, B.; He, T. Anchor-free infrared pedestrian detection based on cross-scale feature fusion and hierarchical attention mechanism. Infrared Phys. Technol. 2023, 131, 104660. [Google Scholar] [CrossRef]
- Hao, S.; An, B.; Ma, X.; Sun, X.; He, T.; Sun, S. PKAMNet: A transmission line insulator parallel-gap fault detection network based on prior knowledge transfer and attention mechanism. IEEE Trans. Power Deliv. 2023, 38, 3387–3397. [Google Scholar] [CrossRef]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
- Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Sun, L.; Wang, J. Conditional detr for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3651–3660. [Google Scholar]
- Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 583–596. [Google Scholar] [CrossRef]
- Danelljan, M.; Robinson, A.; Khan, F.S.; Felsberg, M. Beyond Correlation Filters: Learning Continuous Convolution Operators for Visual Tracking. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 8–10 October 2016; Lecture Notes in Computer Science. Volume 9909, pp. 472–488. [Google Scholar]
- Bhat, G.; Danelljan, M.; Gool, L.V.; Timofte, R. Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6182–6191. [Google Scholar]
- Cao, Z.; Fu, C.; Ye, J.; Li, B.; Li, Y. Hift: Hierarchical feature transformer for aerial tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, New Orleans, LA, USA, 18–24 June 2021; pp. 15457–15466. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 658–666. [Google Scholar]
- Fan, H.; Lin, L.; Yang, F.; Chu, P.; Deng, G.; Yu, S.; Bai, H.; Xu, Y.; Liao, C.; Ling, H. Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5374–5383. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Springer International Publishing: Berlin/Heidelberg, Germany, 2014. [Google Scholar]
- Tang, F.; Ling, Q. Ranking-based Siamese visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8741–8750. [Google Scholar]
- Zhang, Z.; Peng, H.; Fu, J.; Li, B.; Hu, W. Ocean: Object-aware anchor-free tracking. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 771–787. [Google Scholar]
- Danelljan, M.; Bhat, G.; Khan, F.S.; Felsberg, M. ATOM: Accurate Tracking by Overlap Maximization. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4655–4664. [Google Scholar]
- Liang, Y.; Li, Q.; Long, F. Global dilated attention and target focusing network for robust tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 1549–1557. [Google Scholar]
- Pi, Z.; Wan, W.; Sun, C.; Gao, C.; Sang, N.; Li, C. Hierarchical feature embedding for visual tracking. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2022; pp. 428–445. [Google Scholar]
- Zhang, Z.; Liu, Y.; Wang, X.; Li, B.; Hu, W. Learn to match: Automatic matching network design for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 13339–13348. [Google Scholar]
- Wang, N.; Zhou, W.; Wang, J.; Li, H. Transformer meets tracker: Exploiting temporal context for robust visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1571–1580. [Google Scholar]
- Ma, F.; Shou, M.Z.; Zhu, L.; Fan, H.; Xu, Y.; Yang, Y.; Yan, Z. Unified transformer tracker for object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8781–8790. [Google Scholar]
- Peng, J.; Jiang, Z.; Gu, Y.; Wu, Y.; Wang, Y.; Tai, Y.; Wang, C.; Lin, W. Siamrcr: Reciprocal classification and regression for visual object tracking. arXiv 2021, arXiv:2105.11237. [Google Scholar]
- Yu, Y.; Xiong, Y.; Huang, W.; Scott, M.R. Deformable siamese attention networks for visual object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6728–6737. [Google Scholar]
- Fu, Z.; Liu, Q.; Fu, Z.; Wang, Y. Stmtrack: Template-free visual tracking with space-time memory networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13774–13783. [Google Scholar]
Parameter | Value |
---|---|
Template image size | 128 |
Search image size | 256 |
Epoch number | 1000 |
Batch size | 19 |
The number of iterations per epoch | 1000 |
Total training image pairs | 19,000,000 |
Start learn rate | 0.0001 |
End learn rate | 0.00001 |
Output feature map size | 32 |
Success (%) | Precision (%) | |
---|---|---|
AMTT (Ours) | 69.6 | 89.9 |
RBO [31] | 69.7 | 90.7 |
TransT [10] | 68.1 | 88.3 |
Ocean [32] | 66.6 | 89.1 |
ATOM [33] | 66.4 | 87.3 |
DaSiamRPN [8] | 65.4 | 87.3 |
SiamRPN [6] | 62.6 | 84.2 |
SiamFC [5] | 58.3 | 76.5 |
ATOM [33] | AutoMatch [36] | SiamGAT [3] | TrDiMP [37] | RBO [31] | UTT [38] | CIA [35] | GadTFT [34] | AMTT Ours | |
---|---|---|---|---|---|---|---|---|---|
AO (%) | 55.6 | 65.2 | 62.7 | 67.1 | 64.4 | 67.2 | 67.9 | 65.0 | 69.8 |
(%) | 63.4 | 76.6 | 74.3 | 77.7 | 76.7 | 76.3 | 79.0 | 77.8 | 79.5 |
(%) | 40.2 | 54.3 | 48.8 | 58.3 | 50.9 | 60.5 | 60.3 | 53.7 | 63.4 |
ATOM [33] | SiamRPN++ [11] | AutoMatch [36] | SiamRCR [39] | TransT [10] | UTT [38] | CIA [35] | GadTFT [34] | AMTT Ours | |
---|---|---|---|---|---|---|---|---|---|
Prec. (%) | 64.8 | 69.4 | 72.6 | 71.6 | 80.3 | 77.0 | 75.1 | 75.4 | 77.4 |
N.Prec. (%) | 77.1 | 80.0 | - | 81.8 | 86.7 | - | 84.5 | - | 84.8 |
Success (%) | 70.3 | 73.3 | 76.0 | 76.4 | 81.4 | 79.7 | 79.2 | 77.8 | 80.0 |
Success (%) | Precision (%) | |
---|---|---|
AMTT (Ours) | 69.2 | 85.9 |
TrDiMP [37] | 67.5 | 85.6 |
TransT [10] | 67.0 | 85.2 |
SiamAttn [40] | 65.9 | 83.9 |
STMTrack [41] | 65.7 | 83.2 |
SiamPW-RBO [31] | 65.6 | 82.6 |
AutoMatch [36] | 65.4 | 81.3 |
Ocean [32] | 63.1 | 81.1 |
SiamBAN [12] | 61.2 | 78.0 |
HiFT [26] | 59.7 | 77.1 |
MFE | FFB | AD | Success | Precision | |
---|---|---|---|---|---|
I | 50.2 | 70.3 | |||
II | ✓ | ||||
III | ✓ | ✓ | |||
IV | ✓ | ✓ | |||
V | ✓ | ✓ | ✓ |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zheng, Y.; Deng, H.; Xu, Q.; Li, N. AMTT: An End-to-End Anchor-Based Multi-Scale Transformer Tracking Method. Electronics 2024, 13, 2710. https://doi.org/10.3390/electronics13142710
Zheng Y, Deng H, Xu Q, Li N. AMTT: An End-to-End Anchor-Based Multi-Scale Transformer Tracking Method. Electronics. 2024; 13(14):2710. https://doi.org/10.3390/electronics13142710
Chicago/Turabian StyleZheng, Yitao, Honggui Deng, Qiguo Xu, and Ni Li. 2024. "AMTT: An End-to-End Anchor-Based Multi-Scale Transformer Tracking Method" Electronics 13, no. 14: 2710. https://doi.org/10.3390/electronics13142710
APA StyleZheng, Y., Deng, H., Xu, Q., & Li, N. (2024). AMTT: An End-to-End Anchor-Based Multi-Scale Transformer Tracking Method. Electronics, 13(14), 2710. https://doi.org/10.3390/electronics13142710