Visual Tracking with FPN Based on Transformer and Response Map Enhancement
Abstract
:1. Introduction
- We propose an object tracking algorithm based on Transformer feature pyramid and add a response map enhancement module. Compared with other trackers, the accuracy and robustness of our tracker are greatly improved when dealing with the challenges of similar object and occlusion.
- Feature pyramid based on Transformer proposed in this paper establishes global information interaction, efficiently integrates feature information of low and high layers, and effectively improves tracking performance; the proposed response map enhancement module makes full use of the context information of the response map, which improves the ability of the model to distinguish the object from the background, and effectively improve the ability to deal with the interference of similar targets and other complex environments.
- Extensive experiments and ablation experiments are conducted on many challenge benchmarks, these prove our proposed method has competitive performance, not only in accuracy but also in speed. When different backbone networks are selected, the frame rate of the algorithm in this paper can reach 45-130FPS, meeting the real-time requirements.
2. Related Work
3. Proposed Method
3.1. Overall Overview
3.2. Transformer Feature Pyramid
3.3. Similarity Match Network
4. Results
4.1. Experimental Details
4.2. Algorithm Comparison
4.2.1. UAV123 Benchmark
- Attribute-Based Evaluation
4.2.2. LaSOT Benchmark
4.2.3. GOT-10K Benchmark
4.3. Visual Comparison of Tracking Results
4.4. Ablation Experiment
4.4.1. Heat Map Experiment
4.4.2. Data Experiment
5. Conclusions
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Li, B.; Fu, C.; Ding, F.; Ye, J.; Lin, F. All-day object tracking for unmanned aerial vehicle. IEEE Trans. Mob. Comput. 2022. [Google Scholar] [CrossRef]
- Chang, C.Y.; Lie, H.W. Real-time visual tracking and measurement to control fast dynamics of overhead cranes. IEEE Trans. Ind. Electron. 2011, 59, 1640–1649. [Google Scholar] [CrossRef]
- Lee, K.H.; Hwang, J.N. On-road pedestrian tracking across multiple driving recorders. IEEE Trans. Multimed. 2015, 17, 1429–1438. [Google Scholar] [CrossRef]
- Tao, R.; Gavves, E.; Smeulders, A.W.M. Siamese instance search for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1420–1429. [Google Scholar]
- Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H. Fully-convolutional siamese networks for object tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–10 October 2016; Springer: Cham, Switzerland, 2016; pp. 850–865. [Google Scholar]
- Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8971–8980. [Google Scholar]
- Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4282–4291. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Bolme, D.S.; Beveridge, J.R.; Draper, B.A.; Lui, Y.M. Visual object tracking using adaptive correlation filters. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 2544–2550. [Google Scholar]
- Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 583–596. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Li, Y.; Zhu, J. A scale adaptive kernel correlation filter tracker with feature integration. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–7 September 2014; Springer: Cham, Switzerland, 2014; pp. 254–265. [Google Scholar]
- Danelljan, M.; Hager, G.; Shahbaz Khan, F.; Felsberg, M. Convolutional features for correlation filter based visual tracking. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Santiago, Chile, 7–13 December 2015; pp. 58–66. [Google Scholar]
- Danelljan, M.; Bhat, G.; Shahbaz Khan, F.; Felsberg, M. Eco: Efficient convolution operators for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6638–6646. [Google Scholar]
- Li, Y.; Zhang, X. SiamVGG: Visual tracking using deeper siamese networks. arXiv 2019, arXiv:1902.02804. [Google Scholar]
- Zhu, Z.; Wang, Q.; Li, B.; Wu, W.; Yan, J.; Hu, W. Distractor-aware siamese networks for visual object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 101–117. [Google Scholar]
- Xi, J.; Yang, J.; Chen, X.; Wang, Y.; Cai, H. Double Branch Attention Block for Discriminative Representation of Siamese Trackers. Appl. Sci. 2022, 12, 2897. [Google Scholar] [CrossRef]
- Guo, D.; Wang, J.; Cui, Y.; Wang, Z.; Chen, S. SiamCAR: Siamese fully convolutional classification and regression for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6269–6277. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Shan, Y.; Zhou, X.; Liu, S.; Zhang, Y.; Huang, K. SiamFPN: A deep learning method for accurate and real-time maritime ship tracking. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 315–325. [Google Scholar] [CrossRef]
- Qi, Y.; Song, Y.Z.; Zhang, H.; Liu, J. Sketch-based image retrieval via siamese convolutional neural network. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 2460–2464. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Processing Syst. 2017, 30, 5998–6008. [Google Scholar]
- Cao, Z.; Fu, C.; Ye, J.; Li, B.; Li, Y. HiFT: Hierarchical Feature Transformer for Aerial Tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 15457–15466. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 84–90. [Google Scholar] [CrossRef]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
- Zhang, D.; Zhang, H.; Tang, J.; Wang, M.; Hua, X.; Sun, Q. Feature pyramid transformer. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 323–339. [Google Scholar]
- Mueller, M.; Smith, N.; Ghanem, B. A benchmark and simulator for uav tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 445–461. [Google Scholar]
- Huang, L.; Zhao, X.; Huang, K. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1562–1577. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Fan, H.; Lin, L.; Yang, F.; Chu, P.; Deng, G.; Yu, S.; Bai, H.; Xu, Y.; Liao, C.; Ling, H. Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5374–5383. [Google Scholar]
- Wu, Y.; Lim, J.; Yang, M.H. Online object tracking: A benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2411–2418. [Google Scholar]
- Danelljan, M.; Bhat, G.; Khan, F.S.; Felsberg, M. Atom: Accurate tracking by overlap maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4660–4669. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
- Nam, H.; Han, B. Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 429–4302. [Google Scholar]
- Wang, Q.; Zhang, L.; Bertinetto, L.; Hu, W.; Torr, P.H. Fast online object tracking and segmentation: A unifying approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1328–1338. [Google Scholar]
- Sauer, A.; Aljalbout, E.; Haddadin, S. Tracking holistic object representations. arXiv 2019, arXiv:1907.12920. [Google Scholar]
- Xu, Y.; Wang, Z.; Li, Z.; Yuan, Y.; Yu, G. Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12549–12556. [Google Scholar]
Tracker | Precision ↑ | Normalized Precision ↑ | Success ↑ |
---|---|---|---|
SINT [4] | 0.295 | 0.354 | 0.314 |
ECO [14] | 0.310 | 0.338 | 0.324 |
SiamFC [5] | 0.339 | 0.420 | 0.336 |
MDNet [33] | 0.373 | 0.460 | 0.397 |
SiamMask [34] | 0.469 | 0.552 | 0.467 |
SiamRPN++ [7] | 0.491 | 0.569 | 0.495 |
ATOM [31] | 0.497 | 0.605 | 0.499 |
SiamCAR [18] | 0.507 | 0.600 | 0.507 |
Ours | 0.527 | 0.606 | 0.528 |
Tracker | AO | SR0.5 | SR0.75 | FPS |
---|---|---|---|---|
KCF [12] | 0.203 | 0.177 | 0.065 | 94.66 |
ECO [14] | 0.316 | 0.309 | 0.111 | 2.62 |
SiamFC [5] | 0.374 | 0.404 | 0.144 | 25.81 |
THOR [35] | 0.447 | 0.538 | 0.204 | 1.00 |
SiamRPN [6] | 0.483 | 0.581 | 0.270 | 97.55 |
SiamFC++ [35] | 0.493 | 0.577 | 0.323 | 160 |
SiamRPN++ [7] | 0.518 | 0.618 | 0.325 | 35 |
ATOM [31] | 0.556 | 0.634 | 0.402 | 30 |
SiamCAR [18] | 0.569 | 0.670 | 0.415 | 50 |
Ours | 0.582 | 0.683 | 0.457 | 45 |
NO. | Backbone | TFN | RmE | Success | PRE | AO | FPS |
---|---|---|---|---|---|---|---|
1 | AlexNet | √ | √ | 0.485 | 0.779 | 0.550 | 130 |
2 | ResNet-50 | × | × | 0.507 | 0.797 | 0.569 | 50 |
3 | ResNet-50 | √ | × | 0.507 | 0.806 | 0.572 | 47 |
4 | ResNet-50 | × | √ | 0.519 | 0.820 | 0.575 | 48 |
5 | ResNet-50 | √ | √ | 0.527 | 0.832 | 0.582 | 45 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Deng, A.; Liu, J.; Chen, Q.; Wang, X.; Zuo, Y. Visual Tracking with FPN Based on Transformer and Response Map Enhancement. Appl. Sci. 2022, 12, 6551. https://doi.org/10.3390/app12136551
Deng A, Liu J, Chen Q, Wang X, Zuo Y. Visual Tracking with FPN Based on Transformer and Response Map Enhancement. Applied Sciences. 2022; 12(13):6551. https://doi.org/10.3390/app12136551
Chicago/Turabian StyleDeng, Anping, Jinghong Liu, Qiqi Chen, Xuan Wang, and Yujia Zuo. 2022. "Visual Tracking with FPN Based on Transformer and Response Map Enhancement" Applied Sciences 12, no. 13: 6551. https://doi.org/10.3390/app12136551
APA StyleDeng, A., Liu, J., Chen, Q., Wang, X., & Zuo, Y. (2022). Visual Tracking with FPN Based on Transformer and Response Map Enhancement. Applied Sciences, 12(13), 6551. https://doi.org/10.3390/app12136551