Slight Aware Enhancement Transformer and Multiple Matching Network for Real-Time UAV Tracking
Abstract
:1. Introduction
- We propose a novel feature extraction network that adaptively aggregates the local information extracted by convolution and the global information extracted by the transformer, encoding both local and global information to enhance the algorithm’s performance in terms of feature expression.
- We have developed the Multiple matching network, which separates the classification and regression branches by incorporating the focus information of the target and improving the cross-correlation method, effectively perceiving the foreground and background information of the target, and enhancing the algorithm’s ability to cope with complex scenarios.
- Extensive experiments and ablation studies were conducted on three challenging UAV benchmark datasets, demonstrating the superior performance of SiamSTM in handling complex scenarios. SiamSTM achieved a high frame rate of 193 FPS on a PC, and its actual deployment on an NVIDIA Jeston AGX Xavier yielded a frame rate exceeding 35 FPS.
2. Related Work
3. Proposed Method
3.1. Overall Overview
3.2. Slight Aware Enhancement Transformer
3.3. Multiple Matching Network
4. Results
4.1. Implementation Details
4.2. Experiments on the UAV123 Benchmark
4.3. Experiments on the UAV20L Benchmark
4.4. Experiments on the UAVDT Benchmark
4.5. Ablation Study
4.6. Qualitative Evaluation
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Li, B.; Fu, C.; Ding, F.; Ye, J.; Lin, F. All-day object tracking for unmanned aerial vehicle. IEEE Trans. Mob. Comput. 2022. [Google Scholar] [CrossRef]
- Zhang, Z. Object Tracking based on satellite videos: A literature review. Remote Sens. 2022, 14, 3674. [Google Scholar] [CrossRef]
- Fu, C.; Lu, K.; Zheng, G.; Ye, J.; Cao, Z.; Li, B. Siamese object tracking for unmanned aerial vehicle: A review and comprehensive analysis. arXiv 2022, arXiv:2205.04281. [Google Scholar]
- Bolme, D.S.; Beveridge, J.R.; Draper, B.A.; Lui, Y.M. Visual object tracking using adaptive correlation filters. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 2544–2550. [Google Scholar]
- Tao, R.; Gavves, E.; Smeulders, A.W.M. Siamese instance search for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1420–1429. [Google Scholar]
- Marvasti-Zadeh, S.M.; Cheng, L.; Ghanei-Yakhdan, H. Deep learning for visual tracking: A comprehensive survey. IEEE Trans. Intell. Transp. Syst. 2021. [Google Scholar] [CrossRef]
- Wu, X.; Li, W.; Hong, D.; Tao, R.; Du, Q. Deep learning for unmanned aerial vehicle-based object detection and tracking: A survey. IEEE Geosci. Remote Sens. Mag. 2021, 10, 91–124. [Google Scholar] [CrossRef]
- Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 583–596. [Google Scholar]
- Huang, Z.; Fu, C.; Li, Y.; Lin, F.; Lu, P. Learning aberrance repressed correlation filters for real-time UAV tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Li, Y.; Fu, C.; Ding, F.; Huang, Z.; Lu, G. AutoTrack: Towards high-performance visual tracking for UAV with automatic spatio-temporal regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P. Fully-convolutional siamese networks for object tracking. In Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8–10 and 15–16, 2016, Proceedings, Part II 14; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 850–865. [Google Scholar]
- Bo, L.; Yan, J.; Wei, W.; Zheng, Z.; Hu, X. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8971–8980. [Google Scholar]
- Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4282–4291. [Google Scholar]
- Zhang, Z.; Zhang, L. Domain Adaptive SiamRPN++ for Object Tracking in the Wild. arXiv 2021, arXiv:2106.07862. [Google Scholar]
- Peng, J.; Jiang, Z.; Gu, Y.; Wu, Y.; Wang, Y.; Tai, Y. Siamrcr: Reciprocal classification and regression for visual object tracking. arXiv 2021, arXiv:2105.11237. [Google Scholar]
- Voigtlaender, P.; Luiten, J.; Torr, P.; Leibe, B. Siam r-cnn: Visual tracking by re-detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6578–6588. [Google Scholar]
- Guo, D.; Shao, Y.; Cui, Y.; Wang, Z.; Zhang, L.; Shen, C. Graph attention tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9543–9552. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, I. [Google Scholar]
- Thangavel, J.; Kokul, T.; Ramanan, A.; Fernando, S. Transformers in Single Object Tracking: An Experimental Survey. arXiv 2023, arXiv:2302.11867. [Google Scholar]
- Deng, A.; Liu, J.; Chen, Q.; Wang, X.; Zuo, Y. Visual Tracking with FPN Based on Transformer and Response Map Enhancement. Appl. Sci. 2022, 12, 6551. [Google Scholar] [CrossRef]
- Cao, Z.; Fu, C.; Ye, J.; Li, B.; Li, Y. Hift: Hierarchical feature transformer for aerial tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 15457–15466. [Google Scholar]
- Cao, Z.; Fu, C.; Ye, J.; Li, B.; Li, Y. SiamAPN++: Siamese attentional aggregation network for real-time UAV tracking. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 3086–3092. [Google Scholar]
- Yao, L.; Fu, C.; Li, S. SGDViT: Saliency-Guided Dynamic Vision Transformer for UAV Tracking. arXiv 2023, arXiv:2303.04378. [Google Scholar]
- Cao, Z.; Huang, Z.; Pan, L.; Zhang, S.; Liu, Z.; Fu, C. TCTrack: Temporal contexts for aerial tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14798–14808. [Google Scholar]
- Blatter, P.; Kanakis, M.; Danelljan, M.; Gool, L.V. Efficient visual tracking with exemplar transformers. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 June 2023. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
- Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8126–8135. [Google Scholar]
- Xu, Y.; Zhang, Q.; Zhang, J.; Tao, D. Vitae: Vision transformer advanced by exploring intrinsic inductive bias. Adv. Neural Inf. Process. Syst. 2021, 34, 28522–28535. [Google Scholar]
- Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
- Javed, S.; Danelljan, M.; Khan, F.S.; Khan, M.H.; Felsberg, M.; Matas, J. Visual object tracking with discriminative filters and siamese networks: A survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. 2022. [Google Scholar] [CrossRef] [PubMed]
- Yan, B.; Zhang, X.; Wang, D.; Lu, H.; Yang, X. Alpha-refine: Boosting tracking performance by precise bounding box estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5289–5298. [Google Scholar]
- Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
- Mueller, M.; Smith, N.; Ghanem, B. A benchmark and simulator for uav tracking. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 445–461. [Google Scholar]
- Isaac-Medina, B.; Poyser, M.; Organisciak, D.; Willcocks, C.G.; Breckon, T.P.; Shum, H. The unmanned aerial vehicle benchmark: Object detection and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 370–386. [Google Scholar]
- Huang, L.; Zhao, X.; Huang, K. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1562–1577. [Google Scholar] [CrossRef] [PubMed]
- Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Lawrence Zitnick, C.; Dollár, P. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13; Springer International Publishing: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
- Zhang, Z.; Peng, H. Deeper and wider siamese networks for real-time visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4591–4600. [Google Scholar]
- Bertinetto, L.; Valmadre, J.; Golodetz, S.; Miksik, O.; Torr, P.H.S. Staple: Complementary learners for real-time tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1401–1409. [Google Scholar]
- Fu, C.; Peng, W.; Li, S.; Ye, J.; Cao, Z. Local Perception-Aware Transformer for Aerial Tracking. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022. [Google Scholar]
- Xu, Y.; Wang, Z.; Li, Z.; Yuan, Y.; Yu, G. Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12549–12556. [Google Scholar]
- Zhu, Z.; Wang, Q.; Li, B.; Wu, W.; Yan, J.; Hu, W. Distractor-aware siamese networks for visual object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 101–117. [Google Scholar]
- Zolfaghari, M.; Singh, K.; Brox, T. Eco: Efficient convolutional network for online video understanding. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 695–712. [Google Scholar]
- Zhang, L.; Gonzalez-Garcia, A.; Weijer, J.V.D.; Danelljan, M.; Khan, F.S. Learning the model update for siamese trackers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4010–4019. [Google Scholar]
- Zhang, Z.; Peng, H.; Fu, J.; Li, B.; Hu, W. Ocean: Object-aware anchor-free tracking. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 771–787. [Google Scholar]
Trackers | Backbone | Overall | Model_Size (MB) | FPS_GPU (FPS) | FPS_Xavier (FPS) | |
---|---|---|---|---|---|---|
Pre. | Suc. | |||||
SiamAPN | AlexNet | 0.765 | 0.579 | 118.7 | 180.4 | 34.5 |
SiamAPN++ | AlexNet | 0.764 | 0.579 | 187.1 | 175.2 | 34.9 |
SGDViT | AlexNet | 0.766 | 0.585 | 183 | 115.3 | 23 |
LPAT [41] | AlexNet | 0.790 | 0.593 | 74.6 | 120.4 | 21 |
TCTrack [24] | AlexNet | 0.800 | 0.604 | 87.1 | 125.6 | 27 |
Ours | Slight-ViT | 0.809 | 0.618 | 31.1 | 193 | 36 |
Trackers | Overall | OB | LO | SV | CM | |||||
---|---|---|---|---|---|---|---|---|---|---|
Pre. | Suc. | Pre. | Suc. | Pre. | Suc. | Pre. | Suc. | Pre. | Suc. | |
SiamFC | 0.681 | 0.447 | 0.738 | 0.453 | 0.479 | 0.359 | 0.604 | 0.424 | 0.616 | 0.409 |
ECO [44] | 0.702 | 0.451 | 0.710 | 0.437 | 0.508 | 0.360 | 0.632 | 0.431 | 0.644 | 0.422 |
UpdateNet [45] | 0.790 | 0.487 | 0.800 | 0.467 | 0.626 | 0.420 | 0.790 | 0.498 | 0.745 | 0.452 |
SiamAPN | 0.710 | 0.516 | 0.777 | 0.533 | 0.546 | 0.422 | 0.698 | 0.507 | 0.619 | 0.465 |
Ocean [46] | 0.725 | 0.523 | 0.762 | 0.555 | 0.566 | 0.430 | 0.684 | 0.528 | 0.678 | 0.506 |
SiamAPN++ | 0.758 | 0.549 | 0.800 | 0.560 | 0.588 | 0.451 | 0.745 | 0.536 | 0.719 | 0.519 |
Baseline | 0.780 | 0.571 | 0.821 | 0.593 | 0.605 | 0.463 | 0.712 | 0.559 | 0.739 | 0.548 |
SiamSTM | 0.802 | 0.595 | 0.866 | 0.626 | 0.647 | 0.541 | 0.771 | 0.582 | 0.763 | 0.560 |
NO. | Slight-ViT | MMH | Overall | VC | OV | FO | CM |
---|---|---|---|---|---|---|---|
1 | × | × | 0.711 | 0.666 | 0.705 | 0.515 | 0.693 |
2 | √ | × | 0.732 | 0.715 | 0.753 | 0.557 | 0.705 |
3 | × | √ | 0.725 | 0.690 | 0.742 | 0.533 | 0.712 |
4 | √ | √ | 0.744 | 0.741 | 0.784 | 0.581 | 0.728 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Deng, A.; Han, G.; Chen, D.; Ma, T.; Liu, Z. Slight Aware Enhancement Transformer and Multiple Matching Network for Real-Time UAV Tracking. Remote Sens. 2023, 15, 2857. https://doi.org/10.3390/rs15112857
Deng A, Han G, Chen D, Ma T, Liu Z. Slight Aware Enhancement Transformer and Multiple Matching Network for Real-Time UAV Tracking. Remote Sensing. 2023; 15(11):2857. https://doi.org/10.3390/rs15112857
Chicago/Turabian StyleDeng, Anping, Guangliang Han, Dianbin Chen, Tianjiao Ma, and Zhichao Liu. 2023. "Slight Aware Enhancement Transformer and Multiple Matching Network for Real-Time UAV Tracking" Remote Sensing 15, no. 11: 2857. https://doi.org/10.3390/rs15112857
APA StyleDeng, A., Han, G., Chen, D., Ma, T., & Liu, Z. (2023). Slight Aware Enhancement Transformer and Multiple Matching Network for Real-Time UAV Tracking. Remote Sensing, 15(11), 2857. https://doi.org/10.3390/rs15112857