SAMViTrack: A Search-Region Adaptive Mamba-ViT Tracker for Real-Time UAV Tracking
Abstract
1. Introduction
- A lightweight hybrid backbone integrating Mamba and ViT. We design a compact Mamba–ViT hybrid backbone that leverages Mamba’s efficient sequence modeling and ViT’s strong spatial representation. Unlike pure-Mamba or pure-VT variants, our hybrid design achieves a balanced trade-off between accuracy and computational overhead, making it suitable for UAV platforms with limited resources.
- A plug-and-play adaptive search-region mechanism. We introduce a search-region adaptive (SA) module that adjusts the size of the search area based solely on the relative velocity of the target and is activated only during inference. Unlike existing adaptive search mechanisms, such as Aba-ViTrack, which relies on auxiliary appearance cues and online regression, our method does not require additional network branches, online optimization, or extra model parameters. This makes the SA module architecture-agnostic, easily attachable to various trackers, and free of training-time cost.
- Extensive validation demonstrating strong generalization. Although designed for UAV tracking, the proposed SA module provides consistent improvements when integrated into different state-of-the-art trackers (e.g., OSTrack, AQATrack, HIPTrack) and evaluated on both UAV and general tracking benchmarks. These results confirm that the mechanism is not restricted to UAV viewpoints but serves as a broadly applicable dynamic search strategy.
2. Related Works
3. Method
3.1. Overview
3.2. Hybrid Mamba-ViT for Feature Representation
3.3. Search-Region Adaptive Module
3.4. Tracking Head and Loss Function
3.5. Parameter Sensitivity Observation
3.6. Summary and Motivation for Adaptive Search
4. Experiments
4.1. Implementation Details
4.2. Comparison with Lightweight Trackers
4.3. Comparison with Deep Trackers
4.4. Attribute-Based Evaluation
4.5. Ablation Study
4.6. Computational Complexity
4.7. Qualitative Results
5. Discussion
5.1. Limitations and Design Trade-Offs
5.2. Future Directions
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Fang, Z.; Savkin, A.V. Strategies for Optimized UAV Surveillance in Various Tasks and Scenarios: A Review. Drones 2024, 8, 193. [Google Scholar] [CrossRef]
- Zhang, C.; Zhou, W.; Qin, W.; Tang, W. A novel UAV path planning approach: Heuristic crossing search and rescue optimization algorithm. Expert Syst. Appl. 2023, 215, 119243. [Google Scholar] [CrossRef]
- Yang, Z.; Yu, X.; Dedman, S.; Rosso, M.; Zhu, J.; Yang, J.; Xia, Y.; Tian, Y.; Zhang, G.; Wang, J. UAV remote sensing applications in marine monitoring: Knowledge visualization and review. Sci. Total Environ. 2022, 838, 155939. [Google Scholar] [CrossRef] [PubMed]
- Saponi, M.; Borboni, A.; Adamini, R.; Faglia, R.; Amici, C. Embedded payload solutions in UAVs for medium and small package delivery. Machines 2022, 10, 737. [Google Scholar] [CrossRef]
- Cao, Z.; Huang, Z.; Pan, L.; Zhang, S.; Liu, Z.; Fu, C. TCTrack: Temporal contexts for aerial tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14798–14808. [Google Scholar]
- Wang, X.; Zeng, D.; Zhao, Q.; Li, S. Rank-based filter pruning for real-time uav tracking. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar]
- Li, S.; Yang, Y.; Zeng, D.; Wang, X. Adaptive and Background-Aware Vision Transformer for Real-Time UAV Tracking. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 13943–13954. [Google Scholar]
- Liu, M.; Wang, Y.; Sun, Q.; Li, S. Global Filter Pruning with Self-Attention for Real-Time UAV Tracking. In Proceedings of the British Machine Vision Conference, London, UK, 21–24 November 2022. [Google Scholar]
- Howard, W.W.; Martone, A.F.; Buehrer, R.M. Timely target tracking: Distributed updating in cognitive radar networks. IEEE Trans. Radar Syst. 2024, 2, 318–332. [Google Scholar] [CrossRef]
- Brenner, E.; de la Malla, C.; Smeets, J.B. Tapping on a target: Dealing with uncertainty about its position and motion. Exp. Brain Res. 2023, 241, 81–104. [Google Scholar] [CrossRef]
- Zeng, K.; You, Y.; Shen, T.; Wang, Q.; Tao, Z.; Wang, Z.; Liu, Q. NCT: Noise-control multi-object tracking. Complex Intell. Syst. 2023, 9, 4331–4347. [Google Scholar] [CrossRef]
- Sun, L.; Chang, J.; Zhang, J.; Fan, B.; He, Z. Adaptive image dehazing and object tracking in UAV videos based on the template updating Siamese network. IEEE Sens. J. 2023, 23, 12320–12333. [Google Scholar] [CrossRef]
- Ma, Y.; He, J.; Yang, D.; Zhang, T.; Wu, F. Adaptive part mining for robust visual tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 11443–11457. [Google Scholar] [CrossRef] [PubMed]
- Gao, X.; Lin, X.; Lin, F.; Huang, H. Segmentation Point Simultaneous Localization and Mapping: A Stereo Vision Simultaneous Localization and Mapping Method for Unmanned Surface Vehicles in Nearshore Environments. Electronics 2024, 13, 3106. [Google Scholar] [CrossRef]
- Chen, K.; Wang, L.; Wu, H.; Wu, C.; Liao, Y.; Chen, Y.; Wang, H.; Yan, J.; Lin, J.; He, J. Background-Aware Correlation Filter for Object Tracking with Deep CNN Features. Eng. Lett. 2024, 32, 1353–1363. [Google Scholar]
- Jiang, S.; Cui, R.; Wei, R.; Fu, Z.; Hong, Z.; Feng, G. Tracking by segmentation with future motion estimation applied to person-following robots. Front. Neurorobot. 2023, 17, 1255085. [Google Scholar] [CrossRef]
- Kumie, G.A.; Habtie, M.A.; Ayall, T.A.; Zhou, C.; Liu, H.; Seid, A.M.; Erbad, A. Dual-attention network for view-invariant action recognition. Complex Intell. Syst. 2024, 10, 305–321. [Google Scholar] [CrossRef]
- Gao, L.; Ji, Y.; Gedamu, K.; Zhu, X.; Xu, X.; Shen, H.T. View-invariant human action recognition via view transformation network (VTN). IEEE Trans. Multimed. 2021, 24, 4493–4503. [Google Scholar] [CrossRef]
- Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
- Rahman, M.M.; Tutul, A.A.; Nath, A.; Laishram, L.; Jung, S.K.; Hammond, T. Mamba in vision: A comprehensive survey of techniques and applications. arXiv 2024, arXiv:2410.03105. [Google Scholar] [CrossRef]
- Zhang, H.; Zhu, Y.; Wang, D.; Zhang, L.; Chen, T.; Wang, Z.; Ye, Z. A survey on visual mamba. Appl. Sci. 2024, 14, 5683. [Google Scholar] [CrossRef]
- Heidari, M.; Kolahi, S.G.; Karimijafarbigloo, S.; Azad, B.; Bozorgpour, A.; Hatami, S.; Azad, R.; Diba, A.; Bagci, U.; Merhof, D.; et al. Computation-Efficient Era: A Comprehensive Survey of State Space Models in Medical Image Analysis. arXiv 2024, arXiv:2406.03430. [Google Scholar] [CrossRef]
- Yang, C.; Chen, Z.; Espinosa, M.; Ericsson, L.; Wang, Z.; Liu, J.; Crowley, E.J. Plainmamba: Improving non-hierarchical mamba in visual recognition. arXiv 2024, arXiv:2403.17695. [Google Scholar]
- Amir, S.; Gandelsman, Y.; Bagon, S.; Dekel, T. Deep vit features as dense visual descriptors. arXiv 2021, arXiv:2112.05814. [Google Scholar]
- Amir, S.; Gandelsman, Y.; Bagon, S.; Dekel, T. On the effectiveness of vit features as local semantic descriptors. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 39–55. [Google Scholar]
- Li, Y.; Fu, C.; Ding, F.; Huang, Z.; Lu, G. AutoTrack: Towards High-Performance Visual Tracking for UAV With Automatic Spatio-Temporal Regularization. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11920–11929. [Google Scholar]
- Li, S.; Liu, Y.; Zhao, Q.; Feng, Z. Learning residue-aware correlation filters and refining scale for real-time UAV tracking. Pattern Recognit. 2022, 127, 108614. [Google Scholar] [CrossRef]
- Huang, Z.; Fu, C.; Li, Y.; Lin, F.; Lu, P. Learning Aberrance Repressed Correlation Filters for Real-Time UAV Tracking. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2891–2900. [Google Scholar]
- Cao, Z.; Fu, C.; Ye, J.; Li, B.; Li, Y. HiFT: Hierarchical Feature Transformer for Aerial Tracking. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 15457–15466. [Google Scholar]
- Wu, W.; Zhong, P.; Li, S. Fisher Pruning for Real-Time UAV Tracking. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022; pp. 1–7. [Google Scholar]
- Xie, F.; Wang, C.; Wang, G.; Yang, W.; Zeng, W. Learning Tracking Representations via Dual-Branch Fully Transformer Networks. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Virtual, 11–17 October 2021; pp. 2688–2697. [Google Scholar]
- Cui, Y.; Song, T.; Wu, G.; Wang, L. Mixformerv2: Efficient fully transformer tracking. Adv. Neural Inf. Process. Syst. 2024, 36, 58736–58751. [Google Scholar]
- Ye, B.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Joint feature learning and relation modeling for tracking: A one-stream framework. In Proceedings of the Computer Vision—ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part XXII; Springer: Berlin/Heidelberg, Germany, 2022; pp. 341–357. [Google Scholar]
- Xie, F.; Wang, C.; Wang, G.; Cao, Y.; Yang, W.; Zeng, W. Correlation-Aware Deep Tracking. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 8741–8750. [Google Scholar]
- Wu, Y.; Li, Y.; Liu, M.; Wang, X.; Yang, X.; Ye, H.; Zeng, D.; Zhao, Q.; Li, S. Learning Adaptive and View-Invariant Vision Transformer for Real-Time UAV Tracking. In Proceedings of the Forty-first International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024. [Google Scholar]
- Wu, Y.; Wang, X.; Zeng, D.; Ye, H.; Xie, X.; Zhao, Q.; Li, S. Learning motion blur robust vision transformers with dynamic early exit for real-time UAV tracking. arXiv 2024, arXiv:2407.05383. [Google Scholar] [CrossRef]
- Yang, X.; Zeng, D.; Wang, X.; Wu, Y.; Ye, H.; Zhao, Q.; Li, S. Adaptively bypassing vision transformer blocks for efficient visual tracking. Pattern Recognit. 2025, 161, 111278. [Google Scholar] [CrossRef]
- Aminifar, F.; Rahmatian, F. Unmanned aerial vehicles in modern power systems: Technologies, use cases, outlooks, and challenges. IEEE Electrif. Mag. 2020, 8, 107–116. [Google Scholar] [CrossRef]
- Chai, R.; Guo, Y.; Zuo, Z.; Chen, K.; Shin, H.S.; Tsourdos, A. Cooperative motion planning and control for aerial-ground autonomous systems: Methods and applications. Prog. Aerosp. Sci. 2024, 146, 101005. [Google Scholar] [CrossRef]
- Mohsan, S.A.H.; Khan, M.A.; Noor, F.; Ullah, I.; Alsharif, M.H. Towards the unmanned aerial vehicles (UAVs): A comprehensive review. Drones 2022, 6, 147. [Google Scholar] [CrossRef]
- Zhu, J.; Chen, X.; Diao, H.; Li, S.; He, J.Y.; Li, C.; Luo, B.; Wang, D.; Lu, H. Exploring Dynamic Transformer for Efficient Object Tracking. arXiv 2024, arXiv:2403.17651. [Google Scholar] [CrossRef]
- Dao, T.; Fu, D.; Ermon, S.; Rudra, A.; Ré, C. Flashattention: Fast and memory-efficient exact attention with io-awareness. Adv. Neural Inf. Process. Syst. 2022, 35, 16344–16359. [Google Scholar]
- Yeom, S.K.; Kim, T.H. UniForm: A Reuse Attention Mechanism Optimized for Efficient Vision Transformers on Edge Devices. arXiv 2024, arXiv:2412.02344. [Google Scholar] [CrossRef]
- Wu, Y.; Wang, Z.; Lu, W.D. PIM GPT a hybrid process in memory accelerator for autoregressive transformers. npj Unconv. Comput. 2024, 1, 4. [Google Scholar] [CrossRef]
- Ruan, J.; Xiang, S. Vm-unet: Vision mamba unet for medical image segmentation. arXiv 2024, arXiv:2402.02491. [Google Scholar] [CrossRef]
- Liu, J.; Yang, H.; Zhou, H.Y.; Xi, Y.; Yu, L.; Li, C.; Liang, Y.; Shi, G.; Yu, Y.; Zhang, S.; et al. Swin-umamba: Mamba-based unet with imagenet-based pretraining. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Marrakesh, Morocco, 7–11 October 2024; Springer: Cham, Switzerland, 2024; pp. 615–625. [Google Scholar]
- Crosato, L.; Tian, K.; Shum, H.P.; Ho, E.S.; Wang, Y.; Wei, C. Social Interaction-Aware Dynamical Models and Decision-Making for Autonomous Vehicles. Adv. Intell. Syst. 2024, 6, 2300575. [Google Scholar] [CrossRef]
- Hu, W.; Deng, Z.; Yang, Y.; Zhang, P.; Cao, K.; Chu, D.; Zhang, B.; Cao, D. Socially Game-Theoretic Lane-Change for Autonomous Heavy Vehicle based on Asymmetric Driving Aggressiveness. IEEE Trans. Veh. Technol. 2025, 74, 17005–17018. [Google Scholar] [CrossRef]
- Deng, Z.; Hu, W.; Sun, C.; Chu, D.; Huang, T.; Li, W.; Yu, C.; Pirani, M.; Cao, D.; Khajepour, A. Eliminating uncertainty of driver’s social preferences for lane change decision-making in realistic simulation environment. IEEE Trans. Intell. Transp. Syst. 2024, 26, 1583–1597. [Google Scholar] [CrossRef]
- Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv 2024, arXiv:2401.09417. [Google Scholar] [CrossRef]
- Elfwing, S.; Uchibe, E.; Doya, K. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw. 2018, 107, 3–11. [Google Scholar] [CrossRef] [PubMed]
- Ye, B.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
- Li, S.; Yeung, D.Y. Visual object tracking for unmanned aerial vehicles: A benchmark and new motion models. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
- Mueller, M.; Smith, N.; Ghanem, B. A benchmark and simulator for uav tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
- Zhu, P.; Wen, L.; Du, D.; Bian, X.; Ling, H.; Hu, Q.; Wu, H.; Nie, Q.; Cheng, H.; Liu, C.; et al. VisDrone-VDT2018: The Vision Meets Drone Video Detection and Tracking Challenge Results. In Proceedings of the ECCV Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
- Du, D.; Qi, Y.; Yu, H.; Yang, Y.F.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The Unmanned Aerial Vehicle Benchmark: Object Detection and Tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 375–391. [Google Scholar]
- Zhang, C.; Huang, G.; Liu, L.; Huang, S.; Yang, Y.; Wan, X.; Ge, S.; Tao, D. WebUAV-3M: A benchmark for unveiling the power of million-scale deep UAV tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 9186–9205. [Google Scholar] [CrossRef] [PubMed]
- Huang, L.; Zhao, X.; Huang, K. GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 1562–1577. [Google Scholar] [CrossRef] [PubMed]
- Fan, H.; Lin, L.; Yang, F.; Chu, P.; Deng, G.; Yu, S.; Bai, H.; Xu, Y.; Liao, C.; Ling, H. Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
- Muller, M.; Bibi, A.; Giancola, S.; Alsubaihi, S.; Ghanem, B. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Loshchilov, I. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
- Zhang, Z.; Peng, H.; Fu, J.; Li, B.; Hu, W. Ocean: Object-aware anchor-free tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020. [Google Scholar]
- Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. High-Speed Tracking with Kernelized Correlation Filters. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 583–596. [Google Scholar] [CrossRef] [PubMed]
- Danelljan, M.; Hager, G.; Khan, F.S.; Felsberg, M. Discriminative Scale Space Tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1561–1575. [Google Scholar] [CrossRef]
- Danelljan, M.; Bhat, G.; Shahbaz Khan, F.; Felsberg, M. Eco: Efficient convolution operators for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Wang, N.; Zhou, W.; Tian, Q.; Hong, R.; Wang, M.; Li, H. Multi-cue correlation filters for robust visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4844–4853. [Google Scholar]
- Ye, J.; Fu, C.; Zheng, G.; Paudel, D.P.; Chen, G. Unsupervised domain adaptation for nighttime aerial tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8896–8905. [Google Scholar]
- Zeng, D.; Zou, M.; Wang, X.; Li, S. Towards Discriminative Representations with Contrastive Instances for Real-Time UAV Tracking. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; pp. 1349–1354. [Google Scholar]
- Yao, L.; Fu, C.; Li, S.; Zheng, G.; Ye, J. SGDViT: Saliency-Guided Dynamic Vision Transformer for UAV Tracking. arXiv 2023, arXiv:2303.04378. [Google Scholar]
- Chen, T.; Ding, S.; Xie, J.; Yuan, Y.; Chen, W.; Yang, Y.; Ren, Z.; Wang, Z. Abd-net: Attentive but diverse person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8351–8361. [Google Scholar]
- Wei, Q.; Zeng, B.; Liu, J.; He, L.; Zeng, G. LiteTrack: Layer Pruning with Asynchronous Feature Extraction for Lightweight and Efficient Visual Tracking. arXiv 2023, arXiv:2309.09249. [Google Scholar] [CrossRef]
- Gopal, G.Y.; Amer, M.A. Separable self and mixed attention transformers for efficient object tracking. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 6708–6717. [Google Scholar]
- Zhao, H.; Wang, D.; Lu, H. Representation Learning for Visual Object Tracking by Masked Appearance Transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18696–18705. [Google Scholar]
- Zhu, J.; Tang, H.; Cheng, Z.Q.; He, J.Y.; Luo, B.; Qiu, S.; Li, S.; Lu, H. Dcpt: Darkness clue-prompted tracking in nighttime uavs. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 7381–7388. [Google Scholar]
- Wei, X.; Bai, Y.; Zheng, Y.; Shi, D.; Gong, Y. Autoregressive visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 9697–9706. [Google Scholar]
- Song, Z.; Yu, J.; Chen, Y.P.P.; Yang, W. Transformer tracking with cyclic shifting window attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8791–8800. [Google Scholar]
- Cai, W.; Liu, Q.; Wang, Y. HIPTrack: Visual Tracking with Historical Prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
- Wu, Q.; Yang, T.; Liu, Z.; Wu, B.; Shan, Y.; Chan, A.B. DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 14561–14571. [Google Scholar]
- Chen, B.; Li, P.; Bai, L.; Qiao, L.; Shen, Q.; Li, B.; Gan, W.; Wu, W.; Ouyang, W. Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
- Shi, L.; Zhong, B.; Liang, Q.; Li, N.; Zhang, S.; Li, X. Explicit Visual Prompts for Visual Object Tracking. In Proceedings of the Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024. [Google Scholar]
- Chen, X.; Peng, H.; Wang, D.; Lu, H.; Hu, H. SeqTrack: Sequence to Sequence Learning for Visual Object Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
- Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning Spatio-Temporal Transformer for Visual Tracking. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10428–10437. [Google Scholar]
- Kou, Y.; Gao, J.; Li, B.; Wang, G.; Hu, W.; Wang, Y.; Li, L. ZoomTrack: Target-aware Non-uniform Resizing for Efficient Visual Tracking. Adv. Neural Inf. Process. Syst. 2023, 36, 50959–50977. [Google Scholar]
- Cai, Y.; Liu, J.; Tang, J.; Wu, G. Robust object modeling for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023. [Google Scholar]
- Guo, D.; Shao, Y.; Cui, Y.; Wang, Z.; Zhang, L.; Shen, C. Graph attention tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9543–9552. [Google Scholar]
- Liu, Y.; Qin, P.; Fu, L. Research on the Application of Entropy Method and Efficiency Coefficient Method in Financial Risk Early Warning of Enterprises: Taking Shandong Longda Meishi Co., Ltd. as an Example. Front. Bus. Econ. Manag. 2024, 14, 254–259. [Google Scholar] [CrossRef]
- Xie, J.; Zhong, B.; Mo, Z.; Zhang, S.; Shi, L.; Song, S.; Ji, R. Autoregressive Queries for Adaptive Tracking with Spatio-TemporalTransformers. arXiv 2024, arXiv:2403.10574. [Google Scholar] [CrossRef]
- Wu, Y.; Lim, J.; Yang, M.H. Online object tracking: A benchmark. In Proceedings of the IEEE conference on computer vision and pattern recognition, Portland, OR, USA, 23–28 June 2013; pp. 2411–2418. [Google Scholar]
- Salzmann, T.; Ivanovic, B.; Chakravarty, P.; Pavone, M. Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 683–700. [Google Scholar]
- Alahi, A.; Goel, K.; Ramanathan, V.; Robicquet, A.; Li, F.-F.; Savarese, S. Social lstm: Human trajectory prediction in crowded spaces. In Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 961–971. [Google Scholar]







| Tracker | Source | DTB70 | UAVDT | VisDrone2018 | UAV123 | WebUAV-3M | Avg. | Avg. FPS | FLOPs | Param | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Prec. | Succ. | Prec. | Succ. | Prec. | Succ. | Prec. | Succ. | Prec. | Succ. | Prec. | Succ. | GPU | CPU | (GMac) | (M) | |||
| DCF-based | KCF [65] | TPAMI 15 | 46.8 | 28.0 | 57.1 | 29.0 | 68.5 | 41.3 | 52.3 | 33.1 | 39.8 | 21.6 | 52.9 | 30.6 | - | 468.5 | - | - |
| fDSST [66] | TPAMI 17 | 53.4 | 35.7 | 66.6 | 38.3 | 69.8 | 51.0 | 58.3 | 40.5 | 43.5 | 28.5 | 58.3 | 38.8 | - | 132.4 | - | - | |
| ECO_HC [67] | CVPR 17 | 63.5 | 44.8 | 69.4 | 41.6 | 80.8 | 58.1 | 71.0 | 49.6 | 61.3 | 40.2 | 69.2 | 46.9 | - | 66.9 | - | - | |
| MCCT_H [68] | CVPR 18 | 60.4 | 40.5 | 66.8 | 40.2 | 80.3 | 56.7 | 65.9 | 45.7 | 52.1 | 34.3 | 65.1 | 43.5 | - | 55.3 | - | - | |
| AutoTrack [27] | CVPR 20 | 71.6 | 47.8 | 71.8 | 45.0 | 78.8 | 57.3 | 68.9 | 47.2 | 56.5 | 35.6 | 69.5 | 46.6 | - | 50.7 | - | - | |
| RACF [28] | PR 20 | 72.6 | 50.5 | 77.3 | 49.4 | 83.4 | 60.0 | 70.2 | 47.7 | 58.0 | 38.4 | 72.3 | 49.2 | - | 33.2 | - | - | |
| CNN-based | HiFT [30] | ICCV 21 | 80.2 | 59.4 | 65.2 | 47.5 | 71.9 | 52.6 | 78.7 | 59.0 | 60.9 | 45.8 | 71.4 | 52.9 | 157.2 | - | 7.2 | 9.9 |
| UDAT [69] | CVPR 22 | 80.6 | 61.8 | 80.1 | 59.2 | 81.6 | 61.9 | 76.1 | 59.0 | 64.8 | 48.7 | 76.6 | 58.1 | 31.2 | - | - | - | |
| TCTrack [5] | CVPR 22 | 81.2 | 62.2 | 72.5 | 53.0 | 79.9 | 59.4 | 80.0 | 60.5 | 61.9 | 45.7 | 75.1 | 56.2 | 132.7 | - | 8.8 | 9.7 | |
| DRCI [70] | ICME 23 | 81.4 | 61.8 | 84.0 | 59.0 | 83.4 | 60.0 | 76.7 | 59.7 | 62.0 | 47.6 | 77.5 | 57.6 | 268.5 | 61.0 | 3.6 | 8.8 | |
| SGDViT [71] | ICRA 23 | 78.5 | 60.4 | 65.7 | 48.0 | 72.1 | 52.1 | 75.4 | 57.5 | 61.3 | 45.7 | 70.6 | 52.7 | 107.3 | - | 11.3 | 23.3 | |
| ABDNet [72] | RAL 23 | 76.8 | 59.6 | 75.5 | 55.3 | 75.0 | 57.2 | 79.3 | 60.7 | 63.9 | 48.7 | 74.1 | 56.3 | 119.2 | - | - | - | |
| Vit-based | Aba-ViTrack [7] | ICCV 23 | 85.9 | 66.4 | 82.9 | 59.1 | 86.1 | 65.3 | 86.4 | 66.4 | 67.4 | 53.7 | 81.7 | 62.2 | 172.1 | 46.3 | 2.4 | 8.0 |
| AVTrack [36] | ICML 24 | 84.3 | 65.0 | 82.1 | 58.7 | 86.0 | 65.3 | 84.8 | 66.8 | 69.0 | 54.0 | 81.2 | 61.9 | 252.8 | 53.8 | 0.97–1.9 | 3.5–7.9 | |
| LiteTrack [73] | ICRA 24 | 82.5 | 63.9 | 81.6 | 59.3 | 79.7 | 61.4 | 84.2 | 65.9 | 69.4 | 54.1 | 79.5 | 60.9 | 132.6 | - | 14.1 | 54.92 | |
| SMAT [74] | WACV 24 | 81.9 | 63.8 | 80.8 | 58.7 | 82.5 | 63.4 | 81.8 | 64.6 | 68.9 | 53.9 | 79.2 | 60.9 | 118.5 | - | 3.2 | 8.6 | |
| Aba-ViTrack-SA | 86.1 | 65.2 | 83.1 | 59.4 | 87.7 | 65.7 | 86.0 | 67.9 | 70.7 | 55.5 | 82.7 | 62.8 | 188.4 | 47.9 | 2.4 | 8.0 | ||
| AVTrack-SA | 85.5 | 64.8 | 82.8 | 58.3 | 86.6 | 65.4 | 85.8 | 67.2 | 71.2 | 55.3 | 82.4 | 62.2 | 268.0 | 55.3 | 0.97–1.9 | 3.5–7.9 | ||
| SAMViTrack | Ours | 83.3 | 63.8 | 82.1 | 58.1 | 84.1 | 63.6 | 81.9 | 64.1 | 70.2 | 54.6 | 80.3 | 60.8 | 314.1 | 61.6 | 1.1 | 4.0 | |
| Tracker | Prec. | Succ. | FPS | Tracker | Prec. | Succ. | FPS | Tracker | Prec. | Succ. | FPS |
|---|---|---|---|---|---|---|---|---|---|---|---|
| SAMViTrack (Ours) | 82.1 | 58.1 | 346.2 | MixFormerV2 [33] | 57.8 | 42.1 | 248.0 | MAT [75] | 72.9 | 54.6 | 71.2 |
| DCPT [76] | 76.8 | 56.8 | 41.2 | ARTrack [77] | 74.7 | 52.8 | 84.9 | CSWinTT [78] | 67.3 | 51.2 | 32.6 |
| HIPTrack [79] | 79.6 | 60.9 | 33.0 | DropTrack [80] | 78.8 | 57.0 | 142.8 | SimTrack [81] | 76.5 | 57.2 | 76.0 |
| EVPTrack [82] | 80.1 | 60.3 | 68.3 | SeqTrack [83] | 79.0 | 60.0 | 12.1 | STARK [84] | 72.0 | 53.6 | 53.3 |
| ZoomTrack [85] | 77.1 | 58.0 | 62.3 | ROMTrack [86] | 81.9 | 61.6 | 55.8 | SiamGAT [87] | 76.4 | 58.9 | 86.5 |
| Mamba Layers | ViT Layers | SA | Prec. | Succ. | FPS | Ib |
|---|---|---|---|---|---|---|
| 6 | 0 | 75.28 | 54.66 | 366.72 | 0.49 | |
| ✓ | 0.87 | |||||
| 5 | 1 | 72.9 | 52.7 | 357.9 | 0.30 | |
| ✓ | 0.80 | |||||
| 4 | 2 | 78.8 | 56.1 | 341.0 | 0.64 | |
| ✓ | 0.79 | |||||
| 3 | 3 | 79.3 | 56.8 | 337.3 | 0.66 | |
| ✓ | 0.89 | |||||
| 2 | 4 | 78.2 | 56.3 | 319.9 | 0.53 | |
| ✓ | 0.63 | |||||
| 1 | 5 | 76.8 | 56.1 | 315.1 | 0.41 | |
| ✓ | 0.69 | |||||
| 0 | 6 | 75.68 | 54.57 | 302.7 | 0.30 | |
| ✓ | 0.67 |
| Mamba | ViT | SA | Prec. | Succ. | FPS | Ib |
|---|---|---|---|---|---|---|
| 1, 3, 5 | 2, 4, 6 | 65.4 | 46.5 | 285.1 | 0.14 | |
| ✓ | 71.9 | 48.7 | 292.9 | 0.41 | ||
| 2, 4, 6 | 1, 3, 5 | 67.0 | 48.4 | 273.9 | 0.16 | |
| ✓ | 74.2 | 51.4 | 293.0 | 0.50 | ||
| 4, 5, 6 | 1, 2, 3 | 79.6 | 55.0 | 335.9 | 0.87 | |
| ✓ | 80.7 | 57.4 | 344.7 | 0.94 | ||
| 1, 2, 3 | 4, 5, 6 | 79.3 | 56.8 | 337.3 | 0.86 | |
| ✓ | 82.1 | 58.1 | 346.2 | 1.00 |
| Tracker | SA | GOT-10k | OTB | Avg. FPS | ||
|---|---|---|---|---|---|---|
| AO | SR0.75 | Succ. | Prec. | |||
| AQATrack [89] | 76.0 | 74.9 | 68.8 | 89.8 | 22.6 | |
| ✓ | 76.7 | 75.7 | 69.9 | 91.7 | 24.3 | |
| OSTrack [34] | 73.7 | 70.8 | 68.2 | 88.7 | 33.5 | |
| ✓ | 74.7 | 72.6 | 69.4 | 90.5 | 35.2 | |
| HIPTrack [79] | 77.4 | 74.5 | 70.9 | 93.1 | 28.7 | |
| ✓ | 78.4 | 75.8 | 71.7 | 93.9 | 30.0 | |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Guo, X.; Li, Y.; Zhang, H.; Wang, X.; Zeng, D.; He, F.; Li, S. SAMViTrack: A Search-Region Adaptive Mamba-ViT Tracker for Real-Time UAV Tracking. Sensors 2025, 25, 7454. https://doi.org/10.3390/s25247454
Guo X, Li Y, Zhang H, Wang X, Zeng D, He F, Li S. SAMViTrack: A Search-Region Adaptive Mamba-ViT Tracker for Real-Time UAV Tracking. Sensors. 2025; 25(24):7454. https://doi.org/10.3390/s25247454
Chicago/Turabian StyleGuo, Xiaoyu, Yian Li, Hao Zhang, Xucheng Wang, Dan Zeng, Feixiang He, and Shuiwang Li. 2025. "SAMViTrack: A Search-Region Adaptive Mamba-ViT Tracker for Real-Time UAV Tracking" Sensors 25, no. 24: 7454. https://doi.org/10.3390/s25247454
APA StyleGuo, X., Li, Y., Zhang, H., Wang, X., Zeng, D., He, F., & Li, S. (2025). SAMViTrack: A Search-Region Adaptive Mamba-ViT Tracker for Real-Time UAV Tracking. Sensors, 25(24), 7454. https://doi.org/10.3390/s25247454

