A Benchmark for UAV-View Natural Language-Guided Tracking
Abstract
1. Introduction
- We propose the UAVNLT data set, providing bounding box and language annotations for 2000 video sequences. This data set aims to offer robust support for developing and testing natural language-based UAV tracking technologies.
- We propose a benchmark method that utilizes the global–local switcher module to alternate between the visual grounding and object tracker components flexibly. The method paves the way for future research.
- We conduct extensive experiments using state-of-the-art trackers on our proposed natural language-based UAV tracking data set, aiming to provide a comprehensive benchmark for future research.
2. Related Work
2.1. Tracking with Bounding Box
2.2. Tracking with Natural Language Description
2.3. UAV Video Data Sets
3. Proposed Benchmark
3.1. Data Collection and Annotation
3.2. Comparison with Existing Data Sets
3.3. Data Set Analysis
- Object aspect: As shown in Figure 3a, the distribution of the aspect ratios of objects is displayed. Most objects have an aspect ratio of 0.5 to 1.5, indicating that smaller vehicles, such as cars or SUVs, are common in the UAVNLT data set. Additionally, a portion of objects have an aspect ratio of around 2, signifying the presence of larger vehicles, such as buses and trucks, in the data set.
- Object scale: Figure 3b illustrates the distribution of object sizes, defined as the proportion of the video frame occupied by the object. It can be observed that most objects are of a smaller scale, primarily concentrated below 0.05. This is mainly because our videos are mainly shot from heights of greater than 70 m, thus including many small-sized targets and bringing new challenges to UAV tracking.
- Object position: As shown in Figure 3c, the heatmap illustrates the spatial distribution of target bounding boxes within video frames. The bright areas indicate locations with a high frequency of target appearances. The primary directions of target movement are either upwards, downwards, left, or right, a characteristic that indicates our data set was mainly captured in areas such as intersections and other traffic hubs, capturing typical movement patterns within these regions.
- Video length: Figure 3d presents the distribution of tracking sequence lengths. Most of the videos are centered around the 400 to 800 frame range, but a few videos are shorter than 200 or longer than 1000 frames. The distribution indicates that our data set offers a wide range of sequence lengths, from short to medium, which is beneficial for evaluating the performance of tracking algorithms across different temporal spans.
3.4. Evaluation Protocols
4. Method
4.1. Visual Grounding Module
4.2. Object Tracker
4.3. Global–Local Switcher Module
4.4. Implementation Details
5. Experiments
5.1. Performance on UAVNLT
5.2. Ablation Study
6. Conclusions and Future Works
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Shao, Y.; Yang, Z.; Li, Z.; Li, J. Aero-YOLO: An Efficient Vehicle and Pedestrian Detection Algorithm Based on Unmanned Aerial Imagery. Electronics 2024, 13, 1190. [Google Scholar] [CrossRef]
- Hu, Q.; Li, L.; Duan, J.; Gao, M.; Liu, G.; Wang, Z.; Huang, D. Object Detection Algorithm of UAV Aerial Photography Image Based on Anchor-Free Algorithms. Electronics 2023, 12, 1339. [Google Scholar] [CrossRef]
- Yamani, A.; Alyami, A.; Luqman, H.; Ghanem, B.; Giancola, S. Active Learning for Single-Stage Object Detection in UAV Images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2024; pp. 1860–1869. [Google Scholar]
- Rizzoli, G.; Barbato, F.; Caligiuri, M.; Zanuttigh, P. SynDrone-Multi-Modal UAV Dataset for Urban Scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 2210–2220. [Google Scholar]
- Javed, S.; Hassan, A.; Ahmad, R.; Ahmed, W.; Ahmed, R.; Saadat, A.; Guizani, M. State-of-the-Art and Future Research Challenges in UAV Swarms. IEEE Internet Things J. 2024. [Google Scholar] [CrossRef]
- Ren, H.; Zhao, Y.; Xiao, W.; Hu, Z. A review of UAV monitoring in mining areas: Current status and future perspectives. Int. J. Coal Sci. Technol. 2019, 6, 320–333. [Google Scholar] [CrossRef]
- Moore, J.; Tadinada, H.; Kirsche, K.; Perry, J.; Remen, F.; Tse, Z.T.H. Facility inspection using UAVs: A case study in the University of Georgia campus. Int. J. Remote Sens. 2018, 39, 7189–7200. [Google Scholar] [CrossRef]
- Li, X.; Yang, L. Design and Implementation of UAV Intelligent Aerial Photography System. In Proceedings of the 2012 4th International Conference on Intelligent Human-Machine Systems and Cybernetics, Nanchang, China, 26–27 August 2012; Volume 2, pp. 200–203. [Google Scholar] [CrossRef]
- Zhao, H.; Zhang, H.; Zhao, Y. Yolov7-sea: Object detection of maritime uav images based on improved yolov7. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 233–238. [Google Scholar]
- Wang, G.; Chen, Y.; An, P.; Hong, H.; Hu, J.; Huang, T. UAV-YOLOv8: A small-object-detection model based on improved YOLOv8 for UAV aerial photography scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef] [PubMed]
- Paul, H.; Martinez, R.R.; Ladig, R.; Shimonomura, K. Lightweight multipurpose three-arm aerial manipulator systems for uav adaptive leveling after landing and overhead docking. Drones 2022, 6, 380. [Google Scholar] [CrossRef]
- Lieret, M.; Lukas, J.; Nikol, M.; Franke, J. A lightweight, low-cost and self-diagnosing mechatronic jaw gripper for the aerial picking with unmanned aerial vehicles. Procedia Manuf. 2020, 51, 424–430. [Google Scholar] [CrossRef]
- Nguyen, V.S.; Jung, J.; Jung, S.; Joe, S.; Kim, B. Deployable hook retrieval system for UAV rescue and delivery. IEEE Access 2021, 9, 74632–74645. [Google Scholar] [CrossRef]
- Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The unmanned aerial vehicle benchmark: Object detection and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 370–386. [Google Scholar]
- Mueller, M.; Smith, N.; Ghanem, B. A benchmark and simulator for uav tracking. In Proceedings of the European Conference on Computer Vision (ECCV) 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 445–461. [Google Scholar]
- Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and tracking meet drones challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7380–7399. [Google Scholar] [CrossRef]
- Wu, Y.; Lim, J.; Yang, M.H. Online object tracking: A benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013; pp. 2411–2418. [Google Scholar]
- Fan, H.; Lin, L.; Yang, F.; Chu, P.; Deng, G.; Yu, S.; Bai, H.; Xu, Y.; Liao, C.; Ling, H. Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5374–5383. [Google Scholar]
- Wang, X.; Shu, X.; Zhang, Z.; Jiang, B.; Wang, Y.; Tian, Y.; Wu, F. Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13763–13773. [Google Scholar]
- Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H. Fully-convolutional siamese networks for object tracking. In Proceedings of the European Conference on Computer Vision (ECCV) 2016, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 850–865. [Google Scholar]
- Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019; pp. 4282–4291. [Google Scholar]
- Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8971–8980. [Google Scholar]
- Xu, Y.; Wang, Z.; Li, Z.; Yuan, Y.; Yu, G. Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. In Proceedings of the AAAI Conference on Artificial Intelligence, AAAI 2020, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12549–12556. [Google Scholar]
- Zheng, Y.; Zhong, B.; Liang, Q.; Tang, Z.; Ji, R.; Li, X. Leveraging Local and Global Cues for Visual Tracking via Parallel Interaction Network. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 1671–1683. [Google Scholar] [CrossRef]
- Ma, J.; Lan, X.; Zhong, B.; Li, G.; Tang, Z.; Li, X.; Ji, R. Robust Tracking via Uncertainty-Aware Semantic Consistency. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 1740–1751. [Google Scholar] [CrossRef]
- Ge, D.; Liu, R.; Li, Y.; Miao, Q. Reliable Memory Model for Visual Tracking. Electronics 2021, 10, 2488. [Google Scholar] [CrossRef]
- Zhao, M.; Okada, K.; Inaba, M. Trtr: Visual tracking with transformer. arXiv 2021, arXiv:2105.03817. [Google Scholar]
- Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer tracking. In Proceedings of the Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, 19–25 June 2021; pp. 8126–8135. [Google Scholar]
- Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021; pp. 10448–10457. [Google Scholar]
- Cui, Y.; Jiang, C.; Wang, L.; Wu, G. Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 13608–13618. [Google Scholar]
- Ye, B.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Joint feature learning and relation modeling for tracking: A one-stream framework. In Proceedings of the Computer Vision–ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; pp. 341–357. [Google Scholar]
- Li, Z.; Tao, R.; Gavves, E.; Snoek, C.G.; Smeulders, A.W. Tracking by natural language specification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, Honolulu, HI, USA, 21–26 July 2017; pp. 6495–6503. [Google Scholar]
- Feng, Q.; Ablavsky, V.; Bai, Q.; Li, G.; Sclaroff, S. Real-time visual object tracking with natural language description. In Proceedings of the Winter Conference on Applications of Computer Vision, WACV 2020, Snowmass Village, CO, USA, 1–5 March 2020; pp. 700–709. [Google Scholar]
- Yang, Z.; Kumar, T.; Chen, T.; Su, J.; Luo, J. Grounding-Tracking-Integration. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 3433–3443. [Google Scholar] [CrossRef]
- Li, Y.; Yu, J.; Cai, Z.; Pan, Y. Cross-modal Target Retrieval for Tracking by Natural Language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2022, New Orleans, LA, USA, 19–20 June 2022; pp. 4931–4940. [Google Scholar]
- Feng, Q.; Ablavsky, V.; Bai, Q.; Sclaroff, S. Siamese natural language tracker: Tracking by natural language descriptions with siamese trackers. In Proceedings of the Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, 19–25 June 2021; pp. 5851–5860. [Google Scholar]
- Wang, X.; Li, C.; Yang, R.; Zhang, T.; Tang, J.; Luo, B. Describe and attend to track: Learning natural language guided structural representation and visual attention for object tracking. arXiv 2018, arXiv:1811.10014. [Google Scholar]
- Guo, M.; Zhang, Z.; Fan, H.; Jing, L. Divert more attention to vision-language tracking. NeurIPS 2022, 35, 4446–4460. [Google Scholar]
- Zhang, H.; Wang, J.; Zhang, J.; Zhang, T.; Zhong, B. One-stream Vision-Language Memory Network for Object Tracking. IEEE Trans. Multimed. 2023, 26, 1720–1730. [Google Scholar] [CrossRef]
- Zheng, Y.; Zhong, B.; Liang, Q.; Li, G.; Ji, R.; Li, X. Towards Unified Token Learning for Vision-Language Tracking. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 2125–2135. [Google Scholar] [CrossRef]
- Zhu, P.; Wen, L.; Bian, X.; Ling, H.; Hu, Q. VisDrone-DET2018: The Vision Meets Drone Object Detection in Image Challenge Results. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
- Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. Visual Object Tracking for Unmanned Aerial Vehicles: A Benchmark and New Motion Models. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 445–461. [Google Scholar]
- Fu, C.; Cao, Z.; Li, Y.; Ye, J.; Feng, C. Onboard Real-Time Aerial Tracking with Efficient Siamese Anchor Proposal Network. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5606913. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Chen, T.; Saxena, S.; Li, L.; Fleet, D.J.; Hinton, G. Pix2seq: A language modeling framework for object detection. arXiv 2022, arXiv:2109.10852. [Google Scholar]
- Girshick, R. Fast r-cnn. In Proceedings of the International Conference on Computer Vision, ICCV 2015, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, Virtual Event, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Danelljan, M.; Bhat, G.; Khan, F.S.; Felsberg, M. Atom: Accurate tracking by overlap maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019; pp. 4660–4669. [Google Scholar]
- Bhat, G.; Danelljan, M.; Gool, L.V.; Timofte, R. Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea, 27 October–2 November 2019; pp. 6182–6191. [Google Scholar]
- Mayer, C.; Danelljan, M.; Paudel, D.P.; Van Gool, L. Learning target candidate association to keep track of what not to track. In Proceedings of the IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021; pp. 13444–13454. [Google Scholar]
- Zheng, Y.; Zhong, B.; Liang, Q.; Mo, Z.; Zhang, S.; Li, X. ODTrack: Online Dense Temporal Token Learning for Visual Tracking. arXiv 2024, arXiv:2401.01686. [Google Scholar] [CrossRef]
- Wei, X.; Bai, Y.; Zheng, Y.; Shi, D.; Gong, Y. Autoregressive visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 9697–9706. [Google Scholar]







| Data Sets | Resolution | #Video | #Min | #Max | #Total | NL | 
|---|---|---|---|---|---|---|
| UAV123 | 720p | 123 | 109 | 3085 | 113 K | ✗ | 
| UAV20L | 720p | 20 | 1717 | 5527 | 59 K | ✗ | 
| UAV123@10FPS | 720p | 123 | 36 | 1362 | 38 K | ✗ | 
| UAVDT | 540p | 100 | 82 | 2969 | 80 K | ✗ | 
| VisDrone2019-SOT | 756p | 132 | 329 | 2789 | 109 K | ✗ | 
| DTB | 720p | 70 | 68 | 699 | 15 K | ✗ | 
| UAVTrack112 | 400p | 112 | - | 1000 | - | ✗ | 
| OTB-99 | 432p | 99 | 71 | 3872 | 59 K | ✓ | 
| LaSOT | 360p | 1400 | 1000 | 11,397 | 3.52 M | ✓ | 
| TNL2K | 720p | 2000 | 21 | 18,488 | 1.24 M | ✓ | 
| UAVNLT | 2160p | 2000 | 117 | 1461 | 900 K | ✓ | 
| Methods | Source | Initialization | UAVNLT | LaSOT | TNL2K | |||
|---|---|---|---|---|---|---|---|---|
| AUC | PRE | AUC | PRE | AUC | PRE | |||
| ATOM [50] | CVPR2019 | BB | 0.429 | 0.587 | 0.510 | 0.510 | 0.390 | 0.400 | 
| DIMP [51] | CVPR2019 | BB | 0.462 | 0.611 | 0.569 | - | - | - | 
| KeepTrack [52] | ICCV2019 | BB | 0.483 | 0.643 | 0.671 | 0.702 | - | - | 
| STARK [29] | ICCV2021 | BB | 0.623 | 0.811 | 0.671 | 0.712 | - | - | 
| MixFormer [30] | CVPR2022 | BB | 0.666 | 0.894 | 0.692 | 0.747 | 0.552 | 0.558 | 
| ODtrack [53] | AAAI2024 | BB | 0.657 | 0.919 | 0.731 | 0.757 | 0.609 | 0.723 | 
| ARtrack [54] | CVPR2024 | BB | 0.689 | 0.731 | 0.803 | 0.747 | 0.603 | 0.766 | 
| TNL2K-1 [19] | CVPR2021 | NL | - | - | 0.510 | 0.490 | 0.110 | 0.060 | 
| CTRNLT [35] | CVPR2022 | NL | - | - | 0.520 | 0.510 | 0.140 | 0.090 | 
| STARK * [29] | ICCV2021 | NL | 0.062 | 0.071 | - | - | - | - | 
| MixFormer * [30] | CVPR2022 | NL | 0.077 | 0.094 | - | - | - | - | 
| ARtrack * [54] | CVPR2024 | NL | 0.097 | 0.102 | - | - | - | - | 
| ODtrack * [53] | AAAI2024 | NL | 0.103 | 0.113 | - | - | - | - | 
| Ours | - | NL | 0.105 | 0.105 | 0.522 | 0.513 | 0.461 | 0.422 | 
| SNLT [36] | CVPR2021 | NL + BB | 0.234 | 0.372 | 0.540 | 0.576 | 0.276 | 0.419 | 
| VLTTT [38] | NIPS2022 | NL + BB | 0.452 | 0.418 | 0.673 | 0.721 | 0.531 | 0.533 | 
| Ours | - | NL + BB | 0.467 | 0.421 | 0.613 | 0.688 | 0.547 | 0.529 | 
| Method | M-Adapter | LAM | GAM | AUC | Precision | 
|---|---|---|---|---|---|
| ① | 0.443 | 0.396 | |||
| ② | ✓ | 0.452 | 0.412 | ||
| ③ | ✓ | ✓ | 0.456 | 0.420 | |
| ④ | ✓ | ✓ | ✓ | 0.461 | 0.422 | 
| Data Sets | Num Frames | Num Switch | Cost Rate | 
|---|---|---|---|
| TNL2K | 512,783 | 61,576 | 12.0% | 
| LaSOT | 684,688 | 69,933 | 10.2% | 
| UAVNLT | 295,607 | 38,731 | 13.1% | 
| Total | 1,493,078 | 170,240 | 11.4% | 
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. | 
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, H.; Liu, X.; Li, G. A Benchmark for UAV-View Natural Language-Guided Tracking. Electronics 2024, 13, 1706. https://doi.org/10.3390/electronics13091706
Li H, Liu X, Li G. A Benchmark for UAV-View Natural Language-Guided Tracking. Electronics. 2024; 13(9):1706. https://doi.org/10.3390/electronics13091706
Chicago/Turabian StyleLi, Hengyou, Xinyan Liu, and Guorong Li. 2024. "A Benchmark for UAV-View Natural Language-Guided Tracking" Electronics 13, no. 9: 1706. https://doi.org/10.3390/electronics13091706
APA StyleLi, H., Liu, X., & Li, G. (2024). A Benchmark for UAV-View Natural Language-Guided Tracking. Electronics, 13(9), 1706. https://doi.org/10.3390/electronics13091706
 
        


 
       