ATA: A Benchmark for Vision–Language Tracking in Air-to-Air Counter-UAV of Tiny Drones

Kang, Wenchao; Zhang, Xuekai; Peng, Yueping; Tang, Wei; Li, Qilong; Hao, Hexiang; Liu, Kang; Chen, Qinghe

doi:10.3390/drones10060429

Open AccessArticle

ATA: A Benchmark for Vision–Language Tracking in Air-to-Air Counter-UAV of Tiny Drones

by

Wenchao Kang

,

Xuekai Zhang

,

Yueping Peng

^*

,

Wei Tang

,

Qilong Li

,

Hexiang Hao

,

Kang Liu

and

Qinghe Chen

College of Information Engineering, Engineering University of the Chinese People’s Armed Police Force, Weiyang District, Xi’an 710086, China

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(6), 429; https://doi.org/10.3390/drones10060429

Submission received: 20 April 2026 / Revised: 24 May 2026 / Accepted: 31 May 2026 / Published: 2 June 2026

(This article belongs to the Special Issue Detection, Identification and Tracking of UAVs and Drones: 2nd Edition)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We present ATA, the first benchmark for vision–language tracking in air-to-air counter-UAV scenarios of tiny drones, containing 50 real-flight video sequences with 38,094 annotated frames and supporting both BBox-only and Language-assisted settings.
Benchmark results on representative vision-only and vision–language trackers show that current methods remain limited under this setting, and that incorporating adjacent-frame temporal information can improve tracking performance across multiple baselines.

What are the implications of the main findings?

ATA provides a dedicated benchmark for evaluating tracking methods under key challenges in air-to-air counter-UAV scenarios, including dual-dynamic motion, tiny-target perception, complex background variation, and interference from similar drones.
The results suggest that temporal modeling and more effective use of language cues are important directions for improving robust tracking in practical air-to-air counter-UAV applications.

Abstract

In air-to-air counter-UAV scenarios, vision–language tracking for tiny drones still lacks a dedicated benchmark. Unlike traditional UAV tracking or ground-based Anti-UAV settings, air-to-air counter-UAV tracking involves simultaneous motion of both the tracking platform and the target platform. In addition, the target typically appears as a tiny object and is subject to rapid viewpoint changes, fast background transitions, and interference from similar drones, making it difficult to systematically assess the capability boundaries of existing methods. To address this gap, we present the ATA dataset. To the best of our knowledge, ATA is the first vision–language tracking dataset specifically designed for real air-to-air tiny-object UAV countermeasure scenarios. ATA contains 50 real-flight video sequences with 38,094 frames in total, and provides frame-wise bounding box annotations together with video-level English language descriptions. It supports two unified task settings, namely BBox-only and Language-assisted tracking. The dataset covers diverse real-world low-altitude scenarios with complex backgrounds. Notably, the average target area accounts for only

0.03 %

of the full image, exhibiting pronounced tiny-object characteristics. ATA also captures several key challenges in this setting, including dual-dynamic disturbances, complex background changes, and multi-drone interference. Based on ATA, we establish a benchmark covering both vision-only and vision–language tracking methods, and conduct a systematic evaluation of eight representative recent trackers. Experimental results show that current mainstream methods still perform unsatisfactorily in this scenario, with evident limitations in tiny-object representation, cross-frame association, robustness to complex backgrounds, and interference suppression. Furthermore, we validate a lightweight temporal enhancement module, AFTE, and show that explicitly leveraging adjacent-frame information consistently improves the performance of multiple baseline models. Overall, ATA provides a unified benchmark for vision–language tracking in air-to-air counter-UAV scenarios of tiny drones and highlights temporal modeling as a promising direction for improving tracking performance in this challenging setting.

Keywords:

air-to-air counter-UAV; tiny drone tracking; vision–language tracking; benchmark; temporal modeling

1. Introduction

In recent years, unmanned aerial vehicles (UAVs), benefiting from their high maneuverability, ease of deployment, and relatively low operational cost, have been extensively employed in a wide range of applications, including environmental monitoring, agricultural inspection, and emergency rescue. Meanwhile, the rapid proliferation of UAVs has also given rise to increasingly prominent low-altitude security risks. Unauthorized or maliciously manipulated UAVs may intrude into airports, sensitive zones, and critical infrastructure, thereby posing tangible threats to flight safety, public security, and privacy protection. Consequently, UAV countermeasure technologies for low-altitude security have gradually emerged as an important research topic in the fields of computer vision and autonomous systems [1].

In recent years, anti-UAV research has gradually evolved from single-sensor perception toward a comprehensive direction involving multi-source sensing, intelligent recognition, and system-level coordination. Current UAV detection and countermeasure systems usually involve multiple technical routes, including radar, radio-frequency, acoustic, electro-optical/vision-based sensing, and multi-sensor fusion. Different sensors have their own advantages and limitations in terms of detection range, environmental adaptability, target recognition capability, and anti-interference ability. For example, radar and radio-frequency methods are suitable for relatively long-range detection, but they are easily affected by environmental clutter, electromagnetic interference, or changes in target communication modes. Acoustic methods have relatively low cost, but their stability is limited in urban noise environments. Electro-optical and vision-based methods can provide more intuitive information about target appearance, spatial position, and motion state, but they are also more susceptible to illumination, weather conditions, background complexity, and target-scale variations. With the development of artificial intelligence and deep learning technologies, vision-based UAV detection and tracking methods have attracted increasing attention in target recognition, persistent localization, and complex-scene adaptation, and they have gradually become an important component of the anti-UAV perception pipeline. In this context, visual tracking is not only a specific task within vision-based perception, but also a key component for maintaining target identity, estimating motion state, and supporting subsequent countermeasure decision-making after the target has been detected [2].

Specifically, within the UAV countermeasure pipeline, target tracking serves to bridge target discovery, state estimation, and subsequent response decision-making. As illustrated in Figure 1, compared with traditional radar sensing, radio-frequency detection, and ground-based electro-optical devices, vision-based tracking deployed on an airborne platform enables more flexible observation viewpoints and sustained proximity to the target, thereby rendering it particularly suitable for tasks such as dynamic pursuit, companion-flight perception, and autonomous intervention. Owing to these advantages, vision-based air-to-air tracking exhibits unique potential in UAV countermeasure scenarios [3].

However, air-to-air tiny-object UAV tracking differs fundamentally not only from conventional generic object tracking, but also from existing UAV tracking and ground-based anti-UAV tracking scenarios. As shown in Figure 2, this task is characterized by the coexistence of two tightly coupled properties, namely dual-dynamic motion in the air-to-air setting and long-range tiny-object imaging. The former causes the target state to be intricately entangled with the platform’s attitude, velocity, and viewpoint variations, whereas the latter results in weak appearance cues and rapid cross-frame changes. Therefore, air-to-air tiny-object UAV tracking should by no means be regarded as a straightforward extension of existing UAV tracking or anti-UAV tasks; rather, it constitutes a substantially more challenging and independent problem [4].

Although substantial progress has recently been achieved in object tracking, UAV scene perception, and vision–language tracking, existing studies still remain insufficient to adequately support this task. Most current UAV tracking datasets primarily focus on the setting of airborne platforms tracking ground targets [5], whereas existing anti-UAV datasets are predominantly established under the configuration of ground-based devices observing aerial UAVs. These two research paradigms correspond to air-to-ground and ground-to-air scenarios, respectively, and neither is able to faithfully capture the core difficulties encountered in realistic air-to-air countermeasure processes [6]. Meanwhile, although existing vision–language tracking datasets have introduced natural language descriptions, the majority of them are still designed for generic object tracking scenarios, and their semantic annotations generally lack fine-grained descriptions specifically tailored to UAV targets, such as category-level distinctions, structural attributes, relative positional relations, and background-aware contextual cues, thereby making them inadequate for effectively supporting air-to-air tiny-object UAV tracking [7]. In complex aerial environments, merely relying on first-frame bounding-box initialization is often insufficient to fully characterize the target identity and its scene relations, whereas language descriptions are able to provide complementary high-level semantic constraints from multiple aspects, including target category, appearance characteristics, structural properties, and contextual associations [8]. Furthermore, most prevailing trackers are developed primarily for generic scenarios, and their performance evaluation is largely based on existing public benchmarks. However, strong performance on such benchmarks does not necessarily imply direct adaptability to air-to-air tiny-object scenarios, since the latter are characterized by weak target representations and highly complex temporal dynamics, while the explicit exploitation of temporal information remains insufficient in most current tracking architectures. As a consequence, clear deficiencies still exist at the levels of data, task setting, and evaluation protocol, and a unified as well as dedicated benchmark is still lacking for systematically revealing the core challenges, capability boundaries, and potential directions for improvement in air-to-air tiny-object UAV tracking [9,10].

To address the aforementioned limitations, this paper presents ATA, a vision–language tracking benchmark specifically designed for air-to-air tiny-object UAV countermeasure scenarios. Given a continuous video sequence captured by a pursuer UAV, the objective is to continuously localize the tracked UAV in subsequent frames under complex and dynamically changing backgrounds [10]. In order to simultaneously accommodate the conventional visual tracking paradigm and the need for collaborative vision–language modeling, ATA contains two task settings. The first is BBox-only, in which the model is provided with the initial bounding box of the target UAV in the first frame and is required to accomplish persistent tracking solely based on this initialization and the subsequent video frames. The second is Language-assisted, in which, in addition to the first-frame bounding box, a natural-language prompt describing the target UAV is further provided to facilitate subsequent tracking [11]. The former mainly evaluates the pure visual modeling capability of a tracker in air-to-air tiny-object scenarios, whereas the latter further investigates whether linguistic information can effectively complement visual cues, thereby alleviating the problems caused by insufficient tiny-object representation, severe interference from similar targets, and pronounced ambiguity in first-frame target specification [12].

In summary, the main contributions of this paper are as follows:

We propose ATA. To the best of our knowledge, ATA is the first vision–language tracking benchmark specifically designed for real air-to-air tiny-object UAV countermeasure scenarios. ATA is not intended as a general replacement for existing UAV tracking or anti-UAV tracking datasets. Instead, it serves as a further task-specialized extension and complement to existing benchmarks under the specific combination of real air-to-air observation, tiny UAV targets, language-assisted target specification, and persistent tracking.
We define two task settings for ATA, namely BBox-only and Language-assisted. The former is designed to assess the modeling capability of purely visual trackers in air-to-air tiny-object scenarios, while the latter is intended to investigate the auxiliary role of language information in target specification and persistent tracking, thereby establishing a unified benchmark for training and evaluation in this research direction.
Based on ATA, we establish a benchmark covering both vision-only and vision–language methods and conduct systematic evaluations on multiple recent mainstream trackers. The experimental results reveal significant performance bottlenecks of current methods in this scenario, thereby highlighting the substantial gap between existing tracking techniques and the practical demands of air-to-air UAV countermeasures.
Considering that ATA is constructed from continuous video sequences and exhibits pronounced temporal correlation, we further introduce the AFTE temporal enhancement module for validation. By extending the original single-frame input of a tracker into paired adjacent-frame input, AFTE explicitly models the temporal relations between neighboring frames, and its effectiveness is verified on multiple baseline models. These results further demonstrate the pressing need for stronger temporal modeling capability in air-to-air tiny-object tracking scenarios.

2. Related Work

2.1. Recent Progress in Anti-UAV Perception and Counter-UAV Systems

Against the background of multi-source perception and intelligent countermeasures, vision-based UAV detection and tracking has gradually become an important direction in anti-UAV research. Wang et al. systematically reviewed recent vision-based anti-UAV detection and tracking methods and summarized the main technical routes, including Siamese-based, Transformer-based, and YOLO-based methods [13]. Compared with non-visual sensing methods such as radar, radio-frequency, and acoustic sensing, vision-based methods can provide fine-grained information from images or video sequences, including target shape, color, texture, spatial position, and motion state. Therefore, they play an important role in target confirmation, persistent localization, and countermeasure decision-making. However, vision-based anti-UAV tasks still face many challenges, such as long-range tiny-object imaging, low resolution, drastic scale variation, short-term target disappearance, camera-motion disturbance, complex background occlusion, and confusion with similar flying objects. In addition, existing public anti-UAV datasets still have limitations in terms of scene diversity, target types, multi-target interference, image quality, and unified evaluation protocols.

Existing studies have extensively discussed general UAV detection, ground-to-air observation, multi-sensor fusion perception, and conventional anti-UAV tracking. Nevertheless, for the air-to-air countermeasure scenario in which one UAV platform continuously observes and tracks another target UAV in the air, there is still a lack of a unified benchmark specifically designed for the vision–language tracking task. This scenario differs from ground-based anti-UAV observation and general UAV-view ground-target tracking. It simultaneously involves dual-dynamic motion of the tracking platform and the target platform, long-range tiny-object imaging, rapid background switching, interference from similar UAVs, and ambiguity in target specification. Therefore, ATA is positioned as a task-specialized benchmark within the anti-UAV perception system, focusing on vision–language tracking of tiny UAVs under real air-to-air conditions, and providing a unified data foundation and evaluation platform for studying tiny-object representation, cross-frame association, language-assisted target specification, and temporal modeling.

2.2. Related Tracking Datasets and Task Positioning of ATA

The development of tracking benchmarks has laid an important foundation for studying air-to-air anti-UAV scenarios with tiny targets. Broadly, existing datasets can be grouped into three categories: generic visual object tracking benchmarks, UAV tracking and anti-UAV benchmarks, and vision–language tracking benchmarks. The first category mainly establishes the general evaluation paradigm for single-object tracking. The second focuses on target tracking in UAV-mounted views, low-altitude environments, and anti-UAV scenarios. The third introduces natural language descriptions to improve target specification and continuous tracking. Although these three lines of benchmarks have each advanced their respective research directions, their task formulations, observation settings, and target characteristics are still largely designed for more general tracking problems. As such, they remain insufficient to simultaneously cover the requirements of air-to-air tracking, tiny targets, counter-UAV scenarios, and language-assisted target specification. In the following, we briefly review related work from these three perspectives and clarify the motivation and necessity of ATA.

2.2.1. Generic Visual Object Tracking Benchmarks

Generic visual object tracking benchmarks have played a central role in shaping the evaluation protocol of single-object tracking. Early benchmarks such as OTB [14] and VOT [15] established widely adopted evaluation practices, including success, precision, and attribute-based analysis. NfS [16] further extended benchmark evaluation to high-frame-rate scenarios, providing a dedicated testbed for fast-motion tracking. TrackingNet [17] enlarged the scale and diversity of training and evaluation data by introducing a large-scale in-the-wild benchmark. GOT-10k [18] highlighted category generalization through its one-shot setting, while LaSOT [12] advanced long-term single-object tracking with long sequences, dense annotations, and standardized protocols. Collectively, these benchmarks have driven generic tracking from small-scale algorithm comparison toward standardized large-scale training and evaluation.

As shown in Table 1, the target categories in generic tracking benchmarks are still mainly limited to pedestrians, vehicles, animals, and other common objects, and most videos are captured from ground views or conventional imaging settings. Consequently, while these benchmarks provide valuable references for benchmark construction and evaluation design, they do not capture the key challenges of air-to-air anti-UAV tracking with tiny targets, including dual-dynamic disturbances, long-range imaging of weak-texture tiny objects, rapid background changes, and interference from multiple similar UAVs. Therefore, such datasets mainly serve as methodological references for ATA, rather than directly satisfying the data requirements of our task.

2.2.2. UAV Tracking and Anti-UAV Benchmarks

Compared with generic object tracking benchmarks, UAV-related datasets place greater emphasis on target tracking under UAV-mounted views, fast motion, scale variation, and complex low-altitude environments. From the perspective of observation settings, existing benchmarks can be broadly divided into three categories: air-to-ground UAV tracking benchmarks, ground-to-air anti-UAV tracking benchmarks, and air-to-air UAV confrontation tracking benchmarks.

The first category is air-to-ground UAV tracking benchmarks, where UAV platforms are used to capture and track ground targets. Representative datasets include UAV123 [19], VisDrone [20], UAVTrack112 [21], and WebUAV-3M [3]. Among them, WebUAV-3M is particularly notable for its large scale, broad category coverage, and extended annotations such as language and audio, making it one of the most representative recent benchmarks in UAV tracking. These datasets have substantially advanced tracking research from airborne viewpoints. However, their targets are still primarily ground objects, and thus they remain fundamentally within the air-to-ground setting.

The second category is ground-to-air anti-UAV tracking benchmarks, which are mainly designed for low-altitude security and anti-UAV applications. In these datasets, flying UAVs are typically observed by ground-based devices, fixed cameras, or weakly moving platforms. Representative examples include Anti-UAV318 [1], MM-UAV [22], and Anti-UAV600 [23]. Compared with air-to-ground benchmarks, this line of work places greater emphasis on aerial UAV targets, small-scale imaging, and tracking under cluttered low-altitude backgrounds. Nevertheless, the observation platform in these datasets is usually relatively stable, and most image variations are induced by the target motion itself. Consequently, they do not fully capture the severe dual-dynamic disturbances caused by the simultaneous high-speed motion of both the tracker platform and the target platform.

The third category is air-to-air UAV confrontation tracking benchmarks. Recent efforts have started to move beyond conventional observation paradigms and establish benchmarks that better reflect realistic low-altitude confrontation scenarios, such as the recently proposed UAV-Anti-UAV [24]. This line of work explicitly formulates a new task paradigm in which one UAV continuously tracks another UAV, extending the benchmark setting from air-to-ground and ground-to-air to the more realistic and adversarial air-to-air case. As such, it provides a more direct foundation for the problem studied in this work.

As shown in Table 2, UAV-related benchmark datasets have evolved from air-to-ground scenarios to ground-to-air scenarios, and more recently toward air-to-air scenarios. However, existing datasets still mainly focus on broad scenario coverage or general UAV tracking performance evaluation. To further clarify the differences between ATA and existing UAV-related tracking datasets, we provide a horizontal comparison of representative datasets from the perspectives of task setting and challenge attributes, as shown in Table 3. Unlike a simple comparison of dataset scale, this comparison focuses on differences in target type, observation platform, whether the dataset involves an air-to-air scenario, whether it contains tiny UAV targets, whether language annotations are provided, whether similar UAV distractors exist, and whether dual motion is involved. These factors are closely related to the practical difficulty of air-to-air counter-UAV tracking.

As shown in Table 3, existing UAV tracking datasets such as UAV123, VisDrone, UAVTrack112, and WebUAV-3M are mainly collected by UAV platforms and can reflect challenges such as fast motion, scale variation, and complex low-altitude backgrounds from an airborne viewpoint. However, the tracking targets in these datasets are mostly general ground objects rather than aerial UAV targets, and therefore they are not directly designed for air-to-air counter-UAV tracking. Anti-UAV datasets such as Anti-UAV318, Anti-UAV600, MM-UAV, and UAV-Anti-UAV mainly take UAVs as the tracking targets, and can reflect challenges such as aerial small targets, target disappearance, and complex background interference. Nevertheless, their observation platforms are usually ground-based devices, fixed sensors, or specific airborne platforms, making it difficult to simultaneously cover the drastic viewpoint changes, scale variations, background disturbances, and target-specification ambiguity caused by the joint motion of the pursuer UAV and the target UAV.

In contrast, ATA is designed for tiny-UAV tracking in real-world air-to-air counter-UAV scenarios and has a more focused task setting. Specifically, ATA simultaneously involves an airborne observation platform, distant tiny UAV targets, similar UAV distractors, complex dynamic backgrounds, target-specification ambiguity, and dual-dynamic motion. In addition, ATA provides language annotations, enabling unified support for both BBox-only and Language-assisted tracking settings. Therefore, ATA is not a simple repetition of existing UAV tracking or anti-UAV tracking datasets, but rather a further task-specialized and complementary benchmark built upon existing UAV-scenario tracking benchmarks for air-to-air tiny-UAV countermeasure tracking.

Overall, UAV-related tracking datasets have gradually extended from air-to-ground scenarios to ground-to-air scenarios and further toward air-to-air scenarios. However, existing datasets are still generally oriented toward large-scale scenario coverage or general UAV tracking evaluation. For the real air-to-air tiny-UAV countermeasure scenario considered in this work, there is still a lack of a benchmark that simultaneously focuses on tiny UAV targets, language-assisted target specification, similar UAV interference, and persistent tracking, while also uniformly supporting both BBox-only and Language-assisted settings. Therefore, ATA is complementary to existing datasets in terms of observation relationship, target scale, task setting, and annotation form, and provides a more targeted evaluation platform for real air-to-air counter-UAV tracking research.

2.2.3. Vision–Language Tracking Benchmarks

Vision–language tracking augments conventional single-object tracking with natural language descriptions, providing high-level semantic cues beyond the first-frame bounding box for target specification and continuous tracking. It has therefore become an important direction in recent tracking research.

Early efforts such as OTB99-Lang [25] demonstrated the potential of introducing natural language into tracking benchmarks, showing that language can serve as a more natural form of human–machine interaction and can help alleviate target initialization ambiguity and model drift. TNL2K [11] further advanced tracking by language by establishing a larger-scale and more native benchmark for vision–language tracking. In parallel, several benchmarks originally developed for generic single-object tracking have gradually incorporated language annotations. For instance, LaSOT [12] augments its high-quality long-term benchmark with sequence-level language descriptions, VastTrack [26] explores unified support for both vision-only and vision–language tracking at a much larger scale, and MGIT [27] further expands the expressive capacity of multimodal tracking by introducing richer hierarchical semantic annotations. Overall, these benchmarks have driven language from being an auxiliary annotation in conventional tracking datasets to becoming an increasingly important source of information for target specification and continuous tracking.

Compared with conventional methods that rely solely on first-frame bounding box initialization, language descriptions provide complementary high-level semantics, including target category, appearance attributes, structural characteristics, relative position, and background relations. Such information can impose more stable constraints on target specification and continuous tracking, especially in scenarios with severe target ambiguity, significant appearance variation, or strong interference from similar objects.

Nevertheless, as shown in Table 4, most existing vision–language tracking benchmarks are still built upon generic video scenarios, where target categories and language annotations are primarily designed for general target localization rather than the unique challenges of air-to-air anti-UAV tracking with tiny targets [7]. In this setting, the target is typically smaller, exhibits weaker texture cues, and is more easily confused with similar distractors. Consequently, language becomes even more critical, as it must provide stable and discriminative semantic constraints from multiple aspects, including category, color, structural characteristics, relative position, and background relations [3]. Therefore, while existing vision–language tracking benchmarks provide important inspiration for ATA, they still lack a dedicated data foundation tailored to air-to-air anti-UAV scenarios with tiny targets and supporting both BBox-only and Language-assisted settings.

3. Construction of the ATA Dataset

3.1. Design Objectives

To advance vision–language tracking research for air-to-air counter-UAV scenarios involving tiny drones, we construct ATA (Air-to-Air Tracking with Language Assistance), a dedicated dataset designed to provide a unified data foundation and evaluation benchmark for this task.

The construction of ATA is primarily motivated by two critical gaps in existing studies. First, current UAV-related datasets remain insufficient to faithfully characterize the dual-dynamic disturbances arising from the simultaneous motion of both the tracking platform and the target platform. Second, although language annotations have been introduced in several existing vision–language tracking datasets, such annotations are largely designed for general-purpose scenarios and therefore are of limited utility for target specification and persistent tracking in air-to-air tiny-UAV settings, where complex backgrounds and similar distractors are frequently encountered [8]. To address these limitations, ATA is deliberately designed with an emphasis on real air-to-air flight acquisition, preservation of tiny-object characteristics, dual annotations of bounding boxes and language descriptions, and standardized data organization for benchmark evaluation. As a result, it provides unified support for both the BBox-only and Language-assisted settings. Overall, ATA is neither a general replacement for existing UAV tracking or anti-UAV tracking datasets nor a simple extension of existing UAV datasets. Instead, it is a task-driven dataset and benchmark foundation constructed around several specific requirements, including real air-to-air observation, persistent tracking of tiny UAVs, counter-UAV scenarios, and language-assisted target specification.

3.2. Data Acquisition Platform and Scenario Configuration

ATA adopts real-flight data acquisition. The tracking platform is fixed as a DJI Mavic 3T (SZ DJI Technology Co., Ltd., Shenzhen, China), with the camera mounted below the front part of the fuselage and equipped with a stabilized gimbal and zoom capability. The target UAVs mainly include DJI Mini 4 Pro (SZ DJI Technology Co., Ltd., Shenzhen, China), DJI Mini 3 (SZ DJI Technology Co., Ltd., Shenzhen, China), and DJI Avata 2 (SZ DJI Technology Co., Ltd., Shenzhen, China). Among them, the Mini series is generally dominated by a white appearance, whereas Avata 2 is overall darker in color, and all these UAVs have relatively small body sizes, which are more consistent with the characteristics of the air-to-air tiny-target scenario considered in this paper. The selection of the above consumer-grade UAVs as target platforms is mainly based on two considerations. On the one hand, consumer-grade UAVs are widely deployed in low-altitude scenarios and represent common target types in real low-altitude security and counter-UAV tasks. On the other hand, such UAVs have small physical sizes and usually exhibit weak texture, low resolution, and small-area imaging characteristics under long-range air-to-air observation conditions, which can better match the problem of tiny-UAV tracking investigated in this paper. It should be noted that the target types in the current version of ATA are still mainly compact consumer-grade UAVs, and have not yet further covered more platform types such as industrial UAVs, fixed-wing UAVs, and irregular rotor aircraft. Future versions may further expand the target categories under the premise of ensuring flight safety and feasible data acquisition conditions, so as to improve the target diversity and evaluation difficulty of the dataset. ATA retains only visible-spectrum videos as the data source, and all sequences are uniformly captured at a resolution of 1920 × 1080 and a frame rate of 30 FPS.

From the perspective of acquisition conditions, the acquisition area of ATA is approximately located within a low-altitude airspace with a radius of about 300 m, and the flight altitudes of both the acquisition UAV and the target UAV are approximately between 20 m and 80 m. Since the collected data simultaneously cover different observation viewpoints, including same-altitude observation, downward-looking observation, upward-looking observation, and side-view observation, there is no fixed altitude constraint between the target UAV and the tracking platform. In terms of relative motion patterns, ATA covers typical air-to-air maneuvers such as pursuit, companion flight, and circling flight, together with dynamic variations including straight flight, turning, sharp turning, ascending, descending, acceleration, and deceleration. From the perspective of the motion relationship between the platform and the target, the dataset includes not only cases where the platform UAV hovers while the target UAV moves, but also cases where the platform UAV and the target UAV move simultaneously, where the target approaches or moves away from the platform, and where the two UAVs fly in the same or opposite directions. Therefore, the target appearance variations in ATA are not only caused by the motion of the target UAV itself, but are also jointly affected by the motion of the acquisition platform, altitude differences, observation-angle changes, and background transitions, thereby more intensively reflecting the practical difficulties of air-to-air tiny-object tracking under dual-dynamic conditions. In terms of imaging settings, each video sequence maintains a fixed zoom setting throughout the sequence, so as to avoid additional interference in target scale variation caused by artificial zoom changes within the sequence. The gimbal is mainly used for image stabilization and does not perform automatic target locking or automatic target tracking; therefore, the position changes of the target within the field of view are not eliminated by an automatic gimbal-tracking mechanism. The acquisition scenarios of ATA are mainly under daytime visible-light conditions, covering common weather conditions such as sunny, cloudy, and overcast scenes. The overall illumination conditions are mainly front-lit and normal-illumination cases, while some backlit, cloud-occluded, and low-contrast overcast scenarios are also included. It should be noted that the current version has not yet systematically covered low-light, night-time, and extreme weather conditions. Therefore, ATA mainly reflects the challenges of air-to-air tiny-object tracking under daytime visible-light conditions. Regarding video processing, ATA does not apply additional transcoding or secondary compression to the original videos, nor does it perform image enhancement, stabilization processing, super-resolution reconstruction, artificial synthesis, or target-region enlargement, thereby preserving, as much as possible, the real visual characteristics of the flight acquisition process, including tiny target scale, viewpoint changes, platform-motion disturbance, background interference, and illumination variations.

In terms of task organization, all sequences in ATA follow the single-object tracking protocol, where only one target UAV is annotated for persistent tracking in each video. To further increase the difficulty of the benchmark, however, some sequences additionally introduce one or two extra UAVs as distractors, thereby creating more challenging conditions with similar-object interference. Specifically, 28 sequences, denoted as uav-m1 to uav-m28, contain multiple UAVs, whereas the remaining 22 sequences, denoted as uav-s1 to uav-s22, include only a single target UAV. This design preserves the standard formulation of single-object tracking while substantially increasing the difficulty of target specification and long-term target association in ATA.

As shown in Figure 3, From the perspective of background complexity, ATA contains 12 sequences with pure sky backgrounds, 16 sequences with mixed sky-ground backgrounds, and 22 sequences with complex ground backgrounds. More specifically, the dataset covers 11 woodland sequences, 8 urban-edge sequences, 3 football-field sequences, 5 runway sequences, 6 open-sky sequences, 9 side-by-side building sequences, and 8 sequences with standalone red-building backgrounds. Overall, ATA deliberately retains a relatively high proportion of sequences with complex backgrounds so as to better reflect the visual interference encountered in real-world counter-UAV scenarios.

3.3. Dataset Annotation

Before formal annotation, we first manually reviewed the originally collected videos and selected valid video segments. For video segments in which the target location could not be reliably determined due to severe motion blur, partial occlusion, extremely small target scale, or only a few remaining visible pixels, ATA follows the principle of “not forcing unreliable annotations rather than introducing incorrect annotations” [28]. Instead of assigning inaccurate bounding boxes to such segments, we directly remove these unreliable segments during the dataset construction stage. If an unreliable segment appears in the middle of an original video, this segment is removed, and the reliable parts before and after it, in which the target can be reliably localized, are treated as independent valid video sequences. After the above screening and splitting process, all frames retained in ATA have manually confirmed valid target locations. Therefore, in the final released ATA annotation files, there are no blank frames that are retained but left unlabeled. All retained valid frames can be normally used for model training and performance evaluation. The above screening, splitting, and annotation strategy is consistently applied to all video sequences, so as to ensure the consistency of the dataset construction and evaluation protocol.

During bounding box annotation, ATA adopts a frame-wise manual annotation strategy to annotate the target UAV in the final retained video sequences. The annotation tool is DarkLabel (https://github.com/darkpgmr/DarkLabel.git, accessed on 16 April 2026), and the annotation format is an axis-aligned bounding box. Each bounding box is annotated to tightly cover the visible region of the target UAV as much as possible, so as to ensure the continuity of the target trajectory and the accuracy of the localization results. For difficult frames in which the target remains basically identifiable, adjacent-frame motion continuity and trajectory consistency are further used as auxiliary cues to maintain the temporal consistency of the annotation results as much as possible.

For language annotation, we first select representative frames from each video sequence and construct a target-centered region whose area is approximately four times that of the original target bounding box, so as to simulate the search region. To improve the clarity and reproducibility of the language annotation process, representative frames are preferentially selected when the target is visible, the background relationship is clear, and the frame can reflect both the appearance characteristics of the target and its scene context. The search region is then expanded around the target bounding box to highlight the target and its neighboring background, while reducing the interference caused by irrelevant regions in the full image. Subsequently, we use ChatGPT-5.3 (OpenAI, San Francisco, CA, USA) to assist in generating an English sentence of no more than 40 words to describe the relationship between the target and its surrounding background. The description mainly covers the target category, color, structural characteristics, relative position, and background context, while also moderately reflecting the motion state of the target, thereby supplementing the high-level semantic information that cannot be fully conveyed by bounding boxes alone. Specifically, the input prompt requires the model to describe the target UAV inside the red box, and the basic form of the prompt is as follows: “Please describe the target inside the red box in one English sentence within 40 words, including its category, color, structural features, relative position, surrounding environment, and possible motion state.” Through this prompt design, the generated language prompt can provide additional semantic constraints on target appearance, local background, and scene relations beyond the first-frame bounding box.

After the initial descriptions are generated by the large language model, we further obtain the final language prompts through manual screening and revision. The manual revision process is not limited to simple grammar checking. Instead, each description is checked according to several criteria, including category correctness, color consistency, visibility of structural characteristics, rationality of relative position, clarity of background relations, semantic ambiguity, and target discriminativeness. Descriptions that are inconsistent with the image content, contain subjective speculation, are overly general, may cause misleading interpretation, or cannot stably distinguish the target are removed or rewritten. For sequences containing similar UAV distractors, semantic information that can distinguish the target UAV from distractors is preferentially retained, such as differences in color, body structure, relative position, background region, or motion state. Finally, each video sequence is assigned one manually verified video-level English language description, which is used for the Language-assisted tracking task in ATA.

In terms of language annotation granularity, ATA adopts a video-level single-sentence English description as the language prompt for each sequence. This design mainly follows the common setting of existing vision–language tracking tasks, where a stable natural language description is used to provide high-level semantic information, such as target category, color, structural attributes, relative position, and background relations, thereby assisting the model in target specification and persistent tracking. Compared with frame-level or key-frame-level language descriptions, video-level descriptions can maintain consistent semantic constraints on the target throughout the entire sequence. They also allow existing mainstream vision–language trackers to be directly trained and evaluated on ATA without modifying their input protocols. Nevertheless, video-level single-sentence descriptions still have certain limitations. Since the target scale, viewpoint, background, and occlusion state may continuously change during air-to-air tracking, a single video-level description is difficult to fully characterize fine-grained target-state changes across different key frames. Future versions of ATA may further explore key-frame-level language descriptions, temporally dynamic language prompts, or finer-grained target semantic annotations to support more detailed language-guided tracking research.

The annotation work of ATA is completed by multiple annotators following a unified annotation protocol. After the initial annotation is completed, all annotations are subjected to cross-checking and manual verification to reduce subjective errors and inconsistencies. For bounding box annotations, the quality-control process mainly focuses on the accuracy of target localization, the consistency of target identity, the temporal continuity of trajectory changes, the rationality of video splitting boundaries, and whether confusion occurs between the target and similar distractors in challenging scenarios. For language descriptions, the quality-control process mainly focuses on the consistency between the description and the image content, whether the semantic expression is ambiguous, and whether the description can effectively distinguish the target UAV from the surrounding background or similar UAV distractors. Dedicated quality-control procedures are established for both bounding box annotations and language descriptions, thereby ensuring the overall quality of the dataset.

Overall, the annotation system of ATA is characterized by both fine-grained frame-level annotation and dual visual–language annotation, providing a unified and reliable data foundation for subsequent studies on both vision-only tracking and vision–language tracking.

3.4. Data Organization and Split

ATA comprises 50 video sequences in total and is divided into a training set and a test set. The training set contains 40 sequences, including 22 multi-UAV interference sequences from uav-m1 to uav-m22 and 18 single-target sequences from uav-s1 to uav-s18. The test set consists of the remaining 10 sequences, including 6 multi-UAV interference sequences from uav-m23 to uav-m28 and 4 single-target sequences from uav-s19 to uav-s22. Such a split preserves both multi-UAV interference and single-target scenarios in both the training and evaluation stages, thereby enabling a more comprehensive assessment of tracker performance across different levels of task difficulty.

In terms of data organization, each sequence folder contains an img directory and two annotation files, namely groundtruth.txt and language.txt. Specifically, the img directory stores image frames in chronological order, groundtruth.txt records the frame-wise bounding box annotations, and language.txt provides the corresponding video-level English description. This standardized data structure allows ATA to be readily integrated into mainstream tracking frameworks for training, testing, and subsequent benchmark evaluation.

The split strategy of ATA is designed by jointly considering scenario diversity and difficulty coverage. On the one hand, both the training set and the test set include multi-UAV interference sequences as well as single-target sequences, which prevents the evaluation from being overly biased toward relatively simple cases. On the other hand, maintaining both sequence types in the two subsets enables the benchmark to cover a broader range of scene configurations and difficulty levels. Overall, ATA adopts a data split that preserves a clear organizational structure and complete annotations, while also providing a direct and reliable foundation for unified benchmark evaluation.

3.5. Dataset Statistics and Characteristic Analysis

ATA comprises 50 video sequences with a total of 38,094 frames and an overall duration of 1269.8 s (approximately 21 min and 9.8 s). On average, each sequence contains 761.88 frames, with the shortest and longest sequences consisting of 310 and 945 frames, respectively. All videos are collected at a unified resolution of 1920 × 1080. Although ATA is relatively compact in scale, it exhibits a reasonably balanced distribution of sequence lengths, which is sufficient to satisfy the fundamental requirements of an air-to-air tiny-object tracking benchmark.

One of the most distinctive characteristics of ATA lies in its target-scale distribution. Statistical analysis shows that the targets have an average width of 31.175 pixels, an average height of 16.845 pixels, and an average area of 623.850 pixels². The minimum target area is only 100 pixels², whereas the maximum reaches 6624 pixels². Correspondingly, the average equivalent side length is approximately 24.977 pixels. More importantly, the average target area accounts for merely 0.03% of the full image area, indicating that the overwhelming majority of targets in ATA fall into the regime of extremely small-scale objects.

As further illustrated in Figure 4, target instances in ATA are predominantly concentrated within the range of 64–1024 pixels². In particular, 95.084% of the targets occupy less than 0.1% of the full image area, while all targets remain below 0.5% of the image area. These statistics provide compelling evidence that ATA exhibits a highly representative and pronounced air-to-air tiny-object characteristic.

In addition to visual annotations, ATA provides one English language prompt for each video, resulting in a total of 50 prompts. The average prompt length is 13.46 words, with the shortest and longest prompts containing 7 and 25 words, respectively. Overall, the language descriptions are deliberately maintained at a moderate length, such that they can effectively convey key cues—including target category, color, background relationship, and relative position—while avoiding the redundant semantic noise that may be introduced by overly lengthy text.

Taken together, the above statistics highlight several salient properties of ATA in terms of dataset scale, tiny-object characteristics, and language annotation design. First, ATA is collected through real air-to-air flight operations, allowing it to faithfully reflect the observation conditions of targets in dynamic low-altitude environments. Second, the targets in ATA are overwhelmingly distributed at an extremely small scale, making the tiny-object property particularly prominent. Third, by simultaneously providing bounding box annotations and language prompts, ATA supports both vision-only and vision–language tracking settings within a unified framework. Therefore, ATA is not merely a dataset tailored to air-to-air counter-UAV scenarios involving tiny targets, but also an effective and reliable foundation for systematic benchmark evaluation.

3.6. Summary of Dataset Challenges

3.6.1. Dual-Dynamic Disturbances

One of the most distinctive characteristics of ATA lies in the fact that both the tracking platform and the target platform are simultaneously in motion, such that the target displacement observed in the image plane is determined not only by the motion of the target itself, but also by the combined effects of platform attitude variation, flight velocity, and viewpoint changes of the observing UAV. As illustrated in Figure 5, compared with conventional scenarios involving static cameras or ground-to-air observation, this dual-dynamic coupling substantially increases the uncertainty of inter-frame target displacement, scale variation, and viewpoint change, thereby rendering cross-frame association and persistent tracking considerably more challenging. For existing trackers that rely heavily on template matching or local search mechanisms, such disturbances further exacerbate the risks of target drift and eventual tracking failure.

3.6.2. Difficulty in Tiny-Object Representation

The target UAVs in ATA exhibit a highly pronounced tiny-object characteristic. Statistical analysis shows that the average target area occupies only a negligible fraction of the full image, with the overwhelming majority of targets falling within the typical regime of tiny objects. As illustrated in Figure 6, owing to their extremely limited scale, the targets can provide only sparse texture, edge, and local structural cues in the image. Under such circumstances, even slight motion blur, compression artifacts, or scale fluctuations may cause target features to deteriorate rapidly. As a consequence, conventional visual representations often struggle to remain both stable and sufficiently discriminative, while the model itself becomes more susceptible to background clutter and spurious responses. Therefore, tiny-object perception and robust representation learning constitute one of the central challenges that ATA poses to existing tracking methods.

3.6.3. Rapid Background Switching and Complex Interference

Because the observing platform remains in continuous flight, the background in ATA is by no means stationary; rather, it undergoes rapid transitions among pure-sky scenes, mixed sky–ground backgrounds, building-dominated environments, woodland scenes, and sports-field settings. As illustrated in Figure 7, compared with relatively simple sky backgrounds, complex ground backgrounds often contain a large number of regions with salient textures and structural patterns. When the target is extremely small, such regions are more likely to compress the effective discriminative space, thereby increasing the difficulty of separating the target from the background. Meanwhile, rapid background variation also weakens the model’s ability to exploit short-term motion continuity, making it more likely for the search region to contain spurious responses that resemble the target in shape, scale, or motion pattern. Therefore, ATA not only evaluates a tracker’s ability to adapt to complex low-altitude environments, but also more stringently examines its robustness under rapid background transitions.

3.6.4. Multi-UAV Interference and Target Specification Ambiguity

In a subset of ATA sequences, besides the target being tracked, one or two additional UAVs are deliberately introduced as distractors, which further intensifies the problems of target confusion and identity drift. As illustrated in Figure 8, since these UAVs are often highly similar in category, appearance, scale, and flight state, conventional tracking paradigms that rely solely on first-frame bounding box initialization are more prone to mismatching during persistent tracking. This issue becomes particularly severe in the tiny-object regime, where the visual evidence itself is already weak, and the coexistence of multiple similar flying objects substantially increases the difficulty of target specification. For this reason, the language descriptions introduced in ATA should not be viewed merely as an extension at the annotation level; rather, they serve as a necessary semantic complement for addressing such target ambiguity, thereby providing a more direct experimental foundation for subsequent research on vision–language collaborative tracking.

4. Experiments

To evaluate the discriminative capacity of ATA across different methodological paradigms and to investigate the adaptability of current mainstream tracking models to air-to-air counter-UAV scenarios involving tiny targets, we select eight representative recent trackers as baselines, including four vision-only trackers and four vision–language trackers. Specifically, the vision-only baselines include OSTrack [29], SeqTrack [30], AQATrack [31], and MCITrack [32], while the vision–language baselines comprise All-in-One [33], MMTrack [34], SUTrack [35], and MambaTrack [36]. Such a benchmark configuration not only covers the two mainstream paradigms of vision-only tracking and vision–language tracking, but also spans a diverse spectrum of technical designs, including unified modeling, sequential modeling, contextual interaction, and cross-modal fusion. This diversity enables a more comprehensive and informative analysis of benchmark performance on ATA.

4.1. Experimental Settings

All experiments are conducted on a platform running Ubuntu 20.04, with Python 3.8 and Miniconda 3 adopted for environment management. The hardware configuration consists of an Intel Xeon Platinum 8358P CPU (Intel Corporation, Santa Clara, CA, USA) and an NVIDIA GeForce RTX 4090 GPU with 24 GB of memory (NVIDIA Corporation, Santa Clara, CA, USA). All methods are evaluated under the unified train/test split of ATA, where the vision-only trackers are tested under the BBox-only setting, whereas the vision–language trackers are evaluated under the Language-assisted setting.

To ensure the reproducibility of the experimental process and the fairness of comparison among different methods, this paper strictly adopts the same data split and training/evaluation protocol on the ATA dataset. All baseline models and their AFTE-enhanced variants load the official pretrained weights of the corresponding methods as initialization and are retrained only on the ATA training set based on pretrained initialization. After training, the last-epoch model weights are uniformly used for final performance evaluation on the testing set. The testing set is not involved in model training, hyperparameter adjustment, or checkpoint selection. The number of training epochs for all methods is uniformly set to 50. The input resolution is also uniformly configured, with the template region set to

192 \times 192

and the search region set to

384 \times 384

. For visual-only tracking methods, only the image sequence and the first-frame target bounding box are used as inputs. For vision–language tracking methods, each video sequence uses one manually verified English description provided by ATA as a fixed video-level language prompt. This language prompt remains unchanged throughout the tracking process, and no additional manual prompt engineering is introduced. The description is processed according to the original text encoding module or tokenizer of each method. To adapt to the input formats of different trackers, this paper only performs necessary conversions of the ATA data organization, including the sequence list, the first-frame bounding box, and annotation files. Except for the structural modifications corresponding to the AFTE module, the main network architecture, loss function, and inference procedure of each baseline method are not additionally changed. All methods are finally compared on the same testing set and under the same evaluation metrics.

Following standard practice in visual tracking, we adopt five widely used evaluation metrics to comprehensively assess different methods, namely AUC, OP50, Precision,

P_{Norm}

, and FPS. Among these metrics, AUC, which measures the area under the success curve, is used as the primary criterion for ranking and comparing the overall tracking performance of different models in this work.

4.2. Baseline Results and Analysis

Under the unified experimental protocol described above, we first compare the overall performance of the eight representative trackers on the ATA dataset, with the quantitative results summarized in Table 5. Ranking the methods by AUC, which serves as the primary evaluation metric in this work, SeqTrack achieves the best overall performance, attaining an AUC of 44.55. It further reaches 48.43 in OP50, 79.74 in Precision, and 36.18 in

P_{Norm}

, thereby demonstrating the strongest comprehensive capability among all baselines. MCITrack ranks second with an AUC of 41.09, and likewise exhibits strong competitiveness among the vision-only trackers. Among the vision–language methods, MMTrack delivers the best overall performance, obtaining an AUC of 38.07, together with 45.76 in OP50 and 35.02 in

P_{Norm}

, which indicates that it possesses comparatively strong cross-modal modeling capability under the ATA setting. Taken as a whole, even the strongest baseline still leaves considerable room for improvement in the air-to-air tiny-object counter-UAV scenario characterized by ATA, suggesting that the benchmark poses a substantial challenge to existing trackers across different technical paradigms.

Beyond absolute tracking accuracy, the results also reveal a non-negligible trade-off between effectiveness and efficiency. For instance, All-in-One achieves 96.67 FPS, making it the fastest method among all baselines; however, its AUC is only 31.19, indicating that higher inference speed does not necessarily translate into superior tracking performance. By contrast, SeqTrack not only delivers the best accuracy, but also maintains a runtime speed of 50.00 FPS, thereby exhibiting a relatively favorable balance between tracking performance and computational efficiency. Although MCITrack ranks second in terms of AUC, its runtime is only 32.95 FPS, which is comparatively modest. These observations suggest that, under the ATA setting, existing trackers still face a pronounced robustness–real-time trade-off, especially when accurate localization and real-time deployment are expected to be achieved simultaneously.

It should be noted that the results in Table 5 also show that vision–language trackers do not always outperform purely visual trackers after introducing language prompts. For example, purely visual methods such as SeqTrack and MCITrack surpass some vision–language methods in terms of AUC, indicating that although language information can provide additional semantic cues for target specification, its performance gain cannot be stably reflected in all models and all air-to-air tiny-target scenarios. This phenomenon can be mainly attributed to the following four aspects.

First, the target UAVs in ATA are extremely small, and the available visual evidence in a single frame is very limited. Dataset statistics show that the average target area in ATA accounts for only

0.03 %

of the entire image, and most targets fall within the typical tiny-object range. In this case, language prompts can describe the target category, color, structure, or background relationship, but they can hardly provide precise pixel-level or box-level spatial localization information. When the target appears as only a few pixels in the image, the model still highly relies on the visual branch to capture weak textures, weak edges, and local responses. Therefore, language prompts cannot fully compensate for the localization difficulty caused by insufficient visual representation of tiny targets.

Second, the language descriptions in ATA are video-level prompts and remain unchanged throughout the entire sequence tracking process, lacking frame-wise updated information about target position, scale, and motion state. In other words, language descriptions can provide global semantic constraints for the target, but they cannot dynamically adapt to rapid displacement, scale variation, viewpoint changes, and background switching in consecutive frames. Therefore, when the target is affected by dual-dynamic motion, rapid background changes, or short-term occlusion, the fixed language prompt has limited constraint capability for precise localization in the current frame and cannot directly replace cross-frame motion modeling and local search mechanisms.

Third, existing vision–language trackers still face obvious cross-modal alignment difficulties in air-to-air tiny-target scenarios. Most vision–language tracking methods usually encode language prompts as global semantic features and fuse them with visual features at a relatively coarse level. However, in ATA scenarios, the target UAV often occupies only a very small number of pixels, and its local visual responses are weak and easily affected by complex backgrounds, motion blur, and similar UAV distractors. This makes it difficult for semantic information in the language branch, such as category, color, structure, or relative position, to establish stable correspondence with fine-grained local responses in the image. In other words, the existing cross-modal fusion process may fail to sufficiently convert high-level language semantics into effective frame-level spatial localization constraints, thereby causing language information to bring unstable performance gains in some models.

Finally, although some purely visual trackers do not use language input, they have stronger visual representation, template matching, sequence modeling, or localization regression capabilities, and therefore may still perform more stably in ATA scenarios where tiny-object localization and cross-frame association are the core difficulties. For air-to-air tiny-UAV tracking, coarse-grained semantic information is not always more critical than fine-grained visual localization capability. When language prompts cannot provide sufficiently fine-grained spatial and temporal constraints, purely visual methods with stronger visual feature extraction and localization capabilities may still achieve higher performance.

In summary, the results in Table 5 do not indicate that language information is ineffective in ATA. Instead, they suggest that existing vision–language trackers still have limitations in exploiting language information under air-to-air tiny-UAV tracking conditions. For such scenarios, language prompts should not be simply introduced as global semantic descriptions. Instead, they need to be combined with finer-grained cross-modal alignment mechanisms, frame-level spatial grounding mechanisms, local search-region guidance mechanisms, and temporal modeling mechanisms. By enabling language features to participate more directly in target specification, candidate-region selection, similar-UAV discrimination, and cross-frame association, the auxiliary role of language information in robust tracking under complex dynamic backgrounds can be more fully exploited.

To further analyze the task difficulty of ATA from an experimental perspective, we compare the AUC performance of representative VOT and VLT trackers on ATA, LaSOT, TNL2K, and UAV-Anti-UAV benchmarks, as shown in Table 6. By comparing performance differences across different types of benchmarks, we can more intuitively observe the adaptability and limitations of existing tracking methods in the air-to-air tiny-UAV tracking scenario characterized by ATA.

As can be seen from Table 6, existing tracking methods generally obtain lower AUC scores on ATA than on general tracking or vision–language tracking benchmarks such as LaSOT and TNL2K. For example, OSTrack achieves an AUC of 71.1 on LaSOT, whereas its AUC on ATA is only 36.11. MCITrack obtains an AUC of 75.3 on LaSOT, but decreases to 41.09 on ATA. For vision–language tracking methods, SUTrack achieves an AUC of 67.9 on TNL2K, while its AUC on ATA is only 35.24.

This performance gap indicates that, compared with conventional object tracking scenarios, air-to-air tiny-UAV tracking is indeed more challenging. Existing methods can usually achieve relatively stable performance in ordinary scenarios, but their performance drops significantly on ATA. This suggests that the task is affected not only by tiny target scale and limited appearance information, but also by the combined influence of dual-dynamic motion between the tracking platform and the target platform, rapid viewpoint changes, complex background interference, and similar UAV distractors. Therefore, ATA can more concentratedly reflect the key challenges of air-to-air tiny-UAV tracking, which further demonstrates the necessity of constructing a dedicated benchmark.

4.3. Failure Cases and Bottleneck Analysis

Compared with generic single-object tracking scenarios, the air-to-air tiny-object counter-UAV task characterized by ATA is considerably more challenging. Figure 9 presents several representative failure cases of the baseline trackers on ATA, where the green boxes denote the ground-truth (GT) annotations and the red boxes correspond to the tracker predictions.

First, because the targets are extremely small and provide only highly limited visual evidence, existing methods often struggle to establish stable and discriminative target representations, which consequently gives rise to response attenuation, localization jitter, and even complete target loss, as illustrated in Figure 9a. Moreover, since both the tracking platform and the target platform are simultaneously in motion, the target may undergo rapid variations in position, scale, and pose across consecutive frames, as shown in Figure 9b. Such dual-dynamic disturbances substantially increase the difficulty of inter-frame correspondence and target association, thereby rendering current trackers more vulnerable to mismatching, drift accumulation, and eventual tracking failure. Meanwhile, the target may appear against a wide range of complex backgrounds, including sky, buildings, ground scenes, woodland areas, and sports fields, as illustrated in Figure 9c, which further aggravates the challenges of stable localization and robust target–background separation.

A closer inspection further suggests that, although these failure cases manifest themselves in different forms, they in fact reflect a common underlying bottleneck, namely that existing methods still make insufficient use of continuous temporal information. When the target appearance in a particular frame becomes ambiguous, or is severely degraded by dynamic variation and background interference, relying solely on the visual evidence from the current frame is often insufficient to support a stable prediction, as illustrated in Figure 9d. Therefore, under the ATA setting, single-frame appearance cues alone are inadequate for stable and persistent tracking, which in turn indicates that more effective exploitation of temporal continuity and sequence-level information may be essential for improving tracking robustness in this highly challenging scenario.

4.4. Validation of AFTE-Based Temporal Enhancement

4.4.1. Core Idea and Adaptation Strategy of AFTE

In persistent air-to-air tiny-UAV tracking, the information provided by a single frame is usually limited. In particular, under the tiny-object, strong-dynamic, and complex-background conditions characterized by ATA, the target often occupies only a very small number of pixels in a single frame, with weak appearance texture and structural information. It is also easily affected by motion blur, background false responses, and similar UAV distractors. Therefore, relying only on the appearance features of the current frame is insufficient to support stable persistent tracking. Since ATA is collected from continuous air-to-air video sequences, adjacent frames naturally contain important cues such as target motion states, background variation trends, and cross-frame association relationships. Based on this property, this paper further introduces the AFTE temporal enhancement module to verify whether short-term temporal information can improve tracking performance in ATA scenarios.

As one of the key focuses of the dataset analysis in this paper, this section further explores and analyzes methods for enhancing algorithm performance by leveraging temporal information, emphasizing the efficient use of temporal cues in videos to alleviate the insufficient exploitation of motion features in air-to-air tiny-UAV tracking. To this end, this paper introduces a lightweight adjacent-frame temporal enhancement module, AFTE, to verify whether temporal information can improve tiny-UAV tracking performance in air-to-air scenarios.

The basic idea of AFTE [37] is inherited from existing adjacent-frame feature enhancement and efficient adjacent-frame fusion methods. Such methods usually establish feature correspondences between the current frame and neighboring frames by measuring local similarity between adjacent frames, and they enhance the target response in the current frame through feature alignment, similarity-weighted fusion, or background-difference modeling. For ATA scenarios, when the appearance cues of the target in the current frame are weak, the target position, local motion tendency, and background variation information in the previous frame can provide supplementary constraints for current-frame target localization. When false responses with shapes or scales similar to the target exist in the background, the temporal consistency between adjacent frames also helps suppress unstable background interference.

In terms of the specific adaptation strategy, this paper adopts a unified integration strategy. For a given video sequence, when tracking the target in the t-th frame, AFTE simultaneously uses the visual information of the current frame

I_{t}

and the previous adjacent frame

I_{t - 1}

. The two frames are first processed by the visual encoder of the corresponding tracker to extract features, obtaining the current-frame feature

F_{t}

and the adjacent-frame feature

F_{t - 1}

. Then, AFTE estimates the local correspondence between the adjacent-frame feature and the current-frame feature through local similarity calculation and aligns and fuses the adjacent-frame feature to obtain the enhanced current-frame feature

{\hat{F}}_{t}

. The enhanced feature is subsequently fed into the following prediction module of the original tracker to output the target bounding box of the current frame [38].

For visual-only trackers, AFTE extends the original single-frame visual input to a two-frame input consisting of the current frame and the previous adjacent frame, so as to enhance the current-frame feature in the visual branch. For vision–language trackers, the language branch remains unchanged. The language prompt is still the video-level English description provided by ATA, and the text encoder and language-fusion strategy follow the original settings of each method. AFTE is only applied to the visual feature branch. In this way, this paper can compare the influence of AFTE on different types of baseline trackers without changing the original language input, main network architecture, training protocol, or evaluation metrics.

4.4.2. Experimental Results and Analysis of AFTE

As shown in Table 7, after introducing AFTE, the overall performance of each baseline model is improved to different degrees. Among visual-only methods, the AUC of MCITrack increases from 41.09 to 41.46, and that of AQATrack increases from 36.10 to 38.07. In addition, the OP50, Precision, and

P_{Norm}

of AQATrack are improved to 40.09, 72.78, and 29.53, respectively. These results indicate that adjacent-frame information can, to some extent, supplement the insufficient single-frame representation of visual-only trackers in air-to-air tiny-object scenarios.

For vision–language methods, the gains brought by AFTE are more obvious. The AUC of All-in-One increases from 31.19 to 35.91, that of MambaTrack increases from 31.98 to 33.53, and that of SUTrack increases from 35.24 to 40.74, with an improvement of 5.50. Meanwhile, the Precision of SUTrack increases from 64.98 to 75.38, showing the most significant performance improvement. Overall, vision–language methods benefit more from AFTE than visual-only methods, indicating that for models with relatively weak single-frame discrimination ability or insufficient visual evidence, short-term temporal information can complement language prompts and further improve target localization stability.

From the perspective of model complexity, AFTE does not introduce significant additional overhead. The parameters and FLOPs of each model increase only slightly and remain within an acceptable range. In terms of speed, the FPS of some models decreases slightly, but AQATrack and SUTrack instead improve to 63.83 and 85.24 FPS after introducing AFTE, respectively. This indicates that AFTE can achieve relatively stable performance gains at a small computational cost.

Overall, AFTE brings different degrees of performance improvements for multiple baseline methods, indicating that adjacent-frame information is effective to some extent and has application potential for air-to-air tiny-object tracking. By explicitly integrating adjacent-frame information, AFTE can enhance target representation to some extent, improve cross-frame association in air-to-air tiny-object scenarios, and increase tracking robustness under complex backgrounds and fast-motion conditions. Meanwhile, considering that the current test set is relatively compact in scale, some small numerical changes still need to be understood in conjunction with the difficulty of specific sequences and scene variations. Therefore, this paper focuses more on the overall performance trend of AFTE across different baseline methods, rather than over-interpreting a single small numerical gain. In the future, with further expansion of the number of test sequences, flight conditions, and target types, the role of temporal modeling methods in persistent air-to-air tiny-UAV tracking can be further verified through more extensive experiments and statistical analysis.

5. Conclusions

In this paper, we address the problem of vision–language tracking in real air-to-air tiny-object UAV countermeasure scenarios by constructing the ATA dataset and establishing a corresponding benchmark. Compared with existing generic object tracking datasets, air-to-ground UAV tracking datasets, and ground-to-air anti-UAV datasets, ATA is more specifically focused on the particular combination of real air-to-air observation, tiny UAV targets, language-assisted target specification, and persistent tracking. It also covers key challenges inherent in this setting, including dual-dynamic disturbances, weak target representation caused by extremely small object scale, rapid background variation, and interference from visually similar targets. Meanwhile, ATA provides both frame-wise bounding box annotations and video-level English language descriptions for each sequence, thereby supporting the BBox-only and Language-assisted settings within a unified framework and offering a common data foundation and evaluation protocol for both vision-only tracking and vision–language tracking.

On top of this, we establish a benchmark over ATA that covers both vision-only and vision–language methods and conduct systematic evaluations of several representative mainstream trackers proposed in recent years. The experimental results indicate that, although current methods are capable of achieving a certain degree of persistent tracking under the ATA setting, their overall performance remains limited, and no clearly dominant solution has yet emerged that can robustly cope with the principal challenges posed by this scenario. Further analysis reveals that existing methods on ATA generally suffer from several shared bottlenecks, including insufficient tiny-object representation, difficulty in inter-frame matching under dual-dynamic conditions, limited robustness in the presence of complex backgrounds and similar-target interference, as well as inadequate exploitation of temporal information. These findings suggest that the air-to-air tiny-object counter-UAV task characterized by ATA indeed imposes substantially higher demands on current tracking models than generic scenarios, thereby further justifying the necessity of constructing such a dataset and benchmark.

Given that ATA is naturally derived from continuous video streams and exhibits pronounced temporal correlations, we further introduce the AFTE temporal enhancement module for additional validation. The experimental results demonstrate that explicitly exploiting adjacent frames for short-term temporal enhancement can yield relatively consistent performance gains across multiple baseline trackers. This observation suggests that temporal modeling constitutes one of the key factors governing the performance upper bound under the ATA setting, while also indicating that explicit and lightweight adjacent-frame enhancement represents a practically meaningful avenue for improvement. Meanwhile, vision–language methods have not yet exhibited a stable advantage over vision-only methods on ATA, which further implies that future research should investigate more effective ways of utilizing language information for target specification, ambiguity suppression, and identity preservation.

Although ATA provides a dedicated benchmark for vision–language tracking research in air-to-air tiny-UAV countermeasure scenarios, the current study still has certain limitations. First, the current test set size of ATA is still relatively limited. Although it can support preliminary evaluation of different types of tracking methods, there remains room for further improvement in terms of statistical stability. Second, the acquisition conditions of ATA are still mainly based on daytime visible-light scenarios. Although the dataset contains a small number of samples under backlit, overcast, and low-contrast conditions, low-light, night-time, and more extreme weather conditions have not yet been systematically covered. Therefore, there is still room for further expansion in terms of evaluating the generalization ability of the dataset under complex illumination and all-weather scenarios. Third, the current target types in ATA are mainly DJI consumer-grade UAVs. These UAVs have small imaging scales in the air and can match the air-to-air tiny-UAV countermeasure task considered in this paper well. However, there is still room for further expansion in terms of UAV category diversity. Fourth, ATA currently adopts video-level single-sentence language descriptions, which can provide basic semantic information such as target category, appearance attributes, and background relations. However, the granularity of language annotation is still relatively limited and does not yet cover fine-grained semantic changes of the target across different time periods. Fifth, the main objective of this paper is to construct and analyze a vision–language tracking benchmark for air-to-air tiny-UAV countermeasure scenarios. At the methodological level, this paper only further validates the role of the lightweight AFTE temporal enhancement module, and has not yet designed a complete tracking architecture specifically oriented to tiny-object imaging and dual-dynamic-motion coupling characteristics. Therefore, the design of dedicated tracking models for this scenario remains a problem worthy of further investigation in future work.

Future work will mainly be carried out from the following three aspects. First, we will further expand the data scale and scenario coverage of ATA by introducing richer environmental conditions, flight relations, and UAV types, so as to continuously improve the representativeness and challenge of the dataset. Second, we will further expand the annotation system by exploring key-frame-level language descriptions, temporally dynamic language prompts, and finer-grained target semantic annotations, so as to enhance the support of language information for tiny-UAV identity discrimination and persistent tracking. Third, based on the core problems revealed by ATA, we will further explore temporal modeling methods, robust representation mechanisms, and vision–language collaborative modeling frameworks that are more suitable for air-to-air tiny-object scenarios, so as to continuously promote the development of object tracking research for real low-altitude counter-UAV tasks.

Author Contributions

Conceptualization, W.K.; methodology, W.K.; software, W.K.; validation, W.K., Y.P. and H.H.; formal analysis, W.K.; investigation, W.K., W.T. and Q.L.; resources, W.T., Q.L. and K.L.; data curation, X.Z. and W.K.; writing—original draft preparation, W.K.; writing—review and editing, X.Z., Y.P., Q.C. and K.L.; visualization, W.K. and H.H.; supervision, Q.C. and K.L.; project administration, Q.C. and K.L.; funding acquisition, Y.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Equipment Comprehensive Research Project in Engineering University of PAP, grant number WJ2025C0401013; the Second Batch of Scientific Research Innovation Team Project in Engineering University of PAP, no grant number; and the Basic Frontier Innovation Project in Engineering University of PAP, grant number WJY202509. The APC was funded by the Equipment Comprehensive Research Project, grant number WJ2025C0401013.

Data Availability Statement

The data used in this analysis are publicly available, and access information is provided in the main text. The dataset associated with this study is publicly available at the following GitHub repository: https://github.com/kkbushi/ATA.git (accessed on 16 April 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Jiang, N.; Wang, K.; Peng, X.K.; Yu, X.; Wang, Q.; Xing, J.; Li, G.; Guo, G.; Ye, Q.; Jiao, J.; et al. Anti-UAV: A large-scale benchmark for vision-based UAV tracking. IEEE Trans. Multimed. 2023, 25, 486–500. [Google Scholar] [CrossRef]
Vladislav, S.; Ildar, K.; Alberto, L.; Dmitriy, A.; Liliya, K.; Alessandro, C. Advances in UAV Detection: Integrating Multi-Sensor Systems and AI for Enhanced Accuracy and Efficiency. Int. J. Crit. Infrastruct. Prot. 2025, 49, 100744. [Google Scholar] [CrossRef]
Zhang, C.H.; Huang, G.J.; Liu, L.; Huang, S.; Yang, Y.; Wan, X.; Ge, S.; Tao, D. WebUAV-3M: A benchmark for unveiling the power of million-scale deep UAV tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 9186–9205. [Google Scholar] [CrossRef] [PubMed]
Xie, B.; Zhang, C.X.; Wang, F.G.; Liu, P.; Lu, F.; Chen, Z.; Hu, W. CST Anti-UAV: A thermal infrared benchmark for tiny UAV tracking in complex scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Honolulu, HI, USA, 19–20 October 2025; pp. 6216–6225. [Google Scholar]
Zhou, L.; Zhou, Z.K.; Mao, K.G.; He, Z. Joint visual grounding and tracking with natural language specification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6495–6503. [Google Scholar]
Li, H.; Liu, X.; Li, G. A benchmark for UAV-view natural language-guided tracking. Electronics 2024, 13, 1706. [Google Scholar] [CrossRef]
Wang, Y.; Huang, Z.; Laganière, R.; Zhang, H.; Ding, L. A UAV to UAV tracking benchmark. Knowl.-Based Syst. 2023, 261, 110197. [Google Scholar] [CrossRef]
Huang, B.; Li, J.; Chen, J.; Wang, G.; Zhao, J.; Xu, T. Anti-UAV410: A thermal infrared benchmark and customized scheme for tracking drones in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 2852–2865. [Google Scholar] [CrossRef] [PubMed]
Guo, M.; Zhang, Z.; Fan, H.; Jing, L. Divert more attention to vision-language tracking. Adv. Neural Inf. Process. Syst. 2022, 35, 4446–4460. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Fu, C.; Ding, F.; Huang, Z.; Lu, G. AutoTrack: Towards high-performance visual tracking for UAV with automatic spatio-temporal regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11923–11932. [Google Scholar]
Wang, X.; Shu, X.; Zhang, Z.; Jiang, B.; Wang, Y.; Tian, Y.; Wu, F. Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13728–13737. [Google Scholar]
Fan, H.; Bai, H.; Lin, L.; Yang, F.; Chu, P.; Deng, G.; Yu, S.; Harshit; Huang, M.; Liu, J.; et al. LaSOT: A high-quality large-scale single object tracking benchmark. Int. J. Comput. Vis. 2021, 129, 439–461. [Google Scholar] [CrossRef]
Wang, B.; Li, Q.; Mao, Q.; Wang, J.; Chen, C.L.P.; Shangguan, A.; Zhang, H. A Survey on Vision-Based Anti Unmanned Aerial Vehicles Methods. Drones 2024, 8, 518. [Google Scholar] [CrossRef]
Wu, Y.; Lim, J.; Yang, M.-H. Object tracking benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1834–1848. [Google Scholar] [CrossRef] [PubMed]
Kristan, M.; Matas, J.; Danelljan, M.; Felsberg, M.; Chang, H.J.; Zajc, L.Č.; Lukežič, A.; Drbohlav, O.; Zhang, Z.; Tran, K.-T.; et al. The first visual object tracking segmentation VOTS2023 challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Paris, France, 2–6 October 2023; pp. 1788–1810. [Google Scholar]
Galoogahi, C.K.; Huang, H.; Fagg, A.; Lucey, S. Need for speed: A benchmark for higher frame rate object tracking. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1125–1134. [Google Scholar]
Muller, M.; Bibi, A.; Giancola, S.; Alsubaihi, S.; Ghanem, B. TrackingNet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 300–317. [Google Scholar]
Huang, L.; Zhao, X.; Huang, K. GOT-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 1562–1577. [Google Scholar] [CrossRef] [PubMed]
Mueller, M.; Smith, N.; Ghanem, B. A benchmark and simulator for UAV tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 445–461. [Google Scholar]
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and Tracking Meet Drones Challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7380–7399. [Google Scholar] [CrossRef] [PubMed]
Fu, C.; Chen, Z.; Li, Y.; Ye, J.; Feng, C. Siamese anchor proposal network for high-speed aerial tracking. In Proceedings of the IEEE International Conference on Robotics and Automation, Xi’an, China, 30 May–5 June 2021; pp. 510–516. [Google Scholar]
Xu, T.; Gu, J.; Zhu, X.; Wu, X.; Kittler, J. A tri-modal dataset and a baseline system for tracking unmanned aerial vehicles. arXiv 2025, arXiv:2511.18344. [Google Scholar]
Zhang, X.-F.; Xu, T.; Zhang, J.; Liu, J.-W.; Wang, K.; Wang, G.; Liu, J.; Wang, Q.; Jiang, L.; Zheng, Z.; et al. Evidential detection and tracking collaboration: New problem, benchmark and algorithm for robust Anti-UAV system. arXiv 2023, arXiv:2306.15767. [Google Scholar] [CrossRef]
Zhang, C.; Li, L.; Zhang, Z.; Wang, Y.; Wen, H.; Zhou, X.; Ge, S.; Wang, Y. How far are modern trackers from UAV-Anti-UAV? A million-scale benchmark and new baseline. arXiv 2025, arXiv:2512.07385. [Google Scholar]
Li, Z.; Tao, R.; Gavves, E.; Snoek, C.G.M.; Smeulders, A.W.M. Tracking by natural language specification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7350–7358. [Google Scholar]
Peng, L.; Gu, J.; Li, X.; Liu, W.; Dong, S.; Zhang, Z.; Fan, H.; Zhang, L.; Zhang, Z. VastTrack: Vast category visual object tracking. Adv. Neural Inf. Process. Syst. 2024, 37, 130797–130818. [Google Scholar]
Hu, S.; Zhang, D.; Wu, M.; Feng, X.; Li, X.; Zhao, X.; Huang, K. A multi-modal global instance tracking benchmark (MGIT): Better locating target in complex spatio-temporal and causal relationship. Adv. Neural Inf. Process. Syst. 2023, 36, 25007–25030. [Google Scholar]
Hao, H.; Peng, Y.; Ye, Z.; Han, B.; Tang, W.; Kang, W.; Zhang, X.; Li, Q.; Liu, W. TMRGBT-D2D: A temporal misaligned RGB-thermal dataset for drone-to-drone target detection. Drones 2025, 9, 694. [Google Scholar] [CrossRef]
Ye, B.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Joint feature learning and relation modeling for tracking: A one-stream framework. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 341–357. [Google Scholar]
Chen, X.; Peng, H.; Wang, D.; Lu, H.; Hu, H. SeqTrack: Sequence to sequence learning for visual object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14572–14581. [Google Scholar]
Xie, J.; Zhong, B.; Mo, Z.; Zhang, S.; Shi, L.; Song, S.; Ji, R. Autoregressive queries for adaptive tracking with spatio-temporal transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 19300–19309. [Google Scholar]
Kang, B.; Chen, X.; Lai, S.; Liu, Y.; Liu, Y.; Wang, D. Exploring enhanced contextual information for video-level object tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 4194–4202. [Google Scholar]
Zhang, C.; Sun, X.; Yang, Y.; Liu, L.; Liu, Q.; Zhou, X.; Wang, Y. All in one: Exploring unified vision-language tracking with multi-modal alignment. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 5552–5561. [Google Scholar]
Zheng, Y.; Zhong, B.; Liang, Q.; Li, G.; Ji, R.; Li, X. Towards unified token learning for vision-language tracking. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 2125–2135. [Google Scholar] [CrossRef]
Chen, X.; Kang, B.; Geng, W.; Zhu, J.; Liu, Y.; Wang, D.; Lu, H. SUTrack: Towards simple and unified single object tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 2239–2247. [Google Scholar]
Zhang, C.; Liu, L.; Wen, H.; Zhou, X.; Wang, Y. MambaTrack: Exploiting dual-enhancement for night UAV tracking. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
Ye, Z.; Peng, Y.; Liu, W.; Yin, W.; Hao, H.; Han, B.; Zhu, Y.; Xiao, D. An efficient adjacent frame fusion mechanism for airborne visual object detection. Drones 2024, 8, 144. [Google Scholar] [CrossRef]
Lyu, Y.; Liu, Z.; Li, H.; Guo, D.; Fu, Y. A real-time and lightweight method for tiny airborne object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Vancouver, BC, Canada, 17–24 June 2023; pp. 3016–3025. [Google Scholar]

Figure 1. Comparison of counter-UAV methods.

Figure 2. Air-to-air small-UAV-tracking task illustration.

Figure 3. Scenario statistics of the ATA dataset.

Figure 4. Target size distribution and tiny-object characteristics of the ATA dataset.

Figure 5. Illustration of dual-dynamic disturbances caused by simultaneous motion of the tracking and target UAVs, resulting in complex inter-frame displacement, scale variation, and viewpoint changes. disturbances.

Figure 6. Illustration of the difficulty in representing tiny UAV targets due to extremely limited visual cues, sparse texture, and minimal scale.

Figure 7. Illustration of rapid background switching and complex interference caused by low-altitude scenes with varied sky, ground, and structural environments.

Figure 8. Illustration of multi-UAV interference and target specification ambiguity arising from visually similar UAVs in the same scene.

Figure 9. Representative failure case.

Table 1. General visual object tracking datasets.

Dataset	Year	Videos	Attributes	Characteristics
OTB-2015 [14]	2015	100	11	Classic traditional tracking benchmark
NfS [16]	2017	100	9	240 FPS high-frame-rate for fast motion
TrackingNet [17]	2018	30,643	15	Large-scale in-the-wild dataset
GOT-10k [18]	2019	10,000+	6	Train/test categories non-overlapping
VOTS-2025 [15]	2025	144	12	Latest official VOT benchmark

Table 2. UAV tracking and anti-UAV datasets.

Dataset	Year	Sequences	Frames	Modalities	Annotation Types	Scenario
UAV123 [19]	2016	123	113K	RGB	Bbox	Classic low-altitude UAV-view single-object tracking dataset
VisDrone [20]	2021	288	139K	RGB	Bbox	High-density annotated UAV aerial dataset
UAVTrack112 [21]	2021	112	100K	RGB	Bbox	Captured via real-world flight tests
WebUAV-3M [3]	2022	4485	3.3M	RGB + Audio + Language	Bbox + NL + Audio	Large-scale multi-modal UAV tracking dataset
MM-UAV [22]	2025	1321	2.8M	RGB + IR + Event	Bbox + Multi-modal	The first large-scale tri-modal anti-UAV tracking benchmark
UAV-Anti-UAV [24]	2025	1810	1.05M	RGB + Language	Bbox + NL	Thermal infrared UAV tracking dataset
Anti-UAV600 [23]	2025	300	723K	IR	Bbox	Extended multi-UAV tracking dataset for competition
Anti-UAV318 [1]	2021	318	296K	RGB + IR	Bbox + Presence	Early large-scale multi-modal anti-UAV tracking benchmark
ATA	2026	50	38K	RGB + Language	Bbox + NL	Oriented to real-world air-to-air small-target anti-UAV scenarios

Table 3. Task settings and challenge attribute comparison of related UAV tracking and anti-UAV tracking datasets.

Dataset	Target Type	Platform Motion	Language Annotation	Air-to-Air Scenario	Tiny UAV Target	Similar UAV Distractor
UAV123	General objects	Airborne platform	×	×	×	×
VisDrone [20]	General objects	Airborne platform	×	×	×	×
UAVTrack112 [21]	General objects	Airborne platform	×	×	×	×
WebUAV-3M [3]	General objects	Airborne platform	✓	×	×	×
MM-UAV [22]	UAV	Ground platform	×	Partial	✓	Partial
UAV-Anti-UAV [24]	UAV	Both UAVs	✓	Partial	✓	Partial
Anti-UAV600 [23]	UAV	Ground platform	×	×	✓	Partial
Anti-UAV318 [1]	UAV	Ground platform	×	×	✓	Partial
ATA	Tiny UAV	Both UAVs	✓	✓	✓	✓

Note: ✓ indicates that the dataset contains the corresponding attribute, × indicates that the dataset does not contain the corresponding attribute, and “Partial” indicates that the attribute is partially covered.

Table 4. Vision–language tracking datasets.

Dataset	Year	Videos	Attributes	Characteristics
OTB99-Lang [25]	2017	99	11	Early language-guided tracking dataset
TNL2K [11]	2021	2000	17	Native large-scale VLT dataset
LaSOT [12]	2019	1400	14	Large-scale long-term tracking dataset
LaSOT-Ext [12]	2021	1550	14	High-difficulty extension of LaSOT
VastTrack [26]	2024	50,610	10	Large-scale general tracking benchmark
MGIT [27]	2023	150	8	Multi-modal dataset for ultra-long videos

Table 5. Performance of ATA dataset on various baseline models.

Type	Method	Source	AUC	OP50	P	$P_{Norm}$	Params	FLOPs	FPS
VOT	OSTrack	ECCV 2022	36.11	37.84	65.55	24.80	92.12	48.36	61.32
	SeqTrack	CVPR 2023	44.55	48.43	79.74	36.18	87.15	67.13	50.00
	AQATrack	CVPR 2024	36.10	36.31	70.36	25.53	71.92	25.73	55.67
	MCITrack	AAAI 2025	41.09	42.69	76.97	30.69	88.03	38.46	32.95
VLT	Allinone	ACMMM 2023	31.19	33.13	56.82	23.30	155.83	21.58	96.67
	MMTrack	TCSVT 2024	38.07	45.76	65.30	35.02	177.07	90.99	54.58
	SUTrack	AAAI 2025	35.24	36.28	64.98	26.02	22.26	5.77	74.27
	MambaTrack	ICASSP 2025	31.98	35.36	59.71	25.08	30.00	1.18	31.07

Table 6. Performance of models on ATA, LaSOT, TNL2K, and UAV-Anti-UAV datasets.

Type	Method	AUC
Type	Method	ATA	LaSOT	TNL2K	UAV-Anti-UAV
VOT	OSTrack	36.11	71.1		27.8
	SeqTrack	44.55	72.5		30.4
	AQATrack	36.10	71.4		35.6
	MCITrack	41.09	75.3		33.7
VLT	Allinone	31.19	71.7	55.3	28.4
	MMTrack	38.07	70.0	58.6
	SUTrack	35.24	74.4	67.9	37.3
	MambaTrack	31.98			26.0

Table 7. Performance of ATA dataset on various improved baseline models.

Type	Method	AUC	OP50	P	$P_{Norm}$	Params	FLOPs	FPS
VOT	MCITrack	41.09	42.69	76.97	30.69	88.03	38.46	31.96
	MCITrack + A	41.46	47.40	74.72	35.32	88.29	38.51	29.40
	AQATrack	36.10	36.31	70.36	25.53	71.92	25.73	55.67
	AQATrack + A	38.07	40.09	72.78	29.53	71.92	25.73	63.83
VLT	Allinone	31.19	33.12	56.82	23.30	155.83	43.16	92.90
	Allinone + A	35.91	38.07	67.09	26.76	156.41	43.77	77.60
	SUTrack	35.24	36.28	64.98	26.02	22.26	5.70	74.27
	SUTrack + A	40.74	43.37	75.38	32.58	22.26	5.77	85.24
	MambaTrack	31.98	35.36	59.71	25.08	30.00	1.18	31.07
	MambaTrack + A	33.53	33.54	64.04	26.65	30.15	1.22	18.80

Note: Values highlighted in blue indicate performance improvements after applying the AFTE module.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kang, W.; Zhang, X.; Peng, Y.; Tang, W.; Li, Q.; Hao, H.; Liu, K.; Chen, Q. ATA: A Benchmark for Vision–Language Tracking in Air-to-Air Counter-UAV of Tiny Drones. Drones 2026, 10, 429. https://doi.org/10.3390/drones10060429

AMA Style

Kang W, Zhang X, Peng Y, Tang W, Li Q, Hao H, Liu K, Chen Q. ATA: A Benchmark for Vision–Language Tracking in Air-to-Air Counter-UAV of Tiny Drones. Drones. 2026; 10(6):429. https://doi.org/10.3390/drones10060429

Chicago/Turabian Style

Kang, Wenchao, Xuekai Zhang, Yueping Peng, Wei Tang, Qilong Li, Hexiang Hao, Kang Liu, and Qinghe Chen. 2026. "ATA: A Benchmark for Vision–Language Tracking in Air-to-Air Counter-UAV of Tiny Drones" Drones 10, no. 6: 429. https://doi.org/10.3390/drones10060429

APA Style

Kang, W., Zhang, X., Peng, Y., Tang, W., Li, Q., Hao, H., Liu, K., & Chen, Q. (2026). ATA: A Benchmark for Vision–Language Tracking in Air-to-Air Counter-UAV of Tiny Drones. Drones, 10(6), 429. https://doi.org/10.3390/drones10060429

Article Menu

ATA: A Benchmark for Vision–Language Tracking in Air-to-Air Counter-UAV of Tiny Drones

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Recent Progress in Anti-UAV Perception and Counter-UAV Systems

2.2. Related Tracking Datasets and Task Positioning of ATA

2.2.1. Generic Visual Object Tracking Benchmarks

2.2.2. UAV Tracking and Anti-UAV Benchmarks

2.2.3. Vision–Language Tracking Benchmarks

3. Construction of the ATA Dataset

3.1. Design Objectives

3.2. Data Acquisition Platform and Scenario Configuration

3.3. Dataset Annotation

3.4. Data Organization and Split

3.5. Dataset Statistics and Characteristic Analysis

3.6. Summary of Dataset Challenges

3.6.1. Dual-Dynamic Disturbances

3.6.2. Difficulty in Tiny-Object Representation

3.6.3. Rapid Background Switching and Complex Interference

3.6.4. Multi-UAV Interference and Target Specification Ambiguity

4. Experiments

4.1. Experimental Settings

4.2. Baseline Results and Analysis

4.3. Failure Cases and Bottleneck Analysis

4.4. Validation of AFTE-Based Temporal Enhancement

4.4.1. Core Idea and Adaptation Strategy of AFTE

4.4.2. Experimental Results and Analysis of AFTE

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI