1. Introduction
In recent years, unmanned aerial vehicles (UAVs), benefiting from their high maneuverability, ease of deployment, and relatively low operational cost, have been extensively employed in a wide range of applications, including environmental monitoring, agricultural inspection, and emergency rescue. Meanwhile, the rapid proliferation of UAVs has also given rise to increasingly prominent low-altitude security risks. Unauthorized or maliciously manipulated UAVs may intrude into airports, sensitive zones, and critical infrastructure, thereby posing tangible threats to flight safety, public security, and privacy protection. Consequently, UAV countermeasure technologies for low-altitude security have gradually emerged as an important research topic in the fields of computer vision and autonomous systems [
1].
In recent years, anti-UAV research has gradually evolved from single-sensor perception toward a comprehensive direction involving multi-source sensing, intelligent recognition, and system-level coordination. Current UAV detection and countermeasure systems usually involve multiple technical routes, including radar, radio-frequency, acoustic, electro-optical/vision-based sensing, and multi-sensor fusion. Different sensors have their own advantages and limitations in terms of detection range, environmental adaptability, target recognition capability, and anti-interference ability. For example, radar and radio-frequency methods are suitable for relatively long-range detection, but they are easily affected by environmental clutter, electromagnetic interference, or changes in target communication modes. Acoustic methods have relatively low cost, but their stability is limited in urban noise environments. Electro-optical and vision-based methods can provide more intuitive information about target appearance, spatial position, and motion state, but they are also more susceptible to illumination, weather conditions, background complexity, and target-scale variations. With the development of artificial intelligence and deep learning technologies, vision-based UAV detection and tracking methods have attracted increasing attention in target recognition, persistent localization, and complex-scene adaptation, and they have gradually become an important component of the anti-UAV perception pipeline. In this context, visual tracking is not only a specific task within vision-based perception, but also a key component for maintaining target identity, estimating motion state, and supporting subsequent countermeasure decision-making after the target has been detected [
2].
Specifically, within the UAV countermeasure pipeline, target tracking serves to bridge target discovery, state estimation, and subsequent response decision-making. As illustrated in
Figure 1, compared with traditional radar sensing, radio-frequency detection, and ground-based electro-optical devices, vision-based tracking deployed on an airborne platform enables more flexible observation viewpoints and sustained proximity to the target, thereby rendering it particularly suitable for tasks such as dynamic pursuit, companion-flight perception, and autonomous intervention. Owing to these advantages, vision-based air-to-air tracking exhibits unique potential in UAV countermeasure scenarios [
3].
However, air-to-air tiny-object UAV tracking differs fundamentally not only from conventional generic object tracking, but also from existing UAV tracking and ground-based anti-UAV tracking scenarios. As shown in
Figure 2, this task is characterized by the coexistence of two tightly coupled properties, namely dual-dynamic motion in the air-to-air setting and long-range tiny-object imaging. The former causes the target state to be intricately entangled with the platform’s attitude, velocity, and viewpoint variations, whereas the latter results in weak appearance cues and rapid cross-frame changes. Therefore, air-to-air tiny-object UAV tracking should by no means be regarded as a straightforward extension of existing UAV tracking or anti-UAV tasks; rather, it constitutes a substantially more challenging and independent problem [
4].
Although substantial progress has recently been achieved in object tracking, UAV scene perception, and vision–language tracking, existing studies still remain insufficient to adequately support this task. Most current UAV tracking datasets primarily focus on the setting of airborne platforms tracking ground targets [
5], whereas existing anti-UAV datasets are predominantly established under the configuration of ground-based devices observing aerial UAVs. These two research paradigms correspond to air-to-ground and ground-to-air scenarios, respectively, and neither is able to faithfully capture the core difficulties encountered in realistic air-to-air countermeasure processes [
6]. Meanwhile, although existing vision–language tracking datasets have introduced natural language descriptions, the majority of them are still designed for generic object tracking scenarios, and their semantic annotations generally lack fine-grained descriptions specifically tailored to UAV targets, such as category-level distinctions, structural attributes, relative positional relations, and background-aware contextual cues, thereby making them inadequate for effectively supporting air-to-air tiny-object UAV tracking [
7]. In complex aerial environments, merely relying on first-frame bounding-box initialization is often insufficient to fully characterize the target identity and its scene relations, whereas language descriptions are able to provide complementary high-level semantic constraints from multiple aspects, including target category, appearance characteristics, structural properties, and contextual associations [
8]. Furthermore, most prevailing trackers are developed primarily for generic scenarios, and their performance evaluation is largely based on existing public benchmarks. However, strong performance on such benchmarks does not necessarily imply direct adaptability to air-to-air tiny-object scenarios, since the latter are characterized by weak target representations and highly complex temporal dynamics, while the explicit exploitation of temporal information remains insufficient in most current tracking architectures. As a consequence, clear deficiencies still exist at the levels of data, task setting, and evaluation protocol, and a unified as well as dedicated benchmark is still lacking for systematically revealing the core challenges, capability boundaries, and potential directions for improvement in air-to-air tiny-object UAV tracking [
9,
10].
To address the aforementioned limitations, this paper presents ATA, a vision–language tracking benchmark specifically designed for air-to-air tiny-object UAV countermeasure scenarios. Given a continuous video sequence captured by a pursuer UAV, the objective is to continuously localize the tracked UAV in subsequent frames under complex and dynamically changing backgrounds [
10]. In order to simultaneously accommodate the conventional visual tracking paradigm and the need for collaborative vision–language modeling, ATA contains two task settings. The first is BBox-only, in which the model is provided with the initial bounding box of the target UAV in the first frame and is required to accomplish persistent tracking solely based on this initialization and the subsequent video frames. The second is Language-assisted, in which, in addition to the first-frame bounding box, a natural-language prompt describing the target UAV is further provided to facilitate subsequent tracking [
11]. The former mainly evaluates the pure visual modeling capability of a tracker in air-to-air tiny-object scenarios, whereas the latter further investigates whether linguistic information can effectively complement visual cues, thereby alleviating the problems caused by insufficient tiny-object representation, severe interference from similar targets, and pronounced ambiguity in first-frame target specification [
12].
In summary, the main contributions of this paper are as follows:
We propose ATA. To the best of our knowledge, ATA is the first vision–language tracking benchmark specifically designed for real air-to-air tiny-object UAV countermeasure scenarios. ATA is not intended as a general replacement for existing UAV tracking or anti-UAV tracking datasets. Instead, it serves as a further task-specialized extension and complement to existing benchmarks under the specific combination of real air-to-air observation, tiny UAV targets, language-assisted target specification, and persistent tracking.
We define two task settings for ATA, namely BBox-only and Language-assisted. The former is designed to assess the modeling capability of purely visual trackers in air-to-air tiny-object scenarios, while the latter is intended to investigate the auxiliary role of language information in target specification and persistent tracking, thereby establishing a unified benchmark for training and evaluation in this research direction.
Based on ATA, we establish a benchmark covering both vision-only and vision–language methods and conduct systematic evaluations on multiple recent mainstream trackers. The experimental results reveal significant performance bottlenecks of current methods in this scenario, thereby highlighting the substantial gap between existing tracking techniques and the practical demands of air-to-air UAV countermeasures.
Considering that ATA is constructed from continuous video sequences and exhibits pronounced temporal correlation, we further introduce the AFTE temporal enhancement module for validation. By extending the original single-frame input of a tracker into paired adjacent-frame input, AFTE explicitly models the temporal relations between neighboring frames, and its effectiveness is verified on multiple baseline models. These results further demonstrate the pressing need for stronger temporal modeling capability in air-to-air tiny-object tracking scenarios.
3. Construction of the ATA Dataset
3.1. Design Objectives
To advance vision–language tracking research for air-to-air counter-UAV scenarios involving tiny drones, we construct ATA (Air-to-Air Tracking with Language Assistance), a dedicated dataset designed to provide a unified data foundation and evaluation benchmark for this task.
The construction of ATA is primarily motivated by two critical gaps in existing studies. First, current UAV-related datasets remain insufficient to faithfully characterize the dual-dynamic disturbances arising from the simultaneous motion of both the tracking platform and the target platform. Second, although language annotations have been introduced in several existing vision–language tracking datasets, such annotations are largely designed for general-purpose scenarios and therefore are of limited utility for target specification and persistent tracking in air-to-air tiny-UAV settings, where complex backgrounds and similar distractors are frequently encountered [
8]. To address these limitations, ATA is deliberately designed with an emphasis on real air-to-air flight acquisition, preservation of tiny-object characteristics, dual annotations of bounding boxes and language descriptions, and standardized data organization for benchmark evaluation. As a result, it provides unified support for both the BBox-only and Language-assisted settings. Overall, ATA is neither a general replacement for existing UAV tracking or anti-UAV tracking datasets nor a simple extension of existing UAV datasets. Instead, it is a task-driven dataset and benchmark foundation constructed around several specific requirements, including real air-to-air observation, persistent tracking of tiny UAVs, counter-UAV scenarios, and language-assisted target specification.
3.2. Data Acquisition Platform and Scenario Configuration
ATA adopts real-flight data acquisition. The tracking platform is fixed as a DJI Mavic 3T (SZ DJI Technology Co., Ltd., Shenzhen, China), with the camera mounted below the front part of the fuselage and equipped with a stabilized gimbal and zoom capability. The target UAVs mainly include DJI Mini 4 Pro (SZ DJI Technology Co., Ltd., Shenzhen, China), DJI Mini 3 (SZ DJI Technology Co., Ltd., Shenzhen, China), and DJI Avata 2 (SZ DJI Technology Co., Ltd., Shenzhen, China). Among them, the Mini series is generally dominated by a white appearance, whereas Avata 2 is overall darker in color, and all these UAVs have relatively small body sizes, which are more consistent with the characteristics of the air-to-air tiny-target scenario considered in this paper. The selection of the above consumer-grade UAVs as target platforms is mainly based on two considerations. On the one hand, consumer-grade UAVs are widely deployed in low-altitude scenarios and represent common target types in real low-altitude security and counter-UAV tasks. On the other hand, such UAVs have small physical sizes and usually exhibit weak texture, low resolution, and small-area imaging characteristics under long-range air-to-air observation conditions, which can better match the problem of tiny-UAV tracking investigated in this paper. It should be noted that the target types in the current version of ATA are still mainly compact consumer-grade UAVs, and have not yet further covered more platform types such as industrial UAVs, fixed-wing UAVs, and irregular rotor aircraft. Future versions may further expand the target categories under the premise of ensuring flight safety and feasible data acquisition conditions, so as to improve the target diversity and evaluation difficulty of the dataset. ATA retains only visible-spectrum videos as the data source, and all sequences are uniformly captured at a resolution of 1920 × 1080 and a frame rate of 30 FPS.
From the perspective of acquisition conditions, the acquisition area of ATA is approximately located within a low-altitude airspace with a radius of about 300 m, and the flight altitudes of both the acquisition UAV and the target UAV are approximately between 20 m and 80 m. Since the collected data simultaneously cover different observation viewpoints, including same-altitude observation, downward-looking observation, upward-looking observation, and side-view observation, there is no fixed altitude constraint between the target UAV and the tracking platform. In terms of relative motion patterns, ATA covers typical air-to-air maneuvers such as pursuit, companion flight, and circling flight, together with dynamic variations including straight flight, turning, sharp turning, ascending, descending, acceleration, and deceleration. From the perspective of the motion relationship between the platform and the target, the dataset includes not only cases where the platform UAV hovers while the target UAV moves, but also cases where the platform UAV and the target UAV move simultaneously, where the target approaches or moves away from the platform, and where the two UAVs fly in the same or opposite directions. Therefore, the target appearance variations in ATA are not only caused by the motion of the target UAV itself, but are also jointly affected by the motion of the acquisition platform, altitude differences, observation-angle changes, and background transitions, thereby more intensively reflecting the practical difficulties of air-to-air tiny-object tracking under dual-dynamic conditions. In terms of imaging settings, each video sequence maintains a fixed zoom setting throughout the sequence, so as to avoid additional interference in target scale variation caused by artificial zoom changes within the sequence. The gimbal is mainly used for image stabilization and does not perform automatic target locking or automatic target tracking; therefore, the position changes of the target within the field of view are not eliminated by an automatic gimbal-tracking mechanism. The acquisition scenarios of ATA are mainly under daytime visible-light conditions, covering common weather conditions such as sunny, cloudy, and overcast scenes. The overall illumination conditions are mainly front-lit and normal-illumination cases, while some backlit, cloud-occluded, and low-contrast overcast scenarios are also included. It should be noted that the current version has not yet systematically covered low-light, night-time, and extreme weather conditions. Therefore, ATA mainly reflects the challenges of air-to-air tiny-object tracking under daytime visible-light conditions. Regarding video processing, ATA does not apply additional transcoding or secondary compression to the original videos, nor does it perform image enhancement, stabilization processing, super-resolution reconstruction, artificial synthesis, or target-region enlargement, thereby preserving, as much as possible, the real visual characteristics of the flight acquisition process, including tiny target scale, viewpoint changes, platform-motion disturbance, background interference, and illumination variations.
In terms of task organization, all sequences in ATA follow the single-object tracking protocol, where only one target UAV is annotated for persistent tracking in each video. To further increase the difficulty of the benchmark, however, some sequences additionally introduce one or two extra UAVs as distractors, thereby creating more challenging conditions with similar-object interference. Specifically, 28 sequences, denoted as uav-m1 to uav-m28, contain multiple UAVs, whereas the remaining 22 sequences, denoted as uav-s1 to uav-s22, include only a single target UAV. This design preserves the standard formulation of single-object tracking while substantially increasing the difficulty of target specification and long-term target association in ATA.
As shown in
Figure 3, From the perspective of background complexity, ATA contains 12 sequences with pure sky backgrounds, 16 sequences with mixed sky-ground backgrounds, and 22 sequences with complex ground backgrounds. More specifically, the dataset covers 11 woodland sequences, 8 urban-edge sequences, 3 football-field sequences, 5 runway sequences, 6 open-sky sequences, 9 side-by-side building sequences, and 8 sequences with standalone red-building backgrounds. Overall, ATA deliberately retains a relatively high proportion of sequences with complex backgrounds so as to better reflect the visual interference encountered in real-world counter-UAV scenarios.
3.3. Dataset Annotation
Before formal annotation, we first manually reviewed the originally collected videos and selected valid video segments. For video segments in which the target location could not be reliably determined due to severe motion blur, partial occlusion, extremely small target scale, or only a few remaining visible pixels, ATA follows the principle of “not forcing unreliable annotations rather than introducing incorrect annotations” [
28]. Instead of assigning inaccurate bounding boxes to such segments, we directly remove these unreliable segments during the dataset construction stage. If an unreliable segment appears in the middle of an original video, this segment is removed, and the reliable parts before and after it, in which the target can be reliably localized, are treated as independent valid video sequences. After the above screening and splitting process, all frames retained in ATA have manually confirmed valid target locations. Therefore, in the final released ATA annotation files, there are no blank frames that are retained but left unlabeled. All retained valid frames can be normally used for model training and performance evaluation. The above screening, splitting, and annotation strategy is consistently applied to all video sequences, so as to ensure the consistency of the dataset construction and evaluation protocol.
During bounding box annotation, ATA adopts a frame-wise manual annotation strategy to annotate the target UAV in the final retained video sequences. The annotation tool is DarkLabel (
https://github.com/darkpgmr/DarkLabel.git, accessed on 16 April 2026), and the annotation format is an axis-aligned bounding box. Each bounding box is annotated to tightly cover the visible region of the target UAV as much as possible, so as to ensure the continuity of the target trajectory and the accuracy of the localization results. For difficult frames in which the target remains basically identifiable, adjacent-frame motion continuity and trajectory consistency are further used as auxiliary cues to maintain the temporal consistency of the annotation results as much as possible.
For language annotation, we first select representative frames from each video sequence and construct a target-centered region whose area is approximately four times that of the original target bounding box, so as to simulate the search region. To improve the clarity and reproducibility of the language annotation process, representative frames are preferentially selected when the target is visible, the background relationship is clear, and the frame can reflect both the appearance characteristics of the target and its scene context. The search region is then expanded around the target bounding box to highlight the target and its neighboring background, while reducing the interference caused by irrelevant regions in the full image. Subsequently, we use ChatGPT-5.3 (OpenAI, San Francisco, CA, USA) to assist in generating an English sentence of no more than 40 words to describe the relationship between the target and its surrounding background. The description mainly covers the target category, color, structural characteristics, relative position, and background context, while also moderately reflecting the motion state of the target, thereby supplementing the high-level semantic information that cannot be fully conveyed by bounding boxes alone. Specifically, the input prompt requires the model to describe the target UAV inside the red box, and the basic form of the prompt is as follows: “Please describe the target inside the red box in one English sentence within 40 words, including its category, color, structural features, relative position, surrounding environment, and possible motion state.” Through this prompt design, the generated language prompt can provide additional semantic constraints on target appearance, local background, and scene relations beyond the first-frame bounding box.
After the initial descriptions are generated by the large language model, we further obtain the final language prompts through manual screening and revision. The manual revision process is not limited to simple grammar checking. Instead, each description is checked according to several criteria, including category correctness, color consistency, visibility of structural characteristics, rationality of relative position, clarity of background relations, semantic ambiguity, and target discriminativeness. Descriptions that are inconsistent with the image content, contain subjective speculation, are overly general, may cause misleading interpretation, or cannot stably distinguish the target are removed or rewritten. For sequences containing similar UAV distractors, semantic information that can distinguish the target UAV from distractors is preferentially retained, such as differences in color, body structure, relative position, background region, or motion state. Finally, each video sequence is assigned one manually verified video-level English language description, which is used for the Language-assisted tracking task in ATA.
In terms of language annotation granularity, ATA adopts a video-level single-sentence English description as the language prompt for each sequence. This design mainly follows the common setting of existing vision–language tracking tasks, where a stable natural language description is used to provide high-level semantic information, such as target category, color, structural attributes, relative position, and background relations, thereby assisting the model in target specification and persistent tracking. Compared with frame-level or key-frame-level language descriptions, video-level descriptions can maintain consistent semantic constraints on the target throughout the entire sequence. They also allow existing mainstream vision–language trackers to be directly trained and evaluated on ATA without modifying their input protocols. Nevertheless, video-level single-sentence descriptions still have certain limitations. Since the target scale, viewpoint, background, and occlusion state may continuously change during air-to-air tracking, a single video-level description is difficult to fully characterize fine-grained target-state changes across different key frames. Future versions of ATA may further explore key-frame-level language descriptions, temporally dynamic language prompts, or finer-grained target semantic annotations to support more detailed language-guided tracking research.
The annotation work of ATA is completed by multiple annotators following a unified annotation protocol. After the initial annotation is completed, all annotations are subjected to cross-checking and manual verification to reduce subjective errors and inconsistencies. For bounding box annotations, the quality-control process mainly focuses on the accuracy of target localization, the consistency of target identity, the temporal continuity of trajectory changes, the rationality of video splitting boundaries, and whether confusion occurs between the target and similar distractors in challenging scenarios. For language descriptions, the quality-control process mainly focuses on the consistency between the description and the image content, whether the semantic expression is ambiguous, and whether the description can effectively distinguish the target UAV from the surrounding background or similar UAV distractors. Dedicated quality-control procedures are established for both bounding box annotations and language descriptions, thereby ensuring the overall quality of the dataset.
Overall, the annotation system of ATA is characterized by both fine-grained frame-level annotation and dual visual–language annotation, providing a unified and reliable data foundation for subsequent studies on both vision-only tracking and vision–language tracking.
3.4. Data Organization and Split
ATA comprises 50 video sequences in total and is divided into a training set and a test set. The training set contains 40 sequences, including 22 multi-UAV interference sequences from uav-m1 to uav-m22 and 18 single-target sequences from uav-s1 to uav-s18. The test set consists of the remaining 10 sequences, including 6 multi-UAV interference sequences from uav-m23 to uav-m28 and 4 single-target sequences from uav-s19 to uav-s22. Such a split preserves both multi-UAV interference and single-target scenarios in both the training and evaluation stages, thereby enabling a more comprehensive assessment of tracker performance across different levels of task difficulty.
In terms of data organization, each sequence folder contains an img directory and two annotation files, namely groundtruth.txt and language.txt. Specifically, the img directory stores image frames in chronological order, groundtruth.txt records the frame-wise bounding box annotations, and language.txt provides the corresponding video-level English description. This standardized data structure allows ATA to be readily integrated into mainstream tracking frameworks for training, testing, and subsequent benchmark evaluation.
The split strategy of ATA is designed by jointly considering scenario diversity and difficulty coverage. On the one hand, both the training set and the test set include multi-UAV interference sequences as well as single-target sequences, which prevents the evaluation from being overly biased toward relatively simple cases. On the other hand, maintaining both sequence types in the two subsets enables the benchmark to cover a broader range of scene configurations and difficulty levels. Overall, ATA adopts a data split that preserves a clear organizational structure and complete annotations, while also providing a direct and reliable foundation for unified benchmark evaluation.
3.5. Dataset Statistics and Characteristic Analysis
ATA comprises 50 video sequences with a total of 38,094 frames and an overall duration of 1269.8 s (approximately 21 min and 9.8 s). On average, each sequence contains 761.88 frames, with the shortest and longest sequences consisting of 310 and 945 frames, respectively. All videos are collected at a unified resolution of 1920 × 1080. Although ATA is relatively compact in scale, it exhibits a reasonably balanced distribution of sequence lengths, which is sufficient to satisfy the fundamental requirements of an air-to-air tiny-object tracking benchmark.
One of the most distinctive characteristics of ATA lies in its target-scale distribution. Statistical analysis shows that the targets have an average width of 31.175 pixels, an average height of 16.845 pixels, and an average area of 623.850 pixels2. The minimum target area is only 100 pixels2, whereas the maximum reaches 6624 pixels2. Correspondingly, the average equivalent side length is approximately 24.977 pixels. More importantly, the average target area accounts for merely 0.03% of the full image area, indicating that the overwhelming majority of targets in ATA fall into the regime of extremely small-scale objects.
As further illustrated in
Figure 4, target instances in ATA are predominantly concentrated within the range of 64–1024 pixels
2. In particular, 95.084% of the targets occupy less than 0.1% of the full image area, while all targets remain below 0.5% of the image area. These statistics provide compelling evidence that ATA exhibits a highly representative and pronounced air-to-air tiny-object characteristic.
In addition to visual annotations, ATA provides one English language prompt for each video, resulting in a total of 50 prompts. The average prompt length is 13.46 words, with the shortest and longest prompts containing 7 and 25 words, respectively. Overall, the language descriptions are deliberately maintained at a moderate length, such that they can effectively convey key cues—including target category, color, background relationship, and relative position—while avoiding the redundant semantic noise that may be introduced by overly lengthy text.
Taken together, the above statistics highlight several salient properties of ATA in terms of dataset scale, tiny-object characteristics, and language annotation design. First, ATA is collected through real air-to-air flight operations, allowing it to faithfully reflect the observation conditions of targets in dynamic low-altitude environments. Second, the targets in ATA are overwhelmingly distributed at an extremely small scale, making the tiny-object property particularly prominent. Third, by simultaneously providing bounding box annotations and language prompts, ATA supports both vision-only and vision–language tracking settings within a unified framework. Therefore, ATA is not merely a dataset tailored to air-to-air counter-UAV scenarios involving tiny targets, but also an effective and reliable foundation for systematic benchmark evaluation.
3.6. Summary of Dataset Challenges
3.6.1. Dual-Dynamic Disturbances
One of the most distinctive characteristics of ATA lies in the fact that both the tracking platform and the target platform are simultaneously in motion, such that the target displacement observed in the image plane is determined not only by the motion of the target itself, but also by the combined effects of platform attitude variation, flight velocity, and viewpoint changes of the observing UAV. As illustrated in
Figure 5, compared with conventional scenarios involving static cameras or ground-to-air observation, this dual-dynamic coupling substantially increases the uncertainty of inter-frame target displacement, scale variation, and viewpoint change, thereby rendering cross-frame association and persistent tracking considerably more challenging. For existing trackers that rely heavily on template matching or local search mechanisms, such disturbances further exacerbate the risks of target drift and eventual tracking failure.
3.6.2. Difficulty in Tiny-Object Representation
The target UAVs in ATA exhibit a highly pronounced tiny-object characteristic. Statistical analysis shows that the average target area occupies only a negligible fraction of the full image, with the overwhelming majority of targets falling within the typical regime of tiny objects. As illustrated in
Figure 6, owing to their extremely limited scale, the targets can provide only sparse texture, edge, and local structural cues in the image. Under such circumstances, even slight motion blur, compression artifacts, or scale fluctuations may cause target features to deteriorate rapidly. As a consequence, conventional visual representations often struggle to remain both stable and sufficiently discriminative, while the model itself becomes more susceptible to background clutter and spurious responses. Therefore, tiny-object perception and robust representation learning constitute one of the central challenges that ATA poses to existing tracking methods.
3.6.3. Rapid Background Switching and Complex Interference
Because the observing platform remains in continuous flight, the background in ATA is by no means stationary; rather, it undergoes rapid transitions among pure-sky scenes, mixed sky–ground backgrounds, building-dominated environments, woodland scenes, and sports-field settings. As illustrated in
Figure 7, compared with relatively simple sky backgrounds, complex ground backgrounds often contain a large number of regions with salient textures and structural patterns. When the target is extremely small, such regions are more likely to compress the effective discriminative space, thereby increasing the difficulty of separating the target from the background. Meanwhile, rapid background variation also weakens the model’s ability to exploit short-term motion continuity, making it more likely for the search region to contain spurious responses that resemble the target in shape, scale, or motion pattern. Therefore, ATA not only evaluates a tracker’s ability to adapt to complex low-altitude environments, but also more stringently examines its robustness under rapid background transitions.
3.6.4. Multi-UAV Interference and Target Specification Ambiguity
In a subset of ATA sequences, besides the target being tracked, one or two additional UAVs are deliberately introduced as distractors, which further intensifies the problems of target confusion and identity drift. As illustrated in
Figure 8, since these UAVs are often highly similar in category, appearance, scale, and flight state, conventional tracking paradigms that rely solely on first-frame bounding box initialization are more prone to mismatching during persistent tracking. This issue becomes particularly severe in the tiny-object regime, where the visual evidence itself is already weak, and the coexistence of multiple similar flying objects substantially increases the difficulty of target specification. For this reason, the language descriptions introduced in ATA should not be viewed merely as an extension at the annotation level; rather, they serve as a necessary semantic complement for addressing such target ambiguity, thereby providing a more direct experimental foundation for subsequent research on vision–language collaborative tracking.
4. Experiments
To evaluate the discriminative capacity of ATA across different methodological paradigms and to investigate the adaptability of current mainstream tracking models to air-to-air counter-UAV scenarios involving tiny targets, we select eight representative recent trackers as baselines, including four vision-only trackers and four vision–language trackers. Specifically, the vision-only baselines include OSTrack [
29], SeqTrack [
30], AQATrack [
31], and MCITrack [
32], while the vision–language baselines comprise All-in-One [
33], MMTrack [
34], SUTrack [
35], and MambaTrack [
36]. Such a benchmark configuration not only covers the two mainstream paradigms of vision-only tracking and vision–language tracking, but also spans a diverse spectrum of technical designs, including unified modeling, sequential modeling, contextual interaction, and cross-modal fusion. This diversity enables a more comprehensive and informative analysis of benchmark performance on ATA.
4.1. Experimental Settings
All experiments are conducted on a platform running Ubuntu 20.04, with Python 3.8 and Miniconda 3 adopted for environment management. The hardware configuration consists of an Intel Xeon Platinum 8358P CPU (Intel Corporation, Santa Clara, CA, USA) and an NVIDIA GeForce RTX 4090 GPU with 24 GB of memory (NVIDIA Corporation, Santa Clara, CA, USA). All methods are evaluated under the unified train/test split of ATA, where the vision-only trackers are tested under the BBox-only setting, whereas the vision–language trackers are evaluated under the Language-assisted setting.
To ensure the reproducibility of the experimental process and the fairness of comparison among different methods, this paper strictly adopts the same data split and training/evaluation protocol on the ATA dataset. All baseline models and their AFTE-enhanced variants load the official pretrained weights of the corresponding methods as initialization and are retrained only on the ATA training set based on pretrained initialization. After training, the last-epoch model weights are uniformly used for final performance evaluation on the testing set. The testing set is not involved in model training, hyperparameter adjustment, or checkpoint selection. The number of training epochs for all methods is uniformly set to 50. The input resolution is also uniformly configured, with the template region set to and the search region set to . For visual-only tracking methods, only the image sequence and the first-frame target bounding box are used as inputs. For vision–language tracking methods, each video sequence uses one manually verified English description provided by ATA as a fixed video-level language prompt. This language prompt remains unchanged throughout the tracking process, and no additional manual prompt engineering is introduced. The description is processed according to the original text encoding module or tokenizer of each method. To adapt to the input formats of different trackers, this paper only performs necessary conversions of the ATA data organization, including the sequence list, the first-frame bounding box, and annotation files. Except for the structural modifications corresponding to the AFTE module, the main network architecture, loss function, and inference procedure of each baseline method are not additionally changed. All methods are finally compared on the same testing set and under the same evaluation metrics.
Following standard practice in visual tracking, we adopt five widely used evaluation metrics to comprehensively assess different methods, namely AUC, OP50, Precision, , and FPS. Among these metrics, AUC, which measures the area under the success curve, is used as the primary criterion for ranking and comparing the overall tracking performance of different models in this work.
4.2. Baseline Results and Analysis
Under the unified experimental protocol described above, we first compare the overall performance of the eight representative trackers on the ATA dataset, with the quantitative results summarized in
Table 5. Ranking the methods by AUC, which serves as the primary evaluation metric in this work, SeqTrack achieves the best overall performance, attaining an AUC of 44.55. It further reaches 48.43 in OP50, 79.74 in Precision, and 36.18 in
, thereby demonstrating the strongest comprehensive capability among all baselines. MCITrack ranks second with an AUC of 41.09, and likewise exhibits strong competitiveness among the vision-only trackers. Among the vision–language methods, MMTrack delivers the best overall performance, obtaining an AUC of 38.07, together with 45.76 in OP50 and 35.02 in
, which indicates that it possesses comparatively strong cross-modal modeling capability under the ATA setting. Taken as a whole, even the strongest baseline still leaves considerable room for improvement in the air-to-air tiny-object counter-UAV scenario characterized by ATA, suggesting that the benchmark poses a substantial challenge to existing trackers across different technical paradigms.
Beyond absolute tracking accuracy, the results also reveal a non-negligible trade-off between effectiveness and efficiency. For instance, All-in-One achieves 96.67 FPS, making it the fastest method among all baselines; however, its AUC is only 31.19, indicating that higher inference speed does not necessarily translate into superior tracking performance. By contrast, SeqTrack not only delivers the best accuracy, but also maintains a runtime speed of 50.00 FPS, thereby exhibiting a relatively favorable balance between tracking performance and computational efficiency. Although MCITrack ranks second in terms of AUC, its runtime is only 32.95 FPS, which is comparatively modest. These observations suggest that, under the ATA setting, existing trackers still face a pronounced robustness–real-time trade-off, especially when accurate localization and real-time deployment are expected to be achieved simultaneously.
It should be noted that the results in
Table 5 also show that vision–language trackers do not always outperform purely visual trackers after introducing language prompts. For example, purely visual methods such as SeqTrack and MCITrack surpass some vision–language methods in terms of AUC, indicating that although language information can provide additional semantic cues for target specification, its performance gain cannot be stably reflected in all models and all air-to-air tiny-target scenarios. This phenomenon can be mainly attributed to the following four aspects.
First, the target UAVs in ATA are extremely small, and the available visual evidence in a single frame is very limited. Dataset statistics show that the average target area in ATA accounts for only of the entire image, and most targets fall within the typical tiny-object range. In this case, language prompts can describe the target category, color, structure, or background relationship, but they can hardly provide precise pixel-level or box-level spatial localization information. When the target appears as only a few pixels in the image, the model still highly relies on the visual branch to capture weak textures, weak edges, and local responses. Therefore, language prompts cannot fully compensate for the localization difficulty caused by insufficient visual representation of tiny targets.
Second, the language descriptions in ATA are video-level prompts and remain unchanged throughout the entire sequence tracking process, lacking frame-wise updated information about target position, scale, and motion state. In other words, language descriptions can provide global semantic constraints for the target, but they cannot dynamically adapt to rapid displacement, scale variation, viewpoint changes, and background switching in consecutive frames. Therefore, when the target is affected by dual-dynamic motion, rapid background changes, or short-term occlusion, the fixed language prompt has limited constraint capability for precise localization in the current frame and cannot directly replace cross-frame motion modeling and local search mechanisms.
Third, existing vision–language trackers still face obvious cross-modal alignment difficulties in air-to-air tiny-target scenarios. Most vision–language tracking methods usually encode language prompts as global semantic features and fuse them with visual features at a relatively coarse level. However, in ATA scenarios, the target UAV often occupies only a very small number of pixels, and its local visual responses are weak and easily affected by complex backgrounds, motion blur, and similar UAV distractors. This makes it difficult for semantic information in the language branch, such as category, color, structure, or relative position, to establish stable correspondence with fine-grained local responses in the image. In other words, the existing cross-modal fusion process may fail to sufficiently convert high-level language semantics into effective frame-level spatial localization constraints, thereby causing language information to bring unstable performance gains in some models.
Finally, although some purely visual trackers do not use language input, they have stronger visual representation, template matching, sequence modeling, or localization regression capabilities, and therefore may still perform more stably in ATA scenarios where tiny-object localization and cross-frame association are the core difficulties. For air-to-air tiny-UAV tracking, coarse-grained semantic information is not always more critical than fine-grained visual localization capability. When language prompts cannot provide sufficiently fine-grained spatial and temporal constraints, purely visual methods with stronger visual feature extraction and localization capabilities may still achieve higher performance.
In summary, the results in
Table 5 do not indicate that language information is ineffective in ATA. Instead, they suggest that existing vision–language trackers still have limitations in exploiting language information under air-to-air tiny-UAV tracking conditions. For such scenarios, language prompts should not be simply introduced as global semantic descriptions. Instead, they need to be combined with finer-grained cross-modal alignment mechanisms, frame-level spatial grounding mechanisms, local search-region guidance mechanisms, and temporal modeling mechanisms. By enabling language features to participate more directly in target specification, candidate-region selection, similar-UAV discrimination, and cross-frame association, the auxiliary role of language information in robust tracking under complex dynamic backgrounds can be more fully exploited.
To further analyze the task difficulty of ATA from an experimental perspective, we compare the AUC performance of representative VOT and VLT trackers on ATA, LaSOT, TNL2K, and UAV-Anti-UAV benchmarks, as shown in
Table 6. By comparing performance differences across different types of benchmarks, we can more intuitively observe the adaptability and limitations of existing tracking methods in the air-to-air tiny-UAV tracking scenario characterized by ATA.
As can be seen from
Table 6, existing tracking methods generally obtain lower AUC scores on ATA than on general tracking or vision–language tracking benchmarks such as LaSOT and TNL2K. For example, OSTrack achieves an AUC of 71.1 on LaSOT, whereas its AUC on ATA is only 36.11. MCITrack obtains an AUC of 75.3 on LaSOT, but decreases to 41.09 on ATA. For vision–language tracking methods, SUTrack achieves an AUC of 67.9 on TNL2K, while its AUC on ATA is only 35.24.
This performance gap indicates that, compared with conventional object tracking scenarios, air-to-air tiny-UAV tracking is indeed more challenging. Existing methods can usually achieve relatively stable performance in ordinary scenarios, but their performance drops significantly on ATA. This suggests that the task is affected not only by tiny target scale and limited appearance information, but also by the combined influence of dual-dynamic motion between the tracking platform and the target platform, rapid viewpoint changes, complex background interference, and similar UAV distractors. Therefore, ATA can more concentratedly reflect the key challenges of air-to-air tiny-UAV tracking, which further demonstrates the necessity of constructing a dedicated benchmark.
4.3. Failure Cases and Bottleneck Analysis
Compared with generic single-object tracking scenarios, the air-to-air tiny-object counter-UAV task characterized by ATA is considerably more challenging.
Figure 9 presents several representative failure cases of the baseline trackers on ATA, where the green boxes denote the ground-truth (GT) annotations and the red boxes correspond to the tracker predictions.
First, because the targets are extremely small and provide only highly limited visual evidence, existing methods often struggle to establish stable and discriminative target representations, which consequently gives rise to response attenuation, localization jitter, and even complete target loss, as illustrated in
Figure 9a. Moreover, since both the tracking platform and the target platform are simultaneously in motion, the target may undergo rapid variations in position, scale, and pose across consecutive frames, as shown in
Figure 9b. Such dual-dynamic disturbances substantially increase the difficulty of inter-frame correspondence and target association, thereby rendering current trackers more vulnerable to mismatching, drift accumulation, and eventual tracking failure. Meanwhile, the target may appear against a wide range of complex backgrounds, including sky, buildings, ground scenes, woodland areas, and sports fields, as illustrated in
Figure 9c, which further aggravates the challenges of stable localization and robust target–background separation.
A closer inspection further suggests that, although these failure cases manifest themselves in different forms, they in fact reflect a common underlying bottleneck, namely that existing methods still make insufficient use of continuous temporal information. When the target appearance in a particular frame becomes ambiguous, or is severely degraded by dynamic variation and background interference, relying solely on the visual evidence from the current frame is often insufficient to support a stable prediction, as illustrated in
Figure 9d. Therefore, under the ATA setting, single-frame appearance cues alone are inadequate for stable and persistent tracking, which in turn indicates that more effective exploitation of temporal continuity and sequence-level information may be essential for improving tracking robustness in this highly challenging scenario.
4.4. Validation of AFTE-Based Temporal Enhancement
4.4.1. Core Idea and Adaptation Strategy of AFTE
In persistent air-to-air tiny-UAV tracking, the information provided by a single frame is usually limited. In particular, under the tiny-object, strong-dynamic, and complex-background conditions characterized by ATA, the target often occupies only a very small number of pixels in a single frame, with weak appearance texture and structural information. It is also easily affected by motion blur, background false responses, and similar UAV distractors. Therefore, relying only on the appearance features of the current frame is insufficient to support stable persistent tracking. Since ATA is collected from continuous air-to-air video sequences, adjacent frames naturally contain important cues such as target motion states, background variation trends, and cross-frame association relationships. Based on this property, this paper further introduces the AFTE temporal enhancement module to verify whether short-term temporal information can improve tracking performance in ATA scenarios.
As one of the key focuses of the dataset analysis in this paper, this section further explores and analyzes methods for enhancing algorithm performance by leveraging temporal information, emphasizing the efficient use of temporal cues in videos to alleviate the insufficient exploitation of motion features in air-to-air tiny-UAV tracking. To this end, this paper introduces a lightweight adjacent-frame temporal enhancement module, AFTE, to verify whether temporal information can improve tiny-UAV tracking performance in air-to-air scenarios.
The basic idea of AFTE [
37] is inherited from existing adjacent-frame feature enhancement and efficient adjacent-frame fusion methods. Such methods usually establish feature correspondences between the current frame and neighboring frames by measuring local similarity between adjacent frames, and they enhance the target response in the current frame through feature alignment, similarity-weighted fusion, or background-difference modeling. For ATA scenarios, when the appearance cues of the target in the current frame are weak, the target position, local motion tendency, and background variation information in the previous frame can provide supplementary constraints for current-frame target localization. When false responses with shapes or scales similar to the target exist in the background, the temporal consistency between adjacent frames also helps suppress unstable background interference.
In terms of the specific adaptation strategy, this paper adopts a unified integration strategy. For a given video sequence, when tracking the target in the
t-th frame, AFTE simultaneously uses the visual information of the current frame
and the previous adjacent frame
. The two frames are first processed by the visual encoder of the corresponding tracker to extract features, obtaining the current-frame feature
and the adjacent-frame feature
. Then, AFTE estimates the local correspondence between the adjacent-frame feature and the current-frame feature through local similarity calculation and aligns and fuses the adjacent-frame feature to obtain the enhanced current-frame feature
. The enhanced feature is subsequently fed into the following prediction module of the original tracker to output the target bounding box of the current frame [
38].
For visual-only trackers, AFTE extends the original single-frame visual input to a two-frame input consisting of the current frame and the previous adjacent frame, so as to enhance the current-frame feature in the visual branch. For vision–language trackers, the language branch remains unchanged. The language prompt is still the video-level English description provided by ATA, and the text encoder and language-fusion strategy follow the original settings of each method. AFTE is only applied to the visual feature branch. In this way, this paper can compare the influence of AFTE on different types of baseline trackers without changing the original language input, main network architecture, training protocol, or evaluation metrics.
4.4.2. Experimental Results and Analysis of AFTE
As shown in
Table 7, after introducing AFTE, the overall performance of each baseline model is improved to different degrees. Among visual-only methods, the AUC of MCITrack increases from 41.09 to 41.46, and that of AQATrack increases from 36.10 to 38.07. In addition, the OP50, Precision, and
of AQATrack are improved to 40.09, 72.78, and 29.53, respectively. These results indicate that adjacent-frame information can, to some extent, supplement the insufficient single-frame representation of visual-only trackers in air-to-air tiny-object scenarios.
For vision–language methods, the gains brought by AFTE are more obvious. The AUC of All-in-One increases from 31.19 to 35.91, that of MambaTrack increases from 31.98 to 33.53, and that of SUTrack increases from 35.24 to 40.74, with an improvement of 5.50. Meanwhile, the Precision of SUTrack increases from 64.98 to 75.38, showing the most significant performance improvement. Overall, vision–language methods benefit more from AFTE than visual-only methods, indicating that for models with relatively weak single-frame discrimination ability or insufficient visual evidence, short-term temporal information can complement language prompts and further improve target localization stability.
From the perspective of model complexity, AFTE does not introduce significant additional overhead. The parameters and FLOPs of each model increase only slightly and remain within an acceptable range. In terms of speed, the FPS of some models decreases slightly, but AQATrack and SUTrack instead improve to 63.83 and 85.24 FPS after introducing AFTE, respectively. This indicates that AFTE can achieve relatively stable performance gains at a small computational cost.
Overall, AFTE brings different degrees of performance improvements for multiple baseline methods, indicating that adjacent-frame information is effective to some extent and has application potential for air-to-air tiny-object tracking. By explicitly integrating adjacent-frame information, AFTE can enhance target representation to some extent, improve cross-frame association in air-to-air tiny-object scenarios, and increase tracking robustness under complex backgrounds and fast-motion conditions. Meanwhile, considering that the current test set is relatively compact in scale, some small numerical changes still need to be understood in conjunction with the difficulty of specific sequences and scene variations. Therefore, this paper focuses more on the overall performance trend of AFTE across different baseline methods, rather than over-interpreting a single small numerical gain. In the future, with further expansion of the number of test sequences, flight conditions, and target types, the role of temporal modeling methods in persistent air-to-air tiny-UAV tracking can be further verified through more extensive experiments and statistical analysis.
5. Conclusions
In this paper, we address the problem of vision–language tracking in real air-to-air tiny-object UAV countermeasure scenarios by constructing the ATA dataset and establishing a corresponding benchmark. Compared with existing generic object tracking datasets, air-to-ground UAV tracking datasets, and ground-to-air anti-UAV datasets, ATA is more specifically focused on the particular combination of real air-to-air observation, tiny UAV targets, language-assisted target specification, and persistent tracking. It also covers key challenges inherent in this setting, including dual-dynamic disturbances, weak target representation caused by extremely small object scale, rapid background variation, and interference from visually similar targets. Meanwhile, ATA provides both frame-wise bounding box annotations and video-level English language descriptions for each sequence, thereby supporting the BBox-only and Language-assisted settings within a unified framework and offering a common data foundation and evaluation protocol for both vision-only tracking and vision–language tracking.
On top of this, we establish a benchmark over ATA that covers both vision-only and vision–language methods and conduct systematic evaluations of several representative mainstream trackers proposed in recent years. The experimental results indicate that, although current methods are capable of achieving a certain degree of persistent tracking under the ATA setting, their overall performance remains limited, and no clearly dominant solution has yet emerged that can robustly cope with the principal challenges posed by this scenario. Further analysis reveals that existing methods on ATA generally suffer from several shared bottlenecks, including insufficient tiny-object representation, difficulty in inter-frame matching under dual-dynamic conditions, limited robustness in the presence of complex backgrounds and similar-target interference, as well as inadequate exploitation of temporal information. These findings suggest that the air-to-air tiny-object counter-UAV task characterized by ATA indeed imposes substantially higher demands on current tracking models than generic scenarios, thereby further justifying the necessity of constructing such a dataset and benchmark.
Given that ATA is naturally derived from continuous video streams and exhibits pronounced temporal correlations, we further introduce the AFTE temporal enhancement module for additional validation. The experimental results demonstrate that explicitly exploiting adjacent frames for short-term temporal enhancement can yield relatively consistent performance gains across multiple baseline trackers. This observation suggests that temporal modeling constitutes one of the key factors governing the performance upper bound under the ATA setting, while also indicating that explicit and lightweight adjacent-frame enhancement represents a practically meaningful avenue for improvement. Meanwhile, vision–language methods have not yet exhibited a stable advantage over vision-only methods on ATA, which further implies that future research should investigate more effective ways of utilizing language information for target specification, ambiguity suppression, and identity preservation.
Although ATA provides a dedicated benchmark for vision–language tracking research in air-to-air tiny-UAV countermeasure scenarios, the current study still has certain limitations. First, the current test set size of ATA is still relatively limited. Although it can support preliminary evaluation of different types of tracking methods, there remains room for further improvement in terms of statistical stability. Second, the acquisition conditions of ATA are still mainly based on daytime visible-light scenarios. Although the dataset contains a small number of samples under backlit, overcast, and low-contrast conditions, low-light, night-time, and more extreme weather conditions have not yet been systematically covered. Therefore, there is still room for further expansion in terms of evaluating the generalization ability of the dataset under complex illumination and all-weather scenarios. Third, the current target types in ATA are mainly DJI consumer-grade UAVs. These UAVs have small imaging scales in the air and can match the air-to-air tiny-UAV countermeasure task considered in this paper well. However, there is still room for further expansion in terms of UAV category diversity. Fourth, ATA currently adopts video-level single-sentence language descriptions, which can provide basic semantic information such as target category, appearance attributes, and background relations. However, the granularity of language annotation is still relatively limited and does not yet cover fine-grained semantic changes of the target across different time periods. Fifth, the main objective of this paper is to construct and analyze a vision–language tracking benchmark for air-to-air tiny-UAV countermeasure scenarios. At the methodological level, this paper only further validates the role of the lightweight AFTE temporal enhancement module, and has not yet designed a complete tracking architecture specifically oriented to tiny-object imaging and dual-dynamic-motion coupling characteristics. Therefore, the design of dedicated tracking models for this scenario remains a problem worthy of further investigation in future work.
Future work will mainly be carried out from the following three aspects. First, we will further expand the data scale and scenario coverage of ATA by introducing richer environmental conditions, flight relations, and UAV types, so as to continuously improve the representativeness and challenge of the dataset. Second, we will further expand the annotation system by exploring key-frame-level language descriptions, temporally dynamic language prompts, and finer-grained target semantic annotations, so as to enhance the support of language information for tiny-UAV identity discrimination and persistent tracking. Third, based on the core problems revealed by ATA, we will further explore temporal modeling methods, robust representation mechanisms, and vision–language collaborative modeling frameworks that are more suitable for air-to-air tiny-object scenarios, so as to continuously promote the development of object tracking research for real low-altitude counter-UAV tasks.