1. Introduction
With the rapid development of unmanned aerial vehicle (UAV) technology, air-to-air UAV object detection has emerged as a critical task for applications such as autonomous swarm coordination [
1], aerial surveillance [
2], obstacle avoidance [
3], and anti-UAV systems [
4].
Unlike ground-based or ground-to-air scenarios, air-to-air UAV object detection is confronted with three interconnected challenges that collectively define its unique difficulty. First, at the scene level, targets often appear at extremely small scales, exhibit dynamic and unpredictable motion patterns, and are set against complex backgrounds such as urban skylines or natural terrain. These factors severely degrade the performance of conventional detection methods that rely on handcrafted features. Second, at the performance level, the application scenario demands high real-time processing capabilities. Third, at the hardware level, the computational capacity of onboard edge devices in drones is severely constrained, imposing rigorous lightweight requirements on the model design. These challenges, particularly the prevalence of motion blur and significant illumination variations during flight, severely degrade the performance of conventional object detection models.
Motion blur arises inherently from the high relative velocities between UAVs. When a target UAV moves rapidly across the field of view of the detecting UAV, or when the detecting UAV itself undergoes agile maneuvers, the integration time of the camera sensor results in a smeared, low-contrast representation of the target [
5,
6]. This blurring effect significantly reduces the discriminative power of local texture and edge features [
7], which are crucial for tiny UAV object detection.
Illumination variations are equally problematic. UAVs operate under diverse lighting conditions, ranging from strong sunlight to deep shadows, and from uniform overcast skies to complex, high-contrast urban skylines. These variations can cause drastic changes in the target’s apparent brightness and color, making it difficult for models to learn robust appearance invariance [
8]. Situations may even arise in which it becomes difficult to distinguish between the target and the background [
9]. Empirical evidence from datasets like Det-Fly [
10] confirms that detection accuracy drops substantially in sequences exhibiting motion blur or extreme lighting (e.g., strong backlighting), highlighting these as critical bottlenecks for reliable air-to-air UAV object detection.
While motion blur and illumination variations present prominent challenges in air-to-air UAV detection, the effective detection of tiny targets and the management of computational costs must also be critically considered. On one hand, during air-to-air UAV detection, the significant detection distance often results in the target UAV occupying only a minimal pixel area within the image [
11]. This leads to extremely weak appearance and textural features, making it difficult to distinguish the target from background noise [
12]. On the other hand, severe computational constraints arise from the limited processing power, memory, and power consumption of the UAV’s onboard edge devices. This necessitates that models maintain high accuracy while simultaneously meeting stringent requirements for lightweight design and real-time performance, thereby imposing significant demands on the model’s computational efficiency and practical deployment feasibility.
While recent deep learning models, particularly the YOLO series [
13,
14,
15,
16,
17,
18,
19,
20], have shown promise due to their real-time capabilities, they often struggle with these specific aerial challenges. Standard convolutional operations and conventional attention mechanisms (e.g., CBAM [
21]) are not explicitly designed to counteract the signal degradation caused by blur or illumination shifts. Although some works incorporate multi-scale fusion or lightweight backbones to handle small targets and computational constraints, a dedicated architectural solution for mitigating motion blur and illumination effects in the feature learning stage remains largely unexplored. There is a critical need for a model that can simultaneously enhance feature robustness against these degradations while maintaining a lightweight footprint for onboard deployment.
To address these challenges, we developed a specialized air-to-air UAV detection model based on the advanced YOLO11 architecture. The proposed framework integrates LECA-Conv modules to enhance both channel-wise relationships and local feature representation, effectively addressing motion blur and illumination variations. GhostModulev2 is employed to achieve efficient feature extraction while maintaining model compactness, and Tiny Detection Heads are incorporated to specifically improve small object detection capabilities. This comprehensive design enables the model to achieve high detection accuracy while maintaining superior real-time performance.
To be specific, the main contributions of this paper are summarized as follows:
- (1)
A Novel Lightweight Network for UAV Detection. We propose A2A-YOLO, a dedicated architecture for air-to-air UAV detection that achieves an exceptional balance between accuracy and efficiency. It is engineered to effectively tackle critical challenges in aerial imagery, including motion blur, illumination variations, and the precise detection of tiny targets, all while ensuring real-time performance under resource-constrained conditions.
- (2)
An Efficient Attention Mechanism (LECA-Conv). We design the LECA-Conv module, based on the insight that attention parameters need not be directly applied to the original feature maps. This design enables local feature enhancement and channel-wise highlighting with remarkable computational cost, leading to significantly improved feature extraction capability for abnormal illumination and motion-blurred objects.
- (3)
Architectural Optimizations for Lightweight and Tiny Targets. We strategically incorporate GhostModulev2 and Tiny Detection Head, which effectively maintains the model’s lightweight characteristics while significantly enhancing its capability for tiny target detection.
- (4)
Comprehensive Validation and Practical Deployment. We conducted comprehensive evaluations of A2A-YOLO in RGB and IR air-to-air UAV object detection scenarios, encompassing diverse backgrounds and varying complexity conditions. The model’s reliability was further verified through inference testing on the RK3588 edge computing platform.
2. Related Works
Traditional machine learning methods offer advantages in air-to-air UAV detection due to rapid inference. However, their performance is unsatisfactory on datasets with complex backgrounds, and they rely heavily on handcrafted features. Kassab et al. [
22] compared the detection performance of Support Vector Machines (SVMs) [
23] and Random Forest (RF) [
24] combined with the Histogram of Oriented Gradients (HOG) on a drone dataset. Although their proposed modified Non-Maximum Suppression (NMS) algorithm led to a
improvement in model precision, the detection accuracy of both methods remained at a relatively low level.
Recent advances in deep learning have shown promise for air-to-air UAV object detection. Two-stage detectors like Faster R-CNN [
25] and single-stage models like YOLO series [
13,
14,
15,
16,
17,
18,
19,
20] have been adapted to aerial imagery, with modifications like multi-scale feature fusion and lightweight backbones to address scale variation and computational constraints. Zheng et al. [
10] compared eight models including Cascade R-CNN [
26], Faster R-CNN [
25] and YOLOv3 [
17] for air-to-air UAV detection, demonstrating that Cascade R-CNN achieves the highest accuracy but poorest real-time performance, while YOLOv3 offers lower accuracy but superior real-time capability.
A significant branch of research has focused on pushing the boundaries of detection accuracy. Cai et al. [
27] proposed an EA-DINO network with EA-FPN optimization that achieves significant accuracy improvements, but its computational complexity prevents real-time operation on resource-limited edge devices. The pursuit of high model accuracy often compromises real-time performance, severely limiting its practical utility in air-to-air scenarios. Conversely, another research direction prioritizes computational efficiency and model lightweighting. Cheng et al. [
28] developed a lightweight YOLOv5s-NGN architecture incorporating a CF2-MC module for streamlined feature extraction and an MG module for complexity reduction, achieving real-time performance on edge devices while exhibiting compromised accuracy that fails to meet operational requirements.
The unique challenges of air-to-air detection, namely motion blur and illumination variations, have been individually studied in the broader context of computer vision. For motion blur, Gong et al. [
29] transformed the problem of blur removal into motion flow estimation, achieving image restoration via an end-to-end pixel-level motion flow estimation neural network. Novel deep learning architectures such as MPRNet [
30] and MIRNet [
31] effectively balance contextual information and spatial details through multi-stage progressive design or feature fusion mechanisms, demonstrating significant performance improvements in tasks like image deblurring and denoising. However, while these general methods enhance image restoration quality, they fail to account for the morphological characteristics of UAV targets and real-time requirements. Moreover, these approaches are often employed as preprocessing steps and lack end-to-end collaborative optimization with detection networks.
For illumination variations, Jiang et al. [
32] proposed a Self-Regularized Attention Mechanism that leverages the intrinsic brightness information of the input image to dynamically adjust the enhancement intensity, thereby achieving differentiated processing for distinct regions and addressing spatially varying illumination issues. Huang et al. [
33] developed a bottom-up attention network that effectively compensates for weak illumination, leading to high-quality image enhancement while avoiding over-enhancement.
While these methods have enhanced the model’s robustness to illumination variations to some extent, they often introduce significant computational overhead [
32,
33]. Furthermore, methods focusing on motion deblurring are typically designed as heavy pre-processing steps, which not only lack end-to-end synergy with detection networks but also fail to account for the specific morphological characteristics of tiny UAV targets in air-to-air scenarios [
30,
31]. Critically, few studies have successfully integrated solutions for the domain-specific challenges of motion blur and illumination variation into a single, lightweight, and efficient architecture. The lack of a unified framework that can simultaneously mitigate signal degradation and respect strict computational constraints remains a critical bottleneck. Therefore, how to simultaneously address the problem of illumination variation and motion blur, and integrate such solutions into air-to-air drone detection networks, remains a critical challenge that requires urgent resolution.
In summary, the current research landscape presents a dichotomy between the pursuit of accuracy at the cost of efficiency and the pursuit of lightness at the cost of performance. Few studies have successfully integrated solutions for the domain-specific challenges of motion blur and illumination variation into a lightweight, efficient architecture without compromising on accuracy. This gap underscores the necessity for a dedicated model like A2A-YOLO, which is designed from the ground up to navigate these trade-offs and directly address the core challenges of air-to-air UAV detection.
3. Methods
In this section, we propose a high-performance and real-time A2A-YOLO model specifically designed for air-to-air UAV object detection, with its architecture illustrated in
Figure 1. The A2A-YOLO model is based on the YOLO11 framework and primarily consists of three key components: Local Enhanced Channel Attention Convolution (LECA-Conv), GhostModulev2, and Tiny Detection Head.
As illustrated in
Figure 1, the A2A-YOLO model integrates several modules from YOLO11. Pyramid Split Attention (PSA) enhances feature representation by capturing multi-scale contextual information through a pyramid splitting strategy, while C2PSA (Cross-stage Partial connections with PSA) combines the partial connections of CSPNet with PSA to facilitate gradient flow and reduce computational cost. C3k2 serves as a variant of the standard C3 block utilizing specific kernel configurations for efficient feature transformation. Spatial Pyramid Pooling–Fast (SPPF) replaces the conventional SPP with a serial pooling structure to rapidly aggregate multi-scale spatial features.
The LECA-Conv module is designed to emphasize channel-wise and local feature importance within a lightweight structure, addressing challenges such as motion blur and abnormal lighting conditions. The GhostModulev2 serves as a lightweight architecture that enhances both computational efficiency and feature representation capabilities to meet high real-time requirements. The Tiny Detection Head is specifically optimized to handle significant variations in target scales, thereby enabling effective detection of tiny objects. Based on the above architecture, A2A-YOLO effectively addresses the key challenges in air-to-air UAV object detection.
3.1. Local Enhanced Channel Attention Module
For current attention mechanism methods, given an input feature map
, the general representation form of the output is:
where,
represents the attention weights obtained from
X, ⊙ denotes element-wise multiplication. It can be observed that existing attention mechanisms essentially assign weights to the original feature map to generate attention-enhanced results.
However, we argue that the attention parameters need not be directly applied to the original feature maps. Attention mechanisms represented by the CBAM network [
21], which sequentially connect Channel Attention Modules and Spatial Attention Modules, introduce significant redundant computations during model operation. In the context of high real-time air-to-air UAV object detection scenarios, where attention mechanisms should focus on local information and channel contributions, we propose a Local Enhanced Channel Attention Convolution (LECA-Conv) as shown in
Figure 2.
After the convolution and batch normalization operations for downsampling, the model proceeds to the local enhanced branch and channel attention gate. In the local enhanced branch, a 1 × 1 convolution is introduced to adjust channel dimensions and reduce parameter count, followed by grouped convolutions applied along the channel dimension to perform local enhancement with small receptive fields.
where,
and
are the weights of the two convolution layers, ⊛ denotes the convolution operation,
represents the BN operation, and
signifies the activation function.
The channel attention gate generates channel weights through average pooling, adjusts these weights via two 1 × 1 convolutional layers with activation functions to introduce nonlinearity, and thereby reallocates channel attention.
where,
and
are the weights of the two convolution layers,
signifies the average pooling operation.
Notably, the intermediate channels in both the channel attention gate and local enhanced branch maintain consistent design dimensions. This deliberate architectural choice preserves coherence in channel transformations while synergistically reinforcing both channel attention mechanisms and local feature enhancement, ultimately facilitating more stable model training.
The adaptive shortcut is optional, and we refer to the LECA-Conv with an adaptive shortcut as PLECA-Conv. For the input feature map
X, after passing through the convolution with weights
and the BN layer, it becomes
. The
after LECA-Conv processing is:
To better illustrate the operational workflow of LECA-Conv, we present the pseudocode of the Local Enhanced Channel Attention Algorithm in Algorithm 1.
To further compare our model with the original attention mechanism, we present a comparative diagram of the ISE-Conv (an architectural improvement based on SEBlock [
34] adapted to our design framework), along with our proposed LECA-Conv and PLECA-Conv, as shown in
Figure 3.
The key distinction between our designed LECA-Conv and conventional attention mechanisms lies in applying attention weights to the local enhanced branch rather than the original feature maps. This innovative approach simultaneously captures channel-wise attention while enhancing local importance, maintaining computational efficiency while significantly boosting attention effectiveness. Consequently, it achieves superior feature representation for air-to-air UAV targets.
| Algorithm 1: Local Enhanced Channel Attention Algorithm |
|
Input: Feature Map |
| Output: Feature Map |
| STEP I: Standard Convolution | 1 |
| | 2 |
| | 3 |
| | 4 |
| STEP II—A: Local Enhanced Branch | 5 |
| | 6 |
| | 7 |
| | 8 |
| | 9 |
| | 10 |
| | 11 |
| STEP II—B: Channel Attention Branch | 12 |
| | 13 |
| | 14 |
| | 15 |
| | 16 |
| | 17 |
| STEP III: Feature Fusion | 18 |
| // Element-wise Multiplication | 19 |
| // Residual Connection | 20 |
The architectural consistency in middle channels between both branches facilitates more stable training convergence, resulting in superior model fitting performance specifically optimized for air-to-air UAV object detection scenarios. This approach demonstrates enhanced robustness against common challenges such as motion blur while maintaining reliable performance across diverse lighting conditions.
3.2. Lightweight Feature Extraction Convolutional Module
To further enhance the real-time performance of A2A-YOLO, we incorporate a lightweight feature extraction module called GhostModulev2 [
35]. This module effectively maintains the model’s inference accuracy and target feature extraction capability while significantly reducing model complexity. The implementation substantially improves the model’s inference efficiency for deployment on airborne embedded edge devices, ensuring stable real-time operation even under constrained computational resources.
The structure of GhostModulev2 is illustrated in
Figure 4, which consists of two GhostConv modules and a Decoupled Fully Connected (DFCConv) module.
The GhostConv is an efficient lightweight convolutional operation, whose core idea is to replace the intensive computation in traditional convolution with inexpensive linear transformations to generate redundant features. For an input feature map
, with the output feature map
, the computational cost of a standard convolution with kernel size
k is:
For the GhostConv, which has a Ghost path with a depthwise convolution kernel size of
d, the number of transformations of
s, its computational cost is:
It can be easily calculated that the FLOPs of GhostConv are approximately of those of a standard convolution, which strongly demonstrates the computational efficiency superiority of GhostConv, further enhancing model performance in resource-constrained air-to-air UAV object detection scenarios and enabling more effective feature extraction tasks.
The DFCConv further enhances the feature extraction capability of the module by leveraging downsampling and upsampling to capture information from a larger receptive field. During this process, a convolution is employed to perform Vertical FC, followed by a convolution for Horizontal FC, thereby strengthening feature extraction through both vertical and horizontal operations. Under lightweight computational constraints, the DFCConv module enhances feature extraction across an expanded receptive field, enabling the capture and fusion of multi-directional features, which is particularly beneficial for addressing issues such as motion blur.
GhostModulev2, which integrates GhostConv and DFCConv, further reduces computational costs while maintaining effective feature extraction, thereby offering significant advantages for edge-device deployment in UAV object detection.
3.3. Tiny Detection Head Module
The YOLO algorithm typically employs three detection heads, with the largest feature map undergoing an
downsampling relative to the input image. However, in air-to-air UAV object detection tasks, UAVs exhibit significant scale variations within images, and some occupy only a small proportion of the frame. For tiny UAV targets, repeated downsampling during feature extraction may lead to information loss [
36], resulting in higher false detection and missed detection rates. To address this issue, we modify the original YOLO detection paradigm by introducing a Tiny Detection Head, which highlighted in green in
Figure 1, to enhance the model’s capability in detecting small objects.
The Tiny Detection structure first upsamples the input of the head that originally processing the largest feature map to match the size of the output feature map from the topmost GhostModulev2 in the backbone network. Subsequently, these two feature maps undergo concatenation to generate a higher-channel feature representation. This combined feature map is then processed through a C3k2 module for channel adjustment. Finally, the refined feature map is fed into the Tiny Detection Head to produce detections based on a downsampled feature map relative to the input image.
The incorporation of the Tiny Detection Head further enhances A2A-YOLO’s capability in detecting extremely tiny targets, significantly mitigating information loss caused by downsampling operations. This improvement substantially boosts the model’s robustness in air-to-air UAV object detection.
4. Experiments
4.1. Implementation Details
4.1.1. Datasets
We conduct an extensive evaluation on the Det-Fly dataset [
10], which is the most representative dataset for air-to-air UAV object detection. Det-Fly presents a dataset of 13,271 images of a flying target UAV acquired by another flying UAV. Featuring diverse backgrounds spanning fields, urban zones, skies, and mountains, the dataset presents complex imaging scenarios including variable lighting conditions, motion blur effects, and partial truncations. The UAV data exhibits multiple viewing angles, along with significant variations in both size and spatial distribution. Notably, we maintained strict compliance with the originally published official splits for all dataset partitions.
Furthermore, to explore the generalization of A2A-YOLO in infrared air-to-air small target detection, we conducted further studies on the RealScene-ISTD [
37] and the IRSTD-1K [
38] datasets. The IRSTD-IK dataset consists of 1000 real UAV images characterized by diverse object morphologies, varying target scales, and complex background clutter. The RealScene-ISTD dataset contains 739 real UAV images featuring substantial target size variations in highly noisy and cluttered environments. Since the two infrared small target detection datasets mentioned above use annotations in mask format, we converted the annotations into the bounding box format.
4.1.2. Settings
Throughout all experiments, the small-scale variants of the comparative models were employed. All of the comparative models and our proposed network were trained on the platform equipped with an Intel Xeon Silver 4410 CPU and Nvidia GeForce RTX 4090 GPU. The computational framework employed Python 3.8 with PyTorch 2.4.1, with each model trained for 300 epochs using input images resized to 640 × 640 pixels. The batchsize was set to 64 for the RGB dataset and 16 for the IR datasets.
For inference speed evaluation, the models were further tested on the platforms equipped with an Intel Xeon E5-2760 v3 CPU and NVIDIA Tesla V100 SXM2 GPU, the RK3588 platform, in addition to the original training platform.
4.2. Evaluation Metrics
4.2.1. Precision
Precision
measures how accurate the model’s predictions are.
where,
is the number of the correct detections, and
is the number of the wrong or redundant detections. It is worth noting that in this experiment, a correct prediction is defined as one with an IoU greater than 0.5. Any prediction failing to meet this threshold is regarded as a false detection.
4.2.2. Recall
Recall
measures how well the model finds all ground-truth objects.
where,
is the number of the ground-truth object that wasn’t matched to any detection.
4.2.3. Average Precision
Average Precision
summarizes model performance across all confidence thresholds. It is the area under the Precision-Recall Curve.
4.2.4. Computation Cost Metrics
In addition to the model’s performance metrics, we also investigated computation cost metrics, including parameters , floating point operations , model size, and frames per second .
4.3. Comparison to the Advanced Methods
To comprehensively evaluate the performance of A2A-YOLO, we conducted comparative experiments on the Det-Fly dataset with advanced general real-time object detection methods, including YOLOv5 [
13], YOLOv8 [
14], YOLOv9 [
20], YOLOv10 [
19], YOLO11 [
15], YOLOv12 [
18], YOLOv13 [
16], RT-DETR [
39], and YOLOv8-DETR [
39]. We also present the results of Faster R-CNN [
25], a two-stage feature extraction network, on the Det-Fly dataset, following the setup in [
10].
Table 1 demonstrates that our method, maintaining a competitive number of parameters, model size and inference speed, achieves superior performance in air-to-air UAV object detection with
of
,
of
, and
of
. With a
advantage over the strongest baseline on the Det-Fly dataset, the A2A-YOLO model achieves a remarkable speed-accuracy trade-off, evidenced by its compact size and satisfactory inference efficiency on diverse platforms. The A2A-YOLO architecture achieves real-time inference at 15 FPS when deployed on RK3588 edge computing platforms with 6 TOPS NPU.
To assess the reasoning capabilities of our method across diverse scenarios, we manually categorized the test set of Det-Fly into field , urban , sky , and mountain based on the examples provided by the dataset authors. Additionally, we conducted targeted screening for conditions such as strong/weak light , motion blur , and partial truncation to ensure a comprehensive evaluation.
As shown in
Table 2, our A2A-YOLO achieves remarkable performance across all tested backgrounds and under a wide range of testing conditions. Specifically, under strong/weak light and motion blur conditions, our model significantly outperforms competing methods, attaining
of
and
, respectively. Compared to the best-performing baseline, our method achieves
improvements of
,
, and
on urban scenes with cluttered backgrounds, sky scenes containing tinier targets, and mountain scenes where targets and backgrounds have similar colors, respectively.
Addressing the key concerns in our model design, notably illumination variations and motion blur, our model achieves scores of and under strong/weak light and motion blur tests, respectively. These results surpass the best-performing baseline by significant margins of and . This performance further validates the effectiveness of the LECA-Conv module in handling illumination variations and motion blur, as well as the rationality of the overall A2A-YOLO architecture. Based on the evaluation results, our method demonstrates above-average performance in handling partial truncation scenarios, indicating its capability to effectively address this specific challenge.
The comparative inference results for RGB UAV detection are presented in
Figure 5. The comparative analysis in the figure demonstrates the superior performance of A2A-YOLO over the YOLO11 baseline in handling tiny object, motion blur, strong/weak light, and complex background conditions. A2A-YOLO shows robust alignment with the ground truth in all cases. Notably, under Strong/Weak Light conditions, YOLO11 exhibits false detections, whereas A2A-YOLO accurately identifies similar targets, further validating the rationality of its model design and the precision of its inference.
4.4. Ablation Study
We first conduct a comprehensive evaluation of the proposed LECA-Conv in A2A-YOLO to validate its efficacy.
Table 3 demonstrates that our proposed LECA-Conv module outperforms the SEBlock-based [
34] ISE-Conv module in terms of
across all configurations of the YOLO11 network, regardless of whether the tiny detection head (TDH) is employed. These results provide strong empirical support for our hypothesis that channel attention mechanisms should process features beyond just the original feature maps. The superior performance of LECA-Conv confirms its effectiveness in simultaneously capturing channel attention while enhancing local spatial information, thereby ensuring more precise feature representation for subsequent network layers.
Based on the performance of PLECA-Conv, we conducted a comprehensive re-examination of this module. Our analysis reveals that while the adaptive shortcut aims to enhance intrinsic image information, it inadvertently increases computational overhead without yielding proportional accuracy gains. Furthermore, the parallel convolutional layers introduce additional uncertainty during model training, likely due to optimization conflicts between the shortcut and the attention branches. Consequently, the performance of PLECA-Conv slightly lagged behind the standard LECA-Conv in certain configurations. This finding underscores the need for careful calibration of residual connections in specialized UAV detection networks, paving the way for our future research on stabilizing lightweight module training.
Subsequently, we performed comprehensive ablation studies, recorded in
Table 4, to evaluate the individual contributions of A2A-YOLO’s modules: LECA-Conv (LECA), tiny detection head (TDH), and GhostModulev2 (GMv2). Relative to the baseline, our model demonstrates
,
, and
improvements in
,
,
, respectively, alongside a
reduction in
, while maintaining competitive
and
.
Experimental results demonstrate that both LECA-Conv and the Tiny Detection Head contribute significantly to model performance improvement. Notably, the Tiny Detection Head yields the most substantial individual gain. This is primarily because, in air-to-air scenarios, target UAVs are extremely small and occupy only a minimal portion of the image pixels. By operating on downsampled feature maps, the Tiny Detection Head effectively preserves the spatial details lost in deeper layers, making it indispensable for detecting such tiny objects. In contrast, LECA-Conv focuses on enhancing feature robustness, specifically addressing the challenges of motion blur and illumination variations by recalibrating channel-wise and local features without imposing heavy computational burdens. The introduced GhostModulev2 further reduces model complexity while maintaining accuracy. Collectively, these three modules achieve substantial improvements in both computational efficiency and model accuracy.
4.5. Generalization Experiments on Infrared Datasets
To further evaluate our model’s generalization capability, we conducted experiments on two infrared small target datasets, RealScene-ISTD [
37] and IRSTD-1K [
38], which primarily consist of air-to-air scenarios with significant noise interference. Notably, the A2A-YOLO model requires no additional configuration for training and inference on infrared images. The experimental results are presented in
Table 5.
The proposed A2A-YOLO model demonstrates superior performance on two infrared benchmark datasets, outperforming all compared models, including the baseline YOLO11. As shown in
Table 5, it attains
of
,
of
, and
of
on the RealScene-ISTD dataset, and
of
,
of
, and
of
on the IRSTD-1K dataset. By effectively mitigating challenges such as motion blur and enabling the detection of tiny objects in infrared imagery, the model demonstrates robust performance. Thus, these results substantiate the soundness of its overall design, which in turn confirms the efficacy of its specialized components.
Notably, despite utilizing identical training configurations, RT-DETR [
39] exhibits a marked degradation in generalization on the infrared benchmarks. We attribute this failure to the limited scale of the IR datasets, which impedes the convergence of data-hungry Transformer architectures. Specifically, the global attention mechanism struggles to localize tiny targets lacking distinctive texture, leading to ineffective representation learning under sparse supervision. In contrast, CNN-based models like A2A-YOLO possess inherent inductive biases that facilitate robust feature extraction even with scarce training data, thereby maintaining high detection accuracy.
Figure 6 presents a comparative analysis of infrared small target detection results under challenging conditions. The evaluation includes two representative scenarios: one image from the RealScene-ISTD dataset, where the target suffers from motion blur and closely resembles background clouds, and another from the IRSTD-1K dataset, which contains multiple minuscule targets. In these demanding cases, our proposed A2A-YOLO model consistently generates inference results that align closely with the ground truth, demonstrating superior performance over the baseline YOLO11. Specifically, the model exhibits robustness in handling motion blur, accurately detecting extremely tiny targets, and reliably identifying multiple targets within a single image. These capabilities collectively validate the effectiveness of our novel architectural design.
Our results demonstrate that the proposed model A2A-YOLO not only achieves remarkable performance in RGB-based air-to-air UAV detection, but also maintains high detection capability and accuracy in infrared small target detection scenarios, which are exceptionally challenging due to increased noise, reduced spectral channels, and limited texture information. This provides more possibilities for air-to-air UAV object detection.
5. Conclusions
In this paper, we propose A2A-YOLO, a lightweight and highly accurate object detection network designed for air-to-air UAV scenarios. The model effectively enhances detection performance for challenges such as motion blur, illumination variations, and tiny targets, while maintaining real-time inference speed under resource-constrained conditions. Specifically, our main contributions include: the novel LECA-Conv module, which enhances local features and channel significance with minimal computational overhead without directly applying attention parameters to the original feature maps; the incorporation of GhostModulev2 and a dedicated Tiny Detection Head to strengthen small target detection capability while preserving model efficiency; and comprehensive evaluations in both RGB and infrared domains, validated on the RK3588 edge computing platform.
According to the extensive evaluations on the Det-Fly dataset, A2A-YOLO has a superior performance with precision of , recall of , and average precision of , outperforming YOLO11 by , , and , respectively. The proposed dedicated network for precise air-to-air UAV object detection demonstrates outstanding performance across diverse backgrounds and challenging conditions including motion blur and illumination variations. The model achieves real-time detection at 15 FPS on RK3588 platform while delivering remarkable performance in infrared small target detection. The viewpoint that the attention parameters need not be directly applied to the original feature maps has been experimentally validated, which provides new insights for subsequent research.