A Dedicated Lightweight Network with Synergistic Attention for Precise Air-to-Air UAV Detection

Han, Xinheng; Zhang, Haoyuan; Wang, Jiacheng; Xu, Jielei; Lv, Yingjie; Zhou, Zunning; Feng, Xiaoxue; Pan, Feng

doi:10.3390/rs18111804

Open AccessArticle

A Dedicated Lightweight Network with Synergistic Attention for Precise Air-to-Air UAV Detection

by

Xinheng Han

¹

,

Haoyuan Zhang

¹

,

Jiacheng Wang

¹

,

Jielei Xu

¹

,

Yingjie Lv

¹

,

Zunning Zhou

²,

Xiaoxue Feng

¹

and

Feng Pan

^1,*

¹

School of Automation, Beijing Institute of Technology, Beijing 100081, China

²

School of Mechatronical Engineering, Beijing Institute of Technology, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(11), 1804; https://doi.org/10.3390/rs18111804

Submission received: 23 April 2026 / Revised: 24 May 2026 / Accepted: 28 May 2026 / Published: 2 June 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

The proposed A2A-YOLO model integrates the novel LECA-Conv module with GhostModulev2 and a Tiny Detection Head, achieving a superior balance between high detection accuracy on challenging air-to-air UAV benchmarks and real-time performance on edge RK3588 platform.
Based on the insight that attention parameters need not be directly applied to original features, we design the LECA-Conv module to enable efficient local enhancement and channel-wise highlighting, thus significantly improving feature extraction for illumination-variant and motion-blurred objects at a low computational cost.

What is the implication of the main finding?

It provides a practical and efficient architectural solution for precise unmanned aerial vehicle (UAV) detection in real-world air-to-air scenarios, both in the visible and infrared spectra, directly addressing critical challenges like motion blur, illumination variations, and tiny targets under strict computational constraints for onboard deployment.
The design principle of LECA-Conv offers a new, efficient paradigm for integrating attention and local feature enhancement, which could inspire the development of more lightweight and robust vision models for other edge-based applications.

Abstract

The rapid advancement of unmanned aerial vehicle (UAV) technology has made air-to-air UAV object detection increasingly essential. However, the model faces additional challenges including small target sizes, motion blur, illumination variations, and stringent real-time performance requirements under constrained computational resources. To address these challenges, this paper proposes A2A-YOLO, a specialized detection model that introduces LECA-Conv for local and channel feature enhancement to effectively mitigate motion blur and illumination variations while incorporating GhostModulev2 for efficient feature extraction and Tiny Detection Heads for improved small target recognition. The proposed LECA-Conv module operates on the principle that attention parameters need not directly modify original feature maps, a key insight validated through extensive experiments. Extensive evaluations on the Det-Fly dataset demonstrate A2A-YOLO’s superior performance with

85.0 %

precision

(P_{P})

,

80.7 %

recall

(P_{R})

, and

81.9 %

average precision

(A P)

, outperforming YOLO11 by

0.9 %

,

8.4 %

, and

6.5 %

, respectively. The proposed method demonstrates outstanding performance across diverse backgrounds and challenging conditions including motion blur and illumination variations. The model achieves real-time detection at 15 FPS on RK3588 platform while delivering remarkable performance in infrared small target detection.

Keywords:

object detection; air-to-air; UAV; attention mechanism; motion blur; illumination variations

1. Introduction

With the rapid development of unmanned aerial vehicle (UAV) technology, air-to-air UAV object detection has emerged as a critical task for applications such as autonomous swarm coordination [1], aerial surveillance [2], obstacle avoidance [3], and anti-UAV systems [4].

Unlike ground-based or ground-to-air scenarios, air-to-air UAV object detection is confronted with three interconnected challenges that collectively define its unique difficulty. First, at the scene level, targets often appear at extremely small scales, exhibit dynamic and unpredictable motion patterns, and are set against complex backgrounds such as urban skylines or natural terrain. These factors severely degrade the performance of conventional detection methods that rely on handcrafted features. Second, at the performance level, the application scenario demands high real-time processing capabilities. Third, at the hardware level, the computational capacity of onboard edge devices in drones is severely constrained, imposing rigorous lightweight requirements on the model design. These challenges, particularly the prevalence of motion blur and significant illumination variations during flight, severely degrade the performance of conventional object detection models.

Motion blur arises inherently from the high relative velocities between UAVs. When a target UAV moves rapidly across the field of view of the detecting UAV, or when the detecting UAV itself undergoes agile maneuvers, the integration time of the camera sensor results in a smeared, low-contrast representation of the target [5,6]. This blurring effect significantly reduces the discriminative power of local texture and edge features [7], which are crucial for tiny UAV object detection.

Illumination variations are equally problematic. UAVs operate under diverse lighting conditions, ranging from strong sunlight to deep shadows, and from uniform overcast skies to complex, high-contrast urban skylines. These variations can cause drastic changes in the target’s apparent brightness and color, making it difficult for models to learn robust appearance invariance [8]. Situations may even arise in which it becomes difficult to distinguish between the target and the background [9]. Empirical evidence from datasets like Det-Fly [10] confirms that detection accuracy drops substantially in sequences exhibiting motion blur or extreme lighting (e.g., strong backlighting), highlighting these as critical bottlenecks for reliable air-to-air UAV object detection.

While motion blur and illumination variations present prominent challenges in air-to-air UAV detection, the effective detection of tiny targets and the management of computational costs must also be critically considered. On one hand, during air-to-air UAV detection, the significant detection distance often results in the target UAV occupying only a minimal pixel area within the image [11]. This leads to extremely weak appearance and textural features, making it difficult to distinguish the target from background noise [12]. On the other hand, severe computational constraints arise from the limited processing power, memory, and power consumption of the UAV’s onboard edge devices. This necessitates that models maintain high accuracy while simultaneously meeting stringent requirements for lightweight design and real-time performance, thereby imposing significant demands on the model’s computational efficiency and practical deployment feasibility.

While recent deep learning models, particularly the YOLO series [13,14,15,16,17,18,19,20], have shown promise due to their real-time capabilities, they often struggle with these specific aerial challenges. Standard convolutional operations and conventional attention mechanisms (e.g., CBAM [21]) are not explicitly designed to counteract the signal degradation caused by blur or illumination shifts. Although some works incorporate multi-scale fusion or lightweight backbones to handle small targets and computational constraints, a dedicated architectural solution for mitigating motion blur and illumination effects in the feature learning stage remains largely unexplored. There is a critical need for a model that can simultaneously enhance feature robustness against these degradations while maintaining a lightweight footprint for onboard deployment.

To address these challenges, we developed a specialized air-to-air UAV detection model based on the advanced YOLO11 architecture. The proposed framework integrates LECA-Conv modules to enhance both channel-wise relationships and local feature representation, effectively addressing motion blur and illumination variations. GhostModulev2 is employed to achieve efficient feature extraction while maintaining model compactness, and Tiny Detection Heads are incorporated to specifically improve small object detection capabilities. This comprehensive design enables the model to achieve high detection accuracy while maintaining superior real-time performance.

To be specific, the main contributions of this paper are summarized as follows:

(1): A Novel Lightweight Network for UAV Detection. We propose A2A-YOLO, a dedicated architecture for air-to-air UAV detection that achieves an exceptional balance between accuracy and efficiency. It is engineered to effectively tackle critical challenges in aerial imagery, including motion blur, illumination variations, and the precise detection of tiny targets, all while ensuring real-time performance under resource-constrained conditions.
(2): An Efficient Attention Mechanism (LECA-Conv). We design the LECA-Conv module, based on the insight that attention parameters need not be directly applied to the original feature maps. This design enables local feature enhancement and channel-wise highlighting with remarkable computational cost, leading to significantly improved feature extraction capability for abnormal illumination and motion-blurred objects.
(3): Architectural Optimizations for Lightweight and Tiny Targets. We strategically incorporate GhostModulev2 and Tiny Detection Head, which effectively maintains the model’s lightweight characteristics while significantly enhancing its capability for tiny target detection.
(4): Comprehensive Validation and Practical Deployment. We conducted comprehensive evaluations of A2A-YOLO in RGB and IR air-to-air UAV object detection scenarios, encompassing diverse backgrounds and varying complexity conditions. The model’s reliability was further verified through inference testing on the RK3588 edge computing platform.

2. Related Works

Traditional machine learning methods offer advantages in air-to-air UAV detection due to rapid inference. However, their performance is unsatisfactory on datasets with complex backgrounds, and they rely heavily on handcrafted features. Kassab et al. [22] compared the detection performance of Support Vector Machines (SVMs) [23] and Random Forest (RF) [24] combined with the Histogram of Oriented Gradients (HOG) on a drone dataset. Although their proposed modified Non-Maximum Suppression (NMS) algorithm led to a

25 %

improvement in model precision, the detection accuracy of both methods remained at a relatively low level.

Recent advances in deep learning have shown promise for air-to-air UAV object detection. Two-stage detectors like Faster R-CNN [25] and single-stage models like YOLO series [13,14,15,16,17,18,19,20] have been adapted to aerial imagery, with modifications like multi-scale feature fusion and lightweight backbones to address scale variation and computational constraints. Zheng et al. [10] compared eight models including Cascade R-CNN [26], Faster R-CNN [25] and YOLOv3 [17] for air-to-air UAV detection, demonstrating that Cascade R-CNN achieves the highest accuracy but poorest real-time performance, while YOLOv3 offers lower accuracy but superior real-time capability.

A significant branch of research has focused on pushing the boundaries of detection accuracy. Cai et al. [27] proposed an EA-DINO network with EA-FPN optimization that achieves significant accuracy improvements, but its computational complexity prevents real-time operation on resource-limited edge devices. The pursuit of high model accuracy often compromises real-time performance, severely limiting its practical utility in air-to-air scenarios. Conversely, another research direction prioritizes computational efficiency and model lightweighting. Cheng et al. [28] developed a lightweight YOLOv5s-NGN architecture incorporating a CF2-MC module for streamlined feature extraction and an MG module for complexity reduction, achieving real-time performance on edge devices while exhibiting compromised accuracy that fails to meet operational requirements.

The unique challenges of air-to-air detection, namely motion blur and illumination variations, have been individually studied in the broader context of computer vision. For motion blur, Gong et al. [29] transformed the problem of blur removal into motion flow estimation, achieving image restoration via an end-to-end pixel-level motion flow estimation neural network. Novel deep learning architectures such as MPRNet [30] and MIRNet [31] effectively balance contextual information and spatial details through multi-stage progressive design or feature fusion mechanisms, demonstrating significant performance improvements in tasks like image deblurring and denoising. However, while these general methods enhance image restoration quality, they fail to account for the morphological characteristics of UAV targets and real-time requirements. Moreover, these approaches are often employed as preprocessing steps and lack end-to-end collaborative optimization with detection networks.

For illumination variations, Jiang et al. [32] proposed a Self-Regularized Attention Mechanism that leverages the intrinsic brightness information of the input image to dynamically adjust the enhancement intensity, thereby achieving differentiated processing for distinct regions and addressing spatially varying illumination issues. Huang et al. [33] developed a bottom-up attention network that effectively compensates for weak illumination, leading to high-quality image enhancement while avoiding over-enhancement.

While these methods have enhanced the model’s robustness to illumination variations to some extent, they often introduce significant computational overhead [32,33]. Furthermore, methods focusing on motion deblurring are typically designed as heavy pre-processing steps, which not only lack end-to-end synergy with detection networks but also fail to account for the specific morphological characteristics of tiny UAV targets in air-to-air scenarios [30,31]. Critically, few studies have successfully integrated solutions for the domain-specific challenges of motion blur and illumination variation into a single, lightweight, and efficient architecture. The lack of a unified framework that can simultaneously mitigate signal degradation and respect strict computational constraints remains a critical bottleneck. Therefore, how to simultaneously address the problem of illumination variation and motion blur, and integrate such solutions into air-to-air drone detection networks, remains a critical challenge that requires urgent resolution.

In summary, the current research landscape presents a dichotomy between the pursuit of accuracy at the cost of efficiency and the pursuit of lightness at the cost of performance. Few studies have successfully integrated solutions for the domain-specific challenges of motion blur and illumination variation into a lightweight, efficient architecture without compromising on accuracy. This gap underscores the necessity for a dedicated model like A2A-YOLO, which is designed from the ground up to navigate these trade-offs and directly address the core challenges of air-to-air UAV detection.

3. Methods

In this section, we propose a high-performance and real-time A2A-YOLO model specifically designed for air-to-air UAV object detection, with its architecture illustrated in Figure 1. The A2A-YOLO model is based on the YOLO11 framework and primarily consists of three key components: Local Enhanced Channel Attention Convolution (LECA-Conv), GhostModulev2, and Tiny Detection Head.

As illustrated in Figure 1, the A2A-YOLO model integrates several modules from YOLO11. Pyramid Split Attention (PSA) enhances feature representation by capturing multi-scale contextual information through a pyramid splitting strategy, while C2PSA (Cross-stage Partial connections with PSA) combines the partial connections of CSPNet with PSA to facilitate gradient flow and reduce computational cost. C3k2 serves as a variant of the standard C3 block utilizing specific kernel configurations for efficient feature transformation. Spatial Pyramid Pooling–Fast (SPPF) replaces the conventional SPP with a serial pooling structure to rapidly aggregate multi-scale spatial features.

The LECA-Conv module is designed to emphasize channel-wise and local feature importance within a lightweight structure, addressing challenges such as motion blur and abnormal lighting conditions. The GhostModulev2 serves as a lightweight architecture that enhances both computational efficiency and feature representation capabilities to meet high real-time requirements. The Tiny Detection Head is specifically optimized to handle significant variations in target scales, thereby enabling effective detection of tiny objects. Based on the above architecture, A2A-YOLO effectively addresses the key challenges in air-to-air UAV object detection.

3.1. Local Enhanced Channel Attention Module

For current attention mechanism methods, given an input feature map

X \in R^{B \times C \times H \times W}

, the general representation form of the output is:

O u t p u t (X) = F_{A t t e n t i o n} (X) ⊙ X,

(1)

where,

F_{A t t e n t i o n} (X)

represents the attention weights obtained from X, ⊙ denotes element-wise multiplication. It can be observed that existing attention mechanisms essentially assign weights to the original feature map to generate attention-enhanced results.

However, we argue that the attention parameters need not be directly applied to the original feature maps. Attention mechanisms represented by the CBAM network [21], which sequentially connect Channel Attention Modules and Spatial Attention Modules, introduce significant redundant computations during model operation. In the context of high real-time air-to-air UAV object detection scenarios, where attention mechanisms should focus on local information and channel contributions, we propose a Local Enhanced Channel Attention Convolution (LECA-Conv) as shown in Figure 2.

After the convolution and batch normalization operations for downsampling, the model proceeds to the local enhanced branch and channel attention gate. In the local enhanced branch, a 1 × 1 convolution is introduced to adjust channel dimensions and reduce parameter count, followed by grouped convolutions applied along the channel dimension to perform local enhancement with small receptive fields.

F_{L o c a l E n h a n c e d} (X) = σ (γ (W_{2} ⊛ σ (γ (W_{1} ⊛ X)))),

(2)

where,

W_{1} \in R^{c_{m i d} \times c_{2} \times 1 \times 1}

and

W_{2} \in R^{c_{2} \times c_{m i d} \times 3 \times 3}

are the weights of the two convolution layers, ⊛ denotes the convolution operation,

γ

represents the BN operation, and

σ

signifies the activation function.

The channel attention gate generates channel weights through average pooling, adjusts these weights via two 1 × 1 convolutional layers with activation functions to introduce nonlinearity, and thereby reallocates channel attention.

F_{C h a n n e l A t t e n t i o n} (X) = σ ({W_{2}}^{'} ⊛ σ ({W_{1}}^{'} ⊛ G_{a v g} (X))),

(3)

where,

{W_{1}}^{'} \in R^{c_{m i d} \times c_{2} \times 1 \times 1}

and

{W_{2}}^{'} \in R^{c_{2} \times c_{m i d} \times 1 \times 1}

are the weights of the two convolution layers,

G_{a v g} (\cdot)

signifies the average pooling operation.

Notably, the intermediate channels

(c_{m i d})

in both the channel attention gate and local enhanced branch maintain consistent design dimensions. This deliberate architectural choice preserves coherence in channel transformations while synergistically reinforcing both channel attention mechanisms and local feature enhancement, ultimately facilitating more stable model training.

The adaptive shortcut is optional, and we refer to the LECA-Conv with an adaptive shortcut as PLECA-Conv. For the input feature map X, after passing through the convolution with weights

W_{0}

and the BN layer, it becomes

X^{'} = σ (W_{0} ⊛ X)

. The

O u t p u t^{'}

after LECA-Conv processing is:

O u t p u t^{'} (X) = F_{C h a n n e l A t t e n t i o n} (X^{'}) ⊙ F_{L o c a l E n h a n c e d} (X^{'}) + X^{'} .

(4)

To better illustrate the operational workflow of LECA-Conv, we present the pseudocode of the Local Enhanced Channel Attention Algorithm in Algorithm 1.

To further compare our model with the original attention mechanism, we present a comparative diagram of the ISE-Conv (an architectural improvement based on SEBlock [34] adapted to our design framework), along with our proposed LECA-Conv and PLECA-Conv, as shown in Figure 3.

The key distinction between our designed LECA-Conv and conventional attention mechanisms lies in applying attention weights to the local enhanced branch rather than the original feature maps. This innovative approach simultaneously captures channel-wise attention while enhancing local importance, maintaining computational efficiency while significantly boosting attention effectiveness. Consequently, it achieves superior feature representation for air-to-air UAV targets.

Algorithm 1: Local Enhanced Channel Attention Algorithm
Input: Feature Map $X_{i n p u t} \in R^{B \times C_{1} \times H_{1} \times W_{1}}$
Output: Feature Map $X_{o u t p u t} \in R^{B \times C_{2} \times H_{2} \times W_{2}}$
STEP I: Standard Convolution	1
$X_{c o n v} \leftarrow {C o n v}_{3 \times 3} (X_{i n p u t}, C_{1} \to C_{2})$	2
$X_{b n} \leftarrow B a t c h N o r m (X_{c o n v})$	3
$X_{m a i n} \leftarrow S i L U (X_{b n})$	4
STEP II—A: Local Enhanced Branch	5
$M_{l e b} \leftarrow {C o n v}_{1 \times 1} (X_{m a i n}, C_{2} \to C_{m i d})$	6
$M_{l e b} \leftarrow B a t c h N o r m (M_{l e b})$	7
$M_{l e b} \leftarrow S i L U (M_{l e b})$	8
$M_{l e b} \leftarrow {G r o u p C o n v}_{3 \times 3} (M_{p}, C_{m i d} \to C_{2})$	9
$M_{l e b} \leftarrow B a t c h N o r m (M_{l e b})$	10
$M_{l e b} \leftarrow S i L U (M_{l e b})$	11
STEP II—B: Channel Attention Branch	12
$G_{c a b} \leftarrow G l o b a l A v g P o o l (X_{m a i n})$	13
$G_{c a b} \leftarrow {C o n v}_{1 \times 1} (G, C_{2} \to C_{m i d})$	14
$G_{c a b} \leftarrow S i L U (G)$	15
$G_{c a b} \leftarrow {C o n v}_{1 \times 1} (G, C_{m i d} \to C_{2})$	16
$G_{c a b} \leftarrow S i g m o i d (G)$	17
STEP III: Feature Fusion	18
$A \leftarrow M_{l e b} \otimes G_{c a b}$ // Element-wise Multiplication	19
$X_{o u t p u t} \leftarrow X_{m a i n} + A$ // Residual Connection	20

The architectural consistency in middle channels between both branches facilitates more stable training convergence, resulting in superior model fitting performance specifically optimized for air-to-air UAV object detection scenarios. This approach demonstrates enhanced robustness against common challenges such as motion blur while maintaining reliable performance across diverse lighting conditions.

3.2. Lightweight Feature Extraction Convolutional Module

To further enhance the real-time performance of A2A-YOLO, we incorporate a lightweight feature extraction module called GhostModulev2 [35]. This module effectively maintains the model’s inference accuracy and target feature extraction capability while significantly reducing model complexity. The implementation substantially improves the model’s inference efficiency for deployment on airborne embedded edge devices, ensuring stable real-time operation even under constrained computational resources.

The structure of GhostModulev2 is illustrated in Figure 4, which consists of two GhostConv modules and a Decoupled Fully Connected (DFCConv) module.

The GhostConv is an efficient lightweight convolutional operation, whose core idea is to replace the intensive computation in traditional convolution with inexpensive linear transformations to generate redundant features. For an input feature map

Y \in R^{B \times C \times H \times W}

, with the output feature map

Y^{'} \in R^{B^{'} \times C^{'} \times H^{'} \times W^{'}}

, the computational cost of a standard convolution with kernel size k is:

F L O P s_{s t a n d a r d} = 2 \cdot B \cdot C \cdot C^{'} \cdot H^{'} \cdot W^{'} \cdot k^{2} .

(5)

For the GhostConv, which has a Ghost path with a depthwise convolution kernel size of d, the number of transformations of s, its computational cost is:

F L O P s_{g h o s t} = \frac{2}{s} \cdot B \cdot C \cdot C^{'} \cdot H^{'} \cdot W^{'} \cdot k^{2} + \frac{(s - 1)}{s} \cdot B \cdot C^{'} \cdot H^{'} \cdot W^{'} \cdot d^{2} .

(6)

It can be easily calculated that the FLOPs of GhostConv are approximately

\frac{1}{s}

of those of a standard convolution, which strongly demonstrates the computational efficiency superiority of GhostConv, further enhancing model performance in resource-constrained air-to-air UAV object detection scenarios and enabling more effective feature extraction tasks.

The DFCConv further enhances the feature extraction capability of the module by leveraging downsampling and upsampling to capture information from a larger receptive field. During this process, a

1 \times 5

convolution is employed to perform Vertical FC, followed by a

5 \times 1

convolution for Horizontal FC, thereby strengthening feature extraction through both vertical and horizontal operations. Under lightweight computational constraints, the DFCConv module enhances feature extraction across an expanded receptive field, enabling the capture and fusion of multi-directional features, which is particularly beneficial for addressing issues such as motion blur.

GhostModulev2, which integrates GhostConv and DFCConv, further reduces computational costs while maintaining effective feature extraction, thereby offering significant advantages for edge-device deployment in UAV object detection.

3.3. Tiny Detection Head Module

The YOLO algorithm typically employs three detection heads, with the largest feature map undergoing an

8 \times

downsampling relative to the input image. However, in air-to-air UAV object detection tasks, UAVs exhibit significant scale variations within images, and some occupy only a small proportion of the frame. For tiny UAV targets, repeated downsampling during feature extraction may lead to information loss [36], resulting in higher false detection and missed detection rates. To address this issue, we modify the original YOLO detection paradigm by introducing a Tiny Detection Head, which highlighted in green in Figure 1, to enhance the model’s capability in detecting small objects.

The Tiny Detection structure first upsamples the input of the head that originally processing the largest feature map to match the size of the output feature map from the topmost GhostModulev2 in the backbone network. Subsequently, these two feature maps undergo concatenation to generate a higher-channel feature representation. This combined feature map is then processed through a C3k2 module for channel adjustment. Finally, the refined feature map is fed into the Tiny Detection Head to produce detections based on a

4 \times

downsampled feature map relative to the input image.

The incorporation of the Tiny Detection Head further enhances A2A-YOLO’s capability in detecting extremely tiny targets, significantly mitigating information loss caused by downsampling operations. This improvement substantially boosts the model’s robustness in air-to-air UAV object detection.

4. Experiments

4.1. Implementation Details

4.1.1. Datasets

We conduct an extensive evaluation on the Det-Fly dataset [10], which is the most representative dataset for air-to-air UAV object detection. Det-Fly presents a dataset of 13,271 images of a flying target UAV acquired by another flying UAV. Featuring diverse backgrounds spanning fields, urban zones, skies, and mountains, the dataset presents complex imaging scenarios including variable lighting conditions, motion blur effects, and partial truncations. The UAV data exhibits multiple viewing angles, along with significant variations in both size and spatial distribution. Notably, we maintained strict compliance with the originally published official splits for all dataset partitions.

Furthermore, to explore the generalization of A2A-YOLO in infrared air-to-air small target detection, we conducted further studies on the RealScene-ISTD [37] and the IRSTD-1K [38] datasets. The IRSTD-IK dataset consists of 1000 real UAV images characterized by diverse object morphologies, varying target scales, and complex background clutter. The RealScene-ISTD dataset contains 739 real UAV images featuring substantial target size variations in highly noisy and cluttered environments. Since the two infrared small target detection datasets mentioned above use annotations in mask format, we converted the annotations into the bounding box format.

4.1.2. Settings

Throughout all experiments, the small-scale variants of the comparative models were employed. All of the comparative models and our proposed network were trained on the platform equipped with an Intel Xeon Silver 4410 CPU and Nvidia GeForce RTX 4090 GPU. The computational framework employed Python 3.8 with PyTorch 2.4.1, with each model trained for 300 epochs using input images resized to 640 × 640 pixels. The batchsize was set to 64 for the RGB dataset and 16 for the IR datasets.

For inference speed evaluation, the models were further tested on the platforms equipped with an Intel Xeon E5-2760 v3 CPU and NVIDIA Tesla V100 SXM2 GPU, the RK3588 platform, in addition to the original training platform.

4.2. Evaluation Metrics

4.2.1. Precision $(P_{P})$

Precision

P_{P}

measures how accurate the model’s predictions are.

P_{P} = \frac{T P}{T P + F P},

(7)

where,

T P

is the number of the correct detections, and

F P

is the number of the wrong or redundant detections. It is worth noting that in this experiment, a correct prediction is defined as one with an IoU greater than 0.5. Any prediction failing to meet this threshold is regarded as a false detection.

4.2.2. Recall $(P_{R})$

Recall

P_{R}

measures how well the model finds all ground-truth objects.

P_{R} = \frac{T P}{T P + F N},

(8)

where,

F N

is the number of the ground-truth object that wasn’t matched to any detection.

4.2.3. Average Precision $(A P)$

Average Precision

A P

summarizes model performance across all confidence thresholds. It is the area under the Precision-Recall Curve.

A P = \int_{0}^{1} P_{P} (P_{R}) d P_{R} .

(9)

4.2.4. Computation Cost Metrics

In addition to the model’s performance metrics, we also investigated computation cost metrics, including parameters

(P a r a m s)

, floating point operations

(F L O P s)

, model size, and frames per second

(F P S)

.

4.3. Comparison to the Advanced Methods

To comprehensively evaluate the performance of A2A-YOLO, we conducted comparative experiments on the Det-Fly dataset with advanced general real-time object detection methods, including YOLOv5 [13], YOLOv8 [14], YOLOv9 [20], YOLOv10 [19], YOLO11 [15], YOLOv12 [18], YOLOv13 [16], RT-DETR [39], and YOLOv8-DETR [39]. We also present the results of Faster R-CNN [25], a two-stage feature extraction network, on the Det-Fly dataset, following the setup in [10]. Table 1 demonstrates that our method, maintaining a competitive number of parameters, model size and inference speed, achieves superior performance in air-to-air UAV object detection with

P_{P}

of

85.0 %

,

P_{R}

of

80.7 %

, and

A P

of

81.9 %

. With a

5.7 %

A P

advantage over the strongest baseline on the Det-Fly dataset, the A2A-YOLO model achieves a remarkable speed-accuracy trade-off, evidenced by its compact size and satisfactory inference efficiency on diverse platforms. The A2A-YOLO architecture achieves real-time inference at 15 FPS when deployed on RK3588 edge computing platforms with 6 TOPS NPU.

To assess the reasoning capabilities of our method across diverse scenarios, we manually categorized the test set of Det-Fly into field

(737)

, urban

(1462)

, sky

(1535)

, and mountain

(1150)

based on the examples provided by the dataset authors. Additionally, we conducted targeted screening for conditions such as strong/weak light

(535)

, motion blur

(412)

, and partial truncation

(62)

to ensure a comprehensive evaluation.

As shown in Table 2, our A2A-YOLO achieves remarkable performance across all tested backgrounds and under a wide range of testing conditions. Specifically, under strong/weak light and motion blur conditions, our model significantly outperforms competing methods, attaining

A P

of

76.4 %

and

84.9 %

, respectively. Compared to the best-performing baseline, our method achieves

A P

improvements of

12.2 %

,

3.8 %

, and

2.1 %

on urban scenes with cluttered backgrounds, sky scenes containing tinier targets, and mountain scenes where targets and backgrounds have similar colors, respectively.

Addressing the key concerns in our model design, notably illumination variations and motion blur, our model achieves

A P

scores of

76.4 %

and

84.9 %

under strong/weak light and motion blur tests, respectively. These results surpass the best-performing baseline by significant margins of

13.0 %

and

5.5 %

. This performance further validates the effectiveness of the LECA-Conv module in handling illumination variations and motion blur, as well as the rationality of the overall A2A-YOLO architecture. Based on the evaluation results, our method demonstrates above-average performance in handling partial truncation scenarios, indicating its capability to effectively address this specific challenge.

The comparative inference results for RGB UAV detection are presented in Figure 5. The comparative analysis in the figure demonstrates the superior performance of A2A-YOLO over the YOLO11 baseline in handling tiny object, motion blur, strong/weak light, and complex background conditions. A2A-YOLO shows robust alignment with the ground truth in all cases. Notably, under Strong/Weak Light conditions, YOLO11 exhibits false detections, whereas A2A-YOLO accurately identifies similar targets, further validating the rationality of its model design and the precision of its inference.

4.4. Ablation Study

We first conduct a comprehensive evaluation of the proposed LECA-Conv in A2A-YOLO to validate its efficacy. Table 3 demonstrates that our proposed LECA-Conv module outperforms the SEBlock-based [34] ISE-Conv module in terms of

A P

across all configurations of the YOLO11 network, regardless of whether the tiny detection head (TDH) is employed. These results provide strong empirical support for our hypothesis that channel attention mechanisms should process features beyond just the original feature maps. The superior performance of LECA-Conv confirms its effectiveness in simultaneously capturing channel attention while enhancing local spatial information, thereby ensuring more precise feature representation for subsequent network layers.

Based on the performance of PLECA-Conv, we conducted a comprehensive re-examination of this module. Our analysis reveals that while the adaptive shortcut aims to enhance intrinsic image information, it inadvertently increases computational overhead without yielding proportional accuracy gains. Furthermore, the parallel convolutional layers introduce additional uncertainty during model training, likely due to optimization conflicts between the shortcut and the attention branches. Consequently, the performance of PLECA-Conv slightly lagged behind the standard LECA-Conv in certain configurations. This finding underscores the need for careful calibration of residual connections in specialized UAV detection networks, paving the way for our future research on stabilizing lightweight module training.

Subsequently, we performed comprehensive ablation studies, recorded in Table 4, to evaluate the individual contributions of A2A-YOLO’s modules: LECA-Conv (LECA), tiny detection head (TDH), and GhostModulev2 (GMv2). Relative to the baseline, our model demonstrates

0.9 %

,

8.4 %

, and

6.5 %

improvements in

P_{P}

,

P_{R}

,

A P

, respectively, alongside a

5.7 %

reduction in

p a r a m e t e r s

, while maintaining competitive

F L O P s

and

F P S

.

Experimental results demonstrate that both LECA-Conv and the Tiny Detection Head contribute significantly to model performance improvement. Notably, the Tiny Detection Head yields the most substantial individual gain. This is primarily because, in air-to-air scenarios, target UAVs are extremely small and occupy only a minimal portion of the image pixels. By operating on

4 \times

downsampled feature maps, the Tiny Detection Head effectively preserves the spatial details lost in deeper layers, making it indispensable for detecting such tiny objects. In contrast, LECA-Conv focuses on enhancing feature robustness, specifically addressing the challenges of motion blur and illumination variations by recalibrating channel-wise and local features without imposing heavy computational burdens. The introduced GhostModulev2 further reduces model complexity while maintaining accuracy. Collectively, these three modules achieve substantial improvements in both computational efficiency and model accuracy.

4.5. Generalization Experiments on Infrared Datasets

To further evaluate our model’s generalization capability, we conducted experiments on two infrared small target datasets, RealScene-ISTD [37] and IRSTD-1K [38], which primarily consist of air-to-air scenarios with significant noise interference. Notably, the A2A-YOLO model requires no additional configuration for training and inference on infrared images. The experimental results are presented in Table 5.

The proposed A2A-YOLO model demonstrates superior performance on two infrared benchmark datasets, outperforming all compared models, including the baseline YOLO11. As shown in Table 5, it attains

P_{P}

of

94.2 %

,

P_{R}

of

92.6 %

, and

A P

of

95.0 %

on the RealScene-ISTD dataset, and

P_{P}

of

87.8 %

,

P_{R}

of

80.6 %

, and

A P

of

86.4 %

on the IRSTD-1K dataset. By effectively mitigating challenges such as motion blur and enabling the detection of tiny objects in infrared imagery, the model demonstrates robust performance. Thus, these results substantiate the soundness of its overall design, which in turn confirms the efficacy of its specialized components.

Notably, despite utilizing identical training configurations, RT-DETR [39] exhibits a marked degradation in generalization on the infrared benchmarks. We attribute this failure to the limited scale of the IR datasets, which impedes the convergence of data-hungry Transformer architectures. Specifically, the global attention mechanism struggles to localize tiny targets lacking distinctive texture, leading to ineffective representation learning under sparse supervision. In contrast, CNN-based models like A2A-YOLO possess inherent inductive biases that facilitate robust feature extraction even with scarce training data, thereby maintaining high detection accuracy.

Figure 6 presents a comparative analysis of infrared small target detection results under challenging conditions. The evaluation includes two representative scenarios: one image from the RealScene-ISTD dataset, where the target suffers from motion blur and closely resembles background clouds, and another from the IRSTD-1K dataset, which contains multiple minuscule targets. In these demanding cases, our proposed A2A-YOLO model consistently generates inference results that align closely with the ground truth, demonstrating superior performance over the baseline YOLO11. Specifically, the model exhibits robustness in handling motion blur, accurately detecting extremely tiny targets, and reliably identifying multiple targets within a single image. These capabilities collectively validate the effectiveness of our novel architectural design.

Our results demonstrate that the proposed model A2A-YOLO not only achieves remarkable performance in RGB-based air-to-air UAV detection, but also maintains high detection capability and accuracy in infrared small target detection scenarios, which are exceptionally challenging due to increased noise, reduced spectral channels, and limited texture information. This provides more possibilities for air-to-air UAV object detection.

5. Conclusions

In this paper, we propose A2A-YOLO, a lightweight and highly accurate object detection network designed for air-to-air UAV scenarios. The model effectively enhances detection performance for challenges such as motion blur, illumination variations, and tiny targets, while maintaining real-time inference speed under resource-constrained conditions. Specifically, our main contributions include: the novel LECA-Conv module, which enhances local features and channel significance with minimal computational overhead without directly applying attention parameters to the original feature maps; the incorporation of GhostModulev2 and a dedicated Tiny Detection Head to strengthen small target detection capability while preserving model efficiency; and comprehensive evaluations in both RGB and infrared domains, validated on the RK3588 edge computing platform.

According to the extensive evaluations on the Det-Fly dataset, A2A-YOLO has a superior performance with precision

(P_{P})

of

85.0 %

, recall

(P_{R})

of

80.7 %

, and average precision

(A P)

of

81.9 %

, outperforming YOLO11 by

0.9 %

,

8.4 %

, and

6.5 %

, respectively. The proposed dedicated network for precise air-to-air UAV object detection demonstrates outstanding performance across diverse backgrounds and challenging conditions including motion blur and illumination variations. The model achieves real-time detection at 15 FPS on RK3588 platform while delivering remarkable performance in infrared small target detection. The viewpoint that the attention parameters need not be directly applied to the original feature maps has been experimentally validated, which provides new insights for subsequent research.

Author Contributions

Conceptualization, X.H. and F.P.; methodology, X.H.; software, X.H., H.Z., J.W.; validation, X.H., H.Z. and J.X.; formal analysis, X.H. and Y.L.; investigation, J.W. and Y.L.; data curation, J.W. and J.X.; writing—original draft preparation, X.H.; writing—review and editing, X.H., Z.Z., X.F. and F.P.; visualization, H.Z.; supervision, Z.Z., X.F. and F.P.; project administration, F.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are cited within the article. The models are available at: https://anonymous.4open.science/r/A2A-YOLO-4D42 (accessed on 27 May 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

UAV	Unmanned Aerial Vehicle
CPU	Central Processing Unit
GPU	Graphics Processing Unit

References

Wang, Z.; Cheng, P.; Chen, M.; Tian, P.; Wang, Z.; Li, X.; Yang, X.; Sun, X. Drones help drones: A collaborative framework for multi-drone object trajectory prediction and beyond. Adv. Neural Inf. Process. Syst. 2024, 37, 64604–64628. [Google Scholar]
Chen, Y.; Ye, Z.; Sun, H.; Gong, T.; Xiong, S.; Lu, X. Global–Local Fusion With Semantic Information Guidance for Accurate Small Object Detection in UAV Aerial Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–15. [Google Scholar] [CrossRef]
Karampinis, V.; Arsenos, A.; Filippopoulos, O.; Petrongonas, E.; Skliros, C.; Kollias, D.; Kollias, S.; Voulodimos, A. Ensuring uav safety: A vision-only and real-time framework for collision avoidance through object detection, tracking, and distance estimation. In Proceedings of the 2024 International Conference on Unmanned Aircraft Systems (ICUAS), Chania, Greece, 4–7 June 2024; pp. 1072–1079. [Google Scholar]
Yaacoub, J.P.; Noura, H.; Salman, O.; Chehab, A. Security analysis of drones systems: Attacks, limitations, and recommendations. Internet Things 2020, 11, 100218. [Google Scholar] [CrossRef] [PubMed]
Huang, T.; Zhu, J.; Liu, Y.; Tan, Y. UAV aerial image target detection based on BLUR-YOLO. Remote Sens. Lett. 2023, 14, 186–196. [Google Scholar] [CrossRef]
Li, Q.; Zhang, Y.; Fang, L.; Kang, Y.; Li, S.; Xiang Zhu, X. DREB-Net: Dual-Stream Restoration Embedding Blur-Feature Fusion Network for High-Mobility UAV Object Detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–18. [Google Scholar] [CrossRef]
Munir, A.; Siddiqui, A.J.; Anwar, S.; El-Maleh, A.; Khan, A.H.; Rehman, A. Impact of Adverse Weather and Image Distortions on Vision-Based UAV Detection: A Performance Evaluation of Deep Learning Models. Drones 2024, 8, 638. [Google Scholar] [CrossRef]
Noor, A.; Li, K.; Tovar, E.; Zhang, P.; Wei, B. Fusion flow-enhanced graph pooling residual networks for Unmanned Aerial Vehicles surveillance in day and night dual visions. Eng. Appl. Artif. Intell. 2024, 136, 108959. [Google Scholar] [CrossRef]
Chiang, S.Y.; Lin, T.Y. Low-Brightness Object Recognition Based on Deep Learning. Comput. Mater. Contin. 2024, 79, 1757–1773. [Google Scholar] [CrossRef]
Zheng, Y.; Chen, Z.; Lv, D.; Li, Z.; Lan, Z.; Zhao, S. Air-to-Air Visual Detection of Micro-UAVs: An Experimental Evaluation of Deep Learning. IEEE Robot. Autom. Lett. 2021, 6, 1020–1027. [Google Scholar] [CrossRef]
Guo, H.; Zheng, Y.; Zhang, Y.; Gao, Z.; Zhao, S. Global-Local MAV Detection Under Challenging Conditions Based on Appearance and Motion. IEEE Trans. Intell. Transp. Syst. 2024, 25, 12005–12017. [Google Scholar] [CrossRef]
Liu, H.I.; Tseng, Y.W.; Chang, K.C.; Wang, P.J.; Shuai, H.H.; Cheng, W.H. A DeNoising FPN With Transformer R-CNN for Tiny Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Jocher, G. Ultralytics YOLOv5. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 1 May 2025).
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. 2020. Available online: https://github.com/ultralytics/ultralytics (accessed on 1 May 2025).
Jocher, G.; Qiu, J. Ultralytics YOLO11. 2024. Available online: https://github.com/ultralytics/ultralytics (accessed on 1 May 2025).
Lei, M.; Li, S.; Wu, Y.; Hu, H.; Zhou, Y.; Zheng, X.; Ding, G.; Du, S.; Wu, Z.; Gao, Y. YOLOv13: Real-Time Object Detection with Hypergraph-Enhanced Adaptive Visual Perception. arXiv 2025, arXiv:2506.17733. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Wang, C.Y.; Yeh, I.; Liao, H.Y.M. Yolov9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision; Springer: Milan, Italy, 29 September–4 October 2024; pp. 1–21. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Kassab, M.; Zitar, R.A.; Barbaresco, F.; Seghrouchni, A.E.F. Drone Detection With Improved Precision in Traditional Machine Learning and Less Complexity in Single-Shot Detectors. IEEE Trans. Aerosp. Electron. Syst. 2024, 60, 3847–3859. [Google Scholar] [CrossRef]
Han, F.; Shan, Y.; Cekander, R.; Sawhney, H.S.; Kumar, R. A two-stage approach to people and vehicle detection with hog-based svm. In Proceedings of the Performance Metrics for Intelligent Systems 2006 Workshop, Gaithersburg, MD, USA, 21–23 August 2006; pp. 133–140. [Google Scholar]
Sedai, S.; Roy, P.K.; Garnavi, R. Right ventricle landmark detection using multiscale HOG and random forest classifier. In Proceedings of the 2015 IEEE 12th International Symposium on Biomedical Imaging (ISBI), Brooklyn, NY, USA, 16–19 April 2015; pp. 814–818. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
Cai, H.; Zhang, J.; Xu, J. EA-DINO: Improved method for unmanned aerial vehicle detection in airspace based on DINO. Eng. Res. Express 2024, 6, 035207. [Google Scholar] [CrossRef]
Cheng, Q.; Wang, Y.; He, W.; Bai, Y. Lightweight air-to-air unmanned aerial vehicle target detection model. Sci. Rep. 2024, 14, 2609. [Google Scholar] [CrossRef] [PubMed]
Gong, D.; Yang, J.; Liu, L.; Zhang, Y.; Reid, I.; Shen, C.; van den Hengel, A.; Shi, Q. From Motion Blur to Motion Flow: A Deep Learning Solution for Removing Heterogeneous Motion Blur. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H.; Shao, L. Multi-stage progressive image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 14821–14831. [Google Scholar]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H.; Shao, L. Learning enriched features for real image restoration and enhancement. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 492–511. [Google Scholar]
Jiang, Y.; Gong, X.; Liu, D.; Cheng, Y.; Fang, C.; Shen, X.; Yang, J.; Zhou, P.; Wang, Z. EnlightenGAN: Deep Light Enhancement Without Paired Supervision. IEEE Trans. Image Process. 2021, 30, 2340–2349. [Google Scholar] [CrossRef]
Huang, Y.; Zha, Z.J.; Fu, X.; Zhang, W. Illumination-Invariant Person Re-Identification. In Proceedings of the 27th ACM International Conference on Multimedia; MM ’19; ACM: New York, NY, USA, 2019; pp. 365–373. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Tang, Y.; Han, K.; Guo, J.; Xu, C.; Xu, C.; Wang, Y. GhostNetv2: Enhance cheap operation with long-range attention. Adv. Neural Inf. Process. Syst. 2022, 35, 9969–9982. [Google Scholar]
Xu, J.; Han, X.; Wang, J.; Feng, X.; Li, Z.; Pan, F. PCLC-Net: Parallel Connected Lateral Chain Networks for Infrared Small Target Detection. Remote Sens. 2025, 17, 2072. [Google Scholar] [CrossRef]
Lu, Y.; Li, Y.; Guo, X.; Yuan, S.; Shi, Y.; Lin, L. Rethinking Generalizable Infrared Small Target Detection: A Real-scene Benchmark and Cross-view Representation Learning. arXiv 2025, arXiv:2504.16487. [Google Scholar] [CrossRef]
Zhang, M.; Zhang, R.; Yang, Y.; Bai, H.; Zhang, J.; Guo, J. ISNet: Shape matters for infrared small target detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 877–886. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]

Figure 1. The structure of the proposed A2A-YOLO for air-to-air UAV object detection. The modules in blue denote our proposed LECA-Conv, the modules in pink represent the lightweight feature extraction component GhostModulev2, and the modules in green indicate the Tiny Detection Head.

Figure 2. The structure of the proposed LECA-Conv. In the figure, the symbol ⊗ denotes the element-wise multiplication, and ⊕ denotes the addition operation. The local enhanced branch enhances the model’s focus on local details, the channel attention gate improves attention to channel-wise importance, and shortcut along with optional adaptive shortcut preserve the original feature map information.

Figure 3. A comparison between ISE-Conv, LECA-Conv, and PLECA-Conv.

Figure 4. The structure of the GhostModulev2.

Figure 5. The comparison between A2A-YOLO and YOLO11 in RGB UAV detection.

Figure 6. The comparison between A2A-YOLO and YOLO11 in IR UAV detection.

Table 1. The RGB air-to-air UAV object detection performance of each advanced model on the Det-Fly dataset. The model size is measured for the ONNX model with a unified standard and the RKNN model suitable for deployment on edge chips, respectively. The FPS is measured on the Nvidia GeForce RTX 4090 GPU, Nvidia Tesla V100 GPU and Rockchip RK3588 NPU, respectively. None of the test environments had TensorRT installed and the batch was set to 1. The optimal and suboptimal results are underlined.

Models	$P_{P}$ ↑	$P_{R}$ ↑	$AP$ ↑	Params ↓	FLOPs ↓	Size(MB) ↓	FPS ↑
Models	$(%)$	$(%)$	$(%)$	(M)	(G)	(ONNX/RKNN)	(RTX 4090/Tesla V100/RK3588)
Faster R-CNN [25]	-	-	70.5	41.29	133.9	165.7 / -	15 / 7 / -
YOLOv5 [13]	82.5	72.1	75.0	9.11	23.8	34.9 / 12.3	291 / 108 / 22
YOLOv8 [14]	85.8	73.5	76.2	11.13	28.4	42.6 / 14.9	312 / 119 / 22
YOLOv9 [20]	84.4	73.2	75.5	7.17	26.7	27.6 / 14.3	124 / 46 / 13
YOLOv10 [19]	83.6	71.8	75.9	7.22	21.4	27.7 / 10.7	238 / 87 / 19
YOLO11 [15]	84.1	72.3	75.4	9.41	21.3	36.1 / 13.1	232 / 88 / 19
YOLOv12 [18]	83.9	72.2	75.4	9.07	19.3	34.9 / 22.5	106 / 37 / 2
YOLOv13 [16]	85.0	71.3	74.6	9.00	20.7	34.7 / 24.8	79 / 28 / <1
RT-DETR [39]	83.1	72.2	73.0	31.99	103.4	122.5 / -	46 / 21 / -
YOLOv8-DETR [39]	72.8	60.7	63.2	16.35	32.2	62.7 / -	54 / 25 / -
A2A-YOLO (ours)	85.0	80.7	81.9	8.87	22.8	34.2 / 16.1	169 / 62 / 15

Table 2. The

A P

for different environmental background scenes and different challenging conditions on the Det-Fly dataset.

A P_{f i e l d}

,

A P_{u r b a n}

,

A P_{s k y}

, and

A P_{m o u n t a i n}

denote the

A P

for under distinct environmental backgrounds: field, urban, sky, and mountain scenes, respectively.

A P_{S}

,

A P_{M}

, and

A P_{P}

denote the

A P

for under challenging conditions: strong/weak light, motion blur, and partial truncation, respectively. The optimal and suboptimal results are underlined.

Table 2. The

A P

for different environmental background scenes and different challenging conditions on the Det-Fly dataset.

A P_{f i e l d}

,

A P_{u r b a n}

,

A P_{s k y}

, and

A P_{m o u n t a i n}

denote the

A P

for under distinct environmental backgrounds: field, urban, sky, and mountain scenes, respectively.

A P_{S}

,

A P_{M}

, and

A P_{P}

denote the

A P

for under challenging conditions: strong/weak light, motion blur, and partial truncation, respectively. The optimal and suboptimal results are underlined.

Models	${AP}_{field}$ ↑	${AP}_{urban}$ ↑	${AP}_{sky}$ ↑	${AP}_{mountain}$ ↑	${AP}_{S}$ ↑	${AP}_{M}$ ↑	${AP}_{P}$ ↑
Models	(%)	(%)	(%)	(%)	(%)	(%)	(%)
YOLOv5 [13]	90.2	63.6	75.8	87.2	58.9	77.4	69.4
YOLOv8 [14]	91.9	64.2	75.4	89.5	58.4	79.4	84.9
YOLOv9 [20]	90.1	64.3	74.5	89.0	59.6	78.2	84.7
YOLOv10 [19]	92.5	63.2	74.8	87.1	57.6	76.7	76.7
YOLO11 [15]	91.2	63.5	75.1	87.7	58.4	78.6	81.4
YOLOv12 [18]	89.9	64.4	74.0	88.1	60.3	77.5	80.9
YOLOv13 [16]	89.9	62.9	73.4	87.4	58.2	77.8	79.3
RT-DETR [39]	86.0	63.7	68.1	91.1	63.4	74.5	51.6
YOLOv8-DETR [39]	86.9	53.2	51.2	79.6	50.4	67.5	67.9
A2A-YOLO (ours)	92.0	76.6	79.6	93.2	76.4	84.9	75.6

Table 3. Ablation study of the LECA-Conv module on the Det-Fly dataset.

Module	Without TDH			With TDH
Module	$AP$ (%)↑	Params(M) ↓	FLOPs(G) ↓	$AP$ (%)↑	Params(M) ↓	FLOPs(G) ↓
ISE-Conv	77.5	9.80	21.5	81.8	9.95	28.8
LECA-Conv	78.2	9.91	21.9	82.9	10.05	29.2
PLECA-Conv	78.0	9.91	22.6	81.2	10.05	29.9

Table 4. Ablation study of the proposed modules in A2A-YOLO on the Det-Fly dataset. The baseline model for the experiments is YOLO11 [15].

LECA	TDH	GMv2	$P_{P}$ ↑	$P_{R}$ ↑	$AP$ ↑	Params ↓	FLOPs ↓	FPS ↑
LECA	TDH	GMv2	(%)	(%)	(%)	(M)	(G)	(4090)
×	×	×	84.1	72.3	75.4	9.41	21.3	232
✔	×	×	85.0	73.7	78.2	9.91	21.9	190
×	✔	×	85.5	80.9	81.8	9.56	28.6	201
×	×	✔	82.7	70.4	73.6	8.36	15.3	229
✔	✔	×	86.5	81.6	82.9	10.05	29.2	166
✔	×	✔	83.2	71.5	75.4	8.73	15.8	191
×	✔	✔	85.4	80.4	80.2	8.50	22.3	199
✔	✔	✔	85.0	80.7	81.9	8.87	22.8	169

Table 5. The IR air-to-air object detection performance of each advanced model on the RealScene-ISTD and IRSTD-1K datasets. Since the structure of each model remains unchanged, the metrics including Params, FLOPs, Size, and FPS are consistent with those in Table 1. The optimal and suboptimal results are underlined.

Models	RealScene-ISTD [37]			IRSTD-1K [38]
Models	$P_{P}$ $(%)$ ↑	$P_{R}$ $(%)$ ↑	$AP$ $(%)$ ↑	$P_{P}$ $(%)$ ↑	$P_{R}$ $(%)$ ↑	$AP$ $(%)$ ↑
YOLOv5 [13]	89.8	89.4	93.9	83.3	75.7	81.9
YOLOv8 [14]	93.1	86.1	93.8	86.9	72.6	82.0
YOLOv9 [20]	92.6	87.1	93.4	89.3	71.6	81.9
YOLOv10 [19]	89.8	85.2	92.5	86.2	69.7	80.9
YOLO11 [15]	93.4	87.5	94.0	84.9	72.5	81.0
YOLOv12 [18]	93.2	87.0	93.0	85.0	72.1	80.7
YOLOv13 [16]	92.6	86.7	92.2	83.8	71.4	80.0
RT-DETR [39]	8.08	4.44	1.48	0.005	0.220	0.009
YOLOv8-DETR [39]	91.0	93.0	93.7	85.2	80.5	82.1
A2A-YOLO (ours)	94.2	92.6	95.0	87.8	80.6	86.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Han, X.; Zhang, H.; Wang, J.; Xu, J.; Lv, Y.; Zhou, Z.; Feng, X.; Pan, F. A Dedicated Lightweight Network with Synergistic Attention for Precise Air-to-Air UAV Detection. Remote Sens. 2026, 18, 1804. https://doi.org/10.3390/rs18111804

AMA Style

Han X, Zhang H, Wang J, Xu J, Lv Y, Zhou Z, Feng X, Pan F. A Dedicated Lightweight Network with Synergistic Attention for Precise Air-to-Air UAV Detection. Remote Sensing. 2026; 18(11):1804. https://doi.org/10.3390/rs18111804

Chicago/Turabian Style

Han, Xinheng, Haoyuan Zhang, Jiacheng Wang, Jielei Xu, Yingjie Lv, Zunning Zhou, Xiaoxue Feng, and Feng Pan. 2026. "A Dedicated Lightweight Network with Synergistic Attention for Precise Air-to-Air UAV Detection" Remote Sensing 18, no. 11: 1804. https://doi.org/10.3390/rs18111804

APA Style

Han, X., Zhang, H., Wang, J., Xu, J., Lv, Y., Zhou, Z., Feng, X., & Pan, F. (2026). A Dedicated Lightweight Network with Synergistic Attention for Precise Air-to-Air UAV Detection. Remote Sensing, 18(11), 1804. https://doi.org/10.3390/rs18111804

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Dedicated Lightweight Network with Synergistic Attention for Precise Air-to-Air UAV Detection

Highlights

Abstract

1. Introduction

2. Related Works

3. Methods

3.1. Local Enhanced Channel Attention Module

3.2. Lightweight Feature Extraction Convolutional Module

3.3. Tiny Detection Head Module

4. Experiments

4.1. Implementation Details

4.1.1. Datasets

4.1.2. Settings

4.2. Evaluation Metrics

4.2.1. Precision $(P_{P})$

4.2.2. Recall $(P_{R})$

4.2.3. Average Precision $(A P)$

4.2.4. Computation Cost Metrics

4.3. Comparison to the Advanced Methods

4.4. Ablation Study

4.5. Generalization Experiments on Infrared Datasets

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

A Dedicated Lightweight Network with Synergistic Attention for Precise Air-to-Air UAV Detection

Highlights

Abstract

1. Introduction

2. Related Works

3. Methods

3.1. Local Enhanced Channel Attention Module

3.2. Lightweight Feature Extraction Convolutional Module

3.3. Tiny Detection Head Module

4. Experiments

4.1. Implementation Details

4.1.1. Datasets

4.1.2. Settings

4.2. Evaluation Metrics

4.2.1. Precision ( P P )

4.2.2. Recall ( P R )

4.2.3. Average Precision ( A P )

4.2.4. Computation Cost Metrics

4.3. Comparison to the Advanced Methods

4.4. Ablation Study

4.5. Generalization Experiments on Infrared Datasets

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.2.1. Precision $(P_{P})$

4.2.2. Recall $(P_{R})$

4.2.3. Average Precision $(A P)$