A Lightweight Multi-Module Collaborative Optimization Framework for Detecting Small Unmanned Aerial Vehicles in Anti-Unmanned Aerial Vehicle Systems

Chen, Zhiling; Fan, Kuangang; Ye, Jingzhen; Xu, Zhitao; Wei, Yupeng

doi:10.3390/drones10010020

Open AccessArticle

A Lightweight Multi-Module Collaborative Optimization Framework for Detecting Small Unmanned Aerial Vehicles in Anti-Unmanned Aerial Vehicle Systems

by

Zhiling Chen

¹

,

Kuangang Fan

^1,2,*

,

Jingzhen Ye

¹

,

Zhitao Xu

¹

and

Yupeng Wei

²

¹

School of Electrical Engineering and Automation, Jiangxi University of Science and Technology, Ganzhou 341000, China

²

School of Electrical Engineering, Shanghai Dianji University, Shanghai 201306, China

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(1), 20; https://doi.org/10.3390/drones10010020

Submission received: 25 November 2025 / Revised: 18 December 2025 / Accepted: 25 December 2025 / Published: 31 December 2025

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Proposition of YOLO-CoOp, a lightweight multi-module collaborative framework with HRFPN, C3k2-WT, SCSA, and DyATF modules specifically optimized for small UAV detection.
Achievement of 94.3% precision and 96.2% mAP⁵⁰ on UAV-SOD dataset with only 1.97 M parameters (24% fewer than baseline), demonstrating superior detection performance with reduced computational requirements.

What are the implications of the main findings?

Contribution to vision-based detection capabilities in an anti-UAV system by offering a lightweight yet effective framework for small UAV detection.
Cross-dataset validation experiments demonstrate that the proposed method consistently improves small object detection performance, offering a transferable solution with potential applicability to other domains involving small target detection.

Abstract

In response to the safety threats posed by unauthorized unmanned aerial vehicles (UAVs), the importance of anti-UAV systems is becoming increasingly apparent. In tasks involving UAV detection, small UAVs are particularly difficult to detect due to their low resolution. Therefore, this study proposed YOLO-CoOp, a lightweight multi-module collaborative optimization framework for detecting small UAVs. First, a high-resolution feature pyramid network (HRFPN) was proposed to retain more spatial information of small UAVs. Second, a C3k2-WT module integrated with wavelet transform convolution was proposed to enhance feature extraction capability and expand the model’s receptive field. Then, a spatial-channel synergistic attention (SCSA) mechanism was introduced to integrate spatial and channel information and enhance feature fusion. Finally, the DyATF method replaced the upsampling with Dysample and the confidence loss with adaptive threshold focal loss (ATFL), aiming to restore UAV details and balance positive–negative sample weights. The ablation experiments show that YOLO-CoOp achieves 94.3% precision, 93.1% recall, 96.2% mAP⁵⁰, and 57.6% mAP⁵⁰⁻⁹⁵ on the UAV-SOD dataset, with improvements of 3.6%, 10%, 5.9%, and 5% over the baseline model, respectively. The comparison experiments demonstrate that YOLO-CoOp has fewer parameters while maintaining superior detection performance. Cross-dataset validation experiments also demonstrate that YOLO-CoOp exhibits significant performance improvements in small object detection tasks.

Keywords:

small object detection; lightweight framework; visual detection; collaborative optimization; anti-UAV system

1. Introduction

In recent years, unmanned aerial vehicles (UAVs) have been widely used in civil, military, and scientific research fields due to their small size, high maneuverability, and flexible control. Research related to UAV has attracted increasing attention, exemplified by work on UAV path planning [1], UAV acoustic source localization [2], UAV interception [3], and UAV detection. In addition, UAVs can be equipped with various sensors and payloads to perform a wide range of tasks, such as power line inspection [4], unmanned aerial sprayer of agricultural chemicals [5], logistics transportation [6], and ecological monitoring [7]. While the rapid development and widespread adoption of UAVs have delivered substantial societal benefits, they have also raised serious safety and security concerns [8,9]. Therefore, the development of an efficient and reliable anti-UAV system is essential for mitigating emerging security threats. Generally, the anti-UAV system is divided into three stages: detection, tracking, and interception, and research has been conducted. Ref. [10] proposed dual-flow semantic consistency (DFSC) for UAV tracking, which offered significant performance improvements. Ref. [11] significantly improved tracking performance by integrating object detection. The work in [12] demonstrated that RF jamming can be used to effectively disable unauthorized UAV, leading to successful interception in controlled scenarios. Existing detection methods include acoustic detection [13], radio frequency detection [14], and radar-based detection [15]. Acoustic detection is highly sensitive to environmental noise, while radio frequency detection and radar-based detection have high operating costs and are highly dependent on equipment [16]. In contrast, vision-based detection technology has attracted widespread attention due to its low cost and strong environmental adaptability.

With the breakthrough of AlexNet [17] in image classification, the field of artificial intelligence shifted toward deep learning, which has also driven the development of other fields. Ref. [18] used an on-chip photonic deep neural network for image classification. Ref. [19] used ProteinMPNN to robust deep learning-based protein sequence design. Deep learning-based object detection has been widely researched and has made breakthrough progress [20,21]. It is primarily categorized into single-stage detector and two-stage detector. The two-stage detector decompose the detection task into two sequential stages. First, candidate regions are generated, and then these regions are classified and their bounding boxes are optimized. Its advantage is that it has a high accuracy rate, and its representative frameworks including R-CNN [22], Fast R-CNN [23], and Faster R-CNN [24], where CNN denotes convolutional neural network. However, two-stage processing increases computation and real-time detection difficult to achieve. In contrast, a single-stage detector integrates localization and classification into a single neural network, enabling efficient end-to-end detection. Its advantages lie in faster processing speed and a more lightweight structure, with representative frameworks including You Only Look Once (YOLO) [25] and Single Shot MultiBox Detector (SSD) [26]. Due to YOLO’s highly efficient detection capabilities, it is often used in practical engineering applications and has a wide range of application scenarios. For example, much research uses the improved YOLO to detect fruits [27], defects, [28] pedestrians, and vehicles [29].

YOLO is also widely used in the field of general UAV detection field. For example, work [30] used YOLO to detect small targets in UAV aerial images. Ref. [31] designed a new cross-layer feature aggregation (CFA) module and proposed a layered associative spatial pyramid pooling (LASPP) module to capture contextual information, ensuring significant performance gains while maintaining real-time detection. Ref. [32] proposed a structured pruning framework, targeting channel and layer redundancies in UAV detection. Ref. [33] introduced an attention mechanism to distinguish UAV from birds. Ref. [34] achieved improved UAV detection performance in adverse conditions such as heavy fog. Ref. [35] combined normalized wasserstein distance (NWD) loss with CIoU loss to deal with the question of high false negatives. Ref. [36] improved the accuracy of single-stage inference by learning scale-invariant features and enhanced disentanglement using an adversarial feature learning (AFL) scheme, effectively improving the accuracy of the model. Although these studies have achieved good results, challenges persist in detecting small UAVs. Therefore, this study improves the detection performance of small UAVs by proposing the YOLO-CoOp framework. It mainly improves detection performance through the collaborative effects between different modules. The main contributions of this study include:

A UAV-SOD dataset is established specifically for solving small UAV detection.
In the feature extraction stage, feature extraction is enhanced through a C3k2-WT module that integrates wavelet transform convolution (WTConv). It can extract features of different frequencies through small kernel filters. Furthermore, it increases the receptive field of the model without excessively increasing parameters.
In the feature fusion stage, high-detail and high-resolution feature maps are obtained through the high-resolution feature pyramid network (HRFPN) and Dysample upsampling. They have more detailed information about small UAVs, which help improve UAV detection performance after information fusion.
In the detection stage, the spatial-channel synergistic attention (SCSA) mechanism integrates spatial and channel information from high-resolution feature maps, allowing the framework to focus more on effective information, and adaptive threshold focal loss (ATFL) assigns higher weights to small UAVs.

2. Materials

2.1. General UAV Datasets

With the development of the low-altitude economy and the advancement of UAV technology, general UAV datasets have become the cornerstone for supporting intelligent perception and navigation. However, building a large-scale UAV dataset is costly, which makes high-quality dataset increasingly important. As an increasing number of researchers focus on UAV-related research, relevant datasets are constantly being improved. Some representative general UAV datasets are listed in Table 1. However, most of the general datasets are designed for comprehensive research in the UAV field. Within existing general UAV datasets, these can broadly be categorized into two types: UAV-aerial and UAV-target. The former, such as UAV123 [37], Visdrone [38], UAVDT [39], AU-AIR [40], DroneCrowd [41], and AnimalDrone [42], are all captured by cameras mounted on UAVs. However, their detection targets are non-UAV objects on the ground (such as pedestrians, vehicles, animals, etc.). The latter category, including MOT-Fly [43], Anti-UAV [10], DUT Anti-UAV [11], and DroneSwarms [44], focuses on detecting the UAV itself. These datasets are predominantly collected from ground-based or air-to-air perspectives and are specifically engineered for security applications such as counter-UAV measures or swarm tracking.

Detecting small UAVs (typically smaller than 30 × 30 pixels) presents challenges. To address this issue, we introduce UAV-SOD, which is a dataset specifically designed for small UAV detection. This dataset was constructed by curating and integrating small UAV target images from existing datasets (including DUT Anti-UAV [11] and TIB-Net [45]), alongside selecting small UAV target images from several commercial UAV datasets. The detailed data characteristics of UAV-SOD are presented in the following subsection.

Table 1. Introduction to some of general UAV datasets. In the task list, D denotes detection, T denotes tracking, and C denotes classification.

Datasets	Target Type	Task	Release Year	Data Quantity	Quantity	Label Scale
UAV123 [37]	UAV-Aerial	D,T	2015	112,578	28	112,578
FL-Drones [46]	UAV-target	D,T	2016	8000	1	8000
DOTA [47]	Aerial images	D	2018	11,268	18	188,282
Visdrone [38]	UAV-Aerial	D,T	2019	$2 \times 10^{7}$	10	over $2 \times 10^{7}$
UAVDT [39]	UAV-Aerial	D,T	2018	80,000	3	841,500
AU-AIR [40]	UAV-Aerial	D	2020	32,800	8	over $1 \times 10^{6}$
DroneCrowd [41]	UAV-Aerial	D,T	2021	33,600	1	over $4.8 \times 10^{6}$
MOT-Fly [43]	UAV-target	C,D	2021	11,186	3	23,544
AnimalDrone [42]	UAV-Aerial	D	2021	53,600	1	over $4 \times 10^{6}$
Anti-UAV [10]	UAV-target	D,T	2021	580,000	1	580,000
DUT Anti-UAV [11]	UAV-target	D	2022	10,000	1	over $1 \times 10^{5}$
HIT-UAV [48]	UAV-Aerial	D	2023	2898	5	24,899
DroneSwarms [44]	UAV-target	D	2024	9100	1	242,200

2.2. Analysis of UAV-SOD

The UAV-SOD dataset comprises 5185 images containing UAVs. The images are annotated using LabelImg to generate label files in YOLO format, which store target class information and normalized bounding box coordinates information (x-center, y-center, width, height). A total of 5675 UAV targets are labeled in the UAV-SOD dataset. Most of these UAV fall under the category of small targets defined as object size with a resolution lower than 32 × 32 pixels (as proposed by the COCO dataset [49]). A high-quality dataset often drives the development of a field. For example, in computer vision, the introduction of the COCO dataset has significantly advanced object detection and image segmentation. As shown in Table 2, the data indicate that the dataset is predominantly composed of small UAVs, based on calculations from the label file annotations. These data show that the average width and height of all UAV are 28.57 and 22.56 pixels, respectively, falling within the small target size range. To improve the model’s generalization ability, UAV-SOD also includes images containing larger UAV targets, such as those indicated by the maximum values in Table 2, which shows that the largest target measures 537 × 462 pixels. At the same time, in order to enhance the robustness of the framework, the relevant images contain complex backgrounds under various lighting conditions in multiple scenarios, as shown in Figure 1.

In addition, Figure 2 shows the core data characteristics of UAV-SOD. The dense scatter plot in the lower left corner (small target area) of Figure 2a indicates that the dataset mainly consists of small UAVs. Additionally, Figure 2b shows the distribution of target center points, with brighter colors indicating denser data density. Overall, the UAV-SOD dataset, which primarily consists of small UAVs, can be used to address the challenge. Finally, these 5185 images are divided into training, validation, and test sets in a ratio of 7:1.5:1.5. The training set consists of 3630 images and is used to adjust model parameters (such as neural network weights) so that the model can learn patterns in the data. The validation set consists of 778 images and is used to adjust hyperparameters and select the optimal model. The test set consists of 778 images and is used to impartially evaluate the final performance.

2.3. Basic Framework Selection

To select the most suitable framework for UAV-SOD, this study compares the performance of various YOLO frameworks. Official benchmark results on the COCO dataset indicate that newer YOLO versions generally achieve better performance. However, given the significant differences in scale and target characteristics between COCO and UAV-SOD, performance trends observed on COCO may not directly transfer to UAV-specific scenarios. To address this, we conduct a comprehensive evaluation of multiple YOLO frameworks on the UAV-SOD dataset. The results are presented in Figure 3. The relevant data are sourced from Section 4.5. Given the relatively small size of the UAV-SOD dataset, the smallest model from each framework is selected for the comparative experiments to mitigate the risk of overfitting. The results show that YOLO11n achieves the best performance on UAV-SOD. This is primarily reflected in its bubble being positioned closer to the top-left corner of the plot—indicating lower computational cost (GFLOPs) and higher mAP⁵⁰—along with a smaller bubble area, denoting fewer model parameters. Accordingly, YOLO11n is adopted as the baseline model for subsequent targeted improvements in small UAV detection.

3. Methods

3.1. YOLO-CoOp

Based on the YOLO11n framework, this study proposes some targeted optimizations for small UAV detection. The C3k2-WT module is employed in both the backbone network and neck network to enhance the model’s feature extraction capability while expanding the model’s receptive field through WTConv without excessively increasing parameters. Additionally, high-resolution feature maps (160 × 160) are added to the FPN structure to form the HRFPN structure, thereby preserving more detailed spatial information about small UAVs. Furthermore, the SCSA is incorporated into the connection region between the head and neck network to improve feature information fusion effectiveness. Finally, by adjusting sampling to Dysample and confidence loss to ATFL, the retention of detailed feature information for small UAVs is further enhanced, and smaller UAV targets are assigned higher weight. Unlike generic enhancements in existing YOLO-based methods, all proposed modules—HRFPN, C3k2-WT, SCSA, and DyATF—are specifically designed and synergistically optimized for the unique challenges of small UAV detection, such as limited pixel occupancy, weak spatial features, and high false negative rates. While individual components may draw inspiration from prior work, their integration, adaptation, and collaborative configuration in YOLO-CoOp constitute a targeted strategy for small-target anti-UAV scenarios. The complete modified framework is shown in Figure 4, with the specific details of the mentioned method to be elaborated in subsequent sections.

3.2. Structure and Principle of HRFPN

The backbone network in object detection models extracts multi-scale features from input images. These hierarchical representations are critical for detecting objects of varying sizes, especially small targets. There are two types of features: spatial information and semantic information. The former refers to geometric details such as position, shape, and texture. The latter refers to semantic attributes and the relationship between contextual features. These features exhibit scale dependency, with shallow network exhibiting more spatial information and deep network exhibiting more semantic information. For small UAV targets with limited pixel coverage, spatial information extracted from shallow layers often undergoes information degradation after multiple convolutions and sampling. When this spatial information is passed to a deep neural network, feature information may be lost. In contrast, for larger UAV targets, deep network can better detect and localize them due to their larger receptive fields. The relevant results can be seen in Figure 5.

For the two small UAV targets images a and b, the information in the shallow network (160 × 160 and 80 × 80) is relatively accurate, but in the deep network (40 × 40 and 20 × 20), it is relatively scattered and sparse. The information is represented as highlighted areas in the figure map, and the closer the highlighted areas to their original positions in the image, the more accurate the information. However, for the two large UAV target images c and d, the deep network still conveys effective information. Therefore, when detecting small UAVs, it is necessary to retain more high-resolution feature maps to preserve more effective information. Additionally, UAV-SOD is a single-category dataset that does not require complex category differentiation, further reducing the importance of semantic information.

In summary, extending the feature map to 160 × 160 is a key improvement strategy. The architecture of HRFPN is shown in Figure 6. It mainly expands the input features of the detection head from the original (80 × 80, 40 × 40, 20 × 20) to (160 × 160, 80 × 80, 40 × 40). The purpose of this is to retain more spatial detail information about small UAVs.

3.3. C3k2-WT Module

In a CNN, the concept of receptive field [50] refers to the range of influence of the input region on the activation of neurons in the feature map. Formally, for a neuron at layer l, its receptive field

R_{l}

is defined as in Formula (1):

R_{l} = R_{l - 1} + \sum_{i = 1}^{l - 1} ((k_{i} - 1) \prod_{j = 1}^{i} s_{j})

(1)

where

k_{i}

denotes the size of the convolution kernel and

s_{j}

denotes the stride of the convolution. From the formula, it can be seen that the receptive field monotonically increases with the number of layers in the neural network, which explains why deeper CNNs have larger receptive field ranges. In the improvement made in the previous subsection, the HRFPN introduces high-resolution feature maps by discarding the original 20 × 20 feature maps. It reduces the receptive field range of the model, thereby limiting its ability to capture global contextual information. Therefore, this study adopts the WTConv proposed in [51] and introduces it into the C3k2 module to form the C3k2-WT module shown in Figure 7. The C3k2-WT module has two structures: Bottleneck-WT and C3k-WT. Both are formed by introducing WTConv into the original structure. The Bottleneck-WT is a single-layer suitable for shallow network, while the C3k-WT is a recurrent structure suitable for deep network. In this paper, the C3k2-WT parameters are set to false for 160 × 160 and 80 × 80 feature map and to true for 40 × 40 and 20 × 20 feature maps. Figure 4 also illustrates the relevant parameter settings.

The core of WTConv (Wavelet Transform Convolution) is to leverage the multi-frequency decomposition property of the 2D Haar Discrete Wavelet Transform (DWT) to achieve a large effective receptive field while keeping the number of parameters low. Specifically, WTConv first applies a single-level 2D Haar DWT to the input feature map

X \in R^{C \times H \times W}

, which decomposes it into four sub-band components using four fixed

2 \times 2

filters with Stride 2. Stride of 2 in the 2D Haar wavelet transform is not arbitrary; it is a fundamental requirement of the critically sampled DWT. Specifically, after convolving the input feature map with the Haar analysis filters (

X_{L L}, X_{L H}, X_{H L}, X_{H H}

), a downsampling operation with Stride 2 is applied along both spatial dimensions. This reduces the spatial resolution of each sub-band by a factor of 2, ensuring that the total number of coefficients across all four sub-bands equals the number of pixels in the original input. This critical sampling property prevents information redundancy and enables a compact, multi-resolution representation, essential for efficient feature decomposition in WTConv.

f_{L L} = \frac{1}{2} [\begin{matrix} 1 & 1 \\ 1 & 1 \end{matrix}], f_{L H} = \frac{1}{2} [\begin{matrix} 1 & - 1 \\ 1 & - 1 \end{matrix}], f_{H L} = \frac{1}{2} [\begin{matrix} 1 & 1 \\ - 1 & - 1 \end{matrix}], f_{H H} = \frac{1}{2} [\begin{matrix} 1 & - 1 \\ - 1 & 1 \end{matrix}]

(2)

Here,

f_{L L}

captures the low-frequency (coarse) content, while

f_{L H}

,

f_{H L}

, and

f_{H H}

capture horizontal, vertical, and diagonal high-frequency (detail) components, respectively. Applying these filters yields four downsampled feature maps via depth-wise convolution with Stride 2:

\begin{matrix} X_{L L} & = (X * f_{L L}) {↓_{2}, X_{L H} = (X * f_{L H}) ↓}_{2} \\ X_{H L} & = (X * f_{H L}) {↓_{2}, X_{H H} = (X * f_{H H}) ↓}_{2} \end{matrix}

(3)

where ∗ denotes convolution and

↓_{2}

denotes downsampling by a factor of 2 in both spatial dimensions. The resulting sub-bands have spatial size

\frac{H}{2} \times \frac{W}{2}

.

Next, a lightweight depth-wise convolution

{Conv}_{dw}^{(k)}

with kernel size

k \in {3, 5}

is applied independently to each sub-band:

\begin{matrix} Y_{L L} & = {Conv}_{dw}^{(k)} (X_{L L}), Y_{L H} = {Conv}_{dw}^{(k)} (X_{L H}) \\ Y_{H L} & = {Conv}_{dw}^{(k)} (X_{H L}), Y_{H H} = {Conv}_{dw}^{(k)} (X_{H H}) \end{matrix}

(4)

Finally, the Inverse Wavelet Transform (IWT) reconstructs the feature map by transposed convolution (Stride 2 deconvolution) using the same Haar filters:

IWT ([Y_{L L}, Y_{L H}, Y_{H L}, Y_{H H}]) = \sum_{α \in {L L, L H, H L, H H}} (Y_{α} ↑_{2} * f_{α})

(5)

where

↑_{2}

denotes bilinear or nearest upsample by 2 (equivalent to transposed convolution under Haar basis).

Receptive Field and Parameter Analysis

Let the convolution kernel size be k. Since the LL band is downsampled once, a

k \times k

kernel on

X_{L L}

corresponds to a

(2 k) \times (2 k)

receptive field in the original input space. Therefore, the effective receptive field (ERF) of WTConv is approximately:

{ERF}_{WTConv} \approx 2 k .

(6)

In contrast, a standard depth-wise convolution with kernel size

K = 2 k

would require

K^{2} C = 4 k^{2} C

parameters. WTConv, however, uses four

k \times k

depth-wise convolutions on C input channels, totaling only

{Params}_{WTConv} = 4 \cdot (k^{2} C) = 4 k^{2} C,

(7)

but crucially, each operates on a

\frac{H}{2} \times \frac{W}{2}

feature map, leading to lower FLOPs and memory access. Moreover, the filters

f_{α}

are fixed and non-learnable, adding zero trainable parameters.

Why Single-Level Decomposition: Multi-level WT decomposition (e.g., 2 or 3 levels) can further enlarge the receptive field exponentially:

ERF \approx k \cdot 2^{ℓ}

for ℓ levels. However, as analyzed in [52], deeper decomposition introduces significant computational overhead:

Each level requires sequential DWT → Conv → IWT, increasing latency due to multiple memory reads/writes;
In modern CNN backbones (e.g., ResNet, ConvNeXt), feature maps in deep stages have low spatial resolution (e.g., $7 \times 7$ ), making $ℓ \geq 2$ infeasible;
For small UAV detection, where targets often span only a few pixels, a moderate ERF (e.g., $6 \times 6$ to $10 \times 10$ ) is sufficient to capture local context without over-smoothing fine details.

Therefore, we adopt single-level WTConv (

ℓ = 1

), which provides an optimal trade-off between receptive field expansion, parameter efficiency, and real-time inference, aligning with our design goal of minimal computational overhead while enhancing contextual awareness for small targets. As can be seen from Figure 8, compared with general convolution results, the feature map obtained by WTConv contains richer information.

3.4. SCSA Module

Existing detection frameworks often suffer from high false negative and false positive rates in small UAV detection, mainly because of lost spatial details and poor channel-wise semantic discrimination. To address these issues, this study adopts SCSA module proposed in [52] to enhance feature fusion, whose structure is shown in Figure 9.

The core of SCSA lies in its dual-module collaborative architecture. It enhances information fusion for small UAVs by jointly improving spatial semantic representation and context awareness. Specifically, a shared Multi-Semantic Spatial Attention (SMSA) module processes features along the height and width dimensions using multi-scale 1D convolutions (kernel sizes: 3, 5, 7, 9) to capture both local structures and global context, thereby strengthening spatial representation. Concurrently, a Progressive Channel-wise Self-Attention (PCSA) mechanism adaptively recalibrates channel weights to suppress redundancy and highlight salient regions, effectively resolving inter-channel semantic ambiguity.

To further elaborate the theoretical foundation of SCSA, let the input feature be

X \in R^{B \times C \times H \times W}

, where B, C, H, and W denote batch size, channel number, height, and width, respectively.

The SMSA first decomposes X along spatial dimensions via global average pooling:

\begin{matrix} X_{H} & = {GAP}_{W} (X) \in R^{B \times C \times H} \end{matrix}

(8)

\begin{matrix} X_{W} & = {GAP}_{H} (X) \in R^{B \times C \times W} \end{matrix}

(9)

Each is split into

K = 4

sub-features

{X_{H}^{i}}_{i = 1}^{K}

and

{X_{W}^{i}}_{i = 1}^{K}

, each with

C / K

channels. Multi-scale depth-wise 1D convolutions with kernel sizes

{3, 5, 7, 9}

extract diverse spatial semantics:

\begin{matrix} {\tilde{X}}_{H}^{i} & = {DWConv 1 d}_{k_{i}} (X_{H}^{i}) \end{matrix}

(10)

\begin{matrix} {\tilde{X}}_{W}^{i} & = {DWConv 1 d}_{k_{i}} (X_{W}^{i}), k_{i} \in {3, 5, 7, 9} \end{matrix}

(11)

Group Normalization (GN) with K groups and Sigmoid activation yield spatial attention maps:

\begin{matrix} {Attn}_{H} & = σ ({GN}_{K} (Concat ({\tilde{X}}_{H}^{1}, \dots, {\tilde{X}}_{H}^{K}))) \end{matrix}

(12)

\begin{matrix} {Attn}_{W} & = σ ({GN}_{K} (Concat ({\tilde{X}}_{W}^{1}, \dots, {\tilde{X}}_{W}^{K}))) \end{matrix}

(13)

The refined feature is

X_{s} = {Attn}_{H} \cdot {Attn}_{W} \cdot X

(14)

The PCSA then applies average pooling to compress

X_{s}

to

X_{p} \in R^{B \times C \times H^{'} \times W^{'}}

, followed by linear projections:

Q = F^{Q} (X_{p}), K = F^{K} (X_{p}), V = F^{V} (X_{p})

(15)

where

F^{{\cdot}}

are

1 \times 1

convolutions. Channel-wise single-head self-attention computes

X_{attn} = Softmax (\frac{Q K^{⊤}}{\sqrt{C}}) V

(16)

Finally, channel recalibration gives

X_{c} = X_{s} \cdot σ ({GAP}_{H^{'} \times W^{'}} (X_{attn}))

(17)

Thus, the full SCSA module is

SCSA (X) = PCSA (SMSA (X))

(18)

This design enhances spatial detail preservation for small UAV while mitigating inter-channel semantic conflicts through guided self-attention. The effectiveness of SCSA is demonstrated in the heatmap visualization comparison in Figure 10, which shows the original image, the heatmap without the SCSA module, and the heatmap with the SCSA module. Compared to the baseline, the model equipped with SCSA exhibits significantly sharper and more localized attention on small UAV, as evidenced by the concentrated activation in the attention maps.

3.5. DyATF Collaborative Optimization

3.5.1. Loss Function Optimization

In YOLO-based object detectors, the confidence loss evaluates whether a predicted bounding box contains an object. The standard binary cross-entropy (BCE) loss is commonly employed to balance positive and negative samples. However, BCE treats all samples equally, assigning uniform gradients regardless of sample difficulty. This leads to insufficient learning signal for challenging cases—such as small UAV detection—where targets exhibit weak features or low localization accuracy. To address this limitation, we adopt the Adaptive Threshold Focal Loss (ATFL) proposed in [53] to replace the standard BCE loss. ATFL extends BCE by introducing a dynamic threshold mechanism that adaptively amplifies gradients for hard samples, thereby enhancing sensitivity to difficult targets while suppressing gradient dominance from easy background regions. The standard BCE loss is defined as in Formula (19):

{Loss}_{BCE} = {- (y log p + (1 - y) log (1 - p))}

(19)

where p denotes the predicted probability and y denotes the true labeled value of the data, which is succinctly represented as in Formulas (20) and (21).

\begin{matrix} {Loss}_{BCE} = - log p^{t} \end{matrix}

(20)

\begin{matrix} p^{t} = \{\begin{matrix} p & y = 1 \\ 1 - p & y \neq 1 \end{matrix} \end{matrix}

(21)

The ATFL function is derived from the BCE loss function, expressed as in Formula (22).

{Loss}_{ATFL} = \{\begin{matrix} - {(λ - p^{t})}^{- ln (p^{t})} log (p^{t}) & p^{t} \leq 0.5, \\ - {(1 - p^{t})}^{- ln (p_{c})} log (p^{t}) & p^{t} > 0.5 . \end{matrix}

(22)

Among them,

λ

is an adjustable parameter.

p_{c}

can be expressed as in Formula (23).

p_{c} = 0.05 \times \frac{1}{t - 1} \sum_{i = 0}^{t - 1} {\bar{p}}^{i} + 0.95 p^{t}

(23)

where

p_{c}

denotes the predicted value of the next epoch,

p^{t}

denotes the current average predicted probability value, and

{\bar{p}}^{i}

denotes the average predicted probability value of each epoch.

3.5.2. Upsampling Optimization

YOLO-based object detectors typically employ nearest-neighbor interpolation for upsampling, which replicates pixel values from the nearest neighbors to fill in the upsampled feature map. While this approach is computationally simple, it introduces significant artifacts when applied to regions containing small UAVs. Due to the limited spatial coverage of such targets, nearest-neighbor upsampling often results in blurred textures, stair-step (aliasing) artifacts along object boundaries, and critically the loss or complete disappearance of fine contour details essential for detection. Therefore, we replace nearest-neighbor interpolation with Dysample proposed in [54], whose structure is shown in Figure 11.

The core principle of Dynamic upsample is to generate feature maps using the grid sample function to achieve a feature map enlargement effect. It is a built-in function in the PyTorch framework, whose basic principle is to convert the input tensor into an output of the specified size using bilinear interpolation. Dynamic upsample accepts a feature map of size W×H×C as input, uses a sampling point generator to generate the corresponding sampling set S, and then passes S and the input feature map X as input to the grid sample function, outputting the sampled feature map.

It has two types of sampling point generators: static and dynamic. Dynamic sampling point generators require more computation, so for simplicity, this study uses static sampling point generators. The static generator constrains the sampling offset range using a predefined scaling factor of 0.25, thereby avoiding dynamic weight calculations and significantly reducing computational costs. The core process is divided into two parallel branches: branch1 generates a set of

s^{2}

offsets for the channel dimension through a linear layer, which is then scaled and pixel shuffle [55] to obtain the dynamic sampling set S1; branch2 generates a conventional grid sampling set S2, and finally outputs the fused sampling set S through element-wise overlay of S1 and S2.

Finally, we obtained the DyATF method by replacing the confidence loss function with ATFL and replacing the upsampling method with Dysample.

4. Experiments and Results

4.1. Experimental Platform

The experimental setup comprises both hardware specifications and the deep learning software environment. The GPU used is an NVIDIA GeForce RTX 4060 Ti with 16 GB of VRAM. The operating system is Windows 11 Professional. The deep learning environment uses PyTorch version 2.1.0, CUDA version 12.1, and Python version 3.8. In order to accelerate the model training speed and reduce the computational cost, the input image size is 640 × 640 and the training epochs are 250. A batch size of 16 is used, and the Adam optimizer is employed with an initial learning rate of 0.005, linearly decayed to 0.01 over the training period. Data augmentation techniques are also applied during training. Automatic mixing precision (AMP) is turned on for training, which accelerates the computation and saves the memory. The important parameters for model training are shown in Table 3.

4.2. Evaluation Indicators

The performance of YOLO-based detectors is quantitatively assessed using multiple evaluation metrics. Its core indicators include the mean average precision (mAP) [49], which comprehensively evaluates the model’s detection capability across different IoU thresholds. Among them, mAP⁵⁰ uses an IoU threshold of 0.5 as the baseline, while mAP⁵⁰⁻⁹⁵ is computed by averaging the mean average precision over IoU thresholds ranging from 0.5 to 0.95 in increments of 0.05. This latter metric imposes stricter requirements on the model’s bounding box localization accuracy. Specifically, mAP is the average of the per-class average precision (AP) scores, providing a holistic measure of detection performance in multi-class object detection. Higher mAP values indicate better combined localization and classification accuracy. The calculation method for mAP are shown in Formulas (24) and (25).

\begin{matrix} A P = \int_{0}^{1} (p \cdot r) d r \end{matrix}

(24)

\begin{matrix} m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i} \end{matrix}

(25)

The evaluation indicators also include precision and recall [49], which reflect, respectively, the model’s ability to identify false positives and false negatives. These indicators are calculated using four values: true positives (TPs), false positives (FPs), false negatives (FNs), and true negatives (TNs). The higher the precision, the fewer the number of false positives; the higher the recall, the fewer the number of false negatives. The formulas for precision and recall are as in Formulas (26) and (27).

\begin{matrix} P r e c i s i o n = \frac{TP}{TP + FP} \end{matrix}

(26)

\begin{matrix} R e c a l l = \frac{TP}{TP + FN} \end{matrix}

(27)

There are also metric parameters specifically used to evaluate the size of the model, such as parameters, which determines the size of the model, and GFLOPs determines the arithmetic power required for model inference.

4.3. Ablation Experiments

This study conducted a progressive ablation study based on the unmodified YOLO11n baseline (denoted as M1). Starting from M1, we incrementally integrated proposed components to construct successive variants:

M2: Replaced the original FPN with the HRFPN structure.
M3: Replaced the standard C3k2 module with the proposed C3k2-WT in both the backbone and neck.
M4: Integrated the SCSA module at the interface between the neck and the detection head.
M5: Further replaced the default upsampling operator in M4 with Dysample.
M6: Replaced the standard confidence loss in M4 with the proposed ATFL.
M7: The final YOLO-CoOp framework, combining all the above modules.

All models were trained under identical hyperparameters and experimental settings, with quantitative results summarized in Table 4.

The ablation results (Table 4 and Figure 12) demonstrate that incorporating the HRFPN structure (M2) improves the model’s recall by 7.9%, mAP⁵⁰ by 4.2%, mAP⁵⁰⁻⁹⁵ by 3.7%, and precision by 1.0%, with consistent gains across all metrics. Further integrating the C3k2-WT module (M3) yields an additional 1.5% precision improvement, which exceeds the 1.0% precision gain provided by HRFPN alone; other metrics also show incremental improvements.

The SCSA module (M4) boosts mAP⁵⁰ to 95.9% and increases precision by 1.0%. Although the individual contributions of M5 (Dysample) and M6 (ATFL) are modest, their combined effect within the full YOLO-CoOp framework (M7) proves effective. In the final model M7, all evaluation metrics reach their peak values.

Compared to the baseline M1, YOLO-CoOp achieves a 3.6% improvement in precision, a 10.0% improvement in recall, a 5.9% improvement in mAP⁵⁰, and a 5.0% improvement in mAP⁵⁰⁻⁹⁵. Additionally, it reduces the number of parameters by approximately 0.62 M and model size by 1.1 MB, resulting in a more compact architecture.

In the high-reliability context of anti-UAV system, the accuracy of the detection module is the primary prerequisite for the successful execution of subsequent tracking and countermeasure operations. False positives or missed detections not only degrade system performance but may also cause the entire defense system to stall or even collapse. Therefore, in our model design, we prioritize detection accuracy over raw inference speed or computational efficiency. As shown in Table 4, by integrating the HRFPN module and replacing the original

20 \times 20

feature map with a higher-resolution

160 \times 160

feature map, we observe a significant increase in computational complexity (GFLOPs rising from 6.4 to 10.9). However, this modification yields substantial performance gains: Recall improves by 7.9% and mAP⁵⁰ increases by 4.2%, effectively reducing the miss rate and significantly enhancing the model’s ability to detect small UAV targets. Notably, the inference speed of 56.2 FPS remains well above the real-time threshold (>30 FPS), fully satisfying practical deployment requirements.

The collaborative effectiveness of the proposed modules is further validated in subsequent sections. As shown in Figure 13, YOLO-CoOp demonstrates superior detection performance, particularly in reducing false positives and false negatives and exhibits enhanced capability in accurately localizing small UAVs. It also identifies small UAVs more accurately and effectively.

4.4. Collaborative Relationship Validation Experiments

To investigate the collaborative relationships among the four proposed modules—HRFPN, C3k2-WT, SCSA, and DyATF (which integrates Dysample and ATFL)—we conducted a series of combinatorial experiments. The performance of different module combinations varies significantly. For instance, the integration of HRFPN and DyATF achieves the highest mAP⁵⁰⁻⁹⁵ among all tested configurations.

The experimental protocol involves systematically assembling subsets of these modules and applying them to the YOLO11n baseline for training. All experiments were conducted on the same platform and with identical hyperparameters as described in Section 4.1, ensuring a fair and rigorous comparison.

We evaluated all valid combinations of two, three, and four modules, resulting in 11 distinct configurations (excluding single-module variants, which were already covered in the ablation study). The corresponding results are summarized in Table 5.

From the results, combinations incorporating HRFPN consistently exhibit reduced model parameters and improved performance, demonstrating that the HRFPN structure effectively enhances detection capability while promoting model compactness.

Specifically:

The HRFPN + DyATF combination achieves the highest mAP⁵⁰⁻⁹⁵ of 58.1%.
The HRFPN + SCSA combination minimizes the number of parameters to 1.94 M, contributing significantly to model compression.
The HRFPN + C3k2-WT combination attains the second highest precision of 93.3%, highlighting its effectiveness in improving classification confidence.
The HRFPN + C3k2-WT + SCSA combination delivers competitive overall performance, ranking among the top configurations.
The full integration of HRFPN + C3k2-WT + SCSA + DyATF yields the highest precision, recall, and mAP⁵⁰, while achieving the second highest mAP⁵⁰⁻⁹⁵.

Except for a few combinations exhibiting exceptional performance or parameter efficiency, the performance gap among configurations is clearly evident. For most other combinations, the collaborative effects are less distinguishable, as illustrated in Figure 14.

To quantitatively evaluate the collaborative relationships between modules, we introduce the Collaborative Efficiency Density (CED), defined as

C E D = \frac{{(\sum_{i = 1}^{4} \frac{4}{x_{i}})}^{- 1}}{p} \times 10^{6}

(28)

where

x_{i} \in \{precision, recall, {mAP}^{50}, {mAP}^{50 - 95}\}

and p denotes the number of model parameters (in millions). The harmonic mean is employed to equally weight all four metrics, which span different ranges (precision/recall/mAP⁵⁰: 90–100%; mAP⁵⁰⁻⁹⁵: 50–60%).

CED quantifies the average performance gain per million parameters, serving as a metric for evaluating module synergy. Higher CED values indicate stronger collaborative effectiveness. Thus, the relative quality of synergy among module combinations can be directly compared via their CED scores.

As shown in Figure 15, the YOLO-CoOp framework—integrating all four modules—achieves the highest CED value, indicating the strongest collaborative synergy among all configurations. The bar chart ranks combinations by CED score, from highest (top) to lowest (bottom).

4.5. Comparison Experiments

To further validate the performance of YOLO-CoOp, we conducted comparative experiments against a range of mainstream object detection models, including multiple variants within the YOLO family and two recent transformer-based detectors: RT-DETR-R50 and RT-DETR-l. All selected models are comparable to YOLO-CoOp in terms of parameter count and model size, enabling a fair evaluation under resource-constrained scenarios. The experimental platform and training protocol remain consistent with those described in Section 4.1, with all models trained on the UAV-SOD dataset to ensure methodological rigor. Quantitative results are summarized in Table 6, and visual comparisons are presented in Figure 16.

From the results, it is evident that YOLO-CoOp achieves the highest precision (94.3%), recall (93.1%), mAP⁵⁰ (96.2%), and mAP⁵⁰⁻⁹⁵ (57.6%) among all compared models. Notably, despite having only 1.97 M parameters and a compact 4.4 MB model size, YOLO-CoOp outperforms both lightweight YOLO variants (e.g., YOLOv5n, YOLOv8n) and larger-scale models such as RT-DETR-R50 (41.9 M parameters, 86.0 MB) and RT-DETR-l (32.8 M parameters, 66.2 MB).

As illustrated in the bubble diagram (Figure 16), YOLO-CoOp occupies the top-left region, indicating superior accuracy with minimal computational footprint. In contrast, RT-DETR models, while achieving competitive mAP⁵⁰ scores (92.0–93.4%), exhibit significantly larger model sizes and higher parameter counts, placing them far from the efficiency–accuracy Pareto frontier for small UAV detection.

Within the YOLO family, smaller models (e.g., YOLOv5n, YOLOv11n) show similar parameter budgets but vary in performance; for instance, YOLOv8s (11.1 M parameters, 22.5 MB) offers only marginal gains over YOLOv5n (2.5 M, 5.3 MB), suggesting diminishing returns at larger scales on UAV-SOD.

4.6. Cross-Dataset Validation Experiments

To validate the generalization capability of YOLO-CoOp for small object and UAV detection across different domains, we conducted cross-dataset validation experiments on three publicly available benchmarks: Visdrone [38], DUT Anti-UAV [11], and TIB-Net-Drone dataset [45].

Visdrone contains 10,209 images captured from a UAV’s perspective, covering diverse small object categories such as pedestrians, vehicles, and bicycles. As shown in Figure 17a, the majority of targets exhibit normalized bounding box dimensions below 0.3 in both width and height, confirming their classification as small objects.

DUT Anti-UAV comprises 10,000 images (5200 training, 2600 validation, 2200 testing), with a high proportion of small UAVs as primary targets. The scatter plot in Figure 17b further validates that most UAV instances are spatially compact.

TIB-Net-Drone is a dedicated small UAV detection dataset consisting of 2850 images, where nearly all targets are sub-meter-scale UAVs. As illustrated in Figure 17c, target sizes are even more constrained, with most occupying less than 0.1 in normalized width and height.

Due to hardware constraints, the batch size was set to 8 for Visdrone; for DUT Anti-UAV and TIB-Net-Drone, the default batch size of 16 was retained. All other experimental settings (optimizer, learning rate, epochs, image size) remained identical to those described in Section 4.1. Results are summarized in Table 7.

On Visdrone, YOLO-CoOp achieves a 4.1 percentage point improvement in mAP⁵⁰ over the baseline YOLO11n (36.0% vs. 31.9%), with consistent gains in precision (+1.7%) and recall (+3.6%). On DUT Anti-UAV, it improves mAP⁵⁰ by 4.9% (91.4% vs. 86.5%) and recall by 6.3% (84.9% vs. 78.6%), demonstrating strong transferability to UAV-specific scenarios. Most notably, on TIB-Net-Drone, a highly challenging small UAV benchmark, YOLO-CoOp achieves a remarkable 9.7% gain in mAP⁵⁰ (92.4% vs. 82.7%) and a 12.4% increase in recall (92.3% vs. 79.9%).

Visual inspection of detection results (Figure 18 and Figure 19) confirms YOLO-CoOp’s superior ability to localize small objects under varying scales, occlusions, and backgrounds. Overall, these results demonstrate that YOLO-CoOp not only excels on its native UAV-SOD dataset but also exhibits robust generalization performance across diverse small-object and UAV detection tasks.

5. Visual Detection in Anti-UAV System

5.1. Experimental Setup

To evaluate the practical feasibility of our proposed YOLO-CoOp framework in real-world anti-UAV applications, we implemented a fully functional detection and tracking system, as illustrated in Figure 20.

The system comprises three core hardware components as shown in Table 8: a computing platform, an electro-optical sensing unit, and a tracking gimbal. For computation, we used a standard laptop equipped with an NVIDIA GeForce GTX 1650 GPU (4 GB VRAM) running Windows 11 Professional. The vision sensor is a Hikvision DS-2ZMN2507C camera module, offering 2-megapixel resolution (

1920 \times 1080

) and

25 \times

optical zoom (focal length range: 4.8–120 mm)—capabilities essential for detecting small UAVs across a range of operational distances. The camera was configured to operate at 30 fps to ensure smooth video capture under local lighting conditions. A HY-MZ17-01A tracking gimbal was integrated to provide precise pan-tilt control, with horizontal rotation speeds of 9–

45^{\circ}

/s, vertical speeds of 2.6–

13^{\circ}

/s, and a positioning accuracy of

\pm 0 . 1^{\circ}

.

The software environment was kept consistent with the training setup described in Section 4.1 (Table 3) to ensure a fair and comparable evaluation of model performance under real-world conditions.

Field experiments were conducted in an urban park between 9:00 and 12:00 on a clear, sunny day. This time window provided stable illumination while preserving realistic operational complexity. The test environment featured diverse visual backgrounds—including trees, buildings, and occasional pedestrian activity—introducing natural challenges that closely mimic practical anti-UAV scenarios. Two widely used consumer UAVs were selected as test targets: the DJI Phantom 4 Pro V2.0 (white, larger airframe) and the DJI Air 2S (gray, more compact), enabling performance evaluation across different target scales and visual appearances.

5.2. Quantitative Metrics and Experimental Protocol

To comprehensively evaluate system performance under realistic operational conditions, we designed a series of controlled field experiments. First, distance variation tests were conducted at 20 m, 40 m, 60 m, 80 m, and 120 m—with the camera fixed at its native wide-angle setting (4.8 mm, no optical zoom)—to assess detection robustness as the target subtends progressively fewer pixels. Second, optical zoom tests were performed at a fixed distance of 130 m (the empirical maximum detection range without zoom), evaluating system performance across zoom levels of 5×, 10×, 15×, 20×, and 25× (corresponding to focal lengths from 4.8 mm to 120 mm). Third, extended-range tests were conducted under maximum optical zoom (25×) at distances of 150 m, 180 m, 210 m, 240 m, 270 m, and 300 m to evaluate the practical detection limit of the integrated vision system. Finally, comprehensive robustness trials were performed under four representative adversarial conditions: (1) complex background, where UAVs flew against dense tree canopies and building facades; (2) multi-target, involving simultaneous operation of the DJI Phantom 4 Pro V2.0 and DJI Air 2S in close proximity; (3) high-speed maneuvers, including rapid directional changes, aggressive accelerations, and abrupt altitude variations mimicking evasive flight; and (4) backlighting, with the sun positioned directly behind the UAV to induce severe contrast loss and silhouette effects.

To objectively evaluate the practical viability of our system in real-world anti-UAV scenarios, we adopted four key quantitative metrics: average frames per second (FPS), average end-to-end latency (ms), average GPU utilization (%), and average GPU power consumption (W).

FPS reflects the system’s real-time processing capability; in visual surveillance applications, a sustained rate above 25 FPS is generally considered sufficient for smooth target tracking and timely operator response. End-to-end latency—measured from image capture to bounding box output—directly affects the responsiveness of downstream tracking or control modules, with values below 50 ms being desirable for engaging fast-moving UAV. GPU utilization and power consumption jointly characterize the system’s computational footprint, which is crucial for deployment on mobile or battery-constrained platforms where thermal and energy budgets are limited. Together, these four metrics provide a comprehensive assessment of both performance and deployability under realistic operating conditions.

The experimental results, summarized in Table 9, Table 10, Table 11 and Table 12, demonstrate that the YOLO-CoOp framework maintains robust real-time performance across a wide range of operational conditions.

Under baseline conditions (no optical zoom), the system achieves stable performance up to 120 m, with FPS consistently above 26.4 and latency below 38 ms (Table 9). This confirms reliable operation at typical surveillance distances. At the critical 130 m range—the empirical detection limit without zoom—activating optical zoom proves highly effective; even at 5×, performance remains acceptable, and at 25×, the system sustains 27.0 FPS with only marginal increases in GPU load (Table 10). When pushed to its maximum zoom (25×), the system successfully detects UAV at extended ranges up to 300 m, maintaining an average FPS of 28.1 and latency under 38 ms (Table 11), showcasing its capability for long-range surveillance.

Under challenging real-world conditions, performance naturally degrades but remains functional (Table 12). The multi-target scenario exhibits the highest latency (49.68 ms) and lowest FPS (20.48), primarily due to the computational burden of managing multiple tracks. Complex background and backlighting tests show similar degradation, indicating that visual clutter and adverse lighting are significant challenges. However, even under these stressors, the system maintains an average FPS above 20, demonstrating sufficient responsiveness for many operational use cases. The relatively low power consumption (19.33–21.40 W) further reinforces its suitability for prolonged deployment on battery-powered platforms.

The qualitative visualization in Figure 21 demonstrates the system’s robustness across a wide range of operational conditions. At close to medium ranges (20–120 m) without optical zoom, the system consistently detects the UAVs with stable bounding boxes and real-time frame rates (26–29 FPS). When operating at the critical distance of 130 m—the empirical limit without zoom—activating optical zoom significantly improves target visibility, with performance remaining stable across all tested zoom levels (5×–25×). Under maximum zoom (25×), the system successfully tracks the UAVs at extended distances up to 300 m, confirming its capability for long-range surveillance.

In challenging real-world conditions, the framework maintains reliable detection despite environmental interference. In multi-target scenarios, both UAVs are tracked simultaneously, though with slightly reduced FPS. Complex backgrounds and strong backlighting introduce some visual clutter, yet the detector remains functional, demonstrating resilience to adverse illumination and occlusion. High-speed maneuvers result in temporary tracking jitter but no complete loss of detection, indicating sufficient temporal stability for practical deployment.

In summary, our experiments confirm that the YOLO-CoOp framework delivers consistent, real-time performance across a wide spectrum of realistic anti-UAV scenarios, from long-range detection to complex environmental interference, while maintaining efficient resource utilization on standard computing hardware.

6. Conclusions

This study proposed a multi-module collaborative optimization framework YOLO-CoOp for small UAV detection. By combining HRFPN, C3k2-WT, SCSA, and DyATF method, the YOLO-CoOp model achieves an precision of 94.3%, a recall of 93.1%, an mAP⁵⁰ of 96.2%, and an mAP⁵⁰⁻⁹⁵ of 57.6%. Multiple ablation experiments validated the incremental contributions of each module. Collaboration relationship validation experiments revealed the collaborative relationships between modules and demonstrated the collaborative effects of module interactions. Additionally, cross-dataset evaluations on the Visdrone and Anti-UAV datasets demonstrated the model’s effectiveness in small object detection and UAV detection domains, showcasing its robust generalization capabilities and adaptability. Furthermore, practical validation experiments were conducted on object detection and tracking system in an anti-UAV system, with results indicating the ability to accurately detect small UAVs. These results establish YOLO-CoOp as a lightweight solution for small UAV detection.

In the current UAV detection work, the system faces two critical limitations that need to be addressed. First, in complex environmental backgrounds—such as when a UAV flies near trees or buildings—the target can easily blend into the background due to lighting variations and viewing angles, leading to significant miss-detections. This issue is particularly pronounced in models like YOLO, which are based on a CNN; their limited receptive fields and weak global modeling capacity struggle to effectively distinguish object boundaries from cluttered surroundings. Such misdetections are catastrophic for counter-UAV systems, potentially resulting in serious security risks. To address this, future work could explore integrating Transformer architectures, whose self-attention mechanisms enable stronger global context modeling and long-range dependency capture, thereby improving object localization accuracy in cluttered scenes. Second, purely vision-based detection methods perform poorly at long ranges: when a UAV is sufficiently far away, it appears in the image as merely a tiny dot or may not be visible at all, making feature extraction extremely challenging for deep learning models and often causing detection failure. To overcome this limitation, future research should focus on multimodal fusion techniques that combine heterogeneous inputs such as acoustic sensors, optical imagery, and radar data. By aligning cross-modal features and fusing decisions at multiple levels, a more robust and adaptive UAV detection framework can be developed.

Author Contributions

Conceptualization, K.F. and Z.C.; methodology, K.F.; software, Z.X.; validation, Z.C., J.Y., Z.X. and Y.W.; formal analysis, J.Y.; investigation, Z.C.; resources, K.F.; data curation, Z.X.; writing—original draft preparation, Z.C.; writing—review and editing, K.F. and J.Y.; visualization, J.Y.; supervision, K.F. and Y.W.; project administration, K.F.; funding acquisition, K.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China [No. 62363014, No. 61763018]; the Key Plan Project of Science and Technology of Ganzhou [GZ2024ZDZ008].

Data Availability Statement

The data presented in this study are available upon reasonable request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, M.; Zhang, D.; Wang, B.; Li, L. Dynamic Trajectory Planning for Multi-UAV Multi-Mission Operations Using a Hybrid Strategy. IEEE Trans. Aerosp. Electron. Syst. 2025, 61, 7369–7386. [Google Scholar] [CrossRef]
Salvati, D.; Drioli, C.; Ferrin, G.; Foresti, G.L. Acoustic Source Localization from Multirotor UAVs. IEEE Trans. Ind. Electron. 2020, 67, 8618–8628. [Google Scholar] [CrossRef]
Wang, H.; Liu, X.; Zhou, X. Autonomous UAV Interception via Augmented Adversarial Inverse Reinforcement Learning. In Proceedings of the 2021 International Conference on Autonomous Unmanned Systems (ICAUS 2021); Springer: Singapore, 2022; pp. 2073–2084. [Google Scholar]
Souza, B.J.; Stefenon, S.F.; Singh, G.; Freire, R.Z. Hybrid-YOLO for Classification of Insulators Defects in Transmission Lines Based on UAV. Int. J. Electr. Power Energy Syst. 2023, 148, 108982. [Google Scholar] [CrossRef]
Ahmad, F.; Qiu, B.; Dong, X.; Ma, J.; Huang, X.; Ahmed, S.; Chandio, F.A. Effect of Operational Parameters of UAV Sprayer on Spray Deposition Pattern in Target and Off-Target Zones During Outer Field Weed Control Application. Comput. Electron. Agric. 2020, 172, 105350. [Google Scholar] [CrossRef]
Rodrigues, T.A.; Patrikar, J.; Oliveira, N.L.; Matthews, H.S.; Scherer, S.; Samaras, C. Drone Flight Data Reveal Energy and Greenhouse Gas Emissions Savings for Very Small Package Delivery. Patterns 2022, 3, 100569. [Google Scholar] [CrossRef]
Buchelt, A.; Adrowitzer, A.; Kieseberg, P.; Gollob, C.; Nothdurft, A.; Eresheim, S.; Tschiatschek, S.; Stampfer, K.; Holzinger, A. Exploring Artificial Intelligence for Applications of Drones in Forest Ecology and Management. For. Ecol. Manag. 2024, 551, 121530. [Google Scholar] [CrossRef]
Mekdad, Y.; Aris, A.; Babun, L.; Fergougui, A.E.; Conti, M.; Lazzeretti, R.; Uluagac, A.S. A Survey on Security and Privacy Issues of UAVs. Comput. Netw. 2023, 224, 109626. [Google Scholar] [CrossRef]
Vattapparamban, E.; Güvenç, I.; Yurekli, A.I.; Akkaya, K.; Uluağaç, S. Drones for Smart Cities: Issues in Cybersecurity, Privacy, and Public Safety. In Proceedings of the 2016 International Wireless Communications and Mobile Computing Conference (IWCMC), Paphos, Cyprus, 5–9 September 2016; pp. 216–221. [Google Scholar]
Jiang, N.; Wang, K.; Peng, X.; Yu, X.; Wang, Q.; Xing, J.; Li, G.; Guo, G.; Ye, Q.; Jiao, J.; et al. Anti-UAV: A Large-Scale Benchmark for Vision-Based UAV Tracking. IEEE Trans. Multimed. 2023, 25, 486–500. [Google Scholar] [CrossRef]
Zhao, J.; Zhang, J.; Li, D.; Wang, D. Vision-Based Anti-UAV Detection and Tracking. IEEE Trans. Intell. Transp. Syst. 2022, 23, 25323–25334. [Google Scholar] [CrossRef]
Souli, N.; Makrigiorgis, R.; Anastasiou, A.; Zacharia, A.; Petrides, P.; Lazanas, A.; Valianti, P.; Kolios, P.; Ellinas, G. Horizonblock: Implementation of an Autonomous Counter-Drone System. In Proceedings of the 2020 International Conference on Unmanned Aircraft Systems (ICUAS), Athens, Greece, 1–4 September 2020; pp. 398–404. [Google Scholar]
Wang, W.; Fan, K.; Ouyang, Q.; Yuan, Y. Acoustic UAV Detection Method Based on Blind Source Separation Framework. Appl. Acoust. 2022, 200, 109057. [Google Scholar] [CrossRef]
Xie, W.; Wan, Y.; Wu, G.; Li, Y.; Zhou, F.; Wu, Q. An RF-Visual Directional Fusion Framework for Precise UAV Positioning. IEEE Internet Things J. 2024, 11, 36736–36747. [Google Scholar] [CrossRef]
Wang, C.; Tian, J.; Cao, J.; Wang, X. Deep Learning-Based UAV Detection in Pulse-Doppler Radar. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–12. [Google Scholar] [CrossRef]
Chen, Z.; Yang, J.; Chen, L.; Li, F.; Feng, Z.; Jia, L.; Li, P. RailVoxelDet: A Lightweight 3-D Object Detection Method for Railway Transportation Driven by Onboard LiDAR Data. IEEE Internet Things 2025, 12, 37175–37189. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
Ashtiani, F.; Geers, A.J.; Aflatouni, F. An On-Chip Photonic Deep Neural Network for Image Classification. Nature 2022, 606, 501–506. [Google Scholar] [CrossRef]
Dauparas, J.; Anishchenko, I.; Bennett, N.; Bai, H.; Ragotte, R.J.; Milles, L.F.; Wicky, B.I.; Courbet, A.; de Haas, R.J.; Bethel, N.; et al. Robust Deep Learning–Based Protein Sequence Design Using ProteinMPNN. Science 2022, 378, 49–56. [Google Scholar] [CrossRef] [PubMed]
Liu, L.; Ouyang, W.; Wang, X.; Fieguth, P.; Chen, J.; Liu, X.; Pietikäinen, M. Deep Learning for Generic Object Detection: A Survey. Int. J. Comput. Vis. 2020, 128, 261–318. [Google Scholar] [CrossRef]
Wu, X.; Sahoo, D.; Hoi, S.C. Recent Advances in Deep Learning for Object Detection. Neurocomputing 2020, 396, 39–64. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot Multibox Detector. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Wang, J.; Liu, M.; Du, Y.; Zhao, M.; Jia, H.; Guo, Z.; Su, Y.; Lu, D.; Liu, Y. PG-YOLO: An Efficient Detection Algorithm for Pomegranate Before Fruit Thinning. Eng. Appl. Artif. Intell. 2024, 134, 108700. [Google Scholar] [CrossRef]
Li, J.; Kang, X. Mobile-YOLO: An Accurate and Efficient Three-Stage Cascaded Network for Online Fiberglass Fabric Defect Detection. Eng. Appl. Artif. Intell. 2024, 134, 108690. [Google Scholar] [CrossRef]
Chen, Y.; Wang, Y.; Zou, Z.; Dan, W. GMS-YOLO: A Lightweight Real-Time Object Detection Algorithm for Pedestrians and Vehicles Under Foggy Conditions. IEEE Internet Things J. 2025, 12, 23879–23890. [Google Scholar] [CrossRef]
Hou, T.; Leng, C.; Wang, J.; Pei, Z.; Peng, J.; Cheng, I.; Basu, A. MFEL-YOLO for Small Object Detection in UAV Aerial Images. Expert Syst. Appl. 2025, 291, 128459. [Google Scholar] [CrossRef]
Zhang, Y.; Wu, C.; Guo, W.; Zhang, T.; Li, W. CFANet: Efficient Detection of UAV Image Based on Cross-Layer Feature Aggregation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–11. [Google Scholar] [CrossRef]
Zhang, X.; Fan, K.; Hou, H.; Liu, C. Real-Time Detection of Drones Using Channel and Layer Pruning, Based on the YOLOv3-SPP3 Deep Learning Algorithm. Micromachines 2022, 13, 2199. [Google Scholar] [CrossRef]
Singha, S.; Aydin, B. Automated Drone Detection Using YOLOv4. Drones 2021, 5, 95. [Google Scholar] [CrossRef]
He, X.; Fan, K.; Xu, Z. UAV Identification Based on Improved YOLOv7 Under Foggy Condition. Signal Image Video Process. 2024, 18, 6173–6183. [Google Scholar] [CrossRef]
Ma, J.; Huang, S.; Jin, D.; Wang, X.; Li, L.; Guo, Y. LA-YOLO: An Effective Detection Model for Multi-UAV Under Low Altitude Background. Meas. Sci. Technol. 2024, 35, 055401. [Google Scholar] [CrossRef]
Liu, F.; Yao, L.; Zhang, C.; Wu, T.; Zhang, X.; Jiang, X.; Zhou, J. Boost UAV-Based Object Detection via Scale-Invariant Feature Disentanglement and Adversarial Learning. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–13. [Google Scholar] [CrossRef]
Mueller, M.; Smith, N.; Ghanem, B. A Benchmark and Simulator for UAV Tracking. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; pp. 445–461. [Google Scholar]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The Vision Meets Drone Object Detection in Image Challenge Results. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 213–226. [Google Scholar]
Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The Unmanned Aerial Vehicle Benchmark: Object Detection and Tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 370–386. [Google Scholar]
Bozcan, I.; Kayacan, E. Au-Air: A Multi-Modal Unmanned Aerial Vehicle Dataset for Low Altitude Traffic Surveillance. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 8504–8510. [Google Scholar]
Wen, L.; Du, D.; Zhu, P.; Hu, Q.; Wang, Q.; Bo, L.; Lyu, S. Detection, Tracking, and Counting Meets Drones in Crowds: A Benchmark. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 7808–7817. [Google Scholar]
Zhu, P.; Peng, T.; Du, D.; Yu, H.; Zhang, L.; Hu, Q. Graph Regularized Flow Attention Network for Video Animal Counting from Drones. IEEE Trans. Image Process. 2021, 30, 5339–5351. [Google Scholar] [CrossRef]
Chu, Z.; Song, T.; Jin, R.; Jiang, T. An Experimental Evaluation Based on New Air-to-Air Multi-UAV Tracking Dataset. In Proceedings of the 2023 IEEE International Conference on Unmanned Systems (ICUS), Hefei, China, 13–15 October 2023; pp. 671–676. [Google Scholar]
Cao, B.; Yao, H.; Zhu, P.; Hu, Q. Visible and Clear: Finding Tiny Objects in Difference Map. In Proceedings of the Computer Vision—ECCV 2024, Milan, Italy, 29 September–4 October 2024; pp. 1–18. [Google Scholar]
Sun, H.; Yang, J.; Shen, J.; Liang, D.; Ning-Zhong, L.; Zhou, H. TIB-Net: Drone Detection Network with Tiny Iterative Backbone. IEEE Access 2020, 8, 130697–130707. [Google Scholar] [CrossRef]
Rozantsev, A.; Lepetit, V.; Fua, P. Detecting Flying Objects Using a Single Moving Camera. In Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence, Venice, Italy, 22–29 October 2017; Volume 39, pp. 879–892. [Google Scholar]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
Suo, J.; Wang, T.; Zhang, X.; Chen, H.; Zhou, W.; Shi, W. HIT-UAV: A High-Altitude Infrared Thermal Dataset for Unmanned Aerial Vehicle-Based Object Detection. Sci. Data 2023, 10, 227. [Google Scholar] [CrossRef] [PubMed]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision—ECCV 2014, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Luo, W.; Li, Y.; Urtasun, R.; Zemel, R. Understanding the effective receptive field in deep convolutional neural networks. arXiv 2016, arXiv:1701.04128. [Google Scholar]
Finder, S.E.; Amoyal, R.; Treister, E.; Freifeld, O. Wavelet Convolutions for Large Receptive Fields. In Proceedings of the Computer Vision—ECCV 2024, Milan, Italy, 29 September–4 October 2024; pp. 363–380. [Google Scholar]
Si, Y.; Xu, H.; Zhu, X.; Zhang, W.; Dong, Y.; Chen, Y.; Li, H. SCSA: Exploring the Synergistic Effects Between Spatial and Channel Attention. Neurocomputing 2025, 634, 129866. [Google Scholar] [CrossRef]
Yang, B.; Zhang, X.; Zhang, J.; Luo, J.; Zhou, M.; Pi, Y. EFLNet: Enhancing Feature Learning Network for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–11. [Google Scholar] [CrossRef]
Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to Upsample by Learning to Sample. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023; pp. 6004–6014. [Google Scholar]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]

Figure 1. Sample images of UAV-SOD.

Figure 2. Data characteristics of UAV-SOD; (a) scatter plot of UAV size distribution, which shows the size distribution of all targets in the dataset; (b) center position distribution density, which shows the distribution of all targets center coordinates in the dataset.

Figure 3. Performance–efficiency trade-off of various YOLO variants on the UAV-SOD dataset. Bubble size and color indicate model parameter count (in millions), with higher values corresponding to larger bubbles and warmer colors. YOLO11n achieves the best balance between accuracy (mAP⁵⁰) and computational cost (GFLOPs).

Figure 4. The structure of YOLO-CoOp.

Figure 5. Visualization of feature maps. (a–d): original images. Subfigures a1–a4, b1–b4, c1–c4, d1–d4 show the corresponding multi-scale feature maps: a1/b1/c1/d1 = 160 × 160, a2/b2/c2/d2 = 80 × 80, a3/b3/c3/d3 = 40 × 40, a4/b4/c4/d4 = 20 × 20. Brighter colors indicate stronger feature responses.

Figure 6. The structure of HRFPN, where (a) denotes the backbone network as the feature extraction component, (b) denotes the neck network as the feature fusion component, and (c) denotes the head network as the detection output component.

Figure 7. (a) Structure of C3k2-WT; (b) structure of Bottleneck-WT: when the parameter is false, it is selected; (c) structure of C3k-WT: when the parameter is true, it is selected.

Figure 8. Illustration of the Wavelet Transform Convolution (WTConv) Process on a Single Channel Image. (a) Input single-channel image; (b) Four sub-bands(

X_{H H}, X_{H L}, X_{L H}, X_{L L}

) generated by single-level 2D Discrete Wavelet Transform (DWT); (c) Feature maps (

Y_{H H}, Y_{H L}, Y_{L H}, Y_{L L}

) extracted from each sub-band via convolution; (d) Reconstructed feature map obtained by applying Inverse DWT to the feature maps in (c); (e) Feature map obtained by directly applying convolution to the input image (a); (f) Final WTConv output, computed as the element-wise sum of (d,e).

Figure 8. Illustration of the Wavelet Transform Convolution (WTConv) Process on a Single Channel Image. (a) Input single-channel image; (b) Four sub-bands(

X_{H H}, X_{H L}, X_{L H}, X_{L L}

) generated by single-level 2D Discrete Wavelet Transform (DWT); (c) Feature maps (

Y_{H H}, Y_{H L}, Y_{L H}, Y_{L L}

) extracted from each sub-band via convolution; (d) Reconstructed feature map obtained by applying Inverse DWT to the feature maps in (c); (e) Feature map obtained by directly applying convolution to the input image (a); (f) Final WTConv output, computed as the element-wise sum of (d,e).

Figure 9. Structure of SCSA.

Figure 10. Visual comparison of attention maps with and without the SCSA module across challenging UAV detection scenarios: large targets, small targets, multiple targets, complex backgrounds, and motion blur. Brighter colors indicate stronger feature responses.

Figure 11. Structure of Dysample; (a) the principle of Dynamic upsample; (b) the principle of static scope generator.

Figure 12. Impact of incremental module integration on training dynamics: precision, recall, and mAP metrics for M1–M7 from baseline to final YOLO-CoOp framework.

Figure 13. Detection results of UAV images in different model.

Figure 14. Experimental results of different combinations.

Figure 15. Bar chart of CED for different combinations.

Figure 16. Comparison experiments bubble diagram.

Figure 17. Target size distribution across datasets: (a) Visdrone, (b) DUT Anti-UAV, (c) TIB-Net-Drone.

Figure 18. Detection results on Visdrone.

Figure 19. Detection results on DUT Anti-UAV and TIB-Net-Drone.

Figure 20. Real-time UAV detection in outdoor environment: system architecture and field test result.

Figure 21. Qualitative detection results under diverse anti-UAV scenarios, including varying distances, zoom levels, multi-target, complex backgrounds, high-speed maneuvers, and backlighting.

Table 2. Statistical results of UAV-SOD.

Indicators	Width	Height	Area	Ratio
Min	4	4	16	1
Median	16	14	224	1.19
Max	537	462	248,094	1.16
Mean	28.57	22.56	2548.3	1.29
Std	51.09	40.24	14,512	0.43

Table 3. Training parameter settings.

No.	Training Parameters	Value
1	Epochs	250
2	Batch Size	16
3	Optimizer	Adam
4	Images Size	$640 \times 640$
5	Initial Learning Rate	$5 \times 10^{- 3}$
6	Final Learning Rate	$5 \times 10^{- 5}$
7	Mosaic	1
8	Close Mosaic	10
9	Momentum	0.937

Table 4. Results of the ablation experiments; the best results are in blue. Where ✓ indicates that the corresponding module is included in the model.

	M1	M2	M3	M4	M5	M6	M7 (Ours)
HRFPN		✓	✓	✓	✓	✓	✓
C3k2-WT			✓	✓	✓	✓	✓
SCSA				✓	✓	✓	✓
Dysample					✓		✓
ATFL						✓	✓
Precision	90.7%	91.7%	93.3%	94.3%	93.0%	92.3%	94.3%
Recall	83.1%	91.0%	92.0%	92.3%	91.9%	93.1%	93.1%
mAP⁵⁰	90.3%	94.5%	94.8%	95.9%	95.6%	95.4%	96.2%
mAP⁵⁰⁻⁹⁵	52.6%	56.3%	57.0%	56.9%	57.2%	56.5%	57.6%
Model size	5.5 MB	4.3 MB	4.4 MB	4.4 MB	4.4 MB	4.4 MB	4.4 MB
Parameters	2.59 M	1.94 M	1.96 M	1.96 M	1.97 M	1.96 M	1.97 M
FPS	83	86	75.57	61.35	57.79	65	56.2
GFLOPs	6.4	9.8	10.9	10.9	10.9	10.9	10.9

Table 5. Experimental results of collaborative relationship validation experiments; the best results are in blue; the second best are in red. Where ✓ indicates that the corresponding module is included in the model.

Comb.	DyATF	HRFPN	C3k2-WT	SCSA	Precision	Recall	mAP⁵⁰	mAP⁵⁰⁻⁹⁵	Parameters
Two	✓	✓			91.7%	91.6%	94.9%	58.1%	1.96 M
	✓		✓		90.8%	84.3%	90.9%	54.0%	2.81 M
	✓			✓	88.5%	82.8%	90.1%	52.6%	2.77 M
		✓	✓		93.3%	92.0%	94.8%	57.0%	1.96 M
		✓		✓	90.9%	92.2%	95.1%	56.6%	1.94 M
			✓	✓	92.4%	81.0%	90.4%	52.5%	2.80 M
Three	✓	✓	✓		91.8%	92.2%	95.3%	57.6%	1.97 M
	✓	✓		✓	91.9%	91.6%	95.6%	56.7%	2.02 M
	✓		✓	✓	91.4%	83.8%	90.8%	53.4%	3.04 M
		✓	✓	✓	94.3%	92.3%	95.9%	56.9%	1.96 M
Four (ours)	✓	✓	✓	✓	94.3%	93.1%	96.2%	57.6%	1.97 M

Table 6. Comparison experiments results; the best results are in blue. RT-DETR-R50 and RT-DETR-l are Vision Transformer-based real-time detectors, using ResNet-50 and a larger backbone, respectively, within an NMS-free, end-to-end detection framework.

Model	Precision	Recall	mAP⁵⁰	mAP⁵⁰⁻⁹⁵	Parameters	Size	GFLOPs	FPS
YOLOv5n	92.5%	81.7%	90.4%	52.7%	2.50 M	5.3 MB	7.2	91.12
YOLOv7-tiny	84.9%	69.0%	78.7%	40.7%	6.01 M	12.3 MB	13.2	78.5
YOLOv8n	90.5%	83.1%	90.3%	53.2%	3.01 M	6.3 MB	8.2	97.76
YOLOv8s	91.9%	79.6%	89.7%	53.1%	11.1 M	22.5 MB	28.6	96
YOLOv9t	87.4%	81.6%	88.6%	51.4%	2.00 M	4.7 MB	7.8	55.63
YOLOv10n	91.2%	79.3%	89.3%	53.1%	2.70 M	5.8 MB	8.4	69
YOLO11n	90.7%	83.1%	90.3%	52.6%	2.59 M	5.5 MB	6.4	83
YOLO11s	87.9%	78.9%	87.4%	50.9%	9.43 M	19.2 MB	21.5	85.9
YOLOv12n	89.7%	82.9%	89.6%	52.0%	2.57 M	5.5 MB	6.5	66.46
YOLOv12s	90.5%	78.5%	87.9%	51.0%	9.25 M	18.9 MB	21.5	63.35
RT-DETR-R50	93.4%	88.6%	92.0%	41.7%	41.9 M	86.0 MB	130.5	39.25
RT-DETR-l	93.2%	86.8%	90.5%	41.4%	32.8 M	66.2 MB	108	42.99
YOLO-CoOp	94.3%	93.1%	96.2%	57.6%	1.97 M	4.4 MB	10.9	56.2

Table 7. Cross-dataset validation experiments results.

Dataset	Model	Precision	Recall	mAP⁵⁰	mAP⁵⁰⁻⁹⁵	Parameters	Size
Visdrone	YOLO11n	43.3%	32.2%	31.9%	18.7%	2.59 M	5.5 MB
Visdrone	YOLO-CoOp	45.0%	35.8%	36.0%	21.4%	1.97 M	4.4 MB
DUT Anti-UAV	YOLO11n	89.8%	78.6%	86.5%	56.6%	2.59 M	5.5 MB
DUT Anti-UAV	YOLO-CoOp	93.6%	84.9%	91.4%	60.4%	1.97 M	4.4 MB
TIB-Net-Drone	YOLO11n	83.3%	79.9%	82.7%	35.6%	2.59 M	5.5 MB
TIB-Net-Drone	YOLO-CoOp	89.8%	92.3%	92.4%	38.5%	1.97 M	4.4 MB

Table 8. Hardware specifications of the anti-UAV detection and tracking system.

Component	Model/Specification	Key Parameters
Computing Platform	Laptop Computer	NVIDIA GeForce GTX 1650 (4 GB VRAM), Windows 11 Professional
Camera Module	Hikvision DS-2ZMN2507C	2-megapixel (1920 × 1080) 25× optical zoom (4.8–120 mm) 30 fps (60 Hz synchronization) H.265 encoding
Tracking Gimbal	HY-MZ17-01A	Horizontal speed: 9–45^∘/s Vertical speed: 2.6–13^∘/s Positioning accuracy: ±0.1^∘ Rotation range: 0–360∘ (horizontal), −60–+60^∘ (vertical) Load capacity: ≤20 kg Protection rating: IP66

Table 9. System performance at varying distances (20–120 m).

Distance (m)	Latency (ms)	FPS	GPU Util (%)	GPU Power (W)
20	36.13	26.0	36.52	23.13
40	37.22	27.2	36.42	22.67
60	35.39	28.2	36.72	23.25
80	37.47	28.1	37.46	28.07
120	34.49	27.7	36.00	23.59

Table 10. System performance under different optical zoom levels (5×–25×) at a fixed distance of 130 m—the empirical detection limit without zoom.

Zoom (×)	Latency (ms)	FPS	GPU Util (%)	GPU Power (W)
5	37.35	26.4	35.83	22.92
10	38.09	26.4	36.90	27.55
15	38.76	26.1	36.57	27.20
20	37.57	26.6	36.81	27.35
25	38.04	26.7	37.27	27.53

Table 11. System performance at extended distances (150–300 m) under maximum optical zoom (25×).

Distance (m)	Latency (ms)	FPS	GPU Util (%)	GPU Power (W)
150	36.55	27.1	35.87	23.19
180	37.90	26.7	35.04	28.06
210	35.80	28.4	38.07	28.65
240	37.23	27.1	37.84	28.42
270	35.92	27.5	36.13	24.31
300	36.17	27.5	35.80	23.76

Table 12. System robustness under challenging operational conditions: complex background, multi-target, high-speed maneuver, and backlighting.

Test Scenario	Latency (ms)	FPS	GPU Util (%)	GPU Power (W)
Multi-Target	49.68	20.2	37.20	20.59
Complex Background	47.78	20.9	36.98	19.55
High-Speed Maneuver	47.23	20.7	36.89	21.40
Backlighting	47.71	21.0	37.28	19.33

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, Z.; Fan, K.; Ye, J.; Xu, Z.; Wei, Y. A Lightweight Multi-Module Collaborative Optimization Framework for Detecting Small Unmanned Aerial Vehicles in Anti-Unmanned Aerial Vehicle Systems. Drones 2026, 10, 20. https://doi.org/10.3390/drones10010020

AMA Style

Chen Z, Fan K, Ye J, Xu Z, Wei Y. A Lightweight Multi-Module Collaborative Optimization Framework for Detecting Small Unmanned Aerial Vehicles in Anti-Unmanned Aerial Vehicle Systems. Drones. 2026; 10(1):20. https://doi.org/10.3390/drones10010020

Chicago/Turabian Style

Chen, Zhiling, Kuangang Fan, Jingzhen Ye, Zhitao Xu, and Yupeng Wei. 2026. "A Lightweight Multi-Module Collaborative Optimization Framework for Detecting Small Unmanned Aerial Vehicles in Anti-Unmanned Aerial Vehicle Systems" Drones 10, no. 1: 20. https://doi.org/10.3390/drones10010020

APA Style

Chen, Z., Fan, K., Ye, J., Xu, Z., & Wei, Y. (2026). A Lightweight Multi-Module Collaborative Optimization Framework for Detecting Small Unmanned Aerial Vehicles in Anti-Unmanned Aerial Vehicle Systems. Drones, 10(1), 20. https://doi.org/10.3390/drones10010020

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

A Lightweight Multi-Module Collaborative Optimization Framework for Detecting Small Unmanned Aerial Vehicles in Anti-Unmanned Aerial Vehicle Systems

Highlights

Abstract

1. Introduction

2. Materials

2.1. General UAV Datasets

2.2. Analysis of UAV-SOD

2.3. Basic Framework Selection

3. Methods

3.1. YOLO-CoOp

3.2. Structure and Principle of HRFPN

3.3. C3k2-WT Module

Receptive Field and Parameter Analysis

3.4. SCSA Module

3.5. DyATF Collaborative Optimization

3.5.1. Loss Function Optimization

3.5.2. Upsampling Optimization

4. Experiments and Results

4.1. Experimental Platform

4.2. Evaluation Indicators

4.3. Ablation Experiments

4.4. Collaborative Relationship Validation Experiments

4.5. Comparison Experiments

4.6. Cross-Dataset Validation Experiments

5. Visual Detection in Anti-UAV System

5.1. Experimental Setup

5.2. Quantitative Metrics and Experimental Protocol

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI