DRF-YOLO Model for Small UAV Detection Through Multi-Scale Residual Enhancement and Progressive Feature Fusion

Wang, Songwei; Shuai, Jianping; Yang, Yuzhu; Hu, Xiaoxiao; Yang, Chenxi; Zhou, Ya

doi:10.3390/asi8060179

Open AccessArticle

DRF-YOLO Model for Small UAV Detection Through Multi-Scale Residual Enhancement and Progressive Feature Fusion

by

Songwei Wang

¹,

Jianping Shuai

^1,*,

Yuzhu Yang

²,

Xiaoxiao Hu

¹,

Chenxi Yang

³

and

Ya Zhou

^1,*

¹

School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China

²

School of Architecture and Transportation Engineering, Guilin University of Electronic Technology, Guilin 541004, China

³

School of Information and Communication, Guilin University of Electronic Technology, Guilin 541004, China

^*

Authors to whom correspondence should be addressed.

Appl. Syst. Innov. 2025, 8(6), 179; https://doi.org/10.3390/asi8060179

Submission received: 21 August 2025 / Revised: 10 October 2025 / Accepted: 24 November 2025 / Published: 26 November 2025

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Detecting small-scale objects remains a critical challenge with limited pixel information, complex backgrounds, and varying imaging conditions. To tackle these challenges, we propose an innovative high-precision detection framework (DRF-YOLO) that integrates a dilated-wise residual (DWR) module and an asymptotic feature pyramid network (AFPN). The DWR module enhances contextual representation and enriches spatial detail, while AFPN optimizes multi-scale feature fusion and semantic alignment. Extensive evaluations were carried out on the DUT-Anti-UAV and Det-Fly datasets, which contain images taken in complex aerial environments. The DRF-YOLO model achieved an mAP@50 of 86.9 and 91.1% on the two respective datasets, showing performance gains of 1.5% and 3.3% compared to the YOLOv8 reference model, and yielded mAP@50:95 gains of 1.1 and 2.3%, respectively. The synergistic effect of the DWR module and AFPN architecture enables significant enhancement in mAP@50, mAP@50:95, precision, and recall, demonstrating an optimal balance between accuracy and object coverage. The model also demonstrates improved robustness under complex backgrounds and occlusion, underscoring its potential for accurate UAV detection.

Keywords:

small-UAV detection; complex background; dilated wise residual (DWR); asymptotic feature pyramid network (AFPN); feature fusion

1. Introduction

Unmanned aerial vehicles (UAVs) are widely used in logistics and transportation, environmental monitoring, rescue missions, and film and television production due to their small size and low maintenance cost. However, large-scale UAV deployment has also brought security risks, such as unlawful aerial surveillance and unauthorized communications interception carried out by unidentified UAVs [1]. Consequently, the design of a robust and effective UAV intrusion recognition framework has emerged as a pressing requirement for safeguarding public security.

Current object detection strategies include radar, acoustic sensing, radio-frequency, and optical imaging [2]. Although these modalities provide complementary perspectives, they remain vulnerable to illumination variation, noise interference, and cloud occlusion, which often lead to biased or erroneous predictions. With the rapid progress of deep learning techniques, object recognition has achieved notable advancements, particularly in feature representation and semantic interpretation. Compared with conventional approaches, deep learning models demonstrate stronger robustness and adaptability, making them highly appropriate for precise and real-time UAV recognition. In recent years, region-based frameworks such as Faster R-CNN [3] have achieved high accuracy but are hindered by slow inference; meanwhile, single-shot architectures like SSD (Single-Shot Multibox Detector) [4] improve efficiency while maintaining comparable accuracy. Tracking-based schemes, including Siamese Fully Convolutional Network (SiamFC) [5] and discriminative model prediction (DiMP) [6], have been developed for stable UAV monitoring. Additionally, the YOLO (You Only Look Once) [7] family offers a balanced trade-off between accuracy and inference speed. More recently, transformer-driven detectors, such as DETR (Detection Transformer) [8], exploit the transformer backbone [9] to model global contextual dependencies and capture long-range relationships.

Despite these advancements, identifying small-scale UAVs remains highly challenging. Such UAVs typically occupy merely a few pixels and are easily influenced by motion blur, often leading to missed detections or false alarms. Frequent variations in altitude, viewing angle, and flight attitude further demand robust multi-scale modeling. Moreover, cluttered scenes containing clouds, sky, or building edges may resemble UAVs, thereby introducing semantic confusion and resulting in false positives.

To address these challenges, we developed an accurate UAV detection model designed to enhance feature representation and multi-scale feature fusion. Specifically, we introduced a dilated-wise residual (DWR) module to strengthen contextual representation and enrich spatial details, and incorporated an asymptotic feature hierarchy network (AFPN) to optimize multi-level feature integration and semantic alignment. The resulting model, termed DRF-YOLO, significantly improves the precision and resilience of UAV identification under complex aerial scenarios, demonstrating its potential for reliable intrusion monitoring and public security protection.

2. Related Works

Limited pixels, insufficient semantic information, and strong background interference and noise make small-object detection (SOD) highly challenging. Therefore, extensive research has been conducted to enhance the performance of SOD models [10].

2.1. Model Architecture

Researchers have explored various architectural modifications to improve SOD performance. Lou et al. [11] introduced a depthwise separable convolution, max-pooling, and 3 × 3 convolution (MDC) module into the DC-YOLOv8 model for small-size object detection using cameras to compensate for information loss from downsampling and optimize feature fusion. However, their model introduced irrelevant features and increased computational overhead. Zamri et al. [12] incorporated multiple attention mechanisms and a high-resolution detection head in their P2-YOLOv8n-ResCBAM model. While their model effectively distinguished UAVs from birds, the high-resolution head increased computational costs. Keles et al. [13] used slicing fine-tuning and inference with YOLOv5 to reduce computational load and improve adaptability. However, this model split small objects, which compromised object integrity and detection accuracy. Cheng et al. [14] combined a lightweight MobileViT backbone with a coordinate attention mechanism (CA-PANet) for feature fusion. This method improved accuracy and efficiency but presented limited ability to extract shallow features and lacked flexibility in multi-scale fusion. To avoid building multi-scale pyramids, Singh et al. [15] cropped fixed-size image patches, which still maintained a time-consuming multi-scale testing process. Li et al. [16] developed DN-DETR, which incorporates innovative denoising training, significantly improving the model’s training speed and detection performance. Hoanh et al. [17] proposed a small-object detection framework that integrates an object-focus module and a dual-head mechanism within a feature pyramid network, leveraging a sparse computation strategy to improve efficiency. The framework first performs a coarse localization stage for small objects, followed by high-resolution feature refinement, thereby reducing computational overhead from background regions. However, it remains sensitive to the choice of thresholds and loss weight settings and incurs additional computational cost. Xu et al. [18] developed a YOLOX-based detector incorporating a Spatio-Temporal Attention Module (STAM) and a lightweight Group SimSPPFCSP module in the backbone, and further designed an NRPP neck to enhance multi-level feature propagation. Although their method improved the robustness and efficiency of micro-UAV detection, it performed less effectively on low-contrast scenes. Wang et al. [19] proposed Dist-Tracker, which integrates a Scale-Shape-Quality (SSQ) detector with a Fusion of L2-IoU Tracker (FLIT) for infrared multi-UAV tracking. Their method significantly improves detection sensitivity and motion robustness; however, the lack of appearance features leads to frequent identity switches under severe occlusion.

2.2. Evaluation and Data Augmentation

Xu et al. [20] introduced the Dot Distance (DotD) metric, formulated as the standardized Euclidean distance between the centroids of predicted boxes and ground-truth annotations, serving as an alternative to the intersection over union (IoU) for assessing localization similarity in tiny object detection. However, because DotD depends only on center distance, it does not encode box width, height, aspect ratio, or orientation, which leads to misjudgments for tilted targets such as UAVs.

For data augmentation, Zhang et al. [21] developed a scale-compensated anchor allocation strategy to expand the quantity of positive anchors for small targets, thereby enhancing recall. However, this approach also markedly increases the proportion of negative samples during the training process, which leads to a higher likelihood of false detections rate. Kisantal et al. [22] augmented small instances by copying and pasting them within the same image created through random transformations. While effective at increasing the representation of small objects, their method distorts semantic context, creates redundant objects, and causes overfitting.

2.3. FPN

The Feature Pyramid Network (FPN) [23] is widely employed since it leverages a top-down pathway to integrate multi-scale representations, thereby enhancing object detection capability across different resolutions. Nevertheless, FPN faces inherent limitations in small-target recognition for UAV surveillance or aerial image analysis. It often introduces noise during feature aggregation, lacks precise cross-scale semantic alignment, and ignores high-frequency textural cues, which can compromise the detection of very small objects.

To solve these problems, various methods to improve the FPN architecture have been developed. Liu et al. [24] developed the denoising feature pyramid network (DN-FPN), adopting a contrastive learning mechanism for enhanced feature extraction while suppressing noise. DN-FPN improves small-object feature extraction but is highly sensitive to hyperparameters, requiring precise control of the ratio of positive to negative samples. Liu et al. [25] developed a feature pyramid network (Dual SIEFPN) by integrating semantic and spatial information to minimize information loss during multi-scale feature transfer. While the network improves small-object detection, it relies on a complex attention mechanism, which increases inference time. Zhao et al. [26] modeled interactions across pyramid levels using graph neural networks to facilitate inter-layer communication. However, the computational overhead of graph construction and message passing cannot completely integrate with lightweight detectors, limiting the models’ deployment in resource-constrained environments. Shi et al. [27] developed high frequency and spatial perception FPN (HS-FPN) that enhances feature extraction and spatial awareness through high-frequency perception combined with spatial dependency. Despite its advantages, HS-FPN lacks global semantic guidance and cannot completely recognize textureless objects, showing limited adaptability to complex scale variations.

3. Methodology

To address the identified problems, a new model is needed for efficient feature extraction and fusion and robust handling of multi-scale information without increasing computational cost. Therefore, we developed the DRF-YOLO model in this study. We designed a DWR module based on the DWR segmentation network (DWRSeg) to decouple regional feature generation from multi-dilation semantic refinement. The module enables lightweight receptive-field expansion and effectively enlarges the receptive field for small objects to capture textural details with minimal computational overhead. We refined the YOLOv8 neck using an asymptotic fusion framework, extending it to a four-layer feature pyramid with a dedicated auxiliary head for detecting small objects. The developed DRF-YOLO obtains detailed and semantic information through iterative upsampling and downsampling, which significantly improves small-UAV detection accuracy.

3.1. Overall Structure

We employed YOLOv8n [28] as the reference model owing to its fast inference, high training efficiency, and reliable accuracy in generic object recognition. Nevertheless, YOLOv8 demonstrates limitations when handling tiny targets in complex aerial contexts, as it relies heavily on shallow representations and lacks sufficient cross-scale perception. To address this issue, we incorporated a DWR block and an AFPN architecture. The DWR unit was embedded at multiple stages of the backbone to broaden the receptive field for small-target recognition and to strengthen scene interpretation in challenging conditions. Moreover, an additional detection head was introduced at the P2 layer, extending the three original detection branches of YOLOv8. Through the synergistic integration of progressive feature refinement and adaptive spatial fusion offered by AFPN, this extension supports cooperative modeling and fine-grained multi-scale perception, thereby enhancing the detection of small UAV instances in complex backgrounds. The complete network design is illustrated in Figure 1, where the proposed improvements are highlighted in red boxes to provide a clear structural overview of how DWR and AFPN interact within the full architecture.

3.2. DWR

In UAV small-object detection, conventional convolutional neural networks (CNNs) employ static-sized convolutional filters that possess restricted receptive fields, limiting their capacity to capture anything beyond local features. When the target occupies only a few pixels or appears against a complex background, detecting local features is not efficient for accurate recognition, frequently resulting in either undetected objects or incorrect identifications. To address this, we introduced the DWR module as a substitute for the four C2f modules within the initial backbone architecture [29]. The module captures fine-grained textures by leveraging regional residual components, then utilizes semantic residual learning to enlarge the receptive field, thereby improving the capacity for contextual understanding. The DWR module employs region residualization (RR) to generate compact multi-scale regional features and semantic residualization (SR) to apply depthwise separable dilated convolutions featuring customized dilation factors, execute morphological operations, and adaptively increase the receptive range. Concatenation and fusion of the output features through convolutional layers yield feature maps that contain both detailed textures and extensive receptive fields (Figure 2). Such a design enables the DWR module to effectively detect small UAVs within complex environments by simultaneously preserving fine-grained details and expanding the receptive field.

In the module, for a given input

X \in R^{C \times H \times W}

, a 3 × 3 convolution is used for the local channel compression. The feature map R is derived through concurrent application of batch normalization (BN) and sigmoid linear unit (SiLU) activation functions, which enables the execution of the following multi-branch computational process (Equation (1)).

\begin{matrix} R = σ (B N ({C o n v}_{3 \times 3}^{C \to C / 2} (X))), R \in R^{\frac{C}{2} \times H \times W} \end{matrix}

(1)

R is then simultaneously fed into three depth-separable 3 × 3 cavity convolutions with expansion rates of

d = 1, 3, 5

, respectively, to obtain the short-medium-long receptive field feature

S

for multi-scale feature capture (Equation (2)). By using

d = 1, 3, 5

, local detail preservation and global semantic context were balanced while avoiding the redundancy of excessively large dilation rates.

S_{d} = σ (B N ({D W C o n v}_{3 \times 3, d}^{\frac{C}{2} \to C_{d}} (R))), d \in {1, 3, 5}

(2)

After obtaining the three branch output features, they are spliced in the channel dimension to preserve the independent expression of each receptive field, forming a multi-scale contextual feature U (Equation (3)).

U = [S_{1}; S_{3}; S_{5}] \in R^{2 C \times H \times W}

(3)

For cross-scale information interaction, a 3 × 3 convolution is applied to compress the

2 C

-channel feature to

C

. Simultaneously, linear combinations are performed along the channel dimension to generate the fused feature

F

, which is subsequently used in the residual splicing operation (Equation (4)).

F = σ (B N ({C o n v}_{1 \times 1}^{2 C \to C} (U))), F \in R^{C \times H \times W}

(4)

Finally, the original input

X

is added element-wise as a residual connection to preserve the original feature pathway and help mitigate gradient vanishing (Equation (5)).

Y = F + X

(5)

Convolutional branches with varying dilation rates are fused with the initial features to enrich multi-scale contextual representations. This integration alleviates the information bottleneck that is commonly encountered in small UAV object detection and causes a lack of global semantic cues. The integration also enhances recognition and localization accuracy.

The DWR module broadens the effective receptive field while maintaining computational efficiency via a three-stage lightweight residualization pipeline: local compression, multi-dilation convolution, and adaptive fusion. Its design facilitates the extraction of both fine-grained edge details and semantically rich representations from small-scale targets, thereby mitigating the constraints imposed by the fixed receptive fields of standard convolutions. Moreover, it effectively reconciles the need for local texture preservation with the demand for global contextual awareness. Experiments confirm that, when integrated into YOLOv8 for UAV-based small-object detection, the DWR module surpasses the original C2f block in both accuracy and efficiency, providing a high-performance yet lightweight alternative.

3.3. AFPN

In complex aerial imagery, UAV targets appear as tiny objects and exhibit significant scale variations. The default FPN in YOLOv8 is ill-suited for this scenario, as it tends to weaken high-level semantic content across multiple levels and compromise low-level spatial details through repeated sampling. To mitigate these limitations, we substituted the original FPN with an AFPN [30]. We further enhanced the baseline three-head YOLOv8 architecture by incorporating a fourth detection head, thereby forming a four-layer pyramid structure. This extension improves the model’s capability to extract fine-grained textures and multi-scale features, specifically optimized for small UAV detection.

The AFPN architecture employs a progressive asymptotic fusion framework (Figure 3). The architecture enables AFPN to preserve fine-grained texture information in the highest-resolution layer before incorporating deeper semantic information from neighboring and distant layers. After conducting fusion at each stage, AFPN applies the pixel-level adaptive weighting of the adaptive spatial feature fusion (ASFF) module [31]. This process optimizes the contribution of AFPN across multiple scales to each spatial location, emphasizing the most informative levels while reducing potential conflicts between overlapping objects. The four feature layers undergo lightweight convolution to unify their channel dimensions before being fed into the corresponding detection heads (Figure 4). The AFPN architecture generates outputs that integrate high-resolution spatial details with rich, multi-scale semantic context, facilitating precise identification of small UAVs within intricate aerial environments. A detailed breakdown of the AFPN workflow is presented in the following section.

AFPN aligns the spatial resolution of high-level and low-level feature maps via upsampling or downsampling operations, facilitating effective fusion of features across adjacent scales (Figure 4). Subsequently, a feature-adaptive spatial fusion mechanism is implemented to dynamically assign fusion weights according to the significance of distinct spatial regions within each feature layer. This approach strengthens the representation capability of salient features and improves cross-layer information exchange. The integration of the four-layer feature vectors is formulated in Equation (6).

y_{i j}^{l} = α_{i j}^{l} \cdot x_{i j}^{1 \to l} + β_{i j}^{l} \cdot x_{i j}^{2 \to l} + γ_{i j}^{l} \cdot x_{i j}^{3 \to l} + η_{i j}^{l} \cdot x_{i j}^{4 \to l}

(6)

where

α_{i j}^{l}, β_{i j}^{l}

,

γ_{i j}^{l}

, and

η_{i j}^{l}

denote the spatial weights of the four levels of features in the first level, respectively, and satisfy

α_{i j}^{l} + β_{i j}^{l} + γ_{i j}^{l} + η_{i j}^{l} = 1

,

X_{i j}^{n \to l}

that denotes the feature vectors from the first to the final levels.

The enhanced AFPN neck produces a series of multi-scale feature sets, derived from the integration of backbone network layers 2, 4, 6, and 9 (Equation (7)). This configuration allows for effective scale diversity in feature representation.

T_{l} \in R^{C_{l} \times H_{l} \times W_{l}}, l \in {2, 4, 6, 9}

(7)

The initial features are then generated by channel downscaling them through a 1 × 1 size convolutional kernel (Equation (8)). In the equations,

ϕ_{1 \times 1}

represents a 1 × 1 convolution operator, which performs pixel-wise channel transformation on the input feature map.

M_{l} = ϕ_{1 \times 1} (T_{l}) \in R^{C^{'} \times H_{l} \times W_{l}}, C^{'} = \frac{C_{l}}{8}

(8)

The underlying features

M_{2}

and

M_{4}

are input into ASFF2, fused and input into ASFF3 with

M_{6}

for secondary fusion. The output of ASFF3 is input into ASFF4 along with

M_{9}

for final fusion (Equation (9)).

F (i, j) = ϕ_{3 \times 3} (\sum_{k = 1}^{K} α_{k} (i, j) \cdot {\tilde{M}}_{k} (i, j)), \sum_{k = 1}^{K} α_{k} (i, j)

(9)

where

ϕ_{3 \times 3}

denotes a 3 × 3 convolutional filter applied to capture local contextual information,

K

indicates the total number of feature layers participating in the fusion stage, and

{\tilde{M}}_{k} (i, j)

signifies the feature value at spatial coordinate

(i, j)

of the k-th feature map following scale normalization.

Following the fusion stage, the channel dimensions of each of the four-level feature outputs are restored within C2f blocks. This yields a multi-scale feature set

{P_{2}, P_{4}, P_{6}, P_{9}}

, which is subsequently forwarded to the detection head for small UAV object recognition.

Throughout this pipeline, AFPN captures rich semantic content while retaining fine-grained spatial details, leveraging progressive feature fusion and ASFF’s adaptive spatial weighting. The developed multi-scale cooperative and dynamic fusion strategy markedly enhances detection precision for small UAVs in complex scenarios.

3.4. Experimental Setup

We evaluated the DRF-YOLO model on the DUT-Anti-UAV and Det-Fly datasets [32,33]. These datasets are designed for UAV detection, featuring thousands of images, including small-scale targets, complex backgrounds, and varied lighting conditions, which are significant challenges in the task. The model was trained for 300 epochs using stochastic gradient descent (SGD) on a workstation equipped with an NVIDIA RTX 4090 GPU. Key hyperparameters are detailed in Table 1.

Model performance was quantified using four standard metrics: precision, recall, mean average precision at an IoU of 0.5 (mAP@50), and mean average precision across IoU thresholds from 0.5 to 0.95 (mAP@50:95). These metrics are used for the assessment of classification accuracy and localization precision.

Precision is used to measure the ratio of correctly identified positive predictions to the total positive outputs produced by the model. It represents the model’s capability to make accurate positive classifications, as formulated in Equation (10).

P r e c i s i o n = \frac{T P}{T P + F P}

(10)

where TP refers to the number of true positives (objects correctly identified), while FP indicates the number of false positives (objects incorrectly detected).

Recall is defined as the proportion of actual positive samples that the model successfully recognizes. It reflects the model’s capacity to detect all relevant instances, as specified in Equation (11).

R e c a l l = \frac{T P}{T P + F N}

(11)

Here, FN represents the number of false negatives, referring to objects that were not detected during the process.

The metric mAP@50 is defined as the mean average precision calculated across all object categories with the IoU threshold set at 0.5. For each class, the average precision is obtained by computing the area under the corresponding precision–recall curve, as expressed in Equation (12). The final mAP@50 value is derived by averaging these AP scores over all categories, as formulated in Equation (13).

A P = \int_{0}^{1} P (r) d r

(12)

m A P @ 50 = \frac{1}{N} \sum_{i = 1}^{N} {A P}_{i}

(13)

where r denotes a specific recall value that maps each recall level to its associated precision value, and AP represents the average precision.

mAP@50:95 serves as a robust and comprehensive evaluation of model performance. It is computed by averaging precision across IoU thresholds ranging from 0.5 to 0.95 in steps of 0.05. By aggregating results over multiple threshold levels, this metric offers a holistic assessment of detection accuracy and localization precision, as expressed in Equation (14).

m A P @ 50 : 95 = \frac{1}{N} \sum_{t ∊ {0.5,0.55 . \dots, 0.95}} {m A P}_{t}

(14)

3.5. Datasets

The DUT-Anti-UAV dataset, created by a research group at Dalian University of Technology, China, serves as a visible-light-based benchmark for anti-UAV detection and tracking. It contains 10,000 images, with 5200 designated for training, 2600 for validation, and 2200 reserved for testing. The dataset includes 10,109 detectable objects, many of which are small-scale UAVs. The Det-Fly dataset represents a compact air-to-air UAV detection resource, comprising more than 13,000 images of in-flight UAVs captured from another flying platform. These images exhibit diverse conditions, including varied background environments, camera angles, relative distances, flight altitudes, and illumination levels. Approximately half of the detected objects cover less than 5% of their respective image areas.

4. Results

We evaluated the developed DRF-YOLO model on two widely used single-class UAV datasets, DUT-Anti-UAV and Det-Fly, both selected for their challenging small-object instances and varied imaging conditions. DUT-Anti-UAV contains 10,109 annotated UAV instances, with its 10,000 images split into 5200 for training, 2600 for validation, and 2200 for testing. Det-Fly comprises more than 13,000 air-to-air images, and approximately half of its annotated targets occupy less than 5% of the image area. All experiments were conducted at an input resolution of 640 × 640 with standard normalization, and the models were trained for 300 epochs (Table 1). This section reports the performance of the proposed DRF-YOLO on these datasets, including ablation studies to assess the contributions of the DWR module and AFPN architecture, comparisons with representative detectors to demonstrate performance advantages in small-UAV detection, and both training-curve analyses and qualitative visualizations to further illustrate the robustness and practical effectiveness of the model.

4.1. Ablation Experiment

Ablation experiments were conducted to determine the contributions of the DWR module and AFPN.

On DUT-Anti-UAV, the YOLOv8 baseline achieved mAP@50 = 85.4% and mAP@50:95 = 53.7% with a precision of 90.5% and a recall of 78.5%. Adding the DWR module improved contextual representation and raised mAP@50 to 85.7% and mAP@50:95 to 54.3% while increasing precision to 92.6% and recall to 79.1%. Introducing AFPN alone increased mAP@50 to 86.2% and recall to 79.7%. Combining DWR and AFPN (DRF-YOLO) yielded the largest consolidated gain, yielding mAP@50 = 86.9%, mAP@50:95 = 54.8%, Precision = 93.9%, and Recall = 80.3%, which confirms that DWR and AFPN provide complementary benefits in contextual modeling and cross-level feature fusion (Table 2).

On Det-Fly, similar trends were observed (Table 3). YOLOv8 baseline showed mAP@50 = 87.8% and mAP@50:95 = 53.1% (precision 92.4%, recall 82.7%); DWR alone increased mAP@50 to 88.5% and mAP@50:95 to 53.8%; AFPN alone raised mAP@50 to 89.9% and recall to 85.8%; DRF-YOLO reached mAP@50 = 91.1%, mAP@50:95 = 55.4%, precision = 95.0%, and recall = 86.5% (Table 3).

The ablation results on DUT-Anti-UAV and Det-Fly demonstrate that the DWR module and AFPN architecture improve different aspects of detection. DWR enhances precision and mAP@50:95 by strengthening contextual representation, while AFPN increases recall and mAP@50 through progressive multi-scale fusion. On both datasets, combining the two modules yields the highest overall performance, indicating that their complementary strengths enable DRF-YOLO to achieve balanced improvements in localization accuracy and detection completeness.

4.2. Comparison with Other Models

We compared DRF-YOLO with detectors, Cascade R-CNN [34], Faster R-CNN [3], RetinaNet [35], FCOS [36], YOLOv5 [37], YOLOv8 [28], YOLOv10 [38], YOLOv11 [39], and RT-DETR [40]. On DUT-Anti-UAV, DRF-YOLO achieves mAP@50 = 86.9%, mAP@50:95 = 54.8%, and recall = 80.3%, outperforming most baselines and improving on the YOLOv8 backbone across both mAP metrics (Table 4).

Two-stage detectors (Cascade R-CNN and Faster R-CNN) showed significantly lower mAP and recall on these small-object datasets, whereas anchor-free methods (FCOS and RetinaNet) improved over two-stage baselines but remain behind modern single-stage and transformer methods in combined accuracy and recall. On Det-Fly, DRF-YOLO presented mAP@50 = 91.1%, mAP@50:95 = 55.4%, and recall = 86.5%, surpassing the YOLOv8 baseline and matching or exceeding other recent detectors in the balance between localization quality and detection completeness (Table 5). RT-DETR attained a high mAP on Det-Fly but generally requires greater computational resources to achieve comparable recall.

The comparative experiments further highlight DRF-YOLO’s advantages in small-UAV detection. Traditional two-stage frameworks struggle with tiny targets, while anchor-free detectors offer partial improvements but still exhibit limited recall. Recent YOLO variants benefit from architectural scaling yet lack mechanisms specifically designed for fine-grained small-object recognition. By contrast, DRF-YOLO leverages enhanced receptive-field modeling and progressive multi-scale fusion, yielding consistently stronger detection quality across both datasets. Compared with transformer-based detectors such as RT-DETR, DRF-YOLO achieves a more favorable balance between accuracy and computational efficiency, making it better suited for UAV surveillance.

4.3. Training Stability and Qualitative Performance

Training curves (Figure 5 and Figure 6) show that DRF-YOLO converges steadily to higher mAP and maintains consistently high recall compared with competing methods, indicating stable optimization behavior and effective feature representation for small objects. Each subfigure illustrates the progression of mAP alongside recall throughout the training process. The developed DRF-YOLO model exhibits a consistently increasing mAP and sustained high recall, indicating stable convergence, effective feature representation, and enhanced capability in detecting small-scale targets. In contrast, RT-DETR shows a slower improvement in mAP and comparatively lower recall, suggesting greater computational complexity and challenges in achieving optimal accuracy. Models such as YOLOv5, YOLOv8, YOLOv10, and YOLOv11 tend to present plateaus in mAP during later training epochs and yield only moderate recall, highlighting their limitations in small-object recognition. Although other methods like FCOS demonstrate continuous mAP growth, their recall remains inferior to that of DRF-YOLO, reflecting suboptimal feature utilization and reduced effectiveness in small-object detection.

The qualitative prediction analysis confirms that DRF-YOLO generates tighter and more accurate bounding boxes for distant and occluded UAVs, reduces false positives in cluttered backgrounds, and detects a greater number of small targets compared to other models (Figure 7 and Figure 8).

The first row of Figure 7 and Figure 8 displays the original input images, followed by ground-truth annotations in the second row. Subsequent rows illustrate the prediction outputs from various models. DRF-YOLO consistently produced more accurate and stable predictions across diverse and complex scenarios. It demonstrated superior bounding-box regression and contour-fitting performance, particularly under conditions involving occlusion, uneven illumination, and background-texture interference, where other models frequently exhibited missed or false detections.

In contrast, DRF-YOLO accurately localized small UAVs, substantially reducing detection errors and demonstrating strong capability in small-object modeling and discrimination. This improvement is closely tied to its architectural design. The DWR module expands the receptive field and enhances contextual representation through multi-scale dilated convolutions and a two-stage residual structure, while AFPN enables progressive feature fusion and spatial weighting to extract discriminative features across multiple semantic levels. The synergy of these components allows DRF-YOLO to maintain robust performance across varying object scales, occlusion levels, and lighting conditions, thereby improving the model’s stability, accuracy, and interpretability in small-object detection tasks.

5. Discussion

UAVs operate in complex environments where image acquisition is significantly influenced by factors such as illumination variability, occlusion, and background-texture interference. Consequently, achieving robustness and strong generalization capability is essential for effective detection models. To address these challenges, we developed and evaluated the DRF-YOLO model through comparative experiments using the DUT-Anti-UAV and Det-Fly datasets, which offer diverse imaging conditions including varying backgrounds, flight altitudes, and viewing angles. Model performance was benchmarked against several existing approaches.

On the two datasets, DRF-YOLO achieved mAP@50 scores of 86.9 and 91.1%, respectively, outperforming the baseline YOLOv8. Furthermore, DRF-YOLO demonstrated superior results in mAP@50:95, precision, and recall, indicating its ability to maintain a favorable balance between detection accuracy and overall object recognition performance.

The results highlight the complementary strengths of the DWR module and AFPN. DWR effectively expands the receptive field and enhances contextual feature representation, while AFPN improves multi-scale detection through progressive feature fusion and spatial weighting, enabling stronger semantic integration across layers. Together, these modules deliver consistent gains—particularly in mAP@50 and recall—demonstrating clear synergy without introducing optimization instability. DRF-YOLO maintains accurate bounding-box regression and stable contour fitting even under challenging imaging conditions, confirming its robustness in detecting small objects within complex backgrounds. Quantitative and qualitative evaluations further reveal several patterns: improvements in mAP@50 generally exceed those in mAP@50:95, suggesting enhanced coarse-level localization for tiny UAVs; AFPN-based configurations reliably achieve higher recall than the baseline; and smooth training curves indicate that the added modules integrate well into the overall architecture. These findings collectively verify the model’s effectiveness across diverse UAV detection scenarios.

While conventional methods rely on scale-space enhancement or attention mechanisms, we integrated the DWR module and AFPN into the YOLO framework, resulting in the DRF-YOLO model. The DWR module extends the receptive field and enriches contextual representations, and AFPN facilitates multi-scale feature integration through progressive fusion and adaptive spatial weighting. The synergistic combination significantly enhances the detection ability of small UAV targets in complex environments, improving the robustness of aerial object detection systems.

However, the developed model is specifically designed for UAV detection and validated only on DUT-Anti-UAV and Det-Fly. Although these datasets encompass diverse backgrounds, altitudes, and viewpoints, their single-class nature might limit the evaluation of the model’s performance in multi-class or cross-domain scenarios. Additionally, only detection accuracy and robustness were evaluated without an analysis of inference speed or computational complexity. Therefore, it is necessary to evaluate the model with a broader range of UAV datasets with varied sensing modalities and complex aerial detection tasks, such as distinguishing UAVs from birds or other airborne objects. We plan to assess efficiency and resource utilization and validate the generalization and practical applicability of the DRF-YOLO model to address such limitations.

6. Conclusions

Detecting small UAVs in complex environments is a critical task. For the task, we developed and evaluated the DRF-YOLO model, which combines a DWR module and AFPN. In the model, the DWR module improves contextual awareness and feature discrimination, while AFPN enhances multi-scale semantic fusion. The experimental results in this study validated that the DRF-YOLO model addressed the challenges in small-object detection. Evaluation results on the DUT-Anti-UAV and Det-Fly datasets confirmed the model’s superior performance, with significant gains over the YOLOv8 baseline model and other widely used models. The ablation experiment results also validated the complementary contributions of the DWR module and AFPN. On the DUT-Anti-UAV and Det-Fly datasets, the model reached mAP@50 values of 86.9% and 91.1%, outperforming YOLOv8 by 1.5% and 3.3%, respectively. The model’s recall of 80.3 and 86.5% demonstrated its ability to correctly identify small objects. The DRF-YOLO model also showed superior performance to other models with two-stage detectors (e.g., Faster R-CNN), anchor-free detectors (e.g., FCOS, RetinaNet), and YOLO Series and RT-DETR models. The DRF-YOLO model made more accurate and stable predictions than other models, particularly in complex scenarios with occlusion, uneven illumination, and background-texture interference. The model consistently demonstrated superior bounding-box regression and contour-fitting capabilities. While other models showed unstable or lower learning curves in training, the DRF-YOLO model maintained a high recall and a smooth, continuous learning curve, indicating strong stability and promising generalization within UAV detection tasks.

The DRF-YOLO developed in this study provides a reliable and high-performance solution for small UAV detection, particularly in scenarios involving large scale variations, dense targets, and complex backgrounds. Its consistent improvements over multiple baselines on two challenging UAV benchmarks demonstrate that the proposed design is not only effective but also robust across diverse imaging conditions and levels of background clutter. It further establishes a solid foundation for developing broader applications in small-object detection.

Author Contributions

Conceptualization, S.W.; methodology, S.W.; software, Y.Y.; validation, X.H.; investigation, X.H.; data curation, C.Y.; writing—original draft preparation, S.W.; writing—review and editing, J.S. and Y.Z.; project administration, J.S. and Y.Z.; funding acquisition, J.S. and Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Guangxi Science and Technology Program (No. Guike AB24010160) and the National Student Innovation and Entrepreneurship Training Program Project (No. 202510595059).

Data Availability Statement

The datasets used in this study were obtained from the following public repositories: https://github.com/wangdongdut/DUT-Anti-UAV and https://github.com/Jake-WU/Det-Fly, accessed on 1 August 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript.

BN	Batch Normalization
CA-PANet	Coordinate Attention-PANet
DC-YOLOv8	Dual-Branch Channel-YOLOv8
DETR	Detection Transformer
DN-FPN	Denoising Feature Pyramid Network
DotD	Dot Distance
DRF-YOLO	Dilated-Wise Residual and Asymptotic Feature Pyramid Network-YOLO
DWR	Dilated-Wise Residual
FN	False Negative
FP	False Positive
HS-FPN	High-Frequency and Spatial Perception FPN
IoU	Intersection over Union
MDC	Multi-dimensional Convolutional
RR	Region Residualization
RT-DETR	Real-Time Detection Transformer
SiLU	Sigmoid Linear Unit
SR	Semantic Residualization
TP	True Positive

References

Mustafa Abro, G.E.; Zulkifli, S.A.B.M.; Masood, R.J.; Asirvadam, V.S.; Laouiti, A. Comprehensive review of UAV detection, security, and communication advancements to prevent threats. Drones 2022, 6, 284. [Google Scholar] [CrossRef]
Chen, X.; Chen, W.; Rao, Y.; Huang, Y.; Guan, J.; Dong, Y. Progress and prospects of radar object detection and recognition technology for flying birds and unmanned aerial vehicles. J. Radars 2020, 9, 803–827. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Volume 9905, pp. 21–37. [Google Scholar]
Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A. Fully-convolutional siamese networks for object tracking. In Proceedings of the 14th European Conference on Computer Vision 2016 Workshops, Amsterdam, The Netherlands, 8–10 October 2016; Volume 9914, pp. 850–865. [Google Scholar]
Bhat, G.; Danelljan, M.; Gool, L.V.; Timofte, R. Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October 2019. [Google Scholar]
Vijayakumar, A.; Vairavasundaram, S. Yolo-based object detection models: A review and its applications. Multimed. Tools Appl. 2024, 83, 83535–83574. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Volume 12346, pp. 213–229. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4 December 2017. [Google Scholar]
Cheng, G.; Yuan, X.; Yao, X.; Yan, K.; Zeng, Q.; Xie, X. Towards large-scale small object detection: Survey and benchmarks. IEEE Trans. Pattern Anal. Mach. 2023, 45, 13467–13488. [Google Scholar] [CrossRef] [PubMed]
Lou, H.; Duan, X.; Guo, J.; Liu, H.; Gu, J.; Bi, L.; Chen, H. DC-YOLOv8: Small-Size Object Detection Algorithm Based on Camera Sensor. Electronics 2023, 12, 2323. [Google Scholar] [CrossRef]
Zamri, F.N.M.; Gunawan, T.S.; Yusoff, S.H.; Alzahrani, A.A.; Bramantoro, A.; Kartiwi, M. Enhanced small drone detection using optimized YOLOv8 with attention mechanisms. IEEE Access 2024, 12, 90629–90643. [Google Scholar] [CrossRef]
Keles, M.C.; Salmanoglu, B.; Guzel, M.S.; Gursoy, B.; Bostanci, G.E. Evaluation of yolo models with sliced inference for small object detection. arXiv 2022, arXiv:2203.04799. [Google Scholar] [CrossRef]
Cheng, Q.; Li, X.; Zhu, B.; Shi, Y.; Xie, B. Drone detection method based on MobileViT and CA-PANet. Electronics 2023, 12, 223. [Google Scholar] [CrossRef]
Singh, B.; Najibi, M.; Davis, L.S. Sniper: Efficient multi-scale training. In Advances in Neural Information Processing Systems; Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Eds.; NeurIPS: Montreal, QC, Canada, 2018. [Google Scholar]
Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. DN-DETR: Accelerate DETR training by introducing query denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21 June 2022. [Google Scholar]
Hoanh, N.; Pham, T.V. Focus-attention approach in optimizing DETR for object detection from high-resolution images. Knowl.-Based Syst. 2024, 296, 111939. [Google Scholar] [CrossRef]
Xu, H.; Ling, Z.; Yuan, X.; Wang, Y. A video object detector with Spatio-Temporal Attention Module for micro UAV detection. Neurocomputing 2024, 597, 127973. [Google Scholar] [CrossRef]
Wang, W.; Fu, J.; Li, K.; Qiao, H.; Liu, J.; Sun, H.; Cas, X. Dist-Tracker: A Small Object-aware Detector and Tracker for UAV Tracking. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 15 June 2025. [Google Scholar]
Xu, C.; Wang, J.; Yang, W.; Yu, L. Dot distance for tiny object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19 June 2021. [Google Scholar]
Zhang, S.; Zhu, X.; Lei, Z.; Shi, H.; Wang, X.; Li, S.Z. S³fd: Single-shot scale-invariant face detector. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22 October 2017. [Google Scholar]
Kisantal, M.; Wojna, Z.; Murawski, J.; Naruniec, J. Augmentation for small object detection. arXiv 2019, arXiv:1902.07296. [Google Scholar] [CrossRef]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21 July 2017; pp. 2117–2125. [Google Scholar]
Liu, H.I.; Tseng, Y.W.; Chang, K.C.; Wang, P.J.; Shuai, H.H.; Cheng, W.H. A denoising FPN with transformer r-CNN for tiny object detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4704415. [Google Scholar] [CrossRef]
Liu, M.; Chen, J.; Liu, P.; Chen, J.; Chang, K.; Piao, C. Dual SIE-FPN: Semantic and spatial information enhancement for multiscale object detection. IEEE Trans. Industr. Inform. 2024, 20, 14164–14173. [Google Scholar] [CrossRef]
Zhao, G.; Ge, W.; Yu, Y. GraphFPN: Graph feature pyramid network for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11 October 2021; pp. 2743–2752. [Google Scholar]
Shi, Z.; Hu, J.; Ren, J.; Ye, H. HS-FPN: High frequency and spatial perception FPN for tiny object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February 2025; pp. 6896–6904. [Google Scholar]
Github. Available online: https://github.com/ultralytics/ultralytics (accessed on 20 August 2025).
Wei, H.; Liu, X.; Xu, S.; Dai, Z.; Dai, Y.; Xu, X. DWRSeg: Rethinking efficient acquisition of multi-scale contextual information for real-time semantic segmentation. arXiv 2022, arXiv:2212.01173. [Google Scholar]
Yang, G.; Lei, J.; Zhu, Z.; Cheng, S.; Feng, Z.; Liang, R. AFPN: Asymptotic feature pyramid network for object detection. In Proceedings of the 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Honolulu, HI, USA, 1–4 October 2023; pp. 2184–2189. [Google Scholar]
Liu, S.; Huang, D.; Wang, Y. Learning spatial fusion for single-shot object detection. arXiv 2019, arXiv:1911.09516. [Google Scholar] [CrossRef]
Zhao, J.; Zhang, J.; Li, D.; Wang, D. Vision-based anti-UAVdetection and tracking. IEEE Trans. Intell. Transp. Syst. 2022, 23, 25323–25334. [Google Scholar] [CrossRef]
Zheng, Y.; Chen, Z.; Lv, D.; Li, Z.; Lan, Z.; Zhao, S. Air-to-air visual detection of micro-UAVs: An experimental evaluation of deep learning. IEEE Robot. Autom. Lett. 2021, 6, 1020–1027. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into high-quality object detection. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18 June 2018; pp. 6154–6162. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22 October 2017; pp. 2999–3007. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully convolutional one-stage object detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October 2019. [Google Scholar]
Github. Available online: https://github.com/ultralytics/yolov5 (accessed on 20 August 2025).
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. In Proceedings of the 38th International Conference on Neural Information Processing Systems (NIPS 2024), Vancouver, BC, Canada, 10 December 2024; pp. 107984–108011. [Google Scholar]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs beat YOLOs on real-time object detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17 June 2024; pp. 16965–16974. [Google Scholar]

Figure 1. The architecture of DRF-YOLO with DWR and AFPN to enhance UAV small-object detection in complex airspace environments.

Figure 2. Detailed structure of the DWR, emphasizing its scale-awareness and scene resolution capabilities.

Figure 3. Detailed structure of AFPN showing its multi-scale perception and hierarchical semantic fusion process.

Figure 4. ASFF network structure diagram showing feature fusion at four different levels.

Figure 5. Changes in mAP@50, mAP@50:95 and Recall of each model in training on the DUT-Anti-UAV dataset.

Figure 6. Changes in mAP@50, mAP@50:95 and Recall of each model in training on the Det-Fly dataset.

Figure 7. Visualization of prediction outcomes from various models on the DUT-Anti-UAV dataset. (a–e) represent five images of small drone targets against different backgrounds. The drones appear as distant, small-scale targets, generally occupying fewer than 20 pixels.

Figure 8. Visualization of prediction outcomes from various models on the Det-Fly dataset. (a–e) represent five images of small drone targets against different backgrounds. The drones appear as distant, small-scale targets, generally occupying fewer than 20 pixels.

Table 1. Experimental parameters.

Parameter	Value
Optimizer	SGD
Initial learning rate	0.001
Weight_decay	0.0005
Momentum	0.937
Batch	32
Epoch	300

Table 2. Performance of different module configurations on the DUT-Anti-UAV dataset.

Architecture	mAP@50 (%)	mAP@50:95 (%)	Precision (%)	Recall (%)
YOLOv8	85.4	53.7	90.5	78.5
YOLOv8 + AFPN	86.2	54.1	91.4	79.7
YOLOv8 + DWR	85.7	54.3	92.6	79.1
DRF-YOLO (YOLOv8 + DWR + AFPN)	86.9	54.8	93.9	80.3

Table 3. Performance of different module configurations on the Det-Fly dataset.

Architecture	mAP@50 (%)	mAP@50:95 (%)	Precision (%)	Recall (%)
YOLOv8	87.8	53.1	92.4	82.7
YOLOv8 + AFPN	89.9	54.4	93.0	85.8
YOLOv8 + DWR	88.5	53.8	93.6	83.9
DRF-YOLO (YOLOv8 + DWR + AFPN)	91.1	55.4	95.0	86.5

Table 4. Performance comparison on the Dut-Anti-UAV dataset, highlighting the small-object detection capability of the developed DRF-YOLO model against other approaches.

Model	Backbone	mAP@50 (%)	mAP@50:95 (%)	Recall (%)
Cascade R-CNN	ResNet-50	54.4	26.2	31.7
FCOS	ResNet-50	80.3	43.8	51.2
Faster R-CNN	ResNet-50	52.7	24.9	30.6
RetinaNet	ResNet-50	84.2	44.1	54.1
YOLOv5	CSPDarkNet53	85.0	53.8	76.9
YOLOv8	CSPDarkNet53	85.4	53.7	78.5
YOLOv10	CSPDarkNet	82.2	52.3	74.4
YOLOv11	CSPDarknet53	85.3	53.3	77.1
RT-DETR	ResNet-50	85.9	54.3	79.2
DRF-YOLO	CSPDarkNet53	86.9	54.8	80.3

Table 5. Performance comparison on the Det-Fly dataset, highlighting the small-object detection capability of the developed DRF-YOLO model against other approaches.

Model	Backbone	mAP@50 (%)	mAP@50:95 (%)	Recall (%)
Cascade R-CNN	ResNet-50	56.5	26.4	31.6
FCOS	ResNet-50	78.5	43.9	51.3
Faster R-CNN	ResNet-50	49.0	22.9	27.9
RetinaNet	ResNet-50	82.1	41.2	54.4
YOLOv5	CSPDarkNet53	86.6	52.0	83.2
YOLOv8	CSPDarkNet53	87.8	53.1	82.7
YOLOv10	CSPDarkNet	83.7	50.0	74.8
YOLOv11	CSPDarknet53	86.3	50.8	77.1
RT-DETR	ResNet-50	90.2	54.5	85.7
DRF-YOLO	CSPDarkNet53	91.1	55.4	86.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Published by MDPI on behalf of the International Institute of Knowledge Innovation and Invention. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, S.; Shuai, J.; Yang, Y.; Hu, X.; Yang, C.; Zhou, Y. DRF-YOLO Model for Small UAV Detection Through Multi-Scale Residual Enhancement and Progressive Feature Fusion. Appl. Syst. Innov. 2025, 8, 179. https://doi.org/10.3390/asi8060179

AMA Style

Wang S, Shuai J, Yang Y, Hu X, Yang C, Zhou Y. DRF-YOLO Model for Small UAV Detection Through Multi-Scale Residual Enhancement and Progressive Feature Fusion. Applied System Innovation. 2025; 8(6):179. https://doi.org/10.3390/asi8060179

Chicago/Turabian Style

Wang, Songwei, Jianping Shuai, Yuzhu Yang, Xiaoxiao Hu, Chenxi Yang, and Ya Zhou. 2025. "DRF-YOLO Model for Small UAV Detection Through Multi-Scale Residual Enhancement and Progressive Feature Fusion" Applied System Innovation 8, no. 6: 179. https://doi.org/10.3390/asi8060179

APA Style

Wang, S., Shuai, J., Yang, Y., Hu, X., Yang, C., & Zhou, Y. (2025). DRF-YOLO Model for Small UAV Detection Through Multi-Scale Residual Enhancement and Progressive Feature Fusion. Applied System Innovation, 8(6), 179. https://doi.org/10.3390/asi8060179

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

DRF-YOLO Model for Small UAV Detection Through Multi-Scale Residual Enhancement and Progressive Feature Fusion

Abstract

1. Introduction

2. Related Works

2.1. Model Architecture

2.2. Evaluation and Data Augmentation

2.3. FPN

3. Methodology

3.1. Overall Structure

3.2. DWR

3.3. AFPN

3.4. Experimental Setup

3.5. Datasets

4. Results

4.1. Ablation Experiment

4.2. Comparison with Other Models

4.3. Training Stability and Qualitative Performance

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI