A YOLO-Based Multi-Scale and Small Object Detection Framework for Low-Altitude UAVs in Cluttered Scenes

Yang, Zhi; Wang, Rijun; Chen, Wenjin; Du, Keer; Huang, Zhigong; Huang, Yu

doi:10.3390/sym17091519

Open AccessArticle

A YOLO-Based Multi-Scale and Small Object Detection Framework for Low-Altitude UAVs in Cluttered Scenes

by

Zhi Yang

¹,

Rijun Wang

^1,2,*

,

Wenjin Chen

¹,

Keer Du

¹,

Zhigong Huang

^1,2 and

Yu Huang

^1,2,*

¹

School of Teachers College for Vocational Education, Guangxi Normal University, Guilin 541004, China

²

Guangxi University Engineering Research Center of Agricultural and Forestry Intelligent Equipment Technology, Guilin 546300, China

^*

Authors to whom correspondence should be addressed.

Symmetry 2025, 17(9), 1519; https://doi.org/10.3390/sym17091519

Submission received: 15 July 2025 / Revised: 19 August 2025 / Accepted: 2 September 2025 / Published: 12 September 2025

(This article belongs to the Section Engineering and Materials)

Download

Browse Figures

Versions Notes

Abstract

Low-altitude UAVs pose increasing security concerns due to their small sizes, high mobility, and low observability. Detecting such targets in cluttered environments remains challenging due to strong background interference and significant scale variations. To address these issues, we propose an efficient and accurate multi-scale detection framework based on YOLO11, specifically tailored for small UAVs in cluttered scenes. The framework incorporates three core modules: C3K2-FB to enhance feature extraction and suppress background noise, MS_FPN for edge-aware multi-scale feature fusion, and RFCBAM to jointly leverage spatial and channel attention for improved focus on discriminative regions. Extensive experiments on the DeTFly dataset show that our method achieves 94.4% mAP@0.5 and 60.9% mAP@0.5:0.95, outperforming state-of-the-art models including YOLO11n, YOLOv10n, and RT-DETR-L in both accuracy and robustness. These results validate the effectiveness of the proposed framework for real-world UAV detection tasks involving small-scale targets in cluttered environments.

Keywords:

multi-scale feature fusion; small object detection; UAV target recognition; background suppression; attention mechanism; object detection framework

1. Introduction

Recent advances in artificial intelligence and sensor technologies have significantly propelled the development of unmanned aerial vehicles (UAVs), enabling their widespread applications in aerial monitoring, environmental inspection, traffic management, and disaster rescue [1]. However, the misuse of low-altitude small UAVs, such as unauthorized flights and illegal reconnaissance, poses serious challenges to airspace management and public safety. Particularly in high-risk areas, UAVs’ remote controllability, high maneuverability, and low observability present dual-use concerns in border security, critical infrastructure protection, and tactical reconnaissance [2]. Thus, developing efficient and reliable counter-UAV detection technologies has become a priority worldwide.

Current counter-UAV detection methods include radar detection, radio frequency (RF) signal analysis, electro-optical/infrared (EO/IR) imaging, and acoustic recognition [3,4]. Radar systems offer long detection range and wide coverage unaffected by lighting conditions but are susceptible to electromagnetic interference and have limited capabilities for detecting low-altitude, slow-moving small UAVs, lacking precise target classification [5]. RF analysis detects communication and video transmission signals, effectively locating active transmitters but failing against silent or frequency-hopping UAVs and suffering from environmental noise [6]. EO/IR imaging provides intuitive visual information suitable for target identification and tracking but is highly dependent on lighting and weather conditions and struggles with small distant targets [7]. Acoustic methods leverage UAV-specific noise for detection with good stealth but have limited range and are vulnerable to ambient noise interference [8]. These approaches face challenges such as insufficient detection accuracy, high false alarm rates, poor environmental adaptability, limited multi-target handling, and weak performance on low-observability and micro-UAVs, limiting their applicability in complex scenarios.

Deep learning-based object detection algorithms have emerged as dominant methods in counter-UAV vision detection, leveraging their ability to hierarchically extract abstract features from complex images and videos for accurate recognition and localization [9]. These methods primarily fall into two categories: single-stage detectors (e.g., YOLO series [10,11], SSD [12,13]) and two-stage detectors (e.g., Faster R-CNN) [14]. Significant efforts have been devoted to enhancing small object detection and background interference suppression. For instance, Wang et al. added a high-resolution detection head to the P2 layer of YOLOv8 to preserve details [15]. To reduce noise, IRWT-YOLO integrates image segmentation techniques [16]. Yang et al. proposed IASL-YOLO optimizing YOLOv8s for a balance between accuracy and lightweight design [17], while Huang et al. developed EDGS-YOLOv8 incorporating ghost convolutions, efficient multi-scale attention modules (EMA), and DCNv2 for real-time lightweight detection [18].

Following Table 1, we explicitly outline three critical challenges that persist in counter-UAV detection research: (1) Target confusion in complex backgrounds—low-altitude scenes often exhibit high visual similarity between UAVs and surrounding elements such as sky, vegetation, or buildings, leading to attention distraction, false positives, and missed detections. (2) Inefficient multi-scale feature fusion—drastic scale variation and small-object scenarios hinder accurate localization and boundary modeling [19,20]. (3) Limited accuracy and robustness for small targets—small UAVs typically occupy very few pixels, making feature preservation difficult, especially under motion blur or long-range conditions [21,22].

To tackle these challenges, we propose a lightweight and effective YOLO11-based detection framework incorporating three dedicated components: C3K2-FB, designed to suppress background interference and enhance key region features; MS_FPN, which improves cross-scale fusion and edge-aware localization; and RFCBAM, an attention-enhanced detection head that preserves fine-grained features for small-object detection. These modules are collaboratively designed to address the identified limitations and form a unified framework optimized for real-time UAV detection in complex environments.

The main contributions are summarized as follows:

The C3K2 with Fusion Block (C3K2-FB) module employs a gating mechanism to dynamically suppress complex background interference and enhance features within key target regions, effectively improving robustness and feature discrimination under challenging environmental conditions.
A Multi-Scale Edge-Enhanced Feature Pyramid (MS_FPN) is designed to integrate edge information with an improved cross-scale bidirectional fusion strategy, enabling efficient fusion of deep semantic and shallow detailed features to enhance detection robustness and localization accuracy across multiple scales.
An attention-based small object detection head is constructed by adding a high-resolution detection layer and integrating receptive field and convolutional attention modules, which effectively preserves and highlights small object features, significantly improving detection accuracy and stability for small targets.

2. YOLO-Based Multi-Scale and Small Object Detection Framework

2.1. Overview of YOLO11

YOLO11 [23], developed by the Ultralytics team and released in September 2024, represents a next-generation object detection framework that preserves the conventional four-component YOLO architecture while introducing several architectural innovations. The structure is shown in Figure 1. The input stage employs a grid-based direct prediction approach to enhance multi-scale adaptability. The backbone network incorporates an optimized C3K2 module (demonstrating 30% faster computational speed and 40% fewer parameters than YOLOv8’s C2f module), integrated with both an SPFF (Spatial Pyramid Fast Fusion) module for multi-scale feature extraction and a C2PSA (Cross-Partial Self-Attention) dual-branch attention mechanism, significantly improving small object detection performance and critical region awareness. The neck architecture utilizes upsampling operations combined with channel-wise concatenation to achieve efficient feature fusion. For final detection, an end-to-end Detect module enables parallel prediction of multiple targets. This optimized architecture achieves substantial improvements in computational efficiency while maintaining high detection accuracy.

2.2. Overall Framework

To address the challenge of small object detection for low-altitude UAVs in cluttered environments, this paper proposes a multi-scale and small object detection framework based on the YOLO11 architecture. The framework is designed to systematically tackle three key issues: background interference, low multi-scale feature fusion efficiency, and the loss of small object information during feature extraction. The overall architecture of the proposed framework is illustrated in Figure 2.

In the Backbone, a novel feature extraction module, C3K2-FB, is introduced. This module employs a gated mechanism to dynamically suppress background noise and enhance the feature representation of key target regions. It improves the model’s robustness and feature discriminability in complex environments and enhances the early-stage response to small targets.

In the Neck, a network named MS_FPN is proposed. This network incorporates edge information to support feature expression and adopts an improved cross-scale bidirectional fusion mechanism to effectively integrate deep semantic and shallow detail features. This design enhances detection robustness and localization accuracy across targets of various scales.

In the Head, a high-resolution detection layer dedicated to small objects is added (the Small-Detect in Figure 2). Additionally, the standard convolution is replaced by a newly designed Receptive Field and Convolutional Block Attention Module (RFCBAM) module, which integrates Receptive Field Attention (RFA) and Convolutional Block Attention Module (CBAM). By jointly modeling the importance of channels and spatial dimensions, this module guides the network to focus more precisely on key regions and feature channels of small UAV targets, significantly improving both the accuracy and stability of small object detection.

2.3. C3K2-FB Module Design

Although the C3K2 module in the backbone of YOLO11 incorporates an efficient residual structure and multi-scale convolutional combinations to enhance feature extraction efficiency, its core operations still rely on standard convolutions. Standard convolutions have significant limitations in complex scenarios because their fixed kernel parameters prevent them from adapting to spatial structural variations. When target and background share similar textures or colors, this often leads to distracted attention, blurred object boundaries, and difficulty in suppressing irrelevant interference while focusing on critical discriminative regions. To address these issues, a novel feature extraction and fusion module—C3K2-FB—is proposed to replace the original C3K2 structure. As shown in Figure 3, the C3K2-FB module integrates the Slim Fusion Module (SFM) [24], which introduces a gating mechanism and wide-receptive-field convolution to dynamically enhance target features and suppress background interference in complex environments. This module facilitates the efficient integration of multi-scale semantic information while keeping computational overhead low, significantly enhancing the model’s ability to detect small, occluded, and scale-varying targets in cluttered scenes.

C3K2-FB draws on the feature stream optimization concept of C2f and adopts a restructured fusion strategy to improve the learning efficiency of deep features. Specifically, the input feature maps are first processed by deep convolution to extract basic spatial and semantic representations of multi-scale targets. These features are then reshaped and evenly grouped along the channel dimension, ensuring balanced distribution and diversity of semantic information such as edges and textures. Each group is passed through the SFM module (as shown in Figure 4), where a residual enhancement mechanism is used to emphasize key regions and suppress background noise. The outputs from SFM are then fused with the input features (via concatenation or weighted summation), followed by a convolutional integration step to generate the optimized features, thereby forming a closed-loop enhancement of the feature stream.

As the core structural component of the C3K2-FB module, SFM combines an efficient element-wise multiplication mechanism (referred to as the “star operation”) with a Gated Linear Unit (GLU)-like structure, achieving dynamic feature modulation and spatial enhancement with minimal computational cost. Initially, SFM uses depthwise separable convolution to split the input into two sub-feature groups, which are processed by two parallel branches: the main branch applies standard convolution to extract fundamental features, while the gating branch generates dynamic weight maps that are element-wise multiplied with the main branch output. This “star operation” replaces traditional summation to enable high-dimensional nonlinear feature modeling and adaptive enhancement of key information. The outputs of both branches are then fused via concatenation or element-wise operations followed by convolution, enhancing the representation of fine-grained spatial information. The fused features are subsequently reconstructed using depthwise separable convolution, and a GLU-like gating mechanism is applied to adjust channel importance dynamically. Specifically, the reconstructed features are split into two linear projections: one undergoes a 3 × 3 depthwise convolution followed by SiLU activation [25] and is then element-wise multiplied with the other, allowing for dynamic control of channel responses. Finally, a residual connection is used to merge the module output with the original input, which effectively alleviates gradient vanishing while preserving information integrity and enhancing feature representation.

2.4. Multi-Scale Edge-Enhanced Feature Pyramid Design

Although traditional pyramid structures in YOLO-based object detection models possess certain capabilities for multi-scale feature processing, they still exhibit significant limitations in feature extraction and fusion. Typically, these structures rely on progressive downsampling of images and extract features independently at each scale. This approach struggles to simultaneously preserve fine-grained details present in high-resolution feature maps and global semantic context available in low-resolution maps. This limitation is particularly detrimental for small object detection, where high-level semantic features tend to lose edge and texture details, while low-level features lack sufficient contextual information, ultimately reducing detection accuracy. Moreover, conventional pyramid designs employ fixed-scale processing, which is inflexible to the diverse object scale variations encountered in complex scenes. This often results in feature discontinuities and inconsistencies across scales, thereby constraining the model’s robustness and generalization ability for multi-scale object detection.

To overcome these limitations, we propose a novel multi-scale edge-enhanced feature pyramid network, termed MS_FPN, inspired by the design philosophy of SlimNeck. The MS_FPN integrates a newly designed MS_Fusion_Block (Multi-scale Selective Fusion Block), which incorporates the FrequencyEnhancer edge enhancement module. This design enables precise modeling of object contours and effective fusion of multi-scale semantic information. The MS_FPN substantially improves the model’s ability to preserve fine details and robustness in detecting small and distant tiny UAV targets, while maintaining computational efficiency.

Structurally, the MS_Fusion_Block, shown in Figure 5, serves as the core unit of MS_FPN and consists of four primary processing stages: (1) Multi-scale Context Extraction: Adaptive average pooling generates feature maps at multiple resolutions to capture cross-scale contextual information. (2) Efficient Feature Modeling: Each scale’s features undergo parallel 1 × 1 convolutions for channel compression and grouped 3 × 3 convolutions for spatial detail extraction, balancing feature richness and computational cost. (3) Edge Saliency Enhancement: The embedded FrequencyEnhancer module first extracts low-frequency background via average pooling, and then isolates high-frequency edge information through residual calculation. Convolution followed by Sigmoid activation enhances edge saliency, and residual fusion generates the edge-enhanced feature maps. (4) Cross-scale Feature Fusion: The enhanced multi-scale features are upsampled, aligned in channel dimension, concatenated, and fused via convolution to produce the final refined feature map. This module effectively highlights object contours and improves semantic consistency and boundary localization precision under complex background conditions, providing a stable and robust feature foundation for subsequent detection heads.

2.5. Small Detection Head Based on Hybrid Attention Convolution

Low-altitude UAV detection typically falls into the category of small object detection. The original YOLO11 architecture employs three detection heads; however, due to the progressive downsampling operations in deep networks, detailed features of small objects contained in shallow layers are largely lost during transmission. This compromises the model’s ability to accurately detect small-scale UAV targets. To address this issue and better preserve fine-grained visual information, an additional high-resolution detection head specifically designed for small object detection is introduced into the framework. Moreover, standard convolution modules commonly used in traditional detection heads treat all spatial positions and channel features equally, lacking the ability to adaptively focus on key spatial regions (e.g., object contours) and discriminative channels (e.g., texture or shape-related features). This limits the effectiveness of feature extraction and fusion for small object detection.

To overcome these limitations, a hybrid attention convolution module, named RFCBAM, is proposed to replace the standard convolution in the newly added detection head. RFCBAM integrates the Receptive Field Attention (RFA) mechanism [26] with the Convolutional Block Attention Module (CBAM) [27] to enhance the model’s capability in both spatial and channel dimensions. RFA dynamically adjusts the receptive field to better capture small targets with diverse spatial distributions; however, it lacks the ability to suppress irrelevant channel-wise information and is susceptible to background clutter. In contrast, CBAM excels at channel attention and enhances target-related features, but its spatial attention mechanism is constrained by fixed receptive fields, limiting its effectiveness for objects with large scale variations.

RFCBAM overcomes these limitations through a synergistic combination of the two mechanisms. Specifically, RFA provides adaptive spatial anchoring that guides CBAM’s attention focus, effectively mitigating background interference. Meanwhile, CBAM’s channel attention further refines the features within the dynamically localized regions identified by RFA. This dual enhancement strategy enables more accurate spatial localization and more discriminative feature representation for small targets in complex scenes. As a result, RFCBAM significantly improves the model’s ability to extract and utilize relevant features while maintaining strong robustness against background noise.

The structure of the hybrid attention-based small object detection head is shown in Figure 6, where the input features are directly taken from the first shallow layer of the backbone to maximally retain raw spatial detail and positional cues. The architecture of the proposed RFCBAM module is illustrated in Figure 7.

3. Experiments and Results

3.1. Dataset Details

This study adopts the DeTFly dataset as the primary benchmark for training and evaluating the proposed model. DeTFly [28] is specifically designed for low-altitude UAV detection in cluttered environments, covering a variety of challenging scenes such as urban, sky, and mountain. These scenarios introduce significant background interference and target occlusion. Representative sample images are shown in Figure 8. While the dataset primarily focuses on UAVs, it presents significant intra-class diversity with respect to object scale, background complexity, and visibility. Images vary in lighting (from morning to evening), background types (e.g., sky, urban, mountain, field), relative distances, and UAV perspectives (top, front, bottom). Annotations were manually created by trained annotators and verified via double-checking to ensure bounding box accuracy and consistency.

The dataset contains over 10,000 high-resolution images (3840 × 2160 pixels) with corresponding bounding box annotations. All labeled objects are small low-altitude UAVs, with most targets occupying less than 5% of the image area, reflecting the typical challenges of small object detection. The dataset is split into training (8000 images), validation (1000 images), and test (1000 images) sets in an 8:1:1 ratio to support model training, hyperparameter tuning, and final evaluation.

3.2. Experimental Configuration

To ensure training stability and reproducibility, all experiments were conducted under the following hardware and software configurations, as shown in Table 2. Data Augmentation: Rotational symmetry (probability 0.2), Symmetric MixUp augmentation (probability 0.1), horizontal flip (probability 0.5).

The training hyperparameter settings are shown in Table 3.

3.3. Evaluation Metrics

To comprehensively evaluate the detection performance of the proposed framework in cluttered low-altitude scenarios, the following widely adopted object detection metrics are used:

Precision (P): The proportion of predicted positive samples that are actually true positives, reflecting the accuracy of the detection results. Recall (R): The proportion of actual positive instances that are correctly detected, indicating the model’s ability to cover all targets. AP (Average Precision): The average of precision values at different recall levels, used to measure the detection capability for each category. In this study, AP is computed for the single-class UAV target. mAP (mean Average Precision): The mean of AP across all classes. Since only one class is considered, mAP is equivalent to AP. The mAP@0.5:0.95 metric, which averages performance over Intersection over Union (IoU) thresholds from 0.5 to 0.95, provides a more rigorous assessment of both localization accuracy and detection robustness. This is particularly critical in small UAV detection, where even minor localization errors significantly impact the IoU due to the small size of the target regions. F1 Score (F1): The harmonic means of Precision and Recall, providing a balanced measure of performance. It is defined as:

F 1 = 2 \times \frac{P \times R}{P + R}

(1)

3.4. Ablation Experiments

To evaluate the contribution of each proposed module to the overall model performance, a series of ablation experiments were conducted by progressively integrating specific structural components, followed by comparative evaluation on the DeTFly dataset. The main evaluation metrics include Precision, Recall, F1 Score, mAP@0.5, mAP@0.5:0.95, and Parameters. The detailed results are presented in Table 1. Baseline: The original YOLO11n model, serving as the comparison reference. Baseline + C3K2-FB: Replaces the standard C3K2 module in the backbone with the proposed C3K2-FB, to assess its impact on feature extraction and background suppression. Baseline + MS_FPN: Introduces the Multi-Scale Feature Pyramid Network with Edge Enhancement (MS_FPN) in the neck to evaluate its effectiveness in multi-scale fusion and small object boundary preservation. Baseline + RFCBAM-based Small Head: Adds an additional high-resolution detection head tailored for small objects and replaces the standard convolution with the RFCBAM hybrid attention module, to assess its influence on small object detection precision. Ours: Integrates all the proposed modules into a unified Multi-Scale and Small Object Detection Framework, serving as the final model for evaluating the overall architecture’s performance. The ablation experiment results are shown in Table 4. For ease of visual comparison among different module combinations, a bar graph of ablation experiment results is as shown in Figure 9.

Ablation results demonstrate that, compared with the baseline model, introducing the C3K2-FB module leads to an increase in mAP@0.5 from 87.6% to 88.9% and in mAP@0.5:0.95 from 54.0% to 58.0%, with the F1 score improving to 88.9%. These results indicate that C3K2-FB effectively enhances feature extraction while suppressing background interference in complex scenes. With the addition of the MS_FPN structure, mAP@0.5:0.95 improves to 56.1%, and the Grad-CAM heatmaps (as shown in Figure 10) demonstrate the module’s effectiveness in multi-scale feature fusion and fine-grained boundary preservation, particularly in small object detection. The introduction of the RFCBAM-based small object detection head yields the most significant performance gains, increasing Recall to 90.2%, F1 Score to 92.8%, and mAP@0.5 to 93.4%. These results highlight the module’s strong capability in enhancing the model’s sensitivity to small targets. The final integrated model (Ours) achieves the best overall performance, with mAP@0.5 and mAP@0.5:0.95 reaching 94.4% and 60.9%, respectively, and F1 Score rising to 93.4%. Precision and Recall also improve to 97.1% and 89.9%. Notably, these improvements are achieved with only 2.39 million parameters, even fewer than the baseline model, demonstrating the lightweight and efficient design of the proposed framework.

In summary, all three modules contribute positively to the model’s performance. Among them, the RFCBAM head shows the most significant improvement in small object detection, while the C3K2-FB module enhances overall feature representation under a lightweight design. The final integrated framework achieves a well-balanced trade-off among accuracy, robustness, and computational efficiency.

3.5. Comparison Experiment with Mainstream Models

To further validate the overall performance of the proposed model, comparative experiments were conducted against several mainstream object detection models, including the two-stage detector Faster R-CNN, the transformer-based detector RT-DETR-L, and multiple versions of lightweight YOLO series models (YOLOv8n, YOLOv9n, YOLOv10n, and YOLO11n). All baseline models were implemented using their official or widely adopted open-source versions, and training pipelines were aligned in terms of augmentation, learning rates, batch sizes, and epochs to ensure fair and reproducible comparisons.

The experimental results compared with mainstream models are shown in Table 5.

The experimental results demonstrate that the proposed method achieves a Precision of 97.1% and a Recall of 89.9%, representing improvements of 2.0 and 8.3 percentage points over YOLO11n, respectively. This indicates a significant enhancement in the model’s detection robustness under complex background conditions. The mAP@0.5:0.95 reaches 60.9%, surpassing YOLO11n’s 54.0% and approaching the performance of RT-DETR-L, while requiring only about one-twelfth of its parameters and significantly reducing computational complexity. Compared with earlier models such as YOLOv10n, YOLOv9n, and YOLOv8n, the proposed framework achieves mAP@0.5 improvements of 8.0%, 14.5%, and 13.2%, respectively, while maintaining similar or lower parameter counts. Similarly, the mAP@0.5:0.95 improves by 11.0%, 20.3%, and 17.6%, respectively. Compared to Faster R-CNN, the model improves mAP@0.5 by 11.6% and mAP@0.5:0.95 by 12.5%, with over 95% reduction in parameters. In summary, the proposed Multi-Scale and Small Object Detection Framework demonstrates superior detection accuracy, enhanced sensitivity to small objects, and excellent model efficiency in low-altitude UAV detection tasks. It significantly outperforms the selected baseline models, highlighting its broad potential for practical deployment in complex real-world scenarios.

3.6. Visualization Results

To further validate the effectiveness of the proposed model in detecting small UAV targets under complex background conditions, this study presents a series of visual analyses from multiple perspectives. The visualization includes three main aspects: Precision–Recall (P-R) curves, detection result visualizations, and heatmaps.

Figure 11 shows the PR curves of different models evaluated on the DeTFly test set. Compared with YOLOv8n, YOLOv9n, YOLOv10n, and YOLO11n, the proposed method demonstrates a significantly better PR curve, with the curve lying closer to the ideal top-right corner. This indicates that the proposed model achieves superior detection performance and robust discriminative capability across various confidence thresholds, particularly suitable for complex application scenarios where high recall must be maintained under low-confidence settings.

Figure 12 visualizes the detection results on representative test images, including bounding boxes with their corresponding confidence scores. It can be observed that the proposed model is able to accurately locate small UAV targets even under severe background clutter and occlusion, and assigns higher confidence scores compared to other models. In contrast, competing models often exhibit unstable confidence distributions or miss detections in similar scenes. As shown in Figure 12, in urban scenarios, Faster-RCNN, YOLOv8, YOLOv9, and the baseline model failed to detect the UAV target, while our method successfully identified it. Similar failure cases were also observed in other complex environments. These results demonstrate that the integration of the C3K2-FB and RFCBAM modules effectively enhances the model’s ability to focus on discriminative regions while suppressing background noise, thereby reducing false positives and false negatives in complex scenes.

Figure 13 presents heatmaps generated using the Grad-CAM method to visualize the model’s attention distribution across feature maps at different levels. The proposed model exhibits clear attention focus on the UAV’s contour and body region in the high-level semantic features, rather than being misled by irrelevant background textures such as trees, buildings, or sky. This result further validates the contribution of the MS_FPN module in enhancing edge sensitivity and spatial awareness, thereby improving the model’s perception accuracy for small or indistinct targets.

4. Conclusions

This paper proposes a multi-scale and small object detection framework tailored for low-altitude UAV detection in cluttered environments, where small object size, background complexity, and scale variation present significant challenges. Built upon the YOLO11 architecture, three key enhancements are introduced: the C3K2-FB module for improving feature extraction and suppressing irrelevant background noise; the MS_FPN module for effective multi-scale fusion and enhanced edge representation of small objects; and the RFCBAM detection head that integrates spatial and channel attention to improve focus on discriminative features. Experimental results demonstrate that our model achieves superior performance with an mAP@0.5 of 94.4% and mAP@0.5:0.95 of 60.9%, outperforming the baseline YOLO11n by +6.9% in mAP@0.5:0.95. Compared to other state-of-the-art models such as RT-DETR-L and YOLOv10n, our framework offers a better trade-off between detection accuracy and model complexity, with only parameters of 2.7 M and FLOPs of 11.2. These characteristics make the framework a promising candidate for deployment in real-time UAV monitoring and security defense applications, such as airport perimeter protection, sensitive facility surveillance, and public event safety.

Despite these encouraging results, current limitations remain. The proposed model has not been extensively validated under dynamic tracking conditions or in extreme weather environments, where motion blur, occlusion, and lighting variance may degrade performance. In future work, we plan to perform comprehensive cross-dataset evaluation using benchmarks like UAVSwarm and UAVFly, as well as extend the model to real-time multi-target tracking, UAV swarm detection, and multi-sensor fusion scenarios to further enhance its robustness and application potential in real-world deployments.

Author Contributions

Methodology, Software, Writing—original draft, Funding acquisition, Z.Y.; Methodology, Supervision, Writing—review and editing, Conceptualization, Funding acquisition, R.W.; Data curation, Software, W.C.; Data curation, K.D.; Resources, Validation, Investigation, Z.H.; Writing—review and editing, Conceptualization, Supervision, Y.H. All authors have read and agreed to the published version of the manuscript.

Funding

The authors declare financial support was received for the research, authorship, and/or publication of this article. The research work is financially supported by the Science and Technology Planning Project of Guangxi Province, China (No. AD23026273); the 2024 College Student Innovation and Entrepreneurship Training Program (Dachuang) (No. 202410602080); the industry-university-research innovation fund projects of China University in 2021 (No. 2021ITA10018); the fund project of the Key Laboratory of AI and Information Processing (No. 2022GXZDSY101).

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gupta, P.; Pareek, B.; Singal, G.; Rao, D.V. Edge device based military vehicle detection and classification from UAV. Multimed. Tools Appl. 2022, 81, 19813–19834. [Google Scholar] [CrossRef]
Macdonald, J.; Schneider, J. Battlefield responses to new technologies: Views from the ground on unmanned aircraft. Secur. Stud. 2019, 28, 216–249. [Google Scholar] [CrossRef]
Chamola, V.; Kotesh, P.; Agarwal, A.; Naren; Gupta, N.; Guizani, M. A comprehensive review of unmanned aerial vehicle attacks and neutralization techniques. Ad Hoc Netw. 2021, 111, 102324. [Google Scholar] [CrossRef]
Wei, X.; Ma, J.; Sun, C. A Survey on Security of UAV Swarm Networks: Attacks and Countermeasures. ACM Comput. Surv. 2024, 57, 1–37. [Google Scholar] [CrossRef]
Hu, C.; Wang, Y.; Wang, R.; Zhang, T.; Cai, J.; Liu, M. An improved radar detection and tracking method for small UAV under clutter environment. Sci. China Inf. Sci. 2019, 62, 29306. [Google Scholar] [CrossRef]
Xie, Y.; Jiang, P.; Gu, Y.; Xiao, X. Dual-source detection and identification system based on UAV radio frequency signal. IEEE Trans. Instrum. Meas. 2021, 70, 1–15. [Google Scholar] [CrossRef]
Dong, Y.; Li, Y.; Li, Z. Research on Detection and Recognition Technology of a Visible and Infrared Dim and Small Target Based on Deep Learning. Electronics 2023, 12, 1732. [Google Scholar] [CrossRef]
Liu, L.; Sun, B.; Li, J.; Ma, R.; Li, G.; Zhang, L. Time-frequency analysis and recognition for UAVs based on acoustic signals collected by low-frequency acoustic-electric sensor. IEEE Sens. J. 2024, 24, 19601–19613. [Google Scholar] [CrossRef]
Gonzalez, R.C. Deep convolutional neural networks [lecture notes]. IEEE Signal Process. Mag. 2018, 35, 79–87. [Google Scholar] [CrossRef]
Wang, L.Y.; Bai, J.; Li, W.J.; Jiang, J.Z. Research Progress of YOLO Series Target Detection Algorithms. J. Comput. Eng. Appl. 2023, 59, 15. [Google Scholar] [CrossRef]
Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo algorithm developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. arXiv 2016. [Google Scholar] [CrossRef]
Desnoyers, P. Analytic models of SSD write performance. ACM Trans. Storage 2014, 10, 1–25. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Gao, J.; Zhang, B. A small object detection model in aerial images based on CPDD-YOLOv8. Sci. Rep. 2025, 15, 770. [Google Scholar] [CrossRef]
Cheng, X.; Wang, F.; Hu, X.; Wu, X.; Nuo, M. IRWT-YOLO: A Background Subtraction-Based Method for Anti-Drone Detection. Drones 2025, 9, 297. [Google Scholar] [CrossRef]
Yang, H.; Liang, B.; Feng, S.; Jiang, J.; Fang, A.; Li, C. Lightweight UAV Detection Method Based on IASL-YOLO. Drones 2025, 9, 325. [Google Scholar] [CrossRef]
Huang, M.; Mi, W.; Wang, Y. EDGS-YOLOv8: An Improved YOLOv8 Lightweight UAV Detection Model. Drones 2024, 8, 337. [Google Scholar] [CrossRef]
He, L.; Zhou, Y.; Liu, L.; Zhang, Y.; Ma, J. Research on the directional bounding box algorithm of YOLO11 in tailings pond identification. Measurement 2025, 253, 117674. [Google Scholar] [CrossRef]
Wang, H.; Lu, J.; Yang, S.; Xiao, Y.; He, L.; Dou, Z.; Zhao, W.; Yang, L. Cervical vertebral body segmentation in X-ray and magnetic resonance imaging based on YOLO-UNet: Automatic segmentation approach and available tool. Digit. Health 2025, 11, 20552076251347695. [Google Scholar] [CrossRef]
Xu, Y.; Wu, H.; Liu, Y.; Zhang, X. PCB Electronic Component Soldering Defect Detection Using YOLO11 Improved by Retention Block and Neck Structure. Sensors 2025, 25, 3550. [Google Scholar] [CrossRef]
Zhang, S.; Liu, Z.; Zhang, L.; Ou, W.; Zhang, K. Unmanned aerial vehicle maritime rescue object detection based on improved YOLO11. J. Electron. Imaging 2025, 34, 033015. [Google Scholar] [CrossRef]
Ultralytics. YOLO11: Release Notes and Model Overview. 2024. Available online: https://docs.ultralytics.com/models/yolo11/ (accessed on 7 August 2025).
Klodt, M.; Vedaldi, A. Supervising the new with the old: Learning SFM from SFM. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 713–728. [Google Scholar] [CrossRef]
Jocher, G.; Stoken, A.; Borovec, J.; NanoCode012; ChristopherSTAN; Changyu, L.; Laughing; tkianai; yxNONG; Hogan, A.; et al. Ultralytics/yolov5: v4.0-nn.SiLU() Activations, Weights & Biases Logging, PyTorch Hub Integration. Zenodo 2021. Available online: https://zenodo.org/records/4418161 (accessed on 1 September 2025).
Zhang, X.; Liu, C.; Yang, D.; Song, T.; Ye, Y.; Li, K.; Song, Y. RFAConv: Innovating spatial attention and standard convolutional operation. arXiv 2023, arXiv:2304.03198. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar] [CrossRef]
Zheng, Y.; Chen, Z.; Lv, D.; Li, Z.; Lan, Z.; Zhao, S. Air-to-air visual detection of micro-uavs: An experimental evaluation of deep learning. IEEE Robot. Autom. Lett. 2021, 6, 1020–1027. [Google Scholar] [CrossRef]

Figure 1. Architecture of YOLO11.

Figure 2. Architecture of the proposed framework.

Figure 3. Network architecture of the C3K2-FB module.

Figure 4. Network architecture of the SFM.

Figure 5. Structure of MS_Fusion_Block.

Figure 6. Structure of small detection head.

Figure 7. Structure of RFCBAM module.

Figure 8. Partial examples of dataset.

Figure 9. Bar graph of ablation experiment results.

Figure 10. Grad-CAM heatmaps with the addition of the MS_FPN structure: (a) Original image; (b) YOLO11n; (c) Ours.

Figure 11. Precision–Recall curves.

Figure 12. Detection result visualizations.

Figure 13. Visualization of model attention using Grad-CAM heatmaps.

Table 1. YOLO-based object detection models for UAV detection.

Model	Datasets	Key Techniques	Performance	Advantages
IRWT-YOLO	3rd Anti-UAV dataset	Image segmentation for background suppression	mAP@0.5: 85.6% mAP@0.5:95: 85.7%	Suppresses complex background
IASL-YOLO	Anti-UAV dataset	YOLOv8s lightweight	mAP@0.5: 92.4% mAP@0.5:95: 61.9% GFLOPs: 19.9	Good balance of accuracy and size
EDGS-YOLOv8	DUT Anti-UAV dataset Det-Fly dataset	Ghost conv + EMA + DCNv2	mAP@0.5: 93.4% mAP@0.5:95: 62.1%	Accurate and real-time

Table 2. Hardware and software configurations.

Hardware and Software	Detailed Configuration
CPU	Intel(R) Core(TM) i7-12700KF
GPU	NVIDIA GeForce RTX 4090
RAM	32GB DDR5 4800 MHz
Storage	SSD 512 GB
Operating system	Windows 11
PyTorch Version	1.10.1
Python Version	3.8
CUDA Version	CUDA 12.0

Table 3. Hyperparameter settings.

Hyperparameter	Set Value
Batch Size	16
Image Size	640 × 640
Weight Decay	0.0005
Optimizer	SGD
Training Epochs	200
Learning Rate	0.01
Momentum	0.937

Table 4. Ablation experiment results.

Model	P	R	F1	mAP@0.5	mAP@0.5:0.95	Parameters (Million)	FLOPs	FPS
Baseline	95.1	81.6	87.7	87.6	54.0	2.58	6.3	327.4
Baseline + C3K2-FB	94.4	83.4	88.9	88.9	58.0	2.40	6.1	424.2
Baseline + MS_FPN	93.8	82.5	88.6	88.9	56.1	2.48	7.4	181
Baseline + RFCBAM-based Small Head	95.7	90.2	92.8	93.4	57.0	2.67	10.3	270
Ours	97.1	89.9	93.4	94.4	60.9	2.39	11.2	273.3

Note: FLOPs: Floating Point Operations Per Second, FPS: Frames Per Second.

Table 5. Comparison results with these mainstream object detection models.

Model	P	R	mAP@0.5	mAP@0.5:0.95	Parameters	FLOPs
Faster R-CNN	86.9	80.3	82.8	48.4	136.87	302.45
RTDETR-L	98.0	90.5	96.8	62.8	32.00	103.5
YOLOv8n	93.4	80.8	81.2	43.3	3.00	8.1
YOLOv9n	93.0	73.1	79.9	40.6	1.9	7.6
YOLOv10n	94.1	78.9	86.4	49.9	2.2	6.5
YOLO11n	95.1	81.6	87.6	54.0	2.58	6.3
Ours	97.1	89.9	94.4	60.9	2.70	11.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, Z.; Wang, R.; Chen, W.; Du, K.; Huang, Z.; Huang, Y. A YOLO-Based Multi-Scale and Small Object Detection Framework for Low-Altitude UAVs in Cluttered Scenes. Symmetry 2025, 17, 1519. https://doi.org/10.3390/sym17091519

AMA Style

Yang Z, Wang R, Chen W, Du K, Huang Z, Huang Y. A YOLO-Based Multi-Scale and Small Object Detection Framework for Low-Altitude UAVs in Cluttered Scenes. Symmetry. 2025; 17(9):1519. https://doi.org/10.3390/sym17091519

Chicago/Turabian Style

Yang, Zhi, Rijun Wang, Wenjin Chen, Keer Du, Zhigong Huang, and Yu Huang. 2025. "A YOLO-Based Multi-Scale and Small Object Detection Framework for Low-Altitude UAVs in Cluttered Scenes" Symmetry 17, no. 9: 1519. https://doi.org/10.3390/sym17091519

APA Style

Yang, Z., Wang, R., Chen, W., Du, K., Huang, Z., & Huang, Y. (2025). A YOLO-Based Multi-Scale and Small Object Detection Framework for Low-Altitude UAVs in Cluttered Scenes. Symmetry, 17(9), 1519. https://doi.org/10.3390/sym17091519

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A YOLO-Based Multi-Scale and Small Object Detection Framework for Low-Altitude UAVs in Cluttered Scenes

Abstract

1. Introduction

2. YOLO-Based Multi-Scale and Small Object Detection Framework

2.1. Overview of YOLO11

2.2. Overall Framework

2.3. C3K2-FB Module Design

2.4. Multi-Scale Edge-Enhanced Feature Pyramid Design

2.5. Small Detection Head Based on Hybrid Attention Convolution

3. Experiments and Results

3.1. Dataset Details

3.2. Experimental Configuration

3.3. Evaluation Metrics

3.4. Ablation Experiments

3.5. Comparison Experiment with Mainstream Models

3.6. Visualization Results

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI