I-YOLOv11n: A Lightweight and Efficient Small Target Detection Framework for UAV Aerial Images
Abstract
1. Introduction
- We propose a multiscale feature enhancement architecture—RFCBAMConv (Receptive Field and Channel Block Attention Module Convolution). This module seamlessly integrates deformable convolution with channel-spatial co-attention mechanisms to construct an adaptive receptive field structure, thereby significantly enhancing the perception capabilities for the edge and texture characteristics of small targets.
- The fusion structure of dilated feature pyramid convolution (DFPC) and STCMSP features is designed. Combined with the CGBD module and the OmniKernel strategy, multi-level context feature compensation is achieved to enhance the semantic modeling capability in complex backgrounds.
- We redesign the Dynamic Detection Head (DyHead) with a hybrid Transformer mechanism. Building upon DyHead’s dynamic weighting framework, we integrate a cross-scale self-attention mechanism to strengthen spatial–channel–scale triple-feature co-modeling. This significantly enhances the localization consistency and classification accuracy for small targets in complex backgrounds, while boosting the detection robustness and positioning stability.
- We propose a three-stage lightweight strategy, “Distillation–Anchor–Pruning”. This approach employs hybrid knowledge distillation to transfer semantic capabilities from teacher models, refines anchor distribution via K-means++ clustering optimization, and compresses channel redundancy using LAMP–Taylor pruning, achieving an effective balance between accuracy and computational efficiency.
- Experiments are carried out on the three aerial datasets of VisDrone, AI-TOD, and SODA-A. The results show that the proposed algorithm increases by 7.1% and 4.9%, respectively, in mAP@0.5 and mAP@0.5:0.95, and the calculation amount is reduced to 14.7 GFLOPs. The parameter quantity is controlled within 3.87 M, and it can maintain real-time processing capability (24 FPS) on embedded platforms (such as Jetson TX2) to meet the deployment requirements of embedded terminals.
2. Related Work
2.1. Research on UAV Small Target Detection Algorithms
2.2. Lightweight Target Detection Model Design
3. Methods
3.1. Network Structure
3.1.1. RFCBAMConv Feature Enhancement Module
3.1.2. DFPC Dilated Feature Pyramid Convolution (DFPC) and STCMSP Multiscale Fusion Module
3.1.3. DyHead Detection Head and Hybrid-Transformer Fusion Mechanism
3.1.4. Structure Compression and Anchor Frame Optimization Strategy
3.2. Improved RFCBAMConv Module
3.2.1. Dynamic Receptive Field Modeling: Introducing Deformable Convolution
3.2.2. Hybrid Attention Mechanism (HAM)
3.2.3. Efficient Feature Refinement Strategy: Dynamic Convolution and Deep Separable Convolution
3.2.4. Dynamic Sampling and Attention Weight Coupling Mechanism
3.3. Improved Expansion Feature Pyramid Convolution (DFPC) Module
3.3.1. Hierarchical Multiscale Dilated Convolution Group Design
3.3.2. Dynamic Weight Fusion Mechanism
3.3.3. Lightweight Context-Guided Sampling Alternative Pooling Operation
3.3.4. Cross-Layer Multiscale Feature Pyramid Path Integration
3.4. DyHead Dynamic Detection Head Module
3.4.1. Three-Dimensional Attention Mechanism Framework
3.4.2. Hybrid Transformer Module
3.4.3. Channel Recalibration Mechanism
3.5. Knowledge Distillation, Network Pruning, and Anchor Optimization Strategy
3.5.1. Hybrid Knowledge Distillation Mechanism
3.5.2. Layered Adaptive Pruning
3.5.3. Anchor Frame Optimization Strategy
4. Experimental Results and Analysis
4.1. Datasets
4.2. Experimental Environment
4.2.1. Training and Testing Platform Configuration
4.2.2. Edge Deployment Testbed
4.2.3. Training Parameter Configuration and Optimization Strategies
4.3. Evaluation Indicatiors
4.3.1. Detection Accuracy Index
4.3.2. Calculation Efficiency Index
4.3.3. Performance Indicator
4.3.4. Lightweight Efficiency Index
4.4. Experiments and Analysis of Results
4.4.1. Ablation Experiments
4.4.2. Comprehensive Comparative Experiment
4.4.3. Visualization and Analysis of Results
Scenario 1: High-Density Small Target Detection
Scenario 2: Low Light Adaptability
Scenario 3: Complex Background Suppression Ability
Scenario 4: Occlusion Small Target Recognition Ability
Scenario 5: Adaptability to Extreme Fog Conditions
4.4.4. Experiment on Model Generalization Capability
- Multiscale semantic enhancement structure: by introducing the RFCBAMConv module and DFPC module, the model enhances the receptive field adaptability and context modeling strength at the backbone network layer, making the texture details and edge contours in different environments more apparent, which is helpful for generalization feature extraction.
- Dynamic detection head and STCMSP structure: the DyHead detection head, combined with the STCMSP pyramid structure, enhances the ability to select the region of interest for small targets during the detection stage, strengthens the synergy between multiscale features, and effectively reduces misjudgment and missed detection caused by complex backgrounds.
- Structure-aware lightweight strategy: A pruning strategy combining LAMP and Taylor scoring mechanisms, along with a structure-preserving hybrid knowledge distillation method, is used to compress the model parameters while effectively maintaining the cross-scene discrimination ability in the teacher network, thereby enhancing the generalization performance of the student network.
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Semsch, E.; Jakob, M.; Pavlicek, D.; Pechoucek, M. Autonomous UAV Surveillance in Complex Urban Environments. In Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, Milan, Italy, 15–18 September 2009; Volume 2, pp. 82–85. [Google Scholar]
- Fascista, A. Toward Integrated Large-Scale Environmental Monitoring Using WSN/UAV/Crowdsensing: A Review of Applications, Signal Processing, and Future Perspectives. Sensors 2022, 22, 1824. [Google Scholar]
- Yucesoy, E.; Balcik, B.; Coban, E. The Role of Drones in Disaster Response: A Literature Review of Operations Research Applications. Int. Trans. Oper. Res. 2025, 32, 545–589. [Google Scholar]
- Tang, G.; Ni, J.; Zhao, Y.; Gu, Y.; Cao, W. A Survey of Object Detection for UAVs Based on Deep Learning. Remote Sens. 2023, 16, 149. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
- Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
- Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]
- Jocher, G.; Qiu, J.; Chaurasia, A. Ultralytics YOLO11, Version 11.0.0; Ultralytics: Frederick, MD, USA, 2024. [Google Scholar]
- Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
- Varghese, R.; Sambath, M. Yolov8: A novel object detection algorithm with enhanced performance and robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; pp. 1–6. [Google Scholar]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
- Li, B.; Li, S. Improved UAV Small Target Detection Algorithm based on YOLOv11n. J. Comput. Eng. Appl. 2025, 61, 96–104. [Google Scholar] [CrossRef]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the 14th European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Han, S.; Mao, H.; Dally, W.J. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv 2015, arXiv:1510.00149. [Google Scholar]
- Lee, J.; Park, S.; Mo, S.; Ahn, S.; Shin, J. Layer-adaptive sparsity for the magnitude-based pruning. arXiv 2020, arXiv:2010.07611. [Google Scholar]
- Molchanov, P.; Tyree, S.; Karras, T.; Aila, T.; Kautz, J. Pruning convolutional neural networks for resource efficient inference. arXiv 2016, arXiv:1611.06440. [Google Scholar]
- Sun, J.; Gao, H.; Yan, Z.; Qi, X.; Yu, J.; Ju, Z. Lightweight UAV object-detection method based on efficient multidimensional global feature adaptive fusion and knowledge distillation. Electronics 2024, 13, 1558. [Google Scholar]
- Xu, C.; Yang, W.; Yu, H.; Datcu, M.; Xia, G.-S. Density-aware Object Detection in Aerial Images. In Proceedings of the 15th International Conference on Digital Image Processing, Nanjing, China, 19–22 May 2023; pp. 1–9. [Google Scholar]
- Han, S.; Liu, W.; Wang, S.; Zhang, X.; Zheng, S. Improving Small Object Detection in Tobacco Strands Using Optimized Anchor Boxes. IEEE Access 2025. [Google Scholar] [CrossRef]
- Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
- Wu, T.; Tang, S.; Zhang, R.; Cao, J.; Zhang, Y. Cgnet: A light-weight context guided network for semantic segmentation. IEEE Trans. Image Process. 2020, 30, 1169–1179. [Google Scholar]
- Cheng, G.; Chen, X.; Wang, C.; Li, X.; Xian, B.; Yu, H. Visual fire detection using deep learning: A survey. Neurocomputing 2024, 596, 127975. [Google Scholar]
- Dai, X.; Chen, Y.; Xiao, B.; Chen, D.; Liu, M.; Yuan, L.; Zhang, L. Dynamic head: Unifying object detection heads with attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 7373–7382. [Google Scholar]
- Chen, Y.; Dai, X.; Liu, M.; Chen, D.; Yuan, L.; Liu, Z. Dynamic convolution: Attention over convolution kernels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11030–11039. [Google Scholar]
- Li, C.; Zhou, A.; Yao, A. Omni-dimensional dynamic convolution. arXiv 2022, arXiv:2209.07947. [Google Scholar]
- Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]
- Wang, L.; Fang, S.; Zhang, C.; Li, R.; Duan, C. Efficient hybrid transformer: Learning global-local context for urban scene segmentation. arXiv 2021, arXiv:2109.08937. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
- Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are RNNs: Fast autoregressive transformers with linear attention. In Proceedings of the International Conference on Machine Learning, Virtual Event, 13–18 July 2020; pp. 5156–5165. [Google Scholar]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
- Chen, D.; Mei, J.-P.; Zhang, Y.; Wang, C.; Wang, Z.; Feng, Y.; Chen, C. Cross-layer distillation with semantic calibration. Proc. Aaai Conf. Artif. Intell. 2021, 35, 7028–7036. [Google Scholar]
- Park, W.; Kim, D.; Lu, Y.; Cho, M. Relational knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3967–3976. [Google Scholar]
- Smith, J.; Doe, J. Pruning Filters for Efficient ConvNets. arXiv 2016, arXiv:1608.08710. [Google Scholar]
- Wang, J.; Yang, W.; Guo, H.; Zhang, R.; Xia, G.-S. Tiny object detection in aerial images. In Proceedings of the 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 3791–3798. [Google Scholar]
- Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Cheng, G.; Yuan, X.; Yao, X.; Yan, K.; Zeng, Q.; Xie, X.; Han, J. Towards large-scale small object detection: Survey and benchmarks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13467–13488. [Google Scholar]
Detection Layer | Resolution | Anchor Frame Size (Width × Height) | Target Scale Range |
---|---|---|---|
P2 | 160 × 160 | (3, 4), (4, 8), (7, 6) | <16 × 16 pixels |
P3 | 80 × 80 | (6, 12), (9, 15), (13, 8) | 16–32 pixels |
P4 | 40 × 40 | (15, 23), (18, 12), (32, 17) | 32–64 pixels |
P5 | 20 × 20 | (25, 36), (52, 33), (67, 83) | >64 pixels |
Datasets | Number of Images | Resolution | Number of Categories | Proportion of Small Objects | Characteristics Overview |
---|---|---|---|---|---|
Vis-Drone2019 | 10,209 | 2000 × 1500 | 10 | 63.8% (<32 × 32) | Dense Small Objects in Urban Environments with Imbalanced Category Distribution |
AI-TOD | 28,036 | 4000 × 3000 | 8 | 89.3% (<16 × 16) | Predominance of Extremely Small Targets, High Resolution, and Dense Object Distribution |
SODA-A | 15,654 | 6000 × 4000 | 10 | 62.5% | Multi-Seasonal, Multi-Sensor, and Occlusion-Prone Remote Sensing Scenarios |
Training Parameter | Numerical Value |
---|---|
Optimizer | AdamW, initial learning rate = 0.001, weight decay = 0.05 |
Learning rate scheduling | Cosine annealing learning rate attenuation |
Period of training | 500 epochs |
Batch strategy | Batch size = 16, gradient cumulative step = 2 |
Loss weight | To enhance small object detection performance, a 1.5× enhancement weight was applied to P2/P3 layers based on the Programmable Gradient Information (PGI) mechanism. |
Memory usage rate | The memory occupancy rate of the training process is 82.3%, and the stability is good. |
Model Configuration Scheme | Para (M) | GFLOPs | mAP@0.5(%) | mAP@0.5:0.95(%) | FPS |
---|---|---|---|---|---|
Model A (baseline model) | 2.58 | 6.3 | 33.0% | 19.2% | 142 |
Model B (A + RFCBAMConv) | 2.63 (+1.9%) | 6.5 | 33.7% (+2.1%) | 19.7% (+2.6%) | 138 |
Model C (B + DFPC) | 2.78 (+5.6%) | 6.5 | 34.4% (+4.2%) | 20.2% (+5.2%) | 135 |
Model D (C + STCMSP + DyHead) | 3.39 (+31.4%) | 13.5 | 38.4% (+16.4%) | 22.9% (+19.3%) | 98 |
Model E (D + KD + Prune + anchor mechanism) | 3.87 (+49.6%) | 14.7 | 40.1% (+21.5%) | 24.1% (+25.5%) | 92 |
Method | Para (M) | GFLOPs | mAP@0.5(%) | mAP@0.5:0.95(%) | FPS |
---|---|---|---|---|---|
Faster R-CNN | 63.2 | 370.0 | 30.9 | 13.1 | 7.2 |
SSD | 12.3 | 63.2 | 24.0 | 11.9 | 25.6 |
YOLOv5s | 7.2 | 16.5 | 38.8 | 23.2 | 98.4 |
YOLOv8s | 11.2 | 28.5 | 39.0 | 19.2 | 83.7 |
YOLOv10n | 2.26 | 6.5 | 34.2 | 19.8 | 132 |
YOLOv11n (Baseline) | 2.58 | 6.3 | 33.0 | 19.2 | 142 |
Deformable DETR | 41.9 | 173.6 | 36.7 | 24.5 | 12.3 |
Swin-T | 28.3 | 41.2 | 37.5 | 22.8 | 34.5 |
YOLOv7-tiny | 9.2 | 21.4 | 28.1 | 15.6 | 157 |
Ours (I-YOLOv11n) | 3.87 | 14.7 | 40.1 | 24.1 | 92 |
Model | AI-TOD: mAP@0.5 | AI-TOD: AP (%) | SODA-A: mAP@0.5 | SODA-A: AP (%) |
---|---|---|---|---|
YOLOv5s | 32.6 | 54.3 | 31.2 | 50.9 |
YOLOv7-Tiny | 33.4 | 55.8 | 30.8 | 49.6 |
YOLOv5s-UAV-RFKD | 35.1 | 57.4 | 33.0 | 52.7 |
YOLOv11n | 31.0 | 52.1 | 29.7 | 48.2 |
Ours (I-YOLOv11n) | 36.5 | 59.0 | 34.9 | 54.6 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ma, Y.; Xi, C.; Ma, T.; Sun, H.; Lu, H.; Xu, X.; Xu, C. I-YOLOv11n: A Lightweight and Efficient Small Target Detection Framework for UAV Aerial Images. Sensors 2025, 25, 4857. https://doi.org/10.3390/s25154857
Ma Y, Xi C, Ma T, Sun H, Lu H, Xu X, Xu C. I-YOLOv11n: A Lightweight and Efficient Small Target Detection Framework for UAV Aerial Images. Sensors. 2025; 25(15):4857. https://doi.org/10.3390/s25154857
Chicago/Turabian StyleMa, Yukai, Caiping Xi, Ting Ma, Han Sun, Huiyang Lu, Xiang Xu, and Chen Xu. 2025. "I-YOLOv11n: A Lightweight and Efficient Small Target Detection Framework for UAV Aerial Images" Sensors 25, no. 15: 4857. https://doi.org/10.3390/s25154857
APA StyleMa, Y., Xi, C., Ma, T., Sun, H., Lu, H., Xu, X., & Xu, C. (2025). I-YOLOv11n: A Lightweight and Efficient Small Target Detection Framework for UAV Aerial Images. Sensors, 25(15), 4857. https://doi.org/10.3390/s25154857