SRTSOD-YOLO: Stronger Real-Time Small Object Detection Algorithm Based on Improved YOLO11 for UAV Imageries
Abstract
Highlights
- The proposed SRTSOD-YOLO series models significantly improve the detection accuracy of small targets in UAV aerial images while maintaining real-time performance. Compared with YOLO11l on the VisDrone2019 dataset, SRTSOD-YOLO-l achieves a 7.9% increase in mAP50 and a 1.08% reduction in missed target metric Emissed.
- We propose the MFCAM module (Multi-scale Feature Complementary Aggregation Module) and the GAC-FPN architecture (Gated Activation Convolutional Fusion Pyramid Network), which effectively address the issue of small target feature loss in deep networks and suppress complex background interference through a dynamic gating mechanism.
- The model series (SRTSOD-YOLO-n/s/m/l) offer flexible deployment solutions, catering to the real-time detection requirements of both airborne edge devices and ground workstations.
- Providing high-precision small-object detection benchmarks for low-altitude economic scenarios such as smart city traffic monitoring and power inspection.
Abstract
1. Introduction
- (1)
- Airborne Sensing-Ground Computing: In this setup, raw image data are transmitted via a low-latency link to a ground-based GPU cluster for processing. This architecture imposes no strict constraints on model size or computational cost, allowing the use of deep networks that leverage spatial details through strategies such as multi-scale feature fusion and attention mechanisms to enhance small object detection accuracy.
- (2)
- Onboard Edge Computing: Embedded systems mounted on UAVs require local real-time processing, imposing stringent limits on model complexity and power consumption. These systems demand lightweight models, often designed via neural architecture search, that balance representation capacity and computational efficiency under tight memory and power budgets. The emphasis here shifts from pure accuracy gains to achieving real-time performance within resource-constrained environments, necessitating co-optimization of low-power hardware and compact model design.
- To address the progressive loss of small target information with increasing network depth, we introduce the Multi-scale Feature Complementary Aggregation Module (MFCAM) into the backbone network. The MFCAM is designed to enhance feature extraction by strategically combining multi-scale convolutional features with channel and spatial attention mechanisms. This design enables the module to effectively locate small objects in the image by emphasizing critical feature channels and spatial positions.
- We design a novel neck architecture, termed the Gated Activation Convolutional Fusion Pyramid Network (GAC-FPN), to enhance multi-scale feature fusion by emphasizing semantically important features and suppressing irrelevant background noise. The GAC-FPN incorporates three key strategies to improve small target detection: (1) introducing a detection head with a smaller receptive field while removing the original one with the largest receptive field; (2) leveraging large-scale feature maps more effectively; and (3) integrating a gated activation convolutional module for adaptive feature refinement.
- To address the class imbalance between positive and negative samples, we replace the original binary cross-entropy loss with an adaptive threshold focal loss in the detection head. This modification accelerates network convergence and enhances detection accuracy for small targets.
- To accommodate diverse practical task requirements, we develop multiple versions of the SRTSOD-YOLO object detection model. These include high-capacity models tailored for ground-based workstations, which emphasize multi-scale feature fusion and contextual modeling to leverage the parallel computing capabilities of GPU clusters, as well as lightweight models designed for airborne platforms. The latter enable real-time inference at the edge while maintaining a high recall rate for critical targets. This hierarchical design paradigm enhances the flexibility of algorithm deployment across different operational scenarios.
2. Materials and Methods
2.1. Target Detection Methods of UAV Aerial Images
2.2. The YOLO Series Algorithms
2.3. The YOLO11 Architecture
2.4. The SRTSOD-YOLO Network Structure
2.5. The Multi-Scale Feature Complementary Aggregation Module
- (1)
- Multi-Scale Convolution with Dual Attention Fusion: Unlike conventional approaches that rely on single-branch convolution or basic attention mechanisms, MFCAM extracts multi-scale contextual features in parallel and fuses them strategically using combined channel and spatial attention. This enables the simultaneous capture of fine-grained details and broader contextual information, while emphasizing both “what” and “where” to focus. As a result, the module demonstrates exceptional capability in preserving and enhancing weak features of small objects.
- (2)
- Split-Aggregation Strategy: MFCAM introduces a unique split-aggregation workflow, where input features are divided, processed by specialized branches (e.g., attention and multi-scale convolution), and then reaggregated. This design efficiently maximizes the representational power of features.
2.6. The Gated Activation Convolutional Fusion Pyramid Network
- (1)
- Introducing an additional detection head with a smaller receptive field while removing the original largest receptive field head to better capture small objects.
- (2)
- Making full use of large-scale shallow features to preserve spatial detail.
- (3)
- Incorporating a gated activation convolution module to dynamically control feature flow.
2.7. Categorization of Feature Fusion Methods
2.8. The Adaptive Threshold Focus Loss Function
3. Results
3.1. Image Datasets for Small Object Detection
3.2. Experimental Setup
3.3. Experimental Evaluation Index
3.4. Assessment of Error Types
- Classification Error: IoUmax ≥ tf for GT of the incorrect class (i.e., localized correctly but classified incorrectly).
- Localization Error: tb ≤ IoUmax ≤ tf for GT of the correct class (i.e., classified correctly but localized incorrectly).
- Both Cls and Loc Error: tb ≤ IoUmax ≤ tf for GT of the incorrect class (i.e., classified incorrectly and localized incorrectly).
- Duplicate Detection Error: IoUmax ≥ tf for GT of the correct class but another higher-scoring detection already matched that GT (i.e., would be correct if not for a higher scoring detection).
- Background Error: IoUmax ≤ tb for all GT (i.e., detected background as foreground).
- Missed GT Error: All undetected ground truth (false negatives) not already covered by classification or localization error.
3.5. Comparative Analysis with YOLO11
3.6. Ablation Experiment
- (1)
- A: Multi-scale Feature Complementary Aggregation Module (MFCAM) is used in the backbone network.
- (2)
- B: Add a detector with a small receptive field and delete the detector with the original maximum receptive field.
- (3)
- C: A multi-scale and multi-level feature fusion pathway is reconstructed at the neck of the model to fully integrate the multi-level expression of large-size feature maps.
- (4)
- D: Use gated activation convolutional modules at the neck of the model.
- (5)
- E: The original binary cross-entropy loss was replaced by using an adaptive threshold focus loss.
3.7. Visual Comparison
3.8. Comparison with YOLO Series Algorithms
3.8.1. Comparison with YOLO Series Lightweight Models
3.8.2. Comparison with YOLO Series Large-Scale Models
3.9. Comparison with Other Object Detection Models
3.10. Comparison of UAVDT Dataset
4. Discussion
4.1. Multi-Scale Object Coexistence and Difficult Feature Extraction Problem
4.2. Complex Background Interference and Positive and Negative Sample Imbalance Problem
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Byun, S.; Shin, I.-K.; Moon, J.; Kang, J.; Choi, S.-I. Road Traffic Monitoring from UAV Images Using Deep Learning Networks. Remote Sens. 2021, 13, 4027. [Google Scholar] [CrossRef]
- Sun, W.; Dai, L.; Zhang, X.; Chang, P.; He, X. RSOD: Real-Time Small Object Detection Algorithm in UAV-Based Traffic Monitoring. Appl. Intell. 2022, 52, 8448–8463. [Google Scholar] [CrossRef]
- Muhmad Kamarulzaman, A.M.; Wan Mohd Jaafar, W.S.; Mohd Said, M.N.; Saad, S.N.M.; Mohan, M. UAV Implementations in Urban Planning and Related Sectors of Rapidly Developing Nations: A Review and Future Perspectives for Malaysia. Remote Sens. 2023, 15, 2845. [Google Scholar] [CrossRef]
- Yu, Y.; Gu, T.; Guan, H.; Li, D.; Jin, S. Vehicle Detection From High-Resolution Remote Sensing Imagery Using Convolutional Capsule Networks. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1894–1898. [Google Scholar] [CrossRef]
- Li, Y.; Huang, Y.; Tao, Q. Improving Real-Time Object Detection in Internet-of-Things Smart City Traffic with YOLOv8-DSAF Method. Sci. Rep. 2024, 14, 17235. [Google Scholar] [CrossRef]
- An, R.; Zhang, X.; Sun, M.; Wang, G. GC-YOLOv9: Innovative Smart City Traffic Monitoring Solution. Alex. Eng. J. 2024, 106, 277–287. [Google Scholar] [CrossRef]
- Li, Z.; Zhang, Y.; Wu, H.; Suzuki, S.; Namiki, A.; Wang, W. Design and Application of a UAV Autonomous Inspection System for High-Voltage Power Transmission Lines. Remote Sens. 2023, 15, 865. [Google Scholar] [CrossRef]
- Vedanth, S.; Udit Narayana, K.B.; Harshavardhan, S.; Rao, T.; Kodipalli, A. Drone-Based Artificial Intelligence for Efficient Disaster Management: The Significance of Accurate Object Detection and Recognition. In Proceedings of the 2024 IEEE 9th International Conference for Convergence in Technology (I2CT), Pune, India, 5–7 April 2024; pp. 1–5. [Google Scholar] [CrossRef]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
- Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar] [CrossRef]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland 2020; pp. 213–229. [Google Scholar] [CrossRef]
- Lin, T.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, L. Microsoft COCO: Common Objects in Context. In Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar] [CrossRef]
- Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
- Long, Y.; Gong, Y.; Xiao, Z.; Liu, Q. Accurate Object Localization in Remote Sensing Images Based on Convolutional Neural Networks. IEEE Trans. Geosci. Remote Sens. 2017, 55, 2486–2498. [Google Scholar] [CrossRef]
- Yu, X.; Gong, Y.; Jiang, N.; Ye, Q.; Han, Z. Scale Match for Tiny Person Detection. In Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA, 1–5 March 2020. [Google Scholar] [CrossRef]
- Li, W.; Wei, W.; Zhang, L. GSDet: Object Detection in Aerial Images Based on Scale Reasoning. IEEE Trans. Image Process. 2021, 30, 4599–4609. [Google Scholar] [CrossRef] [PubMed]
- Liu, K.; Fu, Z.; Jin, S.; Chen, Z.; Zhou, F.; Jiang, R.; Chen, Y.; Ye, J. ESOD: Efficient Small Object Detection on High-Resolution Images. IEEE Trans. Image Process. 2025, 34, 183–195. [Google Scholar] [CrossRef]
- Adaimi, G.; Kreiss, S.; Alahi, A. Perceiving Traffic from Aerial Images. arXiv 2020, arXiv:2009.07611. [Google Scholar] [CrossRef]
- Bouguettaya, A.; Zarzour, H.; Kechida, A.; Taberkit, A.M. Vehicle Detection From UAV Imagery with Deep Learning: A Review. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 6047–6067. [Google Scholar] [CrossRef] [PubMed]
- Zhu, P.; Wen, L.; Bian, X.; Ling, H.; Hu, Q. Vision Meets Drones: A Challenge. arXiv 2018, arXiv:1804.07437. [Google Scholar] [CrossRef]
- Du, B.; Huang, Y.; Chen, J.; Huang, D. Adaptive Sparse Convolutional Networks with Global Context Enhancement for Faster Object Detection on Drone Images. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 13435–13444. [Google Scholar] [CrossRef]
- Ghiasi, G.; Cui, Y.; Srinivas, A.; Qian, R.; Lin, T.-Y.; Cubuk, E.D.; Le, Q.V.; Zoph, B. Simple Copy-Paste Is a Strong Data Augmentation Method for Instance Segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021. [Google Scholar] [CrossRef]
- Kisantal, M.; Wojna, Z.; Murawski, J.; Naruniec, J.; Cho, K. Augmentation for Small Object Detection. arXiv 2019, arXiv:1902.07296. [Google Scholar] [CrossRef]
- Chen, C.; Zhang, Y.; Lv, Q.; Wei, S.; Wang, X.; Sun, X.; Dong, J. RRNet: A Hybrid Detector for Object Detection in Drone-Captured Images. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar] [CrossRef]
- Zhang, X.; Izquierdo, E.; Chandramouli, K. Dense and Small Object Detection in UAV Vision Based on Cascade Network. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 118–126. [Google Scholar] [CrossRef]
- Wang, X.; Zhu, D.; Yan, Y. Towards Efficient Detection for Small Objects via Attention-Guided Detection Network and Data Augmentation. Sensors 2022, 22, 7663. [Google Scholar] [CrossRef]
- Bosquet, B.; Cores, D.; Seidenari, L.; Brea, V.M.; Mucientes, M.; Bimbo, A.D. A Full Data Augmentation Pipeline for Small Object Detection Based on Generative Adversarial Networks. Pattern Recognit. 2023, 133, 108998. [Google Scholar] [CrossRef]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar] [CrossRef]
- Liu, Z.; Gao, G.; Sun, L.; Fang, L. IPG-Net: Image Pyramid Guidance Network for Small Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 4422–4430. [Google Scholar] [CrossRef]
- Gong, Y.; Yu, X.; Ding, Y.; Peng, X.; Zhao, J.; Han, Z. Effective Fusion Factor in FPN for Tiny Object Detection. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 1159–1167. [Google Scholar] [CrossRef]
- Liu, S.; Huang, D.; Wang, Y. Learning Spatial Fusion for Single-Shot Object Detection. arXiv 2019, arXiv:1911.09516. [Google Scholar] [CrossRef]
- Yang, X.; Yang, J.; Yan, J.; Zhang, Y.; Zhang, T.; Guo, Z.; Sun, X.; Fu, K. SCRDet: Towards More Robust Detection for Small, Cluttered and Rotated Objects. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8231–8240. [Google Scholar] [CrossRef]
- Fu, J.; Sun, X.; Wang, Z.; Fu, K. An Anchor-Free Method Based on Feature Balancing and Refinement Network for Multiscale Ship Detection in SAR Images. IEEE Trans. Geosci. Remote Sens. 2021, 59, 1331–1344. [Google Scholar] [CrossRef]
- Lu, X.; Ji, J.; Xing, Z.; Miao, Q. Attention and Feature Fusion SSD for Remote Sensing Object Detection. IEEE Trans. Instrum. Meas. 2021, 70, 1–9. [Google Scholar] [CrossRef]
- Ran, Q.; Wang, Q.; Zhao, B.; Wu, Y.; Pu, S.; Li, Z. Lightweight Oriented Object Detection Using Multiscale Context and Enhanced Channel Attention in Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 5786–5795. [Google Scholar] [CrossRef]
- Wu, X.; Hong, D.; Chanussot, J. UIU-Net: U-Net in U-Net for Infrared Small Object Detection. IEEE Trans. Image Process. 2023, 32, 364–376. [Google Scholar] [CrossRef]
- Du, D.; Qi, Y.; Yang, Y.; Duan, K.; Zhang, W.; Huang, Q.; Tian, Q. The Unmanned Aerial Vehicle Benchmark: Object Detection and Tracking. In Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2018; pp. 375–391. [Google Scholar] [CrossRef]
- Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and Tracking Meet Drones Challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7380–7399. [Google Scholar] [CrossRef]
- Yue, M.; Zhang, L.; Huang, J.; Zhang, H. Lightweight and Efficient Tiny-Object Detection Based on Improved YOLOv8n for UAV Aerial Images. Drones 2024, 8, 276. [Google Scholar] [CrossRef]
- Bai, C.; Zhang, K.; Jin, H.; Qian, P.; Zhai, R.; Lu, K. SFFEF-YOLO: Small Object Detection Network Based on Fine-Grained Feature Extraction and Fusion for Unmanned Aerial Images. Image Vision. Comput. 2025, 156, 105469. [Google Scholar] [CrossRef]
- Wang, H.; Shen, Q.; Deng, Z. A Diverse Knowledge Perception and Fusion Network for Detecting Targets and Key Parts in UAV Images. Neurocomputing 2025, 612, 128748. [Google Scholar] [CrossRef]
- Liu, J.; Wen, B.; Xiao, J.; Sun, M. Design of UAV Target Detection Network Based on Deep Feature Fusion and Optimization with Small Targets in Complex Contexts. Neurocomputing 2025, 639, 130207. [Google Scholar] [CrossRef]
- Wang, J.; Li, X.; Chen, J.; Zhou, L.; Guo, L.; He, Z.; Zhou, H.; Zhang, Z. DPH-YOLOv8: Improved YOLOv8 Based on Double Prediction Heads for the UAV Image Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
- Chen, Y.; Ye, Z.; Sun, H.; Gong, T.; Xiong, S.; Lu, X. Global–Local Fusion with Semantic Information Guidance for Accurate Small Object Detection in UAV Aerial Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–15. [Google Scholar] [CrossRef]
- Ying, Z.; Zhou, J.; Zhai, Y.; Quan, H.; Li, W.; Genovese, A.; Piuri, V.; Scotti, F. Large-Scale High-Altitude UAV-Based Vehicle Detection via Pyramid Dual Pooling Attention Path Aggregation Network. IEEE Trans. Intell. Transport. Syst. 2024, 25, 14426–14444. [Google Scholar] [CrossRef]
- Ding, X.; Zhang, R.; Liu, Q.; Yang, Y. Real-Time Small Object Detection Using Adaptive Weighted Fusion of Efficient Positional Features. Pattern Recognit. 2025, 167, 111717. [Google Scholar] [CrossRef]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
- Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar] [CrossRef]
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
- Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
- Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
- Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
- Wang, C.; Yeh, I.; Yuan, H. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In Lecture Notes in Computer Science; Springer Nature Switzerland: Cham, Switzerland, 2025; pp. 1–21. [Google Scholar] [CrossRef]
- Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
- Xiao, Y.; Xu, T.; Xin, Y.; Li, J. FBRT-YOLO: Faster and Better for Real-Time Aerial Image Detection. In Proceedings of the 39th AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; pp. 8673–8681. [Google Scholar]
- Lin, T.-Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
- Liu, H.; Jia, C.; Shi, F.; Cheng, X.; Chen, S. SCSegamba: Lightweight Structure-Aware Vision Mamba for Crack Segmentation in Structures. In Proceedings of the 2025 IEEE Conference on on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 10–17 June 2025; pp. 29406–29416. [Google Scholar] [CrossRef]
- Dauphin, Y.N.; Fan, A.; Auli, M.; Grangier, D. Language Modeling with Gated Convolutional Networks. In Proceedings of the International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; pp. 933–941. [Google Scholar]
- Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Huang, T.S. Free-Form Image Inpainting With Gated Convolution. In Proceedings of the 2019 IEEE/CVF International Conference On Computer Vision (ICCV 2019), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4470–4479. [Google Scholar] [CrossRef]
- Li, J.; Nie, Q.; Fu, W.; Lin, Y.; Tao, G.; Liu, Y.; Wang, C. LORS: Low-Rank Residual Structure for Parameter-Efficient Network Stacking. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 15866–15876. [Google Scholar] [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2015; Volume 9351, pp. 234–241. [Google Scholar] [CrossRef]
- Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze and Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef]
- Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521. [Google Scholar] [CrossRef]
- Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–9 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
- Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar] [CrossRef]
- Yang, G.; Lei, J.; Zhu, Z.; Cheng, S.; Feng, Z.; Liang, R. AFPN: Asymptotic Feature Pyramid Network for Object Detection. In Proceedings of the 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Honolulu, Oahu, HI, USA, 1–4 October 2023; pp. 2184–2189. [CrossRef]
- Yang, B.; Zhang, X.; Zhang, J.; Luo, J.; Zhou, M.; Pi, Y. EFLNet: Enhancing Feature Learning Network for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–11. [Google Scholar] [CrossRef]
- Bolya, D.; Foley, S.; Hays, J.; Hoffman, J. TIDE: A General Toolbox for Identifying Object Detection Errors. In Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2020; pp. 558–573. [Google Scholar] [CrossRef]
- Xu, H.; Zheng, W.; Liu, F.; Li, P.; Wang, R. Unmanned Aerial Vehicle Perspective Small Target Recognition Algorithm Based on Improved YOLOv5. Remote Sens. 2023, 15, 3583. [Google Scholar] [CrossRef]
- Li, Y.; Fan, Q.; Huang, H.; Han, Z.; Gu, Q. A Modified YOLOv8 Detection Network for UAV Aerial Image Recognition. Drones 2023, 7, 304. [Google Scholar] [CrossRef]
- Tahir, N.U.A.; Long, Z.; Zhang, Z.; Asim, M.; ELAffendi, M. PVswin-YOLOv8s: UAV-Based Pedestrian and Vehicle Detection for Traffic Management in Smart Cities Using Improved YOLOv8. Drones 2024, 8, 84. [Google Scholar] [CrossRef]
- Wang, G.; Chen, Y.; An, P.; Hong, H.; Hu, J.; Huang, T. UAV-YOLOv8: A Small-Object-Detection Model Based on Improved YOLOv8 for UAV Aerial Photography Scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef]
- Zhang, Z. Drone-YOLO: An Efficient Neural Network Method for Target Detection in Drone Images. Drones 2023, 7, 526. [Google Scholar] [CrossRef]
- Liu, Y.; Zhang, J.; Liu, S.; Xu, L.; Wang, Y. Aams-Yolo: A Small Object Detection Method for UAV Capture Scenes Based on YOLOv7. Clust. Comput. 2025, 28, 1–14. [Google Scholar] [CrossRef]
- Chen, Z.; Zhang, Y.; Xing, S. YOLO-LE: A Lightweight and Efficient UAV Aerial Image Target Detection Model. Comput. Mater. Contin. 2025, 84, 1787–1803. [Google Scholar] [CrossRef]
- Lu, Y.; Sun, M. Lightweight Multidimensional Feature Enhancement Algorithm LPS-YOLO for UAV Remote Sensing Target Detection. Sci. Rep. 2025, 15, 1340. [Google Scholar] [CrossRef]
- Wang, H.; Liu, J.; Zhao, J.; Zhang, J.; Zhao, D. Precision and Speed: LSOD-YOLO for Lightweight Small Object Detection. Expert. Syst. Appl. 2025, 269, 126440. [Google Scholar] [CrossRef]
- Zhou, L.; Zhao, S.; Liu, Z.; Zhang, W.; Qiao, B.; Liu, Y. A Lightweight Aerial Image Object Detector Based on Mask Information Enhancement. IEEE Trans. Instrum. Meas. 2025, 74, 1–17. [Google Scholar] [CrossRef]
- Jiang, L.; Yuan, B.; Du, J.; Chen, B.; Xie, H.; Tian, J.; Yuan, Z. MFFSODNet: Multiscale Feature Fusion Small Object Detection Network for UAV Aerial Images. IEEE Trans. Instrum. Meas. 2024, 73, 1–14. [Google Scholar] [CrossRef]
- Yan, H.; Kong, X.; Wang, J.; Tomiyama, H. ST-YOLO: An Enhanced Detector of Small Objects in Unmanned Aerial Vehicle Imagery. Drones 2025, 9, 338. [Google Scholar] [CrossRef]
Model | Depth | Width | Maximum Number of Channels |
---|---|---|---|
yolo11x | 1.00 | 1.50 | 512 |
yolo11l | 1.00 | 1.00 | 512 |
yolo11m | 0.50 | 1.00 | 512 |
yolo11s | 0.50 | 0.50 | 1024 |
yolo11n | 0.50 | 0.25 | 1024 |
Layer | Module | SRTSOD-YOLO-n | SRTSOD-YOLO-s | SRTSOD-YOLO-m | SRTSOD-YOLO-l |
---|---|---|---|---|---|
0 | CBS | 8 | 16 | 32 | 32 |
1 | CBS | 16 | 32 | 64 | 64 |
2 | MFCAM | 16 | 32 | 64 | 64 |
3 | CBS | 32 | 64 | 128 | 128 |
4 | MFCAM | 32 | 64 | 128 | 128 |
5 | CBS | 64 | 128 | 256 | 256 |
6 | MFCAM | 64 | 128 | 256 | 256 |
7 | CBS | 128 | 256 | 512 | 512 |
8 | MFCAM | 128 | 256 | 512 | 512 |
9 | SPPF | 128 | 256 | 512 | 512 |
10 | C2PSA | 128 | 256 | 512 | 512 |
Layer | Module | SRTSOD-YOLO-n | SRTSOD-YOLO-s | SRTSOD-YOLO-m | SRTSOD-YOLO-l |
---|---|---|---|---|---|
11 | CBS | 16 | 32 | 64 | 64 |
12 | CBS | 16 | 32 | 64 | 64 |
13 | CBS | 16 | 16 | 32 | 32 |
14 | Upsample | 128 | 256 | 512 | 512 |
15 | Concat | 208 | 416 | 832 | 832 |
16 | GAC | 208 | 416 | 832 | 832 |
17 | C3K2 | 32/n = 1 | 64/n = 1 | 128/n = 2 | 128/n = 4 |
18 | Upsample | 32 | 64 | 128 | 128 |
19 | Concat | 80 | 160 | 320 | 320 |
20 | GAC | 80 | 160 | 320 | 320 |
21 | C3K2 | 32/n = 1 | 64/n = 1 | 128/n = 2 | 128/n = 4 |
22 | Upsample | 32 | 64 | 128 | 128 |
23 | Concat | 64 | 112 | 224 | 224 |
24 | GAC | 64 | 112 | 224 | 224 |
25 | C3K2 | 16/n = 1 | 32/n = 1 | 64/n = 2 | 64/n = 4 |
26 | CBS | 16 | 32 | 64 | 64 |
27 | Concat | 48 | 96 | 192 | 192 |
28 | C3K2 | 32/n = 1 | 64/n = 1 | 128/n = 2 | 128/n = 4 |
29 | CBS | 32 | 64 | 128 | 128 |
30 | Concat | 64 | 128 | 256 | 256 |
31 | C3K2 | 64/n = 1 | 128/n = 1 | 256/n = 2 | 256/n = 4 |
Parameters | Setup |
---|---|
Epochs | 300 |
Batch size | 16 |
Initial learning rate | 0.01 |
Final learning rate | 0.0001 |
Optimizer | SGD |
Momentum | 0.9 |
Random seed | 42 |
Input image size | 640 × 640 |
Network | mAP50 (%) | mAP50-95 (%) | Params (M) | GFLOPs | FPS |
---|---|---|---|---|---|
YOLO11n | 33.2 ± 0.2 | 20.6 ± 0.1 | 2.6 | 6.5 | 164 |
SRTSOD-YOLO-n | 36.3 ± 0.1 | 21.8 ± 0.1 | 3.5 | 7.4 | 147 |
YOLO11s | 40.6 ± 0.2 | 24.5 ± 0.2 | 9.4 | 21.6 | 153 |
SRTSOD-YOLO-s | 44.4 ± 0.2 | 27.0 ± 0.2 | 11.1 | 24.2 | 138 |
YOLO11m | 43.5 ± 0.2 | 26.3 ± 0.2 | 20.1 | 68.2 | 135 |
SRTSOD-YOLO-m | 49.6 ± 0.3 | 30.4 ± 0.2 | 22.2 | 72.7 | 124 |
YOLO11l | 45.9 ± 0.3 | 28.2 ± 0.2 | 25.3 | 87.3 | 111 |
SRTSOD-YOLO-l | 53.8 ± 0.3 | 33.8 ± 0.2 | 27.6 | 94.7 | 99 |
Model | Ecls | Eloc | Eboth | Edup | Ebkg | Emissed |
---|---|---|---|---|---|---|
YOLO11s | 15.30 | 4.32 | 0.52 | 0.18 | 2.35 | 14.46 |
SRTSOD-YOLO-s | 15.06 | 4.11 | 0.50 | 0.15 | 2.26 | 14.27 |
YOLO11l | 14.59 | 4.19 | 0.53 | 0.12 | 2.55 | 15.04 |
SRTSOD-YOLO-l | 14.09 | 3.91 | 0.49 | 0.03 | 2.13 | 13.96 |
Network | mAP50 (%) | mAP50-95 (%) |
---|---|---|
YOLO11n | 32.3 | 20.2 |
SRTSOD-YOLO-n | 33.5 | 20.8 |
YOLO11s | 34.6 | 21.4 |
SRTSOD-YOLO-s | 38.4 | 23.6 |
YOLO11m | 39.8 | 24.2 |
SRTSOD-YOLO-m | 44.7 | 27.3 |
YOLO11l | 43.9 | 26.5 |
SRTSOD-YOLO-l | 47.2 | 28.7 |
Network | A | B | C | D | E | mAP50 (%) | mAP50-95 (%) | Params (M) | GFLOPs | FPS |
---|---|---|---|---|---|---|---|---|---|---|
YOLO11n | 33.2 | 20.6 | 2.6 | 6.5 | 164 | |||||
SRTSOD-YOLO-n | √ * | 33.9 | 20.7 | 2.7 | 6.7 | 161 | ||||
√ | 34.4 | 21.1 | 2.8 | 6.7 | 160 | |||||
√ | √ | 35.1 | 21.2 | 2.9 | 6.9 | 158 | ||||
√ | √ | √ | 35.6 | 21.5 | 3.3 | 7.3 | 150 | |||
√ | √ | √ | √ | 36.0 | 21.7 | 3.5 | 7.4 | 147 | ||
√ | √ | √ | √ | √ | 36.3 | 21.8 | 3.5 | 7.4 | 147 |
β | mAP50 (%) | mAP50-95 (%) | Params (M) | GFLOPs |
---|---|---|---|---|
0.1 | 35.7 | 21.4 | 3.5 | 7.4 |
0.25 | 36.3 | 21.8 | 3.5 | 7.4 |
0.5 | 35.9 | 21.6 | 3.5 | 7.4 |
0.75 | 35.4 | 21.3 | 3.5 | 7.4 |
0.9 | 35.1 | 21.1 | 3.5 | 7.4 |
Network | mAP50 (%) | mAP50-95 (%) | Params (M) | GFLOPs |
---|---|---|---|---|
YOLO11n | 33.2 | 20.6 | 2.6 | 6.5 |
YOLO11n with CBAM [67] | 33.3 | 20.6 | 2.6 | 6.5 |
YOLO11n with EMA [68] | 33.4 | 20.6 | 2.6 | 6.5 |
YOLO11n with MFCAM | 33.9 | 20.7 | 2.7 | 6.7 |
Network | mAP50 (%) | mAP50-95 (%) | Params (M) | GFLOPs |
---|---|---|---|---|
YOLO11n | 33.2 | 20.6 | 2.6 | 6.5 |
YOLO11n with BiFPN [69] | 34.1 | 21.0 | 3.1 | 6.8 |
YOLO11n with AFPN [70] | 33.7 | 20.8 | 3.5 | 7.4 |
YOLO11n with GAC-FPN | 35.3 | 21.6 | 3.4 | 7.2 |
Network | mAP50 (%) | mAP50-95 (%) | Params (M) | GFLOPs | FPS |
---|---|---|---|---|---|
YOLOv3-tiny | 23.4 | 13.0 | 12.1 | 18.9 | 141 |
YOLOv5s | 37.7 | 22.3 | 9.1 | 23.8 | 146 |
YOLOv6s | 36.3 | 21.4 | 16.3 | 44.0 | 118 |
YOLOv7-tiny | 32.9 | 16.8 | 6.0 | 13.3 | 161 |
YOLOv8s | 39.0 | 23.3 | 11.6 | 28.7 | 135 |
YOLOv10s | 38.6 | 23.1 | 7.4 | 21.4 | 145 |
SRTSOD-YOLOs | 44.4 * | 27.0 | 11.1 | 24.2 | 138 |
Network | mAP50 (%) | mAP50-95 (%) | Params (M) | GFLOPs | FPS |
---|---|---|---|---|---|
YOLOv3 | 44.0 | 26.9 | 103.7 | 282.3 | 71 |
YOLOv5l | 43.0 | 26.2 | 53.2 | 134.7 | 96 |
YOLOv6l | 40.7 | 24.8 | 110.9 | 391.2 | 68 |
YOLOv7 | 46.2 | 25.9 | 37.2 | 105.3 | 103 |
YOLOv8l | 43.8 | 26.9 | 43.6 | 164.9 | 92 |
YOLOv9e | 46.6 | 28.9 | 57.4 | 189.2 | 86 |
YOLOv10l | 43.5 | 26.8 | 24.9 | 120.0 | 111 |
SRTSOD-YOLO-l | 53.8 * | 33.8 | 27.6 | 94.7 | 99 |
Network | mAP50 (%) | mAP50-95 (%) | Params (M) | GFLOPs | FPS | Modules Used |
---|---|---|---|---|---|---|
LE-YOLO [43] | 39.3 | 22.7 | 2.1 | 13.1 | - | LHGNet backbone + LGS bottleneck + LGSCSP fusion module |
YOLOv5-pp [73] | 41.7 | - | 10.5 | - | - | CA attention module + Meta-ACON activation function + SPD Conv module |
Modified YOLOv8 [74] | 42.2 | - | 9.66 | - | 167 | Ghostblock structure used by backbone + Bi-PAN-FPN |
PVswin-YOLOv8 [75] | 43.3 | - | 21.6 | - | - | Improved backbone + CBAM |
UAV-YOLOv8 [76] | 47.0 | 29.2 | 10.3 | - | 51 | FFNB + BiFormer |
Drone-YOLO [77] | 51.3 | 31.9 | 76.2 | - | - | sandwich-fusion module + RepVGG module |
Aams-yolo [78] | 47.2 | 29.1 | 59.2 | 171.7 | 20 | feature fusion + Dy-head + label assignment strategy |
SFFEF-YOLO [44] | 50.1 | 31.0 | - | - | - | FIEM + MFFM |
YOLO-LE [79] | 39.9 | 22.5 | 4.0 | 8.5 | - | C2f-Dy + LDown + AMFF + LEHead |
LPS-YOLO(large) [80] | 53.2 | 34.3 | 44.1 | - | 44 | SKAPP + LSKA + OFTP + E-BIFPN |
LSOD-YOLO [81] | 37.0 | - | 3.8 | 33.9 | 93 | LCOR + SPPFL + C2f-N + Dysample |
BFDet [82] | 51.4 | 29.5 | 5.6 | 25.6 | 33 | BFDet + BCA Layer + EFEM + DM + PSPP + MIEM |
Faster RCNN | 37.2 | 21.9 | 41.2 | 292.8 | - | - |
Cascade RCNN | 39.1 | 24.3 | 68.9 | 320.7 | - | - |
RetinaNet | 19.1 | 10.6 | 35.7 | 299.5 | - | - |
CenterNet | 33.7 | 18.8 | 70.8 | 137.2 | - | - |
MFFSODNet [83] | 45.5 | - | 4.5 | - | 70 | MFFSODNet + MSFEM + BDFPN |
SRTSOD-YOLO-n | 36.3 | 21.8 | 3.5 | 7.4 | 147 | MFCAM + GACFPN + ATFL |
SRTSOD-YOLO-s | 44.4 | 27.0 | 11.1 | 24.2 | 138 | MFCAM + GACFPN + ATFL |
SRTSOD-YOLO-m | 49.6 | 30.4 | 22.2 | 72.7 | 124 | MFCAM + GACFPN + ATFL |
SRTSOD-YOLO-l | 53.8 * | 33.8 | 27.6 | 94.7 | 99 | MFCAM + GACFPN + ATFL |
Network | mAP50 (%) | mAP50-95 (%) | Params (M) | GFLOPs |
---|---|---|---|---|
Aams-yolo [78] | 43.1 | 29.9 | 59.2 | 171.7 |
SFFEF-YOLO [44] | 44.1 | 29.1 | - | - |
ST-YOLO [84] | 33.4 | - | 9.0 | 20.1 |
LSOD-YOLO [81] | 37.1 | 22.1 | - | - |
BFDet [82] | 46.0 | 26.3 | - | - |
Faster RCNN | 36.5 | 21.4 | 41.1 | 292.3 |
Cascade RCNN | 38.7 | 23.9 | 68.8 | 320.5 |
RetinaNet | 18.8 | 10.4 | 35.7 | 299.5 |
CenterNet | 32.9 | 18.2 | 70.8 | 137.2 |
YOLOv7 | 41.9 | 25.4 | 36.5 | 105.3 |
SRTSOD-YOLO-n | 33.5 | 20.8 | 3.5 | 7.4 |
SRTSOD-YOLO-s | 38.4 | 23.6 | 11.1 | 24.2 |
SRTSOD-YOLO-m | 44.7 | 27.3 | 22.2 | 72.7 |
SRTSOD-YOLO-l | 47.2 * | 28.7 | 27.6 | 94.7 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Xu, Z.; Zhao, H.; Liu, P.; Wang, L.; Zhang, G.; Chai, Y. SRTSOD-YOLO: Stronger Real-Time Small Object Detection Algorithm Based on Improved YOLO11 for UAV Imageries. Remote Sens. 2025, 17, 3414. https://doi.org/10.3390/rs17203414
Xu Z, Zhao H, Liu P, Wang L, Zhang G, Chai Y. SRTSOD-YOLO: Stronger Real-Time Small Object Detection Algorithm Based on Improved YOLO11 for UAV Imageries. Remote Sensing. 2025; 17(20):3414. https://doi.org/10.3390/rs17203414
Chicago/Turabian StyleXu, Zechao, Huaici Zhao, Pengfei Liu, Liyong Wang, Guilong Zhang, and Yuan Chai. 2025. "SRTSOD-YOLO: Stronger Real-Time Small Object Detection Algorithm Based on Improved YOLO11 for UAV Imageries" Remote Sensing 17, no. 20: 3414. https://doi.org/10.3390/rs17203414
APA StyleXu, Z., Zhao, H., Liu, P., Wang, L., Zhang, G., & Chai, Y. (2025). SRTSOD-YOLO: Stronger Real-Time Small Object Detection Algorithm Based on Improved YOLO11 for UAV Imageries. Remote Sensing, 17(20), 3414. https://doi.org/10.3390/rs17203414