DRPU-YOLO11: A Multi-Scale Model for Detecting Rice Panicles in UAV Images with Complex Infield Background
Abstract
1. Introduction
- Enhanced multi-scale feature extraction and occlusion suppression: A task-oriented multi-scale feature extraction module, CSP-PGMA, is introduced to address scale variation and partial occlusion commonly observed in UAV-based rice panicle imagery. Through progressive extraction under different receptive fields, this module strengthens the model’s ability to represent multi-scale targets and substantially improves the detection of multi-scale and partially occluded panicles.
- Suppression of background interference and enhanced small object detection: The Small Object and Environment Context Feature Pyramid Network (SOCFPN) redesigns the neck network of the original architecture by integrating dynamic upsampling (DySample), information-guided downsampling (CGDown), and cross-scale feature fusion module CSP-ONMK. This task-driven design promotes precise detection of small objects through cross-scale feature interaction, avoids computational redundancy caused by additional detection layers, and maximizes the retention of small object details in the P2 layer while reducing background interference.
- Optimization of prediction box quality and loss weighting: The PowerTAL strategy is employed to adapt quality-aware label assignment for rice panicle detection by differentiating the contribution of predictions with varying localization quality. Through power-based transformation, higher-quality prediction boxes are assigned greater importance during training, enabling the model to focus more effectively on reliable predictions in occluded and cluttered field environments.
2. Materials and Methods
2.1. Dataset
2.1.1. Data Acquiring
2.1.2. Data Processing
2.2. DRPU-YOLO11 Model
2.2.1. Architecture
2.2.2. CSP-PGMA
2.2.3. SOCFPN
2.2.4. PowerTAL
2.3. Experimental Settings
2.3.1. Training Settings
2.3.2. Evaluation Metrics
3. Experimental Results and Analysis
3.1. Hyperparameter Sensitivity Analysis of PowerTAL
3.2. Ablation Study
3.3. Comparison of Different Backbone Networks
3.4. Comparison of Different Detection Models
4. Discussion
4.1. Advantages
4.2. Challenges and Limitations
4.3. Future Perspectives
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Zeigler, R.S.; Barclay, A. The relevance of rice. Rice 2008, 1, 3–10. [Google Scholar] [CrossRef]
- Prasad, R.; Shivay, Y.S.; Kumar, D. Current status, challenges, and opportunities in rice production. In Rice Production Worldwide; Chauhan, B.S., Jabran, K., Mahajan, G., Eds.; Springer International Publishing: Cham, Switzerland, 2017; pp. 1–32. [Google Scholar]
- Madec, S.; Jin, X.; Lu, H.; De Solan, B.; Liu, S.; Duyme, F.; Heritier, E.; Baret, F. Ear density estimation from high resolution RGB imagery using deep learning technique. Agric. For. Meteorol. 2019, 264, 225–234. [Google Scholar] [CrossRef]
- Pouyanfar, S.; Sadiq, S.; Yan, Y.; Tian, H.; Tao, Y.; Reyes, M.P.; Shyu, M.-L.; Chen, S.-C.; Iyengar, S.S. A survey on deep learning: Algorithms, techniques, and applications. ACM Comput. Surv. 2018, 51, 92. [Google Scholar] [CrossRef]
- Zualkernan, I.; Abuhani, D.A.; Hussain, M.H.; Khan, J.; ElMohandes, M. Machine learning for precision agriculture using imagery from Unmanned Aerial Vehicles (UAVs): A survey. Drones 2023, 7, 382. [Google Scholar] [CrossRef]
- Sanaeifar, A.; Guindo, M.L.; Bakhshipour, A.; Fazayeli, H.; Li, X.; Yang, C. Advancing precision agriculture: The potential of deep learning for cereal plant head detection. Comput. Electron. Agric. 2023, 209, 107875. [Google Scholar] [CrossRef]
- Fageria, N.K. Yield physiology of rice. J. Plant Nutr. 2007, 30, 843–879. [Google Scholar] [CrossRef]
- Zhao, Z.-Q.; Zheng, P.; Xu, S.-T.; Wu, X. Object detection with deep learning: A review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2010, arXiv:2010.11929. [Google Scholar]
- Du, J. Understanding of Object Detection Based on CNN Family and YOLO. J. Phys. Conf. Ser. 2018, 1004, 012029. [Google Scholar] [CrossRef]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
- Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs beat YOLOs on real-time object detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
- Gao, J.; Tan, F.; Hou, Z.; Li, X.; Feng, A.; Li, J.; Bi, F. UAV-based automatic detection of missing rice seedlings using the PCERT-DETR model. Plants 2025, 14, 2156. [Google Scholar] [CrossRef]
- Fang, Y.; Yang, C.; Zhu, C.; Jiang, H.; Tu, J.; Li, J. CML-RTDETR: A lightweight wheat head detection and counting algorithm based on the improved RT-DETR. Electronics 2025, 14, 3051. [Google Scholar] [CrossRef]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
- Zhang, Y.; Xiao, D.; Chen, H.; Liu, Y. Rice panicle detection method based on improved faster R-CNN. Trans. Chin. Soc. Agric. Mach. 2021, 52, 231–240. [Google Scholar]
- Xu, C.; Jiang, H.; Yuen, P.; Zaki Ahmad, K.; Chen, Y. MHW-PD: A robust rice panicles counting algorithm based on deep learning and multi-scale hybrid window. Comput. Electron. Agric. 2020, 173, 105375. [Google Scholar] [CrossRef]
- Wang, X.; Yang, W.; Lv, Q.; Huang, C.; Liang, X.; Chen, G.; Xiong, L.; Duan, L. Field rice panicle detection and counting based on deep learning. Front. Plant Sci. 2022, 13, 966495. [Google Scholar] [CrossRef]
- Tan, S.; Lu, H.; Yu, J.; Lan, M.; Hu, X.; Zheng, H.; Peng, Y.; Wang, Y.; Li, Z.; Qi, L.; et al. In-field rice panicles detection and growth stages recognition based on RiceRes2Net. Comput. Electron. Agric. 2023, 206, 107704. [Google Scholar] [CrossRef]
- Teng, Z.; Chen, J.; Wang, J.; Wu, S.; Chen, R.; Lin, Y.; Shen, L.; Jackson, R.; Zhou, J.; Yang, C. Panicle-cloud: An open and AI-powered cloud computing platform for quantifying rice panicles from drone-collected imagery to enable the classification of yield production in rice. Plant Phenomics 2023, 5, 0105. [Google Scholar] [CrossRef] [PubMed]
- Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
- Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to upsample by learning to sample. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023. [Google Scholar]
- Wu, T.; Tang, S.; Zhang, R.; Cao, J.; Zhang, Y. CGNet: A light-weight context guided network for semantic segmentation. IEEE Trans. Image Process. 2020, 30, 1169–1179. [Google Scholar] [CrossRef]
- Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Cui, Y.; Ren, W.; Knoll, A. Omni-kernel network for image restoration. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024. [Google Scholar]
- Liu, X.; Peng, H.; Zheng, N.; Yang, Y.; Hu, H.; Yuan, Y. EfficientViT: Memory efficient Vision Transformer with cascaded group attention. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
- Chen, J.; Kao, S.; He, H.; Zhuo, W.; Wen, S.; Lee, C.-H.; Chan, S.-H.G. Run, don’t walk: Chasing higher FLOPS for faster neural networks. arXiv 2023, arXiv:2303.03667v3. [Google Scholar]
- Wang, A.; Chen, H.; Lin, Z.; Han, J.; Ding, G. RepViT: Revisiting mobile CNN from ViT perspective. arXiv 2023, arXiv:2307.09283. [Google Scholar]
- Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. ConvNeXt V2: Co-designing and scaling ConvNets with masked autoencoders. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. Int. J. Comput. Vis. 2020, 128, 336–359. [Google Scholar] [CrossRef]









| Hyperparameter | Parameter |
|---|---|
| Epochs | 200 |
| Batch size | 8 |
| Initial learning rate | 0.001 |
| Momentum | 0.937 |
| Weight decay | 0.0005 |
| Input image | 640 × 640 |
| Optimizer | SGD |
| P (%) | R (%) | F1-Score (%) | mAP50 (%) | mAP50–95 (%) | Params (M) | |
|---|---|---|---|---|---|---|
| 0.2 | 81.0 | 78.4 | 79.7 | 84.6 | 50.2 | 3.21 |
| 0.3 | 82.5 | 78.4 | 80.4 | 85.7 | 51.4 | 3.21 |
| 0.4 | 80.9 | 78.6 | 79.7 | 84.5 | 50.7 | 3.21 |
| 0.5 | 82.0 | 78.3 | 80.1 | 85.3 | 51.1 | 3.21 |
| 0.6 | 81.4 | 78.1 | 79.7 | 84.7 | 50.9 | 3.21 |
| CSP-PGMA | SOCFPN | PowerTAL | P (%) | R (%) | F1-Score (%) | mAP50 (%) | Params (M) | GFLOPs (G) |
|---|---|---|---|---|---|---|---|---|
| 80.6 | 76.4 | 78.4 | 83.3 | 2.58 | 6.3 | |||
| √ | 81.2 | 76.7 | 78.9 | 84.6 | 2.63 | 7.6 | ||
| √ | 81.4 | 77.2 | 79.2 | 84.2 | 3.17 | 13.3 | ||
| √ | 81.0 | 77.2 | 79.1 | 83.8 | 2.58 | 6.3 | ||
| √ | √ | 81.2 | 77.6 | 79.4 | 84.7 | 3.22 | 14.6 | |
| √ | √ | 82.2 | 78.1 | 80.1 | 85.0 | 2.63 | 7.6 | |
| √ | √ | 81.8 | 76.6 | 79.2 | 84.1 | 3.17 | 13.3 | |
| √ | √ | √ | 82.5 | 78.4 | 80.4 | 85.7 | 3.22 | 14.6 |
| Methods | P (%) | R (%) | F1-Score (%) | mAP50 (%) | mAP50–95 (%) | Params (M) | GFLOPs (G) |
|---|---|---|---|---|---|---|---|
| YOLO11 | 80.6 | 76.4 | 78.4 | 83.3 | 50.0 | 2.58 | 6.3 |
| EfficientViT | 78.6 | 76.6 | 77.6 | 83.9 | 51.2 | 3.74 | 7.9 |
| FasterNet | 79.5 | 77.1 | 78.3 | 83.5 | 50.4 | 3.90 | 9.2 |
| RepViT | 81.0 | 75.9 | 78.4 | 83.5 | 50.7 | 6.43 | 17.0 |
| ConvNeXtV2 | 80.4 | 75.2 | 77.7 | 83.3 | 49.8 | 5.39 | 12.5 |
| Ours | 81.2 | 76.7 | 78.9 | 84.6 | 51.3 | 2.63 | 7.6 |
| Methods | P (%) | R (%) | F1-Score (%) | mAP50 (%) | mAP50–95 (%) | Params (M) | GFLOPs (G) |
|---|---|---|---|---|---|---|---|
| RT-DETR | 79.9 | 72.6 | 76.1 | 80.7 | 46.9 | 19.87 | 56.9 |
| YOLOv8 | 80.1 | 74.9 | 77.4 | 82.5 | 49.8 | 3.01 | 8.1 |
| YOLOv10 | 79.1 | 73.4 | 76.1 | 81.1 | 49.5 | 2.27 | 8.2 |
| YOLO11 | 80.6 | 76.4 | 78.4 | 83.3 | 50.0 | 2.58 | 6.3 |
| Panicle-AI | 81.6 | 77.3 | 79.4 | 83.9 | 47.9 | 8.54 | 28.5 |
| Ours | 82.5 | 78.4 | 80.4 | 85.7 | 52.2 | 3.21 | 14.6 |
| Models | Backbone Characteristics | Multi-Scale Feature Strategies | Label Assignments | Key Limitations |
|---|---|---|---|---|
| RT-DETR | Transformer-based | Global attention | Hungarian | High computation cost |
| YOLOv8 | CSP-based CNN | z/FPN | SimOTA | Sensitive to background clutter |
| YOLOv10 | Efficiency-oriented CNN | Optimized YOLO neck | End-to-end aligned | Lacks explicit P2-guided fusion |
| YOLO11 | C3k2-based CNN | PAN/FPN | TAL | Limited robustness under occlusion |
| Panicle-AI | Panicle-C3 | Task-specific FPN | IOU-based | Larger model size |
| Ours | CSP-PGMA | SOCFPN | PowerTAL | Edge false positives |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Huang, D.; Chen, Z.; Zhuang, J.; Song, G.; Huang, H.; Li, F.; Huang, G.; Liu, C. DRPU-YOLO11: A Multi-Scale Model for Detecting Rice Panicles in UAV Images with Complex Infield Background. Agriculture 2026, 16, 234. https://doi.org/10.3390/agriculture16020234
Huang D, Chen Z, Zhuang J, Song G, Huang H, Li F, Huang G, Liu C. DRPU-YOLO11: A Multi-Scale Model for Detecting Rice Panicles in UAV Images with Complex Infield Background. Agriculture. 2026; 16(2):234. https://doi.org/10.3390/agriculture16020234
Chicago/Turabian StyleHuang, Dongchen, Zhipeng Chen, Jiajun Zhuang, Ge Song, Huasheng Huang, Feilong Li, Guogang Huang, and Changyu Liu. 2026. "DRPU-YOLO11: A Multi-Scale Model for Detecting Rice Panicles in UAV Images with Complex Infield Background" Agriculture 16, no. 2: 234. https://doi.org/10.3390/agriculture16020234
APA StyleHuang, D., Chen, Z., Zhuang, J., Song, G., Huang, H., Li, F., Huang, G., & Liu, C. (2026). DRPU-YOLO11: A Multi-Scale Model for Detecting Rice Panicles in UAV Images with Complex Infield Background. Agriculture, 16(2), 234. https://doi.org/10.3390/agriculture16020234

