MSMT-RTDETR: A Multi-Scale Model for Detecting Maize Tassels in UAV Images with Complex Field Backgrounds
Abstract
1. Introduction
- Enhanced multi-scale feature extraction: By integrating the FasterNet block and the Efficient Multi-scale Attention (EMA) mechanism, reconstructing BasicBlock, and replacing 2D convolutions in FasterNet Block with re-parameterized partial convolutions (RepConv), a lightweight Faster-RPE Block is formed. This block could achieve superior multi-scale target feature extraction performance through the synergistic optimization of global dependency modeling and local detail capture, while maintaining acceptable multi-scale object detection accuracy.
- Background interference suppression: The Dynamic Cross-Scale Feature Fusion Module (Dy-CCFM) was designed by integrating a Dynamic Scale Sequence Fusion framework (DyScalseq) with a Dynamic Upsampling (DySample) component. This module facilitates the accurate detection of multi-scale objects through cross-scale feature interaction and reduces background interference. Additionally, by adopting a multi-scale adaptive kernel selection strategy, the DySample component ensures global context feature extraction.
- Enhance cross-channel information extraction: Based on multi-branch topology and re-parameterization principles, the MPCC3 module was developed to replace the RepC3 module in the hybrid encoder. This improvement strengthens cross-channel information extraction and network stability without increasing FLOPs, enabling the model to exhibit superior performance when handling intra-class occlusion.
2. Materials and Methods
2.1. Dataset
2.2. Data Augmentation
2.3. MSMT-RTDETR Model
2.3.1. Architecture
2.3.2. Faster-RPE Block
2.3.3. Dy-CCFM
2.3.4. Design of MPCC3
2.4. Experimental Settings
2.4.1. Evaluation Metrics
2.4.2. Training Settings
3. Experimental Results and Discussion
3.1. Ablation Experiment
3.1.1. The Number of MPConv
3.1.2. Ablation Study
3.2. Comparison Experiments
3.2.1. Comparison of Convolutional Modules
3.2.2. Comparison of Backbone Network
3.2.3. Comparison of Different Detection Models
3.3. Visualization
4. Discussion
4.1. Advantage
4.2. Challenges and Limitations
4.3. Future Perspectives
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Conflicts of Interest
References
- Wang, Y.; Bao, J.; Wei, X.; Wu, S.; Fang, C.; Li, Z.; Qi, Y.; Gao, Y.; Dong, Z.; Wan, X. Genetic structure and molecular mechanisms underlying the formation of tassel, anther, and pollen in the male inflorescence of maize (Zea mays L.). Cells 2022, 11, 1753. [Google Scholar] [CrossRef]
- Andorf, C.; Beavis, W.D.; Hufford, M.; Smith, S.; Suza, W.P.; Wang, K.; Woodhouse, M.; Yu, J.; Lübberstedt, T. Technological advances in maize breeding: Past, present and future. Theor. Appl. Genet. 2019, 132, 817–849. [Google Scholar] [CrossRef]
- Guo, Y.; Xiao, Y.; Hao, F.; Zhang, X.; Chen, J.; De Beurs, K.; He, Y.; Fu, Y.H. Comparison of different machine learning algorithms for predicting maize grain yield using UAV-based hyperspectral images. Int. J. Appl. Earth Obs. Geoinf. 2023, 124, 103528. [Google Scholar] [CrossRef]
- Huang, Y.; Qian, Y.; Wei, H.; Lu, Y.; Ling, B.; Qin, Y. A survey of deep learning-based object detection methods in crop counting. Comput. Electron. Agric. 2023, 215, 108425. [Google Scholar] [CrossRef]
- Wu, W.; Zhang, J.; Zhou, G.; Zhang, Y.; Wang, J.; Hu, L. ESG-YOLO: A method for detecting male tassels and assessing density of maize in the field. Agronomy 2024, 14, 241. [Google Scholar] [CrossRef]
- Sanaeifar, A.; Guindo, M.L.; Bakhshipour, A.; Fazayeli, H.; Li, X.; Yang, C. Advancing precision agriculture: The potential of deep learning for cereal plant head detection. Comput. Electron. Agric. 2023, 209, 107875. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef]
- Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
- Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-time end-to-end object detection. In Proceedings of the 38th Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 9–15 December 2024. [Google Scholar]
- Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024. [Google Scholar] [CrossRef]
- Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar] [CrossRef]
- Liu, Y.; Cen, C.; Che, Y.; Ke, R.; Ma, Y.; Ma, Y. Detection of maize tassels from UAV RGB imagery with Faster R-CNN. Remote Sens. 2020, 12, 338. [Google Scholar] [CrossRef]
- Ferro, M.V.; Sørensen, C.G.; Catania, P. Comparison of different computer vision methods for vineyard canopy detection using UAV multispectral images. Comput. Electron. Agric. 2024, 225, 109277. [Google Scholar] [CrossRef]
- Jia, Y.; Fu, K.; Lan, H.; Wang, X.; Su, Z. Maize tassel detection with CA-YOLO for UAV images in complex field environments. Comput. Electron. Agric. 2024, 217, 108562. [Google Scholar] [CrossRef]
- Niu, S.; Nie, Z.; Li, G.; Zhu, W. Multi-altitude corn tassel detection and counting based on UAV RGB imagery and deep learning. Drones 2024, 8, 198. [Google Scholar] [CrossRef]
- Du, J.; Li, J.; Fan, J.; Gu, S.; Guo, X.; Zhao, C. Detection and identification of tassel states at different maize tasseling stages using UAV imagery and deep learning. Plant Phenomics 2024, 6, 0188. [Google Scholar] [CrossRef]
- Zeng, F.; Ding, Z.; Song, Q.; Qiu, G.; Liu, Y.; Yue, X. MT-Det: A novel fast object detector of maize tassel from high-resolution imagery using single level feature. Comput. Electron. Agric. 2023, 214, 108305. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2010, arXiv:2010.11929. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using shifted windows. arXiv 2021, arXiv:2103.14030v2. [Google Scholar] [CrossRef]
- Zhou, Q.; Huang, Z.; Zheng, S.; Jiao, L.; Wang, L.; Wang, R. A wheat spike detection method based on Transformer. Front. Plant Sci. 2022, 13, 1023924. [Google Scholar] [CrossRef] [PubMed]
- Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs beat YOLOs on real-time object detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024. [Google Scholar] [CrossRef]
- Zhang, X.; Zhu, D.; Wen, R. SwinT-YOLO: Detection of densely distributed maize tassels in remote sensing images. Comput. Electron. Agric. 2023, 210, 107905. [Google Scholar] [CrossRef]
- Ye, J.; Yu, Z. Fusing global and local information network for tassel detection in UAV imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 4100–4108. [Google Scholar] [CrossRef]
- Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014. [Google Scholar] [CrossRef]
- Chen, J.; Kao, S.; He, H.; Zhuo, W.; Wen, S.; Lee, C.-H.; Chan, S.-H.G. Run, don’t walk: Chasing higher FLOPS for faster neural networks. arXiv 2023, arXiv:2303.03667v3. [Google Scholar]
- Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient multi-scale attention module with cross-spatial learning. In Proceedings of the 48th IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 4–9 June 2023. [Google Scholar] [CrossRef]
- Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. RepVGG: Making VGG-style ConvNets great again. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, Online, USA, 20–25 June 2021. [Google Scholar] [CrossRef]
- Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to upsample by learning to sample. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023. [Google Scholar] [CrossRef]
- Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
- Selvaraju, R.R.; Das, A.; Vedantam, R.; Cogswell, M.; Parikh, D.; Batra, D. Grad-CAM: Why did you say that? arXiv 2016, arXiv:1611.07450. [Google Scholar]
- Chattopadhay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V.N. Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks. In Proceedings of the 18th IEEE Winter Conference on Applications of Computer Vision, Lake Tahoe, NV, USA, 12–15 March 2018. [Google Scholar] [CrossRef]
- Zhang, J.; Li, X.; Li, J.; Liu, L.; Xue, Z.; Zhang, B.; Jiang, Z.; Huang, T.; Wang, Y.; Wang, C. Rethinking mobile block for efficient attention-based models. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023. [Google Scholar] [CrossRef]
- Zhang, X.; Liu, C.; Yang, D.; Song, T.; Ye, Y.; Li, K.; Song, Y. RFAConv: Innovating spatial attention and standard convolutional operation. arXiv 2023, arXiv:2304.03198. [Google Scholar]
- Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. ConvNeXt V2: Co-designing and scaling ConvNets with masked autoencoders. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar] [CrossRef]
- Li, Y.; Hu, J.; Wen, Y.; Evangelidis, G.; Salahi, K.; Wang, Y.; Tulyakov, S.; Ren, J. Rethinking Vision Transformers for MobileNet size and speed. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023. [Google Scholar] [CrossRef]
- Wang, A.; Chen, H.; Lin, Z.; Han, J.; Ding, G. RepViT: Revisiting mobile CNN from ViT perspective. arXiv 2023, arXiv:2307.09283. [Google Scholar]
- Yu, Z.; Ye, J.; Li, C.; Zhou, H.; Li, X. TasselLFANet: A novel lightweight multi-branch feature aggregation neural network for high-throughput image-based maize tassels detection and counting. Front. Plant Sci. 2023, 14, 1158940. [Google Scholar] [CrossRef]
- Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
- Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Zhang, S.; Chen, K. RTMDet: An empirical study of designing real-time object detectors. arXiv 2022, arXiv:2212.07784. [Google Scholar] [CrossRef]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for end-to-end object detection. In Proceedings of the 9th International Conference on Learning Representations, Virtual Conference, Online, 3–7 May 2021. [Google Scholar]
- Han, D.; Zhao, N.; Shi, P. A new fault diagnosis method based on deep belief network and support vector machine with Teager–Kaiser energy operator for bearings. Adv. Mech. Eng. 2017, 9, 121–131. [Google Scholar] [CrossRef]
- Lu, D.; Wang, Y. MAR-YOLOv9: A multi-dataset object detection method for agricultural fields based on YOLOv9. PLoS ONE 2024, 19, e0307643. [Google Scholar] [CrossRef]
- Shuvo, M.M.H.; Islam, S.K.; Cheng, J.; Morshed, B.I. Efficient acceleration of deep learning inference on resource-constrained edge devices: A review. Proc. IEEE 2023, 111, 42–91. [Google Scholar] [CrossRef]
Dataset | Subset | No. Images a | No. Bounding Box a | Max. w/h b | Min. w/h b | Avg. w/h b |
---|---|---|---|---|---|---|
MTDC-UAV | Train & val | 500 | 28,531 | 257 | 7 | 70.13 |
Test | 300 | 21,460 | 272 | 2 | 68.66 | |
Total | 800 | 49,991 | 272 | 2 | 69.5 | |
After Data Augmentation | Train | 1820 | 110,666 | 828 | 2 | 73.75 |
Validation | 300 | 8199 | 807 | 1 | 79.57 | |
Test | 300 | 21,460 | 272 | 2 | 68.66 | |
Total | 2420 | 140,325 | 828 | 1 | 73.31 |
Parameters | Setup |
---|---|
Epochs | 100 |
Batch size | 8 |
Learning rate | 1 × 10−4 |
Image size | 640 × 640 |
Optimizer | AdamW |
Number of MPConv | P (%) | R (%) | F1-Score (%) | mAP50 (%) | mAP50–95 (%) | Params (M) | GFLOPs (G) |
---|---|---|---|---|---|---|---|
1 | 83.8 | 82.4 | 83.1 | 85.6 | 43.1 | 18.69 | 51 |
2 | 84.0 | 82.6 | 83.3 | 86.0 | 43.4 | 19.28 | 54 |
3 | 84.1 | 83.8 | 83.9 | 86.2 | 43.9 | 19.87 | 56.9 |
4 | 83.6 | 82.8 | 83.2 | 85.8 | 43.6 | 20.46 | 59.9 |
5 | 84.0 | 83.6 | 83.8 | 86.2 | 43.8 | 21.05 | 62.8 |
6 | 83.9 | 83.8 | 83.8 | 86.1 | 43.6 | 21.64 | 65.7 |
Methods | Faster-RPE Block | Dy-CCFM | MPCC3 | P (%) | R (%) | F1-Score (%) | mAP50 (%) | mAP50–95 (%) | Params (M) | GFLOPs (G) |
---|---|---|---|---|---|---|---|---|---|---|
Baseline | 82.9 | 83.8 | 83.4 | 85.8 | 43.4 | 20 | 60 | |||
1 | √ | 83.2 | 84.7 | 83.9 | 86.6 | 44.2 | 16.89 | 51.4 | ||
2 | √ | 83.7 | 84.7 | 84.2 | 86.8 | 43.9 | 20.21 | 61.5 | ||
3 | √ | 83.7 | 83.6 | 83.7 | 86.2 | 43.9 | 19.87 | 56.9 | ||
4 | √ | √ | 83.2 | 84.7 | 83.9 | 86.8 | 44.9 | 17.23 | 56 | |
5 | √ | √ | 82.8 | 84.4 | 83.6 | 86.2 | 44 | 16.89 | 51.4 | |
6 | √ | √ | 83.5 | 83.7 | 83.6 | 86.2 | 43.6 | 20.21 | 61.5 | |
Ours | √ | √ | √ | 84.2 | 84.7 | 84.4 | 87.2 | 45.2 | 17.23 | 56 |
Methods | P (%) | R (%) | F1-Score (%) | mAP50 (%) | mAP50–95 (%) | Params (M) | GFLOPs (G) |
---|---|---|---|---|---|---|---|
Baseline | 83 | 83.8 | 83.4 | 85.8 | 43.4 | 20 | 60 |
iRMB-Block [33] | 83.3 | 83.2 | 83.2 | 86.1 | 43.4 | 16.41 | 49.1 |
PConv-Block [26] | 83 | 84.3 | 83.6 | 86 | 43.6 | 14 | 42.8 |
RFAConv-Block [34] | 83.3 | 82.8 | 83.1 | 85.6 | 42.6 | 20.24 | 59.9 |
Faster-RPE Block | 83.2 | 84.7 | 83.9 | 86.6 | 44.2 | 16.89 | 51.4 |
Methods | P (%) | R (%) | F1-Score (%) | mAP50 (%) | mAP50–95 (%) | Params (M) | GFLOPs (G) |
---|---|---|---|---|---|---|---|
Baseline | 83 | 83.8 | 83.4 | 85.8 | 43.4 | 20 | 60 |
FasterNet [26] | 82 | 82.4 | 82.7 | 84.9 | 42 | 10.91 | 28.8 |
ConvNeXt V2 [35] | 79.7 | 75.8 | 77.7 | 80.4 | 35.7 | 12.4 | 32.3 |
EfficientFormerV2 [36] | 82.4 | 82.5 | 82.4 | 84.8 | 41.2 | 11.9 | 29.8 |
RepViT [37] | 81.7 | 78.6 | 80.2 | 82.6 | 39.2 | 13.4 | 36.7 |
Ours | 83.2 | 84.7 | 83.9 | 86.6 | 44.2 | 16.89 | 51.4 |
Methods | P (%) | R (%) | F1-Score (%) | mAP50 (%) | mAP50–95 (%) | Params (M) | GFLOPs (G) |
---|---|---|---|---|---|---|---|
RT-DETR-R18 [19] | 83 | 83.8 | 83.4 | 85.8 | 43.4 | 20 | 60 |
Deformable DETR [41] | 81.4 | 71.2 | 76.0 | 76.5 | 35.2 | 40.1 | 188.4 |
RTMDet-m [40] | 82.5 | 76.2 | 79.2 | 82.8 | 41.8 | 24.71 | 39.27 |
YOLOX-s [39] | 80.6 | 70.6 | 75.3 | 74.2 | 29.8 | 9 | 26.8 |
YOLOv8m | 83.3 | 82.0 | 83.0 | 86.0 | 41.2 | 25.86 | 79.1 |
YOLOv10m [10] | 83.1 | 80.1 | 82.9 | 85.2 | 41.3 | 15.36 | 59.1 |
TasselLFANet [38] | 82.9 | 82.6 | 82.7 | 84.7 | 39.6 | 3.04 | 20.1 |
Ours | 84.2 | 84.7 | 84.4 | 87.2 | 45.2 | 17.23 | 56 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhu, Z.; Gao, Z.; Zhuang, J.; Huang, D.; Huang, G.; Wang, H.; Pei, J.; Zheng, J.; Liu, C. MSMT-RTDETR: A Multi-Scale Model for Detecting Maize Tassels in UAV Images with Complex Field Backgrounds. Agriculture 2025, 15, 1653. https://doi.org/10.3390/agriculture15151653
Zhu Z, Gao Z, Zhuang J, Huang D, Huang G, Wang H, Pei J, Zheng J, Liu C. MSMT-RTDETR: A Multi-Scale Model for Detecting Maize Tassels in UAV Images with Complex Field Backgrounds. Agriculture. 2025; 15(15):1653. https://doi.org/10.3390/agriculture15151653
Chicago/Turabian StyleZhu, Zhenbin, Zhankai Gao, Jiajun Zhuang, Dongchen Huang, Guogang Huang, Hansheng Wang, Jiawei Pei, Jingjing Zheng, and Changyu Liu. 2025. "MSMT-RTDETR: A Multi-Scale Model for Detecting Maize Tassels in UAV Images with Complex Field Backgrounds" Agriculture 15, no. 15: 1653. https://doi.org/10.3390/agriculture15151653
APA StyleZhu, Z., Gao, Z., Zhuang, J., Huang, D., Huang, G., Wang, H., Pei, J., Zheng, J., & Liu, C. (2025). MSMT-RTDETR: A Multi-Scale Model for Detecting Maize Tassels in UAV Images with Complex Field Backgrounds. Agriculture, 15(15), 1653. https://doi.org/10.3390/agriculture15151653