MSMT-RTDETR: A Multi-Scale Model for Detecting Maize Tassels in UAV Images with Complex Field Backgrounds

Zhu, Zhenbin; Gao, Zhankai; Zhuang, Jiajun; Huang, Dongchen; Huang, Guogang; Wang, Hansheng; Pei, Jiawei; Zheng, Jingjing; Liu, Changyu

doi:10.3390/agriculture15151653

Open AccessArticle

MSMT-RTDETR: A Multi-Scale Model for Detecting Maize Tassels in UAV Images with Complex Field Backgrounds

by

Zhenbin Zhu

^1,†

,

Zhankai Gao

^1,†,

Jiajun Zhuang

^2,*

,

Dongchen Huang

¹,

Guogang Huang

³

,

Hansheng Wang

¹,

Jiawei Pei

^1,4,

Jingjing Zheng

¹ and

Changyu Liu

^1,*

¹

College of Mathematics and Informatics, South China Agricultural University, Guangzhou 510642, China

²

Academy of Contemporary Agriculture Engineering Innovations, Zhongkai University of Agriculture and Engineering, Guangzhou 510225, China

³

School of Oceanography, Shanwei Institute of Technology, Shanwei 516600, China

⁴

School of Economics and Trade, Changzhou Technical Institute of Tourism & Commerce, Changzhou 213032, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Agriculture 2025, 15(15), 1653; https://doi.org/10.3390/agriculture15151653

Submission received: 25 June 2025 / Revised: 17 July 2025 / Accepted: 28 July 2025 / Published: 31 July 2025

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Accurate detection of maize tassels plays a crucial role in yield estimation of maize in precision agriculture. Recently, UAV and deep learning technologies have been widely introduced in various applications of field monitoring. However, complex field backgrounds pose multiple challenges against the precision detection of maize tassels, including maize tassel multi-scale variations caused by varietal differences and growth stage variations, intra-class occlusion, and background interference. To achieve accurate maize tassel detection in UAV images under complex field backgrounds, this study proposes an MSMT-RTDETR detection model. The Faster-RPE Block is first designed to enhance multi-scale feature extraction while reducing model Params and FLOPs. To improve detection performance for multi-scale targets in complex field backgrounds, a Dynamic Cross-Scale Feature Fusion Module (Dy-CCFM) is constructed by upgrading the CCFM through dynamic sampling strategies and multi-branch architecture. Furthermore, the MPCC3 module is built via re-parameterization methods, and further strengthens cross-channel information extraction capability and model stability to deal with intra-class occlusion. Experimental results on the MTDC-UAV dataset demonstrate that the MSMT-RTDETR significantly outperforms the baseline in detecting maize tassels under complex field backgrounds, where a precision of 84.2% was achieved. Compared with Deformable DETR and YOLOv10m, improvements of 2.8% and 2.0% were achieved, respectively, in the mAP₅₀ for UAV images. This study proposes an innovative solution for accurate maize tassel detection, establishing a reliable technical foundation for maize yield estimation.

Keywords:

maize tassel; multi-scale target; deep learning; object detection; detection transformer; phenotype

1. Introduction

Maize, one of the three major cereal crops, is a staple food source for populations worldwide. The male reproductive structure of the maize plant, commonly known as the tassel, plays a crucial role in the fertilization of the female reproductive structure [1]. The tasseling stage is a critical period in maize growth and development, and the developmental status and quantity of tassels are crucial evaluation indicators for the overall growth status of maize plants. Traditional detection methods primarily rely on manual operations, which exhibit notable drawbacks, including being time-consuming, costly, inefficient, and prone to significant errors [2]. Unmanned aerial vehicles (UAVs) have emerged as vital tools for acquiring field crop growth images, owing to their cost-effectiveness, operational efficiency, and extensive spatial coverage advantages [3]. Their widespread application addresses the inefficiency of traditional detection methods while generating massive, high-resolution image data. However, the detection results could be affected by the irrelevant interferences within the complex field background, crop mutual occlusion, and the multi-scale variations of maize tassels caused by differences of varieties and growth stages. Current mainstream maize tassel detection methods predominantly employ computer vision techniques or convolutional neural network CNN-based core inference engines, supplemented by pre-/post-processing operations for object detection [4]. Nevertheless, these methods demonstrate limited learning capacity in adapting to large-scale UAV image datasets collected in complex field background conditions, and it is still a challenge to precisely detect maize tassels in complex field backgrounds. Consequently, developing a high-precision maize tassel detection method holds substantial significance for improving yield assessment and growth status monitoring [5].

In recent years, the continuous advancement of powerful image processing units and the drive of large-scale datasets have propelled significant progress in deep learning within the field of computer vision. In agricultural phenotyping detection, particularly for crop head recognition tasks exemplified by maize tassel detection, deep learning models have demonstrated substantial advantages over traditional image processing methods in terms of detection efficiency and accuracy metrics [6]. Current mainstream object detection algorithms primarily adhere to two technical paradigms: CNN-based architectures and transformer-based detection frameworks.

In the domain of CNN-based detection algorithms, Faster R-CNN [7] and YOLO-series models [8,9,10,11,12] represent classical technical pathways. These algorithms construct local feature extraction mechanisms through convolutional operations, inherently exhibiting translation invariance and spatial inductive bias. Liu et al. [13] constructed a dataset by collecting maize tassel images using UAVs and mobile devices and achieved a detection accuracy of 89.96% on UAV images by improving the feature extraction network of Faster R-CNN and selecting appropriate anchor sizes. However, due to the excessively large number of parameters, the model struggles to meet the requirements of real-time detection. Ferro et al. evaluated the performance of different methods in identifying pure canopy pixels in agricultural scenarios using UAV images [14], including cutting-edge deep learning methods such as Mask R-CNN and U-Net, supervised machine learning methods such as GMM and RF, and the unsupervised K-Means method. They clarified the practical significance of using neural networks and OBIA methods for detailed canopy evaluation. Jia et al. [15] utilized UAVs to collect maize tassel datasets under different weather conditions and flight altitudes and achieved an mAP of 96% on the datasets using CA-YOLOv5, which significantly improved the performance and robustness of maize tassel detection. Similarly, Niu et al. [16] utilized RGB images captured by UAVs at different flight altitudes to construct a maize tassel dataset. By integrating the PConv, CTAM, and ACmix modules into YOLOv8 and enhancing channel semantic interaction, the model maintained stable accuracies of 90.36%, 88.34%, and 84.32% at heights of 10 m, 15 m, and 20 m, respectively. Du et al. [17] proposed a new method for accurately identifying and evaluating tassel status before and after manual detasseling in maize hybrid fields using UAVs and deep learning. By optimizing the RTMDet model through annotation strategies and data augmentation, they achieved an mAP of 94.5% in UAV-based maize tassel detection, demonstrating the synergistic effect between model architecture improvements and data engineering. Additionally, Zeng et al. [18] proposed a fast anchor-free detector MT-Det based on a single stage for high-resolution images. Combined with SAHI slicing inference technology, the mAP on proximal and UAV high-resolution images increased by 13% and 38%, respectively, effectively solving the processing challenge of high-resolution maize tassel images. Although these methods mitigate the generalization limitations of traditional CNN models through attention mechanisms and feature fusion strategies, two bottlenecks persist due to the inherent properties of convolutional operations: First, during multi-scale object detection tasks, a representational disconnection exists between the high-level semantic features of deep networks and the shallow-layer detail features in the feature pyramid fusion process, leading to difficulties in ensuring consistent feature responses for multi-scale targets. Second, convolutional kernels with fixed geometric structures are sensitive to scale variations. When UAV images exhibit significant scale fluctuations in maize tassels, the model’s local receptive fields struggle to adjust adaptively, thereby resulting in detection performance degradation.

Transformer-based detection algorithms establish global feature dependencies through self-attention mechanisms, with Vision Transformer [19] and Swin Transformer [20] serving as representative works in this field. Zhou et al. [21] argue that traditional CNNs have inductive biases of locality and scale invariance, making it difficult to extract global and long-range dependencies. They propose the MWSwin-Transformer, which incorporates the ability of a feature pyramid network to extract multi-scale features, achieving an mAP and an AP50 of 0.918 on the self-constructed wheat ear dataset WSD-2022. Hybrid CNN-transformer architectures have also emerged, such as the real-time end-to-end detector RT-DETR [22], which optimizes multi-scale feature processing through intra-scale interaction and cross-scale fusion mechanisms, outperforming YOLO-series detectors of comparable scale. Zhang et al. [23] proposed the Swin T-YOLO feature extraction and detection model based on the multi-head attention mechanism, which effectively addresses the limitations of CNNs in effectively extracting global information and accurately detecting dense distributions in remote sensing images. The model’s detection accuracy on the VOC dataset increased by 2%. It should be noted that the aforementioned transformer-based detection algorithms rely on the self-attention mechanism’s capability to model global dependencies. Under complex field background conditions, intra-class occlusion and environmental background interference may disrupt the continuity of local image features. Meanwhile, the self-attention mechanism’s excessive focus on long-range dependencies could lead to the loss of shallow-layer detail information, ultimately inducing detection deviations in maize tassel targets.

To further overcome the limitations of state-of-the-art approaches for detecting maize tassels from UAV images based on the Real-Time Detection Transformer (RT-DETR) network, a novel multi-scale maize tassel detection model termed Multi-Scale Maize Tassel-RTDETR (MSMT-RTDETR) was proposed. This model achieved superior detection performance through optimized network architecture and feature fusion mechanisms within an improved deep learning framework. The main contributions include the following:

Enhanced multi-scale feature extraction: By integrating the FasterNet block and the Efficient Multi-scale Attention (EMA) mechanism, reconstructing BasicBlock, and replacing 2D convolutions in FasterNet Block with re-parameterized partial convolutions (RepConv), a lightweight Faster-RPE Block is formed. This block could achieve superior multi-scale target feature extraction performance through the synergistic optimization of global dependency modeling and local detail capture, while maintaining acceptable multi-scale object detection accuracy.
Background interference suppression: The Dynamic Cross-Scale Feature Fusion Module (Dy-CCFM) was designed by integrating a Dynamic Scale Sequence Fusion framework (DyScalseq) with a Dynamic Upsampling (DySample) component. This module facilitates the accurate detection of multi-scale objects through cross-scale feature interaction and reduces background interference. Additionally, by adopting a multi-scale adaptive kernel selection strategy, the DySample component ensures global context feature extraction.
Enhance cross-channel information extraction: Based on multi-branch topology and re-parameterization principles, the MPCC3 module was developed to replace the RepC3 module in the hybrid encoder. This improvement strengthens cross-channel information extraction and network stability without increasing FLOPs, enabling the model to exhibit superior performance when handling intra-class occlusion.

The remaining structure of this study is as follows: Section 2 introduces the materials and methods, as well as the datasets and data enhancement, and presents the proposed MSMT-RTDETR framework, followed by detailed descriptions of its architectural improvements. Section 3 analyzes and discusses the experimental results, including the ablation study of MSMT-RTDETR, the comparison with mainstream object detection models, and the visualization of maize tassel detection. Section 4 conducts a discussion on the feasibility and limitations of the MSMT-RTDETR model. Finally, Section 5 summarizes the experimental results of MSMT-RTDETR and introduces future research directions.

2. Materials and Methods

2.1. Dataset

The MTDC-UAV dataset [24] is a publicly available dataset for maize tassel detection and counting in UAV images, where detailed bounding box annotations of the maize tassels are provided. These images were acquired using a DJI Mavic 2 Pro drone (DJI, Shenzhen, China) during the spring sowing season of 2019, with the flight altitude maintained at 12.5 m. The dataset covers over 400 maize varieties, with each variety randomly sown in plots measuring 5 m in length and 0.6 m in width, and 306 original images with a resolution of 5472 × 3648. To address the loss of spatial information caused by neural network downsampling while avoiding increased computational costs, this study adopted an image segmentation technique to evenly divide each original image into 4 regions, generating 800 sub-images with a resolution of 2736 × 1824 pixels. This segmentation method effectively preserves more spatial information and helps improve the accuracy of model detail detection. The corresponding annotation files are simultaneously split into four parts to ensure that each segmented image accurately corresponds to the original annotation information; bounding boxes exceeding the segmented regions are positionally adjusted and corrected to the post-segmentation boundary values. This processing method not only improves dataset utilization efficiency but also reduces computational resource requirements.

Table 1 details the annotated bounding box statistical data of MTDC-UAV, with a total of 49,991 maize tassel instances annotated using the bounding box labeling method. Additionally, Figure 1 presents corresponding statistical data and scatter plots of the number of targets per image and the target scale distribution to investigate the quality of the split dataset, revealing the bounding box regions of the dataset. Scale classification is benchmarked against the COCO evaluation protocol [25]: targets with a bounding box area less than 1024 are defined as small targets; those greater than 1024 and less than 9216 are medium targets; and those greater than 9216 are large targets. The MTDC-UAV dataset allocates 500 images for training and validation, and 300 images for testing. Small, medium, and large targets account for 6.4%, 78.3%, and 15.3% of the entire dataset, respectively. In the training set, small, medium, and large targets account for 4.5%, 80.7%, and 14.8%, respectively, while in the test set, they account for 8.7%, 75.1%, and 16.2%, respectively. This indicates that the MTDC-UAV dataset is challenging as a multi-scale object detection dataset. Meanwhile, the presence of field maize leaves, flower organs, canopy structures, and other elements poses a certain interference to object detection results.

2.2. Data Augmentation

To enhance network generalization capabilities while preventing test set contamination of training outcomes, this study applies offline data augmentation exclusively to training set images. The offline augmentation operations applied to the MTDC-UAV dataset include horizontal flipping, vertical flipping, rotation, brightness enhancement, contrast enhancement, Gaussian noise, affine transformation, and color jittering. After data augmentation, the training set was expanded to 2120 images. Since the original MTDC-UAV dataset lacks specific splitting information for training and validation, based on the pre-experimental results and considering the impact of data augmentation on sample distribution as well as the preservation of the original dataset’s characteristics, this study divided the dataset into a 6:1:1 ratio, with 1820 images used for model training, 300 for hyperparameter tuning, and 300 images as the test set for final performance evaluation. The test set remains consistent with the test set of the original dataset, avoiding the potential similarity between the training set and the test set from affecting the model’s generalization ability. Table 1 shows the statistical data of maize tassel bounding box in the MTDC-UAV dataset before and after data augmentation.

Furthermore, to strengthen network detection capabilities for multi-scale targets, online data augmentation is employed via the Ultralytics framework. Mosaic and Mixup multi-image fusion methods are applied with 0.1 implementation probability. The Mixup technique operates as follows: First, four sample images are randomly selected and enhanced through flipping, scaling, and color space transformations. The processed images are then composited into four quadrants of a large canvas according to predefined layouts, with annotation coordinates adjusted synchronously via spatial transformation parameters. Finally, the stitched composite images are used for model training. Mosaic augmentation improves model robustness by increasing training sample diversity and effectively suppressing overfitting phenomena, thereby optimizing detection performance and generalization capabilities.

2.3. MSMT-RTDETR Model

2.3.1. Architecture

RT-DETR is a real-time end-to-end object detection model based on the transformer architecture, achieving real-time detection performance while maintaining high accuracy through multi-scale feature processing mechanisms. By eliminating the Non-Maximum Suppression (NMS) operation, the model effectively addresses the constraints imposed by NMS on inference speed and accuracy in traditional methods.

To address the challenges of identifying multi-scale maize tassels by UAV in complex field backgrounds, a multi-scale maize tassel detection model MSMT-RTDETR is proposed based on the RT-DETR network. MSMT-RTDETR consists of three primary modules: the Backbone Network, Hybrid Encoder, and Transformer Decoder with auxiliary pre-detection heads. While maintaining real-time detection capabilities, the model prioritizes optimizing multi-scale object feature extraction capability under complex field backgrounds. In the model design, an innovative Faster-RPE Block module is designed. It integrates FasterNet Block [26], EMA [27], and RepConv [28], optimizing the feature extraction performance of the backbone network while enabling it to have stronger feature representation capability. Furthermore, the model incorporates an innovative Dy-CCFM integrating Dynamic Upsampling (DySample) [29] and Dynamic Scale Sequence Fusion (DyScalSeq) to strengthen the representational capability of multi-scale features. Finally, a Multi-path Cross-channel Fusion Convolution (MPCC3) module is proposed to replace the original RepC3 module, enhancing cross-channel information extraction capability and network stability to address the challenge of intra-class occlusion. Figure 2A demonstrates the fundamental architecture of the MSMT-RTDETR model.

2.3.2. Faster-RPE Block

To enhance maize tassel detection performance in multi-scale target scenarios, an innovative Faster-RPE Block is designed. The BasicBlock in the original residual structure is replaced with the Faster-RPE Block. While maintaining the integrity of the basic convolutional architecture and feature extraction, adaptive selection and optimization of multi-scale features are achieved through feature mixing optimization and attention guidance mechanisms, significantly enhancing fine-grained feature capture capability and enabling the network to adaptively focus on key features.

Figure 2B illustrates the structure of the proposed Faster-RPE Block, which integrates three components: FasterNet Block, EMA, and RPConv. Taking the shallow feature map of the backbone network as the input, it extracts features through a 3 × 3 convolution layer and RPConv. Afterward, the SiLU activation function and EMA further enhance the ability to extract and fuse multi-scale features. Finally, it fuses with branch features to output the feature map.

The Faster-RPE Block is implemented through the following technical pathways: First, the BasicBlock and the FasterNet Block are fused to create a more parameter-efficient block. Its spatial mixing operations effectively enhance local feature learning capability while maintaining input–output channel consistency to ensure architectural compatibility. Replacing PConv in the FasterNet Block with RPConv further improves feature-learning capability. RPConv is a re-parameterized partial convolution integrating PConv and RepConv. It utilizes redundancy in feature maps and systematically applies regular convolution only to a selected subset of input channels to extract spatial features of the selected subset, where the proportion of selected channels is set to the default value of r = 1/4. This setting was verified as the optimal parameter by Chen et al. [26], while preserving the integrity of the remaining channels to avoid feature loss. Compared to standard convolution, this method achieves lower FLOPs without compromising feature extraction. Finally, the Efficient Multi-scale Attention (EMA) mechanism is incorporated to construct the Faster-RPE Block. EMA is a non-dimensionality-reduction, efficient multi-scale attention module based on cross-space learning whose core innovation lies in achieving efficient retention and enhancement of channel information through multi-scale feature interactions. By strengthening feature representation via a channel-wise information prioritization mechanism, the block is endowed with adaptive granularity selection capability.

2.3.3. Dy-CCFM

To address the limitations of the original RT-DETR’s Cross-Scale Feature Fusion Module (CCFM) in processing multi-scale sequential information, the Dynamic Cross-Scale Feature Fusion Module (Dy-CCFM) is proposed in this study. The conventional CCFM primarily exhibits two limitations: insufficient consideration of representational differences among multi-scale targets, leading to inefficient multi-scale feature interactions; and a fixed feature fusion strategy that struggles to adapt to background interference in complex field backgrounds. To overcome these limitations, the Dy-CCFM is seamlessly integrated into the baseline. A dynamic fusion methodology is employed by the Dy-CCFM. Fusion weights and strategies across different scale features are adaptively adjusted based on the input data characteristics, enabling refined and efficient fusion. High-dimensional semantic information from deep feature maps is effectively integrated with detailed information from shallow feature maps through this dynamic fusion mechanism, thus enhancing the model’s perception capability of multi-scale targets under background interference.

An adaptive feature fusion architecture is constructed through the integration of Dynamic Upsampling (DySample) and the Dynamic Scale Sequence Fusion method (DyScaleSeq), thereby enhancing detection performance in complex field backgrounds. DySample is a lightweight and efficient dynamic upsampler that achieves resource-efficient utilization by bypassing dynamic convolution and adopting point-based sampling strategies. DySample is adopted in this study to replace traditional upsampling methods. Its core idea lies in the dynamic adjustment of sampling positions, through which the global context information of feature maps is enhanced, while the high computational complexity introduced by traditional dynamic convolution is avoided. Simultaneously, image resolution and model recognition capability are significantly improved.

DyScalseq is a dynamic sequence fusion method designed for multi-scale feature integration, aiming to effectively combine high-dimensional information from deep feature maps with detailed information from shallow feature maps, thereby improving model performance in multi-scale object detection tasks. Its core design resolves the limitation of traditional feature pyramid networks (FPNs) [30] in underutilizing inter-scale feature correlations during fusion through DySample and feature stacking strategies.

As shown in Figure 2C, in the DyScalseq design architecture, DyScalseq takes multi-level features from the backbone network as inputs. First, channel dimension alignment is performed via 1 × 1 convolution, and deep-layer feature maps are transformed into feature maps with identical dimensions to shallow-layer feature maps through DySample. This process not only preserves high-dimensional semantic information from deep features but also mitigates the weakening of small target features caused by traditional upsampling methods. Subsequently, DyScalseq stacks feature maps to form a composite feature architecture, achieving cross-scale feature fusion through 3D convolution, 3D batch normalization, and SiLU activation functions. This design fully exploits correlations between multi-scale feature maps, enhancing the model’s perception of target details and high-dimensional semantic information.

2.3.4. Design of MPCC3

To address the feature extraction capability under intra-class occlusion and adapt to computational resource constraints, this study proposes the Multi-path Cross-channel Fusion Convolution (MPCC3) module to replace the traditional RepC3 module in the hybrid encoder. Although the RepC3 module simplifies the trade-off between training and inference through re-parameterization, its single-branch and short-cut topological structure results in limitations in channel feature interaction. The MPCC3 module innovatively integrates multi-branch topology with re-parameterization techniques, significantly enhancing the network’s cross-channel feature extraction performance and stability while maintaining unchanged computational costs during inference, thereby enabling the model to exhibit better performance when facing intra-class occlusion. The architecture of the MPCC3 module is illustrated in Figure 2D.

Features input to the MPCC3 module are processed through two independent branches before subsequent feature fusion. This multi-branch topological design expands the dimensionality of the feature space, enhancing the network’s capacity to capture rich semantic information while significantly improving its ability to represent diverse features. In the main branch, the designed MPConv (architecture shown in Figure 2E) employs a convolutional kernel combination of 1 × 1, 3 × 3, and 1 × 1 layers. The 1 × 1 convolutions reduce parameter count while promoting inter-channel information interaction and fusion, enabling the network to better utilize channel correlations and thereby optimize feature representation under intra-class occlusion. The additionally introduced 1 × 1 convolutional sub-branch not only improves information flow paths but also optimizes gradient propagation trajectories, effectively alleviating the gradient vanishing problem in deep networks and ensuring training stability of deep networks. Through iterative stacking of MPConv modules in the model, progressively deeper, high-quality features can be gradually extracted.

2.4. Experimental Settings

2.4.1. Evaluation Metrics

This study evaluates algorithm performance by comparing differences in image detection between the original and improved models under identical experimental settings. Precision (P), recall (R), F1-score, and mean average precision (mAP) are adopted as evaluation metrics. Precision represents the proportion of true positives among the positive predictions made by the model. Recall measures the proportion of true positive samples correctly identified as positive. The formulas for precision and recall are defined as follows:

P = \frac{T P}{T P + F P}

(1)

R = \frac{T P}{T P + F N}

(2)

where TP denotes true positive instances correctly identified, FP indicates false positive instances incorrectly identified, and FN represents false negative instances incorrectly rejected.

The F1-score is introduced to balance the trade-off between the precision and recall to provide a comprehensive evaluation of the model’s performance, and mAP is also adopted to compute the average AP across all categories, providing a comprehensive evaluation of model effectiveness. The definitions of the F1-score and mAP are as follows:

F 1 = 2 \times \frac{P \times R}{P + R}

(3)

m A P = \frac{\sum_{j = 1}^{S} A P (j)}{S}

(4)

where S represents the total category count, and the denominator sums AP values across all categories.

Additionally, the model parameter count (Params) is introduced as a metric for structural complexity in this study, which directly reflects the model’s storage requirements and overfitting risk. Giga floating point operations (GFLOPs) are employed to quantify the model’s computational complexity, with their values being used to indicate demand for computational resources. These two metrics, GFLOPs and Params, were established as the core evaluation dimensions for model complexity assessment.

2.4.2. Training Settings

All experiments were conducted on a PC running Ubuntu 20.04.4 LTS as the operating system. The software environment consisted of PyTorch 1.13.1 with Cuda 11.7 and Python 3.8.19. Hardware configuration included an Intel^® Core™ i5-10400F CPU @ 2.90 GHz × 12, 40.0 GB RAM, and an NVIDIA GeForce RTX 3090 GPU with 24 GB memory. The models converged at 100 epochs, as observed in our pre-experiments, so the number of training epochs was set to 100. The subsequent training procedures adopted the default hyperparameter configurations as follows: batch size was set to 8, learning rate to 1 × 10⁻⁴, and image size to 640 × 640 pixels. The main parameter configurations used for training are shown in Table 2.

3. Experimental Results and Discussion

3.1. Ablation Experiment

3.1.1. The Number of MPConv

Comparative experiments were conducted on models with five different MPConv quantities using the MTDC-UAV dataset to validate the effectiveness of MPConv in enhancing multi-scale feature capture and improving model representation capabilities. The results are shown in Table 3.

As observed in Table 3, significant accuracy variations exist across different numbers of MPConv configurations. When the number of MPConv is 2, it brings a substantial improvement in accuracy. When the number of MPConv is 3, the accuracy reaches the experimental highest score with an mAP₅₀ of 86.2%. And when the number comes to 4, the detection accuracy instead decreases.

Selvaraju et al. argue that feature maps can represent the feature extraction capability of the model [31]. According to the fused feature maps of each experiment, as shown in Figure 3, it can be observed that while the number of MPConv increases, the activated regions of the feature maps gradually increase and their distribution tends to be uniform, and the model’s adaptability and detection capability for targets of different scales are gradually improved [32]. However, using excessive convolutional layers will lead to potential feature redundancy, which is manifested by an increase in irrelevant background activation regions and the accumulation of weakly activated features. Specifically, the feature maps under the configurations of 3, 5, and 6 MPConv all exhibit a certain degree of brightness improvement compared to those under other configurations. Among them, the fused feature map with number of MPConv = 3 is relatively the brightest in the experiments, indicating that this model possesses a superior ability to capture effective features, which can enhance the detection accuracy (mAP50) of maize tassels. Meanwhile, the parameter counts of the models with number of MPConv = 5 and 6 are 21.1 M and 21.6 M, respectively, representing an increase of 5.3% and 8.2% compared to the baseline model’s parameter count (20M), thus introducing additional computational burden. Based on the above analysis, the number of MPConv = 3 is selected as the optimal experimental hyperparameter for this study, and thus the module is named MPCC3.

3.1.2. Ablation Study

To validate the effectiveness of the proposed improvement modules, seven ablation experiment groups were constructed based on the baseline network, with the following enhancements being sequentially implemented: BasicBlocks in the feature extraction network were replaced with lightweight Faster-RPE Blocks, the Dy-CCFM was introduced, and MPCC3 was adopted to optimize the feature fusion network. Experimental verification through progressive module integration yielded the results shown in Table 4.

Experimental results demonstrate that the Faster-RPE Block enhances baseline model performance. While retaining the main architecture of the FasterNet Block, this module integrates the EMA mechanism and RepConv. Through cross-space learning and feature grouping mechanisms, it accentuates the critical features of multi-scale targets while suppressing background interference. Comparing the baseline model with Method 1 reveals that Faster-RPE Block successfully improves maize tassel detection accuracy by 0.8%. Removing this module in Method 6 reduces effective receptive field ranges, diminishing multi-scale processing capabilities and detection precision.

The Dy-CCFM plays a pivotal role in feature aggregation and upsampling. In the experiments, Method 3 with Dy-CCFM integration achieves improvements of 1.1% in detection accuracy and 0.9% in mAP. Its dynamic multi-scale feature processing mechanism significantly enhances multi-scale target perception in complex field backgrounds through adaptive fusion weight adjustments. Eliminating the Dy-CCFM in Method 5 causes notable accuracy degradation because the original feature fusion module struggles to aggregate complex multi-scale features and exhibits detection errors when processing complex field background images.

Table 4 results confirm MPCC3’s effectiveness in model optimization. The MPCC3-enhanced model substantially improves network accuracy and stability without increasing computational overhead. Specifically, MPCC3 employs 1 × 1 convolutions to reduce parameters while enhancing inter-channel information interaction and fusion, effectively mitigating complex field background interference. Additionally, the introduced auxiliary convolutional branch optimizes gradient propagation paths and enables the effective utilization of preceding feature maps. MPCC3 implementation ensures stable deep network training, significantly enhancing practical applicability.

The proposed method integrates all enhancement modules on the baseline framework, yielding a more lightweight model adaptable to multi-scale features with reduced false positives and missed detections. These data demonstrate that each module in our method achieves superior performance compared to the baseline. The proposed method not only realizes an overall performance improvement of 1.4% and an improvement of F1-score by 1%, but also reduces the parameter count by 13.85%.

3.2. Comparison Experiments

3.2.1. Comparison of Convolutional Modules

To meet the model accuracy requirements of the multi-scale maize tassel detection task, this study enhances the backbone network of the baseline model. The proposed lightweight Faster-RPE Block module is used to replace the original BasicBlock residual module to improve the detection performance of multi-scale objects.

To verify the superiority of the Faster-RPE Block, current advanced convolutional modules are selected for comparative experiments. The iRMB-Block [33] features an inverted residual design architecture, which combines depth-wise separable convolution with a self-attention mechanism and is suitable for dense prediction tasks on mobile devices. The PConv-Block [26] uses partial convolution kernels to model feature correlations, achieving module lightweighting. The RFAConv-Block [34] optimizes the problem of convolutional kernel parameter sharing through a dynamic receptive field attention mechanism, adaptively adjusting feature weights to highlight object regions, making it suitable for feature extraction in complex filed backgrounds. These three modules represent the three mainstream optimization directions of enhanced feature extraction capability, lightweight design, and attention enhancement. They form a benchmark with the proposed Faster-RPE Block in terms of characteristics and detection performance, enabling comprehensive verification of the superiority of the proposed module.

In addition, the experimental conditions are kept constant to ensure fair comparisons. All convolutional modules are trained on the MTDC-UAV dataset with the settings of epoch = 100 and batch size = 8. The specific performance metrics of the experiments are presented in Table 5.

Experimental data show that the Faster-RPE Block exhibits significant comprehensive advantages in maize tassel detection tasks: compared with the baseline BasicBlock, its parameters and GFLOPs are reduced by 15.6% and 14.3%, respectively, and mAP50 is increased by 0.8%, mAP_50–95 by 0.8%, and F1-score by 0.5%, indicating better competitiveness in multi-scale target localization accuracy and detection integrity. Compared with other comparative modules, although the inverted residual structure of iRMB-Block increases mAP50 by 0.3% compared with BasicBlock, its recall is lower than that of Faster-RPE Block, and the reduction in parameters and GFLOPs is not significantly better than the latter. PConv-Block achieves a lightweight design through channel splitting strategy, with parameters and GFLOPs reduced by 30% and 28.67%, respectively, but its mAP50 is only increased by 0.2%, and mAP_50–95 is still lower than Faster-RPE Block, with limited accuracy gain. RFAConv-Block performs poorly, as its mechanism of dynamically adjusting convolution kernel weights is not suitable for multi-scale maize detection tasks, with mAP50 decreased by 0.2% compared with the baseline, and parameters and GFLOPs slightly increased.

Compared with the inverted residual structure of the iRMB-Block, the Faster-RPE Block achieves more efficient preservation and enhancement of channel information through multi-scale feature interaction. Compared with PConv-Block, this module realizes adaptive selection and optimization of multi-scale features by additionally introducing the EMA attention guidance mechanism, which not only significantly improves detection accuracy, but also shows significant advantages in recall rate and multi-scale localization ability for small target detection of maize tassels, verifying its advancement in multi-scale detection tasks.

3.2.2. Comparison of Backbone Network

To validate the superiority of our model with the improved backbone network, we conducted comparative experiments with various advanced backbone networks, such as FasterNet [26], ConvNeXt V2 [35], EfficientFormerV2 [36], and RepViT [37]. FasterNet achieves efficient feature extraction through a hierarchical architecture of PConv and MLP blocks, reducing computational load while improving hardware parallel efficiency. ConvNeXt V2 employs GRN normalization and depthwise separable convolution optimization to enhance feature channel correlation. EfficientFormerV2 combines early convolutional local feature capture with late MHSA global modeling, balancing accuracy and latency through dynamic attention downsampling. RepViT, based on structural re-parameterization to separate channel mixers, achieves high-precision balance on mobile devices. These four networks represent the four mainstream directions, including lightweight convolution, pure convolution optimization, CNN-transformer hybrid architecture, and re-parameterization.

The experimental conditions are kept constant, and the experimental results are presented in Table 6.

In the proposed method, the BasicBlock in the original residual structure is replaced by the Faster-RPE Block. Its unique design, while maintaining the basic convolutional architecture and the integrity of feature extraction, achieves adaptive selection and optimization of multi-scale features. It can better focus on the key features of targets at different scales, especially significantly improving the detection integrity and localization accuracy of multi-scale targets, resulting in an increase of 0.8% in mAP₅₀ and mAP_50–95, respectively. Meanwhile, the enhanced fine-grained feature capture capability of the module helps detect small targets and occluded targets, thereby increasing recall by 0.9%.

Compared to current mainstream backbone networks, although FasterNet has the lowest computational load, its mAP₅₀ and mAP_50–95 are 1.7% and 2.2% lower than ours, respectively, and its recall is significantly insufficient; the detection accuracy of ConvNext V2 and RepViT lags far behind ours, with mAP₅₀ gaps reaching 6.2% and 4.0%, respectively; while EfficientFormerV2 is close to FasterNet, it still underperforms ours in core metrics, with mAP₅₀ 1.8% lower and recall 2.2% lower.

In summary, through multi-scale feature fusion design, our model achieves the highest detection accuracy while maintaining a low computational load, particularly excelling in the recall and localization capabilities of multi-scale targets, which verifies its advancement in multi-scale detection tasks.

3.2.3. Comparison of Different Detection Models

During the experimental phase on the MTDC-UAV dataset, the maize tassel detection capabilities of the MSMT-RTDETR model were comprehensively evaluated. This comparative analysis aims to elucidate performance differences among models while highlighting MSMT-RTDETR’s superior accuracy and speed. In this study, the MSMT-RTDETR model is evaluated for maize tassel detection and compared against seven leading object detection algorithms, as detailed in Table 7. The compared models include the classical tassel recognition network TasselLFANet [38], single-stage detectors YOLOX [39], RTMDet [40], YOLOv8, YOLOv10 [10], transformer-based detector Deformable DETR [41], and the baseline model RT-DETR-R18. The validation results of all models are illustrated in Figure 4, while the experimental outcomes are summarized in Table 7.

During the training process on the MTDC-UAV dataset, by observing the precision–epoch curve (Figure 4A), most models rapidly converged to precision exceeding 80%, with the exception of Deformable DETR and YOLOX-s. These two models exhibited slower progression and greater fluctuations, stabilizing at a lower precision of around 73%. Some possible explanations are that these models have issues such as error propagation and information loss during the training process, making their detection performance for multi-scale targets possibly insufficient. Due to the excessively high complexity of Deformable DETR, which has numerous redundant features [42], this leads to overemphasizing certain relevant features while neglecting other important ones. MSMT-RTDETR and YOLOv8m demonstrated strong post-convergence performance, showing greater stability compared to other models, with MSMT-RTDETR achieving the highest precision, of 84.2%. YOLOv8m also delivered competitive results, attaining a precision of 83.3%. Notably, the model size of MSMT-RTDETR is 17.23 M, which is significantly smaller than YOLOv8m’s 25.86 M. Comprehensively, MSMT-RTDETR outperforms other object detection models.

Additionally, as observed in the precision–recall curve (Figure 4B), within the recall range of 0.0–0.9, MSMT-RTDETR’s precision was significantly higher than that of models such as YOLOv8m, YOLOX-s, TasselLFANet, and RTMDet-m. Moreover, in the high-recall (≥0.6) region, it still outperformed RT-DETR-R18, reflecting its advantage in positive instance recognition accuracy. Meanwhile, its precision–recall curve was smooth, with no drastic fluctuations, and its stability was significantly superior to models like Deformable DETR, which exhibited pronounced curve fluctuations, thus avoiding sharp declines in precision caused by changes in recall.

MSMT-RTDETR achieved a score of 87.2% on the standard mAP₅₀ metric, which is an increase of 5.3% and 17.5% compared to the single-stage detectors RTMDet-m and YOLOX-s, respectively. Meanwhile, MSMT-RTDETR obtained 84.4 points in F1-score, leading other advanced detection algorithms. The processing of features at different scales in Dy-CCFM and the multi-stage feature convolution fusion operations in MPCC3 can synergize with the high-quality multi-scale features output by the Faster-RPE Block, further optimizing the detection effect for multi-scale targets, which directly improves the precision, recall, and F1-score of MSMT-RTDETR. Notably, although Deformable DETR has more parameters, MSMT-RTDETR still maintains performance advantages, validating the superiority of the model architecture.

Figure 5 presents a confusion matrix comparison of RT-DETR-R18, Deformable DETR, RTMDet-m, YOLOX-s, YOLOv8m, YOLOv10m, TasselLFANet, and MSMT-RTDETR under optimal configurations. The results showed that the MSMT-RTDETR model was significantly better than the Deformable DETR and Yolo-series models in the ability to distinguish tassel from background, achieving a 1.1% accuracy improvement over YOLOv10m. The high precision in tassel detection and minimal misclassification underscore the reliability of the MSMT-RTDETR model. These advancements highlight MSMT-RTDETR’s application potential for multi-scale crop detection in complex agricultural scenarios.

3.3. Visualization

To validate the capability of MSMT-RTDETR in identifying multi-scale targets under complex field backgrounds, we conducted multiple rounds of comparative visual detection experiments on the MTDC-UAV dataset. We systematically assessed three typical scenarios—multi-scale distribution, dense distribution, and background interference—using UAV maize tassel images with a resolution of 2736 × 1824 pixels. The multi-scale multi-class distribution scenario is characterized by diverse tassel size features within a single UAV image caused by maize variety differences or growth stage variations, significantly increasing detection difficulty. The dense distribution scenario features a high quantity of maize tassels within the same imaging range, where frequent occlusions pose detection challenges. In background interference, influence from foliage and field conditions intensifies. With the UAV positioned at higher altitudes, maize tassels appear relatively smaller and more widely spaced, leading to an increased susceptibility to missed detections.

As shown in Figure 6, within the same UAV image, maize tassels from different varieties and growth stages exhibit a multi-scale distribution. Comparative detection results reveal that most models, when performing multi-scale detection, are prone to the influence of scale variations, suffer from insufficient feature extraction for small targets, and exhibit poor detection performance for small targets, leading to missed detections and false positives. For example, the YOLOv8m model performs poorly when detecting multi-scale maize tassels, misidentifying dead leaves on the ground as maize tassels. Meanwhile, it fails to correctly identify small-sized maize tassels in the middle area and exhibits lower confidence scores for maize tassels compared to other models. However, the MSMT-RTDETR model significantly enhances multi-scale detection capability by strengthening multi-scale feature extraction and intra-scale feature interaction capabilities. As shown in the experimental images, MSMT-RTDETR achieves accurate detection of multi-scale targets, avoiding the false positives and missed detections observed in other models, and demonstrates excellent detection performance for multi-target scenarios.

In complex field backgrounds, maize tassel detection is susceptible to interference from varietal differences and object intra-class occlusions, leading to false positives and missed detections. The improved model effectively resolves these deficiencies, as demonstrated in Figure 7. Detection results show that in UAV images, the original RT-DETR-R18 model misclassifies withered leaves as maize tassels while missing small targets; in contrast, the MSMT-RTDETR model accurately identifies targets in complex field backgrounds, significantly reducing false positive and missed detection rates. The yellow dashed-line areas in the figure visually highlight these improvements: MSMT-RTDETR successfully avoids misdetections and omissions, enhances maize tassel detection accuracy, and achieves greater stability compared to the original model.

Figure 8 displays the comparative detection results of the RT-DETR-R18, YOLOv8m, YOLOv10m, TasselLFANet, and MSMT-RTDETR models on UAV images across typical complex field backgrounds, including multi-scale, dense distribution, and background interference conditions. The visualizations clearly demonstrate MSMT-RTDETR’s accuracy in identifying multi-scale maize tassels across varying sizes and complex field backgrounds. During multi-scale detection tasks, the baseline RT-DETR-R18 model exhibits false positives where withered panicles are mistaken for maize tassels, whereas MSMT-RTDETR demonstrates more stable detection performance. When handling dense distribution scenarios, TasselLFANet and YOLOv8m models experience missed detections, failing to detect individual large maize tassel targets. Furthermore, under background interference conditions, we observe that the YOLOv8, YOLOv10m, and TasselLFANet models all miss some maize tassels, with TasselLFANet performing particularly poorly by missing a large number of maize tassels. It should be noted that MSMT-RTDETR also has some limitations. For example, similar to the baseline model, MSMT-RTDETR might also produce redundant detections for maize tassels cropped at the top of the image in multi-scale scenarios. The possible reason is that the MSMT-RTDETR model has insufficient feature extraction for edge cropping and cannot distinguish between the background and maize tassels. Meanwhile, it is still prone to false detections for some high-brightness maize leaves, and the model has difficulty in handling interference caused by brightness.

Based on the above analysis, MSMT-RTDETR still exhibits superior detection accuracy compared to other models, with a relatively reduced number of false and missed detections per image.

4. Discussion

4.1. Advantage

Maize tassel detection is the foundation of scientific yield estimation and precision agriculture. However, the multi-scale characteristics of tassels caused by differences in varieties and growth stages, intra-class occlusion, and complex field background interference pose significant challenges for accurate detection. The MSMT-RTDETR model proposed in this paper significantly improves the detection performance of UAVs in complex field backgrounds by innovatively integrating dynamic scale sequence fusion architecture and feature processing technologies, providing an efficient solution for precise maize tassel detection and similar agricultural monitoring tasks. The model constructs a backbone feature extraction architecture based on the lightweight Faster-RPE Block module, which improves the multi-scale target feature extraction capability while reducing computational complexity compared to the baseline model ResNet18. Secondly, by integrating DySample and intra-scale feature interaction modules, the DyCCFM structure is innovatively designed and embedded into the hybrid encoder, effectively enhancing multi-scale target attention and reducing the missed detection rate. Finally, the original RepC3 module is replaced with the newly designed MPCC3, which integrates classical convolution architecture, multi-branch topology, and re-parameterization technology, improving the network’s cross-channel feature extraction performance and thereby enhancing the model’s judgment accuracy for multi-scale targets. Compared with other state-of-the-art object detection methods, the proposed model exhibits more prominent stability, i.e., with stable performance for both precision and recall indices.

This paper conducts comprehensive experiments to verify the performance of MSMT-RTDETR. The results show that the proposed model improves by 1.2%, 1.4%, and 1.8% in precision, F1-score, and mAP₅₀ and mAP_50:95, respectively, and the number of parameters is reduced by 14.1%. When applied to UAV maize tassel detection tasks in the future, this method is expected to further improve the performance of applications such as maize growth status monitoring and yield prediction by improving the detection accuracy of multi-scale objects in complex field backgrounds.

4.2. Challenges and Limitations

Although the MSMT-RTDETR model has achieved significant progress in this study, there are still some limitations that require further exploration. In dense small-target detection scenarios, not only do adjacent maize tassels have intra-class occlusion, but some leaves also cause brightness differences in adjacent or low-positioned tassels. The model exhibits poor detection performance in such cases, especially struggling to accurately identify small tassel targets with significant brightness variations. Lu et al. also showed a clear trend of missing small objects, occluded objects, or objects with similar background colors in their training of YOLOv9 on the MTDC-UAV dataset [43]. One possible reason is that the model may fail to fully capture the detailed information of these severely disturbed small objects during feature extraction, resulting in their neglect in subsequent detection stages. Therefore, how to solve the accurate detection of small objects under varying lighting conditions is a key direction in future work.

Furthermore, this study deploys the model on high-performance GPU devices. Although the model achieves high detection accuracy, it may face device resource constraints due to the lack of certain lightweight and network pruning techniques [44]. The performance evaluation of MSMT-RTDETR is based on a specific low-altitude remote sensing dataset of maize tassels. Although the detection effect is acceptable, its practical application still requires further verification and optimization, such as whether the model’s adaptability to multispectral fused images can meet the requirements, the robustness under scene changes, lighting conditions, and shadow or reflection phenomena. Meanwhile, the MTDC-UAV dataset used is limited to 400 varieties, and more morphological variations of tassels still need to be supplemented.

4.3. Future Perspectives

To address the challenge of detecting illumination variations caused by intra-class occlusion of small objects, consideration may be given to introducing an anti-illumination interference module. The differentiated feature extraction capability brought about by the new module may effectively mitigate interference from illumination fluctuations, and a dedicated dataset for illumination variations can also be constructed to optimize model robustness. In addition, it would be necessary to further consider temporal dimension information in the maize tassel dataset, e.g., the dynamic collection covering the entire growth period, which may be more beneficial for more accurate maize yield estimation.

5. Conclusions

To achieve accurate detection of maize tassels in UAV images under complex field conditions, this study proposes a detection network model based on the Multi-Scale Maize Tassel Real-Time Detection Transformer (MSMT-RTDETR). Experimental results demonstrate that the proposed MSMT-RTDETR achieves significant improvements in various accuracy metrics for multi-scale object detection of maize tassels in UAV images. Ablation studies confirm the effectiveness of the improved modules, including Faster-RPE Block, Dy-CCFM, and MPCC3, and the optimized number of MPConv. Extended comparative visualization experiments confirm the performance of MSMT-RTDETR on detection challenges such as multi-scale variations, dense distribution, and background interference, demonstrating its effectiveness in accurately detecting maize tassels. This model provides an innovative solution for the accurate detection of maize tassels and lays a reliable technical foundation for maize yield estimation.

However, the detection challenges caused by illumination variations for clustered small tassel objects in the dataset still need optimization. Future research will extend validation to multiple crop tassel datasets or construct time-series datasets to evaluate the model’s generalization ability and robustness. Meanwhile, multi-scenario testing should be conducted to clarify the model’s advantages and areas for improvement, thereby promoting the transformation of UAV-based agricultural monitoring technology into practical applications.

Author Contributions

Conceptualization, Z.Z. and J.Z. (Jiajun Zhuang); methodology, Z.Z.; software, Z.Z. and D.H.; validation, Z.G. and D.H.; formal analysis, Z.Z.; investigation, Z.Z.; resources, Z.Z. and C.L.; data curation, Z.Z.; writing—original draft preparation, Z.Z., H.W. and J.Z. (Jingjing Zheng); writing—review and editing, Z.Z., Z.G., J.P. and C.L.; visualization, Z.Z., J.Z. (Jiajun Zhuang) and C.L.; supervision, Z.G., G.H. and C.L.; project administration, Z.Z., Z.G. and C.L.; funding acquisition, Z.G., J.Z. (Jiajun Zhuang) and C.L. All authors have read and agreed to the published version of the manuscript.

Funding

The authors acknowledge support from the Key Science and Technology Research and Development Program of Guangzhou City, China (Grant No. 2024B03J1302), the Natural Science Foundation of Guangdong Province, China (Grant No. 2022A1515010885), the Philosophy and Social Sciences Planning Project of Guangdong Province of China (Grant No. GD23XGL099), and the Guangdong General Universities Young Innovative Talents Project (Grant No. 2023KQNCX247).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data used in this study are from the MTDC-UAV maize tassel detection and counting UAV dataset. The MTDC-UAV dataset is publicly available and can be accessed at https://github.com/Ye-Sk/MTDC-UAV (accessed on 7 July 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, Y.; Bao, J.; Wei, X.; Wu, S.; Fang, C.; Li, Z.; Qi, Y.; Gao, Y.; Dong, Z.; Wan, X. Genetic structure and molecular mechanisms underlying the formation of tassel, anther, and pollen in the male inflorescence of maize (Zea mays L.). Cells 2022, 11, 1753. [Google Scholar] [CrossRef]
Andorf, C.; Beavis, W.D.; Hufford, M.; Smith, S.; Suza, W.P.; Wang, K.; Woodhouse, M.; Yu, J.; Lübberstedt, T. Technological advances in maize breeding: Past, present and future. Theor. Appl. Genet. 2019, 132, 817–849. [Google Scholar] [CrossRef]
Guo, Y.; Xiao, Y.; Hao, F.; Zhang, X.; Chen, J.; De Beurs, K.; He, Y.; Fu, Y.H. Comparison of different machine learning algorithms for predicting maize grain yield using UAV-based hyperspectral images. Int. J. Appl. Earth Obs. Geoinf. 2023, 124, 103528. [Google Scholar] [CrossRef]
Huang, Y.; Qian, Y.; Wei, H.; Lu, Y.; Ling, B.; Qin, Y. A survey of deep learning-based object detection methods in crop counting. Comput. Electron. Agric. 2023, 215, 108425. [Google Scholar] [CrossRef]
Wu, W.; Zhang, J.; Zhou, G.; Zhang, Y.; Wang, J.; Hu, L. ESG-YOLO: A method for detecting male tassels and assessing density of maize in the field. Agronomy 2024, 14, 241. [Google Scholar] [CrossRef]
Sanaeifar, A.; Guindo, M.L.; Bakhshipour, A.; Fazayeli, H.; Li, X.; Yang, C. Advancing precision agriculture: The potential of deep learning for cereal plant head detection. Comput. Electron. Agric. 2023, 209, 107875. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-time end-to-end object detection. In Proceedings of the 38th Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 9–15 December 2024. [Google Scholar]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar] [CrossRef]
Liu, Y.; Cen, C.; Che, Y.; Ke, R.; Ma, Y.; Ma, Y. Detection of maize tassels from UAV RGB imagery with Faster R-CNN. Remote Sens. 2020, 12, 338. [Google Scholar] [CrossRef]
Ferro, M.V.; Sørensen, C.G.; Catania, P. Comparison of different computer vision methods for vineyard canopy detection using UAV multispectral images. Comput. Electron. Agric. 2024, 225, 109277. [Google Scholar] [CrossRef]
Jia, Y.; Fu, K.; Lan, H.; Wang, X.; Su, Z. Maize tassel detection with CA-YOLO for UAV images in complex field environments. Comput. Electron. Agric. 2024, 217, 108562. [Google Scholar] [CrossRef]
Niu, S.; Nie, Z.; Li, G.; Zhu, W. Multi-altitude corn tassel detection and counting based on UAV RGB imagery and deep learning. Drones 2024, 8, 198. [Google Scholar] [CrossRef]
Du, J.; Li, J.; Fan, J.; Gu, S.; Guo, X.; Zhao, C. Detection and identification of tassel states at different maize tasseling stages using UAV imagery and deep learning. Plant Phenomics 2024, 6, 0188. [Google Scholar] [CrossRef]
Zeng, F.; Ding, Z.; Song, Q.; Qiu, G.; Liu, Y.; Yue, X. MT-Det: A novel fast object detector of maize tassel from high-resolution imagery using single level feature. Comput. Electron. Agric. 2023, 214, 108305. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2010, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using shifted windows. arXiv 2021, arXiv:2103.14030v2. [Google Scholar] [CrossRef]
Zhou, Q.; Huang, Z.; Zheng, S.; Jiao, L.; Wang, L.; Wang, R. A wheat spike detection method based on Transformer. Front. Plant Sci. 2022, 13, 1023924. [Google Scholar] [CrossRef] [PubMed]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs beat YOLOs on real-time object detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024. [Google Scholar] [CrossRef]
Zhang, X.; Zhu, D.; Wen, R. SwinT-YOLO: Detection of densely distributed maize tassels in remote sensing images. Comput. Electron. Agric. 2023, 210, 107905. [Google Scholar] [CrossRef]
Ye, J.; Yu, Z. Fusing global and local information network for tassel detection in UAV imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 4100–4108. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014. [Google Scholar] [CrossRef]
Chen, J.; Kao, S.; He, H.; Zhuo, W.; Wen, S.; Lee, C.-H.; Chan, S.-H.G. Run, don’t walk: Chasing higher FLOPS for faster neural networks. arXiv 2023, arXiv:2303.03667v3. [Google Scholar]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient multi-scale attention module with cross-spatial learning. In Proceedings of the 48th IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 4–9 June 2023. [Google Scholar] [CrossRef]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. RepVGG: Making VGG-style ConvNets great again. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, Online, USA, 20–25 June 2021. [Google Scholar] [CrossRef]
Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to upsample by learning to sample. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023. [Google Scholar] [CrossRef]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Das, A.; Vedantam, R.; Cogswell, M.; Parikh, D.; Batra, D. Grad-CAM: Why did you say that? arXiv 2016, arXiv:1611.07450. [Google Scholar]
Chattopadhay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V.N. Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks. In Proceedings of the 18th IEEE Winter Conference on Applications of Computer Vision, Lake Tahoe, NV, USA, 12–15 March 2018. [Google Scholar] [CrossRef]
Zhang, J.; Li, X.; Li, J.; Liu, L.; Xue, Z.; Zhang, B.; Jiang, Z.; Huang, T.; Wang, Y.; Wang, C. Rethinking mobile block for efficient attention-based models. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023. [Google Scholar] [CrossRef]
Zhang, X.; Liu, C.; Yang, D.; Song, T.; Ye, Y.; Li, K.; Song, Y. RFAConv: Innovating spatial attention and standard convolutional operation. arXiv 2023, arXiv:2304.03198. [Google Scholar]
Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. ConvNeXt V2: Co-designing and scaling ConvNets with masked autoencoders. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar] [CrossRef]
Li, Y.; Hu, J.; Wen, Y.; Evangelidis, G.; Salahi, K.; Wang, Y.; Tulyakov, S.; Ren, J. Rethinking Vision Transformers for MobileNet size and speed. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Lin, Z.; Han, J.; Ding, G. RepViT: Revisiting mobile CNN from ViT perspective. arXiv 2023, arXiv:2307.09283. [Google Scholar]
Yu, Z.; Ye, J.; Li, C.; Zhou, H.; Li, X. TasselLFANet: A novel lightweight multi-branch feature aggregation neural network for high-throughput image-based maize tassels detection and counting. Front. Plant Sci. 2023, 14, 1158940. [Google Scholar] [CrossRef]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Zhang, S.; Chen, K. RTMDet: An empirical study of designing real-time object detectors. arXiv 2022, arXiv:2212.07784. [Google Scholar] [CrossRef]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for end-to-end object detection. In Proceedings of the 9th International Conference on Learning Representations, Virtual Conference, Online, 3–7 May 2021. [Google Scholar]
Han, D.; Zhao, N.; Shi, P. A new fault diagnosis method based on deep belief network and support vector machine with Teager–Kaiser energy operator for bearings. Adv. Mech. Eng. 2017, 9, 121–131. [Google Scholar] [CrossRef]
Lu, D.; Wang, Y. MAR-YOLOv9: A multi-dataset object detection method for agricultural fields based on YOLOv9. PLoS ONE 2024, 19, e0307643. [Google Scholar] [CrossRef]
Shuvo, M.M.H.; Islam, S.K.; Cheng, J.; Morshed, B.I. Efficient acceleration of deep learning inference on resource-constrained edge devices: A review. Proc. IEEE 2023, 111, 42–91. [Google Scholar] [CrossRef]

Figure 1. The MTDC-UAV exploration analysis of object distribution across scales. (A) Object distribution across scales on the MTDC-UAV training set. (B) Objects per image on the MTDC-UAV training set.

Figure 2. Architecture of the proposed MSMT-RTDETR. (A) Overall architecture. (B) Architecture of Faster-RPE Block. (C) Architecture of MPCC3. (D) Architecture of DyScalSeq. (E) Architecture of MPConv.

Figure 3. Fused feature maps of experiments with different numbers of MPConv models.

Figure 4. Comparison of the validation results of each model on MTDC-UAV. (A) Precision–epoch curves. (B) Precision–recall curves.

Figure 5. Comparison of confusion matrices of different models evaluated on MTDC-UAV. (A) RT-DETR-R18. (B) Deformable DETR. (C) RTMDet-m. (D) YOLOX-s. (E) YOLOv8m. (F) YOLOv10m. (G) TasselLFANet. (H) MSMT-RTDETR.

Figure 6. Comparative detection results of various models under multi-scale conditions. The blue circle indicates occurrences of false positives or missed detections.

Figure 7. Examples of annotated images for RT-DETR-R18 and MSMT-RTDETR. (A) Multi-scale; (B) dense distribution. Red dashed lines indicate occurrences of false positives or missed detections, and yellow dashed lines represent optimization outcomes.

Figure 8. Comparative detection results of various models under different complex field background conditions. Yellow circles indicate occurrences of false positives and missed detections.

Table 1. The MTDC-UAV dataset was compared with the After Data Augmentation dataset in terms of object statistics for annotated bounding boxes.

Dataset	Subset	No. Images ^a	No. Bounding Box ^a	Max. w/h ^b	Min. w/h ^b	Avg. w/h ^b
MTDC-UAV	Train & val	500	28,531	257	7	70.13
	Test	300	21,460	272	2	68.66
	Total	800	49,991	272	2	69.5
After Data Augmentation	Train	1820	110,666	828	2	73.75
	Validation	300	8199	807	1	79.57
	Test	300	21,460	272	2	68.66
	Total	2420	140,325	828	1	73.31

^a Number of images, number of bounding boxes. ^b Maximum, minimum, and average width or height of bounding box.

Table 2. Key training parameters settings.

Parameters	Setup
Epochs	100
Batch size	8
Learning rate	1 × 10⁻⁴
Image size	640 × 640
Optimizer	AdamW

Table 3. Comparative experiment results for different numbers of MPConv.

Number of MPConv	P (%)	R (%)	F1-Score (%)	mAP₅₀ (%)	mAP_50–95 (%)	Params (M)	GFLOPs (G)
1	83.8	82.4	83.1	85.6	43.1	18.69	51
2	84.0	82.6	83.3	86.0	43.4	19.28	54
3	84.1	83.8	83.9	86.2	43.9	19.87	56.9
4	83.6	82.8	83.2	85.8	43.6	20.46	59.9
5	84.0	83.6	83.8	86.2	43.8	21.05	62.8
6	83.9	83.8	83.8	86.1	43.6	21.64	65.7

Table 4. Results of ablation study experiments.

Methods	Faster-RPE Block	Dy-CCFM	MPCC3	P (%)	R (%)	F1-Score (%)	mAP₅₀ (%)	mAP_50–95 (%)	Params (M)	GFLOPs (G)
Baseline				82.9	83.8	83.4	85.8	43.4	20	60
1	√			83.2	84.7	83.9	86.6	44.2	16.89	51.4
2		√		83.7	84.7	84.2	86.8	43.9	20.21	61.5
3			√	83.7	83.6	83.7	86.2	43.9	19.87	56.9
4	√	√		83.2	84.7	83.9	86.8	44.9	17.23	56
5	√		√	82.8	84.4	83.6	86.2	44	16.89	51.4
6		√	√	83.5	83.7	83.6	86.2	43.6	20.21	61.5
Ours	√	√	√	84.2	84.7	84.4	87.2	45.2	17.23	56

Table 5. Comparative experiment results for different advanced convolutional modules.

Methods	P (%)	R (%)	F1-Score (%)	mAP₅₀ (%)	mAP_50–95 (%)	Params (M)	GFLOPs (G)
Baseline	83	83.8	83.4	85.8	43.4	20	60
iRMB-Block [33]	83.3	83.2	83.2	86.1	43.4	16.41	49.1
PConv-Block [26]	83	84.3	83.6	86	43.6	14	42.8
RFAConv-Block [34]	83.3	82.8	83.1	85.6	42.6	20.24	59.9
Faster-RPE Block	83.2	84.7	83.9	86.6	44.2	16.89	51.4

Table 6. Comparative experiment results for different backbone networks.

Methods	P (%)	R (%)	F1-Score (%)	mAP₅₀ (%)	mAP_50–95 (%)	Params (M)	GFLOPs (G)
Baseline	83	83.8	83.4	85.8	43.4	20	60
FasterNet [26]	82	82.4	82.7	84.9	42	10.91	28.8
ConvNeXt V2 [35]	79.7	75.8	77.7	80.4	35.7	12.4	32.3
EfficientFormerV2 [36]	82.4	82.5	82.4	84.8	41.2	11.9	29.8
RepViT [37]	81.7	78.6	80.2	82.6	39.2	13.4	36.7
Ours	83.2	84.7	83.9	86.6	44.2	16.89	51.4

Table 7. Comparison of the performance of different models evaluated on MTDC-UAV.

Methods	P (%)	R (%)	F1-Score (%)	mAP₅₀ (%)	mAP_50–95 (%)	Params (M)	GFLOPs (G)
RT-DETR-R18 [19]	83	83.8	83.4	85.8	43.4	20	60
Deformable DETR [41]	81.4	71.2	76.0	76.5	35.2	40.1	188.4
RTMDet-m [40]	82.5	76.2	79.2	82.8	41.8	24.71	39.27
YOLOX-s [39]	80.6	70.6	75.3	74.2	29.8	9	26.8
YOLOv8m	83.3	82.0	83.0	86.0	41.2	25.86	79.1
YOLOv10m [10]	83.1	80.1	82.9	85.2	41.3	15.36	59.1
TasselLFANet [38]	82.9	82.6	82.7	84.7	39.6	3.04	20.1
Ours	84.2	84.7	84.4	87.2	45.2	17.23	56

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, Z.; Gao, Z.; Zhuang, J.; Huang, D.; Huang, G.; Wang, H.; Pei, J.; Zheng, J.; Liu, C. MSMT-RTDETR: A Multi-Scale Model for Detecting Maize Tassels in UAV Images with Complex Field Backgrounds. Agriculture 2025, 15, 1653. https://doi.org/10.3390/agriculture15151653

AMA Style

Zhu Z, Gao Z, Zhuang J, Huang D, Huang G, Wang H, Pei J, Zheng J, Liu C. MSMT-RTDETR: A Multi-Scale Model for Detecting Maize Tassels in UAV Images with Complex Field Backgrounds. Agriculture. 2025; 15(15):1653. https://doi.org/10.3390/agriculture15151653

Chicago/Turabian Style

Zhu, Zhenbin, Zhankai Gao, Jiajun Zhuang, Dongchen Huang, Guogang Huang, Hansheng Wang, Jiawei Pei, Jingjing Zheng, and Changyu Liu. 2025. "MSMT-RTDETR: A Multi-Scale Model for Detecting Maize Tassels in UAV Images with Complex Field Backgrounds" Agriculture 15, no. 15: 1653. https://doi.org/10.3390/agriculture15151653

APA Style

Zhu, Z., Gao, Z., Zhuang, J., Huang, D., Huang, G., Wang, H., Pei, J., Zheng, J., & Liu, C. (2025). MSMT-RTDETR: A Multi-Scale Model for Detecting Maize Tassels in UAV Images with Complex Field Backgrounds. Agriculture, 15(15), 1653. https://doi.org/10.3390/agriculture15151653

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MSMT-RTDETR: A Multi-Scale Model for Detecting Maize Tassels in UAV Images with Complex Field Backgrounds

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. Data Augmentation

2.3. MSMT-RTDETR Model

2.3.1. Architecture

2.3.2. Faster-RPE Block

2.3.3. Dy-CCFM

2.3.4. Design of MPCC3

2.4. Experimental Settings

2.4.1. Evaluation Metrics

2.4.2. Training Settings

3. Experimental Results and Discussion

3.1. Ablation Experiment

3.1.1. The Number of MPConv

3.1.2. Ablation Study

3.2. Comparison Experiments

3.2.1. Comparison of Convolutional Modules

3.2.2. Comparison of Backbone Network

3.2.3. Comparison of Different Detection Models

3.3. Visualization

4. Discussion

4.1. Advantage

4.2. Challenges and Limitations

4.3. Future Perspectives

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI