1. Introduction
Maize is one of the three major cereal crops globally and the most widely cultivated and highest-yielding grain crop [
1]. It is also an important source of feed for livestock and a raw material for processing industries, contributing to food security and supporting agricultural production [
2]. The maize growth cycle is divided into vegetative and reproductive stages. The vegetative stage includes emergence and stem elongation, while the reproductive stage involves tasseling, flowering, grain filling, and maturity. During the reproductive stage, tassel development affects pollination and kernel formation, influencing yield. As a monoecious plant, maize develops tassels at the top, and their size, branching, and uniformity can affect pollination [
3,
4]. Therefore, accurate identification and monitoring of tassels are important for studying maize growth, guiding management practices, and estimating yield. Tassel emergence marks the beginning of the reproductive stage, where nutrient uptake impacts kernel formation. Identifying tassels at this stage supports growth monitoring, fertilization management, pest control, and yield estimation [
5,
6,
7].
Traditional tassel monitoring relies on manual methods such as field observations and phenological surveys, which are labor-intensive, time-consuming, and prone to human error [
8,
9]. Although remote sensing techniques such as satellite imagery have been applied to maize phenology monitoring [
10], their limited spatial resolution and revisit frequency make precise tassel monitoring difficult, failing to meet high-accuracy requirements. In contrast, UAV-based imaging provides higher resolution and greater flexibility; however, automated tassel detection still faces many challenges, especially in dealing with issues such as occlusion, overlap, and small tassel size. These challenges highlight significant limitations in current remote sensing and computer vision techniques, underscoring the need for more efficient and robust methods for automated tassel detection in field environments. For example, based on Landsat imagery, Niu et al. compiled a maize phenology dataset for 1985–2020 with a 30 m resolution [
11]. Using 250 m MODIS data, Sakamoto et al. developed a two-step filtering (TSF) method to identify maize phenological stages [
12]. However, limited satellite revisit periods and sensor resolutions make continuous high-resolution imagery costly and difficult for farmers [
13]. UAVs offer a practical solution [
14,
15], being easier to deploy, allowing flexible timing of data acquisition [
16], and providing adjustable spatial resolution with centimeter-level quality [
17,
18]. Recent UAV-based studies combined with deep learning have demonstrated accurate maize tassel detection. For instance, RESAM-YOLOv8n achieved an mAP50of 95.74% and reliable tassel counting (R
2 = 0.976, RMSE = 1.56) [
19], while CA-YOLO reached 96% average precision and effectively detected early-stage, leaf-obscured, and overlapping tassels [
20]. These results indicate that UAV imaging combined with deep learning enables continuous, high-resolution monitoring of maize phenology and tassel development. Such advances have promoted UAV-based remote sensing applications in agriculture, including crop yield estimation [
21], pest and disease detection [
22], and phenology identification [
23].
Domestic and foreign scholars have conducted considerable research on maize tassel detection. Researchers have employed mainstream object detection frameworks such as Faster R-CNN [
24], SSD [
25], and YOLO [
26] to improve detection precision and efficiency. For example, ResNet [
27] and VGGNet [
28] have been evaluated as backbone feature extractors in Faster R-CNN for maize tassel detection, with ResNet achieving higher detection precision. In addition, anchor box sizes were adjusted based on the morphological characteristics of maize tassels to improve the matching between the model and the targets. However, these anchor box settings rely on manual experience, introducing subjectivity that may limit model generalization across datasets [
29]. A U-Net-based semantic segmentation method has also been used to identify tassel regions. This approach relies on UAV imagery from a single acquisition and requires post-processing to obtain tassel counts, as segmentation alone does not directly produce them [
30]. A lightweight SEYOLOX-tiny model was applied to detect maize tassels in UAV field images. While the model demonstrated strong detection performance with an mAP50 of 95.0%, some tassels were missed, and the limited dataset of eight time points restricts its application for continuous tasseling monitoring [
31].
YOLO uses a modular architecture, typically consisting of a backbone for feature extraction, a neck for feature aggregation, and a head for classification and regression tasks. The backbone is composed of Conv, C2f, and SPPF modules, which extract multi-scale feature information. The C2f module enhances feature representation using convolutional and Bottleneck blocks, while the SPPF module captures features at different scales through max-pooling. The neck combines Path Aggregation Network (PAN) and Feature Pyramid Network (FPN) structures to fuse features at multiple levels. The head employs a decoupled, anchor-free design with three detection heads for classification and localization, producing target categories and bounding boxes. This structure allows flexible modification and lightweight adjustment, and it is widely used in agricultural computer vision applications. YOLO-based target detection algorithms perform effectively in complex natural scenarios but rely on high computational power and large storage resources, which limits the deployment and application of mature detection models on small mobile terminals [
32]. To promote the practical application of detection algorithms, research on lightweight target detection networks has become a hotspot [
33].
An improved YOLOv5x network was applied to blueberry ripeness detection, where the introduction of a lightweight Little-CBAM module enhanced feature extraction. The model achieved an average mAP of 78.3%, but the real-time inference speed was only 47 FPS [
34]. A lightweight THYOLO model for tomato fruit detection and ripeness classification was developed, incorporating MobileNetV3 into the backbone and employing channel pruning in the neck. The resulting model achieved a 78% reduction in parameters, with a final size of 6.04 MB and a detection speed of 26.5 FPS, while performance for dense small-object detection remained limited [
35]. A multi-scale feature adaptive fusion model (MFAF-YOLO) based on YOLOv5s was developed for complex agricultural environments. The model employed K-means++ clustering of citrus fruit morphological features to optimize anchor boxes and reduce redundant detections, decreasing the model size by 26.2% to 10.4 MB [
36]. These cases indicate that improving detection accuracy may increase computational demands or reduce inference speed. While lightweight designs and feature fusion can lower model size and the number of parameters, practical use in resource-limited settings still requires balancing detection performance with computational cost.
YOLO-based algorithms have been applied in fruit and vegetable detection, showing relatively stable performance for specific crops and scenarios. However, due to the diversity and complexity of agricultural environments, existing models face difficulties in achieving effective cross-crop transfer. Transfer learning can provide favorable initial parameters and accelerate model convergence, but in practice, detection models often require task-specific design or fine-tuning to adapt to different crop types and field conditions [
37]. In maize tassel detection, the use of YOLO remains limited. Current models face three main challenges: (1) limited generalization across different crop datasets, which constrains cross-crop transferability; (2) reduced robustness under field conditions, where leaf occlusion and tassel overlap frequently lead to false or missed detections; and (3) insufficient representation of small or partially occluded tassels due to texture similarity and dense distribution. These issues affect the accuracy and reliability of tassel detection in practical applications. Moreover, while YOLOv8 provides higher detection precision and faster inference compared with earlier versions, it still has limitations in detecting small and densely distributed tassels, maintaining robustness under complex field conditions, and balancing model size and efficiency for potential deployment in low-compute environments.
To address these challenges, this study proposes the RALSD-YOLO model, an improved detection model based on the YOLOv8 framework. The model integrates several structural improvements: (1) the C2f_RVB_EMA module enhances the extraction of small-object features under field conditions; (2) the Adown and SPPF_LSKA modules enable multi-scale feature fusion and reduce information loss; and (3) the SlimNeck and LiSCDetect modules balance model complexity and detection accuracy, improving the handling of occluded and overlapping tassels. Compared with existing YOLO models, these improvements are expected to increase detection accuracy for small and overlapping tassels, enhance generalization across field conditions, and accelerate inference, enabling more reliable maize tassel detection. “RALSD” represents the integration of C2f_RVB_EMA (R), Adown (A), SPPF_LSKA (L), SlimNeck (S), and LiSCDetect (D) into the YOLOv8 framework.
5. Conclusions
The main objective of this study was to develop a lightweight model for maize tassel detection. RALSD-YOLO was constructed based on YOLOv8n, with structural modifications to the backbone, neck, and detection head aimed at improving feature extraction while reducing computational complexity. In the backbone, a lightweight feature extraction module, an improved downsampling operation, and a multi-scale attention mechanism were introduced to enhance the representation of tassels at different scales and capture relevant features more efficiently. In the neck, a streamlined structure facilitated the aggregation of fine-scale features, and the detection head was redesigned to allow feature sharing across multiple layers, improving the model’s response to small and partially occluded tassels in complex field environments. These modifications collectively contributed to a balance between computational efficiency and detection accuracy.
The model was trained and evaluated on the MrMT dataset, showing higher precision and recall than conventional detection models, including YOLOv3, YOLOv5n, YOLOv7-tiny, Faster R-CNN, and SSD. Its generalization capability was further assessed on the MTDC dataset, demonstrating stable performance across different growth stages, lighting conditions, and field environments. Analysis indicated that detection performance was affected by tassel size, overlap, and background complexity, suggesting potential limitations when applied to diverse field conditions. These findings provide insights into the factors influencing tassel detection and can guide future improvements in model design and data acquisition strategies.
Although the model achieved satisfactory performance, there remains room for improvement. Future work will focus on incorporating datasets covering a wider range of maize varieties to enhance adaptability under diverse conditions. In addition, weakly supervised learning techniques will be explored to reduce reliance on manual annotation and decrease the associated labeling workload. These developments aim to further support automated tassel monitoring and quantitative yield estimation in agricultural production.