A Multi-Granularity Gated Image-Level Supervised Network (MGG-ISCNet) for Spike Counting in Agropyron cristatum (L.) Gaertn

Guan, Lihua; Ding, Ziyu; Song, Meng’an; He, Xinyuan; Wang, Qiqi; Zan, Ruopu; Gao, Zhangru; Li, Xiang; Zhao, Yan; Zhang, Dongyan

doi:10.3390/agronomy15122805

Open AccessArticle

A Multi-Granularity Gated Image-Level Supervised Network (MGG-ISCNet) for Spike Counting in Agropyron cristatum (L.) Gaertn

by

Lihua Guan

^1,†,

Ziyu Ding

^2,3,†,

Meng’an Song

^2,3,

Xinyuan He

^2,3,

Qiqi Wang

^1,3,

Ruopu Zan

^2,3,

Zhangru Gao

^2,3,

Xiang Li

²,

Yan Zhao

^1,3,* and

Dongyan Zhang

^2,3,*

¹

Key Laboratory of Grassland Resources (IMAU), Ministry of Education, College of Grassland Science, Inner Mongolia Agricultural University, Hohhot 010021, China

²

Shaanxi Key Laboratory of Agricultural Information Sensing and Intelligent Service, College of Mechanical and Electronic Engineering, Northwest A&F University, Yangling 712100, China

³

National Center of Pratacultural Technology Innovation (Under Preparation), Hohhot 010000, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Agronomy 2025, 15(12), 2805; https://doi.org/10.3390/agronomy15122805

Submission received: 6 November 2025 / Revised: 26 November 2025 / Accepted: 2 December 2025 / Published: 5 December 2025

(This article belongs to the Section Precision and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Accurate counting of spikes in crested wheatgrass, an important forage resource, is essential for breeding and yield evaluation. However, traditional manual counting is inefficient, and instance-level supervised methods face challenges such as high annotation costs and counting errors caused by overlapping targets in complex field scenes. To address these issues, this study proposes the Multi-Granularity Gating Image-Level Supervision Count Network (MGG-ISCNet), a lightweight image-level supervised counting network. The network integrates multi-granularity features adaptively and employs a lightweight regression head with two 1D convolution layers and global average pooling for efficient feature compression, greatly reducing parameter complexity. Requiring only image-level count labels without positional annotations, the proposed approach substantially lowers labeling costs. On a self-constructed crested wheatgrass dataset, the MGG-ISCNet achieved an MAE of 2.73, RMSE of 3.86, and R² of 0.81. Furthermore, transfer experiments on the wheat spike dataset GWHD2020 demonstrated strong generalization. The proposed method achieved the best accuracy among both instance-level and image-level supervised approaches, with MAE = 3.63, RMSE = 4.73, and R² = 0.95, while featuring significantly fewer parameters (61.08 M) compared to the existing image-level method. Overall, this work provides an efficient and lightweight solution for spike counting in crested wheatgrass and other cereal crops, offering valuable support for breeding and forage production.

Keywords:

crested wheatgrass spike; counting detection; image-level supervision; lightweight model; field phenotyping

1. Introduction

Crested wheatgrass, Agropyron cristatum (L.) Gaertn., a key forage grass species with both high nutritional value and significant ecological importance, plays a strategic role in livestock production and ecological restoration. The species is rich in high-quality protein and dietary fiber, making it an ideal feed source for ruminants [1]. However, the combined effects of global warming, increasing drought, and the growing demand for high-quality forage have placed severe pressure on its supply [2]. Statistics indicate that the annual market demand for crested wheatgrass has been increasing at an average rate of 4–5% [3], while the yield improvement achieved through genetic enhancement remains limited due to long breeding cycles and low efficiency [4]. Consequently, the gap between supply and demand continues to widen. To accelerate breeding progress, efficient and accurate field phenotyping of key agronomic traits is urgently required. Among these traits, spike number is a critical indicator for assessing the yield potential of crested wheatgrass. Traditional manual counting methods, however, are time-consuming, labor-intensive, and prone to human error, making them unsuitable for large-scale, high-throughput phenotyping. Therefore, developing rapid and precise automated spike-counting techniques is of great significance for accelerating cultivar selection and alleviating forage shortages.

Automated and high-throughput spike counting plays a crucial role in screening superior crested wheatgrass varieties, evaluating forage yield, and monitoring ecological restoration. Yet, current spike counting methods still rely mainly on manual field surveys, requiring laborious plant-by-plant inspection, which is inefficient and error-prone [5]. In contrast, semi-automated approaches have been explored for other Poaceae crops such as wheat. For example, a combination of color thresholding, UAV multispectral imagery, and feature fusion has been used to detect and count Fusarium-infected wheat spikes [6]. Building on this, later studies improved spike recognition accuracy in Fusarium head blight assessment by integrating both image and spectral information [7]. Texture descriptors have also been explored to distinguish spikes from surrounding foliage [8,9]. However, these traditional methods rely heavily on handcrafted features and fixed thresholds, resulting in limited robustness under complex field conditions. They often struggle to handle illumination variations, occlusion, and density differences. The problem is even more pronounced in crested wheatgrass fields, where dynamic lighting, weed occlusion, and the high visual similarity between spikes and leaves make spike segmentation difficult. Consequently, conventional feature-based methods exhibit poor generalization and unstable performance across diverse field scenarios, substantially reducing counting accuracy and practical applicability.

Recent progress in agricultural deep learning has introduced instance-level supervision into object counting research [10,11,12]. One category is detection-based counting under bounding-box supervision, where bounding boxes are manually annotated around individual spikes to train object detection models for localization and counting [13]. A two-stage Faster R-CNN framework has been employed to achieve accurate spike detection and counting; Single-stage CNN models, such as YOLO, have been explored to improve real-time counting performance in large-scale applications [14]. Bao et al. compared single- and two-stage approaches and proposed modifications tailored for dense spike conditions, achieving reliable counting performance across different scenes [15]. Another category is point-supervised density map counting, which only requires a single point annotation per spike to generate the supervisory signal. During training, these annotations are converted into density maps using Gaussian kernels or similar functions, transforming counting into a density regression problem. A convolutional neural network (CNN) is then trained to regress the density distribution using pixel-wise losses such as MSE or MAE. During inference, summing the predicted density map yields the total spike count without explicit localization. Refining Gaussian kernel parameters has proven effective in improving the alignment between predicted density maps and real spike distributions [16]. Li et al. proposed a Poisson-based loss function to mitigate bias in dense scenes [17]. Enhancements to CNN architectures have been shown to improve robustness against background noise [18].

Despite their effectiveness, instance-level supervised methods suffer from high annotation costs. Bounding boxes must be drawn for each spike, while point annotations require precise center localization [17]. In dense, fine-structured crested wheatgrass canopies, such annotation is prohibitively expensive. Moreover, bounding-box-based approaches are prone to overcounting or missing spikes in overlapping regions, and non-maximum suppression cannot fully resolve these issues [19]. Meanwhile, point-supervised methods rely on fixed Gaussian kernels that struggle to adapt to large morphological variations among genotypes, often assigning density values incorrectly to leaves or weeds and introducing noise.

Recognizing these challenges, recent studies have shifted toward image-level supervised paradigms, which regress the total object count per image without requiring any spatial annotation. Originating in crowd counting, this paradigm has been shown to reduce annotation cost significantly. For example, Yang et al. proposed a soft-label sorting network for count-based crowd density estimation [20], and Liang et al. introduced TransCrowd, a transformer-based model leveraging self-attention to capture global contextual dependencies [21]. In agriculture, a recent study introduced CSNet, applying image-level learning to wheat spike counting and effectively alleviating the annotation burden in dense agricultural images [22]. However, due to substantial differences in spike morphology, density, and growth environment, directly transferring these methods to crested wheatgrass yields suboptimal results. To address these limitations, this study proposes the MGG-ISCNet (Multi-Granularity Gated Image-level Supervised Counting Network), an image-level supervised model designed specifically for crested wheatgrass spike counting under complex field conditions. Compared with recent similar methods such as CSNet, the MGG-ISCNet has achieved improvements in several key technologies. Firstly, in terms of feature fusion strategy, this model adopts a multi-granularity gated dynamic fusion mechanism, which can adaptively weigh the importance of features of wheatgrass spikes at different scales. Secondly, in terms of model structure, we design a regression head based on lightweight 1D convolution and global average pooling, which replaces the fully connected layer with huge parameters in CSNet and significantly reduces the number of parameters. In order to comprehensively evaluate the generalization ability of the model, this study selected the 2020 Global Wheat Spike Detection (GWHD2020) dataset [23] for verification. This selection is mainly based on the following two considerations: First, there is no publicly available wheatgrass spike-counting dataset. Secondly, wheat and wheatgrass have significant morphological similarities in spike structure, occlusion pattern, and other characteristics. GWHD2020, as a widely recognized benchmark dataset in the field of spike counting of cereal crops, can ensure the comparability and repeatability of cross-species evaluation results. The MGG-ISCNet proposed in this paper is a concept verification automatic counting method for breeding scenarios, which aims to significantly reduce the labeling cost, improve the counting efficiency and accuracy, and thus accelerate the screening process of excellent lines. The main contributions of this study are summarized as follows:

(1) This study is the first to introduce image-level weakly supervised counting to crested wheatgrass spike estimation, achieving high-precision counting without explicit localization and offering a practical solution for forage phenotyping.

(2) A multi-granularity gated fusion mechanism combined with a lightweight head design is proposed to adaptively integrate multi-scale features, effectively improving scale sensitivity and feature discrimination while maintaining a low parameter count.

(3) Comprehensive experiments on a custom crested wheatgrass dataset and cross-crop evaluation on public wheat datasets demonstrate the proposed model’s strong generalization, robustness, and potential for broad application in cereal crop phenotyping.

2. Materials and Methods

The overall workflow of this study, including the study area overview, data processing methods, model architecture, and evaluation strategies, is illustrated in Figure 1. The study area is located in Hohhot, Inner Mongolia, China, with the topography and major geographic features shown in Figure 1A. The data preprocessing procedure, shown in Figure 1B, begins by dividing the collected images into multiple patches (patch segmentation). Each patch was manually annotated by breeding experts to record the number of crested wheatgrass spikes, producing the ground-truth labels used for model training. The proposed deep learning model, MGG-ISCNet, is illustrated in Figure 1C. RGB images are first processed through a backbone network to extract multi-level features, followed by the Multi-Scale Patch Mixer (MPM) and Multi-Granularity Gating (MGG) modules for adaptive feature enhancement and information filtering. The final spike count is predicted through an end-to-end regression head optimized using the L1 loss function. Model performance evaluation and generalization verification are shown in Figure 1D—the left panel presents results on the crested wheatgrass (CW) dataset, while the right panel demonstrates transfer learning performance on the GWHD2020 dataset, allowing comparison between image-level and instance-level supervision methods.

2.1. Data Acquisition

Data were collected from the crested wheatgrass experimental field at Inner Mongolia Agricultural University (42.3° N, 119.5° E), as shown in Figure 1A. The cultivar used was Agropyron cristatum cv. Mengnong No.1, a drought- and cold-tolerant variety bred by the university, is widely used in northern grassland restoration and forage production and is representative of the regional vegetation. The field was flat, with chestnut soil, well-managed plots, and minimal weed interference. The area features a typical temperate grassland climate with naturally varying light conditions, capturing realistic plant structures, canopy density, and spatial distribution patterns of crested wheatgrass under field conditions.

An Intel RealSense D455 depth camera (Intel Corporation, Santa Clara, CA, USA) was used for data collection. The camera employs active stereo and structured light fusion, offering high-accuracy depth sensing and robustness under variable illumination—suitable for complex agricultural environments. The camera was connected to a ThinkPad T480 laptop (Lenovo Group Limited, Beijing, China) via USB 3.0 for real-time data reception and buffering. The detailed acquisition parameters are summarized in Table 1.

Data were collected from May to August 2025, covering the key growth stages of crested wheatgrass from late heading to maturity, and were carried out under various weather conditions (sunny and cloudy) to enhance the natural changes in light, shadow, and canopy structure. Extreme disturbances (such as strong winds or violent shaking) were not actively introduced during the collection process, but there were still natural light fluctuations, shadow changes, and slight plant displacement during the collection process. The images were mainly collected from the milk stage (Zadoks scale Z71–Z79). In addition, a small number of plants from the late heading to flowering stage (Z41–Z69) were included in the sample, which further enriched the diversity of spike morphology, density, and color contrast. All images are synchronized to obtain RGB and depth data through a custom Python script based on Intel RealSense SDK 2.0 (https://github.com/IntelRealSense/librealsense (accessed on 1 December 2025)) and stored in a ROS bag (.bag) format that strictly aligns the timestamp; subsequently, the RealSense Viewer is used to extract, rename, and organize images in batches to provide a high-quality, multi-temporal, and multi-scene data foundation for annotation and model training. The process is shown in Figure 2A.

2.2. Dataset Preparation

After acquisition, all image sequences were quality-checked to remove overexposed, motion-blurred, or depth-missing frames. In total, 214 RGB images were retained, representing diverse illumination and canopy density conditions. Depth information was not used in this study. The RGB images (1280 × 720 resolution) intentionally represent ordinary field camera quality to simulate real-world deployment and enhance robustness under low-quality imaging conditions.

In order to make the data adapt to the input requirements of the deep learning model and improve the training efficiency, each image of 1280 × 720 pixels is adjusted to 1280 × 768 pixels by supplementing the black area so that it can be completely divided into non-overlapping 256 × 256 pixel blocks (Figure 2B). This method uses a non-overlapping sliding window with a fixed step size of 256 [24,25]. In theory, each image can generate 15 image blocks. After eliminating some invalid blocks located in the blurred edge region, a total of 2891 valid image blocks were finally extracted from 214 original images for subsequent experiments. Each patch was labeled at the image level by two researchers with expertise in grassland science, and the total number of visible spikes (non-negative integers) was recorded. During the annotation process, when the expert observes the image, he will quickly move the mouse cursor to skim the ear area in the image. This visual guidance behavior helps experts to focus on and identify the key morphological structures of spikes (such as awn and spikelet patterns) so as to complete the psychological accumulation and counting of visible spikes in the brain and finally record the total number [26]. The whole process does not produce any physical location labeling files, such as points or boxes. (Figure 2C) Inconsistent cases were reviewed by a senior expert to finalize labels. This simple image-level annotation approach significantly reduced manual cost by cutting the average labeling time from 5 to 8 min to under 1 min, all without requiring specialized tools (e.g., Labelme, CVAT). Basic tools such as Excel or plain text files were sufficient. This annotation method fully meets the requirements for density regression and count estimation tasks.

To ensure fair model evaluation and consistent data distribution, the dataset was split into training, validation, and test sets at an 8:1:1 ratio while maintaining density balance across subsets (Table 2). The final dataset contained 2891 patches with a total of 47,758 annotated spikes, averaging 16.52 spikes per patch and ranging from 0 to 43 spikes, thereby capturing the substantial complexity and diversity of field conditions.

The overall histogram shows a slightly right-skewed distribution, with a skewness of +0.133 (Figure 3A). To evaluate model robustness under varying spike densities, the 33rd and 66th percentiles (13 and 20 spikes, respectively) were used to divide samples into three density levels: low (Low), medium (Mid), and high (High). Boxplots for the overall and subset distributions (Figure 3B) confirm consistent statistics across splits. The three density groups contained 1074, 879, and 938 samples, respectively. Example patches (Figure 3C–E) illustrate that in low-density scenes, spikes are well separated with clear boundaries, facilitating counting; in medium-density scenes, partial occlusion and overlap occur, increasing difficulty; and in high-density scenes, heavy overlap and canopy stacking blur boundaries, posing significant challenges for accurate detection and counting. This density-level division helps assess model performance under varying scene complexity.

To evaluate the model’s cross-crop generalization ability, in addition to the self-built CW dataset, this study also used the publicly available wheat ear dataset GWHD2020. The dataset includes 3422 images (1024 × 1024 pixels), split 8:2 into 2737 training and 685 test images. The number of wheat heads per image varied widely (0–116 in training, 0–97 in testing, average of about 43), posing challenges for stable regression (Table 3, Figure 4).

2.3. Model Architecture

The proposed MGG-ISCNet follows an end-to-end regression architecture, as shown in Figure 5. The name MGG-ISCNet reflects its core components: a Multi-Granularity Gating mechanism for adaptive multi-scale feature fusion and an image-level supervised counting strategy characteristic of weakly supervised approaches. The model takes a 3 × 512 × 512 RGB image as input. A pretrained backbone network first extracts deep semantic features, producing a feature map of size 512 × 64 × 64. These features are then passed to a feature aggregation layer inspired by CSNet, incorporating a Multi-Granularity Gating (MGG) mechanism. The backbone output is processed by parallel multi-scale MLP-Mixer branches [27], each handling coarse-, medium-, and fine-grained local contexts. Their token outputs are concatenated into a unified 1D representation, followed by learnable gating weights that dynamically evaluate the importance of each scale and generate adaptively fused features. This process enhances key region responses while suppressing redundancy. Finally, a lightweight 1D convolutional regression head compresses the gated features via two Conv1D layers, followed by adaptive global average pooling and a fully connected layer to output the final scalar count prediction.

2.3.1. Backbone Network

The first ten layers of VGG16 [28] were used as the backbone, consisting of four convolutional blocks (Layer 1–Layer 4). Each block uses 3 × 3 kernels, stride 1, and padding 1, followed by ReLU activations. Layer 1 and Layer 2 each include two convolutional layers (64 and 128 channels, respectively), each followed by 2 × 2 max pooling; Layer 3 contains three 256-channel convolutions with pooling; Layer 4 has three 512-channel convolutions without further pooling, producing the final [512, 64, 64] feature map (denoted as B4). ImageNet-pretrained weights were used to improve generalization.

2.3.2. Feature Aggregation

To enable effective multi-scale feature fusion and contextual modeling, we introduce a Multi-Scale Patch Mixer (MPM) module on top of the high-level features (denoted as B4, with shape [512, 64, 64]) extracted from the backbone network. Inspired by the MLP-Mixer architecture [27], the MPM is adapted with targeted modifications in patch partitioning and channel modeling to jointly capture spatial dependencies at multiple granularities.

Specifically, the input feature map is divided into three sets of non-overlapping patches at different scales: 4 × 4, 8 × 8, and 16 × 16, corresponding to spatial granularities of 16 × 16, 8 × 8, and 4 × 4, respectively. This yields coarse-, medium-, and fine-grained local contextual representations. Each scale branch then undergoes independent channel compression via a convolutional layer, reducing the channel dimension from 512 to 256 to lower computational cost and improve feature compactness. The resulting patch sequences are fed into dedicated mixer layers, where alternating token-mixing and channel-mixing operations model global dependencies across the sequence dimension and apply nonlinear transformations along the channel dimension, producing more discriminative multi-scale token representations.

Finally, the output sequences from the three branches—of lengths 16, 64, and 256, each with 256 channels—are concatenated along the token dimension to form a unified sequence of shape [336, 256]. Compared to conventional strategies like average pooling or naive concatenation, the proposed MPM explicitly enables cross-scale interaction and implicit compensation among different spatial granularities, significantly enhancing the model’s cross-scale consistency and contextual understanding.

In natural-field crested wheatgrass spike counting, traditional multi-scale fusion methods, including simple concatenation and handcrafted weighted averaging, often lack adaptability. They fail to dynamically adjust the contribution of each scale according to scene complexity, leading to degraded performance under challenging conditions like illumination variation, density heterogeneity, and occlusion. To address this, we propose a Multi-Granularity Gating (MGG) mechanism that learns adaptive fusion weights and enables dynamic cross-scale integration, allowing the model to selectively emphasize coarse contextual cues or fine structural details as needed.

The MGG module takes the concatenated feature sequence from MPM as input and processes it through three parallel 1D convolutional branches with kernel sizes to capture sequence patterns at different receptive fields. For the i-th branch, the feature mapping is computed as follows:

H_{i} = σ (B N (C o n v 1 D_{k_{i}} (δ (B N (C o n v 1 D_{k_{i}} (X))))))

(1)

In Equation (1),

k_{i}

represents the kernel size of the i-th branch,

X = F_{1}

is the input,

δ (\cdot)

denotes ReLU activation,

σ (\cdot)

is the sigmoid function, and BN stands for batch normalization. This yields a set of importance-aware response maps

H_{i}

, each reflecting the relevance of features at a specific scale.

To fuse these branches adaptively, we introduce learnable parameters

α = [α_{1}, α_{2}, α_{3}]

and compute the dynamic fusion weight

ω_{i}

via softmax normalization is computed as Equation (2):

ω_{i} = \frac{e^{α_{i}}}{\sum_{j = 1}^{3} e^{α_{j}}}, j = 1, 2, 3

(2)

The final global scale-aware response map is obtained by weighted summation:

R = \sum_{i = 1}^{3} w_{i} \cdot H_{i}

(3)

In Equation (3),

R

represents the final global scale-aware response map, and

H_{i}

represents the feature map generated by the i-th branch.

This response map is then element-wise multiplied with the original input to reweight features based on multi-granularity importance:

F_{2} = X ⊙ R

(4)

In Equation (4), ⊙ denotes element-wise multiplication, and

X

is the input. This design allows the model to automatically enhance fine-scale responses in dense regions and strengthen global contour cues in sparse areas, achieving dynamic scale selection and synergistic feature enhancement.

2.3.3. Lightweight Counting Head

To further improve computational efficiency while maintaining strong representational capacity, we design a lightweight regression head (LightHead) based on 1D convolutions, replacing the conventional parameter-heavy fully connected layers (Figure 6). This head compresses features efficiently while preserving local correlations along the sequence dimension and enables end-to-end density regression through global adaptive pooling.

Given the fused feature sequence

X \in R^{B \times C \times L}

(where

B

is batch size,

C = 256

,

L = 336

), the LightHead first applies two successive 1D convolutional layers for progressive channel reduction: the first reduces channels from 256 to 128, and the second from 128 to 64. Each uses a 3 × 1 kernel with padding = 1 to maintain sequence length, followed by batch normalization and ReLU activation δ. The calculation process is shown in Equations (5) and (6):

X^{'} = δ (B N (C o n v 1 D_{3 \times 1}^{(128)} (X)))

(5)

X^{″} = δ (B N (C o n v 1 D_{3 \times 1}^{(64)} (X^{'})))

(6)

X′ features are obtained after passing through the first convolutional layer (outputting 128 channels). X″ features are obtained after passing through the second convolutional layer (outputting 64 channels). Next, global adaptive average pooling (AdaptiveAvgPool1d) collapses the sequence length to 1, and the result is flattened into a vector

\hat{X}

:

\hat{X} = F l a t t e n (A d a p t i v e A v g P o o l 1 d (X^{″}))

(7)

Finally, a linear layer maps this global representation to a single scalar prediction, the final estimated count

\hat{y}

:

\hat{y} = L i n e a r (\hat{X})

(8)

The linear layer has an input dimension of 64 and outputs a single value. By combining channel compression with global aggregation, this lightweight head drastically reduces model parameters while retaining sufficient expressive power, enabling accurate and efficient end-to-end counting.

2.4. Model Training and Evaluation

2.4.1. Experimental Setup

Training was conducted on a high-performance workstation equipped with an NVIDIA GeForce RTX 5090D GPU and an AMD Ryzen Threadripper 7960X 24-core CPU, running Windows 11 and Python 3.10. Input images were resized to 512 × 512 pixels. The batch size was 32, and training lasted 50 epochs. The optimizer was Stochastic Gradient Descent (SGD) with an initial learning rate of 1 × 10⁻⁴. The model was trained using the L1 loss (Mean Absolute Error, MAE) to minimize prediction deviation.

2.4.2. Model Evaluation Indicators

Model performance was quantitatively assessed using Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and the Coefficient of Determination (R²). MAE measures average prediction deviation, RMSE reflects prediction stability and dispersion (penalizing larger errors), and R² evaluates the goodness of fit. The calculation process for these indicators is shown in Equations (9)–(13):

M A E = \frac{1}{N} \sum_{i = 1}^{N} |P_{i} - G_{i}|

(9)

R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(P_{i} - G_{i})}^{2}}

(10)

R^{2} = 1 - \frac{\sum_{i = 1}^{N} {(P_{i} - G_{i})}^{2}}{\sum_{i = 1}^{N} {(G_{i} - \bar{G})}^{2}}

(11)

N M A E = \frac{M A E}{\bar{G}}

(12)

N R M S E = \frac{R M S E}{\bar{G}}

(13)

$P_{i}$ denotes the predicted number of target crested wheatgrass spikes in the i-th image by the model;
$G_{i}$ denotes the ground-truth number of target crested wheatgrass spikes in the i-th image;
$\bar{G}$ represents the mean of the ground-truth spike counts across all test images;
$N$ is the total number of images in the test set.

This evaluation metric system jointly accounts for the absolute error magnitude, error variability, and the model’s overall fitting capability, thereby providing a comprehensive and objective assessment of the counting model’s performance.

3. Results and Analysis

3.1. Ablation Study on Different Backbones

To systematically evaluate the impact of different backbone–head combinations on the spike count accuracy of crested wheatgrass, the neck module was fixed as the Multi-Scale Pooling Module (MPM), and four representative backbone networks—VGG16, MobileNetV3 [29], ResNet18 [30], and DarkNet53 [31]—were compared under two regression heads: FCHead and LightHead. The quantitative results on the test set are summarized in Table 4. Among all combinations, VGG16 + LightHead achieved the best overall performance, with an MAE of 2.85, RMSE of 4.05, and R² of 0.79, while maintaining only 60.95 M parameters. In contrast, VGG16 + FCHead required 104.88 M parameters but yielded higher errors (MAE = 3.43, RMSE = 4.72), indicating that LightHead not only reduced the parameter count substantially but also improved prediction accuracy.

Further analysis revealed that although MobileNetV3 and DarkNet53 are lightweight architectures, their feature representation capacity is limited: MobileNetV3 + LightHead achieved an MAE of 3.30, while DarkNet53 + LightHead reached 3.72, suggesting difficulty in modeling fine-grained details in complex field environments. ResNet18 presented a more balanced performance; however, when paired with LightHead, its accuracy decreased (MAE increased from 3.32 to 4.01), likely due to the incompatibility between its deeper structure and the lightweight head.

In summary, VGG16 combined with LightHead achieved the best balance between accuracy and efficiency and was thus adopted as the default configuration in subsequent experiments. The results verify the importance of the joint design between the backbone and regression head. LightHead employs two-stage 1D convolutions and global pooling for efficient dimensionality reduction, reducing parameters by approximately 41.7% while improving accuracy, demonstrating an excellent trade-off between computational cost and precision.

3.2. Parameter Sensitivity Analysis

To optimize the Multi-Granularity Gating (MGG) mechanism, the effects of channel reduction ratio (r) and receptive field configuration (kernel sizes) were examined across low-, medium-, and high-density subsets as well as the overall test set (Table 5). Three kernel size combinations were tested: {1, 3, 5}, {3, 5, 7}, and {3, 7, 11}, under a fixed VGG16 backbone with LightHead and reduction ratios of 16 and 32. The basis of this design mainly includes the following: the smaller convolution kernel emphasizes detail, the larger convolution kernel captures semantics, and the three groups of combinations cover the progressive receptive field range from local to wider context, which can systematically evaluate the MGG’s ability to integrate multi-scale information.

When r = 16 and the kernel size combination was {3, 5, 7}, the model achieved the best overall performance (overall MAE = 2.73), particularly excelling in low- and medium-density scenes (MAE = 2.26 and 2.31, respectively). This indicates that moderate receptive fields effectively integrate local details and contextual cues, making them suitable for sparse target localization. Smaller receptive fields failed to capture spatial relationships between spikes, while overly large ones introduced noise and instability. In high-density scenarios, all configurations showed increased error, reflecting the intrinsic difficulty of severe occlusion. Increasing r to 32 led to a consistent decline in performance, suggesting that retaining sufficient channel dimensions benefits multi-scale feature fusion.

3.3. Ablation Study on the MGG-ISCNet

To verify the contribution of each component in the MGG-ISCNet, four ablation settings were designed using VGG16 + FCHead as the baseline: (1) Baseline; (2) Baseline + LightHead; (3) Baseline + MGG; (4) Full model (LightHead + MGG). Results are summarized in Table 6 and visualized in Figure 7.

Results show that introducing LightHead alone substantially improved performance—MAE decreased from 3.43 to 2.85, R² rose from 0.71 to 0.79, and parameters were reduced by 41.7% (from 104.88 M to 60.95 M), validating its efficiency–accuracy advantage. Adding MGG to the FCHead baseline increased parameters by only 0.12 M but reduced MAE to 3.07 and improved R² to 0.76, indicating that MGG enhances spatial dependencies and feature robustness.

Combining both modules yielded the best results (MAE = 2.73, RMSE = 3.86, R² = 0.81, parameters = 61.08 M). The visualization in Figure 8 further demonstrates that MGG strengthens multi-scale contextual modeling, while LightHead enables efficient feature compression, together achieving superior generalization under complex conditions.

3.4. Comparative Experiments with Different Networks

To further validate the superiority of the MGG-ISCNet, it was compared with the image-level supervised method CSNet under identical settings (Table 7). The MGG-ISCNet achieved an overall MAE of 2.73 and RMSE of 3.86, demonstrating stronger feature modeling and regression capabilities. In low-density scenes, CSNet suffered from insufficient receptive field (MAE = 3.75), while the MGG-ISCNet reduced the error to 2.26. In medium-density regions, the MGG-ISCNet achieved comparable accuracy with lower complexity (2.31 vs. 2.50). In high-density subsets, both MAE and RMSE decreased by 8.2% and 9.9%, respectively.

Overall, the MGG-ISCNet consistently outperformed the baseline across all density conditions, particularly in challenging low- and high-density scenarios, confirming its effectiveness and generalization capability for weakly supervised agricultural counting tasks.

3.5. Comparative Experiments on Public Dataset

The comparative study evaluated our method against several benchmarks: instance-level detectors, including Faster R-CNN [32], YOLOv5n [33], YOLOv8n [34], and YOLOv11n [35], as well as image-level regression methods like CSNet. The comparison results on GWHD2020 show that the MGG-ISCNet achieves the best performance (MAE = 3.63, RMSE = 4.73, R² = 0.95), which is superior to all instance-level and image-level competitors, as shown in Table 8. Note that avg = 43.24 is calculated for NMAE and NRMSE (Table 3). The regression plot in Figure 9 shows that the instance-level detector has a systematic undercounting (points below the 1:1 line) due to occlusion, spike overlap, and high visual similarity between targets. In contrast, the image-level supervision of the MGG-ISCNet avoids the dependence on localization and exhibits robust performance in dense scenes.

To compare the counting performance of different methods more intuitively, the visualization results of the original image and the four instance-level detection methods are shown in Figure 10. In the legend, the true value, instance-level supervised counting result, and image-level supervised counting result are represented by black, red, and green, respectively. The specific arrangement is as follows: real number labeling, the results of Faster R-CNN, YOLOv5n, YOLOv8n, YOLOv11n, CSNet, and the MGG-ISCNet (the methods proposed in this study).

4. Discussion

4.1. Effect of Density on Counting Performance

The ultimate goal of this study is to provide a high-throughput, low-cost field phenotypic analysis tool for wheatgrass breeding and forage production, which can be used to accurately estimate the number of panicles per plot scale. Spike density is still one of the most important factors affecting the counting performance in visual-based plant phenotypic analysis. The MGG-ISCNet proposed in this study showed excellent performance in the spike counting task of wheatgrass, but its performance was still significantly affected by the image density distribution. Figure 7 shows that the number of low pulses is slightly overestimated and the number of high pulses is slightly underestimated, which is consistent with the mean regression effect of the regression model: the model tends to predict the output close to the training mean to minimize the loss. Background noise, leaf texture, and other structures may be misjudged as pulses, resulting in high predicted values. The occlusion between the pulses is more serious, and the texture of the local area also has aliasing, which makes it difficult for the model to identify all instances. In order to reduce the loss, the model will choose a near-extreme and more conservative output, resulting in underestimation. Similar problems also exist in the study of spike counting of cereal crops such as wheat and rice spikes: high-density scenes usually lead to blurred boundaries, overlapping structures, and feature entanglement, resulting in increased prediction errors [15,25]. Consistent with these findings, the results of this study indicate that the dense A. cristatum canopy poses a serious challenge to the identification of fine panicles. Although the MGG module alleviates the problem of feature confusion in dense areas to a certain extent through Multi-Granularity Gating, when the degree of ear layer overlap exceeds the perceptual limit of the model, it is still inevitable to miss or overcount.

4.2. Cross-Crop Transferability

In order to further verify the applicability and generalization performance of the proposed method in cross-crop scenarios, this study migrated the MGG-ISCNet to the wheat spike public dataset GWHD2020, which is also a cereal crop. Although GWHD2021 has improved in data size and diversity, GWHD2020 is still the most widely used benchmark dataset in the field of wheat ear detection and counting, especially widely used by weakly supervised learning and density estimation methods, which is convenient for fair comparison with existing mainstream methods. The dataset information is shown in Table 3 and Figure 4, and the comparison results are shown in Table 8 and Figure 9, and Figure 10. The image-level supervision method is stronger than the instance-level method in the number prediction task [22]. The instance-level method generally has systematic missed detection (the prediction points are concentrated below the 1:1 line), which may be due to the widespread inter-spike occlusion, target stacking, and appearance of high similarity in dense spike scenes, making the instance-level detection model prone to missed detection and false detection in the target positioning and non-maximum suppression (NMS) stages, thus affecting the overall counting accuracy [36]. The MGG-ISCNet effectively avoids the target location dependence by virtue of the image-level weak supervision paradigm, and shows stronger robustness in dense occlusion scenes. This result not only verifies the adaptability of the model to different spike structures but also lays a foundation for high-throughput phenotypic analysis of more gramineous crops, such as barley and oats, in the future.

4.3. Limitations and Future Research Directions

Although the MGG-ISCNet shows good generalization ability and lightweight advantages in the spike counting task of wheatgrass, its method itself still has some limitations. First, the model relies on a single-view two-dimensional RGB image, and it is still difficult to completely avoid feature confusion in the presence of severe occlusion, ear overlap, or complex background, which is also a common problem under the weak supervision paradigm. Secondly, the data of this study are mainly based on near-earth angle acquisition, and the imaging conditions are relatively stable. However, the actual agricultural scene is gradually expanding to high-throughput platforms such as drones, and its imaging is often accompanied by complex disturbances such as multi-angle distortion, significant illumination changes [25], and severe plant shaking caused by strong winds. Therefore, in the future, it is necessary to refine the model verification under more varieties, different densities, and shooting heights to support its high-throughput deployment on ground phenotypic vehicles and UAV platforms, and finally replace manual investigation. Third, this study is still in the stage of concept verification. Future work needs to further carry out systematic evaluations in multi-platform and multi-sensor scenarios based on repeated experiments, such as integration into vehicle phenotypic platforms and UAV systems at different flight altitudes. Future research will focus on exploring multi-modal fusion and cross-platform large-scale training data, promoting the scalable deployment of models in real farmland environments, and ultimately promoting the application of high-throughput and automated phenotypic analysis in precision agriculture.

5. Conclusions

In this study, the image-level supervision paradigm was successfully applied to the spike counting of crested wheatgrass for the first time, and a lightweight and efficient MGG-ISCNet was proposed to effectively deal with the problem of missed counting and overcounting caused by dense spikes while significantly reducing the labeling cost. Through the Multi-Granularity Gating mechanism, the scale features are adaptively fused, and combined with the lightweight regression head, and MAE = 2.73 and RMSE = 3.86 are realized on the self-built wheatgrass dataset. It also shows good migration ability on the wheat dataset GWHD2020 (MAE = 3.63), and the parameter quantity is only 61.08 M, which is significantly better than the existing instance-level supervision method. The results provide a new path for high-throughput phenotypic analysis of non-model forages such as Agropyron cristatum, support breeding screening and yield estimation, and have important application value for promoting the digitization and automation of intelligent grass husbandry.

Author Contributions

Conceptualization, X.H.; methodology, D.Z.; software, X.L.; validation, Q.W. and Z.G.; formal analysis, X.L.; investigation, Z.G.; resources, M.S.; data curation, R.Z.; writing—original draft preparation, L.G.; writing—review and editing, L.G. and Z.D.; visualization, Z.D.; supervision, D.Z.; project administration, D.Z.; funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Major Innovation Platform Construction Program of the National Center of Pratacultural Technology Innovation (under preparation) [Grant Nos. CCPTZX2023W01, CCPTZX2023N03, and CCPTZX2024N02]; the “Unveiling and Hanging” Project of the Inner Mongolia Autonomous Region in 2023 [Grant No. 2023JBGS0008]; and the Basic and Applied Basic Research Program of Hohhot [Grant No. 2024-Regulation-Foundation-34]; the Inner Mongolia Seed Industry Science and technology innovation major demonstration project [2022JBGS0014].

Data Availability Statement

The data presented in this study have been publicly available on the anonymous open science platform (4OpenScience) at https://anonymous.4open.science/r/MGG-ISCNet-58A0/readme.md (accessed on 25 November 2025).

Acknowledgments

During the preparation of this work, the authors used ChatGPT 5.0 in order to improve language. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication. The authors would like to thank the anonymous reviewers for their critical comments and suggestions for improving the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Tandoh, S.; Coulman, B.; Biligetu, B. Assessment of crested wheatgrass (Agropyron cristatum L.) accessions with different geographical origins for agronomic and phenotypic traits and nutritive value. Euphytica 2019, 215, 161. [Google Scholar] [CrossRef]
Baral, K.; Coulman, B.; Biligetu, B.; Fu, Y. Advancing crested wheatgrass [Agropyron cristatum (L.) Gaertn.] breeding through genotyping-by-sequencing and genomic selection. PLoS ONE 2020, 15, e0239609. [Google Scholar] [CrossRef] [PubMed]
Caradus, J.R.; Chapman, D.F. Evaluating pasture forage plant breeding achievements: A review. N. Z. J. Agric. Res. 2025, 68, 1146–1220. [Google Scholar] [CrossRef]
Robins, J.G.; Jensen, K.B. Breeding of the crested wheatgrass complex (Agropyron spp.) for North American temperate rangeland agriculture and conservation. Agronomy 2020, 10, 1134. [Google Scholar] [CrossRef]
Cheng, T.; Zhang, D.; Zhang, G.; Wang, T.; Ren, W.; Yuan, F.; Liu, Y.; Wang, Z.; Zhao, C. High-throughput phenotyping techniques for forage: Status, bottleneck, and challenges. Artif. Intell. Agric. 2025, 15, 98–115. [Google Scholar] [CrossRef]
Zhang, H.S.; Huang, L.S.; Huang, W.J.; Dong, Y.Y.; Weng, S.Z.; Zhao, J.L.; Ma, H.Q.; Liu, L.Y. Detection of wheat Fusarium head blight using UAV-based spectral and image feature fusion. Front. Plant Sci. 2022, 13, 4427. [Google Scholar] [CrossRef]
Huang, L.S.; Li, T.K.; Ding, C.L.; Zhao, J.L.; Zhang, D.Y.; Yang, G.J. Diagnosis of the Severity of Fusarium Head Blight of Wheat Ears on the Basis of Image and Spectral Feature Fusion. Sensors 2020, 20, 2887. [Google Scholar] [CrossRef]
Narisetti, N.; Neumann, K.; Röder, M.S.; Gladilin, E. Automated spike detection in diverse european wheat plants using textural features and the frangi filter in 2d greenhouse images. Front. Plant Sci. 2020, 11, 666. [Google Scholar] [CrossRef]
Zou, M.; Liu, Y.; Fu, M.; Li, C.; Zhou, Z.; Meng, H.; Xing, E.; Ren, Y. Combining spectral and texture feature of UAV image with plant height to improve LAI estimation of winter wheat at jointing stage. Front. Plant Sci. 2024, 14, 1272049. [Google Scholar] [CrossRef]
Adke, S.; Li, C.; Rasheed, K.M.; Maier, F.W. Supervised and weakly supervised deep learning for segmentation and counting of cotton bolls using proximal imagery. Sensors 2022, 22, 3688. [Google Scholar] [CrossRef]
Cheng, T.; Zhang, D.; Gu, C.; Zhou, X.; Qiao, H.; Guo, W.; Niu, Z.; Xie, J.; Yang, X. YOLO-CG-HS: A lightweight spore detection method for wheat airborne fungal pathogens. Comput. Electron. Agric. 2024, 227, 109544. [Google Scholar] [CrossRef]
Katari, S.; Venkatesh, S.; Stewart, C.; Khanal, S. integrating automated labeling framework for enhancing deep learning models to count corn plants using UAS imagery. Sensors 2024, 24, 6467. [Google Scholar] [CrossRef]
Fernandez-Gallego, J.A.; Kefauver, S.C.; Gutiérrez, N.A.; Nieto-Taladriz, M.T.; Araus, J.L. Wheat ear counting in-field conditions: High throughput and low-cost approach using RGB images. Plant Methods 2018, 14, 22. [Google Scholar] [CrossRef]
Xiong, H.; Cao, Z.; Lu, H.; Madec, S.; Liu, L.; Shen, C. TasselNetv2: In-field counting of wheat spikes with context-augmented local regression networks. Plant Methods 2019, 15, 150. [Google Scholar] [CrossRef] [PubMed]
Bao, W.X.; Xie, W.J.; Hu, G.S.; Yang, X.J.; Su, B.B. Wheat ear counting method in UAV images based on TPH-YOLO. Trans. Chin. Soc. Agric. Eng. 2023, 1, 185–191. [Google Scholar]
Guo, H. Wheat Head Counting by Estimating a Density Map with Convolutional Neural Networks. arXiv 2023, arXiv:2303.10542. [Google Scholar] [CrossRef]
Li, Y.; Zhang, X.; Chen, D. CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; IEEE: New York, NY, USA; pp. 1091–1100. [Google Scholar]
Zhang, G.; Wang, Z.; Liu, B.; Gu, L.; Zhen, W.; Yao, W. A density map-based method for counting wheat ears. Front. Plant Sci. 2024, 15, 1354428. [Google Scholar] [CrossRef] [PubMed]
Zhou, Q.; Huang, Z.; Liu, L.; Wang, F.; Teng, Y.; Liu, H.; Zhang, Y.; Wang, R. High-throughput spike detection and refined segmentation for wheat Fusarium Head Blight in complex field environments. Comput. Electron. Agric. 2024, 227, 109552. [Google Scholar] [CrossRef]
Yang, Y.; Li, G.; Wu, Z.; Su, L.; Huang, Q.; Sebe, N. Weakly-Supervised Crowd Counting Learns from Sorting Rather than Locations. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 1–17. [Google Scholar]
Liang, D.; Chen, X.; Xu, W.; Zhou, Y.; Bai, X. Transcrowd: Weakly-supervised crowd counting with transformers. Sci. China Inf. Sci. 2022, 65, 160104. [Google Scholar] [CrossRef]
Li, Y.; Wu, X.; Wang, Q.; Pei, Z.; Zhao, K.; Chen, P.; Hao, G. CSNet: A Count-Supervised Network via Multiscale MLP-Mixer for Wheat Ear Counting. Plant Phenomics 2024, 6, 0236. [Google Scholar] [CrossRef]
David, E.; Madec, S.; Sadeghi-Tehran, P.; Aasen, H.; Zheng, B.; Liu, S.; Kirchgessner, N.; Ishikawa, G.; Nagasawa, K.; Badhon, M.A. Global wheat head detection (GWHD) dataset: A large and diverse dataset of high-resolution RGB-labelled images to develop and benchmark wheat head detection methods. Plant Phenomics 2020, 2020, 3521852. [Google Scholar] [CrossRef] [PubMed]
Ding, Z.; Zeng, F.; Li, H.; Zheng, J.; Chen, J.; Chen, B.; Zhong, W.; Li, X.; Wang, Z.; Huang, L. Identification of sweetpotato virus disease-infected leaves from field images using deep learning. Front. Plant Sci. 2024, 15, 1456713. [Google Scholar] [CrossRef]
Zhang, D.; Chen, Z.; Luo, H.; Hu, G.; Zhou, X.; Gu, C.; Li, L.; Guo, W. Predicting wheat scab levels based on rotation detector and Swin classifier. Biosyst. Eng. 2024, 248, 15–31. [Google Scholar] [CrossRef]
Cholakkal, H.; Sun, G.; Shahbaz Khan, F.; Shao, L. Object Counting and Instance Segmentation with Image-Level Supervision. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; IEEE: New York, NY, USA; pp. 12389–12397. [Google Scholar]
Tolstikhin, I.O.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; Zhai, X.; Unterthiner, T.; Yung, J.; Steiner, A.; Keysers, D.; Uszkoreit, J. Mlp-mixer: An all-mlp architecture for vision. Adv. Neural Inf. Process. Syst. 2021, 34, 24261–24272. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Jocher, G.; Stoken, A.; Borovec, J.; Changyu, L.; Hogan, A.; Diaconu, L.; Poznanski, J.; Yu, L.; Rai, P.; Ferriday, R. Ultralytics/yolov5: V3.0. 2020. Available online: https://zenodo.org/records/3983579 (accessed on 25 November 2025).
Sohan, M.; Sai Ram, T.; Rami Reddy, C.V. A review on yolov8 and its advancements. In Proceedings of the International Conference on Data Intelligence and Cognitive Informatics, Tirunelveli, India, 18–20 November 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 529–545. [Google Scholar]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Zhang, D.; Luo, H.; Wang, D.; Zhou, X.; Li, W.; Gu, C.; Zhang, G.; He, F. Assessment of the levels of damage caused by Fusarium head blight in wheat using an improved YoloV5 method. Comput. Electron. Agric. 2022, 198, 107086. [Google Scholar] [CrossRef]

Figure 1. (A) Location of study area; (B) data processing workflow; (C) model architecture; (D) model evaluation and generalization verification.

Figure 2. Data acquisition and labeling workflow. (A) Data collection setup; (B) image patch segmentation; (C) expert labeling process.

Figure 3. Density distribution and sample visualization. (A) Spike count histogram; (B) density boxplots for overall and subsets; (C–E) examples of low-, medium-, and high-density patches.

Figure 4. Distribution of spike counts in the GWHD2020 dataset.

Figure 5. Overall architecture of the MGG-ISCNet.

Figure 6. Schematic illustration of the MGG-ISCNet pipeline.

Figure 7. Regression fitting results on CW dataset: (A) VGG16 + FCHead; (B) VGG16 + LightHead; (C) VGG16 + FCHead + MGG; (D) VGG16 + LightHead + MGG (Proposed). Blue dots represent predicted–ground truth pairs; the dashed line denotes the 1:1 reference; the red solid line shows the fitted regression.

Figure 8. Visualization of MGG-ISCNet predictions on the CW dataset. The top images show the original CW dataset sample, while the bottom images present the model’s focus area on the target region in the form of a heatmap. Green numbers represent the model’s predictions, while black numbers represent the actual true labels.

Figure 9. Regression plots of different methods on GWHD2020: (A) Faster R−CNN; (B) YOLOv5n; (C) YOLOv8n; (D) YOLOv11n; (E) CSNet; (F) MGG−ISCNet (proposed). Blue dots denote predictions; dashed lines indicate 1:1 references; red lines show fitted regressions.

Figure 10. Visual comparison of counting results across methods on GWHD2020: (A) sample 1; (B) sample 2; (C) sample 3. Black: ground truth; red: instance-level detection; green: image-level counting.

Table 1. The data acquisition parameters.

Camera Angle	Mounting Height	Distance to Plants	Lighting Conditions	ISO Setting	Frame Rate	Resolution
Top-down	80–150 cm	60–130 cm	Outdoor sunny/cloudy	Auto exposure	30 FPS	1280 × 720

Table 2. Summary of the crested wheatgrass (CW) dataset.

Dataset	Size /Pixels	Num Images	Max /Counts	Avg /Counts	Total /Counts
Overall	256 × 256	2891	43	16.52	47,758
Train	256 × 256	2312	43	16.42	37,952
Val	256 × 256	289	41	17.17	4963
Test	256 × 256	290	38	16.70	4843

Table 3. Statistics of the GWHD2020 dataset.

Dataset	Size	Num Images	Min	Max	Avg	Total
Train	1024 × 1024	2737	0	116	42.86	117,318
Test	1024 × 1024	685	0	97	43.24	29,621

Table 4. Ablation study of different backbone and head combinations on the CW dataset *.

Backbone	Head	MAE	RMSE	NMAE	NRMSE	R²	Params/M
VGG16	FCHead	3.43	4.72	0.21	0.28	0.71	104.88
VGG16	LightHead	2.85	4.05	0.17	0.24	0.79	60.95
MobileNetV3	FCHead	4.01	5.52	0.24	0.33	0.60	97.30
MobileNetV3	LightHead	3.30	4.57	0.20	0.27	0.72	53.38
ResNet18	FCHead	3.32	4.58	0.20	0.27	0.72	108.93
ResNet18	LightHead	4.01	5.14	0.24	0.31	0.65	65.01
DarkNet53	FCHead	3.54	4.98	0.21	0.30	0.67	97.31
DarkNet53	LightHead	3.72	5.07	0.22	0.30	0.66	53.39

* Numbers in bold indicate the best performance.

Table 5. Performance comparison under different MGG hyperparameter settings on the CW dataset *.

Reduction	Size	Low		Mid		High		Overall
Reduction	Size	MAE	RMSE	MAE	RMSE	MAE	RMSE	MAE	RMSE	NMAE	NRMSE
16	1, 3, 5	3.06	4.28	2.77	4.02	3.40	4.23	3.09	4.19	0.19	0.25
	3, 5, 7	2.26	3.31	2.31	3.20	3.60	4.80	2.73	3.86	0.16	0.23
	3, 7, 11	2.32	3.40	2.37	3.42	3.57	4.71	2.76	3.91	0.17	0.23
32	1, 3, 5	2.27	3.41	2.25	3.20	3.80	5.15	2.79	4.04	0.17	0.24
	3, 5, 7	2.37	3.52	2.43	3.44	3.59	4.65	2.81	3.92	0.17	0.23
	3, 7, 11	2.18	3.31	2.41	3.43	3.72	4.85	2.77	3.94	0.17	0.24

* Numbers in bold indicate the best performance.

Table 6. Component-wise ablation results of the MGG-ISCNet on the CW dataset *.

VGG16	LightHead	MGG	MAE	RMSE	NMAE	NRMSE	R²	Params/M
√	-	-	3.43	4.72	0.21	0.28	0.71	104.88
√	√	-	2.85	4.05	0.17	0.24	0.79	60.95
√	-	√	3.07	4.29	0.18	0.26	0.76	105.00
√	√	√	2.73	3.86	0.16	0.23	0.81	61.08

* Where ‘√’ indicates the strategy was used, ‘-‘ means the strategy was not used. Numbers in bold indicate the best performance.

Table 7. Performance of image-level supervised methods on spike counting on the CW dataset *.

Method	Low		Mid		High		Overall
Method	MAE	RMSE	MAE	RMSE	MAE	RMSE	MAE	RMSE	NMAE	NRMSE
CSNet	3.75	5.08	2.50	3.35	3.92	5.33	3.43	4.72	0.21	0.28
MGG-ISCNet (Ours)	2.26	3.31	2.31	3.20	3.60	4.80	2.73	3.86	0.16	0.23

* Numbers in bold indicate the best performance.

Table 8. Counting performance of different methods on GWHD2020 *.

Method (Years)	Supervision	MAE	RMSE	NMAE	NRMSE	R²	Params/M
Faster R-CNN (2015)	Instance-level	11.01	14.90	0.25	0.34	0.50	136.71
YOLOV5n (2020)	Instance-level	6.30	8.04	0.15	0.19	0.85	1.76
YOLOV8n (2023)	Instance-level	10.07	12.35	0.23	0.29	0.66	3.01
YOLOV11n (2024)	Instance-level	10.29	12.48	0.24	0.29	0.65	2.59
CSNet (2024)	Image-level	5.85	7.45	0.14	0.17	0.88	104.88
MGG-ISCNet (Ours)	Image-level	3.63	4.73	0.08	0.11	0.95	61.08

* Numbers in bold indicate the best performance.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guan, L.; Ding, Z.; Song, M.; He, X.; Wang, Q.; Zan, R.; Gao, Z.; Li, X.; Zhao, Y.; Zhang, D. A Multi-Granularity Gated Image-Level Supervised Network (MGG-ISCNet) for Spike Counting in Agropyron cristatum (L.) Gaertn. Agronomy 2025, 15, 2805. https://doi.org/10.3390/agronomy15122805

AMA Style

Guan L, Ding Z, Song M, He X, Wang Q, Zan R, Gao Z, Li X, Zhao Y, Zhang D. A Multi-Granularity Gated Image-Level Supervised Network (MGG-ISCNet) for Spike Counting in Agropyron cristatum (L.) Gaertn. Agronomy. 2025; 15(12):2805. https://doi.org/10.3390/agronomy15122805

Chicago/Turabian Style

Guan, Lihua, Ziyu Ding, Meng’an Song, Xinyuan He, Qiqi Wang, Ruopu Zan, Zhangru Gao, Xiang Li, Yan Zhao, and Dongyan Zhang. 2025. "A Multi-Granularity Gated Image-Level Supervised Network (MGG-ISCNet) for Spike Counting in Agropyron cristatum (L.) Gaertn" Agronomy 15, no. 12: 2805. https://doi.org/10.3390/agronomy15122805

APA Style

Guan, L., Ding, Z., Song, M., He, X., Wang, Q., Zan, R., Gao, Z., Li, X., Zhao, Y., & Zhang, D. (2025). A Multi-Granularity Gated Image-Level Supervised Network (MGG-ISCNet) for Spike Counting in Agropyron cristatum (L.) Gaertn. Agronomy, 15(12), 2805. https://doi.org/10.3390/agronomy15122805

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multi-Granularity Gated Image-Level Supervised Network (MGG-ISCNet) for Spike Counting in Agropyron cristatum (L.) Gaertn

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Acquisition

2.2. Dataset Preparation

2.3. Model Architecture

2.3.1. Backbone Network

2.3.2. Feature Aggregation

2.3.3. Lightweight Counting Head

2.4. Model Training and Evaluation

2.4.1. Experimental Setup

2.4.2. Model Evaluation Indicators

3. Results and Analysis

3.1. Ablation Study on Different Backbones

3.2. Parameter Sensitivity Analysis

3.3. Ablation Study on the MGG-ISCNet

3.4. Comparative Experiments with Different Networks

3.5. Comparative Experiments on Public Dataset

4. Discussion

4.1. Effect of Density on Counting Performance

4.2. Cross-Crop Transferability

4.3. Limitations and Future Research Directions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI