1. Introduction
Crested wheatgrass,
Agropyron cristatum (L.) Gaertn., a key forage grass species with both high nutritional value and significant ecological importance, plays a strategic role in livestock production and ecological restoration. The species is rich in high-quality protein and dietary fiber, making it an ideal feed source for ruminants [
1]. However, the combined effects of global warming, increasing drought, and the growing demand for high-quality forage have placed severe pressure on its supply [
2]. Statistics indicate that the annual market demand for crested wheatgrass has been increasing at an average rate of 4–5% [
3], while the yield improvement achieved through genetic enhancement remains limited due to long breeding cycles and low efficiency [
4]. Consequently, the gap between supply and demand continues to widen. To accelerate breeding progress, efficient and accurate field phenotyping of key agronomic traits is urgently required. Among these traits, spike number is a critical indicator for assessing the yield potential of crested wheatgrass. Traditional manual counting methods, however, are time-consuming, labor-intensive, and prone to human error, making them unsuitable for large-scale, high-throughput phenotyping. Therefore, developing rapid and precise automated spike-counting techniques is of great significance for accelerating cultivar selection and alleviating forage shortages.
Automated and high-throughput spike counting plays a crucial role in screening superior crested wheatgrass varieties, evaluating forage yield, and monitoring ecological restoration. Yet, current spike counting methods still rely mainly on manual field surveys, requiring laborious plant-by-plant inspection, which is inefficient and error-prone [
5]. In contrast, semi-automated approaches have been explored for other Poaceae crops such as wheat. For example, a combination of color thresholding, UAV multispectral imagery, and feature fusion has been used to detect and count Fusarium-infected wheat spikes [
6]. Building on this, later studies improved spike recognition accuracy in Fusarium head blight assessment by integrating both image and spectral information [
7]. Texture descriptors have also been explored to distinguish spikes from surrounding foliage [
8,
9]. However, these traditional methods rely heavily on handcrafted features and fixed thresholds, resulting in limited robustness under complex field conditions. They often struggle to handle illumination variations, occlusion, and density differences. The problem is even more pronounced in crested wheatgrass fields, where dynamic lighting, weed occlusion, and the high visual similarity between spikes and leaves make spike segmentation difficult. Consequently, conventional feature-based methods exhibit poor generalization and unstable performance across diverse field scenarios, substantially reducing counting accuracy and practical applicability.
Recent progress in agricultural deep learning has introduced instance-level supervision into object counting research [
10,
11,
12]. One category is detection-based counting under bounding-box supervision, where bounding boxes are manually annotated around individual spikes to train object detection models for localization and counting [
13]. A two-stage Faster R-CNN framework has been employed to achieve accurate spike detection and counting; Single-stage CNN models, such as YOLO, have been explored to improve real-time counting performance in large-scale applications [
14]. Bao et al. compared single- and two-stage approaches and proposed modifications tailored for dense spike conditions, achieving reliable counting performance across different scenes [
15]. Another category is point-supervised density map counting, which only requires a single point annotation per spike to generate the supervisory signal. During training, these annotations are converted into density maps using Gaussian kernels or similar functions, transforming counting into a density regression problem. A convolutional neural network (CNN) is then trained to regress the density distribution using pixel-wise losses such as MSE or MAE. During inference, summing the predicted density map yields the total spike count without explicit localization. Refining Gaussian kernel parameters has proven effective in improving the alignment between predicted density maps and real spike distributions [
16]. Li et al. proposed a Poisson-based loss function to mitigate bias in dense scenes [
17]. Enhancements to CNN architectures have been shown to improve robustness against background noise [
18].
Despite their effectiveness, instance-level supervised methods suffer from high annotation costs. Bounding boxes must be drawn for each spike, while point annotations require precise center localization [
17]. In dense, fine-structured crested wheatgrass canopies, such annotation is prohibitively expensive. Moreover, bounding-box-based approaches are prone to overcounting or missing spikes in overlapping regions, and non-maximum suppression cannot fully resolve these issues [
19]. Meanwhile, point-supervised methods rely on fixed Gaussian kernels that struggle to adapt to large morphological variations among genotypes, often assigning density values incorrectly to leaves or weeds and introducing noise.
Recognizing these challenges, recent studies have shifted toward image-level supervised paradigms, which regress the total object count per image without requiring any spatial annotation. Originating in crowd counting, this paradigm has been shown to reduce annotation cost significantly. For example, Yang et al. proposed a soft-label sorting network for count-based crowd density estimation [
20], and Liang et al. introduced TransCrowd, a transformer-based model leveraging self-attention to capture global contextual dependencies [
21]. In agriculture, a recent study introduced CSNet, applying image-level learning to wheat spike counting and effectively alleviating the annotation burden in dense agricultural images [
22]. However, due to substantial differences in spike morphology, density, and growth environment, directly transferring these methods to crested wheatgrass yields suboptimal results. To address these limitations, this study proposes the MGG-ISCNet (Multi-Granularity Gated Image-level Supervised Counting Network), an image-level supervised model designed specifically for crested wheatgrass spike counting under complex field conditions. Compared with recent similar methods such as CSNet, the MGG-ISCNet has achieved improvements in several key technologies. Firstly, in terms of feature fusion strategy, this model adopts a multi-granularity gated dynamic fusion mechanism, which can adaptively weigh the importance of features of wheatgrass spikes at different scales. Secondly, in terms of model structure, we design a regression head based on lightweight 1D convolution and global average pooling, which replaces the fully connected layer with huge parameters in CSNet and significantly reduces the number of parameters. In order to comprehensively evaluate the generalization ability of the model, this study selected the 2020 Global Wheat Spike Detection (GWHD2020) dataset [
23] for verification. This selection is mainly based on the following two considerations: First, there is no publicly available wheatgrass spike-counting dataset. Secondly, wheat and wheatgrass have significant morphological similarities in spike structure, occlusion pattern, and other characteristics. GWHD2020, as a widely recognized benchmark dataset in the field of spike counting of cereal crops, can ensure the comparability and repeatability of cross-species evaluation results. The MGG-ISCNet proposed in this paper is a concept verification automatic counting method for breeding scenarios, which aims to significantly reduce the labeling cost, improve the counting efficiency and accuracy, and thus accelerate the screening process of excellent lines. The main contributions of this study are summarized as follows:
(1) This study is the first to introduce image-level weakly supervised counting to crested wheatgrass spike estimation, achieving high-precision counting without explicit localization and offering a practical solution for forage phenotyping.
(2) A multi-granularity gated fusion mechanism combined with a lightweight head design is proposed to adaptively integrate multi-scale features, effectively improving scale sensitivity and feature discrimination while maintaining a low parameter count.
(3) Comprehensive experiments on a custom crested wheatgrass dataset and cross-crop evaluation on public wheat datasets demonstrate the proposed model’s strong generalization, robustness, and potential for broad application in cereal crop phenotyping.
2. Materials and Methods
The overall workflow of this study, including the study area overview, data processing methods, model architecture, and evaluation strategies, is illustrated in
Figure 1. The study area is located in Hohhot, Inner Mongolia, China, with the topography and major geographic features shown in
Figure 1A. The data preprocessing procedure, shown in
Figure 1B, begins by dividing the collected images into multiple patches (patch segmentation). Each patch was manually annotated by breeding experts to record the number of crested wheatgrass spikes, producing the ground-truth labels used for model training. The proposed deep learning model, MGG-ISCNet, is illustrated in
Figure 1C. RGB images are first processed through a backbone network to extract multi-level features, followed by the Multi-Scale Patch Mixer (MPM) and Multi-Granularity Gating (MGG) modules for adaptive feature enhancement and information filtering. The final spike count is predicted through an end-to-end regression head optimized using the L1 loss function. Model performance evaluation and generalization verification are shown in
Figure 1D—the left panel presents results on the crested wheatgrass (CW) dataset, while the right panel demonstrates transfer learning performance on the GWHD2020 dataset, allowing comparison between image-level and instance-level supervision methods.
2.1. Data Acquisition
Data were collected from the crested wheatgrass experimental field at Inner Mongolia Agricultural University (42.3° N, 119.5° E), as shown in
Figure 1A. The cultivar used was
Agropyron cristatum cv. Mengnong No.1, a drought- and cold-tolerant variety bred by the university, is widely used in northern grassland restoration and forage production and is representative of the regional vegetation. The field was flat, with chestnut soil, well-managed plots, and minimal weed interference. The area features a typical temperate grassland climate with naturally varying light conditions, capturing realistic plant structures, canopy density, and spatial distribution patterns of crested wheatgrass under field conditions.
An Intel RealSense D455 depth camera (Intel Corporation, Santa Clara, CA, USA) was used for data collection. The camera employs active stereo and structured light fusion, offering high-accuracy depth sensing and robustness under variable illumination—suitable for complex agricultural environments. The camera was connected to a ThinkPad T480 laptop (Lenovo Group Limited, Beijing, China) via USB 3.0 for real-time data reception and buffering. The detailed acquisition parameters are summarized in
Table 1.
Data were collected from May to August 2025, covering the key growth stages of crested wheatgrass from late heading to maturity, and were carried out under various weather conditions (sunny and cloudy) to enhance the natural changes in light, shadow, and canopy structure. Extreme disturbances (such as strong winds or violent shaking) were not actively introduced during the collection process, but there were still natural light fluctuations, shadow changes, and slight plant displacement during the collection process. The images were mainly collected from the milk stage (Zadoks scale Z71–Z79). In addition, a small number of plants from the late heading to flowering stage (Z41–Z69) were included in the sample, which further enriched the diversity of spike morphology, density, and color contrast. All images are synchronized to obtain RGB and depth data through a custom Python script based on Intel RealSense SDK 2.0 (
https://github.com/IntelRealSense/librealsense (accessed on 1 December 2025)) and stored in a ROS bag (.bag) format that strictly aligns the timestamp; subsequently, the RealSense Viewer is used to extract, rename, and organize images in batches to provide a high-quality, multi-temporal, and multi-scene data foundation for annotation and model training. The process is shown in
Figure 2A.
2.2. Dataset Preparation
After acquisition, all image sequences were quality-checked to remove overexposed, motion-blurred, or depth-missing frames. In total, 214 RGB images were retained, representing diverse illumination and canopy density conditions. Depth information was not used in this study. The RGB images (1280 × 720 resolution) intentionally represent ordinary field camera quality to simulate real-world deployment and enhance robustness under low-quality imaging conditions.
In order to make the data adapt to the input requirements of the deep learning model and improve the training efficiency, each image of 1280 × 720 pixels is adjusted to 1280 × 768 pixels by supplementing the black area so that it can be completely divided into non-overlapping 256 × 256 pixel blocks (
Figure 2B). This method uses a non-overlapping sliding window with a fixed step size of 256 [
24,
25]. In theory, each image can generate 15 image blocks. After eliminating some invalid blocks located in the blurred edge region, a total of 2891 valid image blocks were finally extracted from 214 original images for subsequent experiments. Each patch was labeled at the image level by two researchers with expertise in grassland science, and the total number of visible spikes (non-negative integers) was recorded. During the annotation process, when the expert observes the image, he will quickly move the mouse cursor to skim the ear area in the image. This visual guidance behavior helps experts to focus on and identify the key morphological structures of spikes (such as awn and spikelet patterns) so as to complete the psychological accumulation and counting of visible spikes in the brain and finally record the total number [
26]. The whole process does not produce any physical location labeling files, such as points or boxes. (
Figure 2C) Inconsistent cases were reviewed by a senior expert to finalize labels. This simple image-level annotation approach significantly reduced manual cost by cutting the average labeling time from 5 to 8 min to under 1 min, all without requiring specialized tools (e.g., Labelme, CVAT). Basic tools such as Excel or plain text files were sufficient. This annotation method fully meets the requirements for density regression and count estimation tasks.
To ensure fair model evaluation and consistent data distribution, the dataset was split into training, validation, and test sets at an 8:1:1 ratio while maintaining density balance across subsets (
Table 2). The final dataset contained 2891 patches with a total of 47,758 annotated spikes, averaging 16.52 spikes per patch and ranging from 0 to 43 spikes, thereby capturing the substantial complexity and diversity of field conditions.
The overall histogram shows a slightly right-skewed distribution, with a skewness of +0.133 (
Figure 3A). To evaluate model robustness under varying spike densities, the 33rd and 66th percentiles (13 and 20 spikes, respectively) were used to divide samples into three density levels: low (Low), medium (Mid), and high (High). Boxplots for the overall and subset distributions (
Figure 3B) confirm consistent statistics across splits. The three density groups contained 1074, 879, and 938 samples, respectively. Example patches (
Figure 3C–E) illustrate that in low-density scenes, spikes are well separated with clear boundaries, facilitating counting; in medium-density scenes, partial occlusion and overlap occur, increasing difficulty; and in high-density scenes, heavy overlap and canopy stacking blur boundaries, posing significant challenges for accurate detection and counting. This density-level division helps assess model performance under varying scene complexity.
To evaluate the model’s cross-crop generalization ability, in addition to the self-built CW dataset, this study also used the publicly available wheat ear dataset GWHD2020. The dataset includes 3422 images (1024 × 1024 pixels), split 8:2 into 2737 training and 685 test images. The number of wheat heads per image varied widely (0–116 in training, 0–97 in testing, average of about 43), posing challenges for stable regression (
Table 3,
Figure 4).
2.3. Model Architecture
The proposed MGG-ISCNet follows an end-to-end regression architecture, as shown in
Figure 5. The name MGG-ISCNet reflects its core components: a Multi-Granularity Gating mechanism for adaptive multi-scale feature fusion and an image-level supervised counting strategy characteristic of weakly supervised approaches. The model takes a 3 × 512 × 512 RGB image as input. A pretrained backbone network first extracts deep semantic features, producing a feature map of size 512 × 64 × 64. These features are then passed to a feature aggregation layer inspired by CSNet, incorporating a Multi-Granularity Gating (MGG) mechanism. The backbone output is processed by parallel multi-scale MLP-Mixer branches [
27], each handling coarse-, medium-, and fine-grained local contexts. Their token outputs are concatenated into a unified 1D representation, followed by learnable gating weights that dynamically evaluate the importance of each scale and generate adaptively fused features. This process enhances key region responses while suppressing redundancy. Finally, a lightweight 1D convolutional regression head compresses the gated features via two Conv1D layers, followed by adaptive global average pooling and a fully connected layer to output the final scalar count prediction.
2.3.1. Backbone Network
The first ten layers of VGG16 [
28] were used as the backbone, consisting of four convolutional blocks (Layer 1–Layer 4). Each block uses 3 × 3 kernels, stride 1, and padding 1, followed by ReLU activations. Layer 1 and Layer 2 each include two convolutional layers (64 and 128 channels, respectively), each followed by 2 × 2 max pooling; Layer 3 contains three 256-channel convolutions with pooling; Layer 4 has three 512-channel convolutions without further pooling, producing the final [512, 64, 64] feature map (denoted as B4). ImageNet-pretrained weights were used to improve generalization.
2.3.2. Feature Aggregation
To enable effective multi-scale feature fusion and contextual modeling, we introduce a Multi-Scale Patch Mixer (MPM) module on top of the high-level features (denoted as B4, with shape [512, 64, 64]) extracted from the backbone network. Inspired by the MLP-Mixer architecture [
27], the MPM is adapted with targeted modifications in patch partitioning and channel modeling to jointly capture spatial dependencies at multiple granularities.
Specifically, the input feature map is divided into three sets of non-overlapping patches at different scales: 4 × 4, 8 × 8, and 16 × 16, corresponding to spatial granularities of 16 × 16, 8 × 8, and 4 × 4, respectively. This yields coarse-, medium-, and fine-grained local contextual representations. Each scale branch then undergoes independent channel compression via a convolutional layer, reducing the channel dimension from 512 to 256 to lower computational cost and improve feature compactness. The resulting patch sequences are fed into dedicated mixer layers, where alternating token-mixing and channel-mixing operations model global dependencies across the sequence dimension and apply nonlinear transformations along the channel dimension, producing more discriminative multi-scale token representations.
Finally, the output sequences from the three branches—of lengths 16, 64, and 256, each with 256 channels—are concatenated along the token dimension to form a unified sequence of shape [336, 256]. Compared to conventional strategies like average pooling or naive concatenation, the proposed MPM explicitly enables cross-scale interaction and implicit compensation among different spatial granularities, significantly enhancing the model’s cross-scale consistency and contextual understanding.
In natural-field crested wheatgrass spike counting, traditional multi-scale fusion methods, including simple concatenation and handcrafted weighted averaging, often lack adaptability. They fail to dynamically adjust the contribution of each scale according to scene complexity, leading to degraded performance under challenging conditions like illumination variation, density heterogeneity, and occlusion. To address this, we propose a Multi-Granularity Gating (MGG) mechanism that learns adaptive fusion weights and enables dynamic cross-scale integration, allowing the model to selectively emphasize coarse contextual cues or fine structural details as needed.
The MGG module takes the concatenated feature sequence from MPM as input and processes it through three parallel 1D convolutional branches with kernel sizes to capture sequence patterns at different receptive fields. For the i-th branch, the feature mapping is computed as follows:
In Equation (1), represents the kernel size of the i-th branch, is the input, denotes ReLU activation, is the sigmoid function, and BN stands for batch normalization. This yields a set of importance-aware response maps , each reflecting the relevance of features at a specific scale.
To fuse these branches adaptively, we introduce learnable parameters
and compute the dynamic fusion weight
via softmax normalization is computed as Equation (2):
The final global scale-aware response map is obtained by weighted summation:
In Equation (3), represents the final global scale-aware response map, and represents the feature map generated by the i-th branch.
This response map is then element-wise multiplied with the original input to reweight features based on multi-granularity importance:
In Equation (4), ⊙ denotes element-wise multiplication, and is the input. This design allows the model to automatically enhance fine-scale responses in dense regions and strengthen global contour cues in sparse areas, achieving dynamic scale selection and synergistic feature enhancement.
2.3.3. Lightweight Counting Head
To further improve computational efficiency while maintaining strong representational capacity, we design a lightweight regression head (LightHead) based on 1D convolutions, replacing the conventional parameter-heavy fully connected layers (
Figure 6). This head compresses features efficiently while preserving local correlations along the sequence dimension and enables end-to-end density regression through global adaptive pooling.
Given the fused feature sequence
(where
is batch size,
,
), the LightHead first applies two successive 1D convolutional layers for progressive channel reduction: the first reduces channels from 256 to 128, and the second from 128 to 64. Each uses a 3 × 1 kernel with padding = 1 to maintain sequence length, followed by batch normalization and ReLU activation δ. The calculation process is shown in Equations (5) and (6):
X′ features are obtained after passing through the first convolutional layer (outputting 128 channels). X″ features are obtained after passing through the second convolutional layer (outputting 64 channels). Next, global adaptive average pooling (AdaptiveAvgPool1d) collapses the sequence length to 1, and the result is flattened into a vector
:
Finally, a linear layer maps this global representation to a single scalar prediction, the final estimated count
:
The linear layer has an input dimension of 64 and outputs a single value. By combining channel compression with global aggregation, this lightweight head drastically reduces model parameters while retaining sufficient expressive power, enabling accurate and efficient end-to-end counting.
2.4. Model Training and Evaluation
2.4.1. Experimental Setup
Training was conducted on a high-performance workstation equipped with an NVIDIA GeForce RTX 5090D GPU and an AMD Ryzen Threadripper 7960X 24-core CPU, running Windows 11 and Python 3.10. Input images were resized to 512 × 512 pixels. The batch size was 32, and training lasted 50 epochs. The optimizer was Stochastic Gradient Descent (SGD) with an initial learning rate of 1 × 10−4. The model was trained using the L1 loss (Mean Absolute Error, MAE) to minimize prediction deviation.
2.4.2. Model Evaluation Indicators
Model performance was quantitatively assessed using Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and the Coefficient of Determination (R
2). MAE measures average prediction deviation, RMSE reflects prediction stability and dispersion (penalizing larger errors), and R
2 evaluates the goodness of fit. The calculation process for these indicators is shown in Equations (9)–(13):
denotes the predicted number of target crested wheatgrass spikes in the i-th image by the model;
denotes the ground-truth number of target crested wheatgrass spikes in the i-th image;
represents the mean of the ground-truth spike counts across all test images;
is the total number of images in the test set.
This evaluation metric system jointly accounts for the absolute error magnitude, error variability, and the model’s overall fitting capability, thereby providing a comprehensive and objective assessment of the counting model’s performance.
3. Results and Analysis
3.1. Ablation Study on Different Backbones
To systematically evaluate the impact of different backbone–head combinations on the spike count accuracy of crested wheatgrass, the neck module was fixed as the Multi-Scale Pooling Module (MPM), and four representative backbone networks—VGG16, MobileNetV3 [
29], ResNet18 [
30], and DarkNet53 [
31]—were compared under two regression heads: FCHead and LightHead. The quantitative results on the test set are summarized in
Table 4. Among all combinations, VGG16 + LightHead achieved the best overall performance, with an MAE of 2.85, RMSE of 4.05, and R
2 of 0.79, while maintaining only 60.95 M parameters. In contrast, VGG16 + FCHead required 104.88 M parameters but yielded higher errors (MAE = 3.43, RMSE = 4.72), indicating that LightHead not only reduced the parameter count substantially but also improved prediction accuracy.
Further analysis revealed that although MobileNetV3 and DarkNet53 are lightweight architectures, their feature representation capacity is limited: MobileNetV3 + LightHead achieved an MAE of 3.30, while DarkNet53 + LightHead reached 3.72, suggesting difficulty in modeling fine-grained details in complex field environments. ResNet18 presented a more balanced performance; however, when paired with LightHead, its accuracy decreased (MAE increased from 3.32 to 4.01), likely due to the incompatibility between its deeper structure and the lightweight head.
In summary, VGG16 combined with LightHead achieved the best balance between accuracy and efficiency and was thus adopted as the default configuration in subsequent experiments. The results verify the importance of the joint design between the backbone and regression head. LightHead employs two-stage 1D convolutions and global pooling for efficient dimensionality reduction, reducing parameters by approximately 41.7% while improving accuracy, demonstrating an excellent trade-off between computational cost and precision.
3.2. Parameter Sensitivity Analysis
To optimize the Multi-Granularity Gating (MGG) mechanism, the effects of channel reduction ratio (r) and receptive field configuration (kernel sizes) were examined across low-, medium-, and high-density subsets as well as the overall test set (
Table 5). Three kernel size combinations were tested: {1, 3, 5}, {3, 5, 7}, and {3, 7, 11}, under a fixed VGG16 backbone with LightHead and reduction ratios of 16 and 32. The basis of this design mainly includes the following: the smaller convolution kernel emphasizes detail, the larger convolution kernel captures semantics, and the three groups of combinations cover the progressive receptive field range from local to wider context, which can systematically evaluate the MGG’s ability to integrate multi-scale information.
When r = 16 and the kernel size combination was {3, 5, 7}, the model achieved the best overall performance (overall MAE = 2.73), particularly excelling in low- and medium-density scenes (MAE = 2.26 and 2.31, respectively). This indicates that moderate receptive fields effectively integrate local details and contextual cues, making them suitable for sparse target localization. Smaller receptive fields failed to capture spatial relationships between spikes, while overly large ones introduced noise and instability. In high-density scenarios, all configurations showed increased error, reflecting the intrinsic difficulty of severe occlusion. Increasing r to 32 led to a consistent decline in performance, suggesting that retaining sufficient channel dimensions benefits multi-scale feature fusion.
3.3. Ablation Study on the MGG-ISCNet
To verify the contribution of each component in the MGG-ISCNet, four ablation settings were designed using VGG16 + FCHead as the baseline: (1) Baseline; (2) Baseline + LightHead; (3) Baseline + MGG; (4) Full model (LightHead + MGG). Results are summarized in
Table 6 and visualized in
Figure 7.
Results show that introducing LightHead alone substantially improved performance—MAE decreased from 3.43 to 2.85, R2 rose from 0.71 to 0.79, and parameters were reduced by 41.7% (from 104.88 M to 60.95 M), validating its efficiency–accuracy advantage. Adding MGG to the FCHead baseline increased parameters by only 0.12 M but reduced MAE to 3.07 and improved R2 to 0.76, indicating that MGG enhances spatial dependencies and feature robustness.
Combining both modules yielded the best results (MAE = 2.73, RMSE = 3.86, R
2 = 0.81, parameters = 61.08 M). The visualization in
Figure 8 further demonstrates that MGG strengthens multi-scale contextual modeling, while LightHead enables efficient feature compression, together achieving superior generalization under complex conditions.
3.4. Comparative Experiments with Different Networks
To further validate the superiority of the MGG-ISCNet, it was compared with the image-level supervised method CSNet under identical settings (
Table 7). The MGG-ISCNet achieved an overall MAE of 2.73 and RMSE of 3.86, demonstrating stronger feature modeling and regression capabilities. In low-density scenes, CSNet suffered from insufficient receptive field (MAE = 3.75), while the MGG-ISCNet reduced the error to 2.26. In medium-density regions, the MGG-ISCNet achieved comparable accuracy with lower complexity (2.31 vs. 2.50). In high-density subsets, both MAE and RMSE decreased by 8.2% and 9.9%, respectively.
Overall, the MGG-ISCNet consistently outperformed the baseline across all density conditions, particularly in challenging low- and high-density scenarios, confirming its effectiveness and generalization capability for weakly supervised agricultural counting tasks.
3.5. Comparative Experiments on Public Dataset
The comparative study evaluated our method against several benchmarks: instance-level detectors, including Faster R-CNN [
32], YOLOv5n [
33], YOLOv8n [
34], and YOLOv11n [
35], as well as image-level regression methods like CSNet. The comparison results on GWHD2020 show that the MGG-ISCNet achieves the best performance (MAE = 3.63, RMSE = 4.73, R
2 = 0.95), which is superior to all instance-level and image-level competitors, as shown in
Table 8. Note that avg = 43.24 is calculated for NMAE and NRMSE (
Table 3). The regression plot in
Figure 9 shows that the instance-level detector has a systematic undercounting (points below the 1:1 line) due to occlusion, spike overlap, and high visual similarity between targets. In contrast, the image-level supervision of the MGG-ISCNet avoids the dependence on localization and exhibits robust performance in dense scenes.
To compare the counting performance of different methods more intuitively, the visualization results of the original image and the four instance-level detection methods are shown in
Figure 10. In the legend, the true value, instance-level supervised counting result, and image-level supervised counting result are represented by black, red, and green, respectively. The specific arrangement is as follows: real number labeling, the results of Faster R-CNN, YOLOv5n, YOLOv8n, YOLOv11n, CSNet, and the MGG-ISCNet (the methods proposed in this study).