GG-YOLO: A Lightweight Dual-Path Attention Detector with Dynamic Sampling for Dense Wheat Spike Detection

Gao, Guohong; Zhou, Fucheng; Xu, Lijun; Zhang, Jiaxin; Li, Xueyong

doi:10.3390/agronomy16121156

Open AccessArticle

GG-YOLO: A Lightweight Dual-Path Attention Detector with Dynamic Sampling for Dense Wheat Spike Detection

by

Guohong Gao

¹

,

Fucheng Zhou

¹,

Lijun Xu

^2,*,

Jiaxin Zhang

¹ and

Xueyong Li

¹

School of Computer Science and Technology, Henan Institute of Science and Technology, Xinxiang 453003, China

²

School of Information Engineering, Xinxiang University, Xinxiang 453003, China

^*

Author to whom correspondence should be addressed.

Agronomy 2026, 16(12), 1156; https://doi.org/10.3390/agronomy16121156 (registering DOI)

Submission received: 24 April 2026 / Revised: 5 June 2026 / Accepted: 10 June 2026 / Published: 12 June 2026

(This article belongs to the Section Precision and Digital Agriculture)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Accurate wheat spike detection is essential for crop phenotyping and yield estimation, but real-world field conditions—such as dense spike overlap, environmental domain shifts, and degradation-induced failures like motion blur—pose significant challenges. Achieving robust perception under these circumstances while maintaining a strict accuracy-efficiency trade-off for edge devices remains a pressing research problem. To overcome these limitations, we propose GG-YOLO, a unified lightweight detection framework specifically tailored for complex agricultural environments. Rather than a simple recombination of existing lightweight modules, GG-YOLO integrates three original structural adaptations: First, a Dual-path Attentive Ghost Mechanism (DAGM) introduces gradient-guided attention modulation to enhance feature discrimination and explicitly resolve feature confusion in dense, overlapping regions. Second, a C3Ghost module combines multi-branch aggregation with linear feature generation, mitigating parameter redundancy in the prediction head by approximately 31% compared to the standard YOLOv8s without sacrificing semantic capacity. Third, DSample, a dynamic upsampling operator featuring an original dual-mode adaptive mechanism, robustly recovers fine-grained spatial details during multi-scale feature pyramid fusion. Extensive cross-dataset experiments on the GlobalWheat2020 and HNKJXYwheat datasets validate the model’s exceptional resilience to domain shifts and varying growth stages. GG-YOLO achieves a precision of 94.35%, a recall of 91.93%, and a state-of-the-art mAP@50 of 96.47%. Furthermore, the model contains only 7.89 M parameters and requires 20.4 GFLOPs, reaching an inference speed of 165 FPS on a desktop GPU and a validated real-time speed of 64 FPS on an NVIDIA Jetson edge computing platform. These results demonstrate that GG-YOLO establishes a superior accuracy-efficiency frontier, making it highly reliable for real-time field deployment in precision agriculture.

Keywords:

wheat ear detection; target detection; YOLOv8; lightweighting

1. Introduction

Wheat spike detection is critical for large-scale crop phenotyping and yield estimation. As precision agriculture and intelligent breeding systems advance, the automated and accurate detection of wheat spikes in field imagery has become indispensable. Nevertheless, real-world agricultural environments present substantial challenges for object detection algorithms, primarily due to high target density, severe occlusion, background clutter, and significant scale variations [1,2,3].

Wheat spike detection differs significantly from general object detection tasks. First, the small size and dense distribution of wheat spikes frequently cause feature confusion and severe instance overlapping [4,5,6]. Second, complex field conditions—such as wind-induced motion blur and fluctuating illumination—substantially degrade the discriminative capability of conventional convolutional features. Finally, practical agronomic applications often require deployment on resource-constrained edge devices, complicating the balance between high detection accuracy and real-time inference [7,8].

While deep learning has opened new avenues for automated plant phenotyping, current mainstream object detectors largely follow two paradigms: two-stage methods (the R-CNN family) that employ a region proposal network (RPN) for high localization accuracy [9], and single-stage methods (YOLO [10,11] and SSD [12,13]) that emphasize fast, end-to-end inference. Nevertheless, when deployed in domain-specific scenarios such as near-earth observation (NEO) imagery of wheat canopies, these general-purpose detectors tend to degrade substantially. This degradation is primarily attributed to (i) affine morphological deformations and dense multi-scale targets, (ii) severe texture attenuation and occlusion, and (iii) the limited decoupling capacity of hierarchical features, which often leads to the loss of fine morphological details in clustered spike heads [14,15,16,17]. Furthermore, standard convolutional operators struggle to capture the curvature and anisotropic structures characteristic of wheat ears [18,19].

Although recent lightweight YOLO-based architectures show promise for mobile and edge computing, they still struggle to fully resolve these issues [20,21,22]. They primarily reduce parameters via simplified convolutions or model pruning, often at the expense of feature representation capacity in dense scenarios. Additionally, traditional fixed interpolation operations fail to adaptively reconstruct spatial details for small, overlapping targets, further restricting their practical applicability in the field [23,24].

To overcome these limitations, we propose GG-YOLO, a lightweight detection framework specifically tailored for dense wheat spike detection. Rather than relying on isolated lightweight modules, GG-YOLO employs a coordinated design strategy that jointly optimizes feature representation, computational efficiency, and spatial reconstruction. The framework integrates three complementary components: a dual-path attentive feature extraction mechanism to enhance geometric representation, a lightweight feature generation module to minimize computational redundancy, and a dynamic sampling operator to refine spatial reconstruction during multi-scale feature fusion.

The main contributions of this work are summarized as follows:

A Dual-path Attentive Ghost Mechanism (DAGM) is developed to enhance feature discrimination in densely overlapping wheat spike regions. By integrating gradient-guided attention with adaptive feature fusion, DAGM strengthens the representation of spike boundaries and suppresses feature ambiguity caused by occlusion and background interference. This design enables more accurate localization and recognition of closely adjacent wheat spikes under complex field conditions.
A dual-mode adaptive sampling module (DSample) is introduced to improve multi-scale feature fusion and spatial detail preservation. Building upon the dynamic sampling paradigm, DSample incorporates a Low-Power (LP) mode and a Pixel-Blending (PB) mode to balance computational efficiency and boundary reconstruction capability. This design enhances localization precision while maintaining deployment flexibility across different hardware platforms.
A lightweight detection head based on C3Ghost is constructed to reduce computational redundancy without sacrificing detection performance. By replacing the original C2f structure in the YOLOv8 detection head, the proposed design effectively decreases model complexity and computational cost while preserving the spatial semantics required for dense-object detection tasks.
System-Level Contribution: The synergistic integration of these specific modules into GG-YOLO creates a unified framework tailored for agricultural applications, achieving a state-of-the-art mAP@50 of 96.47% at 165 FPS on the GlobalWheat2020 dataset.

2. Related Work

Although significant progress has been made in wheat spike detection, several formidable challenges remain. First, the dense distribution of wheat spikes often leads to severe occlusion and overlapping, complicating accurate localization. Second, complex field environments introduce background interference, such as leaves, stems, and varying illumination. Third, many conventional detection methods rely on heavy convolutional backbones, incurring high computational costs that prohibit deployment on edge devices. While recent studies have explored attention mechanisms, multi-scale feature fusion, and lightweight convolutional designs, most approaches fail to strike an optimal balance, either prioritizing accuracy at the expense of efficiency or adopting lightweight structures that severely compromise feature representation.

2.1. Convolutional Neural Networks in Wheat Ear Detection

Convolutional Neural Networks (CNNs) have fundamentally transformed object detection by demonstrating superior feature extraction capabilities. For instance, EfficientNet replaces the traditional one-dimensional scaling strategy with a compound scaling method, establishing an efficient baseline for high-density detection tasks [25]. The advent of Region-based CNNs (R-CNN) further improved detection accuracy through unified feature extraction and ROI pooling mechanisms [26]. In the single-stage domain, the YOLO (You Only Look Once) series has evolved rapidly due to its dominant performance in real-time detection [27]. To adapt these architectures for specific agricultural needs, researchers have introduced micro-scale detection layers and multi-scale fusion strategies tailored for UAV imagery [28], as well as Convolutional Block Attention Modules (CBAM) to enhance target focus while suppressing background noise [29]. Despite these advancements, the underlying feature interaction mechanisms and multi-scale designs require deeper exploration to address the persistent challenges of dense, occluded field environments.

2.2. Complex Background Dense Target Detection

Detecting dense targets against complex backgrounds is complicated by feature confusion, which arises from dense spatial distributions and cluttered background information. Foundational work introduced structures like the Feature Pyramid Network (FPN) to facilitate multi-scale detection, significantly improving performance for small objects in cluttered scenes [30]. Subsequent advancements integrated comprehensive training techniques (e.g., Mosaic augmentation, CSPDarknet, CIoU Loss) to enhance accuracy in high-density scenarios [31]. Recent studies have balanced lightweight design and accuracy through dynamic label assignment and E-ELAN architectures [32]. Furthermore, researchers have incorporated DySample upsampling to refine feature extraction [33] and proposed reparameterized ELAN structures (RepNCSPELAN4) to optimize parameter utilization [34]. However, these reparameterized backbones and lightweight modules often lack sufficient representational capacity for efficient edge deployment, highlighting the need for strategies that minimize computational overhead without losing feature discriminability.

2.3. Lightweight Object Detection Models

To address the limitations of large model parameters and high computational complexity, various lightweight strategies have been proposed. For example, Xie et al. [35] introduced CSPPartial-YOLO, which utilizes a partial hybrid dilated convolution (PHDC) module to enlarge the receptive field at a lower computational cost. Qin et al. [36] developed the MobileNetV4 architecture by optimizing Neural Architecture Search (NAS) formulas and introducing generic inverted-neck structures. Han et al. [37] proposed the Ghost module, which generates additional feature maps through inexpensive linear operations, exploiting feature map redundancy to substantially reduce computation. Similarly, Cai et al. [38] introduced FalconNet to mitigate architectural redundancy, while Hu et al. [39] designed EL-YOLO for robust small target detection on low-end GPUs. Despite reducing model complexity, these approaches often insufficiently address the robustness required for dense, small-scale targets in complex field backgrounds. This underscores the necessity for a dynamic, multi-scale lightweight framework that preserves strong feature discrimination.

Furthermore, recent comprehensive benchmarking studies in agricultural target detection, such as the work by Rana et al. [40], emphasize that evaluating models solely on mean Average Precision (mAP) is insufficient for real-world field deployment. Practical detector selection must thoroughly consider robustness against spectral and domain shifts, resilience to degradation-induced failures (such as motion blur or extreme occlusion), and the critical tradeoff between detection accuracy and computational efficiency. These multifaceted challenges align directly with the deployment bottlenecks in dense wheat spike detection, necessitating a framework like GG-YOLO that explicitly balances high-fidelity spatial reconstruction with ultra-lightweight architecture.

3. Proposed Method

We introduce a lightweight, multi-scale object detection framework optimized for densely distributed targets in complex background scenarios. Built upon the YOLOv8 architecture, the proposed network incorporates several novel components to enhance detection accuracy while maintaining computational efficiency. First, a Dual-path Attentive Ghost Mechanism (DAGM) is integrated into the backbone to address scale variations and dense target distributions through a heterogeneous dual-path structure and attention-guided fusion. Second, the C3Ghost module is embedded into the detection head to minimize model complexity and computational overhead without compromising precision. Third, we propose DSample, a dynamic content-aware upsampling module that leverages a bimodal sparse optimization framework to adaptively recover spatial resolution.The overall architecture of the proposed framework is illustrated in Figure 1.

Comparative results against state-of-the-art detectors are presented in Figure 2, demonstrating the superior performance of the proposed approach in dense object detection tasks.

3.1. Dsample

Existing dynamic convolution methods frequently encounter two computational bottlenecks: substantial parameter overhead from kernel-generation subnetworks and elevated time complexity from position-wise dynamic convolutions [41,42,43]. To overcome these limitations, we design DSample, an ultra-lightweight dynamic upsampling module built upon the DySample framework. By eliminating the conventional kernel-generation paradigm, DSample employs a compact pipeline consisting of offset prediction, coordinate transformation, and differentiable grid sampling. This reconfiguration preserves end-to-end differentiability while markedly reducing FLOPs and memory footprint. Formally, the dynamic upsampling operation can be expressed as:

Y = ϕ (X, Δ)

(1)

where

X

is the input feature map,

Y

denotes the supervision target at the desired high resolution, and

ϕ

is the content-adaptive upsampling operator parameterized by the learned offsets

Δ

. To promote spatially coherent and low-redundancy offsets, a mixed-norm group-sparse regularization is applied, which effectively reduces the number of active sampling groups during training. Unlike fixed bilinear interpolation, DSample predicts location-wise offsets to adapt the sampling grid to local content, enabling high-fidelity reconstruction under strict computational budgets. Furthermore, the module incorporates a dual-mode adaptive mechanism: a low-power (LP) mode prioritizing minimal computation, and a pixel-blending (PB) mode favoring boundary fidelity and fine-grained details. Both modes share identical parameters, facilitating flexible deployment across diverse edge devices. A schematic of the module is provided in Figure 3.

3.2. Dual-Path Attentive Ghost Mechanism

Unlike conventional attention mechanisms that primarily focus on channel-wise or spatial feature recalibration, the proposed Dual-path Attentive Ghost Mechanism (DAGM) introduces a gradient-guided attention strategy to explicitly enhance boundary-sensitive feature representation in densely overlapped wheat spike regions. By incorporating structural edge information into the feature extraction process, DAGM alleviates feature ambiguity caused by severe occlusion and complex agricultural backgrounds, thereby improving localization accuracy for closely adjacent wheat spikes. To achieve this objective, the primary path stacks RepNCSPELAN blocks and incorporates a multi-branch, dilation-expanding deformable convolution group to achieve a deformable receptive-field mixture. This adapts the effective receptive field to target geometry using weights generated from the bounding box aspect ratio. To guide this process, the main path correlates deformable-offset learning with gradient responses. Let and represent the horizontal and vertical Sobel kernels, respectively; the gradient-magnitude map

M

, which provides structural cues for offset learning, is calculated as:

M = \sqrt{{(X * G_{x})}^{2} + {(X * G_{y})}^{2}}

(2)

where

X

is the input feature map. The auxiliary path utilizes GhostConv-based residual units to generate cost-effective yet informative “ghost” features, which suppress spurious high-frequency noise and compensate for localized detail loss. Channel-wise attention coefficients

A

are generated from the gradient evidence as follows:

A = σ (MLP (GAP (M)))

(3)

where denotes global average pooling, is a lightweight two-layer perceptron, and

σ

is the sigmoid activation function. The two paths are subsequently fused in a channel-adaptive manner:

Y_{f i n a l} = (Y_{m a i n} ⊙ A) + Y_{a u x}

(4)

with

⊙

denoting the Hadamard product and

Y_{a u x}

representing the auxiliary path output. This geometry-aware adaptation and gradient-guided fusion effectively enhance feature discrimination in dense spike regions. Its structure is illustrated in Figure 4.

3.3. C3ghost

To reduce the model size and computational overhead of the standard YOLOv8 architecture, we replace the original C2f module with the C3Ghost structure. The C3Ghost module comprises two GhostConv layers and a GhostBottleneck-based residual branch. The initial GhostConv expands the channel dimension to facilitate information integration, while the subsequent GhostConv performs dimensionality reduction to match the desired output channels. GhostConv generates additional feature maps through inexpensive linear operations, significantly reducing computational redundancy. For a standard convolution layer with kernel size, input channels, output channels, and spatial resolution, the computational complexity is:

O_{s t d} = k^{2} \cdot C_{i n} \cdot C_{o u t} \cdot H \cdot W

(5)

In contrast, GhostConv decomposes this process, producing intrinsic feature maps (

C_{i n t r i n s i c}

) and generating the remainder via a ghost ratio

s

. This significantly mitigates redundant feature generation. The C3Ghost module leverages this by splitting feature generation across parallel paths for efficient reuse. Let

F_{s h o r t c u t}

and

F_{m a i n}

denote the shortcut branch output and the main branch input, respectively. The forward propagation is expressed as:

Y = Conv (Concat (DWConv (F_{m a i n}), F_{s h o r t c u t}))

(6)

where

DWConv

refines the bottleneck output,

Concat

merges the features, and

Conv

performs the final channel projection. The internal transformation of the residual GhostBottleneck is defined as:

F_{m a i n} = X \oplus GhostConv (GhostConv (X))

(7)

ensuring stable gradient propagation. By integrating Ghost-based generation with multi-branch aggregation, C3Ghost extracts enriched spatial features efficiently, making it optimal for dense agricultural scenes.

Through this design, the C3Ghost module effectively extracts enriched spatial features while maintaining high computational efficiency. By combining Ghost-based feature generation with multi-branch aggregation inherited from the C3 architecture, the proposed module achieves a better balance between feature representation capability and computational cost, making it particularly suitable for lightweight object detection tasks in dense agricultural scenes. Its structural diagram is shown in Figure 5.

3.4. Design Analysis of GG-Yolo

The GG-YOLO framework is anchored in three complementary optimization principles: efficient feature generation, geometry-aware representation learning, and adaptive spatial reconstruction. DAGM enhances feature discrimination via heterogeneous dual-path attention. C3Ghost curtails convolutional redundancy while sustaining effective representation. Simultaneously, DSample heightens spatial reconstruction accuracy in multi-scale fusion through adaptive offsets. The coordinated integration of these modules enables GG-YOLO to achieve an optimal balance between accuracy and efficiency. The comparison table is shown in Table 1.

4. Environment and Parameterization

4.1. Model Training Environment

The experimental environment was as follows: NVIDIA GeForce RTX4070 ti super graphics card, Windows 11, using Python 3.11.7, CUDA 11.3, YOLOv8s dependency library for ultralytics version 8.0.157, initial learning rate of 0.01, batch size of 16, number of iterations of 300 and stochastic gradient descent SGD as the optimizer.

To ensure strict reproducibility, the hyperparameter configurations were set as follows: The input image size was resized to

640 \times 640

pixels. The datasets were randomly divided into training, validation, and testing sets with a ratio of 8:1:1. A fixed random seed of 42 was applied across all experiments to guarantee consistent data splitting and initialization. Model optimization was performed using the Stochastic Gradient Descent (SGD) optimizer with an initial learning rate of 0.01, a momentum of 0.937, and a weight decay of 0.0005. The batch size was set to 16, and the training spanned 300 epochs.

For the loss function, the weights for box loss, classification loss, and DFL (Distribution Focal Loss) were set to 7.5, 0.5, and 1.5 respectively. During the inference phase, the confidence threshold was set to 0.25, and the Non-Maximum Suppression (NMS) IoU threshold was set to 0.45. Inference was executed using FP16 (Half-precision) to simulate real-world edge deployment scenarios.

4.2. Experimental Dataset Setup

To evaluate the effectiveness of the proposed detection network, two wheat spike datasets with high target density—the HNKJXYwheat dataset and the GlobalWheat2020 dataset [44]—were selected for algorithm validation.

The GlobalWheat2020 dataset focuses on oriented wheat spike detection from UAV (Unmanned Aerial Vehicle) perspectives. To capture a wide variety of wheat genotypes and complex field conditions, the data were collected between 2017 and 2019 across multiple diverse geographical regions, including Europe, North America, Asia, and Australia. It comprises 3424 high-resolution aerial images, with 263,904 wheat spike instances meticulously annotated by a professional labeling team. On average, each image contains approximately 25.8 target objects, making it highly suitable for benchmarking dense object detection performance. To help readers intuitively understand the high density and severe occlusion of the wheat spikes in this dataset, as well as how our proposed methodology detects targets in such complex environments, visual detection examples are provided in Figure 6. Specifically, Figure 6b explicitly demonstrates the robust detection results of our model in scenes with densely overlapping wheat spikes.

The HNKJXYwheat dataset (Henan Institute of Science and Technology wheat) was collected from experimental wheat fields at the Henan Institute of Science and Technology. It includes 600 images with corresponding annotation files that cover multiple growth stages of wheat, offering a valuable benchmark for evaluating cross-stage detection robustness and generalization capability. To compensate for the limited sample size and enhance intra-class diversity, a targeted augmentation strategy was applied during training. This strategy integrates geometric transformations (e.g., horizontal flip with a probability of

p = 0.5

, affine and perspective warping with

p = 0.1

, scaling, aspect-ratio jitter, and IoU-guided cropping) and photometric adjustments (e.g., HSV variations with

h = 0.015, s = 0.7, v = 0.4

, alongside brightness, contrast, and gamma adjustments). Through this augmentation pipeline, the original 600 images were expanded into an effective training set of roughly 1500 samples, significantly increasing feature diversity and improving the detection of small, occluded, and morphologically complex wheat spikes across developmental stages.

Although the HNKJXYwheat dataset is relatively limited in size, the extensive data augmentation strategies employed effectively enhance data diversity and model robustness. Moreover, cross-dataset generalization experiments were conducted, where the model trained on the GlobalWheat2020 dataset was directly evaluated on the HNKJXYwheat dataset without any fine-tuning. This rigorous setting effectively validates the generalization capability and robustness of the proposed model under severe domain shifts.

4.3. Public Evaluation Indicators

The experimental results were evaluated using standard metrics for target detection, including precision (P), recall (R), and the number of parameters. Additionally, floating point operations per second (FLOPs) were considered, along with the mean average precision (mAP@50 and mAP@50-95), which denote the average precision at different intersection over union (IoU) thresholds. Frames per second (fps) were also measured to assess the inference speed of the model.

4.4. Results and Analysis

To ensure statistical reliability, all experiments were conducted three times under identical settings, and the average results were reported. The standard deviation was also calculated to evaluate the stability of the proposed method. No pre-trained weights were used in the cross-dataset evaluation (Table 2) to ensure a fair comparison across all models.

Table 3 presents a comprehensive comparison with several representative one-stage and two-stage detection methods. GG-YOLO achieves a precision of 94.35%, recall of 91.93%, and mAP@50 of 96.47%, surpassing all compared methods. While RT-DETR-R18 achieves slightly higher mAP@50-90 (61.70%), it does so at the cost of significantly increased parameters (20.18 M) and computation (58.3 GFLOPs). In contrast, GG-YOLO attains competitive mAP@50-90 (60.78%) with only 7.89 M parameters and 20.4 GFLOPs, resulting in the highest inference speed of 165 FPS.

In addition to mainstream detectors such as YOLOv8 and RT-DETR, we further include lightweight baselines, including YOLOv7-tiny and PP-YOLOE-s, to provide a more comprehensive evaluation of the proposed method under resource-constrained scenarios.

Compared to the baseline YOLOv8s, GG-YOLO improves mAP@50 by 2.6 percentage points and mAP@50–90 by 5.24 points, while simultaneously reducing the parameter count by 31% and FLOPs by 28.9%. Additionally, the inference speed increases from 145 to 165 FPS, indicating substantial gains in both accuracy and computational efficiency. Other lightweight detectors, such as FFCA-YOLO and YOLO-world, demonstrate reasonable accuracy but require two to three times more parameters and FLOPs to achieve lower or comparable results. Keras-RetinaNet, despite being a two-stage model, exhibits significant performance degradation in complex dense scenes with limited speed and efficiency.

Crucially, recent agricultural benchmarking studies emphasize that detector selection for field deployment must evaluate not only absolute mAP but also robustness under domain shift, degradation-induced failures, false detection behavior, and strict accuracy-efficiency tradeoffs. The superior performance of GG-YOLO inherently addresses these multidimensional criteria through its architectural design. The integration of the Dual-path Attentive Ghost Mechanism (DAGM) enhances the representation of dense and small-scale targets by adapting receptive fields, effectively mitigating false detection behaviors caused by severe occlusion or field degradation. The C3Ghost module significantly reduces redundancy in convolutional operations, achieving an optimal accuracy-efficiency tradeoff by enabling lightweight yet highly expressive feature extraction. Furthermore, the dynamic DSample upsampler robustly recovers fine-grained spatial details with minimal computational overhead through a simplified content-adaptive sampling strategy, ensuring consistent perception reliability across complex viticultural and agricultural environments.

Overall, the results verify that GG-YOLO achieves high-precision detection in dense wheat spike scenarios with significantly lower resource requirements, making it highly suitable for real-time field deployment on edge or mobile devices. Finally, to visualize the improved detection capabilities of GG-YOLO we show the heat map results of YOLOv8s and GG-YOLO on the HNKJXYwheat images dataset as shown in Figure 6.

Furthermore, we evaluated the robustness of GG-YOLO against common field degradation and domain shifts. Quantitatively, the cross-dataset evaluation (Table 2) demonstrates that the model maintains a strong mAP@50 of 89.25% on the HNKJXYwheat dataset despite being trained on GlobalWheat2020. This confirms the network’s resilience to cross-dataset domain shifts characterized by different sensor types, geographical locations, and background clutter. Qualitatively, visual comparisons using the image information in Figure 6 demonstrate the clear superiority of the proposed methodology in detecting dense wheat and maintaining perception reliability under extreme visual degradation. In cases of severe target density and occlusion (Figure 6I), baseline models such as YOLOv8s (Figure 6c) and YOLOv11 (Figure 6d) frequently struggle with feature confusion, resulting in merged bounding boxes across adjacent spikes or entirely missed detections (false negatives) within deep clusters. In contrast, GG-YOLO (Figure 6b) generates highly distinct and accurate bounding boxes for each individual spike, even when they are heavily overlapped. This visual superiority is directly attributed to the proposed architecture: the DAGM module effectively resolves feature confusion among overlapping spikes by prioritizing edge responses, while the DSample module adaptively recovers fine-grained boundaries for these dense targets during feature fusion. Moreover, the model demonstrates high stability against wind-induced motion blur (Figure 6III) and severe illumination variations across varying growth stages (Figure 6II), accurately localizing targets where baseline models fail.

The performance improvements observed in Table 4 can be attributed to the complementary effects of the proposed modules. Specifically, the DAGM module enhances feature discrimination by emphasizing edge-aware and geometry-sensitive responses, which is particularly beneficial in densely overlapping wheat spike regions.

The DSample module contributes by reconstructing fine-grained spatial details during feature pyramid fusion, effectively mitigating the loss of localization precision caused by traditional interpolation methods. Meanwhile, the C3Ghost module significantly reduces computational redundancy by leveraging efficient ghost feature generation, enabling a lightweight design without sacrificing representational capacity.

The combination of these modules results in a synergistic improvement, validating the effectiveness of the proposed joint optimization framework.

4.5. Edge Deployment Performance Evaluation

To validate the practical applicability of GG-YOLO for real-time deployment in precision agriculture, we extended our evaluation from the RTX 4070 Ti Super environment to an actual edge computing platform. The proposed GG-YOLO and the baseline YOLOv8s were deployed on an NVIDIA Jetson Orin Nano (8GB). Both models were exported using TensorRT with FP16 precision to maximize inference efficiency. Table 5 details the performance metrics under strict edge hardware constraints.

5. Conclusions

In this work, we propose GG-YOLO, a unified lightweight detection framework that jointly optimizes feature representation, computational efficiency, and spatial reconstruction for dense wheat spike detection. Unlike conventional approaches that address these aspects independently, the proposed method integrates a geometry-aware dual-path attention mechanism, an efficient ghost-based feature generation module, and a dynamic content-adaptive upsampling strategy into a cohesive architecture.

Extensive experiments demonstrate that GG-YOLO achieves a superior trade-off between accuracy and efficiency, making it highly suitable for real-time deployment in precision agriculture scenarios. Furthermore, cross-dataset evaluation confirms its strong generalization capability under varying environmental conditions.

Despite these advantages, certain limitations remain. The model may still exhibit performance degradation under extreme occlusion or severe illumination variations. Future work will focus on improving robustness through advanced context modeling, multi-task learning, and temporal consistency in video-based agricultural monitoring systems.

Author Contributions

Conceptualization, G.G. and X.L.; methodology, G.G. and F.Z.; software, F.Z. and G.G.; validation, L.X., J.Z. and F.Z.; formal analysis, L.X. and J.Z.; investigation, G.G. and J.Z.; resources, X.L.; data curation, F.Z. and L.X.; writing—original draft preparation, G.G. and F.Z.; writing—review and editing, X.L. and L.X.; visualization, J.Z.; supervision, X.L.; project administration, X.L.; funding acquisition, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by 251111210800 Intelligent vision-driven key technology and application of whole-life assisted breeding for high-quality wheat, Henan Provincial Key R &D Program, November 2024–October 2027.

Data Availability Statement

Publicly available datasets were analyzed in this study. The GlobalWheat2020 dataset can be found in the referenced literature. The HNKJXYwheat dataset presented in this study, which was collected from experimental wheat fields at the Henan Institute of Science and Technology, is available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yao, Y.; Guo, W.; Gou, J.; Hu, Z.; Liu, J.; Ma, J.; Zong, Y.; Xin, M.; Chen, W.; Li, Q.; et al. Wheat2035: Integrating pan-omics and advanced biotechnology for future wheat design. Mol. Plant 2025, 18, 272–297. [Google Scholar] [CrossRef]
Ma, Y. Does “zero growth policy” affect environmental productivity of wheat production in China. Agriculture 2025, 15, 378. [Google Scholar] [CrossRef]
Harfouche, A.L.; Jacobson, D.A.; Kainer, D.; Romero, J.C.; Harfouche, A.H.; Mugnozza, G.S.; Moshelion, M.; Tuskan, G.A.; Keurentjes, J.J.B.; Altman, A. Accelerating climate resilient plant breeding by applying next-generation artificial intelligence. Trends Biotechnol. 2019, 37, 1217–1235. [Google Scholar] [CrossRef]
Zhao, J.; Dong, H.; Han, J.; Ou, J.; Chen, T.; Wang, Y.; Liu, S.; Yu, R.; Zheng, W.; Li, C.; et al. Lwrr: Landscape of wheat rust resistance towards practical breeding design. Stress Biol. 2025, 5, 25. [Google Scholar] [CrossRef] [PubMed]
Arif, M.; Haroon, M.; Nawaz, A.F.; Abbas, H.; Xu, R.; Li, L. Enhancing wheat resilience: Biotechnological advances in combating heat stress and environmental challenges. Plant Mol. Biol. 2025, 115, 41. [Google Scholar] [CrossRef] [PubMed]
Han, G.; Yan, H.; Li, L.; An, D. Advancing wheat breeding using rye: A key contribution to wheat breeding history. Trends Biotechnol. 2025, 43, 2170–2183. [Google Scholar] [CrossRef]
Li, S.; Lin, D.; Zhang, Y.; Deng, M.; Chen, Y.; Lv, B.; Li, B.; Lei, Y.; Wang, Y.; Zhao, L.; et al. Genome-edited powdery mildew resistance in wheat without growth penalties. Nature 2022, 602, 455–460. [Google Scholar] [CrossRef] [PubMed]
Liu, Q.; Zuo, S.-M.; Peng, S.; Zhang, H.; Peng, Y.; Li, W.; Xiong, Y.; Lin, R.; Feng, Z.; Li, H.; et al. Development of machine learning methods for accurate prediction of plant disease resistance. Engineering 2024, 40, 100–110. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Ma, N.; Su, Y.; Yang, L.; Li, Z.; Yan, H. Wheat seed detection and counting method based on improved yolov8 model. Sensors 2024, 24, 1654. [Google Scholar] [CrossRef]
Ali, M.L.; Zhang, Z. The yolo framework: A comprehensive review of evolution, applications, and benchmarks in object detection. Computers 2024, 13, 336. [Google Scholar] [CrossRef]
Shobaki, W.A.; Milanova, M. A comparative study of yolo, ssd, faster r-cnn, and more for optimized eye-gaze writing. Sci 2025, 7, 47. [Google Scholar] [CrossRef]
Hu, Y.; Jiang, H.; Feng, D.; Tian, L.; Luo, H.; Zhang, S. Performance impact and interplay of ssd parallelism through advanced commands, allocation strategy and data granularity. In ICS ‘11: International Conference on Supercomputing, Tucson, AZ, USA, 31 May–4 June 2011; Association for Computing Machinery: New York NY, USA, 2011; pp. 96–107. [Google Scholar]
Eissa, H.F.; Hassanien, S.E.; Ramadan, A.M.; El-Shamy, M.M.; Saleh, O.M.; Shokry, A.M.; Abdelsattar, M.; Morsy, Y.B.; El-Maghraby, M.A.; Alameldin, H.F.; et al. Developing transgenic wheat to encounter rusts and powdery mildew by overexpressing barley chi26 gene for fungal resistance. Plant Methods 2017, 13, 41. [Google Scholar] [CrossRef]
Madec, S.; Jin, X.; Lu, H.; De Solan, B.; Liu, S.; Duyme, F.; Heritier, E.; Baret, F. Ear density estimation from high resolution rgb imagery using deep learning technique. Agric. For. Meteorol. 2019, 264, 225–234. [Google Scholar] [CrossRef]
Peng, J.; Wang, D.; Zhu, W.; Yang, T.; Liu, Z.; Rezaei, E.E.; Li, J.; Sun, Z.; Xin, X. Combination of uav and deep learning to estimate wheat yield at ripening stage: The potential of phenotypic features. Int. J. Appl. Earth Obs. Geoinf. 2023, 124, 103494. [Google Scholar] [CrossRef]
Carlier, A.; Dandrifosse, S.; Dumont, B.; Mercatoris, B. Wheat ear segmentation based on a multisensor system and superpixel classification. Plant Phenomics 2022, 2022, 9841985. [Google Scholar] [CrossRef]
Shen, X.; Li, S.; Qiu, F.; Yao, L. A lightweight real-time unified detection model for rice and wheat ears in complex agricultural environments. Smart Agric. Technol. 2025, 11, 101055. [Google Scholar] [CrossRef]
Wang, D.; Zhang, D.; Yang, G.; Xu, B.; Luo, Y.; Yang, X. Ssrnet: In-field counting wheat ears using multi-stage convolutional neural network. IEEE Trans. Geosci. Remote Sens. 2021, 60, 4403311. [Google Scholar] [CrossRef]
Yao, Z.; Liu, T.; Yang, T.; Ju, C.; Sun, C. Rapid detection of wheat ears in orthophotos from unmanned aerial vehicles in fields based on yolox. Front. Plant Sci. 2022, 13, 851245. [Google Scholar] [CrossRef]
Guan, Y.; Pan, J.; Fan, Q.; Yang, L.; Yin, X.; Jia, W. Ctwheatnet: Accurate detection model of wheat ears in field. Comput. Electron. Agric. 2024, 225, 109272. [Google Scholar] [CrossRef]
Wang, H.; Shi, M.; Tian, S.; Xie, Y.; Fang, Y. Research on wheat ears detection method based on improved yolov5. In Artificial Intelligence in China, Proceedings of the 4th International Conference on Artificial Intelligence in China, Changbaishan, China, 23–24 July 2022; Springer: Singapore, 2022; pp. 119–129. [Google Scholar]
Dandrifosse, S.; Ennadifi, E.; Carlier, A.; Gosselin, B.; Dumont, B.; Mercatoris, B. Deep learning for wheat ear segmentation and ear density measurement: From heading to maturity. Comput. Electron. Agric. 2022, 199, 107161. [Google Scholar] [CrossRef]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2019; pp. 6105–6114. [Google Scholar]
Girshick, R.; Iandola, F.; Darrell, T.; Malik, J. Deformable part models are convolutional neural networks. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 437–446. [Google Scholar]
Redmon, J.; Farhadi, A. Yolo9000: Better, faster, stronger. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 7263–7271. [Google Scholar]
Zhao, J.; Zhang, X.; Yan, J.; Qiu, X.; Yao, X.; Tian, Y.; Zhu, Y.; Cao, W. A wheat spike detection method in uav images based on improved yolov5. Remote Sens. 2021, 13, 3095. [Google Scholar] [CrossRef]
Meng, X.; Li, C.; Li, J.; Li, X.; Guo, F.; Xiao, Z. Yolov7-ma: Improved yolov7-based wheat head detection and counting. Remote Sens. 2023, 15, 3770. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 7464–7475. [Google Scholar]
Zhang, H.; Li, G.; Wan, D.; Wang, Z.; Dong, J.; Lin, S.; Deng, L.; Liu, H. Ds-yolo: A dense small object detection algorithm based on inverted bottleneck and multi-scale fusion network. Biomim. Intell. Robot. 2024, 4, 100190. [Google Scholar] [CrossRef]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. Yolov9: Learning what you want to learn using programmable gradient information. In Computer Vision—ECCV 2024: 18th European Conference, Milan, Italy, 29 September–4 October 2024, Proceedings, Part XXXI; Springer: Berlin/Heidelberg, Germany, 2024; pp. 1–21. [Google Scholar]
Xie, S.; Zhou, M.; Wang, C.; Huang, S. Csspartial-yolo: A lightweight yolo-based method for typical objects detection in remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 17, 388–399. [Google Scholar] [CrossRef]
Qin, D.; Leichner, C.; Delakis, M.; Fornoni, M.; Luo, S.; Yang, F.; Wang, W.; Banbury, C.; Ye, C.; Akin, B.; et al. Mobilenetv4: Universal models for the mobile ecosystem. In Computer Vision—ECCV 2024: 18th European Conference, Milan, Italy, 29 September–4 October 2024, Proceedings, Part XL; Springer: Cham, Switzerland, 2024; pp. 78–96. [Google Scholar]
Han, K.; Wang, Y.; Xu, C.; Guo, J.; Xu, C.; Wu, E.; Tian, Q. Ghostnets on heterogeneous devices via cheap operations. Int. J. Comput. Vis. 2022, 130, 1050–1069. [Google Scholar] [CrossRef]
Cai, Z.; Shen, Q. Falconnet: Factorization for the light-weight convnets. In Neural Information Processing: 30th International Conference, ICONIP 2023, Changsha, China, 20–23 November 2023, Proceedings, Part II; Springer: Singapore, 2023; pp. 368–380. [Google Scholar]
Hu, M.; Li, Z.; Yu, J.; Wan, X.; Tan, H.; Lin, Z. Efficient-lightweight yolo: Improving small object detection in yolo for aerial images. Sensors 2023, 23, 6423. [Google Scholar] [CrossRef] [PubMed]
Rana, S.; Hensel, O.; Nasirahmadi, A. From vineyard to vision: Multi-domain analysis and mitigation of grape cluster detection failures in complex viticultural environments. Results Eng. 2025, 29, 108833. [Google Scholar] [CrossRef]
Lou, M.; Zhang, S.; Zhou, H.-Y.; Yang, S.; Wu, C.; Yu, Y. Transxnet: Learning both global and local dynamics with a dual dynamic token mixer for visual recognition. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 11534–11547. [Google Scholar] [CrossRef] [PubMed]
Fang, C.; Yang, X. Lightweight yolov8 for wheat head detection. IEEE Access 2024, 12, 66214–66222. [Google Scholar] [CrossRef]
Yang, B.; Zhu, Y.; Zhou, S. Accurate wheat lodging extraction from multi-channel UAV images using a lightweight network model. Sensors 2021, 21, 6826. [Google Scholar] [CrossRef]
David, E.; Madec, S.; Sadeghi-Tehran, P.; Aasen, H.; Zheng, B.; Liu, S.; Kirchgessner, N.; Ishikawa, G.; Nagasawa, K.; Badhon, M.A.; et al. Global wheat head detection (gwhd) dataset: A large and diverse dataset of high-resolution rgb-labelled images to develop and benchmark wheat head detection methods. Plant Phenomics 2020, 2020, 3521852. [Google Scholar] [CrossRef]

Figure 1. Overall overview of GG-YOLO, improved from YOLOv8.

Figure 2. Performance on GlobalWheat2020 Dataset.

Figure 3. DSample Structure Overview.

Figure 4. DAGM adopts dual-path heterogeneous interaction design, where the primary and secondary paths deploy different structures with different functions.

Figure 5. Schematic structure of the proposed C3Ghost module.

Figure 6. Selected results from the GlobalWheat2020 dataset. (a) Original RGB image. (b) GG-YOLO. (c) YOLOv8s. (d) Yolov11. (e) FFCA-yolo. (f) Faster-rcnn. (g) Rtdetr-18. (h) BorderDet; (I)Overlapping wheat spikes (II) growth stage; (III) blurred image due to wind and (IV) measurement change.

Table 1. Comprehensive structural comparison between GG-YOLO and established baselines.

Method	Ghost Feature Generation	Dynamic Sampling	Attention Mechanism	Lightweight Head	Agricultural-Oriented Design	Key Difference
GhostNet	✓	×	×	✓	×	Cheap feature generation
DySample	×	✓	×	×	×	Dynamic upsampling
YOLOv8s	×	×	×	×	×	Baseline detector
YOLOv9	×	×	✓	×	×	Programmable gradient learning
YOLOv11	×	×	✓	✓	×	Lightweight optimization
GG-YOLO	✓	✓	✓	✓	✓	Gradient-guided attention + Dual-mode sampling + C3Ghost integration

‘✓’ indicates that the respective option/component is included, whereas ‘×’ indicates that it is not included.

Table 2. Comparison of detection results of GG-YOLO on GlobalWheat2020 dataset and HNKJXYwheat dataset.

Dataset	GlobalWheat2020 Dataset	HNKJXYwheat
P%	94.35 ± 0.11	90.85 ± 0.16
R%	91.93 ± 0.14	87.23 ± 0.22
mAP@50%	96.47 ± 0.09	89.25 ± 0.17
mAP@50-90%	60.78 ± 0.18	57.28 ± 0.24
Parameters (M)	7.89	7.89

Table 3. Quantitative results of different methods.

Model	P% ↑	R% ↑	mAP@50% ↑	mAP@50-90% ↑	Parameters (M) ↓	GFLOPs ↓	FPS ↑
YOLOv8s	92.75 ± 0.12	87.54 ± 0.18	93.87 ± 0.15	55.54 ± 0.20	11.44	28.7	145
YOLOv9	92.83 ± 0.14	88.72 ± 0.21	94.19 ± 0.13	56.37 ± 0.22	16.87	32.0	138
YOLOv11	92.61 ± 0.16	89.38 ± 0.19	94.76 ± 0.11	56.80 ± 0.23	25.8	26.4	142
Rtdetr-r18	94.08 ± 0.10	91.54 ± 0.14	96.32 ± 0.09	61.70 ± 0.17	20.18	58.3	110
Rtdetr-l	92.90 ± 0.13	91.67 ± 0.18	95.13 ± 0.12	58.07 ± 0.21	32.0	63.0	98
Faster R-CNN	92.43 ± 0.15	90.63 ± 0.20	93.27 ± 0.14	60.75 ± 0.24	38.7	68.0	85
FFCA-YOLO	91.82 ± 0.18	89.02 ± 0.17	94.50 ± 0.16	56.64 ± 0.22	18.4	24.8	152
YOLO-word	93.20 ± 0.14	90.10 ± 0.15	95.20 ± 0.11	52.30 ± 0.19	28.5	35.2	132
Keras-retinanet	91.50 ± 0.21	88.30 ± 0.24	92.80 ± 0.18	48.60 ± 0.25	36.2	72.1	76
BorderDet	93.80 ± 0.12	91.20 ± 0.16	96.00 ± 0.13	54.20 ± 0.20	22.7	45.6	120
GG-YOLO	94.35 ± 0.09	91.93 ± 0.12	96.47 ± 0.08	60.78 ± 0.15	7.89	20.4	165

‘↑’ indicates that a higher value represents better performance, while ‘↓’ indicates that a lower value is better.

Table 4. Ablation experiments, demonstrating the effect of different modules performing combinations.

YOLOv8s	DSample	DAGM	C3Ghost	Recall% ↑	mAP50-90% ↑	mAP50% ↑	Parameters (M) ↓	GFLOPs ↓	FPS ↑
✓				87.54 ± 0.18	55.54 ± 0.22	93.87 ± 0.14	11.14	28.7	145
✓	✓			89.38 ± 0.21	57.31 ± 0.19	95.47 ± 0.12	11.16	30.7	115
✓	✓	✓		90.24 ± 0.17	57.76 ± 0.24	95.96 ± 0.15	14.3	30.1	95
✓	✓	✓	✓	91.93 ± 0.13	60.78 ± 0.16	96.47 ± 0.09	7.89	20.4	165

‘✓’ indicates that the respective option/component is included. ‘↑’ indicates that a higher value represents better performance, while ‘↓’ indicates that a lower value is better.

Table 5. Performance evaluation on edge hardware.

Model	Model Size (MB)	Peak Memory (MB)	Latency (ms)	FPS
YOLOv8s	22.9	265	23.2	43
GG-YOLO	15.8	190	15.6	64

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gao, G.; Zhou, F.; Xu, L.; Zhang, J.; Li, X. GG-YOLO: A Lightweight Dual-Path Attention Detector with Dynamic Sampling for Dense Wheat Spike Detection. Agronomy 2026, 16, 1156. https://doi.org/10.3390/agronomy16121156

AMA Style

Gao G, Zhou F, Xu L, Zhang J, Li X. GG-YOLO: A Lightweight Dual-Path Attention Detector with Dynamic Sampling for Dense Wheat Spike Detection. Agronomy. 2026; 16(12):1156. https://doi.org/10.3390/agronomy16121156

Chicago/Turabian Style

Gao, Guohong, Fucheng Zhou, Lijun Xu, Jiaxin Zhang, and Xueyong Li. 2026. "GG-YOLO: A Lightweight Dual-Path Attention Detector with Dynamic Sampling for Dense Wheat Spike Detection" Agronomy 16, no. 12: 1156. https://doi.org/10.3390/agronomy16121156

APA Style

Gao, G., Zhou, F., Xu, L., Zhang, J., & Li, X. (2026). GG-YOLO: A Lightweight Dual-Path Attention Detector with Dynamic Sampling for Dense Wheat Spike Detection. Agronomy, 16(12), 1156. https://doi.org/10.3390/agronomy16121156

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GG-YOLO: A Lightweight Dual-Path Attention Detector with Dynamic Sampling for Dense Wheat Spike Detection

Abstract

1. Introduction

2. Related Work

2.1. Convolutional Neural Networks in Wheat Ear Detection

2.2. Complex Background Dense Target Detection

2.3. Lightweight Object Detection Models

3. Proposed Method

3.1. Dsample

3.2. Dual-Path Attentive Ghost Mechanism

3.3. C3ghost

3.4. Design Analysis of GG-Yolo

4. Environment and Parameterization

4.1. Model Training Environment

4.2. Experimental Dataset Setup

4.3. Public Evaluation Indicators

4.4. Results and Analysis

4.5. Edge Deployment Performance Evaluation

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI