Next Article in Journal
Monitoring Rice Blast Disease Progression Through the Fusion of Time-Series Hyperspectral Imaging and Deep Learning
Previous Article in Journal
Characterization of a Salt-Tolerant Plant Growth-Promoting Bacterial Isolate and Its Effects on Oat Seedlings Under Salt Stress
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

YOLOv8n-DSP: A High-Precision Model for Oat Ear Detection and Counting in Complex Fields

1
School of Intelligent Equipment Engineering, Wuxi Taihu University, Wuxi 214064, China
2
Wuxi Key Laboratory “Intelligent Robot and Special Equipment Technology Key Laboratory”, Wuxi Taihu University, Wuxi 214064, China
3
College of Agricultural Engineering, Shanxi Agricultural University, Jinzhong 030801, China
4
Dryland Farm Machinery Key Technology and Equipment Key Laboratory of Shanxi Province, Jinzhong 030801, China
*
Author to whom correspondence should be addressed.
Agronomy 2026, 16(1), 133; https://doi.org/10.3390/agronomy16010133
Submission received: 22 October 2025 / Revised: 3 December 2025 / Accepted: 26 December 2025 / Published: 5 January 2026
(This article belongs to the Section Precision and Digital Agriculture)

Abstract

Accurate detection and counting of oat ears are essential for yield estimation but remain challenging in complex field environments due to occlusion, significant scale variation, and fluctuating lighting. The aim of this study is to develop a high-precision detection and counting model to address these challenges. The adopted methodology was an improved YOLOv8n model, named YOLOv8n-DSP. To address significant scale variation, a Diverse Branch Block (DBB) was introduced into the backbone to enhance multi-scale feature representation. For improved detection of small, dense oat ears, the neck was augmented with a Spatial and Channel Synergistic Attention (SCSA) mechanism to strengthen discriminative feature extraction. Furthermore, to refine localization among overlapping oat ears, the PIoUv2 loss function was employed for bounding box regression. The main results revealed that the proposed model achieved a mean average precision (mAP) of 94.0% and an F1-score of 87.1% on the oat ear detection task, representing gains of 3.2 and 1.8 percentage points over the baseline YOLOv8n, respectively. For counting, it reached an accuracy of 82.5%, a 9.2-point improvement. In conclusion, the YOLOv8n-DSP model provides an effective and practical approach for in-field oat ear detection and counting, suggesting considerable potential as a reliable tool for future yield prediction systems and advanced intelligent agricultural equipment.

1. Introduction

Accurate and efficient estimation of cereal crop yield is a fundamental objective in precision agriculture, enabling optimized resource allocation and strategic planning. As a widely cultivated cereal, oat holds significant importance for livestock feed, human nutrition, and industrial processing [1]. The oat ear, being a primary yield component, serves as a key trait for characterizing productivity [2]. Therefore, the reliable detection and counting of oat ears are fundamental for constructing robust yield estimation models and enabling data-driven farm management.
With the rapid advancement of intelligent agricultural technologies, deep learning-based object detection has become a primary method for agricultural visual inspection, offering significant advantages over traditional image processing approaches [3,4]. Among the various frameworks, the YOLO series has emerged as one of the most prominent in crop recognition, owing to its exceptional balance of real-time performance and high accuracy [5]. This has spurred numerous researchers to conduct multi-faceted optimizations of deep learning algorithms for practical agricultural scenarios [6], with remarkable progress being made in the study of key cereal crops like wheat.
A prominent optimization direction involves the integration of attention mechanisms to enhance feature extraction. For instance, Li et al. [7] incorporated a four-fold downsampling layer and the CBAM module into YOLOv5 for wheat ear detection. Similarly, Qing et al. [8] and Yu et al. [9] employed dual attention mechanisms (ECA & EMA) and an Enhanced E2CBAM module, respectively, demonstrating significant accuracy gains in complex fields. However, a limitation of such classical attention mechanisms is their tendency to process channel and spatial attention sequentially or independently, which may fail to capture the intricate interdependencies crucial for distinguishing densely packed targets. Concurrently, substantial efforts have been directed toward lightweight design and loss function improvements to enhance deployment efficiency. Lu et al. [10] and Tong et al. [11] achieved notable parameter reduction and maintained real-time performance through backbone replacement and architectural tweaks in YOLOv7-tiny and YOLOv5n, respectively. Jing et al. [12] further simplified the YOLOv8 detection head, integrating CBAM with Wise-IoU loss. While these strategies improve efficiency, lightweight architectures often come at the cost of reduced model capacity, which can impair the extraction of fine-grained features. Moreover, the effectiveness of advanced loss functions like Inner-CIoU and Wise-IoU may be limited when processing suboptimal feature maps from constrained backbones. Beyond algorithmic refinements, research has progressed to integrated system design, demonstrating practical potential. Li et al. [13] developed a winter wheat growth monitoring system by integrating Faster R-CNN with optimized NMS, while Xu et al. [14] and Huang et al. [15] successfully deployed detection systems for quality sorting and in-field monitoring.
The development of object detection models for oat ears has not kept pace with that for other major cereals, such as wheat, which have benefited from larger public datasets and more extensive research. Current studies on oat ears are limited, and few image datasets are available. A Faster R-CNN model implemented in MATLAB (2024a) [16] demonstrated initial feasibility for oat ear detection. However, the limited dataset size restricted its coverage of diverse field environments. In addition, this method faces implementation limitations and cannot effectively add newer modules, such as complex attention modules. The inherent growth characteristics of oat plants—including a dense canopy that leads to occlusion, substantial variation in ear size, and variable in-field lighting—present additional difficulties for detection. These challenges are compounded by general limitations observed in existing detection models for other crops, such as the restricted feature representation capacity of some lightweight architectures and the insufficient modeling of spatial-channel dependencies in certain attention designs. Therefore, advancing oat ear detection necessitates a dual approach: the construction of a comprehensive, challenging dataset, coupled with the development of a detection model specifically designed to overcome the aforementioned challenges of occlusion, scale variation, and complex lighting.
To address these challenges, this study constructed an oat ear image dataset containing a variety of field scenarios and developed a high-precision model, YOLOv8n-DSP, for in-field oat ear detection and counting. The model is built upon the YOLOv8n baseline, which was selected for its balanced performance and efficient architecture. Our proposed model incorporates three key improvements to enhance robustness in complex field conditions: (1) a Diverse Branch Block (DBB) in the backbone to strengthen multi-scale feature representation; (2) a Spatial and Channel Synergistic Attention (SCSA) mechanism in the neck to improve feature discrimination for occluded and dense ears; and (3) the PIoUv2 loss function for more accurate bounding box regression. Consequently, the YOLOv8n-DSP model provides an effective solution for in-field oat ear detection and counting, establishing a foundation for future yield prediction systems and intelligent agricultural equipment.

2. Materials and Methods

2.1. Construction of the Oat Ear Dataset

The oat ear image dataset used in this study was collected from the Shenfeng Experimental Base of Shanxi Agricultural University (Shanxi, China) during the local oat maturity period, from June to August 2025. The cultivar under investigation was “Pinyan No. 4”, which was bred and cultivated by the Agricultural Genetic Resources Research Center of Shanxi Agricultural University. This variety was developed through sexual hybridization combined with pedigree selection from naked oats. Its morphological characteristics at maturity are consistent with those of conventional oat varieties, ensuring that varietal differences do not influence the experimental results. All images were captured using a vivo S1 smartphone (vivo Mobile Communication Co., Ltd., Dongguan, China) in auto-exposure mode during daylight hours (08:00 to 18:00 local time). The device’s default primary camera, which has an equivalent focal length of approximately 26 mm, was employed throughout the acquisition process.
The construction of the oat ear dataset followed a systematic workflow, as illustrated in Figure 1. Image acquisition adhered to a multi-scale control principle, with three shooting distance gradients: close-range (<50 cm), mid-range (50–100 cm), and long-range (100–150 cm). Additionally, three standardized shooting angles were adopted: 90° top-view, 45° oblique-view, and 30° side-view. After a rigorous screening process, a final dataset comprising 6012 valid images was constructed. The screening adhered to strict criteria to ensure data quality and relevance. Specifically, images were excluded for reasons of severe defocus that rendered oat ears indistinguishable, improper exposure such as over-saturation or under-exposure which impaired the discernibility of oat ear details, or because they contained an insufficient number of oat ears to be representative of a typical field scene.
The dataset covers oat ear images under four typical field conditions: frontlight, backlight, leaf occlusion, and oat ear overlap. All images were annotated using the LabelImg (1.8.6) tool. All images were annotated by an experienced oat cultivation agronomist. The annotations were then verified by a second expert to ensure quality and consistency, resulting in bounding boxes that adhered closely to the visible contour of each oat ear. Using stratified sampling, the dataset was partitioned into training set (4810 images), validation set (601 images), and test set (601 images) in an 8:1:1 ratio to ensure balanced data distribution. The details of the dataset are shown in Table 1, and representative image samples from each category are illustrated in Figure 2.

2.2. Construction of Oat Ear Detection Model

As a new generation of target detection framework launched by Ultralytics, YOLOv8 achieves performance breakthroughs through multi-dimensional architecture innovation [17,18,19]. Its core technical features include a backbone network based on DarkNet53 that is optimized with depthwise separable convolutions and constructs a multi-scale feature pyramid through a five-stage downsampling process. For feature fusion, an innovative neck module replaces the traditional PAN structure with a Bi-directional Feature Pyramid Network (BiFPN), which employs learnable weights to enable adaptive feature enhancement across layers. Furthermore, the architecture incorporates a decoupled detection head that separates classification and regression tasks. This design is combined with a dynamic label assignment strategy to more effectively define positive and negative samples, significantly boosting detection capability for small objects. The model generates predictions at three feature map scales (80 × 80, 40 × 40, and 20 × 20). Each detection head outputs bounding box coordinates (x, y, w, h), object confidence, and class probabilities. Finally, the predictions are integrated using a weighted non-maximum suppression (Weighted NMS) algorithm to produce high-precision results [20,21,22]. The complete model architecture is illustrated in Figure 3.
YOLOv8n, a lightweight model in the YOLOv8 series, is widely used in real-time object detection tasks. Building upon YOLOv8n, this study proposes the YOLOv8n-DSP model, specifically designed for in-field oat ear detection. The model incorporates DBB module into the backbone network to enhance multi-scale feature extraction capability for oat ears in complex field environments. SCSA attention mechanism is integrated into the neck network to effectively improve the perception and recognition performance for oat ears of varying sizes. Furthermore, the original CIoU loss function is replaced with the PIoUv2 loss, which significantly improves bounding box localization accuracy and mitigates missed detections caused by overlapping oat ears. The main architecture of the improved model is shown in Figure 4.

2.2.1. Construction of DBB Module

The replacement of the standard Bottleneck module with the Diverse Branch Block (DBB) is motivated by architectural limitations of the original design and the specific requirements of oat ear detection. Theoretically, the Bottleneck module relies on a sequential and structurally fixed convolution pattern, which may restrict its ability to adaptively capture features across multiple scales. In comparison, the DBB incorporates a parallel multi-branch topology that integrates convolutional layers with varying kernel sizes, enabling more flexible receptive field adjustment and enriched feature representation. From a task-oriented perspective, this structural property is particularly relevant for detecting oat ears under field conditions, where targets often exhibit substantial scale variation and are frequently partially occluded. The multi-branch design of the DBB allows the model to learn more discriminative representations under such complex scenarios.
The DBB is a structured module designed to enhance the feature representation capability of convolutional neural networks. Its core concept lies in a multi-branch parallel architecture that improves feature diversity [23]. The module integrates multiple heterogeneous branch structures within a single convolutional layer, including standard convolutions of varying kernel sizes, depthwise separable convolutions, and non-linear transformation units. By leveraging complementary feature extraction across these branches, the DBB achieves flexible expansion of the receptive field [24,25].
During training, the DBB module constructs a multi-branch feature extraction system through six types of structural transformations. These include (1) branch addition, (2) depthwise concatenation, (3) multi-scale convolutional operations, (4) average pooling layers, (5) convolutional sequence stacking, and (6) cross-branch feature fusion [26]. The implementation of these transformations involves a series of core operations, such as convolution-batch normalization merging, multi-scale kernel transformation, equivalent convolution replacement for average pooling, and depth concatenation-based feature re-organization. The 1 × 1 convolution rapidly aggregates channel information, while the K × K convolution expands the receptive field, where the value of K is determined by the kernel size of the replaced convolutional block. The average pooling branch, combined with convolutional operations, serves as a downsampling strategy that helps retain critical information while reducing redundant features. The complete transformation pipeline is illustrated in Figure 5. Crucially, through structural re-parameterization, the entire multi-branch DBB module can be equivalently converted into a single standard convolution during inference, as shown in Figure 6. This design preserves the benefits of feature diversity during training while maintaining high computational efficiency in deployment.
In this study, we introduce a structural innovation into the backbone network of YOLOv8n by replacing the Bottleneck module in the original C2f structure with the DBB module. As illustrated in Figure 7, the enhanced backbone leverages the parallel multi-branch topology of the DBB to achieve diversified representations in the convolutional kernel space. This improvement effectively enhances the model’s capability to extract and fuse multi-scale features of oat ears in complex field environments.

2.2.2. SCSA Attention Mechanism

The Spatial and Channel Synergistic Attention (SCSA) mechanism is an innovative module that enhances feature representation by collaboratively modeling interdependencies between spatial and channel dimensions. Unlike conventional attention methods that process spatial and channel information separately, SCSA introduces a unique cross-dimensional interaction strategy to establish dynamic spatial-channel correlations. This enables the network to more accurately capture critical features in complex scenes [27]. The core concept involves constructing a bidirectional information flow, where channel information guides spatial attention computation and spatial cues are integrated into channel attention, resulting in mutually reinforced attention weights [28]. The architectural details of the SCSA mechanism are presented in Figure 8.
In the spatial attention path, channel statistics serve as prior knowledge for calculating spatial weights. A channel-aware spatial pooling operation generates a discriminative spatial attention map, as formulated in Equation (1). Conversely, the channel attention path employs a spatially guided weighting strategy to enhance the modeling of inter-channel dependencies through spatial relationship encoding, with its computation detailed in Equation (2). The outputs of both paths are then dynamically integrated via an adaptive gating mechanism (Equation (3)), which automatically adjusts the contribution ratio of spatial and channel attention based on the input features [29]. This design enables SCSA to maintain low computational complexity while delivering strong performance on various visual tasks.
S A ( F ) = σ ( f c o n v ( [ F a v g ; F m a x ] ) )
C A ( F ) = σ ( W 1 G A P ( F ) + W 2 G M P ( F ) )
S C S A ( F ) = F σ ( C o n v 1 × 1 ( C A ( F ) S A ( F ) ) )
In the formula, F a v g , F m a x R 1 × H × W is the mean and maximum value on the channel dimension, [ ; ]   represents the channel splicing, f c o n v   is a 3 × 3 convolution, G A P ( ) ,   GMP ( ) , respectively, represent the global average pooling and the maximum pooling, W 1 , W 2 R C × C is a learnable weight matrix, σ is a Sigmoid activation function, is an element-level multiplication, is attention weighting, and C o n v 1 × 1 is used to adjust the attention distribution.
In this study, the SCSA module was integrated into the feature extraction network. By simultaneously optimizing spatial localization and channel-wise feature filtering, SCSA precisely focuses on target regions and effectively suppresses background interference, significantly reducing both missed detections and false detections in oat ear detection. To concretely embed this mechanism, we constructed the SCSABottleneck block, as illustrated in Figure 9. Built upon the standard bottleneck structure, it inserts the SCSA module between convolutional layers. As the diagram shows, the input feature map first passes through a convolutional layer for channel adjustment, and is then fed into the SCSA module to compute synergistic spatial and channel attention weights. The refined features subsequently undergo a second convolution. Following the residual learning paradigm, the original input can be routed through a final convolutional layer and added to the output, which facilitates gradient flow and stabilizes training.

2.2.3. PIoUv2 Loss Function

To effectively mitigate the issue of anchor box redundancy and address the inherent limitation of traditional IoU-based loss functions being sensitive to target size, the PIoU series of loss functions introduces a dynamically adaptive penalty factor P [30,31], defined by Equation (4).
P = d w 1 w g t + d w 2 w g t + d h 1 h g t + d h 2   h g t   / 4
In the equation, P represents the self-penalization factor, while d w 1 ,   d w 2 ,   d h 1 ,   d h 2 denotes the absolute distance between corresponding edges of the predicted box and ground-truth box; w g t and h g t respectively indicate the width and height of the ground-truth box, establishing the geometric parameters essential for bounding box regression analysis.
The definition of the PIoU loss function is given by Equations (5)–(8).
f P = 1 e P 2
PIoU = IoU f P , 1 PIoU 1
L P I o U = 1 PIoU
L P I o U = L I o U + f P , 0 L P I o U 2
In these equations, f P denotes the penalty function adaptive to anchor box quality, P I o U represents the IoU of PIoU loss function, I o U signifies the actual IoU, and L P I o U consists of the P I o U loss function.
Building upon this foundation, researchers innovatively integrated a Focusing Strategy with a Non-linear Attention Mechanism to develop the enhanced PIoUv2 loss function. The core advantage of this improvement lies in its significantly strengthened optimization capability for Moderate-Quality Predictions, effectively addressing the suboptimal optimization issue inherent in traditional loss functions within this quality range. This architectural design enables the model to demonstrate enhanced robustness and stability when confronting complex detection scenarios—including Boundary Ambiguity, Partial Occlusion, and drastic scale variations—thereby improving detection performance under challenging conditions. The computational methodology and mathematical formulation of the PIoUv2 loss function are formally defined in Equations (9)–(11).
q = e P , q 0 , 1
L P I o U _ v 2 = u λ q · L P I o U
L P I o U _ v 2 = 3 λ q e λ q 2 L P I o U
In these equations, q quantifies the quality of anchor boxes; L P I o U _ v 2 represents the PIoUv2 loss function; u λ q denotes the attention function; and λ serves as the hyperparameter governing the behavior of the attention function.

2.3. Experimental Setup and Configuration Parameters

The experimental platform configuration for this study is as follows: the software environment consisted of Windows 10 operating system, Python 3.9 development environment, PyTorch 2.6.0 deep learning framework, and PyCharm 2024 development tool; the hardware setup included an Intel Core i7-13700H CPU and an NVIDIA GeForce RTX 4070 GPU. To ensure a fair and unbiased comparison, all models in this study, including the baseline YOLOv8n and other mainstream counterparts, were trained from scratch under an identical protocol using the parameters specified below. Detailed training parameters are provided in Table 2.
Beyond the primary parameters summarized in Table 2, the following advanced configurations were utilized. A linear learning rate scheduler was applied, decaying the rate from its initial value to a final value of 0.0001. The training commenced with a 3.0-epoch warmup phase. Exponential Moving Average (EMA) was enabled to smooth model weights and improve generalization. The batch size of 4 was determined as the optimal balance between performance and the memory constraints of our GPU, and the epoch number of 100 was selected as we observed both training and validation metrics had stabilized and converged by this point. The data augmentation pipeline incorporated Mosaic (probability: 1.0, disabled for the last 10 epochs), HSV modifications (hue, saturation, and value gains of 0.015, 0.7, and 0.4), and Random Erasing (probability: 0.4).

2.4. Evaluation Indicators

This study employs a multi-dimensional evaluation system to comprehensively assess model performance. In terms of detection accuracy, the F1-score, precision, recall, and mean average precision (mAP) are evaluated. For computational efficiency, theoretical computational complexity is measured via model GFLOPs, storage requirements are assessed by parameter count, and practical deployment performance is reflected through inference time. All tests were conducted in a unified environment to ensure comparability of results. The detailed calculation formulas are provided in Equations (12)–(16).
F 1 = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l
P r e c i s i o n = T P F P + T P
R e c a l l = T P F N + T P
A P = 0 1 P ( r ) d r
m A P = 1 N N i A P i
In the formula, T P is the positive sample representing F P is the correct detection, F N is the negative sample representing the false detection, and N is the positive sample representing the missed detection are the total number of categories and A P i denotes the A P -value of the i -th category.

3. Results and Discussion

3.1. Ablation Test and Performance Analysis of the Improved Model

A comparative analysis between the baseline YOLOv8n and the proposed YOLOv8n-DSP model was first conducted under multiple random seeds to evaluate the overall efficacy of the proposed modifications. The results are summarized in Table 3.
As summarized in Table 3, the YOLOv8n-DSP model achieves a precision of 87.4%, representing an increase of 2.7 percentage points over the YOLOv8n baseline (84.7%). The mean average precision (mAP) reaches 94.0%, an improvement of 3.2 points from the baseline (90.8%), while recall shows a modest gain from 85.9% to 86.8%. These results indicate that the proposed modifications enhance detection performance effectively, maintaining a balance across key metrics. The observed precision improvement, particularly under varying illumination, suggests an enhanced ability to distinguish oat ears from complex backgrounds, a benefit attributed to the feature-selection capability of the SCSA mechanism. Similarly, the higher mAP reflects more accurate bounding box localization, which stems from the multi-scale feature extraction enabled by the DBB module and the optimized regression behavior of the PIoUv2 loss. These traits are especially valuable for detecting occluded and overlapping oat ears. It is also notable that the model preserves baseline recall while reducing false positives, indicating retained sensitivity to potential targets. These performance gains can be contextualized within broader methodological trends in agricultural vision. The role of attention mechanisms in improving feature discrimination is well established. For example, Li et al. [7] integrated the CBAM module into YOLOv5 for wheat ear detection to suppress background interference. Our SCSA module follows a similar rationale but implements a more integrated spatial-channel weighting strategy. Likewise, the mAP improvement under occlusion and scale variation aligns with reported benefits of multi-scale architectures. Tong et al. [11] also achieved mAP gains by modifying YOLOv5 for multi-scale wheat ear detection. Our approach, using the DBB module, extends this principle through a parallel multi-branch design that enriches feature representation across receptive fields—a capability particularly relevant for oat ears given their significant size variations in field settings, as noted in studies emphasizing scale-invariant feature extraction for crops [25,32].
To further discuss the contribution of proposed module, a series of systematic ablation studies were carried out, with the results detailed in Table 4.
The introduction of the DBB module increased the F1-score by 2.0 percentage points to 87.30% and precision by 3.0 percentage points to 87.7%. By dynamically adjusting channel dimensions within the bottleneck layer, it enabled adaptive multi-scale feature fusion, effectively suppressing interference from complex field backgrounds. However, this came at a considerable computational cost, with GFLOPs increasing by 63% and the model size growing by 121%. The SCSA module further elevated the mAP to 92.1% through its dual spatial-channel attention mechanism, which enhanced feature responses in key oat ear regions via spatial attention and optimized weight allocation across channels. Notably, these significant gains were achieved with computational efficiency comparable to the baseline. The PIoUv2 module boosted recall by 1.6 percentage points to 87.5%, owing to its refined bounding box regression strategy that reduced missed detections among adjacent ears. It also increased mAP by 1.9 percentage points, indicating that its enhanced localization criteria improved overall detection quality. Crucially, as PIoUv2 operates solely through loss function modification, it provides these advantages without adding any computational overhead during inference.
In terms of module combination effects, the integration of DBB and SCSA increased mAP to 93.2%, demonstrating a synergistic optimization effect between feature enhancement and attention mechanisms. A slight decrease in F1-score observed when combining DBB with SCSA points to a nuanced interaction between these modules. While the DBB module generates a rich, yet potentially redundant, set of multi-scale features through its parallel branches, the subsequent SCSA mechanism performs rigorous selection to prioritize the most discriminative spatial and channel information. Although this filtering is beneficial for focus, it may inadvertently suppress less salient contextual features that remain valuable for distinguishing densely packed and morphologically similar oat ears. This observed trade-off between feature diversity (from DBB) and feature selectivity (from SCSA) parallels the “feature suppression” effect noted in general object detection by Wang et al. [29]. By extending this observation to dense crop detection, our findings suggest that balancing these two aspects requires careful calibration, particularly for small and adhesive targets.
Combining DBB with PIoUv2 resulted in a performance drop, suggesting a suboptimal interaction between their optimization mechanisms. This may be due to a mismatch in their primary operational objectives. The DBB module excels at generating diverse feature representations across scales, yet this very diversity can introduce variability into the feature maps supplied for regression. Conversely, the PIoUv2 loss function relies on a stable and consistent feature representation to effectively execute its dynamic focusing mechanism, which assesses anchor box quality to guide regression [30]. Consequently, variability in the input features can perturb this quality assessment, leading to suboptimal gradient updates. This interpretation is supported by Liu et al. [30], who observed that the efficacy of advanced IoU losses is contingent upon the stability of the preceding feature extraction process. Thus, our empirical result suggests that integrating a highly variable feature extractor with a sensitive, quality-aware regression loss may necessitate additional stabilization techniques within the training pipeline. The DBB module excels at generating diverse feature representations across scales, yet this very diversity can introduce variability into the feature maps supplied for regression. Conversely, the PIoUv2 loss function relies on a stable and consistent feature representation to effectively execute its dynamic focusing mechanism, which assesses anchor box quality to guide regression [30]. Consequently, variability in the input features can perturb this quality assessment, leading to suboptimal gradient updates. This interpretation is supported by Liu et al. [30], who observed that the efficacy of advanced IoU losses is contingent upon the stability of the preceding feature extraction process. Thus, our empirical result suggests that integrating a highly variable feature extractor with a sensitive, quality-aware regression loss may necessitate additional stabilization techniques within the training pipeline.
In contrast, the combination of SCSA and PIoUv2 exhibited the best overall performance: all metrics showed steady improvement while maintaining high computational efficiency. This indicates that the spatial-channel attention mechanism and the enhanced localization evaluation criteria complement each other effectively, achieving a more favorable balance between detection accuracy and computational cost. The complete model, which integrates all three proposed modules, achieves significant improvements across multiple metrics. It reaches a mAP of 94.0% and an F1-score of 87.10%, demonstrating a strong balance between recall (86.8%) and precision (87.4%). Technically, the DBB module enhances multi-scale feature representation, the SCSA mechanism optimizes the spatial and channel-wise distribution of these features, and the PIoUv2 loss further improves localization quality through refined bounding box regression. These components collectively form a coherent enhancement pipeline: feature extraction → feature refinement → detection optimization.
On the basis of completing the ablation test shown in Table 4, in order to further explore the influence of different module combinations on the performance of the model, especially the interaction effect of synergy or inhibition between modules, this study will conduct an in-depth analysis of the models numbered 5 to 8. These models represent the combination of two modules and the combination of all three modules. By drawing and comparing the precision-recall curve (PR curve) and F1 score-threshold curve (F1 curve) of these models, the comprehensive performance of each module combination in the detection task and its stability under different confidence thresholds can be more intuitively revealed.
The comparative analysis of the ablation models from dual perspectives in Figure 10 reveals distinct performance characteristics and operational robustness. As shown in Figure 10a, the Precision-Recall (PR) curves depict the inherent detection capability, where a clear performance hierarchy is established: the complete model (Model 8) achieves the optimal curve, followed by Model 5 (DBB + SCSA), Model 6 (DBB + PIoUv2), and finally Model 7 (SCSA + PIoUv2). This gradient underscores that the inclusion of the DBB module, which enhances multi-scale feature representation, is a fundamental contributor to high performance in scenarios with significant scale variation and dense distribution. Within this framework, the SCSA mechanism and PIoUv2 loss function serve as crucial refinements for discriminating targets from complex backgrounds and improving localization in occluded areas, respectively.
Complementing this, the F1-Confidence curves in Figure 10b evaluate the operational stability under varying decision thresholds. Model 8 again demonstrates superior practicality, maintaining a high F1-score across the broadest threshold range, which reduces the dependency on precise threshold calibration—a key advantage for deployment in variable field conditions. While Models 5 and 6 show competitive potential in the PR space, their steeper F1 declines indicate higher sensitivity to threshold selection. Conversely, Model 7, despite its lower performance ceiling, exhibits a stability comparable to the full model, highlighting that the SCSA and PIoUv2 combination alone can yield a robust, though less accurate, detection baseline.
Although ablation experiments demonstrate the performance improvement brought by the proposed module, a detailed analysis of the computational complexity is crucial for understanding the relevant model cost. As shown in Table 3, the integration of DBB modules in the backbone network brings an increase in computational load and parameters. For further quantitative analysis, Table 5 compares the computational overhead of traditional convolution and DBB modules. The results show that the multi-branch structure of the DBB module requires additional parameters to support its diverse convolution kernel combinations, resulting in an increase in computational complexity.
The integration of the DBB module introduces a quantifiable computational overhead, as detailed in Table 5. The replacement of standard C2f modules with their C2f_DiverseBranchBlock counterparts results in a consistent increase in computational complexity across all levels of the backbone network. The Multi-Adds (MACs) increment ranges from approximately 72.66 MMac in the deep layers to 148.71 MMac in the mid-level layers, representing a relative cost increase between 1.39× and 1.47× compared to the baseline.
This analysis clarifies the inherent trade-off in the model improvement process. The performance gains observed in the ablation study, particularly in handling scale variation, are achieved at the cost of a moderate increase in model complexity and computational demand. The multi-branch structure of the DBB module, while effective in enriching feature representation, inherently requires additional parameters and computations to support its parallel convolutional pathways.
To isolate and evaluate the contribution of the proposed PIoUv2 loss function, a head-to-head comparison was conducted on the standard YOLOv8n architecture by replacing only the bounding box regression loss while keeping all other training settings and model components identical. This controlled experiment aimed to objectively assess the intrinsic performance of various loss functions for the oat ear detection task. The results are summarized in Table 6.
The ablation results indicate a clear performance hierarchy, with PIoUv2 achieving the best overall metrics. Its superior performance, particularly in the localization-sensitive mAP@50:95, can be attributed to its dynamic penalty mechanism that likely provides more refined optimization for dense and occluded oat ears. However, the performance gains over the simpler PIoU are consistent yet incremental across all metrics. This suggests that while the design principles of PIoUv2 are beneficial, the absolute advantage conferred by its enhanced architecture for this specific task is measured. Therefore, PIoUv2 is selected as the most effective option, though its marginal lead over PIoU indicates diminishing returns on complexity.
Figure 11 provides a comprehensive comparison of the training dynamics between the proposed YOLOv8n-DSP model and the original YOLOv8n model in the task of oat ear detection.
During the training phase, the proposed YOLOv8n-DSP model demonstrated a rapid decline in all loss components: the bounding box regression loss (train/box_loss) decreased from an initial value of approximately 3.5 to around 1.05 by epoch 100, the classification loss (train/cls_loss) dropped from 3.5 to about 0.65, and the distribution focal loss (train/dfl_loss) reduced from 3.5 to approximately 1.15. The consistent decline and convergence of these three loss values indicate a continuous improvement in the model’s capabilities for oat ear localization, classification, and handling of challenging samples. During validation, the val/box_loss, val/cls_loss, and val/dfl_loss stabilized around 1.35, 1.3, and 1.5, respectively. The small gap between training and validation losses, along with their convergent behavior, reflects the model’s strong generalization ability and resistance to overfitting. In contrast, although the original YOLOv8n model also exhibited a decreasing trend in training loss, its convergence was slower: the final train/cls_loss remained around 0.75 and train/dfl_loss around 1.25. More notably, its validation losses were significantly higher, with val/box_loss around 1.6, val/cls_loss near 1.7, and val/dfl_loss approximately 1.75, indicating noticeable overfitting.
Overall, by incorporating the DBB module, SCSA attention mechanism, and PIoUv2 loss function, the YOLOv8n-DSP model exhibits stronger adaptability, robustness, and generalization capability in oat ear detection tasks, with particularly notable performance improvements in complex natural field environments. The systematic ablation study quantifies the contribution of each component and reveals their interactive effects. The performance gain from the DBB module highlights the importance of multi-scale feature representation for oat ears, which exhibit substantial size variations in the field. This observation aligns with the principle of multi-scale feature extraction, which is central to many advanced detectors in agricultural vision tasks [32,33]. The effectiveness of the SCSA mechanism in improving mAP with minimal computational overhead demonstrates the benefit of synergistic spatial-channel attention. This approach more effectively suppresses background interference in complex scenes, a common challenge in crop detection [8,9].
A slight performance decrease was observed when combining DBB and SCSA (Model 5), indicating a trade-off between feature diversity and feature selectivity. This can be attributed to the DBB module enriching the feature space, while the SCSA mechanism refines it through selective attention weighting. This phenomenon is consistent with the ‘feature suppression’ effect reported in other studies that integrate multi-branch feature extractors with attention mechanisms [29]. This suggests that for dense small object detection, a more adaptive feature fusion strategy may be required beyond simply stacking advanced modules.

3.2. Comparison Test of Different Detection Models

To comprehensively evaluate the performance of the YOLOv8n-DSP model in oat ear detection tasks, this study conducted a comparative analysis with current mainstream models in the YOLO series, with detailed results provided in Table 7.
In terms of comprehensive evaluation metrics, the YOLOv8n-DSP model demonstrates outstanding overall performance, achieving a mean average precision (mAP) of 94.0%, surpassing all compared models. It exceeds the traditional lightweight model YOLOv3-Tiny (86.5%) by 7.5 percentage points, outperforms YOLOv5n (90.1%), which incorporates a Focus module, by 3.9 percentage points, and maintains a 2.3 percentage point advantage over the newly released YOLOv12n (91.7%). This performance improvement is primarily attributed to dual architectural innovations: the DBB module enables richer multi-scale feature representation through its parallel multi-branch convolutional structure, while the SCSA attention mechanism significantly enhances the extraction of discriminative features of oat ears. Furthermore, YOLOv8n-DSP achieves an F1-score of 87.10%, which not only considerably exceeds those of mainstream lightweight models such as YOLOv5n (84.44%) and YOLOv10s (84.75%) but also remains within a narrow margin of only 0.66 percentage points compared to the current state-of-the-art model, YOLOv11n (87.76%). The mAP of YOLOv8n-DSP over models like YOLOv11n and YOLOv12n likely stems from its task-specific architectural inductiveness. General-purpose models like YOLOv11n/12n are optimized for a broad spectrum of objects on standard benchmarks, often prioritizing a balanced precision-recall trade-off for generic shapes and sizes. In contrast, our model incorporates the DBB and SCSA modules specifically to address the persistent challenges in field-based phenotyping: large scale variation (addressed by DBB’s multi-branch receptive fields) and complex background/occlusion (addressed by SCSA’s synergistic filtering). This targeted design echoes the strategy of Qing et al. [8] and Yu et al. [9], who introduced specialized attention or detection heads for wheat ears, yielding significant gains in complex fields. Conversely, the marginally lower F1-score may reflect YOLOv11n’s more refined optimization for general object detection, potentially through advanced label assignment or classification head design aimed at maximizing the harmonic mean on diverse data. This divergence highlights a key consideration for applied agricultural AI: models optimized solely for general benchmarks may not fully address domain-specific bottlenecks, such as extreme occlusion or subtle inter-target distinctions, which are critically assessed by mAP due to its heavy penalty on poor localization. The multi-branch structure of our DBB module, while enhancing multi-scale representation, may introduce a degree of feature complexity that slightly impacts classification confidence in the most challenging cases, contributing to this F1 gap. This observation suggests that future work could focus on optimizing the balance between feature richness and representational efficiency.
Regarding the balance between precision and recall, YOLOv8n-DSP also exhibits excellent performance. Experimental results show that the model achieves a precision of 87.4% while maintaining a high recall rate of 86.8%. Specifically, although its precision is 3.0 percentage points lower than that of YOLOv12n (90.4%), which features a reinforced classification head design, it is 6.2 percentage points higher than that of the traditional lightweight model YOLOv3-Tiny (81.2%). In terms of recall, it remains comparable to YOLOv3-Tiny (86.9%), which adopts a dense prediction strategy, and shows a substantial increase of 4.9 percentage points over YOLOv10s (81.9%). This excellent balance is mainly due to the multi-level dynamic feature weighting mechanism of the SCSA attention module: the spatial attention component effectively focuses on key regions of oat ears while suppressing background noise, and the channel attention mechanism intelligently prioritizes informative feature channels to optimize classification confidence calibration. The synergy between these components significantly enhances detection stability in complex field environments.
In computational efficiency, YOLOv8n-DSP achieves an effective balance between performance and efficiency through innovative architectural design. Experimental results indicate that the model has a computational complexity of only −8.9 GFLOPs, which is 50.6% lower than that of YOLOv9s (26.7 GFLOPs) and 38.3% lower than YOLOv10s (21.4 GFLOPs). This advantage is largely attributable to the intelligent computational resource allocation strategy of the multi-branch structure within the DBB module. It is noteworthy that despite the significant reduction in computational cost, the model still maintains a high inference speed of 3.7 instances/ms, effectively meeting the real-time requirements for field-based crop detection.
The comparative results indicate that YOLOv8n-DSP performs competitively among contemporary models, demonstrating its applicability to agricultural detection tasks. The higher mAP of YOLOv8n-DSP compared to models like YOLOv11n and YOLOv12n is likely due to its design, which incorporates components targeting agricultural challenges. While general-purpose models achieve strong performance on broad benchmarks, they may be less specialized for challenges specific to field-based phenotyping, such as dense occlusion and subtle target-background distinctions. The integration of DBB and SCSA in our model is intended to address these issues, a strategy that has also been employed effectively in other agricultural detection studies [7,9].
The observed result—a higher mAP but a marginally lower F1-score compared to YOLOv11n—warrants further analysis. One possible explanation is that YOLOv11n employs a classification or label assignment strategy that achieves a better balance between precision and recall on general objects. Conversely, the higher mAP of our model suggests an advantage in localization accuracy, which is an important factor for tasks like counting and size estimation in agricultural applications.

3.3. Visualization of Detection Results

3.3.1. The Actual Detection Effect of Different Detection Models

This study further evaluated the practical performance of various mainstream detection models in oat ear detection tasks. The oat ear images used in the experiments were all collected from real-field environments, covering multiple challenging scenarios such as frontlight, backlight, leaf occlusion, and oat ear overlap. Detailed detection results are shown in Figure 12. In the figure, blue bounding boxes indicate correctly detected oat ears, while red circles and yellow circles mark missed detections and false detections, respectively, after manual verification.
As an early lightweight model, YOLOv3-Tiny was prone to false detections under frontlight or leaf occlusion conditions, particularly for oat ears with low contrast or incomplete morphology, while also exhibiting missed detections in poor lighting environments. These limitations reflect the insufficient feature extraction and target discrimination capabilities of its shallow network architecture. Subsequent improved models such as YOLOv5n and YOLOv6s enhanced detection performance through deeper network structures, but still showed noticeable shortcomings in leaf occlusion and oat ear overlap scenarios, with both false and missed detections persisting. Their adaptability to challenging conditions such as backlighting remained limited. Models like YOLOv9s and YOLOv10s achieved significant improvements in detection accuracy through dynamic network architectures and enhanced feature fusion mechanisms. However, their performance declined noticeably in areas with leaf occlusion or dense oat ears, where missed detections increased in occluded regions and false detections became more frequent among closely spaced targets. Although YOLOv11n demonstrated high bounding box localization accuracy, its ability to distinguish targets within dense clusters still requires further improvement. YOLOv12n showed a relatively low missed detection rate overall but exhibited a slight increase in false positives in regions with complex leaf textures. It is hypothesized that the integration of attention mechanisms and multi-scale prediction modules enhanced its robustness to occlusion and overlap, but may also have led to oversensitivity to local textures.
The proposed YOLOv8n-DSP model in this study achieves a breakthrough in performance through multiple architectural innovations: the DBB module enhances multi-scale feature representation via its multi-branch structure, improving adaptability to varying lighting conditions; the SCSA attention mechanism strengthens the capture of key features and suppresses background interference through spatial-channel synergy; and the PIoUv2 loss function addresses dense target detection by dynamically adjusting weights to improve recognition of overlapping objects. Compared to existing mainstream models, YOLOv8n-DSP significantly reduces both false and missed detections across different lighting conditions and demonstrates improved capability in identifying partially occluded oat ears in leaf occlusion scenarios. Nevertheless, a small number of missed detections persist in areas of severe oat spike overlap. This limitation reflects the inherent constraint of using horizontal bounding boxes for deeply interlocked targets—a challenge not unique to our model but rather a well-documented limitation of the standard detection paradigm in agricultural vision [9]. Our visual results provide empirical support for this observation and align with the findings of Yu et al. [9], who demonstrated that shifting from horizontal to oriented bounding boxes with a specialized detector significantly improved performance for overlapping wheat ears, as OBBs minimize background inclusion and offer a superior spatial fit. Although the SCSA mechanism enhances the model’s attention to salient regions through feature weighting, it is still difficult to achieve complete individual distinction when dealing with spikes with highly similar features and closely overlapped spaces. In order to improve the detection performance in such scenarios, we can consider exploring from multiple technical paths: the introduction of rotation box labeling may reduce boundary ambiguity through more accurate orientation description; the lightweight instance segmentation supervision is integrated into the existing framework, which may further distinguish the adhesion target through the contour information. In addition, designing an attention mechanism that can explicitly model the spatial relationship between spikes and grains is also expected to enhance the model’s ability to identify overlapping targets.
The qualitative results in Figure 12 are consistent with the quantitative findings and reflect challenges commonly reported in the literature. The reduction in false detections under conditions such as leaf occlusion and backlighting indicates that the SCSA mechanism contributes to suppressing background interference and improving feature discrimination, which aligns with issues noted in previous studies [7,15]. The improved detection of overlapping ears may be associated with the more accurate localization provided by the PIoUv2 loss function. Nevertheless, the persistent missed detections in highly dense clusters highlight a limitation inherent to the use of horizontal bounding boxes in deeply occluded scenarios. This constraint has also been recognized in other studies on cereal ear detection [9,16]. These observations support the consideration of alternative detection frameworks in future work, such as oriented bounding boxes or instance segmentation, which have been applied to similar overlapping objects in agricultural contexts.

3.3.2. Heatmap Comparison of Oat Ear Detection Models Before and After Improvement

To visually interpret the model’s focus areas, we generated the feature activation heatmaps using Grad-CAM. For this purpose, the Grad-CAM technique was employed, a method commonly used to visualize spatial attention in convolutional networks. Heatmaps serve as an intuitive visualization tool for highlighting regions of interest in model feature representation, clearly reflecting the intensity of the model’s attention and its ability to capture critical features. They provide an essential basis for evaluating detection performance. By analyzing the distribution, intensity variation, and correspondence with target regions in heatmaps, it is possible to determine whether the model accurately identifies targets, suffers from background interference, or experiences feature loss [34,35].
In this study, targeting the task of oat ear recognition, we integrated the DBB module, SCSA attention mechanism, and PIoUv2 loss function into the YOLOv8n architecture to construct an improved model named YOLOv8n-DSP. The performance advantages of the proposed method were validated through a comparative analysis of feature attention heatmaps under four typical scenarios: frontlight, backlight, leaf occlusion, and oat ear overlap. The results are presented in Figure 13.
Under frontlight conditions, YOLOv8n locates oat ears but its heatmap attention is easily distracted by background elements, resulting in imprecise focus. In comparison, YOLOv8n-DSP leverages the DBB module’s hierarchical feature extraction and the SCSA mechanism’s spatial-channel weighting to concentrate heatmap attention precisely on the core regions of oat ears, effectively suppressing background interference. In backlight scenarios, where oat ear details are often obscured, YOLOv8n frequently produces heatmaps with shifted attention or blurred edges. Conversely, YOLOv8n-DSP uses the spatial attention branch of SCSA to filter out noise from bright backgrounds, while the DBB module preserves low-level texture information through cross-layer fusion. Consequently, its heatmaps clearly outline complete oat ear contours. When leaf occlusion is present, YOLOv8n often exhibits missing or fragmented attention in occluded areas due to insufficient feature extraction. YOLOv8n-DSP, however, enhances contextual semantic connections in these areas via dynamic fusion of shallow and deep features by the DBB module. Combined with channel-wise weighting from SCSA, the heatmaps maintain strong, continuous attention on key parts of partially occluded ears. In cases of oat ear overlap, YOLOv8n struggles to distinguish multiple targets, often generating a single blurred attention region. YOLOv8n-DSP addresses this with a bounding box regression strategy optimized by the PIoUv2 loss, improving localization accuracy. Simultaneously, the spatial attention branch of SCSA discriminates between the spatial positions of different ears, producing independent high-confidence attention points for multiple overlapping targets in the heatmap. This demonstrates the model’s superior ability to handle complex overlapping scenarios.
The Grad-CAM visualizations provide mechanistic insights into the model’s decision-making process, corroborating the hypothesized roles of its core contributions. These visualizations reveal two key improvements through concentrated and coherent activations on oat ears under challenging lighting and occlusion. First, the enhanced activation coherence aligns with the DBB module’s capacity for building robust multi-scale representations—a principle central to many advanced detectors in agricultural vision tasks [7,25]. Second, the effective suppression of background and leaf-occlusion noise demonstrates the spatial-channel filtering effect of the SCSA module, directly addressing the common challenge of background interference in crop detection [8,9]. This filtering effect is analogous to the improved focus on target regions reported in studies employing attention mechanisms for crop disease detection [34] or fruit recognition [35]. A particularly telling observation is the model’s ability to maintain distinct activation foci for overlapping ears, in contrast to the blurred or merged response of the baseline. This capability can be attributed to a productive interaction between SCSA’s spatial discrimination and the localization-prioritizing gradient from the PIoUv2 loss: the SCSA mechanism sharpens spatial attention on individual ear regions, which in turn provides clearer features for PIoUv2 to refine bounding box regression. This observed improvement in handling occlusion aligns with the goals of recent studies aimed at enhancing feature discrimination in dense canopies [9,16].

3.4. Counting Experiment Based on Oat Ear Detection Results

The number of oat ears is a key agronomic trait indicative of oat yield, and their accurate detection and counting are crucial for constructing reliable yield prediction models. To this end, this study conducted panicle counting experiments per unit area using both YOLOv8n and improved oat panicle detection models. As shown in Figure 14, considering the oat planting row spacing and actual growth conditions, the counting area was set to 0.5 m × 0.5 m to avoid affecting adjacent plants. During actual counting, to eliminate interference caused by wind-induced panicle movement, only oats whose rootstalks were located within the counting area were considered valid statistical objects. The panicle counting results of the YOLOv8n and YOLOv8n-DSP models are summarized in Table 8. The manual count served as the ground truth for evaluating the models’ counting accuracy.
Evaluation of the oat ear counting task showed a difference in performance between the models. The YOLOv8n-DSP model yielded a mean accuracy of 82.4% with a standard error of 0.9%, while the original YOLOv8n model yielded a mean accuracy of 73.2% with a standard error of 3.0%. Although minor fluctuations occurred in individual samples with simple backgrounds (e.g., Group 6). This phenomenon may be attributed to the SCSA attention mechanism’s adaptive nature. In simple scenarios with minimal background interference, the model’s heightened sensitivity to subtle features could potentially lead to over-analysis of insignificant texture variations or lighting changes, thereby occasionally misclassifying non-target areas or slightly shifting bounding boxes. Furthermore, the DBB module’s multi-scale feature extraction, while advantageous in complex scenes, might introduce unnecessary feature redundancy in straightforward cases. Future iterative experiments can dynamically adjust the processing intensity of the model according to the scene complexity by combining the difficulty perception mechanism or the context gating module.
The counting experiment extends the evaluation from detection performance to an application-oriented metric. With an achieved counting accuracy of 82.4%, the model establishes a reliable foundation for automated oat panicle density estimation—a crucial proximal phenotypic trait for yield modeling. This aligns with the workflow demonstrated by Li et al. [13], who used a deep learning-based detection system to estimate winter wheat growth parameters as input for yield prediction. It is important to recognize, however—as highlighted by Li et al. [13] and others [15]—that panicle count constitutes only one component of yield. A robust yield estimation model must further incorporate per-panicle yield components (e.g., grain number and weight), which are influenced by genotype, environment, and management practices. Therefore, while not a complete yield prediction solution, our detection and counting model serves as a critical first module in a larger analytical pipeline. It automates the traditionally labor-intensive counting step and provides the essential spatial density data upon which more complex agronomic yield models can be developed.

3.5. Limitations and Future Work

While the preliminary results have been achieved, this study has certain inherent characteristics that should be considered when assessing its contributions and scope. The dataset is derived from a specific geographic and genotypic context, which may affect the model’s performance when directly applied to other regions or cultivars—a common consideration in agricultural AI research [6,16]. Additionally, the evaluation was conducted under controlled offline conditions, which differs from real-field scenarios involving factors like motion blur and strict real-time processing constraints on edge devices. Finally, while the proposed improvements mitigate some issues with occlusion, the fundamental constraint of using horizontal bounding boxes for deeply overlapping and tilted oat ears persists, as visualized in our results.
These limitations delineate clear directions for future research. Efforts will be directed towards constructing a more comprehensive, multi-region and multi-phenology dataset, potentially leveraging UAV and ground-platform synergy. To address the challenge of dense occlusion, exploring alternative detection paradigms such as oriented bounding boxes or instance segmentation is warranted. Finally, transitioning from laboratory validation to practical application will require focused work on model lightweighting and optimization for stable, real-time inference on embedded computing platforms.

4. Conclusions

The accurate detection and counting of oat ears in field environments is essential for non-destructive yield estimation, yet remains challenging due to occlusion, scale variation, and complex lighting conditions. This study aimed to address these challenges by developing a specialized deep learning model, YOLOv8n-DSP, building upon the efficient YOLOv8n architecture.
The primary contribution of this work lies in a coherent set of architectural improvements designed to tackle specific field-based complexities. To enhance multi-scale perception for ears of varying sizes, the Diverse Branch Block (DBB) was incorporated into the backbone, enriching feature representation through a parallel multi-branch structure. To improve discrimination of densely packed and partially occluded ears, a Spatial and Channel Synergistic Attention (SCSA) mechanism was integrated into the neck, enabling more focused feature extraction by modeling cross-dimensional dependencies. Furthermore, the PIoUv2 loss function was employed to refine bounding box regression, specifically aiming to improve localization accuracy for overlapping targets. Ablation studies validated the efficacy of each component and revealed their interactive effects, with the full model integrating all three enhancements.
The potential applications of this research can be explored in two primary directions. First, the detection and counting pipeline provides a core technical module for constructing non-destructive oat yield estimation frameworks. By automating panicle counting, it supplies a fundamental input variable (panicle density) for yield models, supporting more objective agronomic decision-making. Second, the model’s balanced computational profile—3.5 M parameters, 8.9 GFLOPs, and an inference speed of 3.7 ms per instance—indicates its suitability for deployment on embedded systems (e.g., NVIDIA Jetson series). This enables its integration into intelligent agricultural equipment, such as UAVs or ground robots, for real-time, in-field oat panicle monitoring and phenotyping.
Methodologically, this study contributes an efficient and effective design strategy for crop detection in complex environments. It demonstrates that integrating a re-parameterizable multi-scale backbone (DBB) with a synergistic spatial-channel attention mechanism (SCSA) within a single-stage detector can address key field challenges without excessive computational cost. This approach differs from previous methods that were computationally intensive or platform-limited [16], thereby contributing to the methodological toolkit for agricultural vision systems that need to balance accuracy with operational constraints.
Observations from the results indicate that the model’s performance can be limited in densely occluded scenarios, particularly where oat ears severely overlap. Several research directions warrant further investigation:
(1)
Constructing a multi-perspective and multi-phenology oat ear dataset. Oats are widely cultivated in many countries, with some regions harvesting two seasons per year. Systematically incorporating image data from different varieties and growing regions would improve the representativeness and generalization ability of datasets. A viable approach involves employing Unmanned Aerial Vehicles (UAVs) for nadir and oblique aerial imaging, complemented by ground-based mobile platforms to capture images across key growth stages from heading to maturity. Additionally, establishing a dedicated sub-dataset captured under varying wind conditions would address the challenge of plant movement in field environments. This multi-faceted acquisition strategy overcomes the limitations of ground-based perspectives and enables the development of large-scale, high-quality oat ear image datasets.
(2)
Introducing the rotating box labeling method. The improvement of dataset quality depends not only on the number of images, but also on the accuracy of annotation. In the field environment, natural factors such as wind blowing often lead to the tilt of oat plants. Traditional horizontal box labeling will introduce a lot of background interference in this case. If the detection model extracts feature from these irrelevant regions, it may interfere with the model’s recognition of the target ontology, which in turn affects the detection performance. The false detection of partially overlapped and tilted samples in the above experiments also reflects this problem. Establishing a dedicated oriented bounding box (OBB) annotation standard for oat ears, combined with adapting advanced oriented object detectors such as R3Det or S2A-Net to agricultural contexts, would minimize irrelevant background in feature extraction. This approach directly addresses the false detection issues observed with partially overlapped and tilted samples in the current study.
(3)
Enhanced feature extraction for overlapping oat ears remains an important research challenge. In the above experiments, the proposed YOLOv8n-DSP model still exhibits a small number of missed detections in areas with overlapping ears. Future enhancements could focus on extending the SCSA attention mechanism through local-global attention interaction modules to improve perception of subtle features in overlapping regions. Furthermore, RGB-D image-based solutions utilizing depth information could provide complementary spatial features for target recognition under occlusion, potentially improving detection robustness in complex scenarios.
(4)
Advancing model lightweighting and deployment optimization. Another focus of future work should be on model compression and efficiency optimization to achieve better power-performance balance. Model construction should not only stay in the experimental stage, but also face practical application scenarios. Through lightweight design, the deployment of the model in intelligent agricultural machinery equipment or mobile terminals can be promoted, thus promoting the substantial transformation of oat detection technology from theoretical research to field application. Hardware-aware compression techniques, including channel pruning guided by Neural Architecture Search (NAS) and optimization using inference acceleration engines like TensorRT, represent promising directions. Achieving stable inference speeds exceeding 20 FPS on embedded devices such as the NVIDIA Jetson series would facilitate deployment on intelligent agricultural machinery and mobile platforms, enabling the transition of oat detection technology from research to practical application.

Author Contributions

Conceptualization, J.L. and C.T.; data curation, C.T.; formal analysis, Y.W. and C.T.; investigation, Y.W. and C.T.; methodology, J.L. and Y.W.; project administration, J.L. and Y.W.; software, J.L.; supervision, C.T.; validation, Y.W.; visualization, C.T.; writing—original draft, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by General Project of Natural Science Research in Universities of Jiangsu Province (No. 20KJB210012), Young and Middle-aged Academic Leaders of the Qinglan Project in Jiangsu Province, Wuxi Association for Science and Technology Soft Science Research Project9 (KX-25-C360).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Paudel, D.; Dhungana, B.; Caffe, M.; Krishnan, P. A Review of Health-Beneficial Properties of Oats. Foods 2021, 10, 2591. [Google Scholar] [CrossRef]
  2. Shi, S.L. Oats market has potential—Coarse cereals series five. Chin. Grain Econ. 2024, 9, 74–76. [Google Scholar]
  3. Cao, Z.; Sun, S.; Bao, X. A Review of Computer Vision and Deep Learning Applications in Crop Growth Management. Appl. Sci. 2025, 15, 8438. [Google Scholar] [CrossRef]
  4. Lipiński, S.; Sadkowski, S.; Chwietczuk, P. Application of AI in Date Fruit Detection—Performance Analysis of YOLO and Faster R-CNN Models. Computation 2025, 13, 149. [Google Scholar] [CrossRef]
  5. Yin, L.L.; Zainudin, M.N.S.; Saad, W.H.M.; Sulaiman, N.A.; Idris, M.I.; Kamarudin, M.R.; Mohamed, R.; Razak, M.S.J.A. Analysis Recognition of Ghost Pepper and Cili-Padi using Mask RCNN and YOLO. Przegląd Elektrotechniczn 2023, 99, 92–97. [Google Scholar] [CrossRef]
  6. Khan, Z.; Shen, Y.; Liu, H. Object Detection in Agriculture: A Comprehensive Review of Methods, Applications, Challenges, and Future Directions. Agriculture 2025, 15, 1351. [Google Scholar] [CrossRef]
  7. Li, R.; Wang, Y.P. Improved YOLO v5 Wheat Ear Detection Algorithm Based on Attention Mechanism. Electronics 2022, 11, 1673. [Google Scholar] [CrossRef]
  8. Qing, S.H.; Qiu, Z.M.; Wang, W.L.; Wang, F.; Jin, X.; Ji, J.T.; Zhao, L.; Shi, Y. Improved YOLO-FastestV2 wheat ear detection model based on a multi-stage attention mechanism with a LightFPN detection head. Front. Plant Sci. 2024, 15, 1411510. [Google Scholar] [CrossRef]
  9. Yu, J.W.; Chen, W.W.; Guo, Y.S.; Mu, Y.S.; Fan, C. Improved Oriented R-CNN-based model for oriented wheat ears detection and counting. Trans. Chin. Soc. Agric. Eng. 2024, 40, 248–257. [Google Scholar] [CrossRef]
  10. Lu, Z.A.; Zhang, J.J.; Han, B.; Li, Y.F. Wheat spike detection method based on improved YOLO v7-tiny. Jiangsu J. Agric. Sci. 2024, 52, 147–156. [Google Scholar] [CrossRef]
  11. Tong, Z.M.; Chen, X.H.; Wang, B.F.; Ma, Z.Y.; Yang, G.Y. A wheat ear detection and counting method based on improved YOLOv5s. J. Nanjing Agric. Univ. 2024, 47, 1202–1211. [Google Scholar] [CrossRef]
  12. Jing, F.R.; Wang, C.; Li, J.H.; Yang, C.; Liu, H.; Chen, Y. A Dual Detection Head YOLO Model with Its Application in Wheat Ear Recognition. Int. J. Cogn. Inform. Nat. Intell. 2024, 18, 1–17. [Google Scholar] [CrossRef]
  13. Li, Y.X.; Ma, J.C.; Liu, H.J.; Zhang, L.X. Field growth parameter estimation system of winter wheat using RGB digital images and deep learning. Trans. Chin. Soc. Agric. Eng. 2021, 37, 189–198. [Google Scholar] [CrossRef]
  14. Xu, J.P.; Zhang, Z.H.; Li, Z.; Zuo, Z.Y.; Lai, X.L.; Zhao, X.Y.; Zhang, T.Y.; Yin, Y.G. Research on Wheat Appearance Classification Algorithm Based on YOLO Model. Process Autom. Instrum. 2023, 44, 83–87. [Google Scholar] [CrossRef]
  15. Huang, S.; Zhou, Y.N.; Wang, Q.F.; Zhang, H.; Qiu, Z.Y.; Kang, K.; Luo, B. Measuring the number of wheat spikes per unit area in fields using an improved YOLOv5. Trans. Chin. Soc. Agric. Eng. 2022, 38, 235–242. [Google Scholar] [CrossRef]
  16. Tian, C.; Wang, J.W.; Zheng, D.C.; Li, Y.G.; Zhang, X.C. Oat Ears Detection and Counting Model in Natural Environment Based on Improved Faster R-CNN. Agronomy 2025, 15, 536. [Google Scholar] [CrossRef]
  17. Xie, J.X.; Liu, J.B.; Chen, S.N.; Gao, Q.P.; Chen, Y.Z.; Wu, J.T.; Gao, P.; Sun, D.Z.; Wang, W.X.; Shen, J.Y.; et al. Research on inferior litchi fruit detection in orchards based on YOLOv8n-BLS. Comput. Electron. Agric. 2025, 237, 110736. [Google Scholar] [CrossRef]
  18. Cen, X.; Lu, S.; Qian, T. YOLO-LCE: A Lightweight YOLOv8 Model for Agricultural Pest Detection. Agronomy 2025, 15, 2022. [Google Scholar] [CrossRef]
  19. Yan, H.; Guo, H.B.; Wei, L.S.; Xu, X.Q.; Liang, Y.; Li, Y.; Chen, S.T.; Yu, P. A global feature fusion and adaptive optimization method to enhance detection accuracy and computational efficiency based on YOLOv8. Alex. Eng. J. 2025, 129, 538–552. [Google Scholar] [CrossRef]
  20. Tang, J.; Yu, Z.; Shao, C. Hybrid attention transformer integrated YOLOV8 for fruit ripeness detection. Sci. Rep. 2025, 15, 22652. [Google Scholar] [CrossRef]
  21. Guo, Q.; Ma, C.; Hu, H. Apple Detection Algorithm under Complex Background Based on Improved YOLOv8. IAENG Int. J. Comput. Sci. 2025, 52, 2202–2209. [Google Scholar]
  22. Zhu, H.; Wang, D.; Wei, Y.; Wang, P.; Su, M. YOLOV8-CMS: A high-accuracy deep learning model for automated citrus leaf disease classification and grading. Plant Methods 2025, 21, 88. [Google Scholar] [CrossRef] [PubMed]
  23. An, H.C.; Fan, Y.; Jiao, Z.B.; Liu, M.Q. Research on Improved Bridge Surface Disease Detection Algorithm Based on YOLOv7-Tiny-DBB. Appl. Sci. 2025, 15, 3626. [Google Scholar] [CrossRef]
  24. Ren, H.G.; Fan, A.N.; Zhao, J.; Song, H.R.; Wen, Z.G.; Lu, S. A dynamic weighted feature fusion lightweight algorithm for safety helmet detection based on YOLOv8. Measurement 2025, 253, 117572. [Google Scholar] [CrossRef]
  25. Zhao, P.F.; Qian, M.B.; Zhou, K.Q.; Shan, Y.J.; Wu, H.Y. Improvement of Sweet Pepper Fruit Detection in YOLOv7-Tiny Farming Environment. Comput. Eng. Appl. 2023, 59, 329–340. [Google Scholar] [CrossRef]
  26. Shao, J.F.; Cai, S.J.; Liu, J. YOLO-LDD: Light weight UAV Detection Algorithm. J. Jilin Univ. (Sci. Ed.) 2025, 63, 867–877. [Google Scholar] [CrossRef]
  27. Si, Y.Z.; Xu, H.Y.; Zhu, X.Z.; Zhang, W.H.; Dong, Y.; Chen, Y.X.; Li, H.B. SCSA: Exploring the synergistic effects between spatial and channel attention. Neurocomputing 2025, 634, 129866. [Google Scholar] [CrossRef]
  28. Tu, Y.Z.; Wang, F.X.; Wu, C.L. Light weight UAV Aerial Small Object Detection Model Integrating Multi-Attention Mechanisms. Comput. Eng. Appl. 2025, 61, 93–104. [Google Scholar] [CrossRef]
  29. Wang, Q.; Wang, X.; Hou, J.; Liu, X.; Wen, H.; Ji, Z. MF-YOLOv10: Research on the Improved YOLOv10 Intelligent Identification Algorithm for Goods. Sensors 2025, 25, 2975. [Google Scholar] [CrossRef]
  30. Liu, C.; Wang, K.G.; Li, Q.; Zhao, F.Z.; Zhao, K.; Ma, H.T. Powerful-IoU: More straightforward and faster bounding box regression loss with a nonmonotonic focusing mechanism. Neural Netw. 2023, 170, 276–284. [Google Scholar] [CrossRef]
  31. Ling, L.Y.; Xu, S.; Wei, L.Y.; Wei, G.; Jia, J.X.; Hong, B.L. Fsd-detr: Casting surface defect detection based on improved RT-DETR. J. Real-Time Image Process. 2025, 22, 135. [Google Scholar] [CrossRef]
  32. Sun, H.; Fu, R.; Kang, D.K. LigTomDet: Knowledge distillation in a new lightweight tomato disease detection model in planting fields. Pattern Recognit. 2025, 172, 112628. [Google Scholar] [CrossRef]
  33. Liu, Q.; Ouyang, J.W.; Zhou, Z.B.; Zhang, W.F.; Zhang, X.H. Segmentation and localization method for eggplants and stems based on YOLO-CRC. Trans. Chin. Soc. Agric. Eng. (Trans. CSAE) 2025, 41, 196–205. [Google Scholar] [CrossRef]
  34. Lin, Y.S.; Huang, Z.F.; Liang, Y.; Liu, Y.F.; Jiang, W.P. AG-YOLO: A Rapid Citrus Fruit Detection Algorithm with Global Context Fusion. Agriculture 2024, 14, 144. [Google Scholar] [CrossRef]
  35. Luo, R.X.; Zhao, R.R.; Ding, X.; Peng, S.Y.; Cai, F.P. High-Precision Complex Orchard Passion Fruit Detection Using the PHD-YOLO Model Improved from YOLOv11n. Horticulturae 2025, 11, 785. [Google Scholar] [CrossRef]
Figure 1. Workflow of the oat ear dataset construction.
Figure 1. Workflow of the oat ear dataset construction.
Agronomy 16 00133 g001
Figure 2. Partial image data of oat ears: (a) Frontlight; (b) Backlight; (c) Leaf occlusion; (d) Oat ear overlap.
Figure 2. Partial image data of oat ears: (a) Frontlight; (b) Backlight; (c) Leaf occlusion; (d) Oat ear overlap.
Agronomy 16 00133 g002
Figure 3. Structure diagram of YOLOv8: Input represents image input. Conv represents a standard convolutional layer. BatchNorm represents batch normalization. SiLU represents an activation function. Bottelneck represents a lightweight residual structure. C2f with 2 convolutions, which is a key module. SPPF represents the improved spatial pyramid pooling module. Split represents tensor segmentation. Contact represents tensor merging. Upsample represents the up-sampling process. Detect represents the detection head. Maxpool represents the maximum pooling. Bbox loss represents the bounding box loss function. ⊕ represents element-wise addition.
Figure 3. Structure diagram of YOLOv8: Input represents image input. Conv represents a standard convolutional layer. BatchNorm represents batch normalization. SiLU represents an activation function. Bottelneck represents a lightweight residual structure. C2f with 2 convolutions, which is a key module. SPPF represents the improved spatial pyramid pooling module. Split represents tensor segmentation. Contact represents tensor merging. Upsample represents the up-sampling process. Detect represents the detection head. Maxpool represents the maximum pooling. Bbox loss represents the bounding box loss function. ⊕ represents element-wise addition.
Agronomy 16 00133 g003
Figure 4. Schematic diagram of the YOLOv8n-DSP model architecture.
Figure 4. Schematic diagram of the YOLOv8n-DSP model architecture.
Agronomy 16 00133 g004
Figure 5. Six forms of DBB module conversion: Conv represents a standard convolutional layer. Batch norm represents batch normalization. AVG represents average pooling. K denotes the size of the convolution kernel. 1 × 1 denotes a convolution kernel of size 1 × 1. Contact represents tensor merging. ⊕ represents element-wise addition.
Figure 5. Six forms of DBB module conversion: Conv represents a standard convolutional layer. Batch norm represents batch normalization. AVG represents average pooling. K denotes the size of the convolution kernel. 1 × 1 denotes a convolution kernel of size 1 × 1. Contact represents tensor merging. ⊕ represents element-wise addition.
Agronomy 16 00133 g005
Figure 6. DBB module structure diagram: Input represents image input. K represents the size of the convolution kernel. 1 × 1 represents a convolution kernel of size 1 × 1. AVG represents means average pooling. Batch represents batch normalization. Nonlinearity represents a non-linear transformation introduced by an activation function. Output represents the final prediction part of the model. DBB represents the Diverse Branch Block. ⊕ represents element-wise addition.
Figure 6. DBB module structure diagram: Input represents image input. K represents the size of the convolution kernel. 1 × 1 represents a convolution kernel of size 1 × 1. AVG represents means average pooling. Batch represents batch normalization. Nonlinearity represents a non-linear transformation introduced by an activation function. Output represents the final prediction part of the model. DBB represents the Diverse Branch Block. ⊕ represents element-wise addition.
Agronomy 16 00133 g006
Figure 7. Improved backbone network: Conv represents a standard convolutional layer. Split represents tensor segmentation. Bottelneck represents a lightweight residual structure. Contact represents tensor merging. DBB represents the Diverse Branch Block.
Figure 7. Improved backbone network: Conv represents a standard convolutional layer. Split represents tensor segmentation. Bottelneck represents a lightweight residual structure. Contact represents tensor merging. DBB represents the Diverse Branch Block.
Agronomy 16 00133 g007
Figure 8. SCSA attention mechanism: B represents the number of samples for one-time input model. C represents the number of characteristic channels for each sample. H represents the vertical size of the image in pixels. W represents the horizontal size of the image in pixels. Avg Pool represents average pooling. Split represents tensor segmentation. MS-DWConv1d represents Multi-Receptive Field Shared Depth Wise Separable 1D Convolutions with Kernels of 3, 5, 7, and 9. Contact represents tensor merging. Group Norm ~ N represents Group Normalization with N Groups. V represents the information that actually needs to be extracted or aggregated. Q represents the current goal that needs attention. K represents the identifier used to match Query. CA-MHSA represents Channel-wise Attention Based on Self-Attention. ⊗ represents Element-wise Multiplication.
Figure 8. SCSA attention mechanism: B represents the number of samples for one-time input model. C represents the number of characteristic channels for each sample. H represents the vertical size of the image in pixels. W represents the horizontal size of the image in pixels. Avg Pool represents average pooling. Split represents tensor segmentation. MS-DWConv1d represents Multi-Receptive Field Shared Depth Wise Separable 1D Convolutions with Kernels of 3, 5, 7, and 9. Contact represents tensor merging. Group Norm ~ N represents Group Normalization with N Groups. V represents the information that actually needs to be extracted or aggregated. Q represents the current goal that needs attention. K represents the identifier used to match Query. CA-MHSA represents Channel-wise Attention Based on Self-Attention. ⊗ represents Element-wise Multiplication.
Agronomy 16 00133 g008
Figure 9. Architecture of the C2f_SCSA module. ⊕ represents element-wise addition.
Figure 9. Architecture of the C2f_SCSA module. ⊕ represents element-wise addition.
Agronomy 16 00133 g009
Figure 10. Performance curves for ablation models with different module combinations (models 5–8).
Figure 10. Performance curves for ablation models with different module combinations (models 5–8).
Agronomy 16 00133 g010
Figure 11. Training Process and Performance Analysis of the YOLOv8n-DSP Model for Oat ear Detection.
Figure 11. Training Process and Performance Analysis of the YOLOv8n-DSP Model for Oat ear Detection.
Agronomy 16 00133 g011
Figure 12. Actual Detection Results Comparison Across Different Models.
Figure 12. Actual Detection Results Comparison Across Different Models.
Agronomy 16 00133 g012
Figure 13. Comparison of Heatmaps for Feature Attention in Oat Ear Detection Models Before and After Improvement.
Figure 13. Comparison of Heatmaps for Feature Attention in Oat Ear Detection Models Before and After Improvement.
Agronomy 16 00133 g013
Figure 14. Schematic diagram of oat ear counting experiment in a unit area (0.5 m × 0.5 m).
Figure 14. Schematic diagram of oat ear counting experiment in a unit area (0.5 m × 0.5 m).
Agronomy 16 00133 g014
Table 1. Summary of the Oat Ear Dataset.
Table 1. Summary of the Oat Ear Dataset.
SpecificationDetails
Collection PeriodJune–August 2025
LocationShenfeng Experimental Base, Shanxi Agricultural University
CultivarPinyan No. 4
Total Images6012
Annotation ToolLabelImg
Training Set4810 images
Validation Set601 images
Test Set601 images
Split Ratio8:1:1
Environmental ConditionsFrontlight, Backlight, Leaf Occlusion, Oat Ear Overlap
Table 2. Training parameters.
Table 2. Training parameters.
Training ParameterValue or Type
Input image size640 × 640 × 3
OptimizerSGD
Optimizer weight decay rate0.0005
Initial learning rate0.01
Momentum0.937
Batch size4
Epoch100
Table 3. Performance comparison between the baseline YOLOv8n and the proposed YOLOv8n-DSP model.
Table 3. Performance comparison between the baseline YOLOv8n and the proposed YOLOv8n-DSP model.
ModelPrecisionRecallmAPParams/MGFLOPs/GInstances/ms
YOLOv8n84.7 ± 0.685.9 ± 0.590.8 ± 0.33.18.23.3 ± 0.2
YOLOv8n-DSP87.4 ± 0.586.8 ± 0.494.0 ± 0.43.58.93.7 ± 0.2
Table 4. Ablation test of oat ears in field environment.
Table 4. Ablation test of oat ears in field environment.
NumberDBBSCSAPIoUv2F1-Score/%Precision/%Recall/%mAP/%Params/MGFLOPs/GInstances/ms
1---85.3084.785.990.83.18.13.3
2--87.3087.786.991.23.38.93.8
3--86.5087.185.992.13.18.13.7
4--86.2385.087.592.73.18.13.4
5-86.0485.087.193.23.38.93.8
6-84.2383.185.493.03.38.93.5
7-86.3086.186.592.53.18.13.5
887.1087.486.894.03.58.93.7
Note: √ indicates that the module is enabled.
Table 5. Comparison of computational overhead between DBB module and traditional Bottleneck module.
Table 5. Comparison of computational overhead between DBB module and traditional Bottleneck module.
Network Layer PositionModule TypeTotal Module MACsRelative CostMACs Increment
Mid-level BackboneC2f317.85 MMac1.00× (Baseline)+148.71 MMac
C2f_DiverseBranchBlock466.56 MMac1.47×
Mid-deep BackboneC2f316.21 MMac1.00× (Baseline)+146.45 MMac
C2f_DiverseBranchBlock462.66 MMac1.46×
Deep BackboneC2f184.12 MMac1.00× (Baseline)+72.66 MMac
C2f_DiverseBranchBlock256.78 MMac1.39×
Table 7. Comparative evaluation of different detection Models.
Table 7. Comparative evaluation of different detection Models.
ModelF1 Score/%Precision/%Recall/%mAP/%Params/MGFLOPs/GInstances/ms
YOLOv3-Tiny83.9581.286.986.512.118.93.8
YOLOv5n84.4485.183.890.12.57.13.3
YOLOv6s87.0388.485.789.04.211.83.4
YOLOv9s86.6487.585.887.97.326.74.5
YOLOv10s84.7587.881.988.38.121.44.4
YOLOv11n87.7689.686.091.02.66.33.4
YOLOv12n87.6790.485.191.72.66.33.8
YOLOv8n-DSP87.1087.486.894.03.58.93.7
Table 8. Comparison of Oat ear Counting Results Before and After Model Improvement.
Table 8. Comparison of Oat ear Counting Results Before and After Model Improvement.
NumberManual CountYOLOv8nYOLOv8n-DSP
CountAccuracy%CountAccuracy%
1241875.02187.5
2201470.01680.0
3151066.71386.7
4151280.01280.0
5131076.91184.6
6151493.31280.0
711872.7981.8
8171164.71482.4
9171058.81482.4
10191473.71578.9
Mean Accuracy 73.2%82.4%
Standard Deviation 3.0%0.9%
Table 6. Performance comparison of different bounding box regression loss functions on the YOLOv8n architecture.
Table 6. Performance comparison of different bounding box regression loss functions on the YOLOv8n architecture.
Loss FunctionmAP@50 (%)mAP@50:95 (%)Recall (%)
CIoU0.9080.5540.859
GIoU0.9040.5520.855
DIoU0.9130.5530.857
PIoU0.9200.5610.863
PIoUv20.9270.5680.875
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, J.; Tian, C.; Wu, Y. YOLOv8n-DSP: A High-Precision Model for Oat Ear Detection and Counting in Complex Fields. Agronomy 2026, 16, 133. https://doi.org/10.3390/agronomy16010133

AMA Style

Liu J, Tian C, Wu Y. YOLOv8n-DSP: A High-Precision Model for Oat Ear Detection and Counting in Complex Fields. Agronomy. 2026; 16(1):133. https://doi.org/10.3390/agronomy16010133

Chicago/Turabian Style

Liu, Jie, Cong Tian, and Yang Wu. 2026. "YOLOv8n-DSP: A High-Precision Model for Oat Ear Detection and Counting in Complex Fields" Agronomy 16, no. 1: 133. https://doi.org/10.3390/agronomy16010133

APA Style

Liu, J., Tian, C., & Wu, Y. (2026). YOLOv8n-DSP: A High-Precision Model for Oat Ear Detection and Counting in Complex Fields. Agronomy, 16(1), 133. https://doi.org/10.3390/agronomy16010133

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop