3.1. Ablation Test and Performance Analysis of the Improved Model
A comparative analysis between the baseline YOLOv8n and the proposed YOLOv8n-DSP model was first conducted under multiple random seeds to evaluate the overall efficacy of the proposed modifications. The results are summarized in
Table 3.
As summarized in
Table 3, the YOLOv8n-DSP model achieves a precision of 87.4%, representing an increase of 2.7 percentage points over the YOLOv8n baseline (84.7%). The mean average precision (mAP) reaches 94.0%, an improvement of 3.2 points from the baseline (90.8%), while recall shows a modest gain from 85.9% to 86.8%. These results indicate that the proposed modifications enhance detection performance effectively, maintaining a balance across key metrics. The observed precision improvement, particularly under varying illumination, suggests an enhanced ability to distinguish oat ears from complex backgrounds, a benefit attributed to the feature-selection capability of the SCSA mechanism. Similarly, the higher mAP reflects more accurate bounding box localization, which stems from the multi-scale feature extraction enabled by the DBB module and the optimized regression behavior of the PIoUv2 loss. These traits are especially valuable for detecting occluded and overlapping oat ears. It is also notable that the model preserves baseline recall while reducing false positives, indicating retained sensitivity to potential targets. These performance gains can be contextualized within broader methodological trends in agricultural vision. The role of attention mechanisms in improving feature discrimination is well established. For example, Li et al. [
7] integrated the CBAM module into YOLOv5 for wheat ear detection to suppress background interference. Our SCSA module follows a similar rationale but implements a more integrated spatial-channel weighting strategy. Likewise, the mAP improvement under occlusion and scale variation aligns with reported benefits of multi-scale architectures. Tong et al. [
11] also achieved mAP gains by modifying YOLOv5 for multi-scale wheat ear detection. Our approach, using the DBB module, extends this principle through a parallel multi-branch design that enriches feature representation across receptive fields—a capability particularly relevant for oat ears given their significant size variations in field settings, as noted in studies emphasizing scale-invariant feature extraction for crops [
25,
32].
To further discuss the contribution of proposed module, a series of systematic ablation studies were carried out, with the results detailed in
Table 4.
The introduction of the DBB module increased the F1-score by 2.0 percentage points to 87.30% and precision by 3.0 percentage points to 87.7%. By dynamically adjusting channel dimensions within the bottleneck layer, it enabled adaptive multi-scale feature fusion, effectively suppressing interference from complex field backgrounds. However, this came at a considerable computational cost, with GFLOPs increasing by 63% and the model size growing by 121%. The SCSA module further elevated the mAP to 92.1% through its dual spatial-channel attention mechanism, which enhanced feature responses in key oat ear regions via spatial attention and optimized weight allocation across channels. Notably, these significant gains were achieved with computational efficiency comparable to the baseline. The PIoUv2 module boosted recall by 1.6 percentage points to 87.5%, owing to its refined bounding box regression strategy that reduced missed detections among adjacent ears. It also increased mAP by 1.9 percentage points, indicating that its enhanced localization criteria improved overall detection quality. Crucially, as PIoUv2 operates solely through loss function modification, it provides these advantages without adding any computational overhead during inference.
In terms of module combination effects, the integration of DBB and SCSA increased mAP to 93.2%, demonstrating a synergistic optimization effect between feature enhancement and attention mechanisms. A slight decrease in F1-score observed when combining DBB with SCSA points to a nuanced interaction between these modules. While the DBB module generates a rich, yet potentially redundant, set of multi-scale features through its parallel branches, the subsequent SCSA mechanism performs rigorous selection to prioritize the most discriminative spatial and channel information. Although this filtering is beneficial for focus, it may inadvertently suppress less salient contextual features that remain valuable for distinguishing densely packed and morphologically similar oat ears. This observed trade-off between feature diversity (from DBB) and feature selectivity (from SCSA) parallels the “feature suppression” effect noted in general object detection by Wang et al. [
29]. By extending this observation to dense crop detection, our findings suggest that balancing these two aspects requires careful calibration, particularly for small and adhesive targets.
Combining DBB with PIoUv2 resulted in a performance drop, suggesting a suboptimal interaction between their optimization mechanisms. This may be due to a mismatch in their primary operational objectives. The DBB module excels at generating diverse feature representations across scales, yet this very diversity can introduce variability into the feature maps supplied for regression. Conversely, the PIoUv2 loss function relies on a stable and consistent feature representation to effectively execute its dynamic focusing mechanism, which assesses anchor box quality to guide regression [
30]. Consequently, variability in the input features can perturb this quality assessment, leading to suboptimal gradient updates. This interpretation is supported by Liu et al. [
30], who observed that the efficacy of advanced IoU losses is contingent upon the stability of the preceding feature extraction process. Thus, our empirical result suggests that integrating a highly variable feature extractor with a sensitive, quality-aware regression loss may necessitate additional stabilization techniques within the training pipeline. The DBB module excels at generating diverse feature representations across scales, yet this very diversity can introduce variability into the feature maps supplied for regression. Conversely, the PIoUv2 loss function relies on a stable and consistent feature representation to effectively execute its dynamic focusing mechanism, which assesses anchor box quality to guide regression [
30]. Consequently, variability in the input features can perturb this quality assessment, leading to suboptimal gradient updates. This interpretation is supported by Liu et al. [
30], who observed that the efficacy of advanced IoU losses is contingent upon the stability of the preceding feature extraction process. Thus, our empirical result suggests that integrating a highly variable feature extractor with a sensitive, quality-aware regression loss may necessitate additional stabilization techniques within the training pipeline.
In contrast, the combination of SCSA and PIoUv2 exhibited the best overall performance: all metrics showed steady improvement while maintaining high computational efficiency. This indicates that the spatial-channel attention mechanism and the enhanced localization evaluation criteria complement each other effectively, achieving a more favorable balance between detection accuracy and computational cost. The complete model, which integrates all three proposed modules, achieves significant improvements across multiple metrics. It reaches a mAP of 94.0% and an F1-score of 87.10%, demonstrating a strong balance between recall (86.8%) and precision (87.4%). Technically, the DBB module enhances multi-scale feature representation, the SCSA mechanism optimizes the spatial and channel-wise distribution of these features, and the PIoUv2 loss further improves localization quality through refined bounding box regression. These components collectively form a coherent enhancement pipeline: feature extraction → feature refinement → detection optimization.
On the basis of completing the ablation test shown in
Table 4, in order to further explore the influence of different module combinations on the performance of the model, especially the interaction effect of synergy or inhibition between modules, this study will conduct an in-depth analysis of the models numbered 5 to 8. These models represent the combination of two modules and the combination of all three modules. By drawing and comparing the precision-recall curve (PR curve) and F1 score-threshold curve (F1 curve) of these models, the comprehensive performance of each module combination in the detection task and its stability under different confidence thresholds can be more intuitively revealed.
The comparative analysis of the ablation models from dual perspectives in
Figure 10 reveals distinct performance characteristics and operational robustness. As shown in
Figure 10a, the Precision-Recall (PR) curves depict the inherent detection capability, where a clear performance hierarchy is established: the complete model (Model 8) achieves the optimal curve, followed by Model 5 (DBB + SCSA), Model 6 (DBB + PIoUv2), and finally Model 7 (SCSA + PIoUv2). This gradient underscores that the inclusion of the DBB module, which enhances multi-scale feature representation, is a fundamental contributor to high performance in scenarios with significant scale variation and dense distribution. Within this framework, the SCSA mechanism and PIoUv2 loss function serve as crucial refinements for discriminating targets from complex backgrounds and improving localization in occluded areas, respectively.
Complementing this, the F1-Confidence curves in
Figure 10b evaluate the operational stability under varying decision thresholds. Model 8 again demonstrates superior practicality, maintaining a high F1-score across the broadest threshold range, which reduces the dependency on precise threshold calibration—a key advantage for deployment in variable field conditions. While Models 5 and 6 show competitive potential in the PR space, their steeper F1 declines indicate higher sensitivity to threshold selection. Conversely, Model 7, despite its lower performance ceiling, exhibits a stability comparable to the full model, highlighting that the SCSA and PIoUv2 combination alone can yield a robust, though less accurate, detection baseline.
Although ablation experiments demonstrate the performance improvement brought by the proposed module, a detailed analysis of the computational complexity is crucial for understanding the relevant model cost. As shown in
Table 3, the integration of DBB modules in the backbone network brings an increase in computational load and parameters. For further quantitative analysis,
Table 5 compares the computational overhead of traditional convolution and DBB modules. The results show that the multi-branch structure of the DBB module requires additional parameters to support its diverse convolution kernel combinations, resulting in an increase in computational complexity.
The integration of the DBB module introduces a quantifiable computational overhead, as detailed in
Table 5. The replacement of standard C2f modules with their C2f_DiverseBranchBlock counterparts results in a consistent increase in computational complexity across all levels of the backbone network. The Multi-Adds (MACs) increment ranges from approximately 72.66 MMac in the deep layers to 148.71 MMac in the mid-level layers, representing a relative cost increase between 1.39× and 1.47× compared to the baseline.
This analysis clarifies the inherent trade-off in the model improvement process. The performance gains observed in the ablation study, particularly in handling scale variation, are achieved at the cost of a moderate increase in model complexity and computational demand. The multi-branch structure of the DBB module, while effective in enriching feature representation, inherently requires additional parameters and computations to support its parallel convolutional pathways.
To isolate and evaluate the contribution of the proposed PIoUv2 loss function, a head-to-head comparison was conducted on the standard YOLOv8n architecture by replacing only the bounding box regression loss while keeping all other training settings and model components identical. This controlled experiment aimed to objectively assess the intrinsic performance of various loss functions for the oat ear detection task. The results are summarized in
Table 6.
The ablation results indicate a clear performance hierarchy, with PIoUv2 achieving the best overall metrics. Its superior performance, particularly in the localization-sensitive mAP@50:95, can be attributed to its dynamic penalty mechanism that likely provides more refined optimization for dense and occluded oat ears. However, the performance gains over the simpler PIoU are consistent yet incremental across all metrics. This suggests that while the design principles of PIoUv2 are beneficial, the absolute advantage conferred by its enhanced architecture for this specific task is measured. Therefore, PIoUv2 is selected as the most effective option, though its marginal lead over PIoU indicates diminishing returns on complexity.
Figure 11 provides a comprehensive comparison of the training dynamics between the proposed YOLOv8n-DSP model and the original YOLOv8n model in the task of oat ear detection.
During the training phase, the proposed YOLOv8n-DSP model demonstrated a rapid decline in all loss components: the bounding box regression loss (train/box_loss) decreased from an initial value of approximately 3.5 to around 1.05 by epoch 100, the classification loss (train/cls_loss) dropped from 3.5 to about 0.65, and the distribution focal loss (train/dfl_loss) reduced from 3.5 to approximately 1.15. The consistent decline and convergence of these three loss values indicate a continuous improvement in the model’s capabilities for oat ear localization, classification, and handling of challenging samples. During validation, the val/box_loss, val/cls_loss, and val/dfl_loss stabilized around 1.35, 1.3, and 1.5, respectively. The small gap between training and validation losses, along with their convergent behavior, reflects the model’s strong generalization ability and resistance to overfitting. In contrast, although the original YOLOv8n model also exhibited a decreasing trend in training loss, its convergence was slower: the final train/cls_loss remained around 0.75 and train/dfl_loss around 1.25. More notably, its validation losses were significantly higher, with val/box_loss around 1.6, val/cls_loss near 1.7, and val/dfl_loss approximately 1.75, indicating noticeable overfitting.
Overall, by incorporating the DBB module, SCSA attention mechanism, and PIoUv2 loss function, the YOLOv8n-DSP model exhibits stronger adaptability, robustness, and generalization capability in oat ear detection tasks, with particularly notable performance improvements in complex natural field environments. The systematic ablation study quantifies the contribution of each component and reveals their interactive effects. The performance gain from the DBB module highlights the importance of multi-scale feature representation for oat ears, which exhibit substantial size variations in the field. This observation aligns with the principle of multi-scale feature extraction, which is central to many advanced detectors in agricultural vision tasks [
32,
33]. The effectiveness of the SCSA mechanism in improving mAP with minimal computational overhead demonstrates the benefit of synergistic spatial-channel attention. This approach more effectively suppresses background interference in complex scenes, a common challenge in crop detection [
8,
9].
A slight performance decrease was observed when combining DBB and SCSA (Model 5), indicating a trade-off between feature diversity and feature selectivity. This can be attributed to the DBB module enriching the feature space, while the SCSA mechanism refines it through selective attention weighting. This phenomenon is consistent with the ‘feature suppression’ effect reported in other studies that integrate multi-branch feature extractors with attention mechanisms [
29]. This suggests that for dense small object detection, a more adaptive feature fusion strategy may be required beyond simply stacking advanced modules.
3.2. Comparison Test of Different Detection Models
To comprehensively evaluate the performance of the YOLOv8n-DSP model in oat ear detection tasks, this study conducted a comparative analysis with current mainstream models in the YOLO series, with detailed results provided in
Table 7.
In terms of comprehensive evaluation metrics, the YOLOv8n-DSP model demonstrates outstanding overall performance, achieving a mean average precision (mAP) of 94.0%, surpassing all compared models. It exceeds the traditional lightweight model YOLOv3-Tiny (86.5%) by 7.5 percentage points, outperforms YOLOv5n (90.1%), which incorporates a Focus module, by 3.9 percentage points, and maintains a 2.3 percentage point advantage over the newly released YOLOv12n (91.7%). This performance improvement is primarily attributed to dual architectural innovations: the DBB module enables richer multi-scale feature representation through its parallel multi-branch convolutional structure, while the SCSA attention mechanism significantly enhances the extraction of discriminative features of oat ears. Furthermore, YOLOv8n-DSP achieves an F1-score of 87.10%, which not only considerably exceeds those of mainstream lightweight models such as YOLOv5n (84.44%) and YOLOv10s (84.75%) but also remains within a narrow margin of only 0.66 percentage points compared to the current state-of-the-art model, YOLOv11n (87.76%). The mAP of YOLOv8n-DSP over models like YOLOv11n and YOLOv12n likely stems from its task-specific architectural inductiveness. General-purpose models like YOLOv11n/12n are optimized for a broad spectrum of objects on standard benchmarks, often prioritizing a balanced precision-recall trade-off for generic shapes and sizes. In contrast, our model incorporates the DBB and SCSA modules specifically to address the persistent challenges in field-based phenotyping: large scale variation (addressed by DBB’s multi-branch receptive fields) and complex background/occlusion (addressed by SCSA’s synergistic filtering). This targeted design echoes the strategy of Qing et al. [
8] and Yu et al. [
9], who introduced specialized attention or detection heads for wheat ears, yielding significant gains in complex fields. Conversely, the marginally lower F1-score may reflect YOLOv11n’s more refined optimization for general object detection, potentially through advanced label assignment or classification head design aimed at maximizing the harmonic mean on diverse data. This divergence highlights a key consideration for applied agricultural AI: models optimized solely for general benchmarks may not fully address domain-specific bottlenecks, such as extreme occlusion or subtle inter-target distinctions, which are critically assessed by mAP due to its heavy penalty on poor localization. The multi-branch structure of our DBB module, while enhancing multi-scale representation, may introduce a degree of feature complexity that slightly impacts classification confidence in the most challenging cases, contributing to this F1 gap. This observation suggests that future work could focus on optimizing the balance between feature richness and representational efficiency.
Regarding the balance between precision and recall, YOLOv8n-DSP also exhibits excellent performance. Experimental results show that the model achieves a precision of 87.4% while maintaining a high recall rate of 86.8%. Specifically, although its precision is 3.0 percentage points lower than that of YOLOv12n (90.4%), which features a reinforced classification head design, it is 6.2 percentage points higher than that of the traditional lightweight model YOLOv3-Tiny (81.2%). In terms of recall, it remains comparable to YOLOv3-Tiny (86.9%), which adopts a dense prediction strategy, and shows a substantial increase of 4.9 percentage points over YOLOv10s (81.9%). This excellent balance is mainly due to the multi-level dynamic feature weighting mechanism of the SCSA attention module: the spatial attention component effectively focuses on key regions of oat ears while suppressing background noise, and the channel attention mechanism intelligently prioritizes informative feature channels to optimize classification confidence calibration. The synergy between these components significantly enhances detection stability in complex field environments.
In computational efficiency, YOLOv8n-DSP achieves an effective balance between performance and efficiency through innovative architectural design. Experimental results indicate that the model has a computational complexity of only −8.9 GFLOPs, which is 50.6% lower than that of YOLOv9s (26.7 GFLOPs) and 38.3% lower than YOLOv10s (21.4 GFLOPs). This advantage is largely attributable to the intelligent computational resource allocation strategy of the multi-branch structure within the DBB module. It is noteworthy that despite the significant reduction in computational cost, the model still maintains a high inference speed of 3.7 instances/ms, effectively meeting the real-time requirements for field-based crop detection.
The comparative results indicate that YOLOv8n-DSP performs competitively among contemporary models, demonstrating its applicability to agricultural detection tasks. The higher mAP of YOLOv8n-DSP compared to models like YOLOv11n and YOLOv12n is likely due to its design, which incorporates components targeting agricultural challenges. While general-purpose models achieve strong performance on broad benchmarks, they may be less specialized for challenges specific to field-based phenotyping, such as dense occlusion and subtle target-background distinctions. The integration of DBB and SCSA in our model is intended to address these issues, a strategy that has also been employed effectively in other agricultural detection studies [
7,
9].
The observed result—a higher mAP but a marginally lower F1-score compared to YOLOv11n—warrants further analysis. One possible explanation is that YOLOv11n employs a classification or label assignment strategy that achieves a better balance between precision and recall on general objects. Conversely, the higher mAP of our model suggests an advantage in localization accuracy, which is an important factor for tasks like counting and size estimation in agricultural applications.