1. Introduction
Precision agriculture represents a core direction of modern agricultural development, and seeding, as a critical stage in crop production, directly influences yield outcomes and resource utilization efficiency [
1,
2]. Rice and wheat, as the most important staple cereals in China [
3], are particularly sensitive to seeding quality—including seeding rate, uniformity, and the absence of missed or blocked rows—which substantially influences both yield and growth quality [
4,
5]. In recent years, pneumatic high-speed seeders have emerged as the mainstream equipment for improving seeding efficiency [
6]. However, under high-speed operating conditions, the elevated seed flow density, rapid movement velocity, and frequent mutual occlusion pose severe challenges for real-time seed detection—the foundational step upon which all downstream seeding quality assessment tasks depend.
Reliable seeding quality monitoring encompasses multiple complementary functions: seed counting for seeding rate estimation, spatial distribution analysis for uniformity assessment, and anomaly detection for identifying missed or blocked delivery tubes [
4]. Among these, accurate real-time detection of individual seeds within the high-speed seed flow constitutes the most fundamental and technically demanding prerequisite. Only when seeds can be reliably detected and localized in each captured frame do subsequent counting, rate computation, and quality evaluation become feasible. Therefore, developing robust seed detection methods capable of operating under high-speed pneumatic seeding conditions is of considerable practical importance for advancing vision-based seeding quality monitoring systems.
Conventional approaches to seed flow monitoring in pneumatic seeders primarily rely on one-dimensional physical signals acquired from photoelectric, capacitive, and piezoelectric sensors. Photoelectric sensors have been applied to monitor seed passage in corn metering devices [
7] and belt-type high-speed seeders [
8]; however, as seed flow density increases, overlapping seeds and mechanical interference severely degrade detection accuracy. Capacitive sensors detect seeds through variations in equivalent dielectric constants [
9], yet their performance is highly sensitive to environmental fluctuations and cannot reliably differentiate overlapping seeds. Piezoelectric approaches convert mechanical impact energy into voltage pulses and have demonstrated promising results at moderate speeds [
10,
11], but their contact-based nature disrupts seed falling trajectories, and system sensitivity declines markedly at elevated rotational speeds. Overall, these one-dimensional methods lack spatial morphological perception and are consequently compromised by seed posture variations, overlapping, and occlusion. These limitations motivate the exploration of non-contact, high-dimensional visual detection methods for dense, high-speed seed flows.
To overcome these limitations, recent studies have increasingly explored deep learning-based object detection for crop recognition, counting, and quality assessment [
12]. Representative applications include fruit detection and counting in orchards [
13,
14], rice panicle density estimation [
15], weed detection with enhanced attention and feature fusion strategies [
16,
17,
18], and lightweight small-target detection in complex orchard environments [
19]. Taken together, these works highlight the extensive applicability of object detection frameworks across diverse agricultural vision applications [
20]. In the specific domain of seed detection, Xing et al. [
21] developed a real-time monitoring platform for rice pneumatic seed metering devices by enhancing the YOLOv5n architecture, achieving detection accuracies of 88.8–98.65% across different rotational speeds. However, their system was tailored for single-row devices with relatively low seed flow speeds, and detection accuracy declined at higher rotational speeds due to motion blur and vibration, indicating that further algorithmic refinements are required for high-speed, high-density seeding scenarios.
Despite these advances, applying object detection to seed detection in high-speed seeder scenarios presents several domain-specific challenges. First, at elevated operating speeds, image acquisition inevitably produces motion blur, causing seed edge contours to become indistinct and high-frequency detail information to degrade substantially. Second, dense small-target detection remains inherently difficult: a single frame may contain dozens of seeds with frequent mutual occlusion and overlap, leading to increased rates of missed and false detections. Third, the morphological differences between rice and wheat seeds are substantial—rice seeds exhibit an elongated elliptical shape whereas wheat seeds present an ellipsoidal geometry—requiring detection algorithms with robust cross-crop generalization capability [
22,
23].
To tackle the aforementioned issues, we present HSSD-YOLO (High-Speed Seed Detection YOLO), a refined detection framework constructed on top of the YOLOv11 backbone. Three specific advances are reported in this paper:
- (1)
A Motion Blur Enhanced Stem module (MBE-Stem) is designed to explicitly recover seed contour features under motion blur through learnable directional gradient operators combined with adaptive channel attention fusion.
- (2)
An Attention-enhanced Deformable Convolutional Network (ADCN) incorporating a novel Residual Spatial-Channel Attention (RSCA) mechanism is proposed to improve adaptive sampling accuracy for irregularly shaped seeds.
- (3)
An Edge-Guided Adaptive Recalibration Feature Pyramid Network (EGAR-FPN) is constructed to inject edge prior information into multi-scale feature fusion, enhancing boundary discrimination for densely overlapping targets.
3. Results
To systematically evaluate the HSSD-YOLO model’s effectiveness, lightweight properties, and practical applicability in high-speed seed detection scenarios, we adopt a layered experimental design for this study. The study progresses from comprehensive performance assessment to single-factor validation, consisting of four components:
- (1)
Mainstream model detection performance comparison to evaluate the HSSD-YOLO model’s competitiveness against existing detection algorithms, including multiple lightweight versions of the YOLO series (YOLOv5 [
37], YOLOv7 [
38], YOLOv8 [
39], YOLOv9 [
40], YOLOv10 [
41]), a Transformer-based detector (RT-DETR [
42]), and a conventional two-stage detector (Faster R-CNN [
43]), evaluated across detection accuracy, model complexity, and qualitative visual analysis.
- (2)
Ablation studies to examine the independent and joint contributions of the three designed modules—MBE-Stem, ADCN (RSCA), and EGAR-FPN—through progressive module addition and pairwise combination experiments, supplemented by Grad-CAM [
44] visualization analysis.
- (3)
Attention mechanism performance comparison to validate RSCA’s superiority over mainstream alternatives (including SE [
45], CBAM, CA, and ECA [
46]) within the ADCN offset-mask prediction branch, justifying the proposed attention design.
- (4)
Comparison of multiple feature pyramid network structures to confirm the effectiveness of EGAR-FPN against mainstream FPN variants (including PAN-FPN (YOLOv11 default), BiFPN [
47], ASFF [
48], AFPN [
49], and Gold-YOLO Neck [
50]), demonstrating the advantages of edge-guided feature fusion.
All comparative experiments were conducted on the unified experimental platform described in
Section 2.4, with same dataset division, training rounds, and planned learning rate adjustment strategy to ensure objectivity and reliability of results. YOLOv11n acts as the benchmark network for all experiments in this work. All compared models (YOLOv5s, YOLOv7-tiny, YOLOv8n, YOLOv9t, YOLOv10n, YOLOv11n, RT-DETR-l, Faster R-CNN) were trained under unified conditions using the augmented partitions of
Section 2.2.2 (5338 training/1525 validation images) and evaluated on the 476 original, unaugmented test images, with the same number of epochs (300), the same optimizer (SGD, momentum 0.937, weight decay 5 × 10
−4), the same initial learning rate (0.01) with cosine annealing, the same input resolution (640 × 640), the same batch size (32), the same hardware and software environment (RTX 4090 D, PyTorch 2.2.2, CUDA 12.1), and the same evaluation-time preprocessing. Architecture-specific components that cannot be unified without breaking the architecture itself—RT-DETR’s default internal augmentations, Faster R-CNN’s region-proposal-network configuration, and backbone-specific pretraining—were kept at each model’s originally published settings, because forcing these to match would disadvantage the baseline rather than improve fairness. Under this protocol, accuracy differences reported in
Section 3.1 reflect architectural choices rather than training-recipe disparity.
3.1. Comparison with Existing Models
To fully assess the detection capability of HSSD-YOLO, this research conducted systematic comparisons with multiple mainstream object detection networks on the same rice/wheat seed dataset. The compared models cover multiple lightweight versions of the YOLO series (YOLOv5, YOLOv7, YOLOv8, YOLOv9, YOLOv10), a Transformer-based detector (RT-DETR), and a conventional two-stage detector (Faster R-CNN) for comprehensive cross-comparison. YOLOv11n is used as the baseline framework in our experimental setup.
3.1.1. Detection Accuracy Comparison
Table 2 reports metrics evaluated on the 476 original, unaugmented test images. As a supplementary robustness check, HSSD-YOLO was also evaluated on the augmented test set (762 images), yielding mAP@0.5 of 96.8% and mAP@0.5–0.95 of 77.5%, deviating by only 0.2 and 0.1 percentage points from the results reported in
Table 2. Among all compared methods, HSSD-YOLO attains the top ranking on every metric—94.4% Precision, 93.7% Recall, 96.6% mAP@0.5, alongside a mAP@0.5–0.95 score of 77.4%.
In comparison with the baseline YOLOv11n, our model leads to improvements of 3.4 percentage points in Precision, 3.5 percentage points in Recall, 2.5 percentage points in mAP@0.5, and 5.4 percentage points in mAP@0.5–0.95. The substantial improvement in mAP@0.5–0.95 indicates that HSSD-YOLO maintains high detection and localization accuracy over a range of IoU thresholds, which plays a key role in precise seed localization—a prerequisite for accurate seed counting in seeding rate monitoring applications.
Across the YOLO series evolution, all metrics exhibit steady improvement, reflecting progressive architectural optimization. However, even the latest YOLOv11n and the high-performing RT-DETR-l only attain mAP@0.5–0.95 values of 72.0% and 71.8%, respectively, whereas HSSD-YOLO elevates this metric to 77.4%, demonstrating the practical effect of the improvement mechanisms designed in this work.
Notably, although RT-DETR-l achieves competitive detection accuracy comparable to that of YOLOv11n in Precision (90.8%) and mAP@0.5 (93.4%), as a Transformer-based detector, its parameter count is substantially higher than that of lightweight YOLO models, rendering it unsuitable for deployment on resource-constrained agricultural embedded devices. Faster R-CNN, a classical two-stage detector, yields the lowest Precision (84.7%), Recall (83.4%), and mAP@0.5 (87.3%) among all evaluated models. This outcome is primarily attributable to the limited capacity of its region proposal mechanism to resolve densely distributed small seeds under high-speed motion.
Figure 10 displays a three-dimensional bar chart that illustrates the normalized score comparisons of each model over four key performance indicators: Precision, Recall, mAP@0.5, and mAP@0.5–0.95. It is evident that HSSD-YOLO (highlighted in red) consistently outperforms all other competing models across every evaluation dimension, with the most pronounced advantage observed in the mAP@0.5–0.95 metric. These results confirm the synergistic effect of the proposed MBE-Stem module, ADCN module, and EGAR-FPN in enhancing detection accuracy for high-speed seeding scenarios.
3.1.2. Model Complexity and Efficiency Analysis
In actual agricultural use cases, high detection accuracy must be coupled with efficient real-time inference on resource-limited embedded platforms. Therefore, this study further compares the parameter counts (Params) and floating-point operation counts (GFLOPs) of each model.
Table 3 presents representative models for comparison, including lightweight YOLO models (YOLOv5s, YOLOv11n), a Transformer detector (RT-DETR-l), a two-stage model (Faster R-CNN), together with our proposed model (HSSD-YOLO).
As shown in
Table 3, HSSD-YOLO has 5.2 M parameters and 11.8 GFLOPs. Compared with the baseline model YOLOv11n (2.6 M parameters, 6.5 GFLOPs), the parameter count approximately doubles and GFLOPs increase by approximately 82%. This increase primarily stems from the multi-branch parallel structure introduced in the MBE-Stem module, the integration of deformable convolutions and RSCA attention mechanisms in the ADCN module, and the additional parameters from edge-guided feature injection and bidirectional spatial recalibration modules in EGAR-FPN.
Despite the increased complexity, HSSD-YOLO remains substantially lighter than the non-lightweight comparison models. Specifically, compared with RT-DETR-l, HSSD-YOLO requires only 16.3% of the parameters (5.2 M vs. 32.0 M) and 12.8% of the computational cost (11.8 vs. 92.2 GFLOPs), while exceeding it by 3.2 percentage points in mAP@0.5 and 5.6 percentage points in mAP@0.5–0.95. Compared with Faster R-CNN, the advantages in parameters and GFLOPs are even more substantial—only 12.5% and 8.8%, respectively—while detection accuracy leads by a considerable margin. Even compared with the larger YOLOv5s (7.2 M parameters, 16.5 GFLOPs), HSSD-YOLO achieves a substantial improvement in mAP@0.5–0.95 from 57.5% to 77.4% with fewer parameters.
In summary, HSSD-YOLO secures accuracy improvements of 2.5 points in mAP@0.5 and 5.4 points in mAP@0.5–0.95 at a moderate parameter overhead (2.6 M → 5.2 M), while sustaining an inference throughput of 85.1 FPS on the RTX 4090 D reference platform. This exceeds the conventional 30 FPS threshold for real-time detection on that platform; however, it remains below the maximum camera acquisition rate of 526.5 FPS. In a practical deployment, not every acquired frame needs to be processed: frame-skipping strategies can be adopted without compromising detection coverage, because consecutive frames at 526.5 FPS exhibit substantial temporal redundancy. The optimal acquisition-to-inference frame-rate ratio and its effect on downstream counting accuracy remain to be determined through dedicated experiments. Embedded-platform throughput has not been evaluated and warrants dedicated investigation.
3.1.3. Qualitative Detection Results Analysis
To further visually demonstrate the detection effectiveness of HSSD-YOLO, this study selected three typical scenarios (high-speed motion images of indica rice, japonica rice, and wheat seeds) and performed visual comparisons of detection results from four models: HSSD-YOLO, YOLOv11n (baseline), YOLOv5s, and Faster R-CNN, as displayed in
Figure 11. In the figure, green bounding boxes indicate correct detections (True Positive), blue bounding boxes indicate false positives (False Positive), and red bounding boxes indicate missed detections (False Negative, i.e., targets annotated as Ground Truth but not detected). To demonstrate detection performance under high-density operating conditions, the representative images were selected at per-tube supply rates approximating or marginally exceeding the maximum per-tube supply rates derived in
Section 2.1: indica rice at 22.1 g/s, japonica rice at 34.3 g/s, and wheat at 26.6 g/s. The minor deviations from the steady-state maxima reflect instantaneous temporal fluctuations within individual tubes, representing near-peak seed flow density under high-speed operating conditions.
From
Figure 11, the following key observations can be made: (1) In the japonica rice seed scenario, where seeds are short-elliptical and relatively dense with overlap, HSSD-YOLO’s ADCN module’s adaptive sampling mechanism precisely locates the boundaries of partially occluded seeds, effectively reducing missed detection rates in dense scenarios. In contrast, the YOLOv11n baseline model, while outperforming YOLOv5s, still exhibits some missed and false detections in seed overlap regions. (2) In the indica rice seed scenario, where seeds are elongated and prone to significant motion blur during high-speed movement, HSSD-YOLO benefits from MBE-Stem’s explicit modeling of edge gradient information, effectively recovering seed contour features under motion blur conditions and achieving zero missed and false detections. YOLOv5s and Faster R-CNN show considerably more missed detections (red boxes) in this scenario, particularly for seeds with severe edge blur. (3) In the wheat seed scenario, where seed density is highest with multiple seeds closely packed, HSSD-YOLO leverages EGAR-FPN’s edge-guided feature fusion strategy to effectively enhance boundary discrimination between dense small targets, achieving accurate detection in regions where other models all exhibit varying degrees of missed detections. Faster R-CNN shows the most severe false detections in this scenario, further confirming the limitations of conventional two-stage detectors in dense small-target detection tasks.
3.2. Failure-Case Analysis
Examination of the residual errors on the test set reveals three axes that govern error incidence—seed morphology, per-tube loading, and motion-blur severity—and three corresponding failure patterns. Wheat contributes the most missed detections under dense flow; indica rice dominates longitudinal-blur merges; japonica is the least error-prone. Errors rise appreciably only when per-tube loading approaches or exceeds
(
Section 2.1), and only under the heaviest blur does MBE-Stem leave visible residuals. Three recurring patterns mark the operating limits of each module.
The first pattern is cluster-induced missed detection (FN-dominant). When per-tube loading exceeds , wheat kernels in particular travel as tight triples or quadruples whose silhouettes cannot be separated within the effective receptive field of ADCN. One or two kernels in such a cluster are missed as false negatives while surrounding detections remain correct. This represents the resolution ceiling of deformable sampling guided by RSCA attention.
The second pattern is elongated-target merging (FN-dominant, occasional FP). Under strong longitudinal blur, the gap between two adjacent indica rice kernels is filled by blur energy, causing one elongated box to bracket both. Typically, one kernel registers as a true positive while the other becomes a false negative; occasionally the merged box aligns with neither kernel, producing two false negatives and one false positive. This marks the limit of MBE-Stem: its learnable Sobel operators recover contours under moderate blur, but cannot restore separating boundaries when blur trails bridge inter-kernel spacing.
The third pattern is single-seed splitting (FP-dominant). Under the heaviest blur, intensity variations along the motion direction cause the detector to produce two overlapping boxes on one seed whose IoU falls just below the NMS threshold. One box matches the ground-truth kernel; the other is logged as a false positive. This limit arises because EGAR-FPN’s edge priors at adjacent pyramid levels drift apart under extreme trailing, and BASR cannot realign them sufficiently to suppress the duplicate.
In summary of the error-type mapping: pattern 1 contributes only to the missed-detection count; pattern 2 contributes chiefly to missed detections, with the occasional false positive when the merged box aligns with neither kernel; pattern 3 contributes only to false positives, in the form of duplicate boxes on a single kernel. Representative examples of the three cases are given in
Figure 12.
3.3. Ablation Study
To assess the individual and joint contributions of each enhancement module within HSSD-YOLO, we devised a series of ablation experiments starting from the YOLOv11n baseline and incrementally activating MBE-Stem, ADCN (RSCA), and EGAR-FPN, both individually and in combination. Every experiment shared identical data partitions and training configurations; the outcomes are tabulated in
Table 4, where bracketed values denote gains over the baseline.
From the single-module ablation results, the three improvement modules each contribute differently to overall model performance. The standalone introduction of the ADCN (RSCA) module yields the most significant performance gain (1.4 percentage points in mAP@0.5, 1.8 percentage points in mAP@0.5–0.95), indicating that the adaptive sampling strategy based on deformable convolution and RSCA attention mechanisms plays a key role in improving target localization accuracy. The EGAR-FPN module yields a 1.3-point improvement in mAP@0.5–0.95, being surpassed only by ADCN, implying that the edge-guided adaptive feature fusion strategy effectively enhances model detection robustness across varying IoU thresholds. The MBE-Stem module improves mAP@0.5–0.95 by 0.8 percentage points, validating its capability to effectively recover seed contour information under motion blur conditions through multi-branch edge gradient modeling.
From the pairwise combination results, the modules exhibit significant complementary effects. The mAP@0.5–0.95 gains of the three dual-module combinations (+3.1, +2.9, +3.7 percentage points) all exceed the arithmetic sum of the corresponding single-module gains, with the combination of ADCN and EGAR-FPN performing best (mAP@0.5–0.95 of 75.7%, +3.7 percentage points), reflecting the strong synergistic effect between adaptive sampling and edge-guided feature fusion in multi-scale detection. Notably, when all three modules are fully integrated (HSSD-YOLO), mAP@0.5 reaches 96.6% (+2.5) and mAP@0.5–0.95 reaches 77.4% (+5.4), with overall gains exceeding those of any single-or dual-module combination. The notable 5.4-point gain in mAP@0.5–0.95 is especially significant because this metric reflects comprehensive localization accuracy across IoU thresholds ranging from 0.5 to 0.95, indicating that HSSD-YOLO simultaneously enhances both detection and localization performance in high-speed seeding scenarios. This synergistic effect can be attributed to the fact that the shallow-level edge gradient features extracted by MBE-Stem provide more precise sampling references for ADCN, while EGAR-FPN further integrates edge information into multi-scale feature maps, forming a complete enhancement chain from feature extraction to feature fusion.
To more intuitively elucidate the influence of each module on model attention distribution, we carried out experiments under high-density conditions. Specifically, we selected images with relatively dense seed distributions from those with per-tube supply rates slightly above the
benchmark derived in
Section 2.1 (indica rice: 23.1 g/s, japonica rice: 34.9 g/s, wheat: 27.6 g/s), representing frames in which the temporal fluctuations of individual tubes further elevated the supply rate beyond the steady-state inter-tube maxima. This setup serves as a deliberate stress-test for the feature extraction capability of each module under extreme density conditions.
Grad-CAM heatmaps are presented in
Figure 13.
The progressive changes in heatmaps clearly illustrate the gradual contribution of each module: In the Baseline, attention activation shows incompleteness across all three seed scenarios with weak warm-color responses for some seeds. After introducing MBE-Stem, activation coverage expands with new warm-color response points appearing at previously unactivated seed edge regions. Further stacking of ADCN shows warm activation regions becoming more concentrated and focused, with notably improved attention separation between adjacent seeds in dense arrangements. When all three modules are fully integrated (HSSD-YOLO), the heatmaps display optimal attention distribution—virtually all seed targets are accurately covered by warm-color regions with concentrated, sharp activation responses, and background noise activation is effectively suppressed. This progressive attention optimization aligns closely with the quantitative metric improvements in
Table 4, further confirming the effectiveness of the three-module synergistic enhancement from the perspective of feature visualization.
3.4. Attention Mechanism Comparison Experiment
To demonstrate the advantage of RSCA over competing attention designs, we evaluated five alternative configurations within the HSSD-YOLO framework—keeping MBE-Stem and EGAR-FPN fixed and varying only the attention sub-module inside ADCN: ADCN (None), SE, CBAM, CA, and ECA. The results are reported in
Table 5.
As shown in
Table 5, except ECA, introducing any attention mechanism outperforms ADCN (None) on mAP metrics, confirming that attention guidance positively impacts deformable convolution offset prediction.
Figure 14 presents a radar chart that intuitively illustrates the comprehensive performance of each attention mechanism across three metrics—Precision, Recall, and mAP@0.5. RSCA performs significantly better than all other mechanisms across every dimension, exhibiting the largest radar coverage area. Among the five compared mechanisms, CA and CBAM exhibit similar performance, recording mAP@0.5–0.95 values of 76.3% and 75.8%, respectively. Both methods exceed SE (75.4%) and ECA (75.0%), which solely model channel-wise features, demonstrating that attention to the spatial dimension is vital for precise offset generation in the presence of motion blur.
The proposed RSCA mechanism attains optimal performance over all evaluation indicators, with the mAP@0.5–0.95 value hitting 77.4%, exceeding the next-best CA by 1.1 percentage points. The advantage of RSCA primarily derives from its decoupled three-path architecture design: unlike SE, which only models channel dependencies, CBAM, which sequentially concatenates channel and spatial attention, or CA, which only encodes positional information along coordinate directions, RSCA simultaneously models dependencies across three dimensions through height, width, and channel paths, and further introduces cross-dimensional interaction mechanisms, enabling generated offsets and modulation masks to more precisely adapt to seed targets of different morphologies and motion blur degrees. Additionally, RSCA adds only 0.6 M parameters compared to ADCN (None), achieving a desirable balance between efficiency and accuracy.
3.5. Feature Pyramid Structure Comparison Experiment
The Feature Pyramid Network (FPN) critically governs multi-scale detection performance, as its fusion strategy directly determines how well seeds of varying sizes are recognized. To benchmark EGAR-FPN against established pyramid architectures, we conducted experiments within the HSSD-YOLO framework—holding MBE-Stem and ADCN constant while substituting only the neck structure with five alternatives: PAN-FPN, BiFPN, ASFF, AFPN, and Gold-YOLO Neck. The corresponding results appear in
Table 6.
As presented in
Table 6, all compared FPN structures bring improvements in at least one evaluation metric, with most showing gains across all indicators.
Figure 15 presents a radar chart that intuitively illustrates the comprehensive performance of each FPN architecture across three metrics—Precision, Recall, and mAP@0.5. EGAR-FPN achieves the largest coverage area across all three dimensions, demonstrating well-rounded and consistently superior performance. Gold-YOLO Neck performs closest to EGAR-FPN (mAP@0.5–0.95 of 76.4%), with BiFPN and AFPN following at 75.6% and 75.9%, respectively. Although ASFF achieves a relatively high Precision of 93.0%, its mAP@0.5–0.95 (75.2%) is marginally lower than those of BiFPN (75.6%) and AFPN (75.9%), suggesting that its learning-based spatial filtering strategy may have limited generalization capacity under complex motion blur conditions.
EGAR-FPN substantially outperforms all compared structures with an mAP@0.5–0.95 of 77.4%, exceeding the next-best Gold-YOLO Neck by 1.0 percentage points. This advantage primarily derives from two core designs of EGAR-FPN: First, the MSEG module provides explicit edge prior information for the feature fusion process, a capability that is absent in other FPN structures. In high-speed seeding scenarios, motion blur causes severe degradation of seed edge information, and conventional FPN relying solely on the integration of top-down and bottom-up semantic features has difficulty effectively recovering boundary information, whereas EGAR-FPN explicitly injects edge gradient features into feature maps at each scale, providing critical references for precise target boundary reconstruction. Second, the BASR module realizes adaptive integration of deep semantic representations and shallow spatial detail through a bidirectional feature recalibration mechanism, which more effectively preserves fine-grained spatial information in dense seed scenarios compared with the unidirectional or simple weighted fusion strategies of BiFPN and ASFF.
4. Discussion
Achieving an optimal balance between recognition precision and processing overhead is the central challenge in developing per-frame seed detection algorithms for high-speed pneumatic seeders, where reliable seed detection constitutes the foundational step toward downstream seeding-rate monitoring and precision variable-rate application. On the self-constructed high-speed motion image dataset covering three seed varieties—indica rice, japonica rice, and wheat—HSSD-YOLO raised mAP@0.5 by 2.5 points (94.1% → 96.6%) and mAP@0.5–0.95 by 5.4 points (72.0% → 77.4%) over the YOLOv11n reference, with parameter growth limited to the range of 2.6 M to 5.2 M, and sustaining an inference speed of 85.1 FPS that exceeds the conventional 30 FPS real-time threshold on the RTX 4090 D reference platform, although this remains below the camera’s maximum acquisition rate of 526.5 FPS; in deployment, frame-skipping would reconcile this gap. Embedded-platform latency, throughput, and energy use have not yet been measured. The accuracy improvement is therefore obtained with a moderate parameter increase. The per-parameter accuracy contribution, visible both in the ablation results (
Table 4) and in the YOLO-series trend (
Table 2), is consistent with targeted architectural choices aimed at the three specific degradations of this task—motion blur, deformation-aware sampling, and edge-guided fusion. A rigorous separation of feature-allocation effects from parameter-scaling effects would require controlled capacity-matched experiments that are beyond the scope of the present study.
Regarding individual module contributions, the MBE-Stem module enhances seed contour perception under motion-blur conditions by explicitly modeling directional gradient information through learnable Sobel operators. Under the elevated seed supply rates examined in this study, standard convolutions struggle to extract sufficient edge features from blurred images due to their implicit feature-learning paradigm. The standalone introduction of MBE-Stem improved mAP@0.5–0.95 by 0.8 percentage points, a finding that echoes the observation of Chen et al. [
16] regarding the importance of explicit edge modeling for detection performance in complex agricultural scenes. However, unlike prior edge-enhancement strategies that rely on fixed gradient operators, MBE-Stem employs a multi-branch parallel architecture that simultaneously captures edge magnitude, vertical gradient responses, and spatial context, then fuses these complementary cues via an adaptive channel-attention mechanism. This design yields a richer feature representation, particularly for elongated indica rice seeds whose longitudinal blur patterns differ markedly from those of rounder wheat kernels.
The ADCN module produced the largest single-module performance gain in ablation experiments (1.8 percentage points in mAP@0.5–0.95), highlighting the critical role of attention-guided adaptive sampling in improving target localization accuracy for morphologically diverse seeds. Comparative experiments on attention mechanisms further clarified the importance of joint spatial-channel modeling for offset prediction: mechanisms operating exclusively in the channel dimension, namely SE and ECA, achieved mAP@0.5–0.95 scores of 75.4% and 75.0%, respectively, both lower than mechanisms that additionally incorporate spatial attention, namely CA (76.3%) and CBAM (75.8%). The proposed RSCA mechanism achieved 77.4% mAP@0.5–0.95 by simultaneously modeling dependencies across height, width, and channel dimensions through a decoupled three-path architecture and by introducing explicit cross-dimensional interaction via a globally aggregated spatial statistic. RSCA surpassed the next-best CA mechanism by 1.1 percentage points while adding only 0.6 M parameters relative to ADCN (None), thus attaining a favorable accuracy–efficiency trade-off. These findings are consistent with recent conclusions from precision agricultural vision tasks that multi-dimensional joint attention yields superior performance compared with serial or independent attention formulations.
Comparative experiments on neck architectures confirmed the advantage of EGAR-FPN in edge-guided multi-scale feature fusion. Among all evaluated structures—PAN-FPN, BiFPN, ASFF, AFPN, and Gold-YOLO Neck—EGAR-FPN gained the highest mAP@0.5–0.95 of 77.4%, exceeding the next-best Gold-YOLO Neck by 1.0 percentage points. This superiority derives from two complementary design decisions. First, the MSEG module provides explicit edge prior information at each FPN level through learnable directional gradient kernels, a capability absent in all compared structures. Under high-speed seeding conditions, motion blur severely degrades seed boundary information, and conventional FPN architectures that rely solely on top-down and bottom-up propagation of semantic features are unable to recover this information effectively. Second, the BASR module enables adaptive fusion of upper-level semantic context and shallow spatial details through a bidirectional, spatially gated recalibration mechanism, which more effectively preserves fine-grained spatial information in dense seed arrangements than the unidirectional propagation of BiFPN or the learning-based spatial filtering of ASFF. These results align with the conclusion of Zhou et al. [
19] in loquat detection that task-adaptive feature fusion strategies outperform generic lightweight designs for small-object detection.
In the ablation, each of the three dual-module configurations produced mAP@0.5–0.95 gains (3.1, 2.9, and 3.7 percentage points, respectively), exceeding the arithmetic sum of the corresponding single-module gains, and the full three-module integration (+5.4 points) surpassed any pairwise combination. This pattern is consistent with complementary functional roles among the modules; a conclusive demonstration of synergy would still require a controlled, matched-capacity analysis. Grad-CAM provides a qualitative interpretation at the level of feature-attention saliency, rather than a mechanistic claim: as modules are added in turn, the warm-color activations evolve from incomplete and diffuse coverage in the YOLOv11n baseline to concentrated, target-aligned activation in the full HSSD-YOLO, with concomitant suppression of background responses. We read this as qualitative evidence compatible with the intended roles of the three modules; a rigorous mechanistic attribution would require complementary tools such as integrated gradients or layer-wise relevance propagation applied to matched image sets. This cascade effect can be attributed to the functional complementarity of the three modules: shallow-level edge gradient features extracted by MBE-Stem provide more discriminative sampling references for ADCN’s deformable convolutions, while EGAR-FPN subsequently integrates these edge cues across all feature pyramid levels, forming a coherent enhancement chain from initial feature extraction through multi-scale fusion.
Comparison with representative detection algorithms further corroborates the practical utility of HSSD-YOLO. The Transformer-based RT-DETR-l, despite competitive detection accuracy (mAP@0.5 of 93.4%), carries 32.0 M parameters and 92.2 GFLOPs—approximately six times and eight times those of HSSD-YOLO, respectively—rendering it poorly suited to resource-constrained agricultural embedded platforms. Faster R-CNN exhibited the weakest overall performance (mAP@0.5–0.95 of 58.0%), primarily because its Region Proposal Network does not handle densely distributed small-target seeds under high-speed motion conditions effectively. Even compared with the larger YOLOv5s (7.2 M parameters), HSSD-YOLO achieves a substantially smaller model footprint (5.2 M parameters) alongside a 19.9 percentage point improvement in mAP@0.5–0.95 (from 57.5% to 77.4%). Taken together, these results indicate that domain-specific architectural optimization addressing the characteristic degradations of high-speed seeding—motion blur, dense small targets, and cross-variety morphological diversity—yields substantially greater gains than simply scaling model capacity.
The design principles embodied in HSSD-YOLO—explicit edge modeling, attention-guided adaptive sampling, and edge-injected cross-scale information aggregation—offer methodological insights that may transfer to other agricultural vision tasks sharing similar degradation characteristics (motion-induced edge loss, dense target arrangement, morphological variability). Empirical generalizability, however, is established only for the seed-detection task on the present dataset. Transfer to rice-panicle density estimation in unmanned harvesters, occluded fruit detection in orchards, or weed identification in cereal fields is a plausible hypothesis that must be tested directly, rather than asserted on the strength of the present experiments. We also note that HSSD-YOLO shows strong within-setting robustness to the simulated perturbations introduced during augmentation (motion blur, noise, illumination shift); robustness outside the reported acquisition setting has not been tested empirically and is not asserted.
It is worth stating the scope of this contribution plainly. The validated task here is single-frame seed detection. Extending this to seed counting—per-frame detections aggregated via a temporal association or tracking stage—and further to seeding-rate estimation by integration over time has not been empirically tested in this study. Counting accuracy, inter-frame consistency, and any form of closed-loop seeding-rate control should therefore be read as downstream capabilities enabled by the present detector rather than demonstrated by it.
From detection to counting. The intended downstream use of HSSD-YOLO is to aggregate per-frame detections over a continuous video stream through a lightweight multi-object tracker such as SORT or ByteTrack. Each detected seed then carries a temporal identity, so one physical seed traversing several frames contributes exactly once to the cumulative count; per-unit-time seed flux follows by differencing cumulative identities. The frame-level precision and recall reported here characterize the input to that counting stage. It should be emphasized that this detection-to-counting pipeline remains entirely conceptual at present: counting accuracy, identity-switch rate, temporal consistency across consecutive frames, and any form of closed-loop seeding-rate control have not been experimentally validated in this work. The end-to-end counting error depends on tracker choice, association thresholds, and the interaction between detection confidence and track initialization, all of which lie beyond the present study and constitute the immediate next step.
Prospects for integration with pneumatic-transport physics. The present study stays within the scope of computer vision and does not attempt aerodynamic or particle-flow modeling. As a natural extension, the per-tube seed-flux time series produced by the counting pipeline could in future work be cross-referenced with the inter-tube coefficient of variation
defined in
Section 2.1, and, when available, with blower static pressure and airflow–velocity telemetry, allowing deviations linked to tube blockage or two-phase-flow instability to be flagged at the vision output without modifying the detector.
Several limitations of the current study merit acknowledgment. First, all images were acquired on the single testbed described in
Section 2.2 using one camera (Hikvision MV-CS004-10UM, USB 3.0), one backlight, one lens configuration, and laboratory-controlled lighting, covering only three cultivars—japonica rice (‘Wuyoudao No. 4’), indica rice (‘Fengyouxiangzhan’), and wheat (‘Jimai 22’). Crops with markedly different morphology (e.g., maize, soybean, rapeseed) were not evaluated, and cross-crop generalization should not be assumed without retraining. Robustness against different sensors, optics, illumination, field dust, chassis vibration, or sub-zero temperatures has likewise not been tested; any claim of practical applicability is bounded accordingly, and multi-sensor, multi-site, and multi-crop validation remain priorities for subsequent studies. Second, although the computational complexity is effectively controlled (5.2 M parameters, 11.8 GFLOPs, 85.1 FPS on the RTX 4090 D reference platform), the reported throughput was obtained on a high-end desktop GPU rather than on a field-side embedded platform. Systematic benchmarking on the NVIDIA Jetson Orin NX/AGX family commonly used in agricultural edge devices has not been conducted in this study. Continuous-operation latency, thermal stability, power consumption, and post-quantization (INT8 or FP16) throughput therefore remain to be established. Claims regarding the practical deployability in this paper are limited by this gap. Specifically, INT8/FP16 post-quantization accuracy, inference latency under continuous operation, thermal throttling behavior, and power consumption on representative agricultural edge devices (e.g., NVIDIA Jetson Orin NX/AGX) have not been characterized. These factors are critical for field-ready deployment and represent an immediate next step in research. A third caveat concerns environmental extremes. Our robustness tests used real high-speed footage supplemented by synthetic degradation, yet the harshest field conditions—heavy chassis vibration during turns, sub-zero temperatures affecting camera response—were not represented and will need dedicated on-farm trials. Future research should address these limitations by conducting cross-dataset validation across diverse geographic regions and operational environments, utilizing parameter reduction strategies including weight sparsification and teacher–student learning to enable edge-side deployment, and integrating the seed detection algorithm with seed counting and seeder actuator control systems to realize closed-loop seeding-rate feedback, thereby bridging the gap between algorithmic benchmarking and full-scale field application.
5. Conclusions
This study tackled the fundamental difficulties of real-time seed detection in rice–wheat dual-purpose high-speed pneumatic seeders—specifically, rapid seed motion, severe dense occlusion, and edge feature degradation induced by motion blur—by proposing HSSD-YOLO, a target recognition architecture derived from the YOLOv11 backbone and incorporating three targeted improvement modules: MBE-Stem, ADCN, and EGAR-FPN. MBE-Stem recovers seed contour features degraded by motion blur through learnable directional gradient operators. ADCN, equipped with the proposed RSCA mechanism, enhances adaptive sampling for morphologically diverse seeds via joint spatial-channel attention. EGAR-FPN injects explicit edge prior information into multi-scale feature fusion, improving boundary discrimination for densely overlapping targets.
On the self-constructed dataset of high-speed motion images comprising indica rice, japonica rice, and wheat seeds, HSSD-YOLO recorded 96.6% on the mAP@0.5 and 77.4% on mAP@0.5–0.95, which translates to performance lifts of 2.5 and 5.4 percentage units over the YOLOv11n baseline model. Benchmarked against a comprehensive set of mainstream detection algorithms—including YOLOv5s, YOLOv8n, YOLOv9t, YOLOv10n, RT-DETR-l, and Faster R-CNN—HSSD-YOLO ranked first on every evaluation criterion while keeping the model size at just 5.2 M parameters and delivering an inference throughput of 85.1 FPS on the RTX 4090 D reference platform, which exceeds the conventional 30 FPS real-time threshold but remains below the camera’s 526.5 FPS maximum acquisition rate; in practice, frame-skipping can bridge this gap owing to the substantial temporal redundancy at full acquisition speed, whereas embedded-platform throughput, including post-quantization performance, remains to be measured. Ablation studies confirmed that each of the three proposed modules contributes independently to detection performance and that their joint integration produces synergistic gains exceeding linear superposition: the maximum single-module improvement in mAP@0.5–0.95 was 1.8 percentage points, whereas full model integration achieved a 5.4 percentage point gain. Comparative experiments on attention mechanisms and feature pyramid architectures further established the superiority of RSCA and EGAR-FPN over mainstream alternatives, including SE, CBAM, CA, BiFPN, and ASFF. Overall, HSSD-YOLO supplies a computationally feasible detection algorithm for accurate per-frame localization of seeds under high-speed seeding conditions, establishing a prerequisite for vision-based seeding-quality monitoring. Full implementation of seeding-rate estimation with counting-error characterization, temporal consistency across continuous video, closed-loop actuator feedback, and systematic benchmarking on agricultural embedded platforms (including INT8/FP16 post-quantization accuracy, latency, and power consumption) lies outside the scope of this paper and constitutes a necessary direction for subsequent research.