Next Article in Journal
Growth, Yield and Fruit Biological Value of Several Less Known Pear Cultivars on the Lower Silesia (Poland)
Previous Article in Journal
BerryFlowerNet: A Customized Convolutional Neural Network for Blueberry Flower Cluster Detection and Flowering Stage Prediction with a Field Phenotyping Robot
Previous Article in Special Issue
Research on Road Slope Estimation and the Passable Area Modelling Method in Hilly and Mountainous Areas Based on Multi-Sensor Fusion
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

HSSD-YOLO: A Motion-Blur-Robust Object Detection Framework for Real-Time Seed Detection in High-Speed Pneumatic Seeders

1
College of Engineering, South China Agricultural University, Guangzhou 510642, China
2
Key Laboratory of Key Technology on Agricultural Machine and Equipment (South China Agricultural University), Ministry of Education, Guangzhou 510642, China
3
State Key Laboratory of Agricultural Equipment Technology, Beijing 100083, China
4
Guangdong Engineering Research Center of Intelligent Planting Equipment for Field Crops, Guangzhou 510642, China
5
Guangdong Provincial Key Laboratory of Agricultural Artificial Intelligence (GDKL-AAI), Guangzhou 510642, China
6
Huangpu Innovation Research Institute, South China Agricultural University, Guangzhou 510715, China
*
Author to whom correspondence should be addressed.
Agriculture 2026, 16(11), 1160; https://doi.org/10.3390/agriculture16111160
Submission received: 30 March 2026 / Revised: 2 May 2026 / Accepted: 22 May 2026 / Published: 25 May 2026
(This article belongs to the Special Issue Intelligent Agricultural Seeding Equipment)

Abstract

For high-speed pneumatic seeders, accurate real-time seed detection underpins downstream quality assessments including seed counting, seeding-rate estimation, and uniformity evaluation. Under high-speed operating conditions, seeds exhibit rapid motion, dense distribution, frequent occlusion, and severe motion-blur-induced edge degradation, posing substantial challenges for vision-based detection. This study proposes HSSD-YOLO, an improved detection algorithm built upon YOLOv11, incorporating three modules: a Motion Blur Enhanced Stem module (MBE-Stem) employing learnable Sobel gradient operators for edge feature extraction under motion blur; an Attention-enhanced Deformable Convolutional Network (ADCN) with a Residual Spatial-Channel Attention (RSCA) mechanism for adaptive sampling of irregularly shaped seeds; and an Edge-Guided Adaptive Recalibration Feature Pyramid Network (EGAR-FPN) injecting edge prior information into multi-scale feature fusion. On a self-constructed dataset of indica rice, japonica rice, and wheat seeds, HSSD-YOLO achieves 96.6% mAP@0.5 and 77.4% mAP@0.5–0.95, surpassing YOLOv11n by 2.5 and 5.4 percentage points, respectively, with only 5.2 M parameters. Ablation studies confirm synergistic gains exceeding linear superposition. Under the conditions evaluated, HSSD-YOLO outperformed all compared algorithms, providing the per-frame detection foundation for downstream seeding-quality tasks; empirical validation of those tasks on continuous video and embedded hardware remains outside the present scope.

1. Introduction

Precision agriculture represents a core direction of modern agricultural development, and seeding, as a critical stage in crop production, directly influences yield outcomes and resource utilization efficiency [1,2]. Rice and wheat, as the most important staple cereals in China [3], are particularly sensitive to seeding quality—including seeding rate, uniformity, and the absence of missed or blocked rows—which substantially influences both yield and growth quality [4,5]. In recent years, pneumatic high-speed seeders have emerged as the mainstream equipment for improving seeding efficiency [6]. However, under high-speed operating conditions, the elevated seed flow density, rapid movement velocity, and frequent mutual occlusion pose severe challenges for real-time seed detection—the foundational step upon which all downstream seeding quality assessment tasks depend.
Reliable seeding quality monitoring encompasses multiple complementary functions: seed counting for seeding rate estimation, spatial distribution analysis for uniformity assessment, and anomaly detection for identifying missed or blocked delivery tubes [4]. Among these, accurate real-time detection of individual seeds within the high-speed seed flow constitutes the most fundamental and technically demanding prerequisite. Only when seeds can be reliably detected and localized in each captured frame do subsequent counting, rate computation, and quality evaluation become feasible. Therefore, developing robust seed detection methods capable of operating under high-speed pneumatic seeding conditions is of considerable practical importance for advancing vision-based seeding quality monitoring systems.
Conventional approaches to seed flow monitoring in pneumatic seeders primarily rely on one-dimensional physical signals acquired from photoelectric, capacitive, and piezoelectric sensors. Photoelectric sensors have been applied to monitor seed passage in corn metering devices [7] and belt-type high-speed seeders [8]; however, as seed flow density increases, overlapping seeds and mechanical interference severely degrade detection accuracy. Capacitive sensors detect seeds through variations in equivalent dielectric constants [9], yet their performance is highly sensitive to environmental fluctuations and cannot reliably differentiate overlapping seeds. Piezoelectric approaches convert mechanical impact energy into voltage pulses and have demonstrated promising results at moderate speeds [10,11], but their contact-based nature disrupts seed falling trajectories, and system sensitivity declines markedly at elevated rotational speeds. Overall, these one-dimensional methods lack spatial morphological perception and are consequently compromised by seed posture variations, overlapping, and occlusion. These limitations motivate the exploration of non-contact, high-dimensional visual detection methods for dense, high-speed seed flows.
To overcome these limitations, recent studies have increasingly explored deep learning-based object detection for crop recognition, counting, and quality assessment [12]. Representative applications include fruit detection and counting in orchards [13,14], rice panicle density estimation [15], weed detection with enhanced attention and feature fusion strategies [16,17,18], and lightweight small-target detection in complex orchard environments [19]. Taken together, these works highlight the extensive applicability of object detection frameworks across diverse agricultural vision applications [20]. In the specific domain of seed detection, Xing et al. [21] developed a real-time monitoring platform for rice pneumatic seed metering devices by enhancing the YOLOv5n architecture, achieving detection accuracies of 88.8–98.65% across different rotational speeds. However, their system was tailored for single-row devices with relatively low seed flow speeds, and detection accuracy declined at higher rotational speeds due to motion blur and vibration, indicating that further algorithmic refinements are required for high-speed, high-density seeding scenarios.
Despite these advances, applying object detection to seed detection in high-speed seeder scenarios presents several domain-specific challenges. First, at elevated operating speeds, image acquisition inevitably produces motion blur, causing seed edge contours to become indistinct and high-frequency detail information to degrade substantially. Second, dense small-target detection remains inherently difficult: a single frame may contain dozens of seeds with frequent mutual occlusion and overlap, leading to increased rates of missed and false detections. Third, the morphological differences between rice and wheat seeds are substantial—rice seeds exhibit an elongated elliptical shape whereas wheat seeds present an ellipsoidal geometry—requiring detection algorithms with robust cross-crop generalization capability [22,23].
To tackle the aforementioned issues, we present HSSD-YOLO (High-Speed Seed Detection YOLO), a refined detection framework constructed on top of the YOLOv11 backbone. Three specific advances are reported in this paper:
(1)
A Motion Blur Enhanced Stem module (MBE-Stem) is designed to explicitly recover seed contour features under motion blur through learnable directional gradient operators combined with adaptive channel attention fusion.
(2)
An Attention-enhanced Deformable Convolutional Network (ADCN) incorporating a novel Residual Spatial-Channel Attention (RSCA) mechanism is proposed to improve adaptive sampling accuracy for irregularly shaped seeds.
(3)
An Edge-Guided Adaptive Recalibration Feature Pyramid Network (EGAR-FPN) is constructed to inject edge prior information into multi-scale feature fusion, enhancing boundary discrimination for densely overlapping targets.

2. Materials and Methods

2.1. High-Speed Air-Assisted Centralized Metering System for Rice/Wheat

This study is based on a high-speed air-assisted centralized metering system for rice and wheat. The system mainly consists of a blower, a seed hopper, a seed supply device, a high-speed mixing device, a corrugated booster tube, a distributor, and 12 seed delivery tubes, as illustrated in Figure 1. During operation, seeds from the hopper are metered by a mechanical seed supply device to form a stable seed flow entering the mixing section. Through the convergent intake section, the airflow generated by the blower is accelerated into a high-speed jet that mixes with the seed flow in the mixing chamber. Under the impact of the high-speed jet, the low-speed seed flow is dispersed within the delivery tube, forming a uniform high-speed gas–solid two-phase flow. The mixture is conveyed through horizontal tubes, redirected vertically via an elbow, homogenized by the corrugated booster tube, and finally distributed through the distributor into n = 12 individual delivery tubes, completing the entire process of supply, acceleration, delivery, homogenization, distribution, and guidance to achieve high-speed, uniform, and stable seed metering.
The high-speed mixing device, which serves as the seed feeder and core component of the conveyor subsystem, is designed on the basis of the Venturi principle with an equal-width inclined seed inlet and consists of four serially arranged sections, namely a gradually shrinking section, an inclined seed feeding section, a mixing section, and a diffusion section. In operation, the airflow supplied by the blower is accelerated into a high-speed jet within the gradually shrinking section; meanwhile, the metered seed flow enters through the inclined seed feeding section and merges with this jet inside the mixing section, where the seeds are redirected from a nearly vertical falling trajectory to the horizontal conveying direction. The diffusion section downstream gradually enlarges the cross-section so as to recover static pressure and homogenize the gas-solid two-phase flow before it enters the horizontal delivery tube [24]. The key structural parameters of these four sections—including the throat geometry of the shrinking section, the outlet height of the mixing section, and the angle and height of the inclined seed feeding section—were optimized through DEM-CFD coupled simulation and verified by bench tests in a previous study from our group, so that stable conveying of indica rice, japonica rice and wheat is maintained across the full supply range considered in the present work without seed backflow or countercurrent.
To establish the high-speed operating context of this study, the total seed supply rate under field conditions is first derived from agronomic seeding parameters. Given the working width b (m), the forward speed v (m/s), and the target seeding rate Q s (g/m2), the total seed supply rate Q T (g/s) across all 12 delivery tubes is as follows:
Q T = b × v × Q s
where b = 2.4 m is the working width of the seeder. Based on relevant literature and Chinese agronomic requirements for mechanized broadcast seeding, the recommended seeding rates are 90–225 kg/hm2 (9–22.5 g/m2) for indica rice, 240–375 kg/hm2 (24–37.5 g/m2) for japonica rice, and 100–300 kg/hm2 (10–30 g/m2) for wheat [25,26,27]. Combined with typical high-speed operating speeds of 8–14 km/h (2.22–3.89 m/s) [24], the theoretical total supply rate ranges calculated by Equation (1) are 48.0–210.0 g/s for indica rice, 128.0–350.0 g/s for japonica rice, and 53.3–280.0 g/s for wheat. To further ensure that the experimental conditions correspond to high-speed seeding, the maximum total supply rates Q T are set to 250 g/s for indica rice, 400 g/s for japonica rice, and 300 g/s for wheat.
Since the vision-based detection module is deployed on a single delivery tube, the per-tube seed supply rate directly determines the detection difficulty. However, the air-assisted distribution inherently introduces non-uniformity among the n tubes. To characterize the per-tube supply rate under the most demanding detection scenario, the inter-tube coefficient of variation of discharge uniformity V n is introduced, which is defined as follows:
V n = S q ¯
where S is the standard deviation of the individual tube discharge rates and q ¯ is their mean value. Under the assumption of an approximately normal distribution of discharge rates among the 12 tubes, the per-tube supply rate at one standard deviation above the mean represents the most demanding single-tube supply condition under steady-state inter-tube distribution. The maximum per-tube supply rate q m a x is therefore estimated as follows:
q m a x = q ¯ + S = q ¯ ( 1 + V n ) = Q T n ( 1 + V n )
For n = 12 tubes under a normal distribution, approximately 1–2 tubes are expected to have supply rates at or above this level at any given time, making q m a x a representative benchmark for the most demanding per-tube detection scenario based on inter-tube distribution. It should be noted that this value characterizes the steady-state spatial allocation among the 12 tubes; the instantaneous supply rate of a given tube may further fluctuate over time due to the inherent stochasticity of the pneumatic delivery process, occasionally exceeding the steady-state maximum in individual frames.
According to prior studies from our group, the V n values obtained were 5.82% for indica rice, 2.16% for japonica rice, and 5.11% for wheat. Substituting these values into Equation (3), the maximum per-tube supply rates q m a x are obtained as 22.0 g/s for indica rice, 34.1 g/s for japonica rice, and 26.3 g/s for wheat. These steady-state inter-tube maxima serve as the baseline detection benchmarks adopted throughout the subsequent experiments.

2.2. Data Acquisition and Processing

2.2.1. High-Speed Seeding Image Dataset Collection

Data collection was conducted in September 2025 on an experimental testbed constructed based on the high-speed air-assisted centralized metering system described in Section 2.1. The testbed retains the complete seed metering and conveying pathway—including the blower, seed hopper, seed supply device, high-speed mixing device, corrugated booster tube, distributor, and 12 seed delivery tubes—and is additionally equipped with a monitoring conduit, a flat-panel LED backlight source (KCS KP-150-150-W; KCS, Dongguan, China), a light shield, and a high-speed industrial camera (MV-CS004-10UM; Hikvision, Hangzhou, China) for image acquisition, as shown in Figure 2.
Based on the supply rate analysis in Section 2.1, the total supply rates (12 tubes) during data collection were set to the maximum total supply rates: 250 g/s for indica rice, 400 g/s for japonica rice, and 300 g/s for wheat. The conveying airflow velocity for all three seed types was set to 35 m/s to ensure operational stability of the pneumatic metering system. The number of metering wheel groups was adjusted according to each seed type and target supply rate based on prior calibration.
To capture high-speed seed flow images, a monitoring conduit and supplementary backlight source were installed on one of the 12 delivery tubes, as shown in Figure 2. The flat-panel LED backlight provides uniform illumination that highlights seed contours and enhances edge visibility under high-speed motion conditions. The working distance between the light-emitting surface and the imaging plane was kept approximately constant across all acquisitions, and the light shield blocks ambient light, so that each frame was captured under consistent illumination. The camera was positioned directly facing the monitoring conduit. A Hikvision USB 3.0 interface area-scan monochrome camera was selected, featuring a maximum frame rate of 526.5 fps and a resolution of 720 × 540 pixels, satisfying the high-speed image acquisition requirements of this study. Each acquisition lasted 30 s, after which the corresponding seed volume was measured using a graduated cylinder to verify the actual supply rate.
The dataset covers three seed types: japonica rice (‘Wuyoudao No. 4’), indica rice (‘Fengyouxiangzhan’) and wheat (‘Jimai 22’), sample image is shown in Figure 3. For each seed type, multiple acquisitions were performed at the maximum supply rates to capture a sufficient number of images, as shown in Figure 4.

2.2.2. Image Preprocessing

The raw high-speed footage yielded 20,436 frames across the three crops. Temporally adjacent near-duplicate frames were removed with a perceptual-hash (pHash) similarity screen: a 64-bit pHash signature was computed for each candidate frame, and any frame within a Hamming distance of 10 of an already-retained frame was discarded. This threshold was chosen empirically to eliminate near-identical neighbors in time while preserving visibly distinct seed configurations. After screening, 4763 unique frames remained, distributed as 1542 indica rice (32.4%), 1658 japonica rice (34.8%), and 1563 wheat (32.8%). The 4763 originals were then randomly partitioned at the image level into 70% training, 20% validation, and 10% test—3334, 953, and 476 images respectively—stratified by variety so that per-crop proportions held across all three subsets. Splitting preceded any augmentation; consequently, no original images and none of their augmented variants appear in more than one subset, ruling out cross-split leakage. Offline augmentation was then applied independently within each subset with Albumentations v1.3.1 [28], writing augmented variants to disk rather than generating them on-the-fly at training time. Five transformation families were used—geometric (affine, translation–scale–rotation, perspective), blur (motion, Gaussian, defocus, zoom), noise injection, illumination and color shift, and quality degradation—with per-image probabilities set so that each original produced, on average, 0.6 additional stored variants within its own subset. Figure 5 shows the effects of augmentation.
Training accordingly expanded from 3334 to 5338, validation from 953 to 1525, and test from 476 to 762; the total number of images in the dataset reached 7625. Because augmentation operated strictly within an already-partitioned subset, every augmented image inherits its parent’s subset membership. The combined augmentation set was designed to enhance robustness against field-side perturbations including equipment vibration, motion blur, illumination variation, and sensor noise.
To disentangle the influence of test-time augmentation from intrinsic model capability, all primary evaluation metrics reported in Section 3.1, Section 3.2, Section 3.3, Section 3.4 and Section 3.5 are computed on the 476 original, unaugmented test images. A brief supplementary comparison with the augmented test subset (762 images) is provided in Section 3.1.1 for robustness verification.

2.3. HSSD-YOLO Seed Detection Model

For object detection tasks, YOLOv11 has gained widespread adoption in various real-world tasks thanks to its effective trade-off between recognition precision and computational efficiency [29]. Given that the detection targets in this study are high-speed moving rice and wheat seeds—characterized by rapid motion, small dimensions, and frequent mutual occlusion—YOLOv11 was selected as the baseline framework. With prospective deployment on agricultural embedded platforms in mind, the lightweight YOLOv11n variant serves as the basis for this study; however, all experiments reported herein were conducted on the RTX 4090 D desktop GPU, and embedded-platform performance (including latency, thermal behavior, power draw, and INT8/FP16 post-quantization accuracy) has not yet been evaluated. This study develops HSSD-YOLO (High-Speed Seed Detection YOLO) built upon the YOLOv11n framework. Figure 6 depicts the overall architecture of HSSD-YOLO. The principal modifications comprise three elements:
(1)
Replacing the initial two standard convolutional layers within the network’s backbone with a Motion Blur Enhanced Stem module (MBE-Stem) that explicitly models image gradient information through learnable Sobel directional gradient operators.
(2)
Reconstructing the backbone network with an Attention-enhanced Deformable Convolutional Network (ADCN) that employs the Residual Spatial-Channel Attention mechanism (RSCA) to enhance adaptive sampling precision for irregularly shaped seeds.
(3)
Constructing an Edge-Guided Adaptive Recalibration Feature Pyramid Network (EGAR-FPN) that explicitly injects edge prior information into multi-scale feature fusion.

2.3.1. Motion Blur Enhanced Stem Module (MBE-Stem)

Under high-speed operating conditions with total seed supply rates up to 400 g/s, motion blur inevitably causes seed edge contours to become indistinct. The standard convolutional layers in the YOLOv11 backbone, owing to their fixed kernel geometry, have difficulty capturing these degraded edge features.
To mitigate this problem, we introduce the Motion Blur Enhanced Stem module (MBE-Stem) as a replacement for the first two standard convolutional layers in the YOLOv11 backbone. MBE-Stem explicitly captures image gradient cues via a multi-branch parallel topology that concurrently extracts directional edge representations and spatial contextual features, thereby reinforcing the network’s perceptual capacity for seed targets degraded by motion blur.
Figure 7 illustrates the detailed structure of MBE-Stem. Denoting the input tensor as F i n R 3 × H × W , where 3 corresponds to the channel count and H , W represent spatial dimensions. An initial 3 × 3 convolution with stride 2 performs both channel expansion and spatial down-sampling, producing intermediate features F 1 with C channels. These features are subsequently routed into three concurrent branches: an edge magnitude branch, a vertical gradient branch, and a spatial preservation branch.
The edge magnitude path extracts boundary information via learnable Sobel operators. Horizontal and vertical convolutional kernels G x s and G y s , initialized as standard Sobel kernels, perform grouped convolutions on F 1 to obtain directional gradients D x and D y respectively, which are fused into an edge magnitude map:
F m a g = B N D x 2 + D y 2 + ϵ
where B N denotes batch normalization and ϵ = 10 6 is a numerical stability constant. Unlike traditional fixed Sobel operators or Canny operators [30], G x s and G y s are trainable parameters that can adaptively learn the blur-specific gradient patterns of high-speed seeding scenarios.
The vertical gradient path independently preserves the vertical directional gradient response F v e r t by applying batch normalization to the vertical gradient component D y , which exhibits stronger sensitivity to the longitudinal blur produced by seeds during rapid motion. The spatial preservation path retains spatial geometric structure F s p a t i a l through zero-padding and max pooling (kernel size k = 2, stride s = 1), suppressing noise while preserving salient spatial distribution information of seed targets.
To integrate the three-path features, MBE-Stem employs an adaptive channel attention fusion mechanism. The three-path outputs are first concatenated along the channel dimension to obtain the combined feature map F c a t . Then, Global Average Pooling ( GAP ) generates channel-level statistics, and a two-layer fully connected network with a bottleneck reduction ratio of 8 learns inter-channel dependencies:
Ase = σ(Θ2⋅ReLU(Θ1⋅GAP(Fcat)))
where Θ 1 : 3 C 3 C / 8   and Θ 2 : 3 C / 8 3 C are weight matrices, and σ is the Sigmoid function. The generated attention weights A s e are applied element-wise to the concatenated F cat to obtain the recalibrated features F weighted , dynamically adjusting channel importance. The final output F s t e m is obtained through two convolutional layers for dimensionality reduction and spatial downsampling, completing 4× total downsampling from input to output.
This design offers three principal advantages:
(1)
learnable directional gradient detectors explicitly preserve seed contour features under motion blur, overcoming the insufficient implicit learning in standard convolutions.
(2)
the multi-path parallel architecture captures three complementary feature types for comprehensive representation.
(3)
the channel attention mechanism dynamically adjusts feature weights according to input characteristics.

2.3.2. Attention-Enhanced Deformable Convolutional Network (ADCN)

Standard convolution kernels in the YOLOv11 backbone are constrained to fixed rectangular sampling grids, limiting their ability to adaptively capture the irregular morphology of seeding grains. Building upon the deformable convolution framework [31], DCNv2 [32] advances adaptive sampling through jointly learned offset fields and modulation masks; however, the prediction of these offsets and masks still relies on plain convolutional layers that lack global context awareness, leading to suboptimal sampling point localization.
The above limitation motivates our redesign of the backbone around an Attention-Enhanced Deformable Convolutional Network (ADCN). The core innovation lies in a Residual Spatial-Channel Attention (RSCA) mechanism that enhances the offset-mask prediction process of deformable convolutions through a decoupled three-path architecture, simultaneously modeling dependencies across height, width, and channel dimensions. The structures of ADCN and RSCA are illustrated in Figure 8.
For an input feature map X i n R C i n × H × W , a conventional convolution first produces preliminary offset-mask features X o f f s e t R C × H × W , with C = 3 G K 2 , where G denotes the deformable group count and K the kernel size. RSCA extracts complementary multi-dimensional features from X o f f s e t through three adaptive pooling operations:
T h = A v g P o o l h X o f f s e t
T w = P e r m u t e A v g P o o l w X o f f s e t
T c = C o n v 1 × 1 G A P X o f f s e t
Here, A v g P o o l h and A v g P o o l w perform height- and width-directional adaptive mean pooling separately, preserving the corresponding spatial range while compressing the orthogonal dimension. GAP reduces the feature map to a channel-level global descriptor, which is then refined by a 1 × 1 convolution; the Permute operation reorders T w so that its spatial layout matches that of T h .
To establish cross-dimension spatial correlations, we concatenate T h and T w and integrate them via a 3 × 1 convolutional kernel, yielding the combined feature T h w . The fused features T h w are then separated back into height and width components T h and T w that contain mutual spatial information. Spatial attention weights Γ h w are generated from T h w via Sigmoid-activated 1 × 1 convolution and correspondingly split into Γ h and Γ w for subsequent feature modulation. A key distinction of RSCA from approaches such as CBAM [33] and CA [34] is the explicit spatial-channel interaction: a global spatial statistic Γ ¯ h w is derived by averaging the spatial attention weights Γ h w along all spatial positions, which subsequently modulates channel features so that channel attention incorporates both GAP results and the spatial attention distribution:
Γ ¯ h w = 1 H + W i = 1 H + W   Γ h w ( i )
where Γ h w ( i ) denotes the slice of Γ h w at spatial position i , and Γ ¯ h w aggregates attention information across the entire spatial domain. The spatial attention weights and global spatial statistics are then leveraged to adjust the three-path features through element-wise products: Γ h and Γ w weight the height component T h and width component T w respectively to obtain T ^ h and T ^ w , while Γ ¯ h w modulates the channel feature T c to obtain T ^ c . This design ensures that channel attention depends not only on GAP but also incorporates the global statistics of spatial attention weights, establishing explicit interaction between spatial and channel dimensions.
To exploit the complementary nature of these three-branch features, RSCA employs learnable fusion weight vectors ω = [ ω h , ω w , ω c ] T R 3 , with one scalar weight assigned to each of the three branches (height, width, channel). This vector is Softmax-normalized into ω ~ = [ ω h ~ , ω w ~ , ω c ~ ] T with Σ i ω i ~ = 1 . Each branch feature is then activated by σ ( ) and scaled by the corresponding component of ω ~ , producing three branch attention maps, A h = σ ( T ^ h ) ω ~ h , A w = σ ( P e r m u t e ( T ^ w ) ) ω ~ w , and A c = σ ( T ^ c ) ω ~ c . The three maps are then aggregated via broadcasting element-wise multiplication to form the final attention map:
A r s c a = A h A w A c
where A h R C × H × 1 , A w R C × 1 × W   and A c R C × 1 × 1 . The enhanced offset-mask features are output through a residual connection with a learnable scaling factor α r s c a :
X o f f s e t e n h = X o f f s e t + α r s c a ( X o f f s e t A r s c a )
where α r s c a is initialized to 0.2 and adaptively learned through backpropagation, balancing original and attention-enhanced features while preventing excessive modulation.
The RSCA-enhanced features   X o f f s e t e n h are decomposed along the channel dimension into three components: horizontal offset Δ x , vertical offset Δ y , and a raw modulation mask M r a w . The offsets are concatenated to form the complete 2D offset field Δ , and the mask is normalized to the [0, 1] interval via Sigmoid activation to obtain the modulation mask M ~ . Based on the enhanced offset field and modulation mask, deformable convolution is performed on the original input features X i n :
F d c n ( p 0 ) = p n R   W ( p n ) X ( p 0 + p n + Δ ( p n ) ) M ~ ( p n )
where W R C i n × C o u t × K 2   are learnable convolution weights, R the standard sampling grid, and bilinear interpolation handles sub-pixel sampling at non-integer coordinates. The deformable convolution output is subsequently processed through batch normalization and SiLU activation to obtain the final ADCN output F a d c n .
Compared with standard DCNv2, ADCN injects spatial-channel joint attention and residual learning into the offset generation process, yielding more precise sampling positions that substantially improve adaptability to the irregular morphology of seeds under complex conditions such as deformation and blur.

2.3.3. Edge-Guided Adaptive Recalibration Feature Pyramid Network (EGAR-FPN)

Conventional FPN relies on top-down semantic feature propagation for multi-scale detection [35]. However, in high-speed seeding scenarios, seed edge contours are significantly degraded by motion blur, and standard FPN lacks explicit mechanisms for preserving shallow-layer edge features, reducing detection accuracy for small and overlapping targets. This study proposes the Edge-Guided Adaptive Recalibration Feature Pyramid Network (EGAR-FPN), whose structure is depicted in Figure 6. As detailed in Figure 9, EGAR-FPN includes three core modules: Multi-Scale Edge Guidance (MSEG), Edge-Aware Feature Injection (EAFI), and Bidirectional Adaptive Spatial Recalibration (BASR).
MSEG generates explicit multi-scale edge guidance for feature fusion. Similar to the learnable Sobel operators in MBE-Stem, MSEG employs learnable directional gradient kernels for edge detection, but serves a distinct purpose: while MBE-Stem targets initial feature extraction in the backbone, MSEG specifically generates edge guidance maps for each FPN level. Given the MBE-Stem output feature map F s t e m R C × H × W , the base edge response E b a s e is computed as follows:
E b a s e = ( F s t e m K x ) 2 + ( F s t e m K y ) 2 + ε
with     K x   and K y representing horizontal and vertical learnable gradient convolution kernels, denotes convolution, and ε is a numerical stability constant. Note that   K x   and K y are independent trainable parameters optimized separately from G x s and G y s in MBE-Stem. To preserve edge details across spatial scales, MSEG employs progressive downsampling combining max pooling (for preserving salient edge responses) with average pooling (for spatial smoothing):
E ( i ) = A v g P o o l 3 × 3 M a x P o o l 2 × 2 E ( i 1 ) , i { 1 , 2 , 3 }
where E ( 0 ) = E b a s e , successive downsampling produces E ( 1 ) , E ( 2 ) , and E ( 3 ) , which correspond to the P3, P4, and P5 scales, respectively. Each scale’s edge features then undergo adaptive channel projection via 1 × 1 convolution and depthwise separable convolution to produce the edge guidance feature set { E ^ P 3 , E ^ P 4 , E ^ P 5 } aligned with the channel dimensions of each FPN layer.
EAFI adaptively injects edge features into FPN features through a dual-path architecture. Given FPN feature F f p n R C f p n × H × W and the corresponding edge guidance feature E ^ P i R C e d g e × H × W , the edge path employs depthwise separable convolution for spatial edge pattern extraction, while the content path employs standard convolution for semantic representation. Both paths project their inputs to C / 2 channels. An adaptive fusion weight generation module concatenates P e d g e and P c o n t e n t along the channel dimension, then determines the optimal fusion balance through GAP and a multi-layer perceptron ( MLP ) with Softmax normalization:
[ w e , w c ] = S o f t m a x ( M L P ( G A P ( C o n c a t ( P e d g e , P c o n t e n t ) ) ) )
where the Softmax normalization ensures w e + w c = 1 , establishing a competitive constraint between edge preservation and semantic representation. The two-path features are weighted by w e and w c respectively to obtain the blended features F b l e n d , which we then concatenate to restore the complete channel dimension, before passing them via multi-scale convolutions combining 3 × 3 convolution and atrous convolution (atrous rate d = 2) to acquire the multi-scale fused features F m s f . The final EAFI output is obtained through a residual connection with a learnable scaling factor α e a f i :
F E A F I = F f p n + α e a f i C o n v 1 × 1 ( F m s f )
The BASR addresses the limitation of unidirectional top-down propagation in conventional FPN through a spatial-aware gating mechanism for bidirectional information flow. Given the low-level feature F l o and the high-level feature F h i at different resolutions, both are initially compressed to C / 2 channels using 1 × 1 convolutional layers, yielding the projected features F ~ l o   and F ~ h i . Gate signals   G l o and G h i are then generated from the projected features through Sigmoid activation, controlling the bidirectional information flow. In parallel, spatial attention maps are derived by applying channel-wise max and mean pooling, followed by a 7 × 7 convolution that encodes long-range spatial context:
S l o = σ C o n v 7 × 7 M a x P o o l c F ~ l o , A v g P o o l c F ~ l o
with M a x P o o l c   and A v g P o o l c denoting max and average pooling operations computed along the channel axis, respectively; the spatial attention map S h i   for the high-level features is computed analogously. The gate signals and spatial attention maps jointly guide bidirectional feature recalibration. After processing through feature enhancement layers, the low-level features simultaneously receive self-enhanced information weighted by G l o and cross-scale information from high-level features weighted by 1 G l o :
F ^ l o = F ~ l o + G l o ϕ 1 F ~ l o S l o + ( 1 G l o ) U p G h i ϕ 2 F ~ h i S h i
where U p represents bilinear interpolation-based upsampling that matches the high-level feature maps to the low-level spatial dimensions, and denotes element-wise multiplication. The gate signal G l o balances between self-enhancement (first term) and cross-scale information transfer (second term), enabling the network to dynamically adjust the information flow allocation according to feature content. F ^ l o are then concatenated with upsampled high-level features, processed through multi-scale convolution fusion, and refined through a channel attention mechanism based on GAP and Sigmoid-activated M L P , which recalibrates channel-level responses by emphasizing informative channels while suppressing redundant ones.
EGAR-FPN integrates these three modules into the standard PAN-FPN [36] architecture. In the top-down path, BASR modules replace conventional element-wise addition for adaptive feature alignment between adjacent scales (P5 → P4, P4 → P3), with EAFI injecting edge guidance at each fusion node. In the bottom-up path, BASR similarly performs bidirectional recalibration with edge injection at the P5 level. Through this design, EGAR-FPN explicitly models edge information, effectively mitigating edge feature loss under motion blur and enhancing detection accuracy for small and densely overlapping seed targets.

2.4. Experimental Environment and Hyperparameter Configuration

All experiments were executed within a unified deep learning platform featuring fixed hardware and software specifications. The hardware comprises 90.0 GB RAM, an AMD EPYC 7K62 48-Core Processor CPU, and an NVIDIA GeForce RTX 4090 D GPU. The software framework is built on the Linux operating system, with Python 3.10.16, PyTorch 2.2.2, and CUDA 12.1 as the parallel computing platform. We list the specific training hyperparameters adopted in this study in Table 1. For impartial and consistent assessment of all experimental outcomes, comparative and ablation experiments were carried out under strictly unified training settings, including consistent dataset partitioning rules, total training epochs, and a scheduled learning rate adjustment policy. The optimizer type and all core hyperparameters remained completely consistent across all experiments, so that any measured variations in performance can accurately capture the intrinsic efficacy of our designed HSSD-YOLO framework.
The model configuration files, dataset split lists, sample annotations, and output scripts are publicly available at https://github.com/victor-agriculture/HSSD-YOLO (accessed on 1 May 2026).

2.5. Evaluation Metrics

For the comprehensive performance evaluation of the network, this study employs the following measures: Precision (P), Recall (R), mean Average Precision (mAP), Parameters, GFLOPs (Giga Floating-Point Operations), and inference speed, measured in frames per second (FPS). All metrics are computed on the 476 original, unaugmented test images to ensure an unbiased assessment grounded exclusively on real, unprocessed data. Their mathematical expressions are given below:
P = T P T P + F P
R = T P T P + F N
where TP (True Positives) represents the number of actual seed samples correctly detected as seeds by the model; FP (False Positives) represents the number of non-seed samples incorrectly detected as seeds; and FN (False Negatives) represents the number of actual seed samples that the model failed to detect.
Average Precision (AP) and mean Average Precision (mAP) are the core evaluation metrics for object detection tasks:
A P = 0 1   P ( R ) d R
m A P = 1 N i = 1 N   A P i
with N denoting the quantity of categories involved in detection. Within this research, although images were collected from three crop types (indica rice, japonica rice, and wheat), the detection task is formulated as a single category (“seed”, N = 1). This is because the seed type is always pre-specified by the operator when configuring the seeder; accordingly, species-level classification is unnecessary for seeding quality monitoring. The category labels applied in Figure 4 serve solely for dataset management and traceability. Since seed detection is a single-category task (N = 1), mAP is equivalent to AP. The notation mAP is retained to maintain consistency with the standard evaluation conventions in the object detection field. mAP@0.5 stands for mean average precision calculated over a single IoU cutoff of 0.5; mAP@0.5–0.95 is the mean value of mAP metrics computed across IoU thresholds from 0.5 to 0.95 at 0.05 intervals, delivering a fuller assessment of detection quality with varied localization stringency.
Model complexity is measured by the number of parameters (in millions, M), which affects model storage requirements and memory usage, and GFLOPs, which capture the computational cost of one forward inference pass (in billions).

3. Results

To systematically evaluate the HSSD-YOLO model’s effectiveness, lightweight properties, and practical applicability in high-speed seed detection scenarios, we adopt a layered experimental design for this study. The study progresses from comprehensive performance assessment to single-factor validation, consisting of four components:
(1)
Mainstream model detection performance comparison to evaluate the HSSD-YOLO model’s competitiveness against existing detection algorithms, including multiple lightweight versions of the YOLO series (YOLOv5 [37], YOLOv7 [38], YOLOv8 [39], YOLOv9 [40], YOLOv10 [41]), a Transformer-based detector (RT-DETR [42]), and a conventional two-stage detector (Faster R-CNN [43]), evaluated across detection accuracy, model complexity, and qualitative visual analysis.
(2)
Ablation studies to examine the independent and joint contributions of the three designed modules—MBE-Stem, ADCN (RSCA), and EGAR-FPN—through progressive module addition and pairwise combination experiments, supplemented by Grad-CAM [44] visualization analysis.
(3)
Attention mechanism performance comparison to validate RSCA’s superiority over mainstream alternatives (including SE [45], CBAM, CA, and ECA [46]) within the ADCN offset-mask prediction branch, justifying the proposed attention design.
(4)
Comparison of multiple feature pyramid network structures to confirm the effectiveness of EGAR-FPN against mainstream FPN variants (including PAN-FPN (YOLOv11 default), BiFPN [47], ASFF [48], AFPN [49], and Gold-YOLO Neck [50]), demonstrating the advantages of edge-guided feature fusion.
All comparative experiments were conducted on the unified experimental platform described in Section 2.4, with same dataset division, training rounds, and planned learning rate adjustment strategy to ensure objectivity and reliability of results. YOLOv11n acts as the benchmark network for all experiments in this work. All compared models (YOLOv5s, YOLOv7-tiny, YOLOv8n, YOLOv9t, YOLOv10n, YOLOv11n, RT-DETR-l, Faster R-CNN) were trained under unified conditions using the augmented partitions of Section 2.2.2 (5338 training/1525 validation images) and evaluated on the 476 original, unaugmented test images, with the same number of epochs (300), the same optimizer (SGD, momentum 0.937, weight decay 5 × 10−4), the same initial learning rate (0.01) with cosine annealing, the same input resolution (640 × 640), the same batch size (32), the same hardware and software environment (RTX 4090 D, PyTorch 2.2.2, CUDA 12.1), and the same evaluation-time preprocessing. Architecture-specific components that cannot be unified without breaking the architecture itself—RT-DETR’s default internal augmentations, Faster R-CNN’s region-proposal-network configuration, and backbone-specific pretraining—were kept at each model’s originally published settings, because forcing these to match would disadvantage the baseline rather than improve fairness. Under this protocol, accuracy differences reported in Section 3.1 reflect architectural choices rather than training-recipe disparity.

3.1. Comparison with Existing Models

To fully assess the detection capability of HSSD-YOLO, this research conducted systematic comparisons with multiple mainstream object detection networks on the same rice/wheat seed dataset. The compared models cover multiple lightweight versions of the YOLO series (YOLOv5, YOLOv7, YOLOv8, YOLOv9, YOLOv10), a Transformer-based detector (RT-DETR), and a conventional two-stage detector (Faster R-CNN) for comprehensive cross-comparison. YOLOv11n is used as the baseline framework in our experimental setup.

3.1.1. Detection Accuracy Comparison

Table 2 reports metrics evaluated on the 476 original, unaugmented test images. As a supplementary robustness check, HSSD-YOLO was also evaluated on the augmented test set (762 images), yielding mAP@0.5 of 96.8% and mAP@0.5–0.95 of 77.5%, deviating by only 0.2 and 0.1 percentage points from the results reported in Table 2. Among all compared methods, HSSD-YOLO attains the top ranking on every metric—94.4% Precision, 93.7% Recall, 96.6% mAP@0.5, alongside a mAP@0.5–0.95 score of 77.4%.
In comparison with the baseline YOLOv11n, our model leads to improvements of 3.4 percentage points in Precision, 3.5 percentage points in Recall, 2.5 percentage points in mAP@0.5, and 5.4 percentage points in mAP@0.5–0.95. The substantial improvement in mAP@0.5–0.95 indicates that HSSD-YOLO maintains high detection and localization accuracy over a range of IoU thresholds, which plays a key role in precise seed localization—a prerequisite for accurate seed counting in seeding rate monitoring applications.
Across the YOLO series evolution, all metrics exhibit steady improvement, reflecting progressive architectural optimization. However, even the latest YOLOv11n and the high-performing RT-DETR-l only attain mAP@0.5–0.95 values of 72.0% and 71.8%, respectively, whereas HSSD-YOLO elevates this metric to 77.4%, demonstrating the practical effect of the improvement mechanisms designed in this work.
Notably, although RT-DETR-l achieves competitive detection accuracy comparable to that of YOLOv11n in Precision (90.8%) and mAP@0.5 (93.4%), as a Transformer-based detector, its parameter count is substantially higher than that of lightweight YOLO models, rendering it unsuitable for deployment on resource-constrained agricultural embedded devices. Faster R-CNN, a classical two-stage detector, yields the lowest Precision (84.7%), Recall (83.4%), and mAP@0.5 (87.3%) among all evaluated models. This outcome is primarily attributable to the limited capacity of its region proposal mechanism to resolve densely distributed small seeds under high-speed motion.
Figure 10 displays a three-dimensional bar chart that illustrates the normalized score comparisons of each model over four key performance indicators: Precision, Recall, mAP@0.5, and mAP@0.5–0.95. It is evident that HSSD-YOLO (highlighted in red) consistently outperforms all other competing models across every evaluation dimension, with the most pronounced advantage observed in the mAP@0.5–0.95 metric. These results confirm the synergistic effect of the proposed MBE-Stem module, ADCN module, and EGAR-FPN in enhancing detection accuracy for high-speed seeding scenarios.

3.1.2. Model Complexity and Efficiency Analysis

In actual agricultural use cases, high detection accuracy must be coupled with efficient real-time inference on resource-limited embedded platforms. Therefore, this study further compares the parameter counts (Params) and floating-point operation counts (GFLOPs) of each model. Table 3 presents representative models for comparison, including lightweight YOLO models (YOLOv5s, YOLOv11n), a Transformer detector (RT-DETR-l), a two-stage model (Faster R-CNN), together with our proposed model (HSSD-YOLO).
As shown in Table 3, HSSD-YOLO has 5.2 M parameters and 11.8 GFLOPs. Compared with the baseline model YOLOv11n (2.6 M parameters, 6.5 GFLOPs), the parameter count approximately doubles and GFLOPs increase by approximately 82%. This increase primarily stems from the multi-branch parallel structure introduced in the MBE-Stem module, the integration of deformable convolutions and RSCA attention mechanisms in the ADCN module, and the additional parameters from edge-guided feature injection and bidirectional spatial recalibration modules in EGAR-FPN.
Despite the increased complexity, HSSD-YOLO remains substantially lighter than the non-lightweight comparison models. Specifically, compared with RT-DETR-l, HSSD-YOLO requires only 16.3% of the parameters (5.2 M vs. 32.0 M) and 12.8% of the computational cost (11.8 vs. 92.2 GFLOPs), while exceeding it by 3.2 percentage points in mAP@0.5 and 5.6 percentage points in mAP@0.5–0.95. Compared with Faster R-CNN, the advantages in parameters and GFLOPs are even more substantial—only 12.5% and 8.8%, respectively—while detection accuracy leads by a considerable margin. Even compared with the larger YOLOv5s (7.2 M parameters, 16.5 GFLOPs), HSSD-YOLO achieves a substantial improvement in mAP@0.5–0.95 from 57.5% to 77.4% with fewer parameters.
In summary, HSSD-YOLO secures accuracy improvements of 2.5 points in mAP@0.5 and 5.4 points in mAP@0.5–0.95 at a moderate parameter overhead (2.6 M → 5.2 M), while sustaining an inference throughput of 85.1 FPS on the RTX 4090 D reference platform. This exceeds the conventional 30 FPS threshold for real-time detection on that platform; however, it remains below the maximum camera acquisition rate of 526.5 FPS. In a practical deployment, not every acquired frame needs to be processed: frame-skipping strategies can be adopted without compromising detection coverage, because consecutive frames at 526.5 FPS exhibit substantial temporal redundancy. The optimal acquisition-to-inference frame-rate ratio and its effect on downstream counting accuracy remain to be determined through dedicated experiments. Embedded-platform throughput has not been evaluated and warrants dedicated investigation.

3.1.3. Qualitative Detection Results Analysis

To further visually demonstrate the detection effectiveness of HSSD-YOLO, this study selected three typical scenarios (high-speed motion images of indica rice, japonica rice, and wheat seeds) and performed visual comparisons of detection results from four models: HSSD-YOLO, YOLOv11n (baseline), YOLOv5s, and Faster R-CNN, as displayed in Figure 11. In the figure, green bounding boxes indicate correct detections (True Positive), blue bounding boxes indicate false positives (False Positive), and red bounding boxes indicate missed detections (False Negative, i.e., targets annotated as Ground Truth but not detected). To demonstrate detection performance under high-density operating conditions, the representative images were selected at per-tube supply rates approximating or marginally exceeding the maximum per-tube supply rates derived in Section 2.1: indica rice at 22.1 g/s, japonica rice at 34.3 g/s, and wheat at 26.6 g/s. The minor deviations from the steady-state maxima reflect instantaneous temporal fluctuations within individual tubes, representing near-peak seed flow density under high-speed operating conditions.
From Figure 11, the following key observations can be made: (1) In the japonica rice seed scenario, where seeds are short-elliptical and relatively dense with overlap, HSSD-YOLO’s ADCN module’s adaptive sampling mechanism precisely locates the boundaries of partially occluded seeds, effectively reducing missed detection rates in dense scenarios. In contrast, the YOLOv11n baseline model, while outperforming YOLOv5s, still exhibits some missed and false detections in seed overlap regions. (2) In the indica rice seed scenario, where seeds are elongated and prone to significant motion blur during high-speed movement, HSSD-YOLO benefits from MBE-Stem’s explicit modeling of edge gradient information, effectively recovering seed contour features under motion blur conditions and achieving zero missed and false detections. YOLOv5s and Faster R-CNN show considerably more missed detections (red boxes) in this scenario, particularly for seeds with severe edge blur. (3) In the wheat seed scenario, where seed density is highest with multiple seeds closely packed, HSSD-YOLO leverages EGAR-FPN’s edge-guided feature fusion strategy to effectively enhance boundary discrimination between dense small targets, achieving accurate detection in regions where other models all exhibit varying degrees of missed detections. Faster R-CNN shows the most severe false detections in this scenario, further confirming the limitations of conventional two-stage detectors in dense small-target detection tasks.

3.2. Failure-Case Analysis

Examination of the residual errors on the test set reveals three axes that govern error incidence—seed morphology, per-tube loading, and motion-blur severity—and three corresponding failure patterns. Wheat contributes the most missed detections under dense flow; indica rice dominates longitudinal-blur merges; japonica is the least error-prone. Errors rise appreciably only when per-tube loading approaches or exceeds q m a x (Section 2.1), and only under the heaviest blur does MBE-Stem leave visible residuals. Three recurring patterns mark the operating limits of each module.
The first pattern is cluster-induced missed detection (FN-dominant). When per-tube loading exceeds q m a x , wheat kernels in particular travel as tight triples or quadruples whose silhouettes cannot be separated within the effective receptive field of ADCN. One or two kernels in such a cluster are missed as false negatives while surrounding detections remain correct. This represents the resolution ceiling of deformable sampling guided by RSCA attention.
The second pattern is elongated-target merging (FN-dominant, occasional FP). Under strong longitudinal blur, the gap between two adjacent indica rice kernels is filled by blur energy, causing one elongated box to bracket both. Typically, one kernel registers as a true positive while the other becomes a false negative; occasionally the merged box aligns with neither kernel, producing two false negatives and one false positive. This marks the limit of MBE-Stem: its learnable Sobel operators recover contours under moderate blur, but cannot restore separating boundaries when blur trails bridge inter-kernel spacing.
The third pattern is single-seed splitting (FP-dominant). Under the heaviest blur, intensity variations along the motion direction cause the detector to produce two overlapping boxes on one seed whose IoU falls just below the NMS threshold. One box matches the ground-truth kernel; the other is logged as a false positive. This limit arises because EGAR-FPN’s edge priors at adjacent pyramid levels drift apart under extreme trailing, and BASR cannot realign them sufficiently to suppress the duplicate.
In summary of the error-type mapping: pattern 1 contributes only to the missed-detection count; pattern 2 contributes chiefly to missed detections, with the occasional false positive when the merged box aligns with neither kernel; pattern 3 contributes only to false positives, in the form of duplicate boxes on a single kernel. Representative examples of the three cases are given in Figure 12.

3.3. Ablation Study

To assess the individual and joint contributions of each enhancement module within HSSD-YOLO, we devised a series of ablation experiments starting from the YOLOv11n baseline and incrementally activating MBE-Stem, ADCN (RSCA), and EGAR-FPN, both individually and in combination. Every experiment shared identical data partitions and training configurations; the outcomes are tabulated in Table 4, where bracketed values denote gains over the baseline.
From the single-module ablation results, the three improvement modules each contribute differently to overall model performance. The standalone introduction of the ADCN (RSCA) module yields the most significant performance gain (1.4 percentage points in mAP@0.5, 1.8 percentage points in mAP@0.5–0.95), indicating that the adaptive sampling strategy based on deformable convolution and RSCA attention mechanisms plays a key role in improving target localization accuracy. The EGAR-FPN module yields a 1.3-point improvement in mAP@0.5–0.95, being surpassed only by ADCN, implying that the edge-guided adaptive feature fusion strategy effectively enhances model detection robustness across varying IoU thresholds. The MBE-Stem module improves mAP@0.5–0.95 by 0.8 percentage points, validating its capability to effectively recover seed contour information under motion blur conditions through multi-branch edge gradient modeling.
From the pairwise combination results, the modules exhibit significant complementary effects. The mAP@0.5–0.95 gains of the three dual-module combinations (+3.1, +2.9, +3.7 percentage points) all exceed the arithmetic sum of the corresponding single-module gains, with the combination of ADCN and EGAR-FPN performing best (mAP@0.5–0.95 of 75.7%, +3.7 percentage points), reflecting the strong synergistic effect between adaptive sampling and edge-guided feature fusion in multi-scale detection. Notably, when all three modules are fully integrated (HSSD-YOLO), mAP@0.5 reaches 96.6% (+2.5) and mAP@0.5–0.95 reaches 77.4% (+5.4), with overall gains exceeding those of any single-or dual-module combination. The notable 5.4-point gain in mAP@0.5–0.95 is especially significant because this metric reflects comprehensive localization accuracy across IoU thresholds ranging from 0.5 to 0.95, indicating that HSSD-YOLO simultaneously enhances both detection and localization performance in high-speed seeding scenarios. This synergistic effect can be attributed to the fact that the shallow-level edge gradient features extracted by MBE-Stem provide more precise sampling references for ADCN, while EGAR-FPN further integrates edge information into multi-scale feature maps, forming a complete enhancement chain from feature extraction to feature fusion.
To more intuitively elucidate the influence of each module on model attention distribution, we carried out experiments under high-density conditions. Specifically, we selected images with relatively dense seed distributions from those with per-tube supply rates slightly above the q m a x benchmark derived in Section 2.1 (indica rice: 23.1 g/s, japonica rice: 34.9 g/s, wheat: 27.6 g/s), representing frames in which the temporal fluctuations of individual tubes further elevated the supply rate beyond the steady-state inter-tube maxima. This setup serves as a deliberate stress-test for the feature extraction capability of each module under extreme density conditions.
Grad-CAM heatmaps are presented in Figure 13.
The progressive changes in heatmaps clearly illustrate the gradual contribution of each module: In the Baseline, attention activation shows incompleteness across all three seed scenarios with weak warm-color responses for some seeds. After introducing MBE-Stem, activation coverage expands with new warm-color response points appearing at previously unactivated seed edge regions. Further stacking of ADCN shows warm activation regions becoming more concentrated and focused, with notably improved attention separation between adjacent seeds in dense arrangements. When all three modules are fully integrated (HSSD-YOLO), the heatmaps display optimal attention distribution—virtually all seed targets are accurately covered by warm-color regions with concentrated, sharp activation responses, and background noise activation is effectively suppressed. This progressive attention optimization aligns closely with the quantitative metric improvements in Table 4, further confirming the effectiveness of the three-module synergistic enhancement from the perspective of feature visualization.

3.4. Attention Mechanism Comparison Experiment

To demonstrate the advantage of RSCA over competing attention designs, we evaluated five alternative configurations within the HSSD-YOLO framework—keeping MBE-Stem and EGAR-FPN fixed and varying only the attention sub-module inside ADCN: ADCN (None), SE, CBAM, CA, and ECA. The results are reported in Table 5.
As shown in Table 5, except ECA, introducing any attention mechanism outperforms ADCN (None) on mAP metrics, confirming that attention guidance positively impacts deformable convolution offset prediction. Figure 14 presents a radar chart that intuitively illustrates the comprehensive performance of each attention mechanism across three metrics—Precision, Recall, and mAP@0.5. RSCA performs significantly better than all other mechanisms across every dimension, exhibiting the largest radar coverage area. Among the five compared mechanisms, CA and CBAM exhibit similar performance, recording mAP@0.5–0.95 values of 76.3% and 75.8%, respectively. Both methods exceed SE (75.4%) and ECA (75.0%), which solely model channel-wise features, demonstrating that attention to the spatial dimension is vital for precise offset generation in the presence of motion blur.
The proposed RSCA mechanism attains optimal performance over all evaluation indicators, with the mAP@0.5–0.95 value hitting 77.4%, exceeding the next-best CA by 1.1 percentage points. The advantage of RSCA primarily derives from its decoupled three-path architecture design: unlike SE, which only models channel dependencies, CBAM, which sequentially concatenates channel and spatial attention, or CA, which only encodes positional information along coordinate directions, RSCA simultaneously models dependencies across three dimensions through height, width, and channel paths, and further introduces cross-dimensional interaction mechanisms, enabling generated offsets and modulation masks to more precisely adapt to seed targets of different morphologies and motion blur degrees. Additionally, RSCA adds only 0.6 M parameters compared to ADCN (None), achieving a desirable balance between efficiency and accuracy.

3.5. Feature Pyramid Structure Comparison Experiment

The Feature Pyramid Network (FPN) critically governs multi-scale detection performance, as its fusion strategy directly determines how well seeds of varying sizes are recognized. To benchmark EGAR-FPN against established pyramid architectures, we conducted experiments within the HSSD-YOLO framework—holding MBE-Stem and ADCN constant while substituting only the neck structure with five alternatives: PAN-FPN, BiFPN, ASFF, AFPN, and Gold-YOLO Neck. The corresponding results appear in Table 6.
As presented in Table 6, all compared FPN structures bring improvements in at least one evaluation metric, with most showing gains across all indicators. Figure 15 presents a radar chart that intuitively illustrates the comprehensive performance of each FPN architecture across three metrics—Precision, Recall, and mAP@0.5. EGAR-FPN achieves the largest coverage area across all three dimensions, demonstrating well-rounded and consistently superior performance. Gold-YOLO Neck performs closest to EGAR-FPN (mAP@0.5–0.95 of 76.4%), with BiFPN and AFPN following at 75.6% and 75.9%, respectively. Although ASFF achieves a relatively high Precision of 93.0%, its mAP@0.5–0.95 (75.2%) is marginally lower than those of BiFPN (75.6%) and AFPN (75.9%), suggesting that its learning-based spatial filtering strategy may have limited generalization capacity under complex motion blur conditions.
EGAR-FPN substantially outperforms all compared structures with an mAP@0.5–0.95 of 77.4%, exceeding the next-best Gold-YOLO Neck by 1.0 percentage points. This advantage primarily derives from two core designs of EGAR-FPN: First, the MSEG module provides explicit edge prior information for the feature fusion process, a capability that is absent in other FPN structures. In high-speed seeding scenarios, motion blur causes severe degradation of seed edge information, and conventional FPN relying solely on the integration of top-down and bottom-up semantic features has difficulty effectively recovering boundary information, whereas EGAR-FPN explicitly injects edge gradient features into feature maps at each scale, providing critical references for precise target boundary reconstruction. Second, the BASR module realizes adaptive integration of deep semantic representations and shallow spatial detail through a bidirectional feature recalibration mechanism, which more effectively preserves fine-grained spatial information in dense seed scenarios compared with the unidirectional or simple weighted fusion strategies of BiFPN and ASFF.

4. Discussion

Achieving an optimal balance between recognition precision and processing overhead is the central challenge in developing per-frame seed detection algorithms for high-speed pneumatic seeders, where reliable seed detection constitutes the foundational step toward downstream seeding-rate monitoring and precision variable-rate application. On the self-constructed high-speed motion image dataset covering three seed varieties—indica rice, japonica rice, and wheat—HSSD-YOLO raised mAP@0.5 by 2.5 points (94.1% → 96.6%) and mAP@0.5–0.95 by 5.4 points (72.0% → 77.4%) over the YOLOv11n reference, with parameter growth limited to the range of 2.6 M to 5.2 M, and sustaining an inference speed of 85.1 FPS that exceeds the conventional 30 FPS real-time threshold on the RTX 4090 D reference platform, although this remains below the camera’s maximum acquisition rate of 526.5 FPS; in deployment, frame-skipping would reconcile this gap. Embedded-platform latency, throughput, and energy use have not yet been measured. The accuracy improvement is therefore obtained with a moderate parameter increase. The per-parameter accuracy contribution, visible both in the ablation results (Table 4) and in the YOLO-series trend (Table 2), is consistent with targeted architectural choices aimed at the three specific degradations of this task—motion blur, deformation-aware sampling, and edge-guided fusion. A rigorous separation of feature-allocation effects from parameter-scaling effects would require controlled capacity-matched experiments that are beyond the scope of the present study.
Regarding individual module contributions, the MBE-Stem module enhances seed contour perception under motion-blur conditions by explicitly modeling directional gradient information through learnable Sobel operators. Under the elevated seed supply rates examined in this study, standard convolutions struggle to extract sufficient edge features from blurred images due to their implicit feature-learning paradigm. The standalone introduction of MBE-Stem improved mAP@0.5–0.95 by 0.8 percentage points, a finding that echoes the observation of Chen et al. [16] regarding the importance of explicit edge modeling for detection performance in complex agricultural scenes. However, unlike prior edge-enhancement strategies that rely on fixed gradient operators, MBE-Stem employs a multi-branch parallel architecture that simultaneously captures edge magnitude, vertical gradient responses, and spatial context, then fuses these complementary cues via an adaptive channel-attention mechanism. This design yields a richer feature representation, particularly for elongated indica rice seeds whose longitudinal blur patterns differ markedly from those of rounder wheat kernels.
The ADCN module produced the largest single-module performance gain in ablation experiments (1.8 percentage points in mAP@0.5–0.95), highlighting the critical role of attention-guided adaptive sampling in improving target localization accuracy for morphologically diverse seeds. Comparative experiments on attention mechanisms further clarified the importance of joint spatial-channel modeling for offset prediction: mechanisms operating exclusively in the channel dimension, namely SE and ECA, achieved mAP@0.5–0.95 scores of 75.4% and 75.0%, respectively, both lower than mechanisms that additionally incorporate spatial attention, namely CA (76.3%) and CBAM (75.8%). The proposed RSCA mechanism achieved 77.4% mAP@0.5–0.95 by simultaneously modeling dependencies across height, width, and channel dimensions through a decoupled three-path architecture and by introducing explicit cross-dimensional interaction via a globally aggregated spatial statistic. RSCA surpassed the next-best CA mechanism by 1.1 percentage points while adding only 0.6 M parameters relative to ADCN (None), thus attaining a favorable accuracy–efficiency trade-off. These findings are consistent with recent conclusions from precision agricultural vision tasks that multi-dimensional joint attention yields superior performance compared with serial or independent attention formulations.
Comparative experiments on neck architectures confirmed the advantage of EGAR-FPN in edge-guided multi-scale feature fusion. Among all evaluated structures—PAN-FPN, BiFPN, ASFF, AFPN, and Gold-YOLO Neck—EGAR-FPN gained the highest mAP@0.5–0.95 of 77.4%, exceeding the next-best Gold-YOLO Neck by 1.0 percentage points. This superiority derives from two complementary design decisions. First, the MSEG module provides explicit edge prior information at each FPN level through learnable directional gradient kernels, a capability absent in all compared structures. Under high-speed seeding conditions, motion blur severely degrades seed boundary information, and conventional FPN architectures that rely solely on top-down and bottom-up propagation of semantic features are unable to recover this information effectively. Second, the BASR module enables adaptive fusion of upper-level semantic context and shallow spatial details through a bidirectional, spatially gated recalibration mechanism, which more effectively preserves fine-grained spatial information in dense seed arrangements than the unidirectional propagation of BiFPN or the learning-based spatial filtering of ASFF. These results align with the conclusion of Zhou et al. [19] in loquat detection that task-adaptive feature fusion strategies outperform generic lightweight designs for small-object detection.
In the ablation, each of the three dual-module configurations produced mAP@0.5–0.95 gains (3.1, 2.9, and 3.7 percentage points, respectively), exceeding the arithmetic sum of the corresponding single-module gains, and the full three-module integration (+5.4 points) surpassed any pairwise combination. This pattern is consistent with complementary functional roles among the modules; a conclusive demonstration of synergy would still require a controlled, matched-capacity analysis. Grad-CAM provides a qualitative interpretation at the level of feature-attention saliency, rather than a mechanistic claim: as modules are added in turn, the warm-color activations evolve from incomplete and diffuse coverage in the YOLOv11n baseline to concentrated, target-aligned activation in the full HSSD-YOLO, with concomitant suppression of background responses. We read this as qualitative evidence compatible with the intended roles of the three modules; a rigorous mechanistic attribution would require complementary tools such as integrated gradients or layer-wise relevance propagation applied to matched image sets. This cascade effect can be attributed to the functional complementarity of the three modules: shallow-level edge gradient features extracted by MBE-Stem provide more discriminative sampling references for ADCN’s deformable convolutions, while EGAR-FPN subsequently integrates these edge cues across all feature pyramid levels, forming a coherent enhancement chain from initial feature extraction through multi-scale fusion.
Comparison with representative detection algorithms further corroborates the practical utility of HSSD-YOLO. The Transformer-based RT-DETR-l, despite competitive detection accuracy (mAP@0.5 of 93.4%), carries 32.0 M parameters and 92.2 GFLOPs—approximately six times and eight times those of HSSD-YOLO, respectively—rendering it poorly suited to resource-constrained agricultural embedded platforms. Faster R-CNN exhibited the weakest overall performance (mAP@0.5–0.95 of 58.0%), primarily because its Region Proposal Network does not handle densely distributed small-target seeds under high-speed motion conditions effectively. Even compared with the larger YOLOv5s (7.2 M parameters), HSSD-YOLO achieves a substantially smaller model footprint (5.2 M parameters) alongside a 19.9 percentage point improvement in mAP@0.5–0.95 (from 57.5% to 77.4%). Taken together, these results indicate that domain-specific architectural optimization addressing the characteristic degradations of high-speed seeding—motion blur, dense small targets, and cross-variety morphological diversity—yields substantially greater gains than simply scaling model capacity.
The design principles embodied in HSSD-YOLO—explicit edge modeling, attention-guided adaptive sampling, and edge-injected cross-scale information aggregation—offer methodological insights that may transfer to other agricultural vision tasks sharing similar degradation characteristics (motion-induced edge loss, dense target arrangement, morphological variability). Empirical generalizability, however, is established only for the seed-detection task on the present dataset. Transfer to rice-panicle density estimation in unmanned harvesters, occluded fruit detection in orchards, or weed identification in cereal fields is a plausible hypothesis that must be tested directly, rather than asserted on the strength of the present experiments. We also note that HSSD-YOLO shows strong within-setting robustness to the simulated perturbations introduced during augmentation (motion blur, noise, illumination shift); robustness outside the reported acquisition setting has not been tested empirically and is not asserted.
It is worth stating the scope of this contribution plainly. The validated task here is single-frame seed detection. Extending this to seed counting—per-frame detections aggregated via a temporal association or tracking stage—and further to seeding-rate estimation by integration over time has not been empirically tested in this study. Counting accuracy, inter-frame consistency, and any form of closed-loop seeding-rate control should therefore be read as downstream capabilities enabled by the present detector rather than demonstrated by it.
From detection to counting. The intended downstream use of HSSD-YOLO is to aggregate per-frame detections over a continuous video stream through a lightweight multi-object tracker such as SORT or ByteTrack. Each detected seed then carries a temporal identity, so one physical seed traversing several frames contributes exactly once to the cumulative count; per-unit-time seed flux follows by differencing cumulative identities. The frame-level precision and recall reported here characterize the input to that counting stage. It should be emphasized that this detection-to-counting pipeline remains entirely conceptual at present: counting accuracy, identity-switch rate, temporal consistency across consecutive frames, and any form of closed-loop seeding-rate control have not been experimentally validated in this work. The end-to-end counting error depends on tracker choice, association thresholds, and the interaction between detection confidence and track initialization, all of which lie beyond the present study and constitute the immediate next step.
Prospects for integration with pneumatic-transport physics. The present study stays within the scope of computer vision and does not attempt aerodynamic or particle-flow modeling. As a natural extension, the per-tube seed-flux time series produced by the counting pipeline could in future work be cross-referenced with the inter-tube coefficient of variation V n defined in Section 2.1, and, when available, with blower static pressure and airflow–velocity telemetry, allowing deviations linked to tube blockage or two-phase-flow instability to be flagged at the vision output without modifying the detector.
Several limitations of the current study merit acknowledgment. First, all images were acquired on the single testbed described in Section 2.2 using one camera (Hikvision MV-CS004-10UM, USB 3.0), one backlight, one lens configuration, and laboratory-controlled lighting, covering only three cultivars—japonica rice (‘Wuyoudao No. 4’), indica rice (‘Fengyouxiangzhan’), and wheat (‘Jimai 22’). Crops with markedly different morphology (e.g., maize, soybean, rapeseed) were not evaluated, and cross-crop generalization should not be assumed without retraining. Robustness against different sensors, optics, illumination, field dust, chassis vibration, or sub-zero temperatures has likewise not been tested; any claim of practical applicability is bounded accordingly, and multi-sensor, multi-site, and multi-crop validation remain priorities for subsequent studies. Second, although the computational complexity is effectively controlled (5.2 M parameters, 11.8 GFLOPs, 85.1 FPS on the RTX 4090 D reference platform), the reported throughput was obtained on a high-end desktop GPU rather than on a field-side embedded platform. Systematic benchmarking on the NVIDIA Jetson Orin NX/AGX family commonly used in agricultural edge devices has not been conducted in this study. Continuous-operation latency, thermal stability, power consumption, and post-quantization (INT8 or FP16) throughput therefore remain to be established. Claims regarding the practical deployability in this paper are limited by this gap. Specifically, INT8/FP16 post-quantization accuracy, inference latency under continuous operation, thermal throttling behavior, and power consumption on representative agricultural edge devices (e.g., NVIDIA Jetson Orin NX/AGX) have not been characterized. These factors are critical for field-ready deployment and represent an immediate next step in research. A third caveat concerns environmental extremes. Our robustness tests used real high-speed footage supplemented by synthetic degradation, yet the harshest field conditions—heavy chassis vibration during turns, sub-zero temperatures affecting camera response—were not represented and will need dedicated on-farm trials. Future research should address these limitations by conducting cross-dataset validation across diverse geographic regions and operational environments, utilizing parameter reduction strategies including weight sparsification and teacher–student learning to enable edge-side deployment, and integrating the seed detection algorithm with seed counting and seeder actuator control systems to realize closed-loop seeding-rate feedback, thereby bridging the gap between algorithmic benchmarking and full-scale field application.

5. Conclusions

This study tackled the fundamental difficulties of real-time seed detection in rice–wheat dual-purpose high-speed pneumatic seeders—specifically, rapid seed motion, severe dense occlusion, and edge feature degradation induced by motion blur—by proposing HSSD-YOLO, a target recognition architecture derived from the YOLOv11 backbone and incorporating three targeted improvement modules: MBE-Stem, ADCN, and EGAR-FPN. MBE-Stem recovers seed contour features degraded by motion blur through learnable directional gradient operators. ADCN, equipped with the proposed RSCA mechanism, enhances adaptive sampling for morphologically diverse seeds via joint spatial-channel attention. EGAR-FPN injects explicit edge prior information into multi-scale feature fusion, improving boundary discrimination for densely overlapping targets.
On the self-constructed dataset of high-speed motion images comprising indica rice, japonica rice, and wheat seeds, HSSD-YOLO recorded 96.6% on the mAP@0.5 and 77.4% on mAP@0.5–0.95, which translates to performance lifts of 2.5 and 5.4 percentage units over the YOLOv11n baseline model. Benchmarked against a comprehensive set of mainstream detection algorithms—including YOLOv5s, YOLOv8n, YOLOv9t, YOLOv10n, RT-DETR-l, and Faster R-CNN—HSSD-YOLO ranked first on every evaluation criterion while keeping the model size at just 5.2 M parameters and delivering an inference throughput of 85.1 FPS on the RTX 4090 D reference platform, which exceeds the conventional 30 FPS real-time threshold but remains below the camera’s 526.5 FPS maximum acquisition rate; in practice, frame-skipping can bridge this gap owing to the substantial temporal redundancy at full acquisition speed, whereas embedded-platform throughput, including post-quantization performance, remains to be measured. Ablation studies confirmed that each of the three proposed modules contributes independently to detection performance and that their joint integration produces synergistic gains exceeding linear superposition: the maximum single-module improvement in mAP@0.5–0.95 was 1.8 percentage points, whereas full model integration achieved a 5.4 percentage point gain. Comparative experiments on attention mechanisms and feature pyramid architectures further established the superiority of RSCA and EGAR-FPN over mainstream alternatives, including SE, CBAM, CA, BiFPN, and ASFF. Overall, HSSD-YOLO supplies a computationally feasible detection algorithm for accurate per-frame localization of seeds under high-speed seeding conditions, establishing a prerequisite for vision-based seeding-quality monitoring. Full implementation of seeding-rate estimation with counting-error characterization, temporal consistency across continuous video, closed-loop actuator feedback, and systematic benchmarking on agricultural embedded platforms (including INT8/FP16 post-quantization accuracy, latency, and power consumption) lies outside the scope of this paper and constitutes a necessary direction for subsequent research.

Author Contributions

Conceptualization, Y.Y. and Y.Z.; methodology, Y.Y. and Z.H.; software, Y.Y. and J.L.; validation, Y.Y., Z.H. and X.S.; formal analysis, Y.Y.; investigation, Y.Y.; resources, Y.Z.; data curation, Y.Y. and X.S.; writing—original draft preparation, Y.Y.; writing—review and editing, Y.Z.; visualization, Y.Y. and Z.H.; supervision, Y.Z.; project administration, Y.Z.; funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the National Key R&D Program of China (Grant No. 2021YFD2000403), the Guangdong Provincial Science and Technology Plan Project (Grant No. 2021B1212040009), and China Agriculture Research System for Rice (CARS-01).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to the restrictions imposed by the data management policies of the institution and the funding agencies.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
MBE-StemMotion Blur Enhanced Stem module
ADCNAttention-enhanced Deformable Convolutional Network
RSCAResidual Spatial-Channel Attention
EGAR-FPNEdge-Guided Adaptive Recalibration Feature Pyramid Network
MSEGMulti-Scale Edge Guidance
EAFIEdge-Aware Feature Injection
BASRBidirectional Adaptive Spatial Recalibration

References

  1. Pei, K.; Dan, Y. Precision agriculture promoting the development of modern agriculture: Theoretical mechanism and practical evidence. Res. Agric. Mod. 2018, 39, 551–558. [Google Scholar]
  2. Fan, M.; Shen, J.; Yuan, L.; Jiang, R.; Chen, X.; Davies, W.J.; Zhang, F. Improving crop productivity and resource use efficiency to ensure food security and environmental quality in China. J. Exp. Bot. 2012, 63, 13–24. [Google Scholar] [CrossRef]
  3. Qiu, B.; Hu, X.; Chen, C.; Tang, Z.; Yang, P.; Zhu, X.; Yan, C.; Jian, Z. Maps of cropping patterns in China during 2015–2021. Sci. Data 2022, 9, 479. [Google Scholar] [CrossRef] [PubMed]
  4. Liu, W.; Zhou, J.; Zhang, T.; Zhang, P.; Yao, M.; Li, J.; Sun, Z.; Ma, G.; Chen, X.; Hu, J. Key Technologies in Intelligent Seeding Machinery for Cereals: Recent Advances and Future Perspectives. Agriculture 2024, 15, 8. [Google Scholar] [CrossRef]
  5. Yadelew, Z.; Tadesse, T.M.; Tarekegn, W. Appropriate seed source and rate enhanced the productivity of bread wheat varieties under irrigated conditions in North Mecha, Amhara region, Ethiopia. Heliyon 2024, 10, e31568. [Google Scholar] [CrossRef]
  6. Ding, Y.; Zheng, G.; Zhang, W.; Qi, B.; Wang, Y.; Xia, Q.; Wang, R.; Zhang, H. Design and Evaluation of a High-Speed Airflow-Assisted Seeding Device for Pneumatic Drum Type Soybean Precision Seed Metering Device. Agronomy 2025, 15, 2202. [Google Scholar] [CrossRef]
  7. Tang, H.; Xu, C.; Wang, Z.; Wang, Q.; Wang, J. Optimized Design, Monitoring System Development and Experiment for a Long-Belt Finger-Clip Precision Corn Seed Metering Device. Front. Plant Sci. 2022, 13, 814747. [Google Scholar] [CrossRef]
  8. Wang, S.; Yi, S.; Zhao, B.; Li, Y.; Wang, G.; Li, S.; Sun, W. Photoelectric sensor-based belt-type high-speed seed guiding device performance monitoring method and system. Comput. Electron. Agric. 2024, 227, 109489. [Google Scholar] [CrossRef]
  9. Xu, L.; Hu, B.; Li, J.; Ren, L.; Guo, M.; Mao, Z.; Cai, Y.; Sun, S. An efficient seeding state monitoring system of a pneumatic dibbler based on an interdigital capacitive sensor. Comput. Electron. Agric. 2023, 209, 107856. [Google Scholar] [CrossRef]
  10. Rossi, S.; Scola, I.R.; Bourges, G.; Šarauskis, E.; Karayel, D. Improving the seed detection accuracy of piezoelectric impact sensors for precision seeders. Part I: A comparative study of signal processing algorithms. Comput. Electron. Agric. 2023, 215, 108449. [Google Scholar] [CrossRef]
  11. Wang, J.; Zhang, Z.; Wang, F.; Jiang, Y.; Zhou, W. Design and Experiment of Monitoring System for Rice Hill-direct-seeding Based on Piezoelectric Impact Method. Trans. Chin. Soc. Agric. Mach. 2019, 50, 74–84. [Google Scholar]
  12. Khan, Z.; Shen, Y.; Liu, H. ObjectDetection in Agriculture: A Comprehensive Review of Methods, Applications, Challenges, and Future Directions. Agriculture 2025, 15, 1351. [Google Scholar] [CrossRef]
  13. Gao, F.; Fang, W.; Sun, X.; Wu, Z.; Zhao, G.; Li, G.; Li, R.; Fu, L.; Zhang, Q. A novel apple fruit detection and counting methodology based on deep learning and trunk tracking in modern orchard. Comput. Electron. Agric. 2022, 197, 107000. [Google Scholar] [CrossRef]
  14. Koirala, A.; Walsh, K.B.; Wang, Z.; McCarthy, C. Deep learning for real-time fruit detection and orchard fruit load estimation: Benchmarking of ‘MangoYOLO’. Precis. Agric. 2019, 20, 1107–1135. [Google Scholar] [CrossRef]
  15. Sun, J.; Zhou, J.; He, Y.; Jia, H.; Rottok, L.T. Detection of rice panicle density for unmanned harvesters via RP-YOLO. Comput. Electron. Agric. 2024, 226, 109371. [Google Scholar] [CrossRef]
  16. Chen, J.; Wang, H.; Zhang, H.; Luo, T.; Wei, D.; Long, T.; Wang, Z. Weed detection in sesame fields using a YOLO model with an enhanced attention mechanism and feature fusion. Comput. Electron. Agric. 2022, 202, 107412. [Google Scholar] [CrossRef]
  17. Maheswaran, S.; Sathesh, S.; Gomathi, R.D.; Indhumathi, N.; Prasanth, S.; Charumathi, K.; Balanisharitha, P.; Murugesan, G.; Duraisamy, P. Automated Weed Identification And Classification Using Artificial Intelligence. In Proceedings of the 15th International Conference on Computing Communication and Networking Technologies, Kamand, India, 24–28 June 2024; pp. 1–7. [Google Scholar]
  18. Sathesh, S.; Maheswaran, S. The Design and Development of Delta Arm for Multi-Purpose Agribots. IETE J. Res. 2024, 70, 7526–7536. [Google Scholar] [CrossRef]
  19. Zhou, W.; Gao, L.; Sun, F.; Bian, Y. YOLO-MCS: A Lightweight Loquat Object Detection Algorithm in Orchard Environments. Agriculture 2026, 16, 262. [Google Scholar] [CrossRef]
  20. Badgujar, C.M.; Poulose, A.; Gan, H. Agricultural object detection with You Only Look Once (YOLO) Algorithm: A bibliometric and systematic literature review. Comput. Electron. Agric. 2024, 223, 109090. [Google Scholar] [CrossRef]
  21. Xing, H.; Wan, Y.; Zhong, P.; Lin, J.; Huang, M.; Yang, R.; Zang, Y. Design and experimental analysis of real-time detection system for The seeding accuracy of rice pneumatic seed metering device based on the improved YOLOv5n. Comput. Electron. Agric. 2024, 227, 109614. [Google Scholar] [CrossRef]
  22. Qian, Y.; Cao, P.; Yin, W.; Dai, F.; Hu, F.; Yan, Z. Calculation method of surface shape feature of rice seed based on point cloud. Comput. Electron. Agric. 2017, 142, 416–423. [Google Scholar] [CrossRef]
  23. Martín-Gómez, J.J.; Rewicz, A.; Goriewa-Duba, K.; Wiwart, M.; Tocino, Á.; Cervantes, E. Morphological Description and Classification of Wheat Kernels Based on Geometric Models. Agronomy 2019, 9, 399. [Google Scholar] [CrossRef]
  24. He, S.; Qian, C.; Jiang, Y.; Qin, W.; Huang, Z.; Huang, D.; Wang, Z.; Zang, Y. Design and optimization of the seed feeding device with DEM-CFD coupling approach for rice and wheat. Comput. Electron. Agric. 2024, 219, 108814. [Google Scholar] [CrossRef]
  25. Liu, Q.; Sun, Z.; Li, G.; Chen, F.; Zhou, X. Effects of Different Sowing Rates on Rice Grain Yield Under Drill-seeded Conditions. Barley Cereal Sci. 2018, 35, 26–29. [Google Scholar]
  26. Dai, Y.; Luo, X.; Zhang, M.; Lan, F.; Zhou, Y.; Wang, Z. Design and experiments of the key components for centralized pneumatic rice dry direct seeding machine. Trans. Chin. Soc. Agric. Eng. 2020, 36, 1–8. [Google Scholar]
  27. Cao, B. Effects of Sowing Methods and Sowing Rates on Nitrogen Operation, Yield and Quality of Winter Wheat. Master’s Thesis, Shanxi Agricultural University, Jinzhong, China, 2020. [Google Scholar]
  28. Buslaev, A.; Iglovikov, V.I.; Khvedchenya, E.; Parinov, A.; Druzhinin, M.; Kalinin, A.A. Albumentations: Fast and Flexible Image Augmentations. Information 2020, 11, 125. [Google Scholar] [CrossRef]
  29. He, L.; Zhou, Y.; Liu, L.; Cao, W.; Ma, J. Research on object detection and recognition in remote sensing images based on YOLOv11. Sci. Rep. 2025, 15, 14032. [Google Scholar] [CrossRef] [PubMed]
  30. Canny, J. A Computational Approach to Edge Detection. IEEE Trans. Pattern Anal. Mach. Intell. 1986, 8, 679–698. [Google Scholar] [CrossRef]
  31. Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
  32. Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable ConvNets V2: More Deformable, Better Results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 9308–9316. [Google Scholar]
  33. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  34. Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
  35. Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
  36. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
  37. Jaiswal, S.K.; Agrawal, R. A Comprehensive Review of YOLOv5: Advances in Real-time Object Detection. Res. Comput. Sci. Technol. 2024, 12, 75–80. [Google Scholar] [CrossRef]
  38. Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
  39. Hidayatullah, P.; Syakrani, N.; Sholahuddin, M.R.; Gelar, T.; Tubagus, R. YOLOv8 to YOLO11: A Comprehensive Architecture In-depth Comparative Review. arXiv 2025, arXiv:2501.13400. [Google Scholar]
  40. Yaseen, M. What is YOLOv9: An In-Depth Exploration of the Internal Features of the Next-Generation Object Detector. arXiv 2024, arXiv:2409.07813. [Google Scholar]
  41. Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. In Proceedings of the 38th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–15 December 2024; pp. 107984–108011. [Google Scholar]
  42. Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar]
  43. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
  44. Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
  45. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
  46. Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11534–11542. [Google Scholar]
  47. Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10781–10790. [Google Scholar]
  48. Qiao, Y.; Guo, Y.; He, D. Cattle body detection based on YOLOv5-ASFF for precision livestock farming. Comput. Electron. Agric. 2023, 204, 107579. [Google Scholar] [CrossRef]
  49. Yang, G.; Lei, J.; Zhu, Z.; Cheng, S.; Feng, Z.; Liang, R. AFPN: Asymptotic Feature Pyramid Network for Object Detection. arXiv 2023, arXiv:2306.15988. [Google Scholar] [CrossRef]
  50. Wang, C.; He, W.; Nie, Y.; Guo, J.; Liu, C.; Wang, Y.; Han, K. Gold-YOLO: Efficient Object Detector via Gather-and-Distribute Mechanism. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; pp. 51094–51112. [Google Scholar]
Figure 1. Schematic of high-speed air-assisted centralized metering system for rice/wheat.
Figure 1. Schematic of high-speed air-assisted centralized metering system for rice/wheat.
Agriculture 16 01160 g001
Figure 2. Testbed constructed based on the high-speed air-assisted centralized metering system and image acquisition setup.
Figure 2. Testbed constructed based on the high-speed air-assisted centralized metering system and image acquisition setup.
Agriculture 16 01160 g002
Figure 3. Sample image of japonica rice (‘Wuyoudao No. 4’), indica rice (‘Fengyouxiangzhan’) and wheat (‘Jimai 22’).
Figure 3. Sample image of japonica rice (‘Wuyoudao No. 4’), indica rice (‘Fengyouxiangzhan’) and wheat (‘Jimai 22’).
Agriculture 16 01160 g003
Figure 4. Data collection. Typical sample images from the dataset employed to train the deep learning model.
Figure 4. Data collection. Typical sample images from the dataset employed to train the deep learning model.
Agriculture 16 01160 g004
Figure 5. Effects of augmentation.
Figure 5. Effects of augmentation.
Agriculture 16 01160 g005
Figure 6. HSSD-YOLO detection algorithm.
Figure 6. HSSD-YOLO detection algorithm.
Agriculture 16 01160 g006
Figure 7. Structural design of the MBE-Stem module.
Figure 7. Structural design of the MBE-Stem module.
Agriculture 16 01160 g007
Figure 8. Architecture of the proposed ADCN and RSCA modules.
Figure 8. Architecture of the proposed ADCN and RSCA modules.
Agriculture 16 01160 g008
Figure 9. Three core modules of EGAR-FPN: MSEG, EAFI and BASR.
Figure 9. Three core modules of EGAR-FPN: MSEG, EAFI and BASR.
Agriculture 16 01160 g009
Figure 10. Three-dimensional bar chart comparing detection performance indicators among various models.
Figure 10. Three-dimensional bar chart comparing detection performance indicators among various models.
Agriculture 16 01160 g010
Figure 11. Side-by-side detection outputs for indica rice (left column), japonica rice (center), and wheat (right). Each row corresponds to one model, ordered from Faster R-CNN at the top through YOLOv5s, YOLOv11n, and HSSD-YOLO; original images are shown in the top row. Box colors: green = TP, blue = FP, red = FN.
Figure 11. Side-by-side detection outputs for indica rice (left column), japonica rice (center), and wheat (right). Each row corresponds to one model, ordered from Faster R-CNN at the top through YOLOv5s, YOLOv11n, and HSSD-YOLO; original images are shown in the top row. Box colors: green = TP, blue = FP, red = FN.
Agriculture 16 01160 g011
Figure 12. Representative failure cases of HSSD-YOLO on the test set. (a) Cluster-induced missed detection—one or two seeds inside a dense cluster leave no corresponding prediction (wheat, supply rate per-tube pushes past q m a x ). (b) Elongated-target merging—a single elongated box brackets two adjacent seeds, yielding a missed detection on the second kernel (indica rice under strong longitudinal blur). (c) Duplicate detection on a single seed—one kernel carries two overlapping boxes, and the redundant box is counted as a false positive. Box colors: green = TP, blue = FP, red = FN.
Figure 12. Representative failure cases of HSSD-YOLO on the test set. (a) Cluster-induced missed detection—one or two seeds inside a dense cluster leave no corresponding prediction (wheat, supply rate per-tube pushes past q m a x ). (b) Elongated-target merging—a single elongated box brackets two adjacent seeds, yielding a missed detection on the second kernel (indica rice under strong longitudinal blur). (c) Duplicate detection on a single seed—one kernel carries two overlapping boxes, and the redundant box is counted as a false positive. Box colors: green = TP, blue = FP, red = FN.
Agriculture 16 01160 g012
Figure 13. Grad-CAM heatmaps for three dense-scenario crops (japonica rice, indica rice, wheat—one per column). Rows progress through the ablation sequence: Baseline → + MBE-Stem → + MBE-Stem + ADCN → full HSSD-YOLO, with the original image at top. Red/orange tones mark high activation areas.
Figure 13. Grad-CAM heatmaps for three dense-scenario crops (japonica rice, indica rice, wheat—one per column). Rows progress through the ablation sequence: Baseline → + MBE-Stem → + MBE-Stem + ADCN → full HSSD-YOLO, with the original image at top. Red/orange tones mark high activation areas.
Agriculture 16 01160 g013
Figure 14. Radar chart comparison of different attention mechanisms in the ADCN module. Three axes: Precision (P), Recall (R), and mAP@0.5.
Figure 14. Radar chart comparison of different attention mechanisms in the ADCN module. Three axes: Precision (P), Recall (R), and mAP@0.5.
Agriculture 16 01160 g014
Figure 15. Radar chart comparison of different feature pyramid network structures. Three axes: Precision (P), Recall (R), and mAP@0.5.
Figure 15. Radar chart comparison of different feature pyramid network structures. Three axes: Precision (P), Recall (R), and mAP@0.5.
Agriculture 16 01160 g015
Table 1. Training hyperparameters. All listed values are dimensionless except image resolution (pixels).
Table 1. Training hyperparameters. All listed values are dimensionless except image resolution (pixels).
HyperparametersValue
Total training epochs300
Batch size32
Input image resolution 640 × 640
Initial learning rate0.01
OptimizerSGD
Momentum0.937
Weight decay0.0005
Table 2. Comparison of detection results for different models.
Table 2. Comparison of detection results for different models.
ModelPrecision (%)Recall (%)mAP@0.5 (%)mAP@0.5–0.95 (%)
YOLOv5s85.684.488.457.5
YOLOv7-tiny87.085.989.760.6
YOLOv8n87.886.690.766.1
YOLOv9t90.088.491.968.6
YOLOv10n90.288.792.370.2
YOLOv11n (Baseline)91.090.294.172.0
RT-DETR-l90.889.893.471.8
Faster R-CNN84.783.487.358.0
HSSD-YOLO (Ours)94.493.796.677.4
Table 3. Comparison of model parameters, GFLOPs and FPS of representative models.
Table 3. Comparison of model parameters, GFLOPs and FPS of representative models.
ModelParams (M)GFLOPsFPS
YOLOv5s7.216.5118.2
YOLOv11n (Baseline)2.66.5142.5
RT-DETR-l32.092.263.3
Faster R-CNN41.5134.522.4
HSSD-YOLO (Ours)5.211.885.1
Table 4. Ablation study results of each proposed module. Baseline = YOLOv11n. ✓ = module enabled, ✗ = module disabled. Values in parentheses denote improvements over the Baseline.
Table 4. Ablation study results of each proposed module. Baseline = YOLOv11n. ✓ = module enabled, ✗ = module disabled. Values in parentheses denote improvements over the Baseline.
ConfigurationMBE-StemADCN (RSCA)EGAR-FPNmAP@0.5 (%)mAP@0.5–0.95 (%)
YOLOv11n (Baseline)94.172.0
+MBE-Stem94.4 (+0.3)72.8 (+0.8)
+ADCN (RSCA)95.5 (+1.4)73.8 (+1.8)
+EGAR-FPN94.9 (+0.8)73.3 (+1.3)
+MBE-Stem+ADCN95.7 (+1.6)75.1 (+3.1)
+MBE-Stem+EGAR-FPN95.2 (+1.1)74.9 (+2.9)
+ADCN+EGAR-FPN96.3 (+2.2)75.7 (+3.7)
HSSD-YOLO (All)96.6 (+2.5)77.4 (+5.4)
Table 5. Comparison of various attention mechanisms within the ADCN module. All models use the HSSD-YOLO architecture with MBE-Stem and EGAR-FPN. Only the attention mechanism in the ADCN offset-mask prediction branch is varied.
Table 5. Comparison of various attention mechanisms within the ADCN module. All models use the HSSD-YOLO architecture with MBE-Stem and EGAR-FPN. Only the attention mechanism in the ADCN offset-mask prediction branch is varied.
Attention MechanismParams (M)P (%)R (%)mAP@0.5 (%)mAP@0.5–0.95 (%)
None 4.691.990.995.975.3
SE4.992.591.596.275.4
CBAM5.093.192.096.075.8
CA4.992.691.596.376.3
ECA4.792.491.095.675.0
RSCA (Ours)5.294.493.796.677.4
Table 6. Performance comparison between various feature pyramid architectures. A unified backbone of HSSD-YOLO (MBE-Stem + ADCN) is adopted across all models. Only the Neck/FPN structure is varied. PAN-FPN is the default YOLOv11 neck.
Table 6. Performance comparison between various feature pyramid architectures. A unified backbone of HSSD-YOLO (MBE-Stem + ADCN) is adopted across all models. Only the Neck/FPN structure is varied. PAN-FPN is the default YOLOv11 neck.
FPN StructureParams (M)P (%)R (%)mAP@0.5 (%)mAP@0.5–0.95 (%)
PAN-FPN (Default)4.392.491.395.775.1
BiFPN4.693.391.395.975.6
ASFF4.893.091.895.775.2
AFPN4.592.891.896.375.9
Gold-YOLO Neck5.092.891.995.976.4
EGAR-FPN (Ours)5.294.493.796.677.4
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yao, Y.; Huang, Z.; Li, J.; Sun, X.; Zang, Y. HSSD-YOLO: A Motion-Blur-Robust Object Detection Framework for Real-Time Seed Detection in High-Speed Pneumatic Seeders. Agriculture 2026, 16, 1160. https://doi.org/10.3390/agriculture16111160

AMA Style

Yao Y, Huang Z, Li J, Sun X, Zang Y. HSSD-YOLO: A Motion-Blur-Robust Object Detection Framework for Real-Time Seed Detection in High-Speed Pneumatic Seeders. Agriculture. 2026; 16(11):1160. https://doi.org/10.3390/agriculture16111160

Chicago/Turabian Style

Yao, Yizheng, Zishun Huang, Jiaqi Li, Xueyu Sun, and Ying Zang. 2026. "HSSD-YOLO: A Motion-Blur-Robust Object Detection Framework for Real-Time Seed Detection in High-Speed Pneumatic Seeders" Agriculture 16, no. 11: 1160. https://doi.org/10.3390/agriculture16111160

APA Style

Yao, Y., Huang, Z., Li, J., Sun, X., & Zang, Y. (2026). HSSD-YOLO: A Motion-Blur-Robust Object Detection Framework for Real-Time Seed Detection in High-Speed Pneumatic Seeders. Agriculture, 16(11), 1160. https://doi.org/10.3390/agriculture16111160

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop