A Lightweight Shape-Aware YOLO Network for Field Strawberry Maturity Detection Under Complex Orchard Conditions

Leng, Tingting; Zhang, Yuxin; Fan, Yamin; Gao, Huajuan; Zhou, Binhong; Mu, Jiong

doi:10.3390/agronomy16111074

Open AccessArticle

A Lightweight Shape-Aware YOLO Network for Field Strawberry Maturity Detection Under Complex Orchard Conditions

by

Tingting Leng

^1,2,

Yuxin Zhang

^1,2,

Yamin Fan

^1,2,

Huajuan Gao

^1,2,

Binhong Zhou

^1,2 and

Jiong Mu

^1,*

¹

College of Information Engineering, Sichuan Agricultural University, Ya’an 625014, China

²

Sichuan Key Laboratory of Agricultural Information Engineering, Sichuan Agricultural University, Ya’an 625014, China

^*

Author to whom correspondence should be addressed.

Agronomy 2026, 16(11), 1074; https://doi.org/10.3390/agronomy16111074

Submission received: 21 April 2026 / Revised: 10 May 2026 / Accepted: 28 May 2026 / Published: 29 May 2026

(This article belongs to the Section Precision and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Precise, non-destructive detection of fruit maturity is a cornerstone of modern precision agriculture, directly impacting harvest scheduling and post-harvest quality control. In the case of strawberries (Fragaria × ananassa), in-field automated assessment is persistently hampered by the fruit’s diminutive size, subtle physiological colour transitions, and frequent occlusion by foliage. To overcome these limitations, we developed SMLO-YOLO, a specialised lightweight vision system designed to reliably detect different maturity stages on edge devices under complex orchard conditions. The proposed architecture incorporates a Cross-Scale Aggregation Neck (HDP-Neck) driven by entropy-guided dynamic sampling, which effectively concentrates computational resources on fruit regions while filtering background noise. Additionally, we introduce a Shape-aware Intersection-over-Union (ShapeIoU) loss and a Boundary- and Class-aware Knowledge Distillation (BCKD) strategy to specifically address the challenge of detecting overlapping clusters and low-maturity fruits. Validation on custom datasets collected from commercial orchards in Sichuan and Shanxi demonstrated that the final SMLO-YOLO model, after BCKDloss-based knowledge distillation, achieved an mAP50 of 92.4% at an inference speed of 256.41 FPS, with 6.49 M parameters and 15.0 GFLOPs. These metrics indicate that the system successfully balances high-throughput detection with the non-harvestable low-maturity fruits of agricultural robotics, offering a robust tool for objective, real-time maturity monitoring.

Keywords:

Fragaria × ananassa; non-destructive assessment; precision agriculture; lightweight neural networks; maturity grading

1. Introduction

Strawberry (Fragaria × ananassa) is an economically important berry crop with high nutritional and commercial value. Its bright colour, distinctive flavour, and abundant bioactive compounds, including vitamin C, phenolics, flavonoids, and anthocyanins, make it an important fresh fruit and functional food resource [1]. In commercial production, however, the value of strawberry fruit is strongly dependent on the accuracy of harvest timing. Fruits harvested too early often fail to develop desirable colour, aroma, and sweetness, whereas over-mature fruits are more vulnerable to mechanical damage, softening, decay, and postharvest quality deterioration. Therefore, precise and non-destructive maturity assessment is not only a quality evaluation problem, but also a key technical link connecting harvest scheduling, automatic grading, robotic picking, and postharvest management.

Traditional maturity assessment in strawberry production still relies largely on a visual assessment of fruit colour, surface gloss, firmness or softness, and grower experience. Although this approach is flexible, it is inherently subjective and labour-intensive, and its consistency decreases when production scale expands or when fruits are distributed unevenly under dense foliage. With the rapid development of digital agriculture, deep learning-based fruit detection has become an important technical pathway for replacing subjective manual judgement with objective visual perception [2]. In intelligent harvesting systems, visual perception is no longer a separate recognition module; rather, it determines whether the robot can accurately identify fruit targets, estimate their positions, and make reliable picking decisions in real-time [3]. For this reason, object detection and localisation have become fundamental tasks in fruit harvesting robots, especially under complex orchard conditions where occlusion, overlapping fruits, illumination variation, and background interference frequently occur [4].

Compared with many other orchard fruits, such as apples, tomatoes, cherries, pitaya, and passion fruits, strawberry maturity detection presents a more fine-grained and unstable visual recognition problem, strawberry maturity detection presents a more fine-grained and unstable visual recognition problem. Strawberries are usually small in size, densely distributed, and easily occluded by leaves or neighbouring fruits. More importantly, their maturity changes gradually rather than discretely, and the visual difference between low- and medium-maturity stages can be subtle under natural illumination. Recent studies have attempted to improve YOLO-based models for strawberry maturity detection. CES-YOLOv8 demonstrated that optimising the feature extraction and detection structure of YOLOv8 can improve strawberry maturity recognition in field scenes [5]. CR-YOLOv9 further showed that multi-stage maturity detection can benefit from enhanced representation of ripening-related visual features [6]. With the emergence of YOLOv11, YOLOv11-GSF provided a newer framework for strawberry ripeness detection in agricultural environments [7]. LBS-YOLO highlighted the need to reduce model complexity while maintaining maturity recognition accuracy [8]. These studies indicate that deep learning has become a feasible tool for strawberry maturity detection. However, in practical field scenarios, low-maturity strawberries remain difficult to detect because they are often small, greenish-white, and visually similar to leaves.

The difficulty of strawberry maturity recognition is closely related to several common challenges in agricultural object detection. For example, green fruit detection based on an optimised YOLOX-m model showed that colour similarity between fruits and leaves can significantly weaken target separability [9]. This problem is highly relevant to immature strawberry detection, where the fruit surface has not yet developed strong red pigmentation. Meanwhile, real-time fruit detection on CPU platforms has shown that agricultural detection models must be evaluated not only in terms of accuracy, but also whether they can run efficiently on resource-limited devices [10]. Lightweight fruit detection models for orchard environments further confirm that dense distributions, irregular occlusion, and natural illumination changes are persistent constraints in practical deployment [11]. Studies on embedded passion fruit detection also suggest that a compact model design is necessary when detection systems are expected to operate on field robots or portable devices [12]. In automated mulberry harvesting, lightweight YOLOv8n-based detection has been used to improve the deployability of fruit recognition models [13]. Similarly, GreenFruitDetector revealed that low contrast between fruit and vegetation can lead to missed detections when the model fails to preserve weak target cues [14].

Therefore, lightweight design is not merely an engineering preference in agricultural vision; it is a practical requirement imposed by field deployment. In fruit detection tasks, the model must maintain sufficient representational ability to distinguish small and occluded targets while reducing excessive parameters, memory consumption, and floating-point operations. Lightweight YOLOv5s-based pitaya detection has demonstrated that compact models can support fruit recognition under both daytime and nighttime light-supplement environments [15]. The use of S-YOLO for greenhouse tomato detection further illustrates that accuracy and efficiency must be jointly optimised rather than treated as separate objectives [16]. YOLOv8-CML extended this idea to colour-changing melon ripening detection, where the model needs to capture both fruit location and maturity-related colour variation [17]. A pineapple maturity analysis based on MobileNetV3-YOLOv4 also shows that lightweight backbones can reduce computational cost in natural environments [18]. More recent YOLOv11-based apple detection indicates that the latest YOLO architectures are being actively adapted for lightweight orchard perception [19]. The use of GPC-YOLO for tomato maturity detection similarly reflects the trend of redesigning YOLOv8n-like structures for agricultural maturity recognition [20]. Embedded apple detection studies further suggest that model size and inference latency are decisive factors for practical orchard systems [21]. Lightweight cherry detection also confirms that small fruit targets require compact yet detail-preserving feature extraction strategies [22].

Although YOLO-series models have become widely used in agricultural detection, other object detection frameworks still provide useful methodological references. Faster R-CNN-based apple detection has shown strong localisation ability in complex orchard environments [23]. However, its two-stage detection pipeline generally increases computational burden, which is why lightweight Faster R-CNN variants based on MobileNetV3 have been explored for densely planted pitaya orchards [24]. Comparative research on the detection of date fruits further indicates that YOLO and Faster R-CNN differ not only in accuracy, but also in inference speed and deployment suitability [25]. RetinaNet-based fruit detection provides another perspective by using focal loss and multi-scale feature fusion to address class imbalance and complex field backgrounds [26]. Studies comparing YOLOv8, Faster R-CNN, and RetinaNet in olive fruit detection show that the optimal detector must be selected according to both accuracy and real-time requirements [27]. Research on the detection and classification of date fruits also reveals that agricultural detection tasks often involve not only target localisation, but also subtle category discrimination among visually similar fruit classes [28]. These findings suggest that, for strawberry maturity detection, the model should combine the speed advantage of YOLO with stronger feature fusion, shape-aware localisation, and maturity-sensitive classification.

Knowledge distillation offers a promising strategy for improving lightweight models without increasing inference complexity. In object detection, distillation transfers useful feature, response, or relational knowledge from a high-capacity teacher model to a compact student model, thereby improving the performance of small detectors [29]. More general studies on knowledge distillation also regard it as an effective approach for model compression and efficient deployment under limited computational resources [30]. In agricultural scenarios, reconstructed feature and dual distillation have been used to improve lightweight tea shoot detection, showing that distillation can enhance the representation ability of compact models in complex field backgrounds [31]. Knowledge distillation has also been combined with pruning for colour-changing melon ripeness detection, demonstrating its potential in fruit maturity recognition tasks where both accuracy and efficiency are required [32]. Nevertheless, most existing distillation strategies are designed as general compression methods. They do not explicitly consider the specific error sources in strawberry maturity detection, such as weak colour transition in low-maturity fruits, boundary ambiguity under occlusion, and confusion between adjacent maturity stages.

Overall, despite progress in accuracy and lightweight exploration, three gaps remain. First, adaptation to fine-grained cues is insufficient: most models under-capture small low-ripeness targets and the near-elliptic fruit shape, and only weakly model the green-to-red transition and occluded local textures. Second, balancing lightness and accuracy is difficult: some methods sacrifice accuracy for compactness and lack robustness to illumination changes and fruit overlap under real-time constraints. Third, the integration of specialised techniques is low: cross-domain modules such as partial convolution and learned upsampling are still fragmented, and dedicated datasets for complex field scenes are scarce.

Guided by the practical demands of field maturity detection, this study proposes SMLO-YOLO, an improved detector built on YOLOv11s that combines small-object enhancement, structural compression and knowledge distillation, and evaluates it on a dataset covering complex field scenarios. The main contributions are as follows:

1. Complex-Scene Strawberry Dataset: Images were collected in Sichuan and Shanxi across major cultivars, covering leaf occlusion, fruit overlap and uneven illumination. Three maturity levels—low, medium and high—were annotated, and a high-quality dataset was constructed through augmentation and quality control to support the training of lightweight models.

2. SMLO-YOLO Architecture: To reconcile feature transfer with lightweight design, the HDP-Neck integrates cross-scale alignment (HSPAN), dynamic sampling (DySample) and selective convolution (C3K2_PConv), stabilising feature transfer while reducing redundancy. A decoupled EfficientHead enhances class discrimination, and ShapeIoU improves localisation of near-elliptic fruits, addressing small-fruit misses and boundary bias.

3. BCKDloss Knowledge Distillation: To overcome the limited adaptability of generic distillation, a scheme tailored to small and occluded strawberries transfers the teacher’s discriminative capability without increasing model size, thereby improving robustness under complex field conditions.

2. Materials and Methods

2.1. Dataset Collection

2.1.1. In-Field Data Acquisition

Strawberry images were acquired during the 2024–2025 winter season at commercial production facilities in Sichuan and Shanxi, China. Three cultivars—Hongyan, Zhangji, and Sweet Charlie—were selected to ensure morphological and physiological diversity. High-fidelity images were captured using a Canon EOS 90D camera (Canon Inc., Tokyo, Japan) under diffuse natural light (09:00–11:30 and 14:00–16:30) to avoid high-contrast midday shadows. Camera settings included a 5184 × 3456 resolution, aperture-priority (f/4–f/8), and automatic shutter (1/100–1/1000 s), with a 0.5–1.5 m working distance to encompass both fruit and canopy architecture. To reflect complex field constraints, scenes deliberately incorporated biological stressors, such as foliage occlusion (5–30%), fruit overlap (10–20%), and uneven illumination (20–40% contrast), serving as a non-contact measurement system for external maturity assessment.

2.1.2. Maturity Classification and Annotation

Maturity was categorised using LabelImg v1.8.1 into three levels based on commercial horticultural standards and surface pigmentation: (a) Low: <30% red coverage (greenish-white); (b) Medium: 30–80% red with a distinct ripening boundary and no gloss; and (c) High: >80% uniform red with natural gloss and no white/green areas. Labels were assigned independently by two trained assessors, with discrepancies (<5%) resolved through joint review to ensure ground-truth consistency (Figure 1).

To reduce subjectivity in maturity labelling, the visible fruit surface was evaluated according to a predefined annotation rubric based on the proportion of red pigmentation. Two trained annotators independently labelled each strawberry instance. For borderline samples near the maturity thresholds or samples affected by occlusion, blur, or incomplete fruit morphology, the labels were jointly reviewed and corrected. Ambiguous samples that could not be reliably assigned to one maturity class were excluded during quality control.

2.1.3. Dataset Preprocessing and Augmentation

The original 1075 images were first divided into training, validation, and test sets using a stratified 7:2:1 split. The increase from 1075 to 3305 images was caused by training-set augmentation, which was used to improve robustness under rotation, illumination variation, horizontal viewpoint changes, and image noise. Data augmentation was applied only to the training set after the split, while the validation and test sets were kept unchanged to avoid potential data leakage. To improve model generalisation under fluctuating orchard illumination conditions, targeted augmentation strategies, including random rotation, brightness/contrast adjustment, horizontal flipping, and Gaussian noise, were used. At the annotated-instance level, the training set contained 2510 low-maturity, 988 medium-maturity, and 1311 high-maturity strawberry instances before augmentation. After augmentation, these numbers increased to 7476, 3022, and 3956, respectively. The validation set contained 1925 low-maturity, 825 medium-maturity, and 960 high-maturity instances, and the test set was preserved in its original, non-augmented form for final model evaluation. Following quality control to remove blurred images, incomplete fruit morphology, and ambiguous labels, a final dataset of 3305 images was retained. All images were standardised to 640 × 640 pixels using bilinear interpolation to preserve texture and colour cues for the SMLO-YOLO architecture (Figure 2).

2.2. Methods

2.2.1. SMLO-YOLO Network Structure

The SMLO-YOLO architecture, illustrated in Figure 3, is developed as a lightweight vision system derived from the YOLOv11s framework, specifically optimised for the non-destructive quantification of strawberry maturity in complex field imagery. The system addresses three primary agronomic challenges: the visual ambiguity between early-stage green fruits and foliage, the real-time constraints of embedded agricultural hardware, and the detection errors induced by fruit overlapping and leaf occlusion.

The model restructures the multi-scale feature flow through a bottom-up integration of extraction, fusion, and regression modules. In the feature-fusion stage, the neck is reconstructed with a Cross-Scale Aggregation Network (HSPAN) to preserve the weak semantic signals of low-contrast immature targets. To maintain computational efficiency without sacrificing the fine-grained texture cues—such as achene distribution patterns—C3K2_PConv replaces standard convolutions, effectively reducing parameter redundancy. Furthermore, the DySample unit utilises entropy-guided dynamic sampling to allocate resolution based on information density, thereby suppressing background noise from soil and vines while concentrating focus on fruit-bearing regions.

At the detection stage, an EfficientHead is employed to decouple classification and regression tasks. By incorporating channel attention, this head enhances the system’s ability to discriminate between the subtle physiological transitions of low, medium, and high maturity levels while minimising false positives from the vegetative background. To improve localisation for the fruit’s near-elliptic morphology, a Shape-aware Intersection-over-Union (ShapeIoU) loss is introduced. This loss function accounts for shape similarity and local overlap compensation, significantly reducing boundary “drift” in dense clusters. Collectively, these optimisations enable SMLO-YOLO to achieve superior recall for low-maturity targets and higher localisation accuracy, meeting the stringent requirements for deployment on mobile agricultural robots.

2.2.2. HDP-Neck

HDP-Neck is the core neck for adapting YOLOv11 to strawberry maturity detection. It tackles three field pain points: low-ripeness fruit colour close to leaves causing feature attenuation, boundary and semantic breakage under overlap and occlusion, and the trade-off between lightness and accuracy. A unified scheme integrates HSPAN, DySample, and C3K2_PConv to optimise the full chain of sampling, convolution, and fusion, strengthening multi-scale context while cutting invalid computations and feeding a more discriminative, less redundant input to the head.

In forward pass, the three submodules work in tandem to form a closed loop from information-focusing to discriminative enhancement to cross-scale fusion. DySample first divides P3, P4, and P5 features into 16 × 16 blocks, estimates information entropy, keeps high-entropy blocks at original resolution to preserve texture and colour transitions, and downsamples low-entropy background blocks two to four times. C3K2_PConv then splits channels with factor four: Cs (one quarter) receives 3 × 3 convolution to sharpen edges and local reds, Cu (three quarters) passes global cues, then both are fused with a residual path and a 1 × 1 conv to enhance cross-channel interaction, retaining detail while greatly reducing 3 × 3 cost. HSPAN finally unifies P3 to P5 to 256 channels, derives temperature-scaled softmax weights from global average pooling, and performs bidirectional residual fusion: top-down to inject semantics and bottom-up to inject details, followed by two to three C3K2 refinements, and outputs maps aligned with the detection head.

This “focus–refine–associate” design improves feature fusion while maintaining a lightweight structure. DySample reduces redundant background computation, C3K2_PConv decreases convolutional redundancy, and HSPAN strengthens cross-scale feature interaction for small and occluded strawberries. Therefore, HDP-Neck mainly provides an efficient feature-fusion foundation for subsequent detection, loss-function optimisation, and knowledge distillation. The final mAP50 of 92.4% was achieved only after incorporating the proposed BCKDloss-based knowledge distillation strategy.

2.2.3. HSPAN

As shown in Figure 4, HSPAN is rebuilt from HS-FPN for strawberry maturity detection to fix cross-scale transfer gaps under a lightweight budget. Small low-ripeness fruits rely on P3, medium fruits on P4, and large or overlapping high-ripeness fruits on P5; weak coupling among these scales leads to missed small fruits and shifted boxes. HSPAN enforces dense bidirectional association among P3–P5.

Three upgrades are as follows: (1) add a bottom-up path, downsampling P3 to P4 and P5 via 3 × 3 stride-2 conv with residual aggregation while keeping the original top-down flow; (2) apply 2–3 rounds of C3k2 refinement after each P3–P5 fusion to enhance pink-to-green transitions and achene patterns; (3) unify fusion channels at 256 with shared kernels while retaining channel attention. Neck size becomes 1.97 M params and 7.0 GFLOPs, down 12% and 8% versus HS-FPN.

Dynamic weights and two-way residual fusion drive coupling. After unifying P3–P5 to 256 channels (1 × 1 conv) and computing per-level global-average scores, temperature-scaled softmax is as follows:

α_{i} = \frac{e x p (S_{i} / τ)}{\sum_{j = 3}^{5} e x p (S_{j} / τ)},

(1)

Top-down semantic correction is as follows:

F_{4}^{'} = F_{4} + U p (F_{5}) \cdot α_{5},

(2)

Bottom-up detail supplement is as follows:

F_{4}^{″} = F_{4}^{'} + D o w n (F_{3}) \cdot α_{3} .

(3)

The final maps pass through C3k2. Residual links amplify weak signals from small fruits and leaves, reducing misses.

Performance: Low-ripeness recall rises 0.83 to 0.87; high-ripeness box-offset error drops by 15%. HSPAN thus builds a robust P3–P5 loop with dynamic weights, bidirectional fusion, and multi-round refinement, strengthening small and overlapping targets without notable cost.

2.2.4. DySample

The DySample unit is designed with the goals of being lightweight with high information focus. In view of the characteristics of “high information density in the fruit area and low information density in the background area (soil, empty vines)” in strawberry garden images, through local information entropy quantisation and dynamic resolution adjustment, the computational load of the background area is significantly reduced without losing the key features of the fruit. Its structure diagram is shown in Figure 5.

The core workflow is divided into two steps:

Local Information Density Evaluation: The input feature map

F

is divided into

M \times N

local blocks (in the experiment, this was set to 16 × 16 local blocks). For each local block

B_{m, n}

(

1 \leq m \leq M, 1 \leq n \leq N

), Shannon information entropy

H (B_{m, n})

is calculated to quantify its information density. The formula is as follows:

H (B_{m, n}) = - \sum_{k = 1}^{K} p_{m, n, k} {l o g}_{2} p_{m, n, k},

(4)

In the above equation,

K

denotes the number of channels in the feature map, and

p_{m, n, k}

denotes the normalised pixel-value probability of local block

B_{m, n}

in the

k

-th channel. A higher entropy value indicates that the local block contains more effective target information, such as fruit colour and texture cues. Conversely, a lower entropy value indicates that the local block is dominated by background regions.

Figure 5. The structure diagram of DySample.

Dynamic Sampling Execution: Set the information entropy threshold

T

(obtained through validation set tuning in the experiment as

T = 0.6

), and perform classified sampling on local blocks:

High-Information Block

H (B_{m, n}) \geq T

: By using “1x downsampling” (i.e., maintaining the original resolution or only slightly compressing through a 3 × 3 convolution with a step size of 1), the formula is expressed as:

B_{m, n}^{s a m p l e d} = {Conv}_{3 \times 3, s t r i d e = 1} (B_{m, n}),

(5)

This ensures that the colour texture of the fruit (such as the red gradient of high-ripening strawberries and the bluish-white spots of low-ripening strawberries) and edge details are completely retained.

Low-Information block

H (B_{m, n}) < T

: By using “2–4 times downsampling”, the feature size is compressed and background noise is suppressed through mean pooling. The formula is:

B_{m, n}^{s a m p l e d} = {AvgPool}_{s \times s, s t r i d e = s} (B_{m, n}) (s \in {2,4}) .

(6)

2.2.5. C3K2_PConv

Traditional C3k2 blocks in YOLOv11 convolve all channels with a 3 × 3 kernel, causing parameters and computes to scale quadratically with channel count, which strains embedded budgets yet still blurs fine cues that separate low-ripeness fruit from leaves. We introduce C3K2_PConv, a FasterNet-style partial convolution that preserves key features while sharply cutting cost.

Let the input be

X \in R^{C \times H \times W}

. Split channels by a ratio

n_{split}

(default four):

C = C_{s} + C_{b},

(7)

C_{s} = ⌊ C divided by n_{split} ⌋,

(8)

C_{b} = C - C_{s}

(9)

Only the rolled subset

X_{s}

(one quarter) receives 3 × 3 conv; the pass-through subset

X_{b}

(three quarters) bypasses conv.

PCConv (X) = Concat ({Conv}_{3 \times 3} (X_{s}), X_{b}),

(10)

Ignoring bias, norm, and activation, single-layer complexity follows

{FLOP}_{std} \approx 9 H W C^{2},

(11)

{FLOP}_{sparse} \approx 9 H W C_{s}^{2} \approx \frac{9 H W C^{2}}{n_{split}^{2}} .

(12)

C = 256

and

n_{split} = 4

, 3 × 3 parameters drop from 589,824 to 36,864 (about a 0.25×), which is roughly 0.025 of the original compute and parameter load. The bypass also lowers memory traffic, so end-to-end latency improves beyond FLOPs alone.

C3K2_PConv keeps the C3k2 skeleton, replacing two 3 × 3 standard convolutions with PConv while retaining residuals and the multi-branch topology. The

X_{s}

branch sharpens local details such as achene density and fruit-edge contours; the

X_{b}

branch preserves slow-varying global cues such as chromatic differences between low-ripeness fruit and leaves and the overall luminance of high-ripeness fruit. After residual fusion, an optional 1 × 1 pointwise conv (for cross-channel interaction) and DropPath can be added. Interfaces and tensor shapes remain unchanged, making the module plug-and-play for existing training, distillation, and quantisation workflows.

In practice, although constant terms from 1 × 1 layers and residual paths slightly reduce the theoretical gain, bandwidth-limited edge devices benefit disproportionately from the reduced memory access. The complementary roles of

X_{s}

and

X_{b}

yield a balanced representation that couples global semantics with local detail, meeting fine-grained requirements while staying lightweight (Figure 6).

2.2.6. EfficientHead

Traditional heads struggle to balance accuracy and speed for strawberry maturity. Dense 3 × 3 refinement treats all channels equally, expressed as

Conv (X) = X * W,

where

X

is the input map,

W

the fixed kernel, and the asterisk denotes convolution. This drowns weak cues such as faint local red on low-ripeness fruit and the green-to-red transition on medium-ripeness fruit, causing small-target misses. Fixed kernels also respond poorly to gradual colour gradients, blurring the boundary between half-red and nearly fully red. Under occlusion or overlap, computation is further wasted on background, leaving stems and local reds under-modelled, so fully red fruit may be misclassified as semi-red.

We propose EfficientHead, combining lightweight refinement with decoupled outputs. In the stem, dense conv is replaced by either double 3 × 3 group conv or partial conv plus 1 × 1, enabling on-demand computing. Group conv sets the group count

g

to approximately

C

divided by sixteen, letting each group specialise: for example, red-dominated groups strengthen high-ripeness cues while green-dominated groups separate low-ripeness fruit from leaves. Partial conv uses a channel split with

n_{div} = 4

: one quarter of channels receive 3 × 3 conv to capture edges and achene texture; the remaining three quarters bypass conv to keep slow-varying global colour gradients and shape. Both options cut redundant work, lowering stem compute by over sixty percent and avoiding over-smoothing. Measured on low-ripeness fruit under two centimetres, feature response rises by a factor of 2.3, reducing small-target misses.

For decoupled outputs, parallel 1 × 1 branches separate box regression and maturity classification. The regression branch outputs

4 \times reg_\max

channels and, with a distributed focal loss layer, models position as a distribution, reducing small-fruit offset error by fifteen percent. The classification branch outputs

n_{c} = 3

channels for low, medium, and high maturity and trains independently, lowering confusion between half-red and full-red by twelve percent.

EfficientHead adapts well to occlusion and overlap. Channel-selective processing concentrates compute on fruit regions, lifting foreground channel response to more than three times the background while trimming invalid background cost. The regression branch’s probabilistic modelling further captures local contours of occluded fruit, raising the detection rate for occluded cases to 0.87 compared with 0.79 for traditional heads, and strengthening maturity grading reliability in complex scenes.

2.2.7. ShapeIoU Loss Function

As illustrated in Figure 7, traditional box losses such as CIoU primarily optimise overlap, centre distance, and aspect ratio, but are suboptimal for strawberries: fruits are near-elliptic, shape cues vary with maturity, and field images often show only partial contours due to occlusion and overlap. Misfits in shape thus lead to maturity grading errors even when the predicted position is close.

We introduce ShapeIoU, which fuses positional IoU with an elliptic shape term for

B_{pred} = (x_{pred}, y_{pred}, w_{pred}, h_{pred}),

(13)

B_{gt} = (x_{gt}, y_{gt}, w_{gt}, h_{gt}) .

(14)

Positional overlap:

IoU = \frac{| B_{pred} \cap B_{gt} |}{| B_{pred} \cup B_{gt} |} .

(15)

Fit ellipses to obtain

(α_{gt}, β_{gt}, θ_{gt})

and

(α_{pred}, β_{pred}, θ_{pred})

. The normalised elliptic discrepancy is

ShapeDis = {(\frac{α_{pred} - α_{gt}}{α_{gt}})}^{2} + {(\frac{β_{pred} - β_{gt}}{β_{gt}})}^{2} + {(\frac{θ_{pred} - θ_{gt}}{\frac{π}{2}})}^{2},

(16)

ShapeIoU combines both with weight

α = 0.35

:

S h a p e I o U = I o U - α \cdot S h a p e D i s,

(17)

L_{reg} = 1 - ShapeIoU = 1 - IoU + α \cdot ShapeDis .

(18)

Effects: For occluded fruits, the shape term guides the model to infer the full ellipse from partial contours, reducing offsets. Across maturities, rounder fully red fruits have a ratio

\frac{α_{gt}}{β_{gt}}

closer to one, while greener fruits tend to larger ratios; binding these parameters to labels improves grading and cuts confusions such as half-red versus green. Unlike plain IoU, which can rate a slender box highly when overlap is large, ShapeIoU penalises shape mismatch, yielding a more accurate measure of well-positioned and well-fitted bounding boxes.

2.2.8. Knowledge Distillation

Early small strawberries with faint pink–green transitions are often buried by leaf or soil noise, causing misses. Occlusion and overlap also hide maturity cues, leading to mix-ups such as half-red being judged as green or fully red being judged as half-red. To raise accuracy under a lightweight budget, we adopt knowledge distillation with a high-capacity teacher and a compact student (Figure 8).

Teacher: A stronger model sharing the YOLOv11s infrastructure captures fine edges of small fruits, completion rules in occluded regions, and weak green-to-red gradients, supplying high-quality signals.

Student: The lightweight improved YOLOv11 fits device limits but underperforms without distillation; it learns the teacher’s feature logic via targeted transfer.

Targeted distillation is carried out as follows.

Feature Layer: Distil the neck, which carries colour gradients and shape cues critical to maturity.

Apply a dataset-tuned random mask

M

to the student’s neck feature

S

to occlude non-core regions and force completion:

S_{masked} = S ⊙ M

(19)

Reconstruction Loss: A projection

G (\cdot)

maps

S_{masked}

to the teacher neck feature

T

. Use mean-squared error over all elements:

L_{MGD} = (C \cdot H \cdot W)^{- 1} \sum_{c = 1}^{C} \sum_{h = 1}^{H} \sum_{w = 1}^{W} {(G (S_{masked})_{c, h, w} - T_{c, h, w})}^{2 .}

(20)

Total Loss: Balance task and distillation terms:

L_{total} = (1 - α) L_{task} + α L_{MGD},

(21)

where

L_{task}

includes cross-entropy for maturity classes and ShapeIoU for box regression.

After distillation, mAP50 rises toward the teacher, small-fruit misses drop, and occlusion-induced mix-ups decline, while parameters, compute, and memory remain unchanged—preserving the model’s lightweight profile.

2.3. Experimental Environment Settings

The specific initial training parameters of this experiment are listed in Table 1. The experiments were conducted on an NVIDIA RTX 3080 Ti GPU with 12 GB memory. The software environment included Python 3.10, Ubuntu 22.04, CUDA 11.8, and PyTorch 2.1.2. The system was also equipped with an Intel(R) Xeon(R) Silver 4214R CPU @ 2.40 GHz. The training and testing environments were kept consistent to ensure repeatability of the experimental results.

2.4. Evaluation Metrics

To comprehensively evaluate the performance of SMLO-YOLO, this study employed seven key indicators: Precision (Precision), mean Average Precision (mAP), Recall (Recall), frames per second (FPS), Parameters (Parameters), computational performance (GFLOPs), and model Weights (Weights) (Table 2).

TP (True Positive): the number of samples correctly predicted as positive by the model; FP (False Positive): the number of samples wrongly predicted as positive by the model; FN (False Negative): the model wrongly predicts the number of negative samples; TN (True Negative): the model correctly predicts the number of negative samples; Precision: the proportion of positive samples among the predicted positive samples that are actually positive.

P r e c i s i o n = \frac{T P}{T P + F P} .

(22)

Average Precision (AP) and mAP (mAP50): AP represents the area under the precision–recall curve, and mAP is the average of all categories of APs.

A P = \int_{0}^{1} p (r) d r,

(23)

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(24)

Recall (Recall rate): The proportion of all actual positive samples correctly identified by the model:

R e c a l l = \frac{T P}{T P + F N} .

(25)

FPS (Frames Per Second): The number of image frames processed per second, used to measure the speed of model inference:

F P S = \frac{N}{T} .

(26)

GFLOPs (Giga FLOPS): Billions of floating-point operations per second, used to evaluate the computational complexity of models;

Parameters: The total number of trainable parameters, reflecting the complexity of the model. Weights (Model Weights): Storage capacity (MB), which affects the efficiency of model deployment.

Through the above indicators, the comprehensive performance of SMLO-YOLO in terms of detection accuracy, real-time performance, and lightweight design was comprehensively quantified, and a quantitative basis was provided for the ablation experiment, neck comparison experiment, loss-function selection, knowledge distillation, and comparison with mainstream models. All metrics were evaluated using Precision, mean Average Precision (mAP50), Recall, FPS, parameter count, model size (MB), and GFLOPs.

3. Results

3.1. Ablation Experiment

Table 3 presents the incremental effects of the main structural components of SMLO-YOLO. Compared with the baseline YOLOv11s, EfficientHead improved mAP50 from 89.5% to 90.4%, indicating that the decoupled and lightweight detection head enhanced maturity-related classification and regression. The introduction of HDP-Neck further improved multi-scale feature fusion for small and occluded strawberries, achieving an mAP50 of 90.5% while reducing the number of parameters and GFLOPs. When HDP-Neck and EfficientHead were combined into the HDCE architecture, the model maintained a compact structure, with 6.49 M parameters and 15.0 GFLOPs, while achieving an mAP50 of 90.0%.

It should be noted that the mAP50 of 92.4% was not obtained at the HDCE stage. This final performance was achieved after applying the proposed BCKDloss-based knowledge distillation strategy in the subsequent experiments. Therefore, HDP-Neck, EfficientHead, ShapeIoU, and BCKDloss contributed to the final model from different aspects: HDP-Neck enhanced cross-scale feature fusion, EfficientHead improved lightweight detection, ShapeIoU refined shape-aware localisation, and BCKDloss further improved the discriminative ability of the lightweight student model without increasing inference complexity.

3.2. Neck Comparative Experiment

In the neck section, this paper designs and implements seven different neck module enhancement schemes for the improved YOLOv11s model. HDP, YOLOv11s-ASF, YOLOv11s-SPDConv, YOLOv11s-goldyolo, YOLOv11s-bifpn, YOLOv11s-CARAFE, and YOLOv11s-GDFPN comparative experiments were conducted to verify the lightweight performance, detection accuracy and reasoning efficiency of each scheme in strawberry maturity detection. Table 4 presents the training result data of different modules. The results showed that Hspan + Dysample + C3k2_PConv (HDP) achieves the most robust compromise in three types of indicators: in terms of lightweighting, it only has 6.81 M of parameters, significantly lower than similar designs (such as 17.43 M of SPDConv). In terms of accuracy, the highest mAP50 was achieved at 90.5%, Recall = 0.87, and Precision = 0.84. The detection of small targets and occluded samples was more stable. Although 188.68 FPS is not the fastest in terms of efficiency, it is sufficient to meet the real-time field requirements (≥15 FPS) while maintaining high precision and minimal computational overhead.

Compared with other solutions, it can be seen that SPDConv has medium accuracy, with an mAP50 of approximately 89.3%, but it has the highest resource consumption and is difficult to deploy. The fastest ASF inference is approximately 416.67 FPS, but it comes at the cost of accuracy and a lightweight design. mAP50 is 88.3% with both parameters and model size higher than HDP. goldyolo and GDFPN each have their own advantages in terms of accuracy or speed, but they do not have an overall advantage in terms of cost or accuracy. Based on the above, HDP, through the collaborative effect of cross-scale alignment of Hspan, dynamic sampling of Dysample, and lightweight convolution of C3K2_PConv, strengthened the modelling and fusion of key strawberry cues (such as colour completion of the edges of small fruit fluff and occlusion areas) under the premise of strictly controlling computing and model size. Based on this, subsequent experiments adopted HDP as the default neck configuration to further optimise the detection head and distillation strategy.

3.3. Loss Function Experiment

To clarify the influence of the IoU-type loss function on the performance of the improved YOLOv11s strawberry maturity detection equipped with HDCE, based on this model (with a fixed parameter count of 6.49 M, model size of 13.3 MB, and computational power of 15.0 GFLOPs), PIoU, PIoU2, CIoU, MPDIoU and the ShapeIoU proposed in this paper were compared. The specific experimental results are shown in Table 5, and the results are as follows. ShapeIoU has the most comprehensive performance: the mAP50 reaches 90.5% (the highest), which is 0.7, 0.9, and 0.8 percentage points higher than PIoU, CIoU, and MPDIoU, respectively. Due to the integration of the strawberry elliptical shape fit measurement, the regression of the bounding boxes of small fruits and occluded fruits is optimised. The FPS is 344.83, meeting the real-time requirements in the field. The Recall is 0.858, the Precision is 0.837, and the missed detections and false detections are balanced. Other losses each have their own shortcomings: PIoU Precision is high but mAP50 is the lowest, and small fruit detection is weak. PIoU2 has the highest FPS but the lowest Precision and many false detections. CIoU shoes poor compatibility and insufficient precision; MPDIoU is inefficient (FPS is 208.33).

In conclusion, ShapeIoU was more suitable for strawberry morphology than the other IoU-based losses. Under the same model size and computational cost, it improved mAP50 to 90.5%, indicating that the shape-aware localisation constraint helped refine bounding-box regression for near-elliptical and partially occluded strawberry fruits. This improvement was achieved before knowledge distillation and should be distinguished from the final 92.4% mAP50 obtained with BCKDloss.

3.4. Knowledge Distillation Experiment

To evaluate the impact of different knowledge distillation loss functions on the performance of the improved YOLOv11s strawberry maturity detection based on the HDCE architecture, and to determine the best solution that does not undermine the lightweight characteristics, this study tested five loss functions: BCKDloss (proposed in this paper), CWDloss, L1loss, L2loss and MIMICloss. The experiment adopted a consistent “teacher–student” framework and training configuration, and only changed the loss function to eliminate irrelevant interference. The experimental results are shown in Table 6.

After applying BCKDloss-based knowledge distillation, the final SMLO-YOLO achieved the best overall performance, with an mAP50 of 92.4%, which was 1.9, 1.4, 0.8, and 2.1 percentage points higher than CWDloss, L1loss, L2loss, and MIMICloss, respectively. The model also achieved a Recall of 86.5%, a Precision of 85.3%, and an inference speed of 256.41 FPS, which satisfied the real-time requirement for field applications.

It is important to note that all student models maintained the same lightweight structure, with 6.49 M parameters, 13.3 MB model size, and 15.0 GFLOPs. Therefore, the performance improvement brought by BCKDloss was not caused by increased model complexity, but by more effective knowledge transfer from the teacher model to the lightweight student model. This result confirms that the reported 92.4% mAP50 corresponds to the complete SMLO-YOLO model after knowledge distillation.

3.5. Comparison with Mainstream Experiment

As summarised in Table 7, the final SMLO-YOLO model after BCKDloss-based knowledge distillation outperformed the compared mainstream lightweight detectors across key metrics. In terms of accuracy, it achieved the highest mAP50 of 92.4%, surpassing YOLOv6s and YOLOv8s by 4.6 and 3.6 percentage points, respectively. The comparative performance of different models is further illustrated in Figure 9. This result confirms the effectiveness of the complete SMLO-YOLO framework rather than any single module alone.

This accuracy was achieved while maintaining a compact model structure, with only 6.49 M parameters, 15.0 GFLOPs, and an inference speed of 256.41 FPS. Compared with YOLOv10s, SMLO-YOLO reduced computational complexity while remaining suitable for deployment on resource-constrained edge devices. Furthermore, the model exhibits exceptional robustness in reducing false negatives, achieving a Precision of 0.853, the highest among all compared models. While its inference speed of 256 FPS is slightly lower than that of YOLOv6s, it comfortably exceeds the real-time threshold (≥15 FPS) required for agricultural machinery. The confusion matrices before and after model improvement are shown in Figure 10. The heat-map visualisation results under different sample densities are presented in Figure 11. Representative detection results before and after improvement are shown in Figure 12.

4. Discussion

SMLO-YOLO was developed to address the main challenges of field strawberry maturity detection, including small targets, leaf occlusion, fruit overlap, and subtle colour transitions between maturity stages. Previous YOLO-based studies have improved strawberry ripeness detection through feature enhancement and lightweight design, such as CES-YOLOv8, CR-YOLOv9, YOLOv11-GSF, and LBS-YOLO [5,6,7,8]. Compared with these methods, SMLO-YOLO further combines cross-scale feature fusion, shape-aware localisation, and knowledge distillation, which makes it more suitable for complex orchard environments.

The improvement of SMLO-YOLO mainly comes from the coordinated optimisation of several task-specific components. HDP-Neck strengthened multi-scale feature representation for small and occluded fruits, while EfficientHead reduced redundant computation and improved maturity classification. ShapeIoU introduced strawberry-shape information into bounding-box regression, and BCKDloss enhanced the discriminative ability of the lightweight student model without increasing inference complexity. These results suggest that effective strawberry maturity detection depends not only on lightweight model design, but also on preserving weak visual cues related to fruit colour, boundary, and shape.

However, several limitations remain. The current validation was limited to three cultivars and two production regions, and the model was mainly tested under daytime field conditions. Its robustness under broader cultivars, different illumination conditions, disease interference, and real edge-device deployment still requires further evaluation. In addition, maturity assessment was mainly based on external visual cues, while physiological indicators such as firmness, soluble solid content, and acidity were not included. Therefore, the current model should be regarded as a visual maturity detection method rather than a complete physiological maturity evaluation system.

5. Conclusions

This study proposed SMLO-YOLO, a lightweight strawberry maturity detection model designed for complex field environments. The model integrates HDP-Neck, EfficientHead, ShapeIoU, and BCKDloss-based knowledge distillation to improve the detection of strawberries at different maturity stages under conditions such as small targets, occlusion, overlap, and subtle colour variation.

The final SMLO-YOLO model achieved an mAP50 of 92.4%, Recall of 86.5%, Precision of 85.3%, 256.41 FPS, 6.49 M parameters, and 15.0 GFLOPs. These results demonstrate that the proposed method can maintain high detection accuracy while preserving a compact model structure and real-time inference capability. Therefore, SMLO-YOLO provides a feasible visual perception solution for strawberry maturity monitoring in agricultural robots, handheld devices, and edge-based orchard systems.

Future work will expand the dataset, include more cultivars and environmental conditions, introduce maturity-related physiological indicators, and validate the model in closed-loop robotic harvesting scenarios to improve the robustness and replicability of strawberry maturity detection methods.

Author Contributions

Conceptualization, T.L. and J.M.; methodology, T.L.; software, T.L. and H.G.; validation, T.L., H.G. and B.Z.; formal analysis, T.L. and B.Z.; investigation, Y.Z., Y.F. and H.G.; resources, Y.Z. and Y.F.; data curation, Y.Z., Y.F. and T.L.; writing—original draft preparation, T.L.; writing—review and editing, J.M.; visualisation, H.G.; supervision, J.M.; project administration, J.M.; funding acquisition, J.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are openly available in the Zenodo repository at https://doi.org/10.5281/zenodo.18812332.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cianciosi, D.; Armas Diaz, Y.; Qi, Z.; Yang, B.; Chen, G.; Cassotta, M.; Gracia Villar, S.; Dzul Lopez, L.A.; Rivas Garcia, L.; Forbes Hernandez, T.Y.; et al. Strawberry as a health promoter: An evidence-based review. Where are we 10 years later? Food Funct. 2025, 16, 5705–5732. [Google Scholar] [CrossRef] [PubMed]
Gong, X.; Wu, Q. Fruit detection methods based on deep learning in agricultural planting: A systematic literature review. IEEE Access 2025, 13, 96092–96110. [Google Scholar] [CrossRef]
Huang, Y.; Xu, S.; Chen, H.; Li, G.; Dong, H.; Yu, J.; Zhang, X.; Chen, R. A review of visual perception technology for intelligent fruit harvesting robots. Front. Plant Sci. 2025, 16, 1646871. [Google Scholar] [CrossRef]
Shi, X.; Wang, S.; Zhang, B.; Ding, X.; Qi, P.; Qu, H.; Li, N.; Wu, J.; Yang, H. Advances in object detection and localization techniques for fruit harvesting robots. Agronomy 2025, 15, 145. [Google Scholar] [CrossRef]
Chen, Y.; Zhong, H.; Liu, S. CES-YOLOv8: Strawberry maturity detection based on the improved YOLOv8. Agronomy 2024, 14, 1353. [Google Scholar] [CrossRef]
Ye, R.; Shao, G.; Gao, Q.; Zhang, H.; Li, T. CR-YOLOv9: Improved YOLOv9 multi-stage strawberry fruit maturity detection application integrated with CRNET. Foods 2024, 13, 2571. [Google Scholar] [CrossRef] [PubMed]
Ma, H.; Zhao, Q.; Zhang, R.; Hao, C.; Dong, W.; Zhang, X.; Li, F.; Xue, X.; Sun, G. YOLOv11-GSF: An optimised deep learning model for strawberry ripeness detection in agriculture. Front. Plant Sci. 2025, 16, 1584669. [Google Scholar] [CrossRef]
Fu, H.; Li, X.; Li, Z.; Zhu, L.; Feng, Y. LBS-YOLO: A lightweight model for strawberry ripeness detection. Front. Plant Sci. 2025, 16, 1715263. [Google Scholar] [CrossRef]
Jia, W.; Xu, Y.; Lu, Y.; Yin, X.; Pan, N.; Jiang, R.; Ge, X. An accurate green fruits detection method based on optimised YOLOX-m. Front. Plant Sci. 2023, 14, 1187734. [Google Scholar] [CrossRef]
Mao, D.; Sun, H.; Li, X.; Yu, X.; Wu, J.; Zhang, Q. Real-time fruit detection using deep neural networks on CPU: An edge AI application. Comput. Electron. Agric. 2023, 204, 107517. [Google Scholar] [CrossRef]
Yang, X.; Zhao, W.; Wang, Y.; Yan, W.Q.; Li, Y. Lightweight and efficient deep learning models for fruit detection in orchards. Sci. Rep. 2024, 14, 26086. [Google Scholar] [CrossRef] [PubMed]
Sun, Q.; Li, P.; He, C.; Song, Q.; Chen, J.; Kong, X.; Luo, Z. A lightweight and high-precision passion fruit YOLO detection model for deployment in embedded devices. Sensors 2024, 24, 4942. [Google Scholar] [CrossRef] [PubMed]
Qiu, H.; Zhang, Q.; Li, J.; Rong, J.; Yang, Z. Lightweight mulberry fruit detection method based on improved YOLOv8n for automated harvesting. Agronomy 2024, 14, 2861. [Google Scholar] [CrossRef]
Wang, J.; Shang, Y.; Zheng, X.; Zhou, P.; Li, S.; Wang, H. GreenFruitDetector: Lightweight green fruit detector in orchard environment. PLoS ONE 2024, 19, e0312164. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Gu, Z.; He, D.; Wang, X.; Huang, J.; Mo, Y.; Li, P.; Huang, Z.; Wu, F. A lightweight improved YOLOv5s model and its deployment for detecting pitaya fruits in daytime and nighttime light-supplement environments. Comput. Electron. Agric. 2024, 220, 108914. [Google Scholar] [CrossRef]
Sun, X. Enhanced tomato detection in greenhouse environments: A lightweight model based on S-YOLO with high accuracy. Front. Plant Sci. 2024, 15, 1451018. [Google Scholar] [CrossRef]
Chen, G.; Hou, Y.; Cui, T.; Li, H.; Shangguan, F.; Cao, L. YOLOv8-CML: A lightweight target detection method for color-changing melon ripening in intelligent agriculture. Sci. Rep. 2024, 14, 14400. [Google Scholar] [CrossRef]
Li, Y.; Ma, X.; Wang, J. Pineapple maturity analysis in natural environment based on MobileNet V3-YOLOv4. Smart Agric. 2023, 5, 35–44. [Google Scholar] [CrossRef]
Du, X.; Zhang, X.; Li, T.; Chen, X.; Yu, X.; Wang, H. YOLO-WAS: A lightweight apple target detection method based on improved YOLO11. Agriculture 2025, 15, 1521. [Google Scholar] [CrossRef]
Dong, Y.; Qiao, J.; Liu, N.; He, Y.; Li, S.; Hu, X.; Yu, C.; Zhang, C. GPC-YOLO: An improved lightweight YOLOv8n network for the detection of tomato maturity in unstructured natural environments. Sensors 2025, 25, 1502. [Google Scholar] [CrossRef]
Olguín-Rojas, J.C.; Vasquez, J.I.; López-Canteñs, G.d.J.; Herrera-Lozada, J.C.; Mota-Delfin, C. A lightweight YOLO-based architecture for apple detection on embedded systems. Agriculture 2025, 15, 838. [Google Scholar] [CrossRef]
Li, Q.; Li, K.; Zhao, K.; Wu, Y.; Jin, T.; Ji, X. CHERRY-YOLO: Lightweight model for cherry detection based on improved YOLOv8n. Eng. Agrícola 2025, 45, e20240136. [Google Scholar] [CrossRef]
Kong, X.; Li, X.; Zhu, X.; Guo, Z.; Zeng, L. Detection model based on improved Faster-RCNN in apple orchard environment. Intell. Syst. Appl. 2024, 21, 200325. [Google Scholar] [CrossRef]
Nan, Y.; Zhang, H.; Zheng, J.; Yang, K. Pitaya detection using an improved lightweight Faster R-CNN based on MobileNetV3 in densely planted pitaya orchards. J. Agric. Eng. 2025, 56, 1886. [Google Scholar] [CrossRef]
Lipiński, S.; Sadkowski, S.; Chwietczuk, P. Application of AI in date fruit detection—Performance analysis of YOLO and Faster R-CNN models. Computation 2025, 13, 149. [Google Scholar] [CrossRef]
Peng, H.; Chen, H.; Zhang, X.; Liu, H.; Chen, K.; Xiong, J. Retinanet_G2S: A multi-scale feature fusion-based network for fruit detection of punna navel oranges in complex field environments. Precis. Agric. 2024, 25, 889–913. [Google Scholar] [CrossRef]
Osco-Mamani, E.; Santana-Carbajal, O.; Chaparro-Cruz, I.; Ochoa-Donoso, D.; Alcazar-Alay, S. The detection and counting of olive tree fruits using deep learning models in Tacna, Perú. AI 2025, 6, 25. [Google Scholar] [CrossRef]
Almutairi, A.; Alharbi, J.; Alharbi, S.; Alhasson, H.F.; Alharbi, S.S.; Habib, S. Date fruit detection and classification based on its variety using deep learning technology. IEEE Access 2024, 12, 190666–190677. [Google Scholar] [CrossRef]
Shehzadi, T.; Noor, R.; Ifza, I.; Liwicki, M.; Stricker, D.; Afzal, M.Z. Knowledge distillation in object detection: A survey from CNN to transformer. Sensors 2026, 26, 292. [Google Scholar] [CrossRef] [PubMed]
Moslemi, A.; Briskina, A.; Dang, Z.; Li, J. A survey on knowledge distillation: Recent advancements. Mach. Learn. Appl. 2024, 18, 100605. [Google Scholar] [CrossRef]
Zheng, Z.; Zuo, G.; Zhang, W.; Zhang, C.; Zhang, J.; Rao, Y.; Jiang, Z. Learning lightweight tea detector with reconstructed feature and dual distillation. Sci. Rep. 2024, 14, 23669. [Google Scholar] [CrossRef] [PubMed]
Chen, G.; Hou, Y.; Chen, H.; Cao, L.; Yuan, J. A lightweight color-changing melon ripeness detection algorithm based on model pruning and knowledge distillation: Leveraging dilated residual and multi-screening path aggregation. Front. Plant Sci. 2024, 15, 1406593. [Google Scholar] [CrossRef]

Figure 1. Sample pictures of strawberries at different ripeness levels.

Figure 2. Examples of data augmentation: (a) original images; (b) augmented images.

Figure 3. Structure diagram of the SMLO-YOLO network.

Figure 4. Structure diagram of HSPAN. CA is the abbreviation of Channel Attention.

Figure 6. Schematic diagram of the structure of C3K2_PConv. (a) Conventional convolution; (b) grouped convolution; (c) partial convolution; B_Pconv is the abbreviation of Bottleneck_PConv.

Figure 7. Comparison between the CIoU loss function and the ShapeIoU loss function.

Figure 8. Diagram of the core steps of knowledge distillation.

Figure 9. Illustration of comparative experimental results across different models.

Figure 10. Improved pre- and post-model confusion matrices. (a) YOLOv11s; (b) SMLO-YOLO.

Figure 11. Heat maps of YOLOv11s before and after improvement. (a) Sparse sample size; (b) the sample size is moderate; (c) the sample size is dense.

Figure 12. Effect pictures before and after improvement. (a) The sample quantity is dense; (b) the sample size is moderate; (c) the sample size is sparse.

Table 1. Initial Training Parameters.

Parameter	Form/Value
epochs	100
batch	16
close_mosaic	10
workers	8
optimizer	SGD
lrf	0.01
cache	False
validation batch size	16
input image size	640 × 640
final learning-rate factor	0.01
momentum	0.937
weight decay	0.0005
warm-up epochs	3.0

Table 2. Definition of Indicators.

	Positive	Negative
Positive	TP	FN
Negative	FP	TN

Table 3. Results of Ablation Experiment.

	Parameters	MB	GFLOPs	mAP50	FPS	Recall	Precision
YOLOv11s-baseline	9,413,961	19.2	21.6	89.5	243.90	85.1	83.8
YOLOv11s-EfficientHead	8,915,913	18.1	19.5	90.4	333.33	84.1	86.4
YOLOv11s-HDC	6,808,745	13.9	16.7	90.5	188.68	87.0	84.0
YOLOv11s-HDCE (ours)	6,493,481	13.3	15.0	90	208.33	85.2	84.1

Table 4. Comparison Results of different Neck Improvement Modules.

	Parameters	MB	GFLOPs	mAP50	FPS	Recall	Precision
YOLOv11s-ASF	10,055,561	22.5	26.7	88.3	416.67	83.6	8.3
YOLOv11s-SPDConv	17,431,881	35.2	41.1	89.3	200	84.7	83.3
YOLOv11s-goldyolo	13,251,561	27.1	25.6	89.3	163.93	83.8	84.1
YOLOv11s-bifpn	7,077,045	14.5	21.6	89.9	217.39	85.1	83.6
YOLOv11s-CARAFE	9,578,641	19.5	21.6	89.3	232.56	83.6	84.8
YOLOv11s-GDFPN	13,737,769	28.4	28.9	88.5	370.37	84.8	83.2
YOLOv11s-HDC (ours)	6,808,745	13.9	16.7	90.5	188.68	87.0	84.2

Table 5. Comparison of Different IoU Loss Functions.

	Parameters	MB	GFLOPs	mAP50	FPS	Recall	Precision
YOLOv11s-HDCE + PIoU	6,493,481	13.3	15.0	89.8	322	85.3	84.5
YOLOv11s-HDCE + PIoU2	6,493,481	13.3	15.0	90	384.62	86.5	83.5
YOLOv11s-HDCE + CIoU	6,493,481	13.3	15.0	89.6	370.37	86.0	83.9
YOLOv11s-HDCE + MPDIoU	6,493,481	13.3	15.0	89.7	208.33	86.0	84.1
YOLOv11s-HDCE + ShapeIoU (ours)	6,493,481	13.3	15.0	90.5	344.83	85.8	83.7

Table 6. Comparison Results of Different Distillation Methods.

	Parameters	MB	GFLOPs	mAP50	FPS	Recall	Precision
CWDloss	6,493,481	13.3	15.0	90.5	384.62	84.8	85.4
L1loss	6,493,481	13.3	15.0	91	250	87.3	84.6
L2loss	6,493,481	13.3	15.0	91.6	312.5	85.9	85.7
MIMICloss	6,493,481	13.3	15.0	90.3	357.14	84.9	85.4
BCKDloss (ours)	6,493,481	13.3	15.0	92.4	256.41	86.5	85.3

Table 7. Comparative Experiments.

	Parameters	MB	GFLOPs	mAP50	FPS	Recall	Precision
YOLOv5s	9,212,697	18.5	23.8	89.1	217.39	84.7	83.5
YOLOv6s	16,298,009	32.8	44.0	87.8	454.55	82.0	83.3
YOLOv8s	11,126,745	22.6	28.4	88.8	166.67	83.4	84.7
YOLOv9s	7,168,249	15.2	26.7	89.7	370.37	86.5	84.7
YOLOv10s	7,219,161	16.5	21.4	88.9	312.50	84.5	81.1
YOLOv11s	9,413,961	19.2	21.6	89.5	243.90	85.1	83.8
SMLO-YOLO (ours)	6,493,481	13.3	15.0	92.4	256.41	86.5	85.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Leng, T.; Zhang, Y.; Fan, Y.; Gao, H.; Zhou, B.; Mu, J. A Lightweight Shape-Aware YOLO Network for Field Strawberry Maturity Detection Under Complex Orchard Conditions. Agronomy 2026, 16, 1074. https://doi.org/10.3390/agronomy16111074

AMA Style

Leng T, Zhang Y, Fan Y, Gao H, Zhou B, Mu J. A Lightweight Shape-Aware YOLO Network for Field Strawberry Maturity Detection Under Complex Orchard Conditions. Agronomy. 2026; 16(11):1074. https://doi.org/10.3390/agronomy16111074

Chicago/Turabian Style

Leng, Tingting, Yuxin Zhang, Yamin Fan, Huajuan Gao, Binhong Zhou, and Jiong Mu. 2026. "A Lightweight Shape-Aware YOLO Network for Field Strawberry Maturity Detection Under Complex Orchard Conditions" Agronomy 16, no. 11: 1074. https://doi.org/10.3390/agronomy16111074

APA Style

Leng, T., Zhang, Y., Fan, Y., Gao, H., Zhou, B., & Mu, J. (2026). A Lightweight Shape-Aware YOLO Network for Field Strawberry Maturity Detection Under Complex Orchard Conditions. Agronomy, 16(11), 1074. https://doi.org/10.3390/agronomy16111074

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Lightweight Shape-Aware YOLO Network for Field Strawberry Maturity Detection Under Complex Orchard Conditions

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Collection

2.1.1. In-Field Data Acquisition

2.1.2. Maturity Classification and Annotation

2.1.3. Dataset Preprocessing and Augmentation

2.2. Methods

2.2.1. SMLO-YOLO Network Structure

2.2.2. HDP-Neck

2.2.3. HSPAN

2.2.4. DySample

2.2.5. C3K2_PConv

2.2.6. EfficientHead

2.2.7. ShapeIoU Loss Function

2.2.8. Knowledge Distillation

2.3. Experimental Environment Settings

2.4. Evaluation Metrics

3. Results

3.1. Ablation Experiment

3.2. Neck Comparative Experiment

3.3. Loss Function Experiment

3.4. Knowledge Distillation Experiment

3.5. Comparison with Mainstream Experiment

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI