StrawPose-Lite: A Lightweight Pose Network for Strawberry Picking Point Prediction on Edge Devices

Liu, Haojiang; Liang, Yunsen; He, Qile; Li, Bingbing; Wang, Wanshu; He, Hongyu; Xu, Yaoxue; Yao, Yujie; Cao, Xiangyu; Yin, Yongqi; Duan, Xuliang; Pang, Tao

doi:10.3390/agriculture16111185

Open AccessArticle

StrawPose-Lite: A Lightweight Pose Network for Strawberry Picking Point Prediction on Edge Devices

by

Haojiang Liu

^1,†,

Yunsen Liang

^2,†

,

Qile He

³,

Bingbing Li

²,

Wanshu Wang

¹,

Hongyu He

¹,

Yaoxue Xu

²,

Yujie Yao

²,

Xiangyu Cao

¹,

Yongqi Yin

⁴,

Xuliang Duan

^2,*

and

Tao Pang

^1,*

¹

College of Electrical and Mechanical Engineering, Sichuan Agricultural University, Ya’an 625014, China

²

College of Information Engineering, Sichuan Agricultural University, Ya’an 625014, China

³

College of Science, Sichuan Agricultural University, Ya’an 625014, China

⁴

College of Food Science and Engineering, Yangzhou University, Yangzhou 210095, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Agriculture 2026, 16(11), 1185; https://doi.org/10.3390/agriculture16111185

Submission received: 10 April 2026 / Revised: 1 May 2026 / Accepted: 26 May 2026 / Published: 28 May 2026

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Strawberry harvesting perception in greenhouse environments requires visual models that remain reliable under occlusion while staying compact enough for edge-side inference. To address this requirement, this study develops StrawPose-Lite, a lightweight pose network for strawberry picking point prediction based on YOLOv11n-pose. The network combines ADown and C3Ghost to reduce redundant computation while preserving informative structure, and it adopts a six-keypoint pose definition derived from strawberry phenotypic characteristics. In this representation, the pedicel–fruit junction is used as the final visual picking point, whereas the remaining peak, curvature, and bottom keypoints provide geometric support when the visible contour is incomplete. The keypoint branch is further enhanced by P2-guided multi-scale fusion and SimAM-based refinement to improve sensitivity to fine pedicel-related cues under strict lightweight constraints. On the public validation split, StrawPose-Lite contains 0.73 M parameters and requires 3.0 GFLOPs while achieving a pose mAP@0.5:0.95 of 79.2%. In the independent field deployment set, the TensorRT INT8 version achieved a pure network inference throughput of 277 FPS on a Jetson Orin NX 16G Super platform, with a measured total software latency of 5.01 ms under the embedded pipeline. These results indicate that StrawPose-Lite provides an effective balance between pose accuracy, model compactness, and edge-side inference speed for strawberry picking point perception on edge devices.

Keywords:

strawberry harvesting; picking point prediction; keypoint detection; lightweight pose network; pose estimation; edge deployment

1. Introduction

Strawberry (Fragaria × ananassa) is a high-value crop in protected horticulture because of its flavor and nutritional quality [1,2]. Its soft epidermis and rapid postharvest deterioration demand timely and careful harvesting [3,4]. In commercial practice, however, strawberry picking remains highly dependent on manual labor, which increases production cost and makes harvesting capacity vulnerable to seasonal labor shortages [5,6]. Harvesting robots are therefore increasingly regarded as a practical route toward automated strawberry production [7,8,9]. In such systems, reliable visual perception in cluttered greenhouse environments is a prerequisite for downstream motion planning and end effector control [10].

Machine vision serves as the core perception module in harvesting robots and is mainly responsible for target recognition and spatial localization. Early strawberry vision systems relied largely on hand-crafted image processing. Hayashi et al. used color thresholding for strawberry detection and maturity estimation, but the method remained limited to relatively controlled conditions [11,12]. With the rapid development of deep learning, R-CNN- and YOLO-based detectors substantially improved recognition accuracy and inference speed [13,14,15]. Research attention has accordingly shifted from coarse fruit detection toward finer geometric description for picking point estimation [16,17,18]. Although instance segmentation can provide accurate contours, dense pixel-level prediction usually comes with higher computational cost, which makes real-time deployment on edge devices more difficult [19,20,21,22,23].

Compared with segmentation, keypoint and pose estimation networks can output fruit stem locations and pose geometry at lower computational cost [24,25,26,27]. They also remain more usable when the fruit body is partly occluded [28]. Recent agricultural vision studies have further highlighted the importance of lightweight design for edge deployment. Ma et al. reported robust strawberry keypoint detection with an enhanced YOLOv8-pose variant on Jetson-class hardware [17]. Li et al. developed SGSNet for lightweight strawberry growth stage detection [29], and Chen et al. proposed YOLO-Chili for efficient picking point localization in complex environments [30]. Gan et al. developed a lightweight YOLO11-based strawberry perception framework for edge platforms [31], Zhang et al. reduced greenhouse strawberry instance segmentation cost through a lightweight YOLOv8n-MCP design [32], and Xiao et al. proposed YOLO-PDGT for lightweight unripe pomegranate detection and counting [33]. Beyond agricultural detection tasks, recent edge inference work has also emphasized that deployment efficiency depends not only on accuracy but also on hardware-aware computational organization [8]. Together, these studies indicate that compact perception models are becoming a central issue in agricultural robotics.

Even so, edge deployment for real greenhouse harvesting remains challenging [34,35,36]. Picking decisions depend on very small structures such as pedicels and junction regions, and these weak cues are easily attenuated during repeated downsampling, especially in lightweight models [37]. In addition, leaf occlusion, fruit overlap, and non-uniform illumination often reduce keypoint visibility and disturb pose reasoning. On agricultural robots, the perception model must also share limited on-board computing resources with planning, control, and communication tasks [33]. These constraints make it necessary to develop models that remain compact and fast without giving up too much localization reliability. Based on this gap, the hypothesis of this study is that picking point stability under lightweight constraints can be improved not by increasing model capacity, but by preserving junction-related local cues through a task-oriented pose representation and selective feature path redesign. Therefore, the novelty of this work lies in the integration of a pedicel-sensitive six-keypoint representation with a lightweight YOLO-pose refinement strategy for strawberry picking point prediction, rather than in proposing a new general-purpose network primitive.

To address these issues, this study develops StrawPose-Lite on top of YOLOv11n-pose. The model introduces anti-aliasing downsampling to better preserve pedicel-related detail, replaces the neck refinement block with C3Ghost to reduce redundancy, and strengthens the keypoint branch through P2-guided multi-scale fusion together with the Simple Parameter-Free Attention Module (SimAM). In addition, a six-keypoint annotation scheme derived from strawberry phenotypic characteristics is introduced to improve the stability of pedicel–fruit junction prediction under occlusion. The contribution of this work is therefore positioned as a task-oriented architectural customization for junction-sensitive strawberry pose estimation, rather than the introduction of a fundamentally new primitive module. The objective is to maintain reliable picking point localization while satisfying the lightweight computational constraints required for deployment.

In summary, StrawPose-Lite combines a pedicel-sensitive six-keypoint representation with a lightweight backbone–neck–head redesign tailored to edge-side strawberry perception. Relative to existing lightweight pose designs, the main novelty lies in stabilizing picking point prediction under partial occlusion while preserving a low computational footprint that remains suitable for embedded inference.

2. Materials and Methods

The overall workflow of this study consisted of dataset construction and annotation, lightweight model design, model training, and performance evaluation, which together formed the experimental pipeline for strawberry picking point prediction.

2.1. Dataset Construction and Annotation

To support strawberry picking point prediction, a dedicated dataset construction pipeline was established in this study. Unlike general fruit detection datasets, the present task requires not only fruit-level localization but also a stable representation of the pedicel–fruit junction and its surrounding geometry. The dataset was therefore organized from two complementary sources, processed under a unified image setting, and annotated in a joint bounding-box and six-keypoint format. The following subsections describe the public development set, the independent field set, the training time preprocessing and augmentation pipeline, and the annotation protocol adopted for model development and deployment-oriented evaluation. The overall workflow is shown in Figure 1.

2.1.1. Public and Field Dataset Construction

As shown in Figure 1a, the adopted data source consists of two parts. The first is a public development set derived from StrawDI and its standardized subset StrawDI_Db1 [38]. StrawDI was collected in real production environments in Spain during 2018–2019 using smartphone cameras and covers multiple growth stages, viewpoints, and occlusion conditions. From this source, 3100 representative images were selected and uniformly resized to 1008 × 756 to form the working subset used in this study. Although the original release contains an official split, that split was not adopted here. Because a separate field-collected set was reserved for deployment-oriented evaluation, StrawDI_Db1 was used only as a public development set and was re-partitioned into 2790 training images and 310 validation images. This split was used consistently for model training, hyperparameter tuning, convergence analysis, ablation studies, and same-protocol model comparison.

To further examine performance outside the public source domain, an independent field-collected set containing 1200 images was established in April 2024 at Kaixin Picking Park in Ya’an, Sichuan Province, China. Images were captured with a Huawei P40 Pro smartphone (Huawei Technologies Co., Ltd., Shenzhen, China) in a fixed 4:3 format. This set includes several conditions commonly encountered during strawberry harvesting, including fruit occlusion, mixed maturity, foliage interference, specular reflection, and complex background clutter. For the deployment-oriented evaluation, the field set followed the same bounding-box and six-keypoint annotation protocol described in Section 2.1.3. Importantly, this field set was not used for model training, structural design, or hyperparameter adjustment. It was reserved only for the deployment-oriented evaluation reported in Section 3.5.

2.1.2. Data Preprocessing and Augmentation

To reduce domain inconsistency between the two data sources, all source images were uniformly resized to 1008 × 756 before annotation management. During network training, the online pipeline used a 640 × 640 model input together with the augmentation settings recorded in the recorded run configuration: HSV perturbation (h = 0.015, s = 0.7, v = 0.4), translation = 0.1, scale = 0.5, mosaic = 1.0, RandAugment, and random erasing = 0.4. These operations were used to improve tolerance to illumination variation, background disturbance, and partial occlusion, as illustrated schematically in Figure 1b.

2.1.3. Annotation Protocol and Quality Control

To support joint fruit detection and keypoint localization, each strawberry instance was annotated with both a bounding box and six keypoints. Annotations were first created using LabelMe (v5.4.1, https://github.com/wkentaro/labelme, accessed on 9 April 2026) and then converted into standard YOLO-pose format. Following the joint modeling principle of the fruit body and the pedicel–fruit junction region, a short pedicel segment directly connected to the fruit body was intentionally retained in the annotation so that the network could learn redundant geometric cues around the final picking point.

(1): Bounding-box annotation.

Each bounding box covers the strawberry fruit body together with a short pedicel segment immediately above the pedicel–fruit junction. This design allows the model to learn the local spatial relationship between the fruit and the junction region while avoiding an unnecessarily long stem span that would introduce irrelevant background variation. The retained pedicel segment was therefore used as a local geometric prior for picking point prediction rather than as a full stem annotation.

(2): Keypoint annotation.

Because a single picking point is highly sensitive to occlusion and boundary ambiguity, a six-keypoint strategy was used, as shown in Figure 1c and Figure 2. The geometry of each fruit was described by three construction lines and their endpoints: K0–K3 for the central growth axis, K5–K1 for the maximum transverse diameter, and K4–K2 for the longitudinal curvature line. Among these points, K0 denotes the pedicel–fruit junction and serves as the final visual picking point, whereas the remaining keypoints provide auxiliary geometric constraints. Compared with a single-point label or a bounding-box-only description, this design preserves richer shape context and helps stabilize regression when the fruit contour is incomplete or partly occluded.

The six feature points are denoted K0 to K5, and their definitions are listed in Table 1.

For practical annotation, K0 was placed at the visible pedicel–fruit junction and used as the final visual picking point. K1 and K5 were placed at the right and left maximum transverse boundary points, K2 and K4 at representative curvature points on the right and left lateral contours, and K3 at the lowest visible fruit boundary along the central growth axis. For occluded or invisible keypoints, visibility flags of 0, 1, and 2 were assigned according to the annotation status. Partly occluded but inferable keypoints were annotated with the corresponding visibility flag, whereas completely invisible and non-inferable keypoints were marked as invisible. To reduce subjective bias and improve consistency, a dual annotator protocol with conflict arbitration was adopted. Labels judged consistently by the two primary annotators were accepted directly, whereas inconsistent cases were reviewed and resolved by a third senior researcher.

After annotation, all labels were converted to YOLO-pose format. Each instance was represented by one class label, normalized bounding-box coordinates, and six keypoint triplets. This unified representation supports end-to-end joint learning and efficient inference on edge devices.

2.1.4. Evaluation Protocol

In all core algorithmic experiments, the re-partitioned StrawDI_Db1 subset served as the public development set, with 2790 images used for training and 310 used for validation (Table 2). The validation subset was used for convergence monitoring and quantitative evaluation. Accordingly, the results in Section 3.2, Section 3.3 and Section 3.4 were all obtained on the public validation set. The independent field-collected dataset of 1200 images was not used in model development and was reserved exclusively for edge deployment evaluation in Section 3.5.

To further clarify the supervision scale of the pose estimation task, Table 2 reports not only the number of images but also the number of strawberry instances, total keypoint slots, valid keypoints, and keypoint validity rate in each dataset split. Since each strawberry instance was annotated with six keypoints, the total number of keypoint slots was calculated as the number of instances multiplied by six. Valid keypoints refer to annotated keypoints with visibility flags greater than 0. The public development set contained 7626 strawberry instances and 45,756 keypoint slots, of which 43,820 were valid keypoints. The independent field-collected set contained 2952 strawberry instances and 17,712 keypoint slots, of which 16,740 were valid keypoints. Overall, the two datasets contained 10,578 strawberry instances and 60,560 valid keypoints, providing sufficient annotated supervision for training, validation, and deployment-oriented evaluation.

2.2. Baseline Model

Ultralytics YOLOv11n-pose was used as the baseline model for joint object and keypoint detection. This architecture adopts a heatmap-free representation and directly regresses keypoint coordinates, which avoids the extra postprocessing burden of heatmap-based methods. YOLO11 follows a single-stage multi-scale dense prediction framework. Its main design includes the C3k2 backbone block, PSA/C2PSA-based neck fusion, and lightweight task heads, which makes it a suitable starting point for deployment-oriented refinement.

2.3. Architecture of the Proposed StrawPose-Lite Model

To address the visual challenges of strawberry harvesting in unstructured greenhouse scenes, StrawPose-Lite was developed on top of YOLOv11n-pose. As shown in Figure 3, the model preserves the original single-stage detection pipeline and modifies only the components most relevant to picking point prediction. The overall design objective is straightforward: retain a lightweight deployment budget while improving the localization stability of fine strawberry structures under occlusion.

The design involves three targeted changes. In this study, the complete model is consistently referred to as StrawPose-Lite, whereas the enhanced keypoint branch is referred to as the keypoint branch enhancement module. This module consists of P3–P5 adaptive fusion, gated P2 feature injection, and SimAM-based response refinement. First, ADown is inserted into the main backbone downsampling stages to suppress aliasing before feature resolution drops. Second, C3Ghost is used in the neck refinement stage to reduce redundant feature generation at lower cost. Third, the keypoint branch is strengthened through P3-oriented multi-scale fusion, gated injection of high-resolution P2 detail, and SimAM-based response refinement, while the detection branch remains unchanged. In this way, the original detection pathway is preserved, whereas pedicel-related feature representation is selectively enhanced.

2.3.1. ADown-Based Anti-Aliasing Downsampling

ADown was selected because it provides a compact compromise between anti-aliasing and local-detail retention [39,40]. Compared with direct stride 2 convolution, its average pooling prefilter can reduce aliasing before resolution reduction, while the parallel max pooling branch helps retain locally salient responses. Heavier alternatives, such as additional high-resolution branches or attention blocks in the backbone, may preserve detail but would increase computational cost. Therefore, ADown was used as a lightweight downsampling strategy consistent with the deployment-oriented objective of this study. To reduce the loss of fine pedicel cues during backbone downsampling, the standard stride 2 convolution was replaced by ADown in the main downsampling stages except the first layer (Figure 4). In implementation, ADown first applies average pooling with kernel size 2 and stride 1 as a lightweight low-pass prefilter. The filtered feature map is then split into two channel groups. One branch applies a stride 2 3 × 3 convolution to preserve semantic extraction, whereas the other branch uses 3 × 3 max pooling followed by a 1 × 1 convolution to retain locally salient responses. The final output is formed by concatenating the two branches along the channel dimension.

X_{l p} = P o o l (X)

(1)

where X denotes the input feature map, Pool(·) denotes local average pooling with a kernel size of 2 and a stride of 1 rather than global average pooling, and Xlp denotes the low-pass filtered feature map. Equation (1) describes only the average pooling prefilter before downsampling. The complete ADown operation is further described by Equations (2) and (3). Equation (2) describes the stride 2 convolution branch for semantic extraction, whereas Equation (3) describes the max pooling plus 1 × 1 convolution branch for retaining local salient responses.

{F 1 = C o n v}_{3 \times 3, s = 2} (X_{l p})

(2)

F 2 = {C o n v}_{1 \times 1} (M a x P o o l (X))

(3)

Compared with a standard stride 2 convolution, this design reduces high-frequency aliasing before downsampling and distributes feature extraction between a semantic branch and a local response branch. The practical motivation for introducing ADown in this study was therefore not to guarantee an isolated accuracy gain, but to lower computational load while preserving weak structural cues that are easily suppressed in lightweight backbones.

In Equations (2) and (3), F1 and F2 denote the outputs of the two branches. Conv3 × 3, s = 2(·) denotes a 3 × 3 convolution with stride 2. MaxPool(·) denotes max pooling, and Conv1 × 1(·) denotes a 1 × 1 convolution. The final ADown output is obtained by concatenating F1 and F2 along the channel dimension. The two branches play complementary roles: F1 preserves the main semantic transition during resolution reduction, whereas F2 retains sharper local responses around weak edges and narrow pedicel structures.

2.3.2. C3Ghost-Based Lightweight Feature Extraction

To reduce neck redundancy, the original C3k2 block in the neck refinement stage was replaced by C3Ghost (Figure 5) [41]. In this design, a smaller set of intrinsic feature maps is first generated by primary convolution, and additional feature maps are then produced by inexpensive Ghost operations rather than by repeatedly using full convolutions. This mechanism is motivated by the observation that many intermediate feature maps in lightweight detectors are partially redundant and can be approximated by cheaper linear transformations.

Y^{'} = X * f

(4)

Equations (4)–(6) describe the generation of intrinsic features, the construction of Ghost features through inexpensive transformations, and the final channel-wise concatenation, respectively. In the present model, C3Ghost is introduced in the neck rather than throughout the whole backbone so that parameter reduction is achieved at a stage where feature refinement is important, while the early backbone representation remains stable.

y_{i, j} = Φ_{i, j} (y ’_{i}), i = 1, \dots, m, j = 1, \dots, s - 1

(5)

Y = [Y', Y ″]

(6)

By introducing Ghost operations into the CSP-style C3 structure, C3Ghost lowers parameter count and GFLOPs while keeping adequate feature representation for real-time edge deployment.

2.3.3. Keypoint Branch Enhancement Module

Because the pedicel–fruit junction is small and often occluded, the original keypoint head of YOLOv11n-pose was further modified. The detection branch was kept unchanged, whereas the keypoint branch was strengthened through adaptive fusion of P3–P5 features, gated injection of high-resolution P2 detail, and SimAM refinement before K0–K5 regression. In the YOLO feature pyramid, P2, P3, P4, and P5 are feature maps at different scales. P2 has the highest spatial resolution among these levels. It keeps more fine details around small structures, such as the pedicel–fruit junction. P3, P4, and P5 have lower spatial resolutions and larger receptive fields. They provide stronger semantic context for keypoint localization under occlusion. Therefore, the keypoint branch uses P3–P5 fusion for contextual information and introduces P2 features to recover fine structural cues.

Adaptive Multi-Scale Feature Fusion

To handle the uneven contribution of P3, P4, and P5 features to keypoint regression, an adaptive fusion strategy was introduced for the P3-oriented keypoint path [42]. In implementation, P3, P4, and P5 features are first projected to a common channel dimension and spatially aligned. Learnable fusion weights are then normalized through softmax so that the output feature can aggregate contextual information from multiple receptive fields while remaining dominated by the task-relevant scale. Equation (7) formalizes this weighted fusion, and the temperature coefficient T controls the sharpness of the fusion distribution. Here, T is a temperature coefficient in the softmax formulation. In this study, it was fixed at T = 1.0 and was not treated as a tunable hyperparameter.

F_{i}^{o} u t = Σ_{j = 3 . . 5} w_{i, j} \cdot R (F_{j} \to F_{i}), w_{i} = s o f t m a x (T \cdot ω_{i}), Σ_{j} w_{i, j} = 1

(7)

In the current implementation, the fusion matrix is learnable, and the P3-oriented spatial correction convolution is initialized to zero. This zero initialization makes the additional branch start from a neutral state, so that the model first behaves like a stable weighted fusion module and then gradually learns spatially adaptive corrections during training. Figure 6 summarizes this mechanism.

High-Resolution P2 Feature Injection

To compensate for the loss of high-frequency geometric detail in the keypoint branch, high-resolution information from P2 was injected only into the P3 keypoint feature through a gated fusion mechanism. The keypoint branch first projects P2 to the target channel dimension with a 1 × 1 convolution and then downsamples it to the spatial size of the P3 keypoint branch. A learnable channel-wise gate is subsequently used to control the injection strength of this high-resolution signal.

{\tilde{F}}_{P 2 \to P 3} = R (D (P (F_{P} 2)) \to P 3)

(8)

In the deployed configuration, the gate is initialized with p2_init = −2.0, which yields a small initial sigmoid response and therefore prevents the P2 path from overwhelming the original P3 representation at the beginning of training. The fused feature is then formed by interpolating between the P3 fusion output and the gated P2 projection. This design keeps the semantic context of P3 while gradually introducing useful high-frequency detail around pedicel edges.

F_{P} 3^{n} e w = (1 - a) \cdot F_{P} 3^{m} s f + a \cdot {\tilde{F}}_{P 2 \to P 3}

(9)

SimAM-Based Parameter-Free Attention

After multi-scale fusion and gated P2 injection, SimAM [43] is applied in the keypoint branch before the final 1 × 1 output convolution. Unlike SE or CBAM, SimAM does not introduce additional convolution kernels or fully connected layers. Instead, it builds an energy-based attention map from intra-channel statistics and sharpens responses on informative spatial locations with negligible extra overhead.

Equations (10) and (11) describe the energy mapping and the final sigmoid-based recalibration used in SimAM. In the present network, SimAM is placed after two 3 × 3 RepConv layers. This placement allows the keypoint branch to refine already integrated local features and improves the response contrast around fine pedicel-related cues without increasing the parameter count. After SimAM refinement, the final 1 × 1 output convolution maps the keypoint branch feature to six keypoint triplets for each predicted strawberry instance. Each triplet contains the normalized x-coordinate, y-coordinate, and visibility confidence of one keypoint. Therefore, K0–K5 are obtained by direct coordinate regression rather than by heatmap decoding. The structure of the SimAM refinement path is shown in Figure 7.

y = ((X - μ)^{2}) / (4 \cdot ((Σ (X - μ)^{2}) / n + λ)) + 0.5

(10)

\hat{Y} = X ⊙ sigmoid (y)

(11)

2.4. Evaluation Metrics

To evaluate the detection and pose estimation performance of the proposed network, precision (P), recall (R), average precision (AP), mean average precision (mAP), parameter count (Params), GFLOPs, and frames per second (FPS) were used. These metrics jointly characterize localization accuracy, computational burden, and deployment behavior. In this study, pose mAP@0.5:0.95 was used as the primary pose metric, while precision, recall, AP, model size, GFLOPs, and FPS were reported as complementary indicators. For the YOLO-pose evaluation, the reported P, R, AP, and mAP values refer to pose/keypoint metrics rather than bounding-box metrics, and keypoint matching follows the COCO-style Object Keypoint Similarity (OKS) protocol. Keypoints with visibility flags greater than 0 were treated as valid labeled keypoints for pose evaluation. Precision and recall are defined in Equations (12) and (13), and AP and mAP are defined in Equations (14) and (15).

P = \frac{T P}{T P + F P} \times 100 %

(12)

R = \frac{T P}{T P + F N} \times 100 %

(13)

where TP, FP, and FN denote true positives, false positives, and false negatives, respectively.

A P = \int_{0}^{1} p (R) d R

(14)

m A P = \frac{\sum_{i = 1}^{n} A P_{i}}{n}

(15)

AP is derived from the precision–recall curve under a given OKS threshold. For pose evaluation, mAP@0.5:0.95 is calculated by averaging AP over OKS thresholds from 0.50 to 0.95 with a step size of 0.05. FPS is a hardware- and software-pipeline-dependent indicator because it is affected by preprocessing, model inference, postprocessing, backend optimization, and device state. In this manuscript, pure network inference throughput and the latency breakdown were reported separately in the edge deployment evaluation. Equation (16) is therefore used only to describe recorded throughput under the stated hardware and software configuration rather than a device-independent model property. FPS is calculated as follows:

F P S = \frac{1000}{a + b + c}

(16)

3. Results and Analysis

3.1. Experimental Setup

Model training was conducted on Ubuntu 22.04 using an NVIDIA RTX 5090 GPU (32 GB; NVIDIA Corporation, Santa Clara, CA, USA), an Intel Xeon Platinum 8470Q processor (Intel Corporation, Santa Clara, CA, USA), Python 3.9, PyTorch 2.9.0, and CUDA 12.8 (NVIDIA Corporation, Santa Clara, CA, USA). The main hardware and training settings are listed in Table 3. For same-scale lightweight comparisons, all candidate models were run under the same data split, input size, and augmentation pipeline, and were optimized under a unified training schedule in this study.

All models were trained for 100 epochs with an input size of 640 × 640 and a batch size of 64. The recorded training configuration used an initial learning rate of 0.01, momentum of 0.937, weight decay of 5 × 10⁻⁴, warmup over three epochs, AMP, close_mosaic = 15, RandAugment, and random erasing. Unless otherwise stated, all comparison models shared the same split, training schedule, augmentation pipeline, and evaluation protocol. The 640 × 640 input size reported here refers to the online network input during training and inference, whereas the source images were standardized to 1008 × 756 during dataset preparation. For the repeated training stability analysis, the two models were retrained three times under identical settings but with different random initializations and data shuffling.

3.2. Convergence Behavior and Pose Localization Performance

The model was trained on the public development set and evaluated on its corresponding validation split. As shown in Figure 8, the convergence behavior and validation performance of StrawPose-Lite were compared with those of the YOLOv11n-pose baseline. Figure 8a–c show the validation loss, pose mAP@0.5:0.95, and precision–recall curve of StrawPose-Lite, whereas Figure 8d–f present the corresponding results of the baseline model.

For StrawPose-Lite, the validation loss decreased rapidly during the early training stage and then gradually stabilized, indicating that the lightweight structural modifications did not cause unstable optimization. Although its initial validation loss was higher than that of the baseline, the loss dropped quickly within the first few epochs and finally reached a comparable stable range. The pose mAP@0.5:0.95 also increased sharply at the beginning of training and then approached a stable plateau near the end of training. Compared with the baseline, StrawPose-Lite achieved a slightly higher final pose mAP@0.5:0.95, which is consistent with the quantitative comparison reported in the ablation study.

The precision–recall curves further show the difference between the two models. StrawPose-Lite achieved an mAP@0.5 of 0.937, while the baseline reached 0.927. The PR curve of StrawPose-Lite remained closer to the upper-right region over most recall levels, suggesting that the proposed keypoint branch enhancement and lightweight feature refinement design helped maintain precision while improving recall coverage. When recall approached its upper limit, both models showed a clear precision drop, which is common in dense and partially occluded strawberry scenes. Overall, Figure 8 indicates that StrawPose-Lite preserves stable convergence while providing slightly better validation performance than the YOLOv11n-pose baseline under the same training protocol.

3.3. Ablation Studies

To evaluate the independent contributions and combined effects of ADown, C3Ghost, and keypoint branch enhancement module, ablation experiments were conducted on the YOLOv11n-pose baseline. The baseline was progressively equipped with one module, two-module combinations, and the full three-module configuration under the same hardware environment, data split, training settings, and evaluation protocol. The module-wise ablation study is listed in Table 4.

Using ADown alone reduced GFLOPs from 6.60 to 2.40 and increased FPS from 167 to 291, with a limited decrease in pose mAP@0.5:0.95 from 78.40% to 77.90%. This indicates that ADown mainly serves as a computational efficiency module: it removes a large amount of redundant cost while preserving most pose performance. Using C3Ghost alone reduced parameters from 2.66 M to 0.71 M and model size from 5.39 MB to 1.65 MB, with pose mAP@0.5:0.95 of 77.80%, showing that the main benefit of C3Ghost is structural compression in the neck rather than isolated accuracy gain. Using keypoint branch enhancement module alone increased recall from 89.60% to 94.10% and slightly improved mAP@0.5, while pose mAP@0.5:0.95 remained close to the baseline. This pattern suggests that the keypoint branch enhancement is most helpful for recovering difficult instances and preserving geometric detail.

When ADown and C3Ghost were combined, the model reached the lowest computational load and the highest inference speed, but the reduction in representational capacity slightly weakened pose accuracy. After the keypoint enhancement module was added, the full StrawPose-Lite configuration achieved 84.0% precision, 90.5% recall, 93.7% mAP@0.5, and 79.2% mAP@0.5:0.95 with 0.73 M parameters and 3.0 GFLOPs. Relative to the baseline, parameter count decreased by 72.6%, GFLOPs decreased by 54.5%, and FPS increased from 167 to 279. The best overall balance therefore comes from a complementary division of labor: ADown reduces downsampling cost, C3Ghost compresses neck redundancy, and keypoint branch enhancement module compensates for the fine detail loss that would otherwise accompany aggressive lightweighting.

To further verify that the observed improvement was not caused by random fluctuation, YOLOv11n-pose and StrawPose-Lite were each trained three times under identical settings. Table 5 presents the repeated training stability and statistical significance analysis between the two models. The reported performance values are expressed as mean ± standard deviation, and the p-values were calculated using a paired t-test. The small standard deviations indicate that the observed gains remained stable across repeated training runs. In addition, the p-values for Pose mAP@0.5 and Pose mAP@0.5:0.95 were both below 0.05, providing supplementary statistical evidence for the observed improvements under the current experimental setup.

3.4. Comparison with Lightweight Pose Models and Reference Architectures

To further assess the performance of StrawPose-Lite, comparison experiments were conducted against representative lightweight pose model-family implementations under the same hardware environment, dataset split, input size, augmentation pipeline, training settings, and evaluation protocol. The compared YOLOv8n-pose, YOLOv11n-pose, YOLOv12n-pose [44], YOLOv26n-pose [45], and YOLOv11s-pose [46] entries were treated as same-protocol lightweight implementation variants for this study. No additional source code version or pretrained weight claim is made beyond the documented experimental setup. A qualitative comparison on an occluded strawberry example is shown in Figure 9, and the quantitative results are listed in Table 6.

Among the lightweight pose models, StrawPose-Lite achieved the highest mAP@0.5 (93.7%) and a pose mAP@0.5:0.95 of 79.2%, which was higher than YOLOv8n-pose (78.4%), YOLOv12n-pose (77.6%), and YOLOv26n-pose (77.9%). Compared with YOLOv11s-pose, StrawPose-Lite used far fewer parameters (0.73 M vs. 9.70 M) and lower GFLOPs (3.0 vs. 22.3), while the difference in pose mAP@0.5:0.95 was only 0.2 percentage points. This result indicates that the proposed model maintains competitive pose accuracy within a much tighter deployment budget. The absolute gain over the YOLOv11n-pose baseline is 0.8 percentage points in pose mAP@0.5:0.95, so the main contribution should be interpreted as efficiency-oriented improvement with preserved accuracy rather than as a large accuracy jump.

In terms of precision and recall, StrawPose-Lite also maintained a balanced profile. Relative to YOLOv12n-pose, precision increased by 1.1% while recall decreased by 1.3%. Compared with YOLOv11s-pose, recall increased by 2.4% with a 0.7% reduction in precision.

To continue exploring the feasibility of other vision architectures in strawberry keypoint detection, several typical MMDetection-based backbone options were further compared. In order to ensure fairness and comparability, the MMDetection-based baselines were implemented on this platform and uniformly trained using the Mask R-CNN architecture. StrawPose-Lite was evaluated separately under the same dataset split and metrics. Backbones of comparison include ResNet-101, Swin Transformer and CSWin Transformer [47,48,49]. These MMDetection-based models are included only as a broad reference for large-capacity architectures and are not intended as a direct comparison under the same deployment setting. Given that there are discrepancies among the deployment environment and training methods, absolute comparisons of FPS, latency, and power consumption were not made in this section. Therefore, this paper mainly analyzes mAP, Params, and GFLOPs; see Table 7 for details.

Among the large-capacity reference backbones, the two transformer-based models achieved the highest overall accuracy, with mAP@0.5:0.95 ranging from 82.0% to 84.0% and mAP@0.5 reaching up to 96.0%, which is consistent with their larger capacity and stronger global context modeling ability. In contrast, StrawPose-Lite required only 3.0 GFLOPs and 0.73 M parameters. Compared with the strongest MMDetection-based model in this group, StrawPose-Lite reduced computational cost and parameter count by more than 98%, while showing a moderate decline in absolute accuracy. Such computational redundancy is impractical for battery-powered agricultural robots that depend on edge computing devices like the NVIDIA Jetson.

StrawPose-Lite achieved a favorable balance between pose accuracy, computational cost, and real-time deployment performance. These results support its potential as an edge-side visual perception module for strawberry harvesting systems.

3.5. Edge Deployment Evaluation

For agricultural robots, on-board edge computing is required for real-time operation under limited hardware resources [50]. To examine deployment behavior, StrawPose-Lite was deployed on a Jetson Orin NX 16G Super platform (NVIDIA Corporation, Santa Clara, CA, USA). Following the protocol in Section 2.1.4, the model was trained on the public development set and evaluated on the independent field-collected dataset reserved for deployment testing.

Figure 10 shows the robotic platform used for the edge deployment evaluation.

The system consists of a tracked mobile base, a robotic arm, an end effector, a camera, and an on-board Jetson Orin NX 16G Super unit. In this study, the platform was used to examine the edge-side deployment feasibility of StrawPose-Lite for strawberry picking point perception under practical greenhouse conditions. The experiments mainly focused on embedded inference behavior and real-time 2D picking point prediction on the Jetson device. Therefore, closed-loop harvesting functions involving manipulator trajectory execution, end effector operation, harvesting success rate evaluation, cycle time analysis, and fruit damage assessment were not included in the present work.

The deployment procedure and runtime workflow are shown in Figure 11.

During deployment, The trained PyTorch model was exported to ONNX and then optimized with TensorRT (NVIDIA Corporation, Santa Clara, CA, USA) for inference on the Jetson platform. During runtime, RGB images were captured by the camera and processed on the embedded device. The system outputted strawberry bounding boxes and six keypoints (K0–K5). Among them, K0 denotes the pedicel–fruit junction and serves as the final visual picking point.

For a fair comparison, StrawPose-Lite and the YOLOv11n-pose baseline were deployed under the same embedded inference settings. The input size was fixed at 640 × 640, the batch size was set to 1, and the IoU threshold was set to 0.7. The Jetson device was tested under the default runtime and power configuration of the JetPack environment, without manual overclocking or additional clock-locking settings. Therefore, the reported speed reflects the recorded deployment condition rather than a benchmark optimized for a specific power mode. Table 8 reports the preprocessing latency, network inference latency, postprocessing latency, and total software latency under the embedded software pipeline. By contrast, the FPS values in Table 9 were calculated from pure network inference time only.

Across all three deployment formats, StrawPose-Lite achieved higher field set accuracy and higher throughput than the baseline under the recorded deployment logs. The latency breakdown in Table 8 further shows that StrawPose-Lite consistently reduced preprocessing, inference, postprocessing, and total software latency across PyTorch, TensorRT FP16, and TensorRT INT8. Under PyTorch inference, StrawPose-Lite achieved 82.1% precision, 87.4% recall, 91.6% mAP@0.5, 74.1% mAP@0.5:0.95, and 104 FPS. After TensorRT FP16 optimization, the model maintained similar accuracy and ran at 234 FPS. Under TensorRT INT8, speed further increased to 277 FPS, while mAP@0.5 and mAP@0.5:0.95 remained 90.4% and 71.6%, respectively. These results support the suitability of StrawPose-Lite for compact edge-side visual inference under the tested configuration.

4. Discussion

Identifying the pedicel–fruit junction as a reliable cutting or picking point is not only relevant to strawberry harvesting but also represents a broader challenge in robotic harvesting of fruit and vegetable crops. In many harvesting scenarios, the target point is a small structural region around the stem, pedicel, or fruit junction, rather than the fruit body itself. Therefore, picking point prediction requires more detailed local geometric perception than general fruit detection. Strawberry picking point prediction depends on the pedicel–fruit junction, not only on fruit detection. This region is small and easily affected by occlusion, overlap, and uneven illumination. The six-keypoint representation adds shape context around K0. Width, curvature, and longitudinal keypoints provide auxiliary constraints when the fruit contour is incomplete. This design reduces reliance on a single visible junction point while keeping K0 as the final visual picking point. A qualitative comparison between the YOLOv11n-pose baseline and StrawPose-Lite under a representative field scene is shown in Figure 12.

In the same field scene with overlapping fruits, mixed maturity, and partial occlusion near the junction region, the YOLOv11n-pose baseline can detect the main strawberry targets, but its predictions are more easily affected by adjacent fruits, sepals, and local contour ambiguity. By comparison, StrawPose-Lite shows more stable target localization and keypoint geometry around the visible fruit body. These findings also respond to the limitations identified in the literature review. Previous agricultural perception studies have improved fruit detection, instance segmentation, or lightweight deployment efficiency, but many of them still focus on fruit-level recognition or general target localization. For harvesting-oriented perception, the unresolved challenge is to preserve small junction-related cues under occlusion while keeping the model compact enough for embedded inference. The present results suggest that combining a six-keypoint geometric representation with selective lightweight feature preservation can partially address this gap: ADown and C3Ghost reduce redundant computation, whereas the keypoint branch enhancement module compensates for fine detail loss around the pedicel–fruit junction. The auxiliary geometric points help constrain the position of K0 when the pedicel–fruit junction is partly disturbed by leaves or adjacent fruits. This indicates that the proposed representation can provide useful structural support for junction-sensitive picking point prediction under practical field interference.

The field-collected dataset shows a clear domain gap from the public development split. Field images contain stronger occlusion, mixed maturity, foliage interference, specular reflection, and cluttered backgrounds. These conditions weaken junction-related cues and introduce stem-like structures near the target region. This phenomenon can be attributed to the higher visual complexity of greenhouse scenes. The observed performance decrease is therefore consistent with practical deployment conditions. It also suggests that aggregate pose mAP should be interpreted together with qualitative failure patterns. Although StrawPose-Lite improves the stability of 2D picking point prediction in the representative visual comparison, severe junction occlusion and dense fruit overlap may still cause K0 deviation when the anatomical junction is almost invisible.

Recent YOLO-based agricultural perception studies have increasingly emphasized lightweight model refinement, multi-scale feature representation, and deployment on embedded edge devices. Within this context, StrawPose-Lite should be understood as a task-oriented lightweight pose framework rather than a new general-purpose YOLO architecture. The use of ADown, C3Ghost, P2-guided multi-scale fusion, and SimAM refinement aims to reduce redundant computation while preserving local structural cues around the pedicel–fruit junction. This design objective is consistent with the practical requirement of harvesting robots, where the perception model must operate under limited on-board computational resources. The proposed representation may also benefit other berry crops and junction-sensitive greenhouse targets. Its main advantage is the use of local fruit geometry beyond a single picking point. This is useful when junction localization must remain stable under partial contour loss. For physical harvesting, however, the 2D K0 output still needs spatial conversion. In an embedded robotic system, the visual model must share computing resources with other real-time modules, including image acquisition, depth estimation, 3D localization, trajectory planning, and end effector control. Camera calibration and depth estimation, such as RGB-D or stereo reconstruction, are therefore required before trajectory planning and actuation. This remains the next step for extending the current visual perception model toward practical robotic use.

5. Conclusions

This study proposed StrawPose-Lite, a lightweight pose network for strawberry picking point prediction. By combining ADown, C3Ghost, and the keypoint branch enhancement module with a six-keypoint annotation scheme derived from strawberry phenotypic characteristics, the model improved pedicel-related pose estimation within a strict lightweight budget. On the public validation split, StrawPose-Lite achieved 0.73 M parameters, 3.0 GFLOPs, and 79.2% pose mAP@0.5:0.95. In the independent field-collected deployment set, the TensorRT INT8 version achieved a pure network inference throughput of 277 FPS on Jetson Orin NX 16G Super. These results show a useful balance between pose accuracy, model compactness, and edge-side inference speed. They also support lightweight visual perception for strawberry harvesting robots. The study provides a task-oriented pose design for junction-sensitive picking point localization. It is especially relevant when fruit contours are partly occluded or incomplete. However, the current evidence remains limited to 2D visual perception and embedded inference. Future work should integrate depth localization, camera calibration, and manipulator-level validation. These steps are needed to extend StrawPose-Lite from edge-side perception to complete robotic harvesting under dynamic greenhouse conditions.

Author Contributions

Conceptualization, H.L., X.D. and T.P.; methodology, H.L., Y.L., X.C. and Y.Y. (Yongqi Yin); software, H.L., Y.L., Q.H., Y.Y. (Yujie Yao) and X.C.; validation, H.L., Y.L., Q.H., B.L. and Y.Y. (Yujie Yao); formal analysis, H.L., Y.L., Q.H., Y.X., X.D. and T.P.; investigation, H.L., B.L., W.W., H.H., Y.X. and X.C.; resources, H.H., Y.Y. (Yongqi Yin), X.D. and T.P.; data curation, Q.H., B.L., W.W., H.H., Y.X. and Y.Y. (Yujie Yao); writing—original draft preparation, H.L., Y.L. and Q.H.; writing—review and editing, Y.L., Y.Y. (Yongqi Yin), X.D. and T.P.; visualization, H.L. and W.W.; supervision, X.D. and T.P.; project administration, X.D. and T.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. The APC was funded by Sichuan Agricultural University.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The public StrawDI_Db1 dataset used in this study is publicly available through the StrawDI project website (https://strawdi.github.io/, accessed on 9 April 2026) and the Zenodo repository (https://doi.org/10.5281/zenodo.14648986, accessed on 9 April 2026). The independent field-collected images used for deployment-oriented evaluation were collected during the project and are not publicly released because they involve site-specific acquisition records; they may be made available from the corresponding author upon reasonable request, subject to project and data use restrictions.

Acknowledgments

The authors are grateful for the dedicated efforts and collaborative contributions of all team members in this research.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Abbreviations

The following abbreviations are used in this manuscript:

ADown	A Downsampling Convolutional Layer
AMP	Automatic Mixed Precision
C3Ghost	C3 Module with Ghost Bottlenecks
ASFF	Adaptive Spatial Feature Fusion
FPS	frames per second
GFLOPs	Giga Floating-Point Operations
INT8	8-bit Integer Quantization
mAP	mean average precision
NMS	Non-Maximum Suppression
PR	precision–recall
YOLO	You Only Look Once

References

Lei, J.J.; Jiang, S.; Ma, R.Y.; Xue, L.; Zhao, J.; Dai, H.P. Current status of strawberry industry in China. In IX International Strawberry Symposium 1309; International Society for Horticultural Science, Ed.; International Society for Horticultural Science: Leuven, Belgium, 2021; pp. 349–352. [Google Scholar] [CrossRef]
Xiao, F.; Wang, H.; Xu, Y.; Zhang, R. Fruit Detection and Recognition Based on Deep Learning for Automatic Harvesting: An Overview and Review. Agronomy 2023, 13, 1625. [Google Scholar] [CrossRef]
Wang, C.; Pan, W.; Zou, T.; Li, C.; Han, Q.; Wang, H.; Yang, J.; Zou, X. A Review of Perception Technologies for Berry Fruit-Picking Robots: Advantages, Disadvantages, Challenges, and Prospects. Agriculture 2024, 14, 1346. [Google Scholar] [CrossRef]
Tan, Y.; Liu, X.; Zhang, J.; Wang, Y.; Hu, Y. A Review of Research on Fruit and Vegetable Picking Robots Based on Deep Learning. Sensors 2025, 25, 3677. [Google Scholar] [CrossRef]
Zhou, H.; Wang, X.; Au, W.; Kang, H.; Chen, C. Intelligent robots for fruit harvesting: Recent developments and future challenges. Precis. Agric. 2022, 23, 1856–1907. [Google Scholar] [CrossRef]
Zhang, J.; Kang, N.; Qu, Q.; Zhou, L.; Zhang, H. Automatic fruit picking technology: A comprehensive review of research advances. Artif. Intell. Rev. 2024, 57, 54. [Google Scholar] [CrossRef]
Chen, Z.; Lei, X.; Yuan, Q.; Qi, Y.; Ma, Z.; Qian, S.; Lyu, X. Key Technologies for Autonomous Fruit- and Vegetable-Picking Robots: A Review. Agronomy 2024, 14, 2233. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, T.; Gramatikov, S.; Ge, J.; Zhang, Z.; Chama, L.; Zhang, X.; Heng, P.-A.; Da, X. Breaking the Edge: Enabling Efficient Neural Network Inference on Integrated Edge Devices. IEEE Trans. Cloud Comput. 2025, 13, 694–710. [Google Scholar] [CrossRef]
Sánchez-Molina, J.A.; Rodríguez, F.; Moreno, J.C.; Sánchez-Hermosilla, J.; Giménez, A. Robotics in greenhouses: A scoping review. Comput. Electron. Agric. 2024, 221, 108750. [Google Scholar] [CrossRef]
Xiao, X.; Jiang, Y.; Wang, Y. Key Technologies for Machine Vision for Picking Robots: Review and Benchmarking. Mach. Intell. Res. 2025, 22, 2–16. [Google Scholar] [CrossRef]
Hayashi, S.; Shigematsu, K.; Yamamoto, S.; Kobayashi, K.; Kohno, Y.; Kamata, J.; Kurita, M. Evaluation of a strawberry-harvesting robot in a field test. Biosyst. Eng. 2010, 105, 160–171. [Google Scholar] [CrossRef]
Wang, C.; Wang, H.; Han, Q.; Zhang, Z.; Kong, D.; Zou, X. Strawberry Detection and Ripeness Classification Using YOLOv8+ Model and Image Processing Method. Agriculture 2024, 14, 751. [Google Scholar] [CrossRef]
Yang, S.; Wang, W.; Gao, S.; Deng, Z. Strawberry ripeness detection based on YOLOv8 algorithm fused with LW-Swin Transformer. Comput. Electron. Agric. 2023, 214, 108360. [Google Scholar] [CrossRef]
Shen, S.; Duan, F.; Tian, Z.; Han, C. A Novel Deep Learning Method for Detecting Strawberry Fruit. Appl. Sci. 2024, 14, 4213. [Google Scholar] [CrossRef]
Yu, Y.; Zhang, K.; Liu, H.; Yang, L.; Zhang, D. Real-Time Visual Localization of the Picking Points for a Ridge-Planting Strawberry Harvesting Robot. IEEE Access 2020, 8, 116556–116568. [Google Scholar] [CrossRef]
Tafuro, A.; Adewumi, A.; Parsa, S.; Amir Ghalamzan, E.; Debnath, B. Strawberry picking point localization, ripeness, and weight estimation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; IEEE: New York, NY, USA, 2022; pp. 2295–2302. [Google Scholar] [CrossRef]
Ma, Z.; Dong, N.; Gu, J.; Cheng, H.; Meng, Z.; Du, X. STRAW-YOLO: A detection method for strawberry fruits targets and key points. Comput. Electron. Agric. 2025, 230, 109853. [Google Scholar] [CrossRef]
Dai, S.; Bai, T.; Zhao, Y. Keypoint Detection and 3D Localization Method for Ridge-Cultivated Strawberry Harvesting Robots. Agriculture 2025, 15, 372. [Google Scholar] [CrossRef]
Song, G.; Wang, J.; Ma, R.; Shi, Y.; Wang, Y. Study on the fusion of improved YOLOv8 and depth camera for bunch tomato stem picking point recognition and localization. Front. Plant Sci. 2024, 15, 1447855. [Google Scholar] [CrossRef]
Bai, Y.; Mao, S.; Zhou, J.; Zhang, B. Clustered tomato detection and picking point location using machine learning-aided image analysis for automatic robotic harvesting. Precis. Agric. 2023, 24, 727–743. [Google Scholar] [CrossRef]
Pérez-Borrero, I.; Marín-Santos, D.; Gegúndez-Arias, M.E.; Cortés-Ancos, E. A fast and accurate deep learning method for strawberry instance segmentation. Comput. Electron. Agric. 2020, 178, 105736. [Google Scholar] [CrossRef]
Le Louëdec, J.; Cielniak, G. 3D shape sensing and deep learning-based segmentation of strawberries. Comput. Electron. Agric. 2021, 190, 106374. [Google Scholar] [CrossRef]
Tang, C.; Chen, D.; Wang, X.; Ni, X.; Liu, Y.; Liu, Y.; Mao, X.; Wang, S. A fine recognition method of strawberry ripeness combining Mask R-CNN and region segmentation. Front. Plant Sci. 2023, 14, 1211830. [Google Scholar] [CrossRef]
Zheng, S.; Liu, Y.; Weng, W.; Jia, X.; Yu, S.; Wu, Z. Tomato Recognition and Localization Method Based on Improved YOLOv5n-seg Model and Binocular Stereo Vision. Agronomy 2023, 13, 2339. [Google Scholar] [CrossRef]
Rong, Q.; Hu, C.; Hu, X.; Xu, M. Picking point recognition for ripe tomatoes using semantic segmentation and morphological processing. Comput. Electron. Agric. 2023, 210, 107923. [Google Scholar] [CrossRef]
Huang, Y.; Zhong, Y.; Zhong, D.; Yang, C.; Wei, L.; Zou, Z.; Chen, R. Pepper-YOLO: A lightweight model for green pepper detection and picking point localization in complex environments. Front. Plant Sci. 2024, 15, 1508258. [Google Scholar] [CrossRef] [PubMed]
Rong, J.; Wang, P.; Wang, T.; Hu, L.; Yuan, T. Fruit pose recognition and directional orderly grasping strategies for tomato harvesting robots. Comput. Electron. Agric. 2022, 202, 107430. [Google Scholar] [CrossRef]
Huang, Y.; Li, G.; Li, J.; Chen, H.; Lin, H.; Yang, C.; Chen, R. Accurate localization of fruit targets and picking points with multi-dimensional attention and dynamic upsampling. Comput. Electron. Agric. 2025, 236, 111211. [Google Scholar] [CrossRef]
Li, Z.; Wang, J.; Gao, G.; Lei, Y.; Zhao, C.; Wang, Y.; Bai, H.; Liu, Y.; Guo, X.; Li, Q. SGSNet: A lightweight deep learning model for strawberry growth stage detection. Front. Plant Sci. 2024, 15, 1491706. [Google Scholar] [CrossRef]
Chen, H.; Zhang, R.; Peng, J.; Peng, H.; Hu, W.; Wang, Y.; Jiang, P. YOLO-Chili: An Efficient Lightweight Network Model for Localization of Pepper Picking in Complex Environments. Appl. Sci. 2024, 14, 5524. [Google Scholar] [CrossRef]
Gan, Y.; Ren, X.; Liu, H.; Chen, Y.; Lin, P. Real-time lightweight strawberry ripeness detection framework based on YOLO11 deployed on edge computing platform. J. Food Meas. Charact. 2025, 19, 8469–8491. [Google Scholar] [CrossRef]
Zhang, X.; Zhang, G.; Wang, J.; Yang, J.; Ge, Q.; Zhao, R.; Wang, Y. Efficient instance segmentation for strawberry in greenhouses using YOLOv8n-MCP on edge devices. Inf. Process. Agric. 2025, 12, 539–549. [Google Scholar] [CrossRef]
Xiao, Q.; Liu, Y.; Fan, C.; Wang, Y.; Fan, J.; Wu, C.; Hou, W.; Li, Y.; Wang, Y. YOLO-PDGT: A Lightweight and Efficient Algorithm for Unripe Pomegranate Detection and Counting. Measurement 2025, 254, 117852. [Google Scholar] [CrossRef]
Droukas, L.; Doulgeri, Z.; Tsakiridis, N.L.; Triantafyllou, D.; Kleitsiotis, I.; Mariolis, I.; Giakoumis, D.; Tzovaras, D.; Kateris, D.; Bochtis, D. A Survey of Robotic Harvesting Systems and Enabling Technologies. J. Intell. Robot. Syst. 2023, 107, 21. [Google Scholar] [CrossRef] [PubMed]
Tituaña, L.; Gholami, A.; He, Z.; Xu, Y.; Karkee, M.; Ehsani, R. A small autonomous field robot for strawberry harvesting. Smart Agric. Technol. 2024, 8, 100454. [Google Scholar] [CrossRef]
Parsa, S.; Debnath, B.; Khan, M.A.; Amir Ghalamzan, E. Modular autonomous strawberry picking robotic system. J. Field Robot. 2024, 41, 2226–2246. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More Features from Cheap Operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: New York, NY, USA, 2020; pp. 1580–1589. [Google Scholar] [CrossRef]
Perez-Borrero, I.; Marin-Santos, D.; Cortes-Ancos, E.; Gegundez-Arias, M.E. StrawDI—The Strawberry Digital Images Data Set; Zenodo: Geneva, Switzerland, 2020. [Google Scholar] [CrossRef]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In Computer Vision—ECCV 2024, Proceedings of the 18th European Conference, Milan, Italy, 29 September–4 October 2024, Proceedings, Part XXXI; Springer: Cham, Switzerland, 2024; pp. 1–22. [Google Scholar] [CrossRef]
Lu, D.; Wang, Y. MAR-YOLOv9: A multi-dataset object detection method for agricultural fields based on YOLOv9. PLoS ONE 2024, 19, e0307643. [Google Scholar] [CrossRef]
Wang, C.-Y.; Liao, H.-Y.M.; Wu, Y.-H.; Chen, P.-Y.; Hsieh, J.-W.; Yeh, I.-H. CSPNet: A New Backbone that Can Enhance Learning Capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; IEEE: New York, NY, USA, 2020; pp. 390–391. [Google Scholar] [CrossRef]
Liu, S.; Huang, D.; Wang, Y. Learning Spatial Fusion for Single-Shot Object Detection. arXiv 2019, arXiv:1911.09516. [Google Scholar] [CrossRef]
Yang, L.; Zhang, R.-Y.; Li, L.; Xie, X. SimAM: A Simple, Parameter-Free Attention Module for Convolutional Neural Networks. In Proceedings of the 38th International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; PMLR: Cambridge, MA, USA, 2021; Volume 139, pp. 11863–11874. Available online: https://api.semanticscholar.org/CorpusID:235825945 (accessed on 9 April 2026).
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar] [CrossRef]
Sapkota, R.; Cheppally, R.H.; Sharda, A.; Karkee, M. YOLO26: Key Architectural Enhancements and Performance Benchmarking for Real-Time Object Detection. arXiv 2025, arXiv:2509.25164. [Google Scholar] [CrossRef]
Ultralytics. Ultralytics YOLO11. Ultralytics Docs 2024. Available online: https://docs.ultralytics.com/models/yolo11/ (accessed on 9 April 2026).
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016; pp. 770–778. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; IEEE: New York, NY, USA, 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Dong, X.; Bao, J.; Chen, D.; Zhang, W.; Yu, N.; Yuan, L.; Chen, D.; Guo, B. CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 12124–12134. [Google Scholar]
Gong, R.; Zhang, H.; Li, G.; He, J. Edge Computing-Enabled Smart Agriculture: Technical Architectures, Practical Evolution, and Bottleneck Breakthroughs. Sensors 2025, 25, 5302. [Google Scholar] [CrossRef]

Figure 1. Pipeline for constructing the strawberry picking point dataset. (a) Image sources, including the public StrawDI-derived development set and the independent field-collected set. (b) Preprocessing and online data augmentation used during model training. (c) Joint annotation of strawberry bounding boxes and six keypoints. (d) Dataset usage, including training, validation, and deployment-oriented evaluation.

Figure 2. Six-keypoint annotation scheme for strawberry picking point modeling.

Figure 3. Overall architecture of the proposed StrawPose-Lite model. The main modified components are ADown in the backbone, C3Ghost in the neck, and the keypoint branch enhancement module in the keypoint branch. The keypoint branch enhancement module consists of P3–P5 adaptive fusion, gated P2 feature injection, and SimAM refinement.

Figure 4. Structure of the ADown anti-aliasing downsampling module.

Figure 5. Structure of the C3Ghost lightweight feature extraction module.

Figure 6. P3-oriented adaptive multi-scale fusion in the keypoint branch. Projected P3–P5 features are aligned to a common resolution, fused by learnable softmax weights, and refined by a zero-initialized spatial correction path before high-resolution P2 injection.

Figure 7. Structure of the SimAM refinement path used in the keypoint branch. SimAM is inserted after the RepConv refinement stack and before the final output layer so that fine pedicel-related responses can be recalibrated without extra learnable parameters.

Figure 8. Training convergence and validation performance comparison between StrawPose-Lite and the YOLOv11n-pose baseline under the public validation split. (a) Validation loss of StrawPose-Lite; (b) pose mAP@0.5:0.95 of StrawPose-Lite; (c) precision–recall curve of StrawPose-Lite; (d) validation loss of YOLOv11n-pose baseline; (e) pose mAP@0.5:0.95 of YOLOv11n-pose baseline; (f) precision–recall curve of YOLOv11n-pose baseline.

Figure 9. Qualitative comparison of keypoint prediction results from different pose models on a representative occlusion case. (a) YOLOv11n-pose baseline; (b) YOLOv8n-pose; (c) YOLOv11s-pose; (d) YOLOv12n-pose; (e) YOLOv26n-pose; (f) StrawPose-Lite. All sub-images are displayed with the same crop size and visual scale. K0 denotes the pedicel–fruit junction, whereas the auxiliary keypoints provide geometric context for width, curvature, and longitudinal extent.

Figure 10. Robotic platform used for edge deployment evaluation of StrawPose-Lite.

Figure 11. Edge deployment framework and runtime workflow of StrawPose-Lite for strawberry picking point prediction, including image acquisition, preprocessing, network inference, and postprocessing outputs.

Figure 12. Qualitative comparison between the YOLOv11n-pose baseline and StrawPose-Lite under a representative field scene. (a) YOLOv11n-pose baseline; (b) StrawPose-Lite. The scene contains overlapping fruits, mixed maturity, and partial occlusion near the pedicel–fruit junction.

Table 1. Definitions and geometric roles of the six keypoints.

Keypoint	Anatomical Meaning	Geometric Role	Used for Final Picking Point
K0	Pedicel–fruit junction	Final visual picking point	Yes
K1	Right peak point	Right-side width constraint	No
K2	Right curvature point	Right contour curvature constraint	No
K3	Bottom point	Longitudinal stability constraint	No
K4	Left curvature point	Left contour curvature constraint	No
K5	Left peak point	Left-side width constraint	No

Table 2. Dataset composition, instance statistics, and usage in this study.

Dataset	Split	Images	Strawberry Instances	Total Keypoint Slots	Valid Keypoints	Validity Rate	Role	Usage
StrawDI_Db1	Train	2790	6863	41,178	39,460	95.82%	Public development set	Training
StrawDI_Db1	Validation	310	763	4578	4360	95.24%	Public development set	Validation, convergence analysis, ablation study, and model comparison
StrawDI_Db1	Total	3100	7626	45,756	43,820	95.79%	Public development set	Core algorithmic experiments
Field-collected set	Field	1200	2952	17,712	16,740	94.52%	Independent field set	Edge deployment evaluation only
Total	—	4300	10,578	63,468	60,560	95.42%	—	Training, validation, and deployment-oriented evaluation

Table 3. Experimental hardware and training settings.

System	Ubuntu 22.04
Python	3.9
PyTorch	2.9.0
CUDA	12.8
GPU	NVIDIA RTX 5090 (32 GB)
CPU	Intel Xeon Platinum 8470Q
RAM	90 GB
Size	640 × 640
Epoch	100
Batch size	64

Table 4. Ablation results of different module combinations based on the YOLOv11n-pose baseline. The keypoint branch enhancement module consists of P3–P5 adaptive fusion, gated P2 feature injection, and SimAM refinement. Bold values indicate the numerically best result in each column under the same experimental protocol; no additional threshold was applied.

Modules				Pose Metrics				Params (M)	GFLOPs	Size (MB)	FPS
Baseline	ADown	C3Ghost	Keypoint Branch Enhancement Module	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Params (M)	GFLOPs	Size (MB)	FPS
√				82.90	89.60	92.70	78.40	2.66	6.60	5.39	167
√	√			82.90	91.70	92.90	77.90	0.64	2.40	1.52	291
√		√		83.20	91.70	92.80	77.80	0.71	2.70	1.65	285
√			√	81.80	94.10	93.10	78.30	0.81	3.50	1.94	268
√	√	√		80.90	92.20	93.10	78.00	0.63	2.30	1.52	293
√	√		√	82.90	91.30	93.20	77.90	0.74	3.10	1.80	276
√		√	√	84.80	90.00	93.20	78.00	0.81	3.40	1.94	270
√	√	√	√	84.00	90.50	93.70	79.20	0.73	3.00	1.80	279

The symbol √ indicates that the corresponding module was included in the model configuration.

Table 5. Repeated training stability and statistical significance analysis between YOLOv11n-pose and StrawPose-Lite. Note: Bold values indicate the better result between YOLOv11n-pose and StrawPose-Lite.

Model	Pose mAP@0.5 (%)	Pose mAP@0.5:0.95 (%)
YOLOv11n-pose	92.727 ± 0.110	78.407 ± 0.105
StrawPose-Lite	93.710 ± 0.115	79.200 ± 0.110
p-value	0.00044	0.00084

Table 6. Comparison of StrawPose-Lite with representative state-of-the-art pose models under identical experimental settings. Note: Bold values indicate the best result in each column.

Model (Pose)	Params (M)	GFLOPs	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Size (MB)	FPS
YOLOv11n-pose baseline	2.66	6.60	82.90	89.60	92.70	78.40	5.39	167
YOLOv8n-pose	3.09	8.40	83.60	90.60	93.00	78.40	6.13	142
YOLOv11s-pose	9.70	22.3	84.70	88.10	93.20	79.40	18.8	86
YOLOv12n-pose	2.64	6.60	82.90	91.80	92.90	77.60	5.43	158
YOLOv26n-pose	2.68	6.70	83.70	90.40	93.00	77.90	5.46	171
StrawPose-Lite	0.73	3.00	84.00	90.50	93.70	79.20	1.80	279

Table 7. Comparison of StrawPose-Lite with representative MMDetection-based backbones in accuracy and computational efficiency. Note: Bold values indicate the best result in each column.

Model	mAP@0.5:0.95 (%)	mAP@0.5 (%)	GFLOPs	Params (M)
ResNet-101	65.00	83.00	336	63.00
Swin	82.00	94.00	342	60.00
CSWin	84.00	96.00	339	54.00
StrawPose-Lite	79.20	93.70	3.0	0.73

Table 8. Latency breakdown of YOLOv11n-pose and StrawPose-Lite on Jetson Orin NX 16G Super. Note: Bold values indicate the lowest latency in each column.

Model (Pose)	Format	Preprocess (ms)	Inference (ms)	Postprocess (ms)	Total (ms)
YOLOv11n-pose baseline	PyTorch	1.1	14.08	1.3	16.48
YOLOv11n-pose baseline	TensorRT FP16	0.9	5.68	0.9	7.48
YOLOv11n-pose baseline	TensorRT INT8	0.9	4.76	0.9	6.56
StrawPose-Lite	PyTorch	1.0	9.62	1.0	11.62
StrawPose-Lite	TensorRT FP16	0.7	4.27	0.7	5.67
StrawPose-Lite	TensorRT INT8	0.7	3.61	0.7	5.01

Table 9. Edge deployment results of YOLOv11n-pose baseline and StrawPose-Lite under PyTorch, TensorRT FP16, and TensorRT INT8 on the independent field dataset. The reported FPS refers to pure network inference throughput. Note: Bold values indicate the best result in each column.

Model (Pose)	Format	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Pure Network Inference FPS
YOLOv11n-pose baseline	PyTorch	80.40	85.80	90.20	72.60	71
YOLOv11n-pose baseline	TensorRT FP16	80.20	85.60	90.00	72.40	176
YOLOv11n-pose baseline	TensorRT INT8	79.60	84.20	89.00	69.90	210
StrawPose-Lite	PyTorch	82.10	87.40	91.60	74.10	104
StrawPose-Lite	TensorRT FP16	81.90	87.20	91.40	73.90	234
StrawPose-Lite	TensorRT INT8	81.30	85.80	90.40	71.60	277

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, H.; Liang, Y.; He, Q.; Li, B.; Wang, W.; He, H.; Xu, Y.; Yao, Y.; Cao, X.; Yin, Y.; et al. StrawPose-Lite: A Lightweight Pose Network for Strawberry Picking Point Prediction on Edge Devices. Agriculture 2026, 16, 1185. https://doi.org/10.3390/agriculture16111185

AMA Style

Liu H, Liang Y, He Q, Li B, Wang W, He H, Xu Y, Yao Y, Cao X, Yin Y, et al. StrawPose-Lite: A Lightweight Pose Network for Strawberry Picking Point Prediction on Edge Devices. Agriculture. 2026; 16(11):1185. https://doi.org/10.3390/agriculture16111185

Chicago/Turabian Style

Liu, Haojiang, Yunsen Liang, Qile He, Bingbing Li, Wanshu Wang, Hongyu He, Yaoxue Xu, Yujie Yao, Xiangyu Cao, Yongqi Yin, and et al. 2026. "StrawPose-Lite: A Lightweight Pose Network for Strawberry Picking Point Prediction on Edge Devices" Agriculture 16, no. 11: 1185. https://doi.org/10.3390/agriculture16111185

APA Style

Liu, H., Liang, Y., He, Q., Li, B., Wang, W., He, H., Xu, Y., Yao, Y., Cao, X., Yin, Y., Duan, X., & Pang, T. (2026). StrawPose-Lite: A Lightweight Pose Network for Strawberry Picking Point Prediction on Edge Devices. Agriculture, 16(11), 1185. https://doi.org/10.3390/agriculture16111185

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

StrawPose-Lite: A Lightweight Pose Network for Strawberry Picking Point Prediction on Edge Devices

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Construction and Annotation

2.1.1. Public and Field Dataset Construction

2.1.2. Data Preprocessing and Augmentation

2.1.3. Annotation Protocol and Quality Control

2.1.4. Evaluation Protocol

2.2. Baseline Model

2.3. Architecture of the Proposed StrawPose-Lite Model

2.3.1. ADown-Based Anti-Aliasing Downsampling

2.3.2. C3Ghost-Based Lightweight Feature Extraction

2.3.3. Keypoint Branch Enhancement Module

Adaptive Multi-Scale Feature Fusion

High-Resolution P2 Feature Injection

SimAM-Based Parameter-Free Attention

2.4. Evaluation Metrics

3. Results and Analysis

3.1. Experimental Setup

3.2. Convergence Behavior and Pose Localization Performance

3.3. Ablation Studies

3.4. Comparison with Lightweight Pose Models and Reference Architectures

3.5. Edge Deployment Evaluation

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI